Data Science

Graph Summarization with Quality Guarantees

Posted on March 1, 2017

680.07 KB

Authors: Matteo Riondato (Two Sigma), David Garcia-Soriano, Francesco Bonchi

Published in: Data Mining and Knowledge Discovery, 31(2):314–349

Presented at: Department of Information Engineering, University of Padua, Padua (Italy), September 26, 2016

Abstract: We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this work we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. To quantify the dissimilarity between the original graph and a summary, we adopt the reconstruction error and the cut-norm error. By exposing a connection between graph summarization and geometric clustering problems (i.e., k-means and k-median), we develop the first polynomial-time approximation algorithms to compute the best possible summary of a certain size under both measures. We discuss how to use our summaries to store a (lossy or lossless) compressed graph representation and to approximately answer a large class of queries about the original graph, including adjacency, degree, eigenvector centrality, and triangle and subgraph counting. Using the summary to answer queries is very efficient as the running time to compute the answer depends on the number of supernodes in the summary, rather than the number of nodes in the original graph.

DOI: https://doi.org/10.1007/s10618-016-0468-8

Graph Summarization with Quality Guarantees from Two Sigma

Download PDF — 680.07 KB

References

The final publication is available at Springer via https://doi.org/10.1007/s10618-016-0468-8

The views expressed above are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). The information presented above is only for informational and educational purposes and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. Additionally, the above information is not intended to provide, and should not be relied upon for investment, accounting, legal or tax advice. Two Sigma makes no representations, express or implied, regarding the accuracy or completeness of this information, and the reader accepts all risks in relying on the above information for any purpose whatsoever. Click here for other important disclaimers and disclosures.

Graph Summarization with Quality Guarantees

References

Rademacher Averages: Theory and Practice

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

Improving Python and Spark Performance and Interoperability with Apache Arrow

Tags:

References

Rademacher Averages: Theory and Practice

TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fixed Memory Size

Improving Python and Spark Performance and Interoperability with Apache Arrow