Summary of Graph Datasets in Subgraph Matching Research

This blog summarizes the data graphs that are frequently used in isomorphism-based subgraph matching algorithm/system research. The surveyed literature is listed in the “Reference” section.

Problem Definition of Isomorphism-based Subgraph Matching

Given a single large data graph $D$ and a small query graph $q$, the target of subgraph isomorphism is to find all subgraphs of $D$ that are isomorphic to $q$.

The data graph and query graph may have labels attached to the vertices and/or edges. In those cases, a subgraph of $D$ matches the query graph $q$ if and only if they are isomorphic and their corresponding vertex/edge labels match.

Classification of Graph Datasets

The graph datasets that are used as data graphs in research can be grouped into two groups: real-world graphs and synthetic graphs.

Real-world Graphs

The statistics and description of the real-world graph datasets are summarized in the following table. Each graph dataset is given a unified code name.

Note: The statistics of each graph are retrieved from the literature. Their precision values depend on how to process the dataset files. Since different authors may process the original dataset files in different ways, it is common that the statistics do not agree with each other in literature.

The average degree $\bar{d}$ is defined as $\bar{d}=V/E$ (for directed graphs) and $\bar{d}=2V/E$ (for undirected graphs), where $V$ is the number of vertices and $E$ is the number of edges.

*The table is sorted by $|E|$ in the ascending order.

Table: Statistics of Real-world Graphs

Code Name Dataset Name Directed Labeled V E Average Degree
human Human protein-protein interaction network NO Vertex 4.70E+03 8.60E+04 36.60
as-caida CAIDA AS Relationships Datasets NO NO 2.65E+04 1.07E+05 8.07
wordnet WordNet 1.7.1 YES Vertex and Edge 8.27E+04 1.33E+05 1.61
enron-email Enron email network NO NO 3.67E+04 1.84E+05 10.02
facebook-wosn Facebook (WOSN) NO NO 6.37E+04 8.17E+05 25.64
com-dblp DBLP collaboration network and ground-truth communities NO Vertex 3.17E+05 1.05E+06 6.62
com-youtube Youtube social network and ground-truth communities NO Vertex 1.13E+06 2.99E+06 5.27
amazon0601 Amazon product co-purchasing network from June 1 2003 YES YES 4.03E+05 3.39E+06 8.40
web-google Google web graph YES NO 8.76E+05 4.32E+06 4.94
web-berkstan Web graph of Berkeley and Stanford YES NO 6.85E+05 7.60E+06 11.09
as-skitter Autonomous systems by Skitter NO NO 1.70E+06 1.11E+07 13.08
ego-gplus Social circles: Google+ YES NO 1.08E+05 1.37E+07 127.06
imdb The International Movie Database (IMDb) graph YES Vertex 5.00E+06 1.45E+07 2.90
uspatents Patent citation network YES Vertex 3.77E+06 1.65E+07 4.38
eu-2005 A small crawl of the .eu domain, mainly useful for debugging and testing purposes. YES NO 8.63E+05 1.92E+07 22.30
us-road Challenge 9 - Road map of USA YES NO 2.39E+07 5.83E+07 2.44
soc-livejournal LiveJournal online social network YES NO 4.85E+06 6.90E+07 14.23
com-orkut Orkut online social network NO Vertex 3.07E+06 1.17E+08 76.28
uk-2002 Web graph from a 2002 crawl of the .uk domain. YES NO 1.85E+07 2.98E+08 16.10
yago YAGO 4 knowledge base YES Vertex and Edge 6.70E+07 3.43E+08 5.12
eu-road Roadmap of Europe NO NO 1.74E+08 3.48E+08 2.00
billion-triple Billion Triples Challenge 2009 Dataset YES YES 1.65E+08 3.61E+08 2.19
dbpedia-tiny-diamond The DBpedia Latest Core Release YES Vertex and Edge NA 9.00E+08 NA
twitter What is Twitter, a Social Network or a News Media? YES NO 4.16E+07 1.46E+09 35.10
friendster Friendster online social network NO Vertex 6.56E+07 1.81E+09 55.06
reddit-social-media Reddit social media graph YES Vertex and Edge 3.90E+09 7.00E+09 1.79
yahoo G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS) YES NO 1.40E+09 1.29E+10 9.21
clueweb12 ClueWeb12 Web Graph YES To Construct Manually 9.78E+08 4.26E+10 43.51
wdc-2012 Web Data Commons - Hyperlink Graph 2012 YES Vertex 3.50E+09 1.29E+11 36.71

The description of each dataset is summarized in the following table.

Table: Description of Real-world Graphs

Code Name Dataset Name Graph Type Download Link Description
human Human protein-protein interaction network Protein-protein interaction network https://github.com/yspark-dblab/gcare (See the datasets.tar.gz part) Shijie Zhang, Shirong Li, and Jiong Yang. 2009. GADDI: distance index based subgraph matching in biological networks. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. 192–203. You can download the processed version provided by the g-care project.
as-caida CAIDA AS Relationships Datasets computer network http://snap.stanford.edu/data/as-caida.html The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/. Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots. Vertices represent routers. Edges represent routes between routers.
wordnet WordNet 1.7.1 natural language http://vlado.fmf.uni-lj.si/pub/networks/data/dic/Wordnet/Wordnet.htm Graphs of relations among words, retrieved from WordNet v1.7.1. Vertices represent words. Edges represent the relations between words. Vertex labels represent word types (like verb, noun, …). Edge labels represent types of relations among words (like similar, cause, …).
enron-email Enron email network communication network http://snap.stanford.edu/data/email-Enron.html Email communication network from Enron. Vertices are email addresses. Edges are communication relationships between email addresses.
facebook-wosn Facebook (WOSN) social network http://socialnetworks.mpi-sws.org/data-wosn2009.html This undirected network contains friendship data of Facebook users. A node represents a user and an edge represents a friendship between two users. The dataset is obviously not complete and contains a very small subset of the total Facebook friendship graph.
com-dblp DBLP collaboration network and ground-truth communities collaboration network http://snap.stanford.edu/data/com-DBLP.html DBLP co-authorship collaboration network. Vertices are authors. There is an edge among two vertices if the two authors have published at least one paper together. Each connected component is regarded as a community. Each label corresponds to a community.
com-youtube Youtube social network and ground-truth communities social network http://snap.stanford.edu/data/com-Youtube.html Youtube social network. Vertices represent users. Edges represent friendships between users. Labels represent ground-truth communities.
amazon0601 Amazon product co-purchasing network from June 1 2003 E-commerce http://snap.stanford.edu/data/amazon0601.html (graph), http://snap.stanford.edu/data/amazon-meta.html (label & property) Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j. Vertices represent products. Edges represent the co-purchased relationships.
web-google Google web graph WWW http://snap.stanford.edu/data/web-Google.html Web hyperlink graph released by Google. Vertices represent web pages. Edges represent hyperlinks between web pages.
web-berkstan Web graph of Berkeley and Stanford WWW http://snap.stanford.edu/data/web-BerkStan.html Nodes represent pages from berkely.edu and stanford.edu domains and directed edges represent hyperlinks between them. The data was collected in 2002.
as-skitter Autonomous systems by Skitter computer network http://snap.stanford.edu/data/as-Skitter.html Internet topology graph. From traceroutes run daily in 2005 - http://www.caida.org/tools/measurement/skitter. From several scattered sources to million destinations. 1.7 million nodes, 11 million edges.
ego-gplus Social circles: Google+ social network http://snap.stanford.edu/data/ego-Gplus.html This dataset consists of ‘circles’ from Google+. Vertices represent users. Edges represent the relationships between users.
imdb The International Movie Database (IMDb) graph knowledge graph https://www.imdb.com/interfaces/ More details of how to construct the dataset is available in the paper: T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
uspatents Patent citation network citation network http://snap.stanford.edu/data/cit-Patents.html U.S. patent dataset is maintained by the National Bureau of Economic Research. Vertices represent patents. Edges represent citations among patents. Vertex labels can represent patent classification.
eu-2005 A small crawl of the .eu domain, mainly useful for debugging and testing purposes. WWW https://law.di.unimi.it/webdata/eu-2005/ A small crawl of the .eu domain, mainly useful for debugging and testing purposes. This graph exhibits a very low locality, probably because the crawl was quite shallow (and the chosen domain is quite artificial anyway). Vertices present web pages. Edges represent hyper links between webpages.
us-road Challenge 9 - Road map of USA road network http://www.diag.uniroma1.it/challenge9/download.shtml Road map of 12 USA road networks. Vertices represent key points. Edges present roads connected the key points. Edges have weights which represent either the distance or the travel time.
soc-livejournal LiveJournal online social network social network http://snap.stanford.edu/data/soc-LiveJournal1.html Social network of LiveJournal. Vertices represent users of LiveJournal. Edges represent friendship relationships between users.
com-orkut Orkut online social network social network http://snap.stanford.edu/data/com-Orkut.html Orkut social network and ground-truth communities. Vertices represent users. Edges represent friendships between users. Labels represent groud-truth communities.
uk-2002 Web graph from a 2002 crawl of the .uk domain. WWW https://law.di.unimi.it/webdata/uk-2002/ This graph has been obtained from a 2002 crawl of the .uk domain performed by UbiCrawler. Vertices represent web pages. Edges present hyperlinks between web pages.
yago YAGO 4 knowledge base knowledge graph https://yago-knowledge.org/downloads/yago-4 YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. YAGO dataset is a knowledge graph stored in the standard RDF framework. Vertices represent entities. Edges represent relationships between entities.
eu-road Roadmap of Europe road network https://i11www.iti.kit.edu/resources/roadgraphs.php Road map of Europe retrieved from the OpenStreetMap. Vertices represent crosses. Edges represent the roads connected crosses.
billion-triple Billion Triples Challenge 2009 Dataset knowledge graph http://km.aifb.kit.edu/projects/btc-2009/ The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card.
dbpedia-tiny-diamond The DBpedia Latest Core Release knowledge graph https://www.dbpedia.org/resources/latest-core/ The DBpedia Latest Core Release is the small subset of of the total DBpedia releases that we are loading into the main DBpedia SPARQL endpoint and Linked Data Interface and Lookup Search. Dbpedia is a knowledge graph database. It is updated in a montly cycle.
twitter What is Twitter, a Social Network or a News Media? social network https://anlab-kaist.github.io/traces/WWW2010 Twitter’s following graph. Vertices represent twitter user IDs. Edges represent the following relationships.
friendster Friendster online social network social network http://snap.stanford.edu/data/com-Friendster.html Friendster social network. Vertices represent users in the Friendster on-line gaming network. Edges represent friendship relations between users. Labels represent ground-truth communities.
reddit-social-media Reddit social media graph social network https://github.com/dewarim/reddit-data-tools More details of how to construct the dataset is available in the paper: T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
yahoo G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS) WWW https://webscope.sandbox.yahoo.com/catalog.php?datatype=g&guccounter=1 This dataset contains URLs and hyperlinks for over 1.4 billion public web pages indexed by the Yahoo! AltaVista search engine in 2002.
clueweb12 ClueWeb12 Web Graph WWW https://law.di.unimi.it/webdata/clueweb12/ This is the graph underlying the ClueWeb12 dataset. Vertices represent web pages. Edges represent hyper links between web pages. The labels needs to be manually constructed from the original dataset files by treating each domain as a label.
wdc-2012 Web Data Commons - Hyperlink Graph 2012 WWW http://webdatacommons.org/hyperlinkgraph/2012-08/download.html The hyperlink graph extracted from the Common Crawl 2012 web corpus. The graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. Vertices represent web pages. Edges represent hyperlinks among web pages. Labels are retrieved from domain names.

Synthetic Graphs

The synthetic graphs are mainly used to evaluate performance of distributed subgraph matching methods on big graphs. The widely-used synthetic graph generation models or generator are summarized below.

R-MAT Model

The R-MAT model [21] uses a recursive model to generate the adjacency matrix of a random graph. The R-MAT model can generate different kinds of random graphs, including E-R models and power-law graphs. The R-MAT model is used in Graph500 Benchmark to generate very big graphs.

There are some typical implementation of RMAT random graph generators:

  1. TrillionG (SIGMOD17);
  2. Graph500 Kronecker Generator (Reference Implementation).

Waterloo SPARQL Diversity Test Suite (WatDiv)

WatDiv [22] is a benchmark developed for RDF data managements systems. WatDiv can generate large RDF datasets with a wide spectrum of SPARQL queries. WatDiv has a data generator and a query generator. The description of WatDiv is available at its home page.

WatDiv also has a stream version to generate dynamic RDF graphs.

Lehigh University Benchmark (LUBM)

LUBM is a benchmark for RDF data management systems. It follows a university domain ontology to generate big RDF graphs. It includes a data generator and a suite of test queries. The homepage of LUBM is http://swat.cse.lehigh.edu/projects/lubm/.

LDBC Social Network Benchmarking (LDBC-SNB)

LDBC Social Network Benchmarking (LDBC-SNB) is a benchmark suite for graph databases. It can simulate both interactive workloads and business intelligence workloads of a social network. It includes a social network generator that generates complex property graphs and a set of benchmark queries. The home page of LDBC-SNB is https://ldbcouncil.org/benchmarks/snb/.

EvoGraph

EvoGraph [23] is a tool upscaling a given real-world graph to generate large simulated graphs. The generated simulated graphs can preserve the properties of the underlying real-world graph. The source code of EvoGraph is available at chan150’s GitHub repository.

References

  1. C. T. Duong, D. Hoang, H. Yin, M. Weidlich, Q. V. H. Nguyen, and K. Aberer, “Efficient Streaming Subgraph Isomorphism with Graph Neural Networks,” Proc. VLDB Endow., vol. 14, no. 5, pp. 730–742, 2021.
  2. X. Jin, Z. Yang, X. Lin, S. Yang, L. Qin, and Y. Peng, “FAST: FPGA-based Subgraph Matching on Massive Graphs,” in Proc. ICDE, 2021, pp. 1452–1463. doi: 10.1109/ICDE51399.2021.00129.
  3. D. Mawhirter, S. Reinehr, C. Holmes, T. Liu, and B. Wu, “GraphZero: A High-Performance Subgraph Matching System,” SIGOPS Oper. Syst. Rev., vol. 55, no. 1, pp. 21–37, 2021, doi: 10.1145/3469379.3469383.
  4. H. Kim, Y. Choi, K. Park, X. Lin, S.-H. Hong, and W.-S. Han, “Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching,” in SIGMOD, New York, NY, USA, 2021, pp. 925–937. doi: 10.1145/3448016.3457265.
  5. Z. Yang, L. Lai, X. Lin, K. Hao, and W. Zhang, “HUGE: An Efficient and Scalable Subgraph Enumeration System,” in Proc. SIGMOD, New York, NY, USA, 2021, pp. 2049–2062. doi: 10.1145/3448016.3457237.
  6. W. Guo, Y. Li, M. Sha, B. He, X. Xiao, and K.-L. Tan, “GPU-Accelerated Subgraph Enumeration on Partitioned Graphs,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1067–1082. doi: 10.1145/3318464.3389699.
  7. S. Sun and Q. Luo, “In-Memory Subgraph Matching: An In-Depth Study,” in SIGMOD, New York, NY, USA, 2020, pp. 1083–1098. doi: 10.1145/3318464.3380581.
  8. H. Zhang, J. X. Yu, Y. Zhang, K. Zhao, and H. Cheng, “Distributed Subgraph Counting: A General Approach,” Proc. VLDB Endow., vol. 13, no. 11, pp. 2493–2507, 2020.
  9. X. Jian, Y. Wang, X. Lei, Y. Shen, and L. Chen, “DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs,” in Proc. DASFAA, Cham, 2020, pp. 632–640.
  10. L. Zeng, L. Zou, M. T. Özsu, L. Hu, and F. Zhang, “GSI: GPU-friendly Subgraph Isomorphism,” in Proc. ICDE, Apr. 2020, pp. 1249–1260. doi: 10.1109/ICDE48307.2020.00112.
  11. S. Sun, X. Sun, Y. Che, Q. Luo, and B. He, “RapidMatch: a holistic approach to subgraph query processing,” Proc. VLDB Endow., vol. 14, no. 2, pp. 176–188, Oct. 2020, doi: 10.14778/3425879.3425888.
  12. D. Yan, G. Guo, M. M. R. Chowdhury, M. T. Özsu, W.-S. Ku, and J. C. S. Lui, “G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph,” in Proc. ICDE, 2020, pp. 1369–1380. doi: 10.1109/ICDE48307.2020.00122.
  13. Y. Park, S. Ko, S. S. Bhowmick, K. Kim, K. Hong, and W.-S. Han, “G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,” in SIGMOD, New York, NY, USA, 2020, pp. 1099–1114. doi: 10.1145/3318464.3389702.
  14. T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
  15. M. Han, H. Kim, G. Gu, K. Park, and W.-S. Han, “Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together,” in Proc. SIGMOD, New York, New York, USA, 2019, pp. 1429–1446. doi: 10.1145/3299869.3319880.
  16. B. Bhattarai, H. Liu, and H. H. Huang, “CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching,” in SIGMOD, 2019, pp. 1447–1462. doi: 10.1145/3299869.3300086.
  17. L. Lai et al., “Distributed subgraph matching on timely dataflow,” PVLDB, vol. 12, no. 10, pp. 1099–1112, 2019, doi: 10.14778/3339490.3339494.
  18. A. Mhedhbi and S. Salihoglu, “Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1692–1704, 2019, doi: 10.14778/3342263.3342643.
  19. Z. Zhang, H. Wei, J. Xu, and B. Choi, “GScan: Exploiting Sequential Scans for Subgraph Matching,” in Proc. DASFAA, Cham, 2019, pp. 471–475.
  20. Z. Wang, R. Gu, W. Hu, C. Yuan, and Y. Huang, “BENU: Distributed Subgraph Enumeration with Backtracking-Based Framework,” in ICDE, Macao, Macao, Apr. 2019, pp. 136–147. doi: 10.1109/ICDE.2019.00021.
  21. D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A Recursive Model for Graph Mining,” in Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, 2004, pp. 442–446. doi: 10.1137/1.9781611972740.43.
  22. G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, 2014, pages 197-212. WatDiv available from http://dsg.uwaterloo.ca/watdiv/.
  23. H. Park and M.-S. Kim, “EvoGraph: An Effective and Efficient Graph Upscaling Method for Preserving Graph Properties,” in Proc. SIGKDD, New York, NY, USA, 2018, pp. 2051–2059. doi: 10.1145/3219819.3220123.