Summary of Graph Datasets in Subgraph Matching Research

This blog summarizes the data graphs that are frequently used in isomorphism-based subgraph matching algorithm/system research. The surveyed literature is listed in the “Reference” section.

Problem Definition of Isomorphism-based Subgraph Matching

Given a single large data graph $D$ and a small query graph $q$, the target of subgraph isomorphism is to find all subgraphs of $D$ that are isomorphic to $q$.

The data graph and query graph may have labels attached to the vertices and/or edges. In those cases, a subgraph of $D$ matches the query graph $q$ if and only if they are isomorphic and their corresponding vertex/edge labels match.

Classification of Graph Datasets

The graph datasets that are used as data graphs in research can be grouped into two groups: real-world graphs and synthetic graphs.

Real-world Graphs

The statistics and description of the real-world graph datasets are summarized in the following table. Each graph dataset is given a unified code name.

Note: The statistics of each graph are retrieved from the literature. Their precision values depend on how to process the dataset files. Since different authors may process the original dataset files in different ways, it is common that the statistics do not agree with each other in literature.

The average degree $\bar{d}$ is defined as $\bar{d}=V/E$ (for directed graphs) and $\bar{d}=2V/E$ (for undirected graphs), where $V$ is the number of vertices and $E$ is the number of edges.

*The table is sorted by $|E|$ in the ascending order.

Table: Statistics of Real-world Graphs

Code Name	Dataset Name	Directed	Labeled	V	E	Average Degree
human	Human protein-protein interaction network	NO	Vertex	4.70E+03	8.60E+04	36.60
as-caida	CAIDA AS Relationships Datasets	NO	NO	2.65E+04	1.07E+05	8.07
wordnet	WordNet 1.7.1	YES	Vertex and Edge	8.27E+04	1.33E+05	1.61
enron-email	Enron email network	NO	NO	3.67E+04	1.84E+05	10.02
facebook-wosn	Facebook (WOSN)	NO	NO	6.37E+04	8.17E+05	25.64
com-dblp	DBLP collaboration network and ground-truth communities	NO	Vertex	3.17E+05	1.05E+06	6.62
com-youtube	Youtube social network and ground-truth communities	NO	Vertex	1.13E+06	2.99E+06	5.27
amazon0601	Amazon product co-purchasing network from June 1 2003	YES	YES	4.03E+05	3.39E+06	8.40
web-google	Google web graph	YES	NO	8.76E+05	4.32E+06	4.94
web-berkstan	Web graph of Berkeley and Stanford	YES	NO	6.85E+05	7.60E+06	11.09
as-skitter	Autonomous systems by Skitter	NO	NO	1.70E+06	1.11E+07	13.08
ego-gplus	Social circles: Google+	YES	NO	1.08E+05	1.37E+07	127.06
imdb	The International Movie Database (IMDb) graph	YES	Vertex	5.00E+06	1.45E+07	2.90
uspatents	Patent citation network	YES	Vertex	3.77E+06	1.65E+07	4.38
eu-2005	A small crawl of the .eu domain, mainly useful for debugging and testing purposes.	YES	NO	8.63E+05	1.92E+07	22.30
us-road	Challenge 9 - Road map of USA	YES	NO	2.39E+07	5.83E+07	2.44
soc-livejournal	LiveJournal online social network	YES	NO	4.85E+06	6.90E+07	14.23
com-orkut	Orkut online social network	NO	Vertex	3.07E+06	1.17E+08	76.28
uk-2002	Web graph from a 2002 crawl of the .uk domain.	YES	NO	1.85E+07	2.98E+08	16.10
yago	YAGO 4 knowledge base	YES	Vertex and Edge	6.70E+07	3.43E+08	5.12
eu-road	Roadmap of Europe	NO	NO	1.74E+08	3.48E+08	2.00
billion-triple	Billion Triples Challenge 2009 Dataset	YES	YES	1.65E+08	3.61E+08	2.19
dbpedia-tiny-diamond	The DBpedia Latest Core Release	YES	Vertex and Edge	NA	9.00E+08	NA
twitter	What is Twitter, a Social Network or a News Media?	YES	NO	4.16E+07	1.46E+09	35.10
friendster	Friendster online social network	NO	Vertex	6.56E+07	1.81E+09	55.06
reddit-social-media	Reddit social media graph	YES	Vertex and Edge	3.90E+09	7.00E+09	1.79
yahoo	G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS)	YES	NO	1.40E+09	1.29E+10	9.21
clueweb12	ClueWeb12 Web Graph	YES	To Construct Manually	9.78E+08	4.26E+10	43.51
wdc-2012	Web Data Commons - Hyperlink Graph 2012	YES	Vertex	3.50E+09	1.29E+11	36.71

The description of each dataset is summarized in the following table.

Table: Description of Real-world Graphs

Code Name	Dataset Name	Graph Type	Download Link	Description
human	Human protein-protein interaction network	Protein-protein interaction network	https://github.com/yspark-dblab/gcare (See the datasets.tar.gz part)	Shijie Zhang, Shirong Li, and Jiong Yang. 2009. GADDI: distance index based subgraph matching in biological networks. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. 192–203. You can download the processed version provided by the g-care project.
as-caida	CAIDA AS Relationships Datasets	computer network	http://snap.stanford.edu/data/as-caida.html	The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/. Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots. Vertices represent routers. Edges represent routes between routers.
wordnet	WordNet 1.7.1	natural language	http://vlado.fmf.uni-lj.si/pub/networks/data/dic/Wordnet/Wordnet.htm	Graphs of relations among words, retrieved from WordNet v1.7.1. Vertices represent words. Edges represent the relations between words. Vertex labels represent word types (like verb, noun, …). Edge labels represent types of relations among words (like similar, cause, …).
enron-email	Enron email network	communication network	http://snap.stanford.edu/data/email-Enron.html	Email communication network from Enron. Vertices are email addresses. Edges are communication relationships between email addresses.
facebook-wosn	Facebook (WOSN)	social network	http://socialnetworks.mpi-sws.org/data-wosn2009.html	This undirected network contains friendship data of Facebook users. A node represents a user and an edge represents a friendship between two users. The dataset is obviously not complete and contains a very small subset of the total Facebook friendship graph.
com-dblp	DBLP collaboration network and ground-truth communities	collaboration network	http://snap.stanford.edu/data/com-DBLP.html	DBLP co-authorship collaboration network. Vertices are authors. There is an edge among two vertices if the two authors have published at least one paper together. Each connected component is regarded as a community. Each label corresponds to a community.
com-youtube	Youtube social network and ground-truth communities	social network	http://snap.stanford.edu/data/com-Youtube.html	Youtube social network. Vertices represent users. Edges represent friendships between users. Labels represent ground-truth communities.
amazon0601	Amazon product co-purchasing network from June 1 2003	E-commerce	http://snap.stanford.edu/data/amazon0601.html (graph), http://snap.stanford.edu/data/amazon-meta.html (label & property)	Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j. Vertices represent products. Edges represent the co-purchased relationships.
web-google	Google web graph	WWW	http://snap.stanford.edu/data/web-Google.html	Web hyperlink graph released by Google. Vertices represent web pages. Edges represent hyperlinks between web pages.
web-berkstan	Web graph of Berkeley and Stanford	WWW	http://snap.stanford.edu/data/web-BerkStan.html	Nodes represent pages from berkely.edu and stanford.edu domains and directed edges represent hyperlinks between them. The data was collected in 2002.
as-skitter	Autonomous systems by Skitter	computer network	http://snap.stanford.edu/data/as-Skitter.html	Internet topology graph. From traceroutes run daily in 2005 - http://www.caida.org/tools/measurement/skitter. From several scattered sources to million destinations. 1.7 million nodes, 11 million edges.
ego-gplus	Social circles: Google+	social network	http://snap.stanford.edu/data/ego-Gplus.html	This dataset consists of ‘circles’ from Google+. Vertices represent users. Edges represent the relationships between users.
imdb	The International Movie Database (IMDb) graph	knowledge graph	https://www.imdb.com/interfaces/	More details of how to construct the dataset is available in the paper: T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
uspatents	Patent citation network	citation network	http://snap.stanford.edu/data/cit-Patents.html	U.S. patent dataset is maintained by the National Bureau of Economic Research. Vertices represent patents. Edges represent citations among patents. Vertex labels can represent patent classification.
eu-2005	A small crawl of the .eu domain, mainly useful for debugging and testing purposes.	WWW	https://law.di.unimi.it/webdata/eu-2005/	A small crawl of the .eu domain, mainly useful for debugging and testing purposes. This graph exhibits a very low locality, probably because the crawl was quite shallow (and the chosen domain is quite artificial anyway). Vertices present web pages. Edges represent hyper links between webpages.
us-road	Challenge 9 - Road map of USA	road network	http://www.diag.uniroma1.it/challenge9/download.shtml	Road map of 12 USA road networks. Vertices represent key points. Edges present roads connected the key points. Edges have weights which represent either the distance or the travel time.
soc-livejournal	LiveJournal online social network	social network	http://snap.stanford.edu/data/soc-LiveJournal1.html	Social network of LiveJournal. Vertices represent users of LiveJournal. Edges represent friendship relationships between users.
com-orkut	Orkut online social network	social network	http://snap.stanford.edu/data/com-Orkut.html	Orkut social network and ground-truth communities. Vertices represent users. Edges represent friendships between users. Labels represent groud-truth communities.
uk-2002	Web graph from a 2002 crawl of the .uk domain.	WWW	https://law.di.unimi.it/webdata/uk-2002/	This graph has been obtained from a 2002 crawl of the .uk domain performed by UbiCrawler. Vertices represent web pages. Edges present hyperlinks between web pages.
yago	YAGO 4 knowledge base	knowledge graph	https://yago-knowledge.org/downloads/yago-4	YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. YAGO dataset is a knowledge graph stored in the standard RDF framework. Vertices represent entities. Edges represent relationships between entities.
eu-road	Roadmap of Europe	road network	https://i11www.iti.kit.edu/resources/roadgraphs.php	Road map of Europe retrieved from the OpenStreetMap. Vertices represent crosses. Edges represent the roads connected crosses.
billion-triple	Billion Triples Challenge 2009 Dataset	knowledge graph	http://km.aifb.kit.edu/projects/btc-2009/	The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card.
dbpedia-tiny-diamond	The DBpedia Latest Core Release	knowledge graph	https://www.dbpedia.org/resources/latest-core/	The DBpedia Latest Core Release is the small subset of of the total DBpedia releases that we are loading into the main DBpedia SPARQL endpoint and Linked Data Interface and Lookup Search. Dbpedia is a knowledge graph database. It is updated in a montly cycle.
twitter	What is Twitter, a Social Network or a News Media?	social network	https://anlab-kaist.github.io/traces/WWW2010	Twitter’s following graph. Vertices represent twitter user IDs. Edges represent the following relationships.
friendster	Friendster online social network	social network	http://snap.stanford.edu/data/com-Friendster.html	Friendster social network. Vertices represent users in the Friendster on-line gaming network. Edges represent friendship relations between users. Labels represent ground-truth communities.
reddit-social-media	Reddit social media graph	social network	https://github.com/dewarim/reddit-data-tools	More details of how to construct the dataset is available in the paper: T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
yahoo	G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS)	WWW	https://webscope.sandbox.yahoo.com/catalog.php?datatype=g&guccounter=1	This dataset contains URLs and hyperlinks for over 1.4 billion public web pages indexed by the Yahoo! AltaVista search engine in 2002.
clueweb12	ClueWeb12 Web Graph	WWW	https://law.di.unimi.it/webdata/clueweb12/	This is the graph underlying the ClueWeb12 dataset. Vertices represent web pages. Edges represent hyper links between web pages. The labels needs to be manually constructed from the original dataset files by treating each domain as a label.
wdc-2012	Web Data Commons - Hyperlink Graph 2012	WWW	http://webdatacommons.org/hyperlinkgraph/2012-08/download.html	The hyperlink graph extracted from the Common Crawl 2012 web corpus. The graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. Vertices represent web pages. Edges represent hyperlinks among web pages. Labels are retrieved from domain names.

Synthetic Graphs

The synthetic graphs are mainly used to evaluate performance of distributed subgraph matching methods on big graphs. The widely-used synthetic graph generation models or generator are summarized below.

R-MAT Model

The R-MAT model [21] uses a recursive model to generate the adjacency matrix of a random graph. The R-MAT model can generate different kinds of random graphs, including E-R models and power-law graphs. The R-MAT model is used in Graph500 Benchmark to generate very big graphs.

There are some typical implementation of RMAT random graph generators:

Waterloo SPARQL Diversity Test Suite (WatDiv)

WatDiv [22] is a benchmark developed for RDF data managements systems. WatDiv can generate large RDF datasets with a wide spectrum of SPARQL queries. WatDiv has a data generator and a query generator. The description of WatDiv is available at its home page.

WatDiv also has a stream version to generate dynamic RDF graphs.

Lehigh University Benchmark (LUBM)

LUBM is a benchmark for RDF data management systems. It follows a university domain ontology to generate big RDF graphs. It includes a data generator and a suite of test queries. The homepage of LUBM is http://swat.cse.lehigh.edu/projects/lubm/.

LDBC Social Network Benchmarking (LDBC-SNB) is a benchmark suite for graph databases. It can simulate both interactive workloads and business intelligence workloads of a social network. It includes a social network generator that generates complex property graphs and a set of benchmark queries. The home page of LDBC-SNB is https://ldbcouncil.org/benchmarks/snb/.

EvoGraph

EvoGraph [23] is a tool upscaling a given real-world graph to generate large simulated graphs. The generated simulated graphs can preserve the properties of the underlying real-world graph. The source code of EvoGraph is available at chan150’s GitHub repository.

References

C. T. Duong, D. Hoang, H. Yin, M. Weidlich, Q. V. H. Nguyen, and K. Aberer, “Efficient Streaming Subgraph Isomorphism with Graph Neural Networks,” Proc. VLDB Endow., vol. 14, no. 5, pp. 730–742, 2021.
X. Jin, Z. Yang, X. Lin, S. Yang, L. Qin, and Y. Peng, “FAST: FPGA-based Subgraph Matching on Massive Graphs,” in Proc. ICDE, 2021, pp. 1452–1463. doi: 10.1109/ICDE51399.2021.00129.
D. Mawhirter, S. Reinehr, C. Holmes, T. Liu, and B. Wu, “GraphZero: A High-Performance Subgraph Matching System,” SIGOPS Oper. Syst. Rev., vol. 55, no. 1, pp. 21–37, 2021, doi: 10.1145/3469379.3469383.
H. Kim, Y. Choi, K. Park, X. Lin, S.-H. Hong, and W.-S. Han, “Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching,” in SIGMOD, New York, NY, USA, 2021, pp. 925–937. doi: 10.1145/3448016.3457265.
Z. Yang, L. Lai, X. Lin, K. Hao, and W. Zhang, “HUGE: An Efficient and Scalable Subgraph Enumeration System,” in Proc. SIGMOD, New York, NY, USA, 2021, pp. 2049–2062. doi: 10.1145/3448016.3457237.
W. Guo, Y. Li, M. Sha, B. He, X. Xiao, and K.-L. Tan, “GPU-Accelerated Subgraph Enumeration on Partitioned Graphs,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1067–1082. doi: 10.1145/3318464.3389699.
S. Sun and Q. Luo, “In-Memory Subgraph Matching: An In-Depth Study,” in SIGMOD, New York, NY, USA, 2020, pp. 1083–1098. doi: 10.1145/3318464.3380581.
H. Zhang, J. X. Yu, Y. Zhang, K. Zhao, and H. Cheng, “Distributed Subgraph Counting: A General Approach,” Proc. VLDB Endow., vol. 13, no. 11, pp. 2493–2507, 2020.
X. Jian, Y. Wang, X. Lei, Y. Shen, and L. Chen, “DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs,” in Proc. DASFAA, Cham, 2020, pp. 632–640.
L. Zeng, L. Zou, M. T. Özsu, L. Hu, and F. Zhang, “GSI: GPU-friendly Subgraph Isomorphism,” in Proc. ICDE, Apr. 2020, pp. 1249–1260. doi: 10.1109/ICDE48307.2020.00112.
S. Sun, X. Sun, Y. Che, Q. Luo, and B. He, “RapidMatch: a holistic approach to subgraph query processing,” Proc. VLDB Endow., vol. 14, no. 2, pp. 176–188, Oct. 2020, doi: 10.14778/3425879.3425888.
D. Yan, G. Guo, M. M. R. Chowdhury, M. T. Özsu, W.-S. Ku, and J. C. S. Lui, “G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph,” in Proc. ICDE, 2020, pp. 1369–1380. doi: 10.1109/ICDE48307.2020.00122.
Y. Park, S. Ko, S. S. Bhowmick, K. Kim, K. Hong, and W.-S. Han, “G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,” in SIGMOD, New York, NY, USA, 2020, pp. 1099–1114. doi: 10.1145/3318464.3389702.
T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
M. Han, H. Kim, G. Gu, K. Park, and W.-S. Han, “Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together,” in Proc. SIGMOD, New York, New York, USA, 2019, pp. 1429–1446. doi: 10.1145/3299869.3319880.
B. Bhattarai, H. Liu, and H. H. Huang, “CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching,” in SIGMOD, 2019, pp. 1447–1462. doi: 10.1145/3299869.3300086.
L. Lai et al., “Distributed subgraph matching on timely dataflow,” PVLDB, vol. 12, no. 10, pp. 1099–1112, 2019, doi: 10.14778/3339490.3339494.
A. Mhedhbi and S. Salihoglu, “Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1692–1704, 2019, doi: 10.14778/3342263.3342643.
Z. Zhang, H. Wei, J. Xu, and B. Choi, “GScan: Exploiting Sequential Scans for Subgraph Matching,” in Proc. DASFAA, Cham, 2019, pp. 471–475.
Z. Wang, R. Gu, W. Hu, C. Yuan, and Y. Huang, “BENU: Distributed Subgraph Enumeration with Backtracking-Based Framework,” in ICDE, Macao, Macao, Apr. 2019, pp. 136–147. doi: 10.1109/ICDE.2019.00021.
D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A Recursive Model for Graph Mining,” in Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, 2004, pp. 442–446. doi: 10.1137/1.9781611972740.43.
G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, 2014, pages 197-212. WatDiv available from http://dsg.uwaterloo.ca/watdiv/.
H. Park and M.-S. Kim, “EvoGraph: An Effective and Efficient Graph Upscaling Method for Preserving Graph Properties,” in Proc. SIGKDD, New York, NY, USA, 2018, pp. 2051–2059. doi: 10.1145/3219819.3220123.

上篇班主任手记：第一次班会 - 大学期间的时间管理

下篇Summary of Query Graphs in Subgraph Matching Research