Schemaless Graph Querying Shengqi Yang,Yinghui Wu, Huan Sun and XifengYan UC Santa Barbara The Big Graphs Internet graph Social graph (Facebook) Software graph (Apache) Protein interaction graph (human) Credits: Andrew R. Wade, Facebook, Apache, Ravi Iyengar The Big Graphs  It is not only about the scale! organization business join Video music photo Facebook 10 years ago Facebook Now  Examples are countless: Google’s knowledge graph, LinkedIn’s professional network, Linked Open Data, etc. The Graph Querying  Problem: Subgraph querying o a.k.a., subgraph matching, subgraph isomorphism, etc …... Pattern query Query Molecule graphs (Graph database) Data graph (Credit: linkeddata.org) Answer (a.k.a., match, embedding) Querying Big Graphs: the Challenges  The data graph is schemaless  The data items, nodes and relations, are from various domains and thus are quite diverse.  There is no uniformed conversion followed by the data contributors.  The query is schemaless  Queries are usually not requested in accordance with the metadata, which is transparent to the searchers.  A “good” querying system should  Bridge the gap between query and data: match query to candidates.  Rank the results in a principled way.  Return top results in query time. Querying Graphs: the DB approach  Keyword Search  Find a root node (LCA in XML or trees) [Liu07][Koutrika06].  Group Steiner Tree [Ding07][Talukdar08].  Expansion from keyword matches [Bhalotia02][Dalvi08][].  Searching based on the score function [Hristidis03][Luo07][Kasneci08].  Searching based on indexes (e.g., reachability, distance, etc) [He07][Markowetz09][Li09][Qiao13].  Graph Search  Exact subgraph matching, based on indexing techniques.  Search on graph databases [Shasha02][Yan04][Zhao07][Zou08].  Search on a single large graph [Ullman76][Cordella04][Shang08][Zhang09].  Approximate subgraph matching [Tian08][Mongiovi10][Khan13].  Most of the works were primarily focusing on the efficiency. Result ranking is usually simply considered. Querying Graphs: the IR and ML approach  IR based ranking  Classic IR metrics: TF-IDF, BM25, etc. [Hristidis03][Liu06][Luo07]  Language models: a probabilistic space for each document [Elbassuoni109][Mass12].  The works are mostly based on linear combinations of the IR metrics, with human tuned fixed parameters.  Machine-learned ranking (MLR) [Liu09]  The research in this domain targets on  How to design an effective (ensemble) ranking model.  How to estimate the parameters in the ranking model.  Application: rank the documents or the web pages.  Input: a score vector for a (query, doc) pair and the relevance score. Schemaless Graph Querying (SLQ)  To the users, no knowledge on the graph data is required.  Name the query and the search engine will do the rest.  An efficient graph search technique that could fast extract the good results.  A machine learning based ranking algorithm.  Robust to noisy and ambiguous queries.  Adaptive to the user’s preference. SLQ - Outline • The matching strategy • The ranking and searching technique • Offline learning • Online query processing • Performance and Demonstration The Matching Technique  A transformation-based matching approach  The users could freely post queries, without possessing any knowledge of the underlying data.  The querying system should automatically find the matches through a set of transformations. Query Actor, ~30 yrs UCB M:I A match Chris Pine (1980) University of J. J. Abrams California, Berkeley Mission: Impossible Acronym transformation matches ‘UCB’ to ‘University of California, Berkeley’ Abbreviation transformation matches ‘M : I’ to ‘Mission: Impossible’ Numeric transformation matches ‘~30’ to ‘1980’. Common Transformations Transformation Category Example First/Last token String “Barack Obama” > “Obama” Abbreviation String “Jeffrey Jacob Abrams” > “J. J. Abrams” Prefix String “Doctor” > “Dr” Acronym String "International Business Machines" > "IBM" Synonym Semantic “tumor" > “neoplasm" Ontology Semantic "teacher" > "educator" Range Numeric “~30” > “1980” Unit Conversion Numeric "3 mi" > "4.8 km" Distance Topology "Pine" - "M:I" > "Pine" - "J.J. Abrams" - "M:I" … … …  The focus of this work is not to study all possible transformations. Any new transformation can be easily plugged into the framework. The Matching Technique: New Challenges Avg. result number  The transformation-based No. of Transformation applied matching approach is likely to implicate many more results, compared to classic direct matching.  To find all matching results (subgraph matching problem) is quite expansive. It is necessary to first suggest the “best” results to the users.  IR-based top-K search  We resort to ML-based top-K search The Ranking and Searching Techniques  The offline learning  A ranking model should encode the transformations between the query and the match  The ranking model should automatically determine its parameters.  User query logs  Automatic training data generation  The online ranking and searching  Among many match candidates, an efficient algorithm is required to fast extract the top-K results, in terms of the ranking model. The Ranking Model  With a set of transformations { f i }, given a query Q and its match result R, our ranking model considers  the node matching: from a query node v to its match  (v) FV (v,  (v))   i f i (v,  (v)) i  the edge matching: from query edge e to its match  (e) FE (e,  (e))   i f i (e,  (e)) i (the edge match  (e) in the result could be a path)  The overall ranking model: a probabilistic model based on Conditional Random Fields. P( R | Q)  exp(  FV (v,  (v))  vVQ  F (e, (e))) eEQ E The Offline Learning  It is clearly that the parameters { i ;  j } need to be determined appropriately.  Classic IR method: the parameters are tuned by domain experts manually.  Specific domain knowledge is not sufficient for big graph data.  Supervised method: learning to rank  User query logs: not easy to acquire.  Manually label the training data: not practical and scalable.  Unsupervised method: automatic training data generation  A set of high quality training instances could be generated directly from the graph data.  Inspired by the advantage in learning high level representations with deep belief networks  Denoising Autoencoders [Vincent08] Automatic training data generation  The basis of denoising autoencoders  An explicit criteria for learning good representations  Robustness to partial destruction of the input.  Intuition: human is able to recognize partially corrupted images.  Application: image classification (search) min L( p1, p1' ) Image Database Training random noise (0.1) p1' Searching Good result p1 p2 Google result (Picture credit: etidbits.com, youtube.com, blog.sina.com.cn) p3 Baidu result Automatic training data generation  Sampling: a set of subgraphs are Data graph  4. Label the results 1. Sampling Tom Cruise 2. Add noise  SamuelTom 3. Search Good training query Tom Cruise … Tom  results 5. Train the ranking model (L-BFGS [Liu89])  randomly extracted from the data graph based on the templates. Query generation: the queries are generated by randomly adding noise (transformation) on the extracted subgraphs. Searching: search the generated queries on the data graph Labeling: the results are labeled based on the original subgraph. Training: the queries, with the labeled results, are then used to estimate the parameters of the ranking model. Online Query Processing  Exact subgraph matching is quite expansive  The transformations produce many match candidates.  A NP-hard problem by reducing to graph isomorphism.  A little (possible) compromise on the quality of the results could lead to a significant improvement on the efficiency.  A fast top-K search algorithm  Indexing technique.  Two effective heuristics in the search algorithm.  Approximate inference in the CRFs ranking model.  Sketch graph based pruning strategy. Online Query Processing  Inference in graphical model (CRFs)  A CRFs model is constructed based on the query (as variables) and the node match candidates (as possible assignments to the variables)  Top-1 result: computing the most likely assignment.  Exact inference in general graphical model is still NP-hard [Sutton06].  Approximate inference: Loopy Belief Propagation (LoopyBP)  A message-passing algorithm: m(jit ) (ui )  max FV (v j , u j ) FE ((v j , vi ), (u j , ui )) uj  m ( t 1) kj (u j ) vk N ( v j ) \ vi It is very efficient and empirically successful in many practical applications.  From Top-1 to Top-k: best max-marginal first algorithm [Yanover04] Query Processing – Sketch Graph  Problem: due to the various transformations, a large number of match candidates should be inspected by the LoopyBP algorithm.  Solution: sketch graph  The candidates through the same transformation preserve the same matching score and thus can be grouped together. Query Sketch graph Brad Pitt Brad Top-1 result in sketch graph Bradley Top-1 result f1 : … … fn : … f1 : … Brad Pitt f1 : … … fm : … f1 : … Troy film Troy Troy film 1. Construct sketch graph Jim Troy 2. Search in sketch graph 3. Search in graph Query Processing – Two-level Search  The two-level search algorithm  Search in sketch graph  The size of the sketch graph is very small: only related to the size of the query and the number of the transformations.  The score of the upper-level match in the sketch graph is no less (upper bound) than the score of the lower-level match.  Search in data graph  Based on the upper-level match found in the sketch graph, the algorithm then extracts the lower-level matches in the data graph on a smaller set of match candidates.  Pruning: when the score of the lower-level match is smaller than that of a previous upper-level match. Other Designs - Indexing  An index is constructed based on the transformations. Given a query, its match candidates can be extracted readily from the index. StrIdx Query Pine actor ~30yrs OntIdx NumIdx Matching Str. Trans. [ Pine : {v1, ...} C. Pine : {v1, …} Chris : {v1, …} …… ] Ont. Trans. [ actor = cast : {v1, …} …… ] Data Graph Chris Pine born in 1980 UC Berkeley cast “Star Trek” …… Num. Trans. (B+ tree) <50 <35 32 33 34 {v1, ...} Indexing v1 v2 v3 The Performance Evaluation  Dataset Graph Nodes Edges Node types Relations Size DBpedia 3.7M 20M 359 800 40G YAGO2 2.9M 11M 6,543 349 18.5G Freebase 40.3M 180M 10,110 9,101 88G  Query template  DBPSB [Morsey11] : the query templates derived from query logs on DBpedia  Baseline  Spark [Luo07] : a IR based ranking model that takes each node as a document and does not consider any structure information. The parameters are fixed.  SLQ: our method  Unit: a variant of SLQ, with equal parameter value  Card: a variant of SLQ, with parameter value as the selectivity of the corresponding transformation. The Performance Evaluation – Case Study  Queries on DBpedia  In the above cases, Spark gives low IR score and cannot identify matches for Query 2. The Performance Evaluation - Effectiveness  Queries are randomly generated on YAGO2 based the query templates  Evaluation: MAP@5  Change query size while fixing the transformation ratio.  SLQ shows the best result.  Large query (with more evidence) is more robust to the noise (the transformation).  Change the transformation ratio.  The result degrades along with the increasing of the transformation ratio. The Performance Evaluation - Efficiency  Queries are randomly generated on Freebase based the query templates .  The transformation ratio: 0.2~0.5.  Baseline  Exact: exact search on all candidates.  NeMa [Khan13]: a graph search algorithm similar to LoopyBP.  Increasing the graph size in a “streaming” mode  With the sketch graph, the SLQ outperforms the two baselines Conclusion  A novel framework for schemaless graph querying.  A matching strategy: transformation.  A ranking model.  Incorporates the transformations.  Automatic training in the cold-start.  Top-k search algorithm. The Architecture and Future Work Natural language query process Better top-k search algorithm Runtime Indexing Storage; Multiple data graph Better ranking model Warm-start training System Demonstration Dataset selection Keyword/SLQ Query Graph Query Drawing / Result rendering panel Result navigation bar Information panel More Applications Text Mining/RDF -> Link entities across documents -> Build Information Network Graph Visualization / Querying / Analysis References  [Vincent08] Vincent, H. LarochelleY. Bengio and P.A. Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In ICML‘08, 2008.  [Liu89] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.  [Sutton06] Sutton, C. & Mccallum. Introduction to Conditional Random Fields for Relational Learning, MIT Press, 2006.  [Yanover04] C.Yanover and Y. Weiss. Finding the m most probable configurations using loopy belief propagation. In NIPS, 2004.  [Luo07] Y. Luo, X. Lin, W. Wang, and X. Zhou. Spark: top-k keyword query in relational databases. In SIGMOD, 2007.  [Morsey11] M. Morsey, J. Lehmann, S. Auer, and A.-C. Ngonga Ngomo. Dbpedia sparql benchmark - performance assessment with real queries on real data. In ISWC, 2011.  [Liu07] Z. Liu and Y. Chen. Identifying meaningful return information for XML keyword search. In SIGMOD, 2007  [Koutrika06] Koutrika, G., Simitsis, A., and Ioannidis, Y. E. Pré cis: The Essence of a Query Answer. In ICDE, 2006.  [Ding07] Ding, B.,Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. Finding top-k min-cost connected trees in databases. In ICDE, 2007.  [Talukdar2008] Talukdar, P. P., Jacob, M., Mehmood, M. S., Crammer, K., Ives, Z. G., Pereira, F., and Guha, S. Learning to create data-integrating queries. In VLDB, 2008.  [Bhalotia02] Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. Keyword Searching and Browsing in Databases using BANKS. In ICDE, 2002.  [Dalvi08] Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. Keyword search on external memory data graphs. In VLDB, 2008.  [Hristidis03] Hristidis, V., Papakonstantinou, Y., and Balmin, A. Keyword proximity search on xml graphs. In ICDE, 2003.  [Zhang09] S. Zhang, S. Li, and J.Yang. GADDI: distance index based subgraph matching in biological networks. In EDBT, 2009. References (cont.)  [Li09] Li, G., Ji, S., Li, C., and Feng, J. Efficient type-ahead search on relational data: a TASTIER approach. In SIGMOD, 2009.  [Markowetz09] Markowetz, A.,Yang,Y., and Papadias, D. Reachability Indexes for Relational Keyword Search. In ICDE, 2009.  [He07] He, H., Wang, H.,Yang, J., and Yu, P. S. BLINKS: Ranked keyword searches on graphs. In SIGMOD, 2007.  [Qiao13] Miao Qiao, Lu Qin, Hong Cheng, Jeffrey XuYu, and WentaoTian. Top-K nearest keyword search on large graphs. In VLDB, 2013.  [Kasneci08] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking knowledge. In ICDE, 2008.  [Khan13] A. Khan,Y. Wu, C. C. Aggarwal, and X.Yan. Nema: fast graph search with label similarity. In VLDB, 2013.  [Shasha02] D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In PODS, 2002.  [Yan04] X.Yan, P. S.Yu, and J. Han. Graph indexing: A frequent structure-based approach. In SIGMOD, 2004.  [Zhao07] P. Zhao, J. X.Yu, and P. S.Yu. Graph indexing: Tree + delta >= graph. In VLDB, pages 938–949, 2007.  [Zou08] L. Zou, L. Chen, J. X.Yu, and Y. Lu. A novel spectral coding in a large graph database. In EDBT, 2008.  [Ullmann76] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 1976.  [Cordella04] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graph isomorphism algorithm for matching large graphs. IEEE PAMI, 26(10):1367–1372, 2004. References (cont.)  [Shang09] H. Shang,Y. Zhang, X. Lin, and J. X.Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. In VLDB, 2008.  [Tian08] Y. Tian and J. M. Patel. TALE: A tool for approximate large graph matching. In ICDE, pages 963–972, 2008.  [Mongiovi10] M. Mongiovi, R. D. Natale, R. Giugno, A. Pulvirenti, A. Ferro, and R. Sharan. Sigma: a set-coverbased inexact graph matching algorithm. J. Bioinformatics and Computational Biology, 8(2):199–218, 2010.  [Liu06] F. Liu, C.Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In SIGMOD, 2006.  [Elbassuoni109] S. Elbassuoni1, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum1. Language-model-based ranking for queries on rdf-graphs. In CIKM, 2009.  [Mass12] Language Models for Keyword Search over Data Graphs.Y. Mass and Y. Sagiv. In WSDM, 2012  [Liu09] Tie-Yan Liu. Learning to Rank for Information Retrieval. In WWW, 2009.