Schemaless Graph Querying Shengqi Yang,Yinghui Wu, Huan Sun and XifengYan UC Santa Barbara The Big Graphs Internet graph Social graph (Facebook) Software graph (Apache) Protein interaction graph (human) Credits: Andrew R. Wade, Facebook, Apache, Ravi Iyengar The Big Graphs It is not only about the scale! organization business join Video music photo Facebook 10 years ago Facebook Now Examples are countless: Google’s knowledge graph, LinkedIn’s professional network, Linked Open Data, etc. The Graph Querying Problem: Subgraph querying o a.k.a., subgraph matching, subgraph isomorphism, etc …... Pattern query Query Molecule graphs (Graph database) Data graph (Credit: linkeddata.org) Answer (a.k.a., match, embedding) Querying Big Graphs: the Challenges The data graph is schemaless The data items, nodes and relations, are from various domains and thus are quite diverse. There is no uniformed conversion followed by the data contributors. The query is schemaless Queries are usually not requested in accordance with the metadata, which is transparent to the searchers. A “good” querying system should Bridge the gap between query and data: match query to candidates. Rank the results in a principled way. Return top results in query time. Querying Graphs: the DB approach Keyword Search Find a root node (LCA in XML or trees) [Liu07][Koutrika06]. Group Steiner Tree [Ding07][Talukdar08]. Expansion from keyword matches [Bhalotia02][Dalvi08][]. Searching based on the score function [Hristidis03][Luo07][Kasneci08]. Searching based on indexes (e.g., reachability, distance, etc) [He07][Markowetz09][Li09][Qiao13]. Graph Search Exact subgraph matching, based on indexing techniques. Search on graph databases [Shasha02][Yan04][Zhao07][Zou08]. Search on a single large graph [Ullman76][Cordella04][Shang08][Zhang09]. Approximate subgraph matching [Tian08][Mongiovi10][Khan13]. Most of the works were primarily focusing on the efficiency. Result ranking is usually simply considered. Querying Graphs: the IR and ML approach IR based ranking Classic IR metrics: TF-IDF, BM25, etc. [Hristidis03][Liu06][Luo07] Language models: a probabilistic space for each document [Elbassuoni109][Mass12]. The works are mostly based on linear combinations of the IR metrics, with human tuned fixed parameters. Machine-learned ranking (MLR) [Liu09] The research in this domain targets on How to design an effective (ensemble) ranking model. How to estimate the parameters in the ranking model. Application: rank the documents or the web pages. Input: a score vector for a (query, doc) pair and the relevance score. Schemaless Graph Querying (SLQ) To the users, no knowledge on the graph data is required. Name the query and the search engine will do the rest. An efficient graph search technique that could fast extract the good results. A machine learning based ranking algorithm. Robust to noisy and ambiguous queries. Adaptive to the user’s preference. SLQ - Outline • The matching strategy • The ranking and searching technique • Offline learning • Online query processing • Performance and Demonstration The Matching Technique A transformation-based matching approach The users could freely post queries, without possessing any knowledge of the underlying data. The querying system should automatically find the matches through a set of transformations. Query Actor, ~30 yrs UCB M:I A match Chris Pine (1980) University of J. J. Abrams California, Berkeley Mission: Impossible Acronym transformation matches ‘UCB’ to ‘University of California, Berkeley’ Abbreviation transformation matches ‘M : I’ to ‘Mission: Impossible’ Numeric transformation matches ‘~30’ to ‘1980’. Common Transformations Transformation Category Example First/Last token String “Barack Obama” > “Obama” Abbreviation String “Jeffrey Jacob Abrams” > “J. J. Abrams” Prefix String “Doctor” > “Dr” Acronym String "International Business Machines" > "IBM" Synonym Semantic “tumor" > “neoplasm" Ontology Semantic "teacher" > "educator" Range Numeric “~30” > “1980” Unit Conversion Numeric "3 mi" > "4.8 km" Distance Topology "Pine" - "M:I" > "Pine" - "J.J. Abrams" - "M:I" … … … The focus of this work is not to study all possible transformations. Any new transformation can be easily plugged into the framework. The Matching Technique: New Challenges Avg. result number The transformation-based No. of Transformation applied matching approach is likely to implicate many more results, compared to classic direct matching. To find all matching results (subgraph matching problem) is quite expansive. It is necessary to first suggest the “best” results to the users. IR-based top-K search We resort to ML-based top-K search The Ranking and Searching Techniques The offline learning A ranking model should encode the transformations between the query and the match The ranking model should automatically determine its parameters. User query logs Automatic training data generation The online ranking and searching Among many match candidates, an efficient algorithm is required to fast extract the top-K results, in terms of the ranking model. The Ranking Model With a set of transformations { f i }, given a query Q and its match result R, our ranking model considers the node matching: from a query node v to its match (v) FV (v, (v)) i f i (v, (v)) i the edge matching: from query edge e to its match (e) FE (e, (e)) i f i (e, (e)) i (the edge match (e) in the result could be a path) The overall ranking model: a probabilistic model based on Conditional Random Fields. P( R | Q) exp( FV (v, (v)) vVQ F (e, (e))) eEQ E The Offline Learning It is clearly that the parameters { i ; j } need to be determined appropriately. Classic IR method: the parameters are tuned by domain experts manually. Specific domain knowledge is not sufficient for big graph data. Supervised method: learning to rank User query logs: not easy to acquire. Manually label the training data: not practical and scalable. Unsupervised method: automatic training data generation A set of high quality training instances could be generated directly from the graph data. Inspired by the advantage in learning high level representations with deep belief networks Denoising Autoencoders [Vincent08] Automatic training data generation The basis of denoising autoencoders An explicit criteria for learning good representations Robustness to partial destruction of the input. Intuition: human is able to recognize partially corrupted images. Application: image classification (search) min L( p1, p1' ) Image Database Training random noise (0.1) p1' Searching Good result p1 p2 Google result (Picture credit: etidbits.com, youtube.com, blog.sina.com.cn) p3 Baidu result Automatic training data generation Sampling: a set of subgraphs are Data graph 4. Label the results 1. Sampling Tom Cruise 2. Add noise SamuelTom 3. Search Good training query Tom Cruise … Tom results 5. Train the ranking model (L-BFGS [Liu89]) randomly extracted from the data graph based on the templates. Query generation: the queries are generated by randomly adding noise (transformation) on the extracted subgraphs. Searching: search the generated queries on the data graph Labeling: the results are labeled based on the original subgraph. Training: the queries, with the labeled results, are then used to estimate the parameters of the ranking model. Online Query Processing Exact subgraph matching is quite expansive The transformations produce many match candidates. A NP-hard problem by reducing to graph isomorphism. A little (possible) compromise on the quality of the results could lead to a significant improvement on the efficiency. A fast top-K search algorithm Indexing technique. Two effective heuristics in the search algorithm. Approximate inference in the CRFs ranking model. Sketch graph based pruning strategy. Online Query Processing Inference in graphical model (CRFs) A CRFs model is constructed based on the query (as variables) and the node match candidates (as possible assignments to the variables) Top-1 result: computing the most likely assignment. Exact inference in general graphical model is still NP-hard [Sutton06]. Approximate inference: Loopy Belief Propagation (LoopyBP) A message-passing algorithm: m(jit ) (ui ) max FV (v j , u j ) FE ((v j , vi ), (u j , ui )) uj m ( t 1) kj (u j ) vk N ( v j ) \ vi It is very efficient and empirically successful in many practical applications. From Top-1 to Top-k: best max-marginal first algorithm [Yanover04] Query Processing – Sketch Graph Problem: due to the various transformations, a large number of match candidates should be inspected by the LoopyBP algorithm. Solution: sketch graph The candidates through the same transformation preserve the same matching score and thus can be grouped together. Query Sketch graph Brad Pitt Brad Top-1 result in sketch graph Bradley Top-1 result f1 : … … fn : … f1 : … Brad Pitt f1 : … … fm : … f1 : … Troy film Troy Troy film 1. Construct sketch graph Jim Troy 2. Search in sketch graph 3. Search in graph Query Processing – Two-level Search The two-level search algorithm Search in sketch graph The size of the sketch graph is very small: only related to the size of the query and the number of the transformations. The score of the upper-level match in the sketch graph is no less (upper bound) than the score of the lower-level match. Search in data graph Based on the upper-level match found in the sketch graph, the algorithm then extracts the lower-level matches in the data graph on a smaller set of match candidates. Pruning: when the score of the lower-level match is smaller than that of a previous upper-level match. Other Designs - Indexing An index is constructed based on the transformations. Given a query, its match candidates can be extracted readily from the index. StrIdx Query Pine actor ~30yrs OntIdx NumIdx Matching Str. Trans. [ Pine : {v1, ...} C. Pine : {v1, …} Chris : {v1, …} …… ] Ont. Trans. [ actor = cast : {v1, …} …… ] Data Graph Chris Pine born in 1980 UC Berkeley cast “Star Trek” …… Num. Trans. (B+ tree) <50 <35 32 33 34 {v1, ...} Indexing v1 v2 v3 The Performance Evaluation Dataset Graph Nodes Edges Node types Relations Size DBpedia 3.7M 20M 359 800 40G YAGO2 2.9M 11M 6,543 349 18.5G Freebase 40.3M 180M 10,110 9,101 88G Query template DBPSB [Morsey11] : the query templates derived from query logs on DBpedia Baseline Spark [Luo07] : a IR based ranking model that takes each node as a document and does not consider any structure information. The parameters are fixed. SLQ: our method Unit: a variant of SLQ, with equal parameter value Card: a variant of SLQ, with parameter value as the selectivity of the corresponding transformation. The Performance Evaluation – Case Study Queries on DBpedia In the above cases, Spark gives low IR score and cannot identify matches for Query 2. The Performance Evaluation - Effectiveness Queries are randomly generated on YAGO2 based the query templates Evaluation: MAP@5 Change query size while fixing the transformation ratio. SLQ shows the best result. Large query (with more evidence) is more robust to the noise (the transformation). Change the transformation ratio. The result degrades along with the increasing of the transformation ratio. The Performance Evaluation - Efficiency Queries are randomly generated on Freebase based the query templates . The transformation ratio: 0.2~0.5. Baseline Exact: exact search on all candidates. NeMa [Khan13]: a graph search algorithm similar to LoopyBP. Increasing the graph size in a “streaming” mode With the sketch graph, the SLQ outperforms the two baselines Conclusion A novel framework for schemaless graph querying. A matching strategy: transformation. A ranking model. Incorporates the transformations. Automatic training in the cold-start. Top-k search algorithm. The Architecture and Future Work Natural language query process Better top-k search algorithm Runtime Indexing Storage; Multiple data graph Better ranking model Warm-start training System Demonstration Dataset selection Keyword/SLQ Query Graph Query Drawing / Result rendering panel Result navigation bar Information panel More Applications Text Mining/RDF -> Link entities across documents -> Build Information Network Graph Visualization / Querying / Analysis References [Vincent08] Vincent, H. LarochelleY. Bengio and P.A. Manzagol. Extracting and Composing Robust Features with Denoising Autoencoders. In ICML‘08, 2008. [Liu89] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989. [Sutton06] Sutton, C. & Mccallum. Introduction to Conditional Random Fields for Relational Learning, MIT Press, 2006. [Yanover04] C.Yanover and Y. Weiss. Finding the m most probable configurations using loopy belief propagation. In NIPS, 2004. [Luo07] Y. Luo, X. Lin, W. Wang, and X. Zhou. Spark: top-k keyword query in relational databases. In SIGMOD, 2007. [Morsey11] M. Morsey, J. Lehmann, S. Auer, and A.-C. Ngonga Ngomo. Dbpedia sparql benchmark - performance assessment with real queries on real data. In ISWC, 2011. [Liu07] Z. Liu and Y. Chen. Identifying meaningful return information for XML keyword search. In SIGMOD, 2007 [Koutrika06] Koutrika, G., Simitsis, A., and Ioannidis, Y. E. Pré cis: The Essence of a Query Answer. In ICDE, 2006. [Ding07] Ding, B.,Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. Finding top-k min-cost connected trees in databases. In ICDE, 2007. [Talukdar2008] Talukdar, P. P., Jacob, M., Mehmood, M. S., Crammer, K., Ives, Z. G., Pereira, F., and Guha, S. Learning to create data-integrating queries. In VLDB, 2008. [Bhalotia02] Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. Keyword Searching and Browsing in Databases using BANKS. In ICDE, 2002. [Dalvi08] Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. Keyword search on external memory data graphs. In VLDB, 2008. [Hristidis03] Hristidis, V., Papakonstantinou, Y., and Balmin, A. Keyword proximity search on xml graphs. In ICDE, 2003. [Zhang09] S. Zhang, S. Li, and J.Yang. GADDI: distance index based subgraph matching in biological networks. In EDBT, 2009. References (cont.) [Li09] Li, G., Ji, S., Li, C., and Feng, J. Efficient type-ahead search on relational data: a TASTIER approach. In SIGMOD, 2009. [Markowetz09] Markowetz, A.,Yang,Y., and Papadias, D. Reachability Indexes for Relational Keyword Search. In ICDE, 2009. [He07] He, H., Wang, H.,Yang, J., and Yu, P. S. BLINKS: Ranked keyword searches on graphs. In SIGMOD, 2007. [Qiao13] Miao Qiao, Lu Qin, Hong Cheng, Jeffrey XuYu, and WentaoTian. Top-K nearest keyword search on large graphs. In VLDB, 2013. [Kasneci08] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking knowledge. In ICDE, 2008. [Khan13] A. Khan,Y. Wu, C. C. Aggarwal, and X.Yan. Nema: fast graph search with label similarity. In VLDB, 2013. [Shasha02] D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In PODS, 2002. [Yan04] X.Yan, P. S.Yu, and J. Han. Graph indexing: A frequent structure-based approach. In SIGMOD, 2004. [Zhao07] P. Zhao, J. X.Yu, and P. S.Yu. Graph indexing: Tree + delta >= graph. In VLDB, pages 938–949, 2007. [Zou08] L. Zou, L. Chen, J. X.Yu, and Y. Lu. A novel spectral coding in a large graph database. In EDBT, 2008. [Ullmann76] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 1976. [Cordella04] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graph isomorphism algorithm for matching large graphs. IEEE PAMI, 26(10):1367–1372, 2004. References (cont.) [Shang09] H. Shang,Y. Zhang, X. Lin, and J. X.Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. In VLDB, 2008. [Tian08] Y. Tian and J. M. Patel. TALE: A tool for approximate large graph matching. In ICDE, pages 963–972, 2008. [Mongiovi10] M. Mongiovi, R. D. Natale, R. Giugno, A. Pulvirenti, A. Ferro, and R. Sharan. Sigma: a set-coverbased inexact graph matching algorithm. J. Bioinformatics and Computational Biology, 8(2):199–218, 2010. [Liu06] F. Liu, C.Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In SIGMOD, 2006. [Elbassuoni109] S. Elbassuoni1, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum1. Language-model-based ranking for queries on rdf-graphs. In CIKM, 2009. [Mass12] Language Models for Keyword Search over Data Graphs.Y. Mass and Y. Sagiv. In WSDM, 2012 [Liu09] Tie-Yan Liu. Learning to Rank for Information Retrieval. In WWW, 2009.
© Copyright 2024 ExpyDoc