slides - UCSB Computer Science

Schemaless Graph Querying
Shengqi Yang,Yinghui Wu, Huan Sun and XifengYan
UC Santa Barbara
The Big Graphs
Internet graph
Social graph (Facebook)
Software graph (Apache)
Protein interaction graph (human)
Credits: Andrew R. Wade, Facebook, Apache, Ravi Iyengar
The Big Graphs
 It is not only about the scale!
organization
business
join
Video
music
photo
Facebook 10 years ago
Facebook Now
 Examples are countless: Google’s knowledge graph, LinkedIn’s
professional network, Linked Open Data, etc.
The Graph Querying
 Problem: Subgraph querying
o a.k.a., subgraph matching, subgraph isomorphism, etc
…...
Pattern query
Query
Molecule graphs (Graph database)
Data graph
(Credit: linkeddata.org)
Answer
(a.k.a., match, embedding)
Querying Big Graphs: the Challenges
 The data graph is schemaless
 The data items, nodes and relations, are from various domains and
thus are quite diverse.
 There is no uniformed conversion followed by the data contributors.
 The query is schemaless
 Queries are usually not requested in accordance with the metadata,
which is transparent to the searchers.
 A “good” querying system should
 Bridge the gap between query and data: match query to candidates.
 Rank the results in a principled way.
 Return top results in query time.
Querying Graphs: the DB approach
 Keyword Search
 Find a root node (LCA in XML or trees) [Liu07][Koutrika06].
 Group Steiner Tree [Ding07][Talukdar08].
 Expansion from keyword matches [Bhalotia02][Dalvi08][].
 Searching based on the score function [Hristidis03][Luo07][Kasneci08].
 Searching based on indexes (e.g., reachability, distance, etc)
[He07][Markowetz09][Li09][Qiao13].
 Graph Search
 Exact subgraph matching, based on indexing techniques.
 Search on graph databases [Shasha02][Yan04][Zhao07][Zou08].
 Search on a single large graph
[Ullman76][Cordella04][Shang08][Zhang09].
 Approximate subgraph matching [Tian08][Mongiovi10][Khan13].
 Most of the works were primarily focusing on the
efficiency. Result ranking is usually simply considered.
Querying Graphs: the IR and ML approach
 IR based ranking
 Classic IR metrics: TF-IDF, BM25, etc.
[Hristidis03][Liu06][Luo07]
 Language models: a probabilistic space for each document
[Elbassuoni109][Mass12].
 The works are mostly based on linear combinations of the IR
metrics, with human tuned fixed parameters.
 Machine-learned ranking (MLR) [Liu09]
 The research in this domain targets on
 How to design an effective (ensemble) ranking model.
 How to estimate the parameters in the ranking model.
 Application: rank the documents or the web pages.
 Input: a score vector for a (query, doc) pair and the relevance
score.
Schemaless Graph Querying (SLQ)
 To the users, no knowledge on the graph data is required.
 Name the query and the search engine will do the rest.
 An efficient graph search technique that could fast
extract the good results.
 A machine learning based ranking algorithm.
 Robust to noisy and ambiguous queries.
 Adaptive to the user’s preference.
SLQ - Outline
• The matching strategy
• The ranking and searching technique
• Offline learning
• Online query processing
• Performance and Demonstration
The Matching Technique
 A transformation-based matching approach
 The users could freely post queries, without possessing any
knowledge of the underlying data.
 The querying system should automatically find the matches
through a set of transformations.
Query
Actor, ~30 yrs
UCB
M:I
A match
Chris Pine (1980)
University of
J. J. Abrams
California,
Berkeley
Mission: Impossible
Acronym transformation matches ‘UCB’ to ‘University of California, Berkeley’
Abbreviation transformation matches ‘M : I’ to ‘Mission: Impossible’
Numeric transformation matches ‘~30’ to ‘1980’.
Common Transformations
Transformation Category
Example
First/Last token
String
“Barack Obama” > “Obama”
Abbreviation
String
“Jeffrey Jacob Abrams” > “J. J. Abrams”
Prefix
String
“Doctor” > “Dr”
Acronym
String
"International Business Machines" > "IBM"
Synonym
Semantic
“tumor" > “neoplasm"
Ontology
Semantic
"teacher" > "educator"
Range
Numeric
“~30” > “1980”
Unit Conversion
Numeric
"3 mi" > "4.8 km"
Distance
Topology
"Pine" - "M:I" > "Pine" - "J.J. Abrams" - "M:I"
…
…
…
 The focus of this work is not to study all possible transformations.
Any new transformation can be easily plugged into the framework.
The Matching Technique: New Challenges
Avg. result number
 The transformation-based
No. of Transformation applied
matching approach is likely
to implicate many more
results, compared to classic
direct matching.
 To find all matching results
(subgraph matching
problem) is quite
expansive. It is necessary
to first suggest the “best”
results to the users.
 IR-based top-K search
 We resort to ML-based
top-K search
The Ranking and Searching Techniques
 The offline learning
 A ranking model should encode the transformations between
the query and the match
 The ranking model should automatically determine its
parameters.
 User query logs
 Automatic training data generation
 The online ranking and searching
 Among many match candidates, an efficient algorithm is
required to fast extract the top-K results, in terms of the
ranking model.
The Ranking Model
 With a set of transformations { f i }, given a query Q and
its match result R, our ranking model considers
 the node matching: from a query node v to its match  (v)
FV (v,  (v))   i f i (v,  (v))
i
 the edge matching: from query edge e to its match  (e)
FE (e,  (e))   i f i (e,  (e))
i
(the edge match  (e) in the result could be a path)
 The overall ranking model: a probabilistic model based
on Conditional Random Fields.
P( R | Q)  exp(  FV (v,  (v)) 
vVQ
 F (e, (e)))
eEQ
E
The Offline Learning
 It is clearly that the parameters { i ;  j } need to be
determined appropriately.
 Classic IR method: the parameters are tuned by domain experts
manually.
 Specific domain knowledge is not sufficient for big graph data.
 Supervised method: learning to rank
 User query logs: not easy to acquire.
 Manually label the training data: not practical and scalable.
 Unsupervised method: automatic training data generation
 A set of high quality training instances could be generated directly from
the graph data.
 Inspired by the advantage in learning high level representations with deep
belief networks
 Denoising Autoencoders [Vincent08]
Automatic training data generation
 The basis of denoising autoencoders
 An explicit criteria for learning good representations
 Robustness to partial destruction of the input.
 Intuition: human is able to recognize partially corrupted images.
 Application: image classification (search)
min L( p1, p1' )
Image Database
Training
random noise (0.1)
p1'
Searching
Good result
p1
p2
Google result
(Picture credit: etidbits.com, youtube.com, blog.sina.com.cn)
p3
Baidu result
Automatic training data generation
 Sampling: a set of subgraphs are
Data graph

4. Label the results
1. Sampling
Tom Cruise
2. Add noise

SamuelTom
3. Search
Good
training query
Tom Cruise
…
Tom

results
5. Train the ranking model (L-BFGS [Liu89])

randomly extracted from the data
graph based on the templates.
Query generation: the queries
are generated by randomly
adding noise (transformation) on
the extracted subgraphs.
Searching: search the generated
queries on the data graph
Labeling: the results are labeled
based on the original subgraph.
Training: the queries, with the
labeled results, are then used to
estimate the parameters of the
ranking model.
Online Query Processing
 Exact subgraph matching is quite expansive
 The transformations produce many match candidates.
 A NP-hard problem by reducing to graph isomorphism.
 A little (possible) compromise on the quality of the results
could lead to a significant improvement on the efficiency.
 A fast top-K search algorithm
 Indexing technique.
 Two effective heuristics in the search algorithm.
 Approximate inference in the CRFs ranking model.
 Sketch graph based pruning strategy.
Online Query Processing
 Inference in graphical model (CRFs)
 A CRFs model is constructed based on the query (as variables)
and the node match candidates (as possible assignments to the
variables)
 Top-1 result: computing the most likely assignment.
 Exact inference in general graphical model is still NP-hard [Sutton06].
 Approximate inference: Loopy Belief Propagation (LoopyBP)

A message-passing algorithm:
m(jit ) (ui )  max FV (v j , u j ) FE ((v j , vi ), (u j , ui ))
uj

m
( t 1)
kj
(u j )
vk N ( v j ) \ vi
It is very efficient and empirically successful in many practical
applications.
 From Top-1 to Top-k: best max-marginal first algorithm
[Yanover04]
Query Processing – Sketch Graph
 Problem: due to the various transformations, a large number of
match candidates should be inspected by the LoopyBP algorithm.
 Solution: sketch graph
 The candidates through the same transformation preserve the same matching
score and thus can be grouped together.
Query
Sketch graph
Brad Pitt
Brad
Top-1 result
in sketch graph
Bradley
Top-1 result
f1 :
…
… fn :
…
f1 :
…
Brad Pitt
f1 :
…
… fm :
…
f1 :
…
Troy film
Troy
Troy film
1. Construct sketch graph
Jim Troy
2. Search in sketch graph
3. Search in graph
Query Processing – Two-level Search
 The two-level search algorithm
 Search in sketch graph
 The size of the sketch graph is very small: only related to the size of the
query and the number of the transformations.
 The score of the upper-level match in the sketch graph is no less (upper
bound) than the score of the lower-level match.
 Search in data graph
 Based on the upper-level match found in the sketch graph, the algorithm
then extracts the lower-level matches in the data graph on a smaller set of
match candidates.
 Pruning: when the score of the lower-level match is smaller than that of a
previous upper-level match.
Other Designs - Indexing
 An index is constructed based on the transformations. Given
a query, its match candidates can be extracted readily from
the index.
StrIdx
Query
Pine
actor
~30yrs
OntIdx
NumIdx
Matching
Str. Trans. [
Pine : {v1, ...}
C. Pine : {v1, …}
Chris : {v1, …}
……
]
Ont. Trans. [
actor = cast : {v1, …}
……
]
Data Graph
Chris Pine
born in 1980
UC Berkeley
cast
“Star Trek”
……
Num. Trans. (B+ tree)
<50
<35
32 33 34
{v1, ...}
Indexing
v1
v2
v3
The Performance Evaluation
 Dataset
Graph
Nodes
Edges
Node types
Relations
Size
DBpedia
3.7M
20M
359
800
40G
YAGO2
2.9M
11M
6,543
349
18.5G
Freebase
40.3M
180M
10,110
9,101
88G
 Query template
 DBPSB [Morsey11] : the query templates derived from query logs on
DBpedia
 Baseline
 Spark [Luo07] : a IR based ranking model that takes each node as a
document and does not consider any structure information. The
parameters are fixed.
 SLQ: our method
 Unit: a variant of SLQ, with equal parameter value
 Card: a variant of SLQ, with parameter value as the selectivity of the
corresponding transformation.
The Performance Evaluation – Case Study
 Queries on DBpedia
 In the above cases, Spark gives low IR score and cannot identify
matches for Query 2.
The Performance Evaluation - Effectiveness
 Queries are randomly generated on
YAGO2 based the query templates
 Evaluation: MAP@5
 Change query size while fixing the
transformation ratio.
 SLQ shows the best result.
 Large query (with more evidence) is
more robust to the noise (the
transformation).
 Change the transformation ratio.
 The result degrades along with the
increasing of the transformation
ratio.
The Performance Evaluation - Efficiency
 Queries are randomly generated
on Freebase based the query
templates .
 The transformation ratio:
0.2~0.5.
 Baseline
 Exact: exact search on all
candidates.
 NeMa [Khan13]: a graph search
algorithm similar to LoopyBP.
 Increasing the graph size in a
“streaming” mode
 With the sketch graph, the SLQ
outperforms the two baselines
Conclusion
 A novel framework for schemaless graph querying.
 A matching strategy: transformation.
 A ranking model.
 Incorporates the transformations.
 Automatic training in the cold-start.
 Top-k search algorithm.
The Architecture and Future Work
Natural language
query process
Better top-k
search algorithm
Runtime
Indexing
Storage;
Multiple data
graph
Better ranking
model
Warm-start
training
System Demonstration
Dataset selection
Keyword/SLQ Query
Graph Query Drawing / Result rendering panel
Result navigation bar
Information panel
More Applications
Text Mining/RDF ->
Link entities across documents ->
Build Information Network
Graph Visualization / Querying
/ Analysis
References

[Vincent08] Vincent, H. LarochelleY. Bengio and P.A. Manzagol. Extracting and Composing Robust Features with
Denoising Autoencoders. In ICML‘08, 2008.

[Liu89] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical
programming, 45(1-3):503–528, 1989.

[Sutton06] Sutton, C. & Mccallum. Introduction to Conditional Random Fields for Relational Learning, MIT Press, 2006.

[Yanover04] C.Yanover and Y. Weiss. Finding the m most probable configurations using loopy belief propagation. In NIPS,
2004.

[Luo07] Y. Luo, X. Lin, W. Wang, and X. Zhou. Spark: top-k keyword query in relational databases. In SIGMOD, 2007.

[Morsey11] M. Morsey, J. Lehmann, S. Auer, and A.-C. Ngonga Ngomo. Dbpedia sparql benchmark - performance
assessment with real queries on real data. In ISWC, 2011.

[Liu07] Z. Liu and Y. Chen. Identifying meaningful return information for XML keyword search. In SIGMOD, 2007

[Koutrika06] Koutrika, G., Simitsis, A., and Ioannidis, Y. E. Pré
cis: The Essence of a Query Answer. In ICDE, 2006.

[Ding07] Ding, B.,Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. Finding top-k min-cost connected trees in databases.
In ICDE, 2007.

[Talukdar2008] Talukdar, P. P., Jacob, M., Mehmood, M. S., Crammer, K., Ives, Z. G., Pereira, F., and Guha, S. Learning to
create data-integrating queries. In VLDB, 2008.

[Bhalotia02] Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. Keyword Searching and Browsing in
Databases using BANKS. In ICDE, 2002.

[Dalvi08] Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. Keyword search on external memory data graphs. In VLDB, 2008.

[Hristidis03] Hristidis, V., Papakonstantinou, Y., and Balmin, A. Keyword proximity search on xml graphs. In ICDE, 2003.

[Zhang09] S. Zhang, S. Li, and J.Yang. GADDI: distance index based subgraph matching in biological networks. In EDBT,
2009.
References (cont.)

[Li09] Li, G., Ji, S., Li, C., and Feng, J. Efficient type-ahead search on relational data: a TASTIER approach. In
SIGMOD, 2009.

[Markowetz09] Markowetz, A.,Yang,Y., and Papadias, D. Reachability Indexes for Relational Keyword Search. In
ICDE, 2009.

[He07] He, H., Wang, H.,Yang, J., and Yu, P. S. BLINKS: Ranked keyword searches on graphs. In SIGMOD, 2007.

[Qiao13] Miao Qiao, Lu Qin, Hong Cheng, Jeffrey XuYu, and WentaoTian. Top-K nearest keyword search on
large graphs. In VLDB, 2013.

[Kasneci08] G. Kasneci, F. M. Suchanek, G. Ifrim, M. Ramanath, and G. Weikum. Naga: Searching and ranking
knowledge. In ICDE, 2008.

[Khan13] A. Khan,Y. Wu, C. C. Aggarwal, and X.Yan. Nema: fast graph search with label similarity. In VLDB,
2013.

[Shasha02] D. Shasha, J. T. L. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In
PODS, 2002.

[Yan04] X.Yan, P. S.Yu, and J. Han. Graph indexing: A frequent structure-based approach. In SIGMOD, 2004.

[Zhao07] P. Zhao, J. X.Yu, and P. S.Yu. Graph indexing: Tree + delta >= graph. In VLDB, pages 938–949, 2007.

[Zou08] L. Zou, L. Chen, J. X.Yu, and Y. Lu. A novel spectral coding in a large graph database. In EDBT, 2008.

[Ullmann76] J. R. Ullmann. An algorithm for subgraph isomorphism. J. ACM, 1976.

[Cordella04] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graph isomorphism algorithm for
matching large graphs. IEEE PAMI, 26(10):1367–1372, 2004.
References (cont.)

[Shang09] H. Shang,Y. Zhang, X. Lin, and J. X.Yu. Taming verification hardness: an efficient algorithm for testing
subgraph isomorphism. In VLDB, 2008.

[Tian08] Y. Tian and J. M. Patel. TALE: A tool for approximate large graph matching. In ICDE, pages 963–972,
2008.

[Mongiovi10] M. Mongiovi, R. D. Natale, R. Giugno, A. Pulvirenti, A. Ferro, and R. Sharan. Sigma: a set-coverbased inexact graph matching algorithm. J. Bioinformatics and Computational Biology, 8(2):199–218, 2010.

[Liu06] F. Liu, C.Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In SIGMOD,
2006.

[Elbassuoni109] S. Elbassuoni1, M. Ramanath, R. Schenkel, M. Sydow, and G. Weikum1. Language-model-based
ranking for queries on rdf-graphs. In CIKM, 2009.

[Mass12] Language Models for Keyword Search over Data Graphs.Y. Mass and Y. Sagiv. In WSDM, 2012

[Liu09] Tie-Yan Liu. Learning to Rank for Information Retrieval. In WWW, 2009.