CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Can we identify node groups? (communities, modules, clusters) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Football Teams Edges: Games played 2 NCAA conferences 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Football Teams Edges: Games played 3 Can we identify functional modules? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Proteins Edges: Physical interactions 4 Functional modules 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Proteins Edges: Physical interactions 5 Can we identify social communities? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Facebook Users Edges: Friendships 6 Social communities Summer internship High school Stanford (Basketball) Stanford (Squash) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Facebook Users Edges: Friendships 7 Non-overlapping vs. overlapping communities 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Nodes Nodes Network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Adjacency matrix 9 What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as “tiles” 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 This is what we want! Communities in a network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 1) Given a model, we generate the network: B Generative model for networks F A D E G C H 2) Given a network, find the “best” model B F A D E G C H 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Generative model for networks 12 Goal: Define a model that can generate networks The model will have a set of “parameters” that we will later want to estimate (and detect communities) B Generative model for networks F A D E G C H Q: Given a set of nodes, how do communities “generate” edges of the network? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Communities, C pA pB Model Memberships, M Nodes, V Model Network Generative model B(V, C, M, {pc}) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability pc Later we fit the model to networks to detect communities 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Communities, C pA pB Model Memberships, M Nodes, V Network Community Affiliations AGM generates the links: For each For each pair of nodes in community 𝑨𝑨, we connect them with prob. 𝒑𝒑𝑨𝑨 The overall edge probability is: P(u , v) = 1 − ∏ (1 − p ) c c∈M u ∩ M v 2/13/2014 𝑴𝑴𝒖𝒖 … set of communities If 𝒖𝒖, 𝒗𝒗 share no communities: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝜺𝜺 node 𝒖𝒖 belongs to Think of this as an “OR” function: If at least 1 community says “YES” we create an edge Jure Leskovec, Stanford C246: Mining Massive Datasets 15 Model 2/13/2014 Network Jure Leskovec, Stanford C246: Mining Massive Datasets 16 AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 Detecting communities with AGM: B F A D E G C H Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters pc 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 Maximum Likelihood Principle (MLE): Given: Data 𝑿𝑿 Assumption: Data is generated by some model 𝒇𝒇(𝚯𝚯) 𝒇𝒇 … model 𝚯𝚯 … model parameters Want to estimate 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯): The probability that our model 𝒇𝒇 (with parameters 𝜣𝜣) generated the data Now let’s find the most likely model that could have generated the data: arg max 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯) 2/13/2014 Θ Jure Leskovec, Stanford C246: Mining Massive Datasets 19 Imagine we are given a set of coin flips Task: Figure the bias of a coin! Data: Sequence of coin flips: 𝑿𝑿 = [𝟏𝟏, 𝟎𝟎, 𝟎𝟎, 𝟎𝟎, 𝟏𝟏, 𝟎𝟎, 𝟎𝟎, 𝟏𝟏] Model: 𝒇𝒇 𝚯𝚯 = return 1 with prob. Θ, else return 0 What is 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 ? Assuming coin flips are independent So, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 … ∗ 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 What is 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ? Simple, 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 = 𝚯𝚯 𝟓𝟓 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟎𝟎. 𝟓𝟓 = 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟑𝟑 𝟖𝟖 = 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎 What did we learn? Our data was most likely generated by coin with bias 𝚯𝚯 = 𝟑𝟑/𝟖𝟖 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 Then, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝚯𝚯𝟑𝟑 𝟏𝟏 − 𝚯𝚯 For example: 𝚯𝚯∗ = 𝟑𝟑/𝟖𝟖 𝚯𝚯 20 How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix For every pair of nodes 𝒖𝒖, 𝒗𝒗 AGM gives the prob. 𝒑𝒑𝒖𝒖𝒖𝒖 of them being linked 0.25 0.10 0.10 0.05 0.15 0.02 0.05 0.02 0.01 0.03 Flip biased 0.04 coins 1 1 0 0 0.06 0 1 0 1 0.15 0.06 1 0 1 1 0.03 0.09 0 1 0 1 The likelihood of AGM generating graph G: P (G | Θ) = Π P (u , v) Π (1 − P (u , v)) ( u ,v )∈G 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets ( u ,v )∉G 21 Given graph G and Θ we calculate likelihood that Θ generated G: P(G|Θ) Θ=B(V, C, M, {pc}) G 0.25 0.10 0.10 0.04 1 0 1 1 0.05 0.15 0.02 0.06 0 1 0 1 0.05 0.02 0.15 0.06 1 0 1 1 0.01 0.03 0.03 0.09 1 1 1 1 G P(G|Θ) P(G | Θ) = Π P(u , v) Π (1 − P(u , v)) ( u ,v )∈G 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets ( u ,v )∉G 22 Our goal: Find 𝚯𝚯 = 𝑩𝑩(𝑽𝑽, 𝑪𝑪, 𝑴𝑴, 𝒑𝒑𝑪𝑪 ) such that: arg max Θ P( 𝑮𝑮 | AGM Θ ) How do we find 𝑩𝑩(𝑽𝑽, 𝑪𝑪, 𝑴𝑴, 𝒑𝒑𝑪𝑪 ) that maximizes the likelihood? No nice way, have to search over all affiliation graphs But, don’t despair… 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 Relaxation: Memberships have strengths 𝑭𝑭𝒖𝒖𝒖𝒖 u v 𝑭𝑭𝒖𝒖𝒖𝒖 : The membership strength of node 𝒖𝒖 to community 𝑨𝑨 (𝑭𝑭𝒖𝒖𝒖𝒖 = 𝟎𝟎: no membership) Each community 𝑨𝑨 links nodes independently: 𝑷𝑷𝑨𝑨 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨 ) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Community membership strength matrix 𝑭𝑭 Nodes 𝑭𝑭 = 𝑭𝑭𝒖𝒖𝒖𝒖 … Communities 𝑷𝑷𝑨𝑨 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝒗𝒗 ) Probability of connection is proportional to the product of strengths j strength of 𝒖𝒖’s membership to 𝑨𝑨 2/13/2014 Prob. that at least one common community 𝑪𝑪 links the nodes: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − ∏𝑪𝑪 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖, 𝒗𝒗 𝑭𝑭𝒗𝒗 … vector of community membership strengths of 𝒖𝒖 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 Community 𝑨𝑨 links nodes 𝒖𝒖, 𝒗𝒗 independently: 𝑷𝑷𝑨𝑨 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨 ) Then prob. at least one common 𝑪𝑪 links them: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − ∏𝑪𝑪 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(− ∑𝑪𝑪 𝑭𝑭𝒖𝒖𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝒗𝒗 ) = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝑻𝑻𝒗𝒗 ) For example: 𝑭𝑭𝒖𝒖 : 𝑭𝑭𝒗𝒗 : 𝑭𝑭𝒘𝒘 : 0 0.5 1.2 0 0 0 0.2 0.8 0 1.8 1 0 Node community membership strengths 2/13/2014 Then: 𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝑻𝑻𝒗𝒗 = 𝟎𝟎. 𝟏𝟏𝟏𝟏 And: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒆𝒆 −𝟎𝟎. 𝟏𝟏𝟏𝟏 = 𝟎𝟎. 𝟏𝟏𝟏𝟏 But: 𝑷𝑷 𝒖𝒖, 𝒘𝒘 = 𝟎𝟎. 𝟖𝟖𝟖𝟖 𝑷𝑷 𝒗𝒗, 𝒘𝒘 = 𝟎𝟎 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 Another way to think about this is: Think of a weighted graph (but we don’t see the weights) 𝒖𝒖, 𝒗𝒗 interact with strength 𝑿𝑿𝑨𝑨𝒖𝒖𝒖𝒖 ~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨 ) Total interaction strength is the sum: 𝑿𝑿𝒖𝒖𝒖𝒖 = ∑𝑪𝑪 𝑿𝑿𝑪𝑪𝒖𝒖𝒖𝒖 Then: 𝑿𝑿𝒖𝒖𝒖𝒖 ~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(∑𝐶𝐶 𝑭𝑭𝒖𝒖𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝒗𝒗 ) We observe an edge if there is non-zero interaction: 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 > 0 = 1 − 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 = 0 = = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝑻𝑻𝒗𝒗 ) Poisson PMF: 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 𝝀𝝀𝒌𝒌 −𝝀𝝀 𝑷𝑷(𝑿𝑿 = 𝒙𝒙) = 𝒆𝒆 𝒌𝒌! 27 [wsdm ‘13] Task: Given a network, estimate 𝑭𝑭 Find 𝑭𝑭 that maximizes the log-likelihood Many times we take the logarithm of the likelihood and call it log-likelihood: 𝑙𝑙 𝐹𝐹 = log 𝑃𝑃(𝐺𝐺|𝐹𝐹) Objective function is biconvex Non-negative matrix factorization: Update 𝑭𝑭𝒖𝒖𝒖𝒖 for node 𝒖𝒖 while fixing the memberships of all other nodes 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 Task: Given a network, estimate 𝑭𝑭 Find 𝑭𝑭 that maximizes the log-likelihood Many times we take the logarithm of the likelihood and call it log-likelihood: 𝒍𝒍 𝑭𝑭 = 𝐥𝐥𝐥𝐥𝐥𝐥 𝑷𝑷(𝑮𝑮|𝑭𝑭) Connection: Non-negative matrix factorization We want to “factor” the adjacency matrix G 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒆𝒆(− 2/13/2014 F . FT Jure Leskovec, Stanford C246: Mining Massive Datasets )= Matrix of edge probs. Graph G 29 Non-negative matrix factorization: Use a gradient based approach! But updating the whole 𝑭𝑭 takes too long Computing the gradient takes quadratic time! Our network are far too big to allow for that: Network with 1M nodes, can have 1012 different edges Instead, update 𝑭𝑭 in small “units”: Update 𝑭𝑭𝒖𝒖𝒖𝒖 for node 𝒖𝒖 while fixing the memberships of all other nodes 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 Compute gradient of a single row 𝑭𝑭𝒖𝒖 of 𝑭𝑭: Coordinate gradient ascent: Iterate over the rows of 𝑭𝑭: Compute gradient 𝜵𝜵𝜵𝜵 𝑭𝑭𝒖𝒖 of row 𝒖𝒖 (while keeping others fixed) Update the row 𝑭𝑭𝒖𝒖: 𝑭𝑭𝒖𝒖 ← 𝑭𝑭𝒖𝒖 + 𝜼𝜼 𝛁𝛁𝒍𝒍(𝑭𝑭𝒖𝒖 ) Project 𝑭𝑭𝒖𝒖 back to a non-negative vector: If 𝑭𝑭𝒖𝒖𝒖𝒖 < 𝟎𝟎: 𝑭𝑭𝒖𝒖𝒖𝒖 = 𝟎𝟎 This is slow! Computing 𝜵𝜵𝜵𝜵 𝑭𝑭𝒖𝒖 takes linear time! 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 31 However, we notice: We can cache ∑𝒗𝒗 𝑭𝑭𝒗𝒗 So, computing ∑𝒗𝒗∉𝓝𝓝(𝒖𝒖) 𝑭𝑭𝒗𝒗 now takes linear time in the degree 𝓝𝓝(𝒖𝒖) of 𝒖𝒖 In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup! 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges! 33 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 34 Extension: Make community membership edges directed: Outgoing membership: Nodes “sends” edges Incoming membership: Node “receives” edges 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36 Everything is almost the same except now we have 2 matrices: 𝑭𝑭 and 𝑯𝑯 𝑭𝑭… out-going community memberships 𝑯𝑯… in-coming community memberships 𝑭𝑭 𝑯𝑯 Edge prob.: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒆𝒆(−𝑭𝑭𝒖𝒖 𝑯𝑯𝑻𝑻𝒗𝒗 ) Network log-likelihood: which we optimize the same way as before 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 38 Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
© Copyright 2025 ExpyDoc