Detecting overlapping communities - SNAP

CS246: Mining Massive Datasets
Jure Leskovec, Stanford University
http://cs246.stanford.edu
Can we identify
node groups?
(communities,
modules, clusters)
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Nodes: Football Teams
Edges: Games played
2
NCAA conferences
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Nodes: Football Teams
Edges: Games played
3
Can we identify
functional modules?
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Nodes: Proteins
Edges: Physical interactions
4
Functional modules
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Nodes: Proteins
Edges: Physical interactions
5
Can we identify
social communities?
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Nodes: Facebook Users
Edges: Friendships
6
Social communities
Summer
internship
High school
Stanford (Basketball)
Stanford (Squash)
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Nodes: Facebook Users
Edges: Friendships
7

Non-overlapping vs. overlapping communities
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
8
Nodes
Nodes
Network
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Adjacency matrix
9

What is the structure of community overlaps:
Edge density in the overlaps is higher!
Communities as “tiles”
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
10
This is what we want!
Communities
in a network
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
11

1) Given a model, we generate the network:
B
Generative
model for
networks
F
A
D
E
G
C
H

2) Given a network, find the “best” model
B
F
A
D
E
G
C
H
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
Generative
model for
networks
12

Goal: Define a model that can generate
networks
 The model will have a set of “parameters” that we
will later want to estimate (and detect communities)
B
Generative
model for
networks
F
A
D
E
G
C
H

Q: Given a set of nodes, how do communities
“generate” edges of the network?
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
13
Communities, C
pA
pB
Model
Memberships, M
Nodes, V
Model

Network
Generative model B(V, C, M, {pc}) for graphs:
 Nodes V, Communities C, Memberships M
 Each community c has a single probability pc
 Later we fit the model to networks to detect
communities
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
14
Communities, C
pA
pB
Model
Memberships, M
Nodes, V
Network
Community Affiliations

AGM generates the links: For each
 For each pair of nodes in community 𝑨𝑨,
we connect them with prob. 𝒑𝒑𝑨𝑨
 The overall edge probability is:
P(u , v) = 1 −
∏ (1 − p )
c
c∈M u ∩ M v
2/13/2014
𝑴𝑴𝒖𝒖 … set of communities
If 𝒖𝒖, 𝒗𝒗 share no communities: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝜺𝜺
node 𝒖𝒖 belongs to
Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
Jure Leskovec, Stanford C246: Mining Massive Datasets
15
Model
2/13/2014
Network
Jure Leskovec, Stanford C246: Mining Massive Datasets
16

AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
17

Detecting communities with AGM:
B
F
A
D
E
G
C
H
Given a Graph, find the Model
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
18

Maximum Likelihood Principle (MLE):
 Given: Data 𝑿𝑿
 Assumption: Data is generated by some model 𝒇𝒇(𝚯𝚯)
 𝒇𝒇 … model
 𝚯𝚯 … model parameters
 Want to estimate 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯):
 The probability that our model 𝒇𝒇 (with parameters 𝜣𝜣)
generated the data
 Now let’s find the most likely model that could have
generated the data: arg max 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯)
2/13/2014
Θ
Jure Leskovec, Stanford C246: Mining Massive Datasets
19


Imagine we are given a set of coin flips
Task: Figure the bias of a coin!
 Data: Sequence of coin flips: 𝑿𝑿 = [𝟏𝟏, 𝟎𝟎, 𝟎𝟎, 𝟎𝟎, 𝟏𝟏, 𝟎𝟎, 𝟎𝟎, 𝟏𝟏]
 Model: 𝒇𝒇 𝚯𝚯 = return 1 with prob. Θ, else return 0
 What is 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 ? Assuming coin flips are independent
 So, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 … ∗ 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯
 What is 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ? Simple, 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 = 𝚯𝚯
𝟓𝟓
 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟎𝟎. 𝟓𝟓 = 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎
 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 =
𝟑𝟑
𝟖𝟖
= 𝟎𝟎. 𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎𝟎
 What did we learn? Our data was most
likely generated by coin with bias 𝚯𝚯 = 𝟑𝟑/𝟖𝟖
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯
 Then, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝚯𝚯𝟑𝟑 𝟏𝟏 − 𝚯𝚯
 For example:
𝚯𝚯∗ = 𝟑𝟑/𝟖𝟖
𝚯𝚯
20

How do we do MLE for graphs?
 Model generates a probabilistic adjacency matrix
 We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix
For every pair
of nodes 𝒖𝒖, 𝒗𝒗
AGM gives the
prob. 𝒑𝒑𝒖𝒖𝒖𝒖 of
them being
linked

0.25
0.10
0.10
0.05
0.15
0.02
0.05
0.02
0.01
0.03
Flip
biased
0.04
coins
1
1
0
0
0.06
0
1
0
1
0.15
0.06
1
0
1
1
0.03
0.09
0
1
0
1
The likelihood of AGM generating graph G:
P (G | Θ) = Π P (u , v) Π (1 − P (u , v))
( u ,v )∈G
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
( u ,v )∉G
21

Given graph G and Θ we calculate
likelihood that Θ generated G: P(G|Θ)
Θ=B(V, C, M, {pc})
G
0.25 0.10 0.10 0.04
1
0
1
1
0.05 0.15 0.02 0.06
0
1
0
1
0.05 0.02 0.15 0.06
1
0
1
1
0.01 0.03 0.03 0.09
1
1
1
1
G
P(G|Θ)
P(G | Θ) = Π P(u , v) Π (1 − P(u , v))
( u ,v )∈G
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
( u ,v )∉G
22

Our goal: Find 𝚯𝚯 = 𝑩𝑩(𝑽𝑽, 𝑪𝑪, 𝑴𝑴, 𝒑𝒑𝑪𝑪 ) such
that:
arg max
Θ

P(
𝑮𝑮
|
AGM
Θ )
How do we find 𝑩𝑩(𝑽𝑽, 𝑪𝑪, 𝑴𝑴, 𝒑𝒑𝑪𝑪 ) that
maximizes the likelihood?
 No nice way, have to search over
all affiliation graphs  But, don’t despair…
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
23

Relaxation: Memberships have strengths
𝑭𝑭𝒖𝒖𝒖𝒖
u
v
 𝑭𝑭𝒖𝒖𝒖𝒖 : The membership strength of node 𝒖𝒖
to community 𝑨𝑨 (𝑭𝑭𝒖𝒖𝒖𝒖 = 𝟎𝟎: no membership)
 Each community 𝑨𝑨 links nodes independently:
𝑷𝑷𝑨𝑨 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨 )
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
24
Community membership strength matrix 𝑭𝑭
Nodes

𝑭𝑭 =
𝑭𝑭𝒖𝒖𝒖𝒖 …
Communities
𝑷𝑷𝑨𝑨 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝒗𝒗 )
 Probability of connection is
proportional to the product of
strengths

j
strength of 𝒖𝒖’s
membership to 𝑨𝑨
2/13/2014

Prob. that at least one common
community 𝑪𝑪 links the nodes:
 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − ∏𝑪𝑪 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖, 𝒗𝒗
𝑭𝑭𝒗𝒗 … vector of
community
membership
strengths of 𝒖𝒖
Jure Leskovec, Stanford C246: Mining Massive Datasets
25
Community 𝑨𝑨 links nodes 𝒖𝒖, 𝒗𝒗 independently:
𝑷𝑷𝑨𝑨 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨 )
 Then prob. at least one common 𝑪𝑪 links them:
𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − ∏𝑪𝑪 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖, 𝒗𝒗
= 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(− ∑𝑪𝑪 𝑭𝑭𝒖𝒖𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝒗𝒗 )
= 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝑻𝑻𝒗𝒗 )
 For example:

𝑭𝑭𝒖𝒖 :
𝑭𝑭𝒗𝒗 :
𝑭𝑭𝒘𝒘 :
0
0.5
1.2
0
0
0
0.2
0.8
0 1.8 1
0
Node community
membership strengths
2/13/2014
Then: 𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝑻𝑻𝒗𝒗 = 𝟎𝟎. 𝟏𝟏𝟏𝟏
And: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒆𝒆 −𝟎𝟎. 𝟏𝟏𝟏𝟏 = 𝟎𝟎. 𝟏𝟏𝟏𝟏
But: 𝑷𝑷 𝒖𝒖, 𝒘𝒘 = 𝟎𝟎. 𝟖𝟖𝟖𝟖
𝑷𝑷 𝒗𝒗, 𝒘𝒘 = 𝟎𝟎
Jure Leskovec, Stanford C246: Mining Massive Datasets
26

Another way to think about this is:
 Think of a weighted graph (but we don’t see the
weights)
 𝒖𝒖, 𝒗𝒗 interact with strength 𝑿𝑿𝑨𝑨𝒖𝒖𝒖𝒖 ~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨 )
 Total interaction strength is the sum: 𝑿𝑿𝒖𝒖𝒖𝒖 = ∑𝑪𝑪 𝑿𝑿𝑪𝑪𝒖𝒖𝒖𝒖
 Then: 𝑿𝑿𝒖𝒖𝒖𝒖 ~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(∑𝐶𝐶 𝑭𝑭𝒖𝒖𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝒗𝒗 )
 We observe an edge if there is non-zero interaction:
𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 > 0 = 1 − 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 = 0 =
= 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞(−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝑻𝑻𝒗𝒗 )
Poisson PMF:
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
𝝀𝝀𝒌𝒌 −𝝀𝝀
𝑷𝑷(𝑿𝑿 = 𝒙𝒙) = 𝒆𝒆
𝒌𝒌!
27
[wsdm ‘13]

Task: Given a network, estimate 𝑭𝑭
 Find 𝑭𝑭 that maximizes the log-likelihood
 Many times we take the logarithm of the likelihood and
call it log-likelihood: 𝑙𝑙 𝐹𝐹 = log 𝑃𝑃(𝐺𝐺|𝐹𝐹)
 Objective function is biconvex

Non-negative matrix factorization:
 Update 𝑭𝑭𝒖𝒖𝒖𝒖 for node 𝒖𝒖 while fixing the
memberships of all other nodes
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
28

Task: Given a network, estimate 𝑭𝑭
 Find 𝑭𝑭 that maximizes the log-likelihood
 Many times we take the logarithm of the likelihood and
call it log-likelihood: 𝒍𝒍 𝑭𝑭 = 𝐥𝐥𝐥𝐥𝐥𝐥 𝑷𝑷(𝑮𝑮|𝑭𝑭)

Connection: Non-negative matrix factorization
 We want to “factor” the adjacency matrix G
𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒆𝒆(−
2/13/2014
F
.
FT
Jure Leskovec, Stanford C246: Mining Massive Datasets
)=
Matrix of
edge probs.
Graph G
29

Non-negative matrix factorization:
 Use a gradient based approach!
 But updating the whole 𝑭𝑭 takes too long
 Computing the gradient takes quadratic time!
 Our network are far too big to allow for that:
 Network with 1M nodes, can have 1012 different edges
 Instead, update 𝑭𝑭 in small “units”:
 Update 𝑭𝑭𝒖𝒖𝒖𝒖 for node 𝒖𝒖 while fixing the memberships of
all other nodes
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
30

Compute gradient of a single row 𝑭𝑭𝒖𝒖 of 𝑭𝑭:

Coordinate gradient ascent:
 Iterate over the rows of 𝑭𝑭:

 Compute gradient 𝜵𝜵𝜵𝜵 𝑭𝑭𝒖𝒖 of row 𝒖𝒖 (while keeping others fixed)
 Update the row 𝑭𝑭𝒖𝒖: 𝑭𝑭𝒖𝒖 ← 𝑭𝑭𝒖𝒖 + 𝜼𝜼 𝛁𝛁𝒍𝒍(𝑭𝑭𝒖𝒖 )
 Project 𝑭𝑭𝒖𝒖 back to a non-negative vector: If 𝑭𝑭𝒖𝒖𝒖𝒖 < 𝟎𝟎: 𝑭𝑭𝒖𝒖𝒖𝒖 = 𝟎𝟎
This is slow! Computing 𝜵𝜵𝜵𝜵 𝑭𝑭𝒖𝒖 takes linear time!
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
31

However, we notice:
 We can cache ∑𝒗𝒗 𝑭𝑭𝒗𝒗
 So, computing ∑𝒗𝒗∉𝓝𝓝(𝒖𝒖) 𝑭𝑭𝒗𝒗 now takes linear time
in the degree 𝓝𝓝(𝒖𝒖) of 𝒖𝒖
 In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant
speedup!
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
32

BigCLAM takes 5 minutes for 300k node nets
 Other methods take 10 days

Can process networks with 100M edges!
33
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
34

Extension: Make community membership edges
directed:
 Outgoing membership: Nodes “sends” edges
 Incoming membership: Node “receives” edges
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
35
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
36



Everything is almost the same except now we
have 2 matrices: 𝑭𝑭 and 𝑯𝑯
 𝑭𝑭… out-going community memberships
 𝑯𝑯… in-coming community memberships
𝑭𝑭
𝑯𝑯
Edge prob.: 𝑷𝑷 𝒖𝒖, 𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒆𝒆(−𝑭𝑭𝒖𝒖 𝑯𝑯𝑻𝑻𝒗𝒗 )
Network log-likelihood:
which we optimize the same way as before
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
37
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
38

Overlapping Community Detection at Scale: A Nonnegative Matrix
Factorization Approach by J. Yang, J. Leskovec. ACM International
Conference on Web Search and Data Mining (WSDM), 2013.

Detecting Cohesive and 2-mode Communities in Directed and
Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM
International Conference on Web Search and Data Mining (WSDM),
2014.

Community Detection in Networks with Node Attributes by J. Yang,
J. McAuley, J. Leskovec. IEEE International Conference On Data
Mining (ICDM), 2013.
2/13/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
39