본문 바로가기

2022/October 2022

Community detection notes

Girvan-Newman 

This approach defines communities by cutting most-traversed edges (until we form k communities). We identify most-traversed edges by counting how many times an edge is traversed as part of shortest path from a node to another node. If the shortest path from i to j covers an edge (a,b), then edge-betweenness increases by 1 due to that path. BUT if there are q shortest paths from i to j, then edge-betweenness increases by 1/q. We see n^2 pairs of nodes (source and destination swap is counted as a different pair). 

 

However, this takes way too long. so we use Brandes algorithm to compute edge-betweenness. We do top-down then bottom-up as shown below:

Edge betweenness is the blue numbers on each edge. For the bottom-up, the explanation might be confusing. But it's basically distributing the blue numbers on each node to the parents based on the green numbers assigned on each parent's node (ratio?), then you add 1 because each node gets 1 credit at the start.

 

Girvan-Newman algorithm basically uses Brandes algorithm, then cuts edge with the highest edge betweenness until there are k clusters.

 

Spectral Clustering

The whole idea is that we construct k-clusters by minimising the number of cuts while making sure each cluster is balanced - that is, we minimise radio cut. Creating 2-cluster using this method aims to minimise RadioCut(A,B) where A and B are supposed clusters

Naive approach is to enumerate all possibilities for k-clusters then find a combination of clusters such that radiocut is minimised. Instead, we convert the problem domain to eigenvalue laplacian problem (whatever the heck that is).

 

// technical, skip this

Given N nodes, we convert our objective to finding a vector y (length N, populated with real number), such that yT L y is minimised. We get N eigen values as a solution, from which N eigen vectors are derived (all of which are potential y). If we order the eigenvectors {y1, y2...yn}, where yi is eigen vector derived from i-th smallest eigen value, then y1 is not a valid solution (due to additional constraints). L = D-W, where D is NxN diagonal matrix where D_{i,i} = degree of node i, and W is NxN (binary) adjacency matrix (symmetric as graph is bidirectional).

 

Anyways, we get {y2...yn} from solving yT L y minimum problem. 

 

For bi-partition, we use y2 which contains N real numbers. Node i and j where y2[i] is close to y2[j] belongs in the same community. We could use 2-means clustering, but we can simply say all node i for which y2[i]<0 belongs to community A, and all node i for which y2[i]>=0 belongs to community B. Basically this:

For k-way partition, we take [y2...y(k-1)] which forms Nx(k-1) matrix. Each row becomes k-1 dimensional point. We find k clusters of these N points. 

Modularity

Modularity is higher if more edges within community and less edges between communities. The goal of community detection is to find set of k communities which has highest modularity.

 

Note that here, we see each edge as composed of two edges ('half edges').

Qc is modularity of a clusters

2m = number of half edges in the graph

C = a set of communities

e_AA = for each node i in A, count number of half-edges of i that leads to a node j in A then div by 2m

e_AB = for each node i in A, count number of half-edges of i that leads to a node j in B then div by 2m

a_A = number of half-edges that stems from nodes in A.

 

But that's calculating modularity of clusters (which is the result). How do we create clusters?

 

We use greedy algorithm. We start with N clusters (each node being a cluster). We merge two nodes that cases greatest increase in modularity, until we have k clusters.

To do this:'

1. Construct a matrix M that is an adjancency matrix divided by 2m. Here, M[i,j] corresponds to e_ij. 

2. a_i is just sum(M[i,:]).

3. For each directly connected pairs of nodes, we compute modularity gain (delta Q, which is 2*(e_ij-a_i*a_j))

4. Pick pair that causes highest delta Q, and merge these two nodes. 

 

In step 4, this is what happens

The rows and columns whose corresponding nodes have disappeared, and blackbox values are newly computed based on {2,3} being a node. Rest just carries over. Don't think about merging the removed values, it's a lot more trivial to just recompute the blackbox values.

 

Do this until there are k communities.