PageRank notes

카테고리 없음

PageRank notes

hajinny 2022. 10. 21. 13:06

PageRank algorithm

Google answers queries by ranking webpages in the order of importance. PageRank postulates that an important webpage has many in-links from important webpages. This avoids displaying websites that pretends to be relevant (by word spam, for example).

PageRank takes a web graph formatted as matrix. Given N webpages, we have a NxN matrix. Each element of each column i takes a value of 1/(# of out-links from website i). Website i's rank is the sum of (in-linking website j's rank)/(#outlinks of website j). This can be solved using systems of equations, but matrix formulation avoids solving N systems of equations.

Matrix formulation is "do r' := M*r until r' converges", where r',r are rank vectors (1xN) and M is the matrix constructed as above. This only works if M satisfies the markov chain properties (see Note).

Since webpages can have deadends and spider traps (causing leakage and trap of scores), we also suppose that every website has implicit link to every other node (this is called teleport). We therefore have beta parameter which is the proportion of a webpage's rank that does not get distributed to other nodes, and (1-beta) proportion of the rank gets distributed to all N nodes. This has some consequence in the algorithm which is out of the scope for this note.

So the algorithm is roughly:

def pageRank(M, beta):
r := 1D rank vector initialized with 1/N
do{
r_new = M*r

r_new += (1-sum(r_new))/N
} while norm1(r_new-r) < epsilon
return r

Note: Markov chain properties

Stochastic: sum(col_i) = 1. Violated if the website i does not have outlinks (is a dead-end)

Aperiodic: if a random walker walks on the webgraph, we can't predict its location at time=t. Violated if there's spider web (because we can predict the random walker's location)

Irreducible: a walker from a node can travel to every other node.

Note: when we have only spider traip problem vs we also have deadend problem

If we didn't have deadends but spider traps, we would have the following:

We solve spider traps AND deadends with teleport:

where's (1-beta/N)? well, since both the leakage and teleport eventually evenly distribute the leftovers of this operation:

it suffices to just distribute 1-S.