Web Science:
PageRank
Parke Godfrey
29 November 2012
CSE-2041
Parke Godfrey
29 November 2012
CSE-2041
These slides are based in large part from the book
and from
Imagine a surfer who surfs forever the web.
He spends a minute on a page, then follows randomly one of the page's links to another page.
Over a long, long time, how much time proportionally will he spend on a given page?
A page is important if important pages link to it.
The random web server captures this idea:
He gets trapped on a page with no exits (no links going out).
He gets trapped in a sub-graph of the Web graph.
Can happen
if the Web graph has disconnected components; or
if the Web graph has sinks.
He gets eaten by lolcats.
Larry, Sergey, Rajeev, and Terry set out to model and implement this idea.
In retrospect, they were not the only ones to develop this type of idea
or even to have this idea for ranking pages on the Web!
But they were the most tenacious, and had a very sound idea.
We can cast their approach by Markov chain theory, a part of linear algebra.
PageRank
Equation
\(\prv\): the pagerank vector
\(\S\): the stochastic hyperlink matrix
\(\E\): the teleportation matrix
\(\alpha\): the scaling parameter; \(\alpha \in (0..1)\)
\(r(P)\): rank of page \(P\)
\(B_{P_{i}}\): the set of pages that link to page \(P\); that is, the backlinks of page \(P\)
\(|P_{j}|\): the number of links that page \(P_{j}\) has
Of course, this definition for \(r\) is recursive: the values of \(r\) over the pages depend on the values of \(r\)!
So, we could recast this as an iterative process.
Perhaps let \(r_{0}(P_{i}) = 1/n\) for all \(P_{i}\) where there are \(n\) pages in total.
Will \(r\) converge?
\(\prv^{k}\): the pagerank vector (our \(r\)'s in vector format)
\(\H\): the normalized hyperlink matrix
\(H_{ij} = 1/{|P_{i}|}\), if there is a link from node \(i\) to \(j\)
\(H_{ij} = 0\), otherwise
\(H\) is very sparse!
Great! Turn the crank until we converge; that is, once \(\prv^{(k+1)} = \prv^{k}\).
Well, there are potential problems.
Will this iterative process continue indefinitely, or converge?
Under what circumstances or properties of \(\H\) is it guaranteed to converge?
Will it converge to just one vector or multiple vectors?
Will it converge to something that makes sense in the context of PageRank?
Does the convergence depend on the starting vector \(\trans{\prv^{(0)}}\)?
If it will converge, how long is “eventually”?
That is, how many iterations can we expect until convergence?
Many nodes / pages have no outgoing links. E.g., images, PDFs.
\(\e\): vector of all \(1\)'s
\(\vector{a}\): the dangling node vector
\(a_{i} = 1\), if node \(i\) is a dangling node
\(a_{i} = 0\), otherwise
\(\S\) is stochastic: each row sums to \(1\).
A periodic matrix is both irreducible and aperiodic.
No \(0\)'s in the matrix yields this.
\(\G\): the Google matrix
\((1/n)\e\trans{\e}\): the teleportation matrix (\(\E\))
\(\alpha\): the scaling parameter; \(\alpha \in (0..1)\)
Well...the random surfer gets bored occasionally — with probability \((1 - \alpha)\) — and jumps from the current page to any other page — with probability \(1/n\).
Like typing in the URL directly instead of following a link.
Or using a search engine. Bing, maybe?
Really, however, this \(\alpha\)-adjustment is artificial.
But it is a mathematical necessity for this approach to work.
\(\G\)
(stochastic) is the convex combination of two stochastic matrices, \(\S\) and \(\E\),
(irreducible) has every page is directly connected to every other page (so this trivially follows),
(aperiodic) has self loops for every node (\(\G_{ii} > 0\)), guaranteeing aperiodicity,
(primitive) guarantees that \(G^{k} > 0\) for some \(k\) — in fact, \(G^{(1)} > 0\) — and thus is primitive, and
(dense) is completely dense.
That \(\G\) is primitive guarantees that \(\prv\) converges, and has a unique solution.
\(\G\) is dense, which would be intractable, computationally; but it is directly constructable from a sparse matrix, \(\H\).
As an eigenvector problem for \(\trans{\prv}\).
As a linear homogeneous system for \(\trans{\prv}\).
\(\trans{\prv}\) as the stationary vector of the Markov chain with transition matrix \(\G\).
On the one hand, \(\alpha\) adds an artificial component to the ranking.
On the other hand, there is good mathematical justification — and need — for \(\alpha\).
\(\alpha\) is set as a constant (\(\alpha \in (0..1)\)) for the duration of the computation of \(\prv\).
What should it be set to?
\(\boldsymbol{+}\) Seems to capture best “reputation” according to the Web graph's structure.
\(\boldsymbol{-}\) Takes a long, long time to converge! (That is, many iterations by the power method.)
\(\boldsymbol{-}\) instability: Small changes to the Web graph can result in large changes to \(\prv\).
\(\boldsymbol{+}\) Converges quite quickly (by the power method).
\(\boldsymbol{-}\) stability: Even large changes to the Web graph result in only small changes to \(\prv\).
\(\boldsymbol{-}\) What's the point?! This is ignoring the strucuture of the Web graph. Is completely “artificial”.
Believed to have been \(\alpha = 0.85\).
Convergence in around 50 iterations.
\(n =\) 13 billion.
Still, in the early 2000's, the \(\prv\) computation took around 3 days.
There was a monthly “day of reckoning” (the 28th?) when Google would roll over to the new \(\prv\).
SEO'ers would take note!
PageRank
the be all
No. There are many weaknesses.
The \(\prv\) is query independent!
One ranking for all queries.
It can be gamed.
PageRank
the Pillar
Half of it.
The technique provided much superior search results than others — for the common types of search queries — in the early 2000's.
It gave — and gives — Google some powerful tools for adjusting their search rank and index.
The quality of the search brought in the customers / users.
AdSense
The other pillar
—
and foundation of Google's business model
—
is AdSense
.
This revolutionized advertizing on the Web.
It brought in the paying customers, advertizers.
Which documents / pages are relevant to the search of a query?
Of the relevant ones, are some more relevant than others?
If so, how to blend relevancy scores (query dependent) with, say, PageRank scores (query independent)?
Can we weight \(\prv^{(0)}\) differently to favour “good content” pages?
No. Remember, the convergence is unique, regardless of the setting of \(\prv^{(0)}\)!
Can we weight the teleportation matrix \(\E\) for this?
Yes!
Can we weight pages's outgoing links differently?
Yes!
The science of who links to whom has extended beyond the Web to a variety of other networks that collectively by the name of complex systems. Graph techniques have successfully been applied to learn valuable information about networks ranging from the AIDS transmission and power grid networks to terrorist and email networks.
— Langville & Meyer 2006 [p.30]