Web Science:
PageRank

Parke Godfrey
29 November 2012
CSE-2041

Credits

These slides are based in large part from the book

Amy N. Langville & Carl D. Meyer

Google's PageRank and Beyond: The Science of Search Engine Ranking

Princeton University Press, 2006.

ISBN-13: 978-0-691-12202-1

ISBN-10: 0-691-12202-4

and from

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd

The PageRank citation ranking: bringing order to the web.

TR, Stanford University, 1999.
Private correspondence.

Web Pages
How to assign importance?

The Random Web Surfer

Imagine a surfer who surfs forever the web.
He spends a minute on a page, then follows randomly one of the page's links to another page.
Over a long, long time, how much time proportionally will he spend on a given page?

Exploiting The Web Graph
the links are what is important

A page is important if important pages link to it.

The random web server captures this idea:

if he visits important pages more often, then he is likely to visit pages they link to more often.

Problems
the random surfer might encounter?

He gets trapped on a page with no exits (no links going out).
He gets trapped in a sub-graph of the Web graph.

Can happen
- if the Web graph has disconnected components; or
- if the Web graph has sinks.
He gets eaten by lolcats.

Modeling the Random Surfer
linear algebra to the rescue!

Larry, Sergey, Rajeev, and Terry set out to model and implement this idea.
In retrospect, they were not the only ones to develop this type of idea
- or even to have this idea for ranking pages on the Web!
But they were the most tenacious, and had a very sound idea.

We can cast their approach by Markov chain theory, a part of linear algebra.

The `PageRank` Equation
the billion dollar formula

\[ \displaystyle{ \trans{\prv} = \trans{\prv} (\alpha\S + (1 - \alpha)\E) } \]

\(\prv\): the pagerank vector
\(\S\): the stochastic hyperlink matrix
\(\E\): the teleportation matrix
\(\alpha\): the scaling parameter; \(\alpha \in (0..1)\)

The Idea
in math

\[ \displaystyle{ r(P_{i}) = \sum_{P_{j}\in B_{P_{i}}} {{r(P_{j})} \over {|P_{j}|}} } \]

\(r(P)\): rank of page \(P\)
\(B_{P_{i}}\): the set of pages that link to page \(P\); that is, the backlinks of page \(P\)
\(|P_{j}|\): the number of links that page \(P_{j}\) has

The Idea (2)
in math

Of course, this definition for \(r\) is recursive: the values of \(r\) over the pages depend on the values of \(r\)!

So, we could recast this as an iterative process.

\[ \displaystyle{ r_{k+1}(P_{i}) = \sum_{P_{j}\in B_{P_{i}}} {{r_{k}(P_{j})} \over {|P_{j}|}} } \]

Perhaps let \(r_{0}(P_{i}) = 1/n\) for all \(P_{i}\) where there are \(n\) pages in total.
Will \(r\) converge?

With Matrices
this time, please!

\[ \displaystyle{ \trans{\prv^{(k+1)}} = \trans{\prv^{k}}\H } \]

\(\prv^{k}\): the pagerank vector (our \(r\)'s in vector format)
\(\H\): the normalized hyperlink matrix
- \(H_{ij} = 1/{|P_{i}|}\), if there is a link from node \(i\) to \(j\)
- \(H_{ij} = 0\), otherwise

\(H\) is very sparse!

Modeled? Does this capture the idea?

Great! Turn the crank until we converge; that is, once \(\prv^{(k+1)} = \prv^{k}\).

Well, there are potential problems.

Will this iterative process continue indefinitely, or converge?

Under what circumstances or properties of \(\H\) is it guaranteed to converge?
Will it converge to just one vector or multiple vectors?
- Will it converge to something that makes sense in the context of PageRank?
- Does the convergence depend on the starting vector \(\trans{\prv^{(0)}}\)?
If it will converge, how long is “eventually”?

That is, how many iterations can we expect until convergence?

Problem with sinks: dangling node vectors
the stochasticity adjustment

Many nodes / pages have no outgoing links. E.g., images, PDFs.

\[ \displaystyle{ \S = \H + \vector{a}((1/n)\trans{\e}) } \]

\(\e\): vector of all \(1\)'s
\(\vector{a}\): the dangling node vector
- \(a_{i} = 1\), if node \(i\) is a dangling node
- \(a_{i} = 0\), otherwise

\(\S\) is stochastic: each row sums to \(1\).

Problem: no unique solution guaranteed!
the primitivity adjustment

A periodic matrix is both irreducible and aperiodic.

No \(0\)'s in the matrix yields this.

\[ \displaystyle{ \G = \alpha\S + (1 - \alpha)(1/n)\e\trans{\e} } \]

\(\G\): the Google matrix
\((1/n)\e\trans{\e}\): the teleportation matrix (\(\E\))
\(\alpha\): the scaling parameter; \(\alpha \in (0..1)\)

Teleportation Matrix?!
does this fit the story?

Well...the random surfer gets bored occasionally — with probability \((1 - \alpha)\) — and jumps from the current page to any other page — with probability \(1/n\).

Like typing in the URL directly instead of following a link.
Or using a search engine. Bing, maybe?

Really, however, this \(\alpha\)-adjustment is artificial.

But it is a mathematical necessity for this approach to work.

The Google Matrix

\(\G\)

(stochastic) is the convex combination of two stochastic matrices, \(\S\) and \(\E\),
(irreducible) has every page is directly connected to every other page (so this trivially follows),
(aperiodic) has self loops for every node (\(\G_{ii} > 0\)), guaranteeing aperiodicity,
(primitive) guarantees that \(G^{k} > 0\) for some \(k\) — in fact, \(G^{(1)} > 0\) — and thus is primitive, and
(dense) is completely dense.

A Unique Solution
that is computable

That \(\G\) is primitive guarantees that \(\prv\) converges, and has a unique solution.
\(\G\) is dense, which would be intractable, computationally; but it is directly constructable from a sparse matrix, \(\H\).

PageRank can be solved in various ways

As an eigenvector problem for \(\trans{\prv}\).

\[ \trans{\prv} = \trans{\prv}\G \] \[ \trans{\prv}\e = \matrix{1} \]
As a linear homogeneous system for \(\trans{\prv}\).

\[ \trans{\prv}(\matrix{I} - \G) = \trans{\matrix{0}} \] \[ \trans{\prv}\e = \matrix{1} \]
\(\trans{\prv}\) as the stationary vector of the Markov chain with transition matrix \(\G\).

\[ \trans{\prv^{k}} = \trans{\prv}\G^{k} \] \[ \exists k. \G^{(k + 1)} = \G^{k} \]

What about \(\alpha\)?
the great fudge factor

On the one hand, \(\alpha\) adds an artificial component to the ranking.
On the other hand, there is good mathematical justification — and need — for \(\alpha\).

\(\alpha\) is set as a constant (\(\alpha \in (0..1)\)) for the duration of the computation of \(\prv\).

What should it be set to?

Setting \(\alpha\) close to \(1\)

\(\boldsymbol{+}\) Seems to capture best “reputation” according to the Web graph's structure.
\(\boldsymbol{-}\) Takes a long, long time to converge! (That is, many iterations by the power method.)
\(\boldsymbol{-}\) instability: Small changes to the Web graph can result in large changes to \(\prv\).

Setting \(\alpha\) close to \(0\)

\(\boldsymbol{+}\) Converges quite quickly (by the power method).
\(\boldsymbol{-}\) stability: Even large changes to the Web graph result in only small changes to \(\prv\).
\(\boldsymbol{-}\) What's the point?! This is ignoring the strucuture of the Web graph. Is completely “artificial”.

The Goog's setting for \(\alpha\)
around, say, 2005

Believed to have been \(\alpha = 0.85\).
Convergence in around 50 iterations.
\(n =\) 13 billion.

Still, in the early 2000's, the \(\prv\) computation took around 3 days.

There was a monthly “day of reckoning” (the 28th?) when Google would roll over to the new \(\prv\).

SEO'ers would take note!

Is `PageRank` the be all
of Google's search algorithm? Of search algorithms?

No. There are many weaknesses.

The \(\prv\) is query independent!

One ranking for all queries.
It can be gamed.

Was `PageRank` the Pillar
of Google's launch to fame?

Half of it.

The technique provided much superior search results than others — for the common types of search queries — in the early 2000's.
It gave — and gives — Google some powerful tools for adjusting their search rank and index.
The quality of the search brought in the customers / users.

`AdSense`
the other pillar

The other pillar — and foundation of Google's business model — is AdSense.

This revolutionized advertizing on the Web.
It brought in the paying customers, advertizers.

The Other Half of Search
information retrieval

Which documents / pages are relevant to the search of a query?
Of the relevant ones, are some more relevant than others?
- If so, how to blend relevancy scores (query dependent) with, say, PageRank scores (query independent)?

Tweaking the Markovian Approach

Can we weight \(\prv^{(0)}\) differently to favour “good content” pages?
- No. Remember, the convergence is unique, regardless of the setting of \(\prv^{(0)}\)!
Can we weight the teleportation matrix \(\E\) for this?
- Yes!
Can we weight pages's outgoing links differently?
- Yes!

Web Science
complex systems

The science of who links to whom has extended beyond the Web to a variety of other networks that collectively by the name of complex systems. Graph techniques have successfully been applied to learn valuable information about networks ranging from the AIDS transmission and power grid networks to terrorist and email networks.

— Langville & Meyer 2006 [p.30]

Web Science: PageRank

Credits

Web Pages How to assign importance?

Exploiting The Web Graph the links are what is important

Problems the random surfer might encounter?

Modeling the Random Surfer linear algebra to the rescue!

The PageRank Equation the billion dollar formula

The Idea in math

The Idea (2) in math

With Matrices this time, please!