Topic-specific PageRank

Suppose our random surfer, endowed with a teleport operation as before, teleports to *a random web page on the topic of sports* instead of teleporting to a uniformly chosen random web page. We will not focus on how we collect all web pages on the topic of sports; in fact, we only need a non-zero subset of sports-related web pages, so that the teleport operation is feasible. This may be obtained, for instance, from a manually built directory of sports pages such as the open directory project (`http://www.dmoz.org/`) or that of Yahoo.

Provided the set of sports-related pages is non-empty, it follows that there is a non-empty set of web pages over which the random walk has a steady-state distribution; let us denote this *sports PageRank* distribution by . For web pages not in , we set the PageRank values to zero. We call the *topic-specific PageRank* for sports.

Topic-specific PageRank.In this example we consider a user whose interests are 60% sports and 40% politics. If the teleportation probability is 10%, this user is modeled as teleporting 6% to sports pages and 4% to politics pages.

We do not demand that teleporting takes the random surfer to a uniformly chosen sports page; the distribution over teleporting targets could in fact be arbitrary.

In like manner we can envision topic-specific PageRank distributions for each of several topics such as science, religion, politics and so on. Each of these distributions assigns to each web page a PageRank value in the interval . For a user interested in only a single topic from among these topics, we may invoke the corresponding PageRank distribution when scoring and ranking search results. This gives us the potential of considering settings in which the search engine knows what topic a user is interested in. This may happen because users either explicitly register their interests, or because the system learns by observing each user's behavior over time.

But what if a user is known to have a mixture of interests from multiple topics? For instance, a user may have an interest mixture (or *profile*) that is 60% sports and 40% politics; can we compute a *personalized PageRank* for this user? At first glance, this appears daunting: how could we possibly compute a different PageRank distribution for each user profile (with, potentially, infinitely many possible profiles)? We can in fact address this provided we assume that an individual's interests can be well-approximated as a linear combination of a small number of topic page distributions. A user with this mixture of interests could teleport as follows: determine first whether to teleport to the set of known sports pages, or to the set of known politics pages. This choice is made at random, choosing sports pages 60% of the time and politics pages 40% of the time. Once we choose that a particular teleport step is to (say) a random sports page, we choose a web page in uniformly at random to teleport to. This in turn leads to an ergodic Markov chain with a steady-state distribution that is personalized to this user's preferences over topics (see Exercise 21.2.3 ).

While this idea has intuitive appeal, its implementation appears cumbersome: it seems to demand that for each user, we compute a transition probability matrix and compute its steady-state distribution. We are rescued by the fact that the evolution of the probability distribution over the states of a Markov chain can be viewed as a linear system. In Exercise 21.2.3 we will show that it is not necessary to compute a PageRank vector for every distinct combination of user interests over topics; the personalized PageRank vector for any user can be expressed as a linear combination of the underlying topic-specific PageRanks. For instance, the personalized PageRank vector for the user whose interests are 60% sports and 40% politics can be computed as

**Exercises.**

- Write down the transition probability matrix for
the example in Figure 21.2 .
- Consider a web graph with three nodes 1, 2 and 3. The links
are as follows:
. Write down the transition
probability matrices for the surfer's walk with teleporting,
for the following three values of the teleport probability:
(a) ; (b) and (c) .
- A user of a browser can, in addition to clicking a
hyperlink on the page he is currently browsing, use the
*back button*to go back to the page from which he arrived at . Can such a user of back buttons be modeled as a Markov chain? How would we model repeated invocations of the back button? - Consider a Markov chain with three states A, B and
C, and transition probabilities as follows. From state A,
the next state is B with probability 1. From B, the next
state is either A with probability , or state C with
probability . From C the next state is A with
probability 1. For what values of is this
Markov chain ergodic?
- Show that for any directed graph, the Markov chain
induced by a random walk with the teleport operation is
ergodic.
- Show that the PageRank of every page is at least
. What does this imply about the difference in
PageRank values (over the various pages) as becomes
close to 1?
- For the data in Example 21.2.2, write a
small routine or use a scientific calculator to compute the
PageRank values stated in Equation 260.
- Suppose that the web graph is stored on disk as an
adjacency list, in such a way that you may only query for
the out-neighbors of pages in the order in which they are
stored. You cannot load the graph in main memory but you may
do multiple reads over the full graph. Write the algorithm
for computing the PageRank in this setting.
- Recall the sets and introduced near the beginning of Section 21.2.3 . How does the set relate to ?
- Is the set always the set of all web pages?
Why or why not?
- Is the sports PageRank of any page in at least
as large as its PageRank?
- Consider a
setting where we have two topic-specific PageRank values for
each web page: a sports PageRank , and a
politics PageRank . Let be the
(common) teleportation probability used in computing both
sets of topic-specific PageRanks. For , consider
a user whose interest profile is divided between a fraction
in sports and a fraction in politics. Show that
the user's personalized PageRank is the steady-state
distribution of a random walk in which - on a teleport step
- the walk teleports to a sports page with probability
and to a politics page with probability .
- Show that the Markov chain corresponding to the walk in
Exercise 21.2.3 is ergodic and hence the
user's personalized PageRank can be obtained by computing
the steady-state distribution of this Markov chain.
- Show that in the steady-state distribution of
Exercise 21.2.3, the steady-state
probability for any web page equals
.

This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.

2009-04-07