Categories
Random observations

PageRank

After initially being told that a paper I wrote with Ryan, Jaga and Audun had been rejected from the Australian Computer Science Conference, it now appears that there was a glitch in the system and that our paper, in fact, was supposed to be accepted. We became aware of this today after I sent off a missive to the conference organisers telling them that they’d forgotten to attach the reviews of our paper. Anyway, more on that matter as further details become available.

Now that it appears the paper will be accepted, like a good little researcher, I started updating the paper for the camera-ready version. I came across the following claim in the paper regarding Google’s PageRank algorithm:

The algorithm reduces the rank of web pages with outgoing hyperlinks and increases the rank of web pages with incoming hyperlinks. This means that a page with few outgoing links and many incoming links will have a high rating.

I had previously flagged this description of the PageRank algorithm as being in need of revision. Therefore, I reviewed Brin’s and Page’s The Anatomy of a Large-Scale Hypertextual Web Search Engine (pdf) in the hope of distilling its contents and deriving a more satisfactory, yet still concise, description of PageRank. Unless I am mistaken, which is a distinct possibility, the original description is not as wildly off target as I thought. The key point to note is that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one. In PageRank, a link from page A to page B is essentially a vote (the weight of this vote depends upon a number of factors, including the number of other links on page A and the PageRank of page A, but these factors are not directly relevant to this discussion). In linking to page B, then, page A has helped to increase page B’s PageRank. But since the sum of the PageRanks must equal one, the increase of B’s PageRank must necessarily come at the expense of a decrease in the PageRanks of other pages. In other words, there’s no such thing as a free lunch: somebody has to pay. The question is, does A alone incur the cost of linking to B? The answer is no, since the calculation of A’s PageRank is not dependent in any way on the number of outgoing links from A, as shown by the PageRank algorithm (which you can find in the manuscript linked above). So where does the extra Googlejuice imbibed by B come from? One way to think about it is this. The addition of a link from A to B slightly reduces the weight of all the other links in the global collection of links. Therefore, it is not A alone that pays the price. Rather, the cost is shared by all existing pages. Another way to think about this is using the "random surfer" model suggested by Page and Brin. The PageRank of A equates to the probability that a user who starts on a random page and clicks randomly on hyperlinks will visit A before boredom sets in, at which time the surfer will randomly select another page to start from. Notice that adding a link from A to B does not lower the probability that the surfer will visit A any more than it reduces the probability of the surfer visiting any other page. What is clear, however, is that the probability that the surfer will visit B has increased, since B now has more incident edges, and has therefore increased its proportion of the total number of incident edges in the graph.

I was not at all surprised to find a great many discussions and (sometimes heated) arguments about the way PageRank works. Just Google it! The really weird stuff shows up when the search is constrained to discussions talking about outgoing links.

This investigation started me thinking about alternative algorithms for ranking results. In Google, the PageRank of page A utilises the PageRanks of all the pages that link to A. It occurred to me that not all pages linking to A would be relevant to the current search. A possible improvement to the PageRank algorithm might be to use only those pages linking to A which also appear in the query results. For instance, if my query is about "fluffy white dogs", and page A is a hit for this query, then only pages which were also a hit for this query and which link to A should be used in the calculation of A’s PageRank for this particular query. Why should a page which links to A for some other reason, say because A also discusses "lazy ginger cats", be included in the ranking of A for this query? Surely this adjustment to the algorithm would improve the ordering of results in Google. The one reason I can think of not to do this is that PageRank would be calculated at the time of query rather than offline, meaning that results would be delayed slightly longer. Mind you, it’s entirely possible that Google already does something like this, because there’s no doubt that the PageRank algorithm must have been updated and modified since 1998!

The question still remains: how do I go about adjusting our paper?

By ricky

Husband, dad, R&D manager and resident Lean Startup evangelist. I work at NICTA.