Microsoft have detailed an 8 page, two column document coming out of China which may well eventually beat Google’s PageRank and then some. Because the paper is so long, I thought I would do what Graywolf did years ago for the PageRank algorithm paper, by summarizing it all in easy to digest terms. (Added: Thanks Michael for sending me the link, see comments.)
PageRank was Larry Page’s brainchild (hence “Page”) and was the making of Google. It used the basic notion that a link from one web page to another can act as an endorsement of the landing page’s value. It worked brilliantly. Mostly. But Microsoft think they have a better approach. Here’s what their paper says.
The paper proposes a new method for working out a page’s importance – called BrowseRank. The article starts by summarizing the theory behind PageRank which I won’t reiterate, although they do also not that PageRank also uses a discrete time Markov process – which basically refers to the way its bot link-walks around the web, collecting data. It highlights two fundamental weaknesses of the model. First – it points out that the link graph that is generated is not especially reliable as a data source, because people can create pages with tons of links automatically and quickly. (Now who would do a think like that?) They also suggest that the PageRank model uses a random “walk” along the links. (I’m not so sure about “random”. Maybe in the early days) They also observe that the original PageRank theory took no account of the amount of time that a websurfer spends on a web page – something seen as a significant flaw by the writers of the China paper.
It is these drawbacks that BrowseRank seeks to overcome in what they say is a “more powerful mathematical model”. Instead of using the link graph as the basis for data collection, they propose to utilize a user browsing graph, generated from user behaviour. They will collect the data either through web browsers or at the server level. (MY ADD: Let’s face it – they have plenty of people using Microsoft’s Internet explorer, so even if this was decided by the courts to be something that required an opt-in, they would have a HUGE data set. Even if they didn’t, then any sites that wanted to be seen only need to opt in themselves… probably through adding tracking code or by allowing the server to pass logs to Microsoft. Either way – they have a potentially great data source here. ) They do propose to store the data by anonymous user, you will be glad to here! (MY ADD: They do not need to personalize it, but my guess is that they will need to do something to filter by IP number to avoid spammers… but even as I type, I think they could get around that.)
So – with this data collection methodology, the basic calculation of a page’s importance becomes obvious: The more visits to the web page combined with the longer people spend on those pages, the more important the page. (MY ADD: Time to start writing long, engaging copy and maybe if you are black hat, reduce the page’s usability so people take ages to figure out how to move off the page). The writers call this very “Web 2.0” in concept, with user behaviour defining “implicit voting” on page importance.
So now you get the basics – but the paper gets into much more depth… (I am only on page 2, column 1 of an eight page document here, and page 1 was half taken up with the author details).
Leveraging the data using BrowseRank
Here’s where they start to get clever. Once you have this data, you can’t use a discrete-time Markov process to leverage it. I’ll try to explain each bit separately as I go through it, but here’s the terminology they are using
My interpretation of all those is below. (TEMPORARY NOTE: As I plan to go through this paper in details, I will probably amend these as I understand them more. By the time I have written the last part of these blog posts, I’ll delete this sentence and keep the content static thereafter except typos)
Stationary probability distribution of the continuous Markov Process
What the heck is that??? Well look – it seems to me that a Markov process basically is a link-walking bot. This database will not be based on link walking – so it needs to find a way to put a percentage, if you will, on the probability that a person clicks on each link on a page. The highest POSSIBLE combined total for any page’s total links would be 100%, but this would be impossible in practice as the back button and people going nowhere from a page would also need to be included. So for every link on the page, there is a percentage chance that a user will click, with the total percentage representing the “Stationary Probability Distribution“. Hope that makes sense? In layman’s terms, It means that they don’t NEED to use a bot to linkwalk for this part.
Noise Model to represent the Markov Process Observations
I think this means that the system will look at the content on the page and relate it to the content on the pages that they link to (and content indicators like anchor text etc) to get some indication of where a user might link next, which would be valuable where a page does not have enough user behaviour data to provide a meaningful map.
An embedded Markov Chain based technology to speed up calculation of the stationary distribution
Relying entirely on user behaviour to map the whole web would take to long… so they are going to use a bot anyway! But now, instead of the Bot linking randomly, it will link based on the probability distribution from pages where there is suffficient confidence in the distribution (and presumably sufficient confidence in the page value).
I’ll expand on this and go through the rest of the paper on subsequent blog posts over the next few days, once I have digested the paper. We still need to cover:
- User Data Behaviour – how they plan to define user behaviour, including their modelling and assumptions. (Real MATHS here, that I can’t cut and paste because your browser may not supprt the symbols!)
- The Conituous-time Markov Model: More maths
- The Algorithm itself: Horrendous amounts of Maths
- Experimental results.
More soon! The actual paper is available in PDF form at http://research.microsoft.com/users/tyliu/files/fp032-Liu.pdf.