I put this video together on how PageRank works last year whilst practising it for Pubcon in Las Vegas. I will also talk on PageRank at BrightonSEO in September. The presentation takes what looks to be a very complicated mathematical algorithm and break it down into concepts that mortals with basic Excel knowledge can understand.
Here’s what I say in the Video:
Hi, I’m the global brand ambassador for Majestic. Majestic is a specialist search engine that spent the last decade crawling the internet, and having a look at the backlinks into web pages, and using that as a way to map the Internet.
Before that, I worked again for probably ten years at founding the SEO agency reception in the UK, so I’ve got around about 20 years in the industry.
When I went on holiday this year, in Croatia Dubrovnik, a maths teacher came up to me and asked me what this meant on my t-shirt, and it dawned on me, that even at conferences like, this people don’t understand what this algorithm means.
So at the end of this presentation you will know. This metrics of course is the PageRank algorithm and the algorithm is the one that made Larry Page and Sergey Brin to the richest people and most powerful people in the world. This is the maths that built Google, and you can just read it. It says the PageRank of a page in this iteration equals one minus a dampening factor plus for every link into that page, add the PageRank of that page, divide it by the number of outbound links on that page, and reduced that by the dampening factor. As long as you are not counting a page linking to itself. Easy, right?
Well, okay, maybe for a few of you, but this algorithm is fundamental in understanding links and in particular in understanding one most links count for nothing (or almost nothing) on the Internet.
When you get to grips with this algorithm, you will usually be light-years ahead of other SEOs, but I never really see it properly explained, and I guarantee, that even if you know this algorithm inside out, you will see some unexpected results from this maths, and you will also (I hope) never use the phrase domain authority in front of another customer again (at least in relation to links).
Now, I should say, I’m not asking anyone here to know much more than simple Excel for this presentation, and I’m going to start showing you how the maths in the calculation here applies to this representation of a very small Internet system of only five nodes. Then we are going to look at a very slightly different map, which has profound consequences to our results.
So the PageRank algorithm is called an iterative algorithm. We start by some estimates, and then we continually refine our understanding of the ecosystem that we are measuring.
So how can we see how the PageRank of formula applies to this ecosystem.
Firstly we need to create a metrics. We have the nodes A to E, which I’ve put in two columns, and we are going to call them pages for now, but nodes is correct terminology, but I’m going to use the word pages, because we understand pages in the SEO world, but nodes is important as we’ll find out later.
I’m also going to put in a start value and I’m going to put in the number of links into each page as my first row. So page A has one link coming into it, page B has two links coming into it, etc. etc. across and we can see that from the the chart in the top right hand corner. And then I’m going to have to look at the out links as well, so the out links page A is only linking out to one page, which is page C, whereas page C for example is linking out to three pages: A, B and E.
Interestingly page D actually only links out to two nodes: D links to B and D links to E, but of course it’s linking to E three times and in this version we don’t count the other two links, because either there’s a link from D to E, well, there isn’t, so I’m going to put two in that particular column of D.
Now also we need to say, well, if a site can’t link to itself so we’ll put a red bar through AA, BB, CC, etc. etc.
Now we’re going to have to have a look and see in this metrics, where there is a link between two nodes. So for example, page A links to page C, page B links to page C, page C links out to page A, B and E, so it’s got three links coming out, we can check that by seeing the number of out links D and E.
So now we’ve filled in the whole of our metrics and we can see the green spots and the number of green spots should add up to the number of backlinks and it should add up to the number of out links. They should all add up to the same number.
So here’s the grid. And we can check those numbers. That eight is very useful, it’s a good reason for us to use backlinks as a place to start here, but the PageRank algorithm doesn’t start with backlinks it starts with an estimate and initial estimate of PageRank.
Now most people when they use the PageRank algorithm start by saying that every page initially is worth one point. I’m not going to do that. We found that if we estimate that the PageRank starts with the number of links coming in or any other better estimate, then we’re going to save ourselves a lot of computing time and Majestic needs a lot of computing time.
So better to use the link account or some other proxy for PageRank as a first estimate so we’re going to use the link counts, the number of in links as I our initial estimate of PageRank in this calculation.
And we also have in the calculation a damping factor. Damping factor is mostly and most of the documentation has that scene where it asks me and specified is 0.15 so we’re going to use that and so that’s to save that we don’t want all of the power of a page to be dissipated out by the links, we are going to have 0.85 out so we’re going to damp dampen it down by 0.15.
So the opposite of 0.15 1-D 1- the dampening factor will be 0.85. So now we can work out a multiplier for every single page, and the multiplier for every single page is the page rank that have, multiplied by 1- the damping factor 1-0.15 divided by the number of out links.
Now we can simplify that in fact, so that it’s PageRank times 0.85 divided by the number of out links.
One thing I should say here that I’m not going to cover in this presentation is what if the number of out links is zero, what if a page doesn’t link how because in theory that would mean that you’re then dividing by zero which causes all sorts of problems to this algorithm and that does get a little bit complicated for us to explain.
So I’m going to have an ecosystem where people are linking out on all of the pages but we can then fill in this table a little bit more so in this table we’ve got for example page A has initially a page rank of 1 which we’re going to multiply by 0.85 and divided by the single outbound link so the multiplier is 0.85 in column A, whereas in column C the page rank is 2 the multiplier is 0.85 and that’s divided by the three outbound links, which means that each one lends a score of 0.56666666 etc. 7.
So now we can use this to fill in our new values for all of the green boxes
So page A gives one link to page C and each fit has each link it gives a value of 0.85 so we reference this 0.85 in column A and say, right that’s the amount it’s their generated giving to page C and page C goes to three pages for example, each with a value of 0.56666 recurring and so it goes into gives that value to pages A, B and E, and so it goes on.
So we’re taken these calculations we’re referencing these cells and portioning out the PageRank values to each of the pages.
And then we can add up those columns to find the new page ranks for each page. We have to add back in here the damping factor that we took out at the start the 0.15 which we add back into every page.
So now we have some new numbers for a new estimation of page rank.
Now that’s really all there is to the PageRank algorithm, it’s just that it gets repeated.
I did say that it’s iterative so you will need to do it again and again and again.
So I’ve got a spreadsheet that I did for all these workings and if you’d like to have a copy of the spreadsheet, email [email protected] with PageRank in the subject to get the slides own and also the excel spreadsheet I use to do this presentation.
But basically when I get down there I cut and paste the old value and I cut it back and I past it into the top to start again and that redoes the calculation because these cells are referencing the out link values or multipliers.
So I cut and paste it, put the numbers back into the top and I get numbers and new numbers at the bottom for my second iteration and again for my third iteration and I carry on doing this again and again and again.
And if you do this over a lot of iterations I’ve done it 15 times to show and plotted the results here, you can see that after an initial jumping around, the numbers start to settle and the PageRank values start to stabilize. And this is what happens in the charts after 15 iterations.
If we’d start with all the pages being one by the way, which is what most people tell you to do, then this would have taken many more iterations to get a stable set of numbers.
So now we’ve done the maths. We can see which is the most important page on the Internet, on our internet, and it turns out to be node C.
I don’t know if that’s the one you guessed, but whether you guess yes or no to C, it’s now time to reveal the wider story.
You recall I said nodes instead of pages and that’s because this was doing the PageRank at the lowest common denominator that I had at the time, which was five nodes, but what if this were actually domains not pages in this diagram and that page D (slight D) had three pages within if and E had two pages and C had three pages.
I now have an ecosystem with ten nodes in it, not five. And more importantly we now have some internal linking but we’ll break that all down so that we’ve got ten nodes so that we don’t have internal
Now where do you think the power is going to lie. Am I really going to go through all the maths again for these ten nodes? Well, hell yeah, just for you guys I am! And here we go. With these ten nodes if we go and through the algorithm again, again, again, 15 iterations I haven’t done it on the screen but if you want to see the maths it’s in the spreadsheet, you can see that again everything goes into a a set pattern after a number of iterations and we can place that on the chart.
And here’s the actual scores for every single page and now we see that the winning page, winning node in this example by a country mile is E1, the first one in the E node initially and the C domain seems to be really rather lacking in depth, when it comes to it. So when we do the maths at this page level there’s some surprising outcomes and observations that I wanted you to take away. Hopefully now you’ve seen the maths and even if you missed some of it, you’ve got the spreadsheets to help you get there.
But because of the differences between the first example and the second example this is why you should always look at page rank at that page level not at the domain level. The winning domain in site in the five node model so if you’d used the domain level modelling, you would have hoped for links from pages which were amongst the worst at the page name level. Page rank has only ever actually been done at the page level as far as we know.
Majestic does our own calculations that top-level domains sublevel domains and page level domains and in our quest to show higher link counts we default to the top-level domain first as do our competitors by the way, but really it’s the page level that counts. and I would urge you, if you’re going to really want to try and compare two websites, I would urge you to compare the web pages of the home pages rather than trying to compare the direct sites with each other, because you’ll get a better understanding of the comparisons of those pages and those sites.
If you build a new site and you only use domain authority in this example, you could easily have got linked from one of the worst-possible pages, even though it was from the best domain, because of the internal link structures. And when Google say you won’t know which ones are going to pass PageRank, how on earth would you know, if you didn’t have the whole map, if you didn’t have the whole calculations on the whole map.
You’re not going to be able to see the strength of a link if that link depends on the internal links of an entirely different website.
That’s interesting observation one. Second observation is that the data doesn’t have to be complete but it works best with the universal data set and back in 2014 so 4 or more years ago now, one of our researchers wrote this blog post after somebody in Toulouse, I think it was, did a study, using the PageRank algorithm only on Wikipedia pages. And hat study showed that Carl Ennis who’s a famous botanist, was more influential than Jesus or Hitler.
Majestics citation flows of proxy to PageRank works on a much more universal basis and we could have told the research or that information would have told the researcher a slightly different story and a more likely result as our data user largest of the internet.
So every signal in a link is very small and individually is prone to error opinion. But at scale the error decreases and the confidence level in what we’re predicting increases.
The next oddity is the majority of pages have hardly any page rank at all. This is taking the page rank that we had in the the final node graph by area, a tree graph, and really the top three pages in our term Monod model accounts for 75 to 80 percent of the entire page rank of the system as you can see from the chart here.
And this last oddity is that your guess of using link counts as an initial estimate is not a good idea because if we have a look at our original estimates and use the same tree graph we would have said the page C3 was one of if not our biggest page by page ranking our initial estimate because the number of links coming into it, but if you compare that to the page rank by area chart you find that C3 hardly has any page rank in the final evolution of our algorithms.
So I’d like to leave you with these thoughts. I’ve shown you how this works in a world 10 pages big. 10 pages times 10 calculations all be multiplied many times by 0 and then 15 – iterations is 1,500 bits.
Majestic does a similar but different calculation. Over 500 billion URLs a day for our fresh index and 1.8 billion pages a month on our historic index. The point is the PageRank proxies are hard to build, they are hard to understand, they are hard to generate, there’s a lot of maths and a lot of computing power involved.
Finally PageRank is not about rankings, because pure PageRank doesn’t consider context so be very wary of using page metrics that are based on search visibility for link building. Majestic citation flow is about the purest correlation to PageRank currently available although the algorithm is a little different to the PageRank algorithm. Thank you for taking the time I hope you earned something today.
Is PageRank still used today?
This is a common question. Larry Page came up with the concept and with the help of Sergei Brin, it became the cornerstone of Google’s early success. Googlers say it is still used, according to this video. On the other hand, a recent article by Roger Monti suggests that it may have been shelved in 2006. Now – this is from an ex-Googler, so it does seem odd that Gary Illyes made it
So what gives?
The truth of the matter is that Google may have
The Power method is another iterative in a way that can change the PageRank of an individual page at a time. These methods resulted in what we SEOs used to call the “Google Dance”. Periodically, Google will have recalculated PageRank for every URL and the Search Engine Results Pages (SERPs) used to bounce around for a few days, whilst all the indices were updated.
When the Google Dance stopped (which probably was around 2006 if my memory serves me correctly), Google will have found a way to calculate or approximate PageRank on the fly. This lecture from Stanford University looks at some ideas such as the Random Surfer Model, which considers Pagerank from the point of view of the probability that the person was on the page one click after their last page visit.
Thanks for listening/watching/reading all this!