Some stuff about Google’s Crawler

How Google crawls and retrieves data from the web. Some common SEO issues that arise. From Googler’s mouth.

Pierre Far is a Googler. I expect he’d appreciate that I pointed him out on G+. He spoke a bit at ThinkVisibility about the crawler and some of the issues that face the whole information gathering and retrieval process. His pictures weren’t as pretty as the “How Majestic Works” infographic, but there was some useful substance in there.

For example: Did you know that Google only checks Robots.txt about once per day to help keep the load off your server? and that having a +1 button on your site can override robots.txt? These are some of the things that he brought up in his very interesting presentation. I made some notes as I went along. I hope they are legible…

Google sets a conservative crawl rate per server. So too many domains or URLs will reduce crawl rate per URL

If you use shared hosting, then this could easily be problematic for you. If you do not know how many other websites are on the same IP number as you, then you may be surprised. You can easily check this by putting your domain or IP number into Majestic’s neighbourhood checker to see how many other websites we saw on the same IP number. Dixonjones.com currently is on a server with 10 sites. But there could be hundreds. More importantly, if one site has a massive amount of URLs… and it is not yours… then you could be losing crawl opportunities, just because there’s a big site that isn’t connected to you in any way on the same IP number. You can’t really go complaining to Google about this. You bought the cheap hosting, and this is one of the sacrifices you made.

If a CMS has huge duplication, Google then knows, and this is how it notifies you of duplicates on WMT.

This is interesting because it is more efficient to realize a site has duplicate urls at this point than after Google has had to analyze all the data and deduplicae on your behalf.

Google then picks URLS in a chosen order

I asked Pierre what factors affected which URLs were selected. In truth I asked if deep links to urls were likely to prioritize those urls for a higher crawl rate than other pages. Of course I believe deep links will change this priority, but had to ask. I was just given:

Change Rate of page content will change this.

Which is not quite what I asked – but nice to know.

Google checks Robots.txt about once per day. Not every visit.

This was interesting to me. Majestic checks more often and you would be surprised at how simply checking Robots.txt annoys some people. Maybe less is more.

Google then crawls the URLs and sends feedback to scheduler.

If server spikes with 500 errors, Googlebot backs off. Also (as with Majestic) firewalls etc can block the bot. This can – after a few days – create a state in Google, that says the site is dead. The Jquery blog had this issue.

If 503 error on robots.txt they stop crawling.

OK. Don’t do that then 🙂

Biggest and smallest ISPs can block Googlebot at the ISP level.

That was good to see that other crawlers face this issue. Because ISPs need to protect their bandwidth, the fact that you want Google to visit you site does not necessarily mean it will be so. Firewalls at the ISP may block bots even before they see your home page. They may (more likely) start throttling bits. So if your pages are taking a long time to get indexed, this may be a factor.

Strong recommendation – set up email notifications in Web Master Tool.

Pierre did not understand why we were not all doing this. If Google has crawling errors – or other things that they would like to warn us about – then an email notification trumps waiting for us to log back in to Webmastertools. I’ll be setting mine up right after this post.

Getting better and better at seeing .js files.

At least – I think that’s what he said.

Soft error pages create an issue and so Google tries hard to detect those.

If they can’t, they end up crawling the soft error as a crawl slot (at the expense of another URL crawl, maybe). So if you don’t know what a soft error is, it is when an error page returns a 200 response instead of a 400 (usually 404) response. You can “ping” a random non-existent url on your site to check this using Receptional’s free http header checker if you want.

Google then analyses the content. If it is no index, then that’s it.

There was a question from the audience: “Is Google keeping up with the growth of the web?” Pierre likes to think they are, but admitted it was hard to tell.

Serving the data back to you:

Google receives your incoming query and searches the Index.

Err – yes. Google does not try to scan the whole web in real time. Non-techies don’t realize this it seems.

Magic produces ordered links.

No questions allowed on the magic!

On displaying result, Google needs to:

  • Pick a url
  • Pick title: usually title tag, sometimes change tag based on user query. This is win win for everyone
  • Generate Snippet: will create stuff on page, but strongly recommends using rich snippets.
  • Generates Site-links: depends on query and result as to whether this appears. If you see a bad site-link issue (wrong link) check for canonicalisation issue.

A +1 button can override Robots.txt, on the basis that it is a stronger signal than Robots.txt.

Question from the audience: “Why are rich snippets showing are so volatile?” Google has noticed people spamming rich snippets recently, so he said maybe that was a reason for increased testing.

Pierre was completely unable to talk about using +1 as a ranking signal. (whether by policy or because it was not his part of the ship)

Q: “How can we prioritize the crawl to get new content spidered?” A: Pierre threw it back. Do some simple maths. 1 URL/second is 8400 per day. Google is unlikely to hit your site continually for 24 hours, so large amounts of new content can take time to crawl.

Q: “What error message should you use if your site comes offline for a while?” A: 503, but be careful if only some of your site is offline not to serve a 503 on robots.txt.

OK – that was about it. Thanks Pierre for the help.

Oh – nearly forgot – Pierre would like to point out that all this is in the Google Webmaster Documentation.

12 thoughts on “Some stuff about Google’s Crawler”

  1. I’ll be a little cheeky here.

    If things like Google only hitting robots.txt approx 1:24Hr, or +1 buttons overring robots.txt disallow directives are a surprise/new to you,
    then I strongly suggest you go and get yourself in the Google Webmaster Central Forums, and reading Googles Official Webmaster Blog.

    Spending a few Hours each week reading through that lot (the blog has low activity, and the forums tend to repeat a lot – so it’s not too much time/efforT),
    will definitely help keep you ontop of such things.

    http://www.google.com/support/forum/p/Webmasters
    http://googlewebmastercentral.blogspot.com/

  2. Thank you for this summary – I had overlooked Pierre’s appearance at ThinkVisibility, glad I stumbled upon your post!

  3. Thanks for the summary, couldn’t make it to ThinkVisibility but this is a great summary of Pierre’s presentation, its always good to hear from real googlers. Especially since my site has had issues with 500 errors in the past.

  4. Great write up, but your “simple maths” in Pierre’s answer is a little off. 1 URL/sec is approximately 84,000 (not 8,400) or exactly 86,400 per day (if Googlebot is hitting your site 24 hours/day).

    Otherwise, great info and thanks to John Mueller for informing us of this post.

Leave a Reply

Your email address will not be published. Required fields are marked *