Google Search Quality team being transparent

I must say that I have been hugely encouraged by Google’s drive towards a more open communication with the Webmaster community recently. Their monthly search quality briefings and their decision to start encouraging users in Webmaster Central to set up email alerts are really helpful. In fact – so is the whole “Inside Search Quality” blog.

Today I saw that they had a video of a search quality meeting. It was looking at autocorrecting on 10 word phrases (I guess that would be called a decagram). It shows the level of immense detail that goes into algorithm changes.

This move towards proactive transparency is great. It really starts to show that there is SO much “white hat” stiff to get stuck into when optimizing a site that you probably shouldn’t start thinking about the less legitimate stuf for quite some time yet. I am hopeful that this goes some way towards putting clear water between professionals in the industry and dabblers… whereas before I would say there was at best a murky puddle between the two camps. Now there is SO much we can learn from these briefings that you just don’t have time to do this in your “free time”.

Right… where’s that Rel=author button…

Dixon.

Ten Link building tests you can try in a single post.

Last month, Google said they changed “something” to do with links. To be exact, they “switched something off”. Now – I’m pretty confident that the changes just around the corner will be hugely more significant, but in the meantime I thought I would do a post that shows you several ways to test theories about links in Google for yourself… or just see what happens to my tests.

Test 1:

Have a link in your post with a highly irregular anchor combination pointing to a page that you have no interest in that has absolutely no relevance to one or more of the words, and no earthly reason for that page to rank for the anchor text term and see if… after a few weeks… the page ranks in the SERPS.

Test 2:

Have a link in your post with a highly irregular anchor combination pointing to a page that you have no interest in that might have SOME relevance to one or more of the words, but no earthly reason for that page to rank for the anchor text term and see if… after a few weeks… the page ranks in the SERPS.

Do you spot the difference between test one and test two?

Test 3:

Have a look for a page you have no interest in that lingers on the second page of the SERPS for some of the words in your page title (like “link building post tests” at 20 without the quotes) and use a non-descript anchor text to see if – after a few weeks – that page moves up or down. It does help if you choose a search phrase which does not invoke QDF, News, Places, Images or any other results. This test will need replicating several times before you can be confident, because many other factors can change a site’s position that already ranks for a page. Read More…

Test 4:

Can’t tell you about test 4…

Test 5:

Actually Tezt 4 iz here

OK – that image should say “improve” not “discover”… I can’t find a page without Google knowing I found it without way more paranoia than I currently can lay my hands on. The one in the link was 10 for  without quotes when I looked. Oh… yes… that text right there in the line above?… that’s in an image for a reason.

Test 6:

tezt funfen excuse FrenchThis one has the attribute:

Test 7:

Hey guys – can you press this link and mention this post on Google+? Let’s see if we can’t get a few “ripples”? Links are not all about rankings. They are about connections and relationships. If this post is giving you some ideas on how to test theories for yourself, then please pass the post on. Then – in the comments in a few weeks – I can tell you what traffic came to this page from Google+ and also see if anyone’s picture appears in the serps under this post. If it does, then we will be able to say that +1s do indeed affect SERPs – at least for friends of people doing the +1 ing.

Test 8:

Because this post is going to get Tweeted (at least a bit) I can’t really do too many tests on Twitter. However, by using bit.ly in the Twitter link, then even though Twitter wraps the link in a t.co link, you will be able to see the stats from people linking to this post as the direct effect of my Twitter links here.

Test 9:
You can replicate test 1 with a nofollow link with a similar awkward and unreliable phrase to a similarly obscure URL and see if you can make a difference. If you can then – quite possibly – Google has stopped taking any notice of Nofollows. That would be a surprise, given that they pushed for the tag in the first place, but personally I feel that noFollows never had the effect they were intended for.

Test 10:

Of course – this post is full of tests. But the pure amongst you will recognize that the strongest tests are not carried out in such an exposed environment and also follow the following pattern:

Hypothesis: “I think that the First Tuesday in the month always ends up on the same date”

Then. Try and disprove the hypothesis. This is a much better way of approaching testing, because it is MUCH easier to DISPROVE something than PROVE something. Proving that the First Tuesday will always be on the same date is pretty hard. But disproving this is much easier. (See what I did there? changed the paradigm.)

I’ll leave the comment links in – but only if they don’t have any commercial intent whatsoever. Save that comment spam for another post please.

Some stuff about Google’s Crawler

Pierre Far is a Googler. I expect he’d appreciate that I pointed him out on G+. He spoke a bit at ThinkVisibility about the crawler and some of the issues that face the whole information gathering and retrieval process. His pictures weren’t as pretty as the “How Majestic Works” infographic, but there was some useful substance in there.

For example: Did you know that Google only checks Robots.txt about once per day to help keep the load off your server? and that having a +1 button on your site can override robots.txt? These are some of the things that he brought up in his very interesting presentation. I made some notes as I went along. I hope they are legible…

Google sets a conservative crawl rate per server. So too many domains or URLs will reduce crawl rate per URL

If you use shared hosting, then this could easily be problematic for you. If you do not know how many other websites are on the same IP number as you, then you may be surprised. You can easily check this by putting your domain or IP number into Majestic’s neighbourhood checker to see how many other websites we saw on the same IP number. Dixonjones.com currently is on a server with 10 sites. But there could be hundreds. More importantly, if one site has a massive amount of URLs… and it is not yours… then you could be losing crawl opportunities, just because there’s a big site that isn’t connected to you in any way on the same IP number. You can’t really go complaining to Google about this. You bought the cheap hosting, and this is one of the sacrifices you made.

If a CMS has huge duplication, Google then knows, and this is how it notifies you of duplicates on WMT.

This is interesting because it is more efficient to realize a site has duplicate urls at this point than after Google has had to analyze all the data and deduplicae on your behalf.

Google then picks URLS in a chosen order

I asked Pierre what factors affected which URLs were selected. In truth I asked if deep links to urls were likely to prioritize those urls for a higher crawl rate than other pages. Of course I believe deep links will change this priority, but had to ask. I was just given:

Change Rate of page content will change this.

Which is not quite what I asked – but nice to know.

Google checks Robots.txt about once per day. Not every visit.

This was interesting to me. Majestic checks more often and you would be surprised at how simply checking Robots.txt annoys some people. Maybe less is more.

Google then crawls the URLs and sends feedback to scheduler.

If server spikes with 500 errors, Googlebot backs off. Also (as with Majestic) firewalls etc can block the bot. This can – after a few days – create a state in Google, that says the site is dead. The Jquery blog had this issue.

If 503 error on robots.txt they stop crawling.

OK. Don’t do that then 🙂

Biggest and smallest ISPs can block Googlebot at the ISP level.

That was good to see that other crawlers face this issue. Because ISPs need to protect their bandwidth, the fact that you want Google to visit you site does not necessarily mean it will be so. Firewalls at the ISP may block bots even before they see your home page. They may (more likely) start throttling bits. So if your pages are taking a long time to get indexed, this may be a factor.

Strong recommendation – set up email notifications in Web Master Tool.

Pierre did not understand why we were not all doing this. If Google has crawling errors – or other things that they would like to warn us about – then an email notification trumps waiting for us to log back in to Webmastertools. I’ll be setting mine up right after this post.

Getting better and better at seeing .js files.

At least – I think that’s what he said.

Soft error pages create an issue and so Google tries hard to detect those.

If they can’t, they end up crawling the soft error as a crawl slot (at the expense of another URL crawl, maybe). So if you don’t know what a soft error is, it is when an error page returns a 200 response instead of a 400 (usually 404) response. You can “ping” a random non-existent url on your site to check this using Receptional’s free http header checker if you want.

Google then analyses the content. If it is no index, then that’s it.

There was a question from the audience: “Is Google keeping up with the growth of the web?” Pierre likes to think they are, but admitted it was hard to tell.

Serving the data back to you:

Google receives your incoming query and searches the Index.

Err – yes. Google does not try to scan the whole web in real time. Non-techies don’t realize this it seems.

Magic produces ordered links.

No questions allowed on the magic!

On displaying result, Google needs to:

  • Pick a url
  • Pick title: usually title tag, sometimes change tag based on user query. This is win win for everyone
  • Generate Snippet: will create stuff on page, but strongly recommends using rich snippets.
  • Generates Site-links: depends on query and result as to whether this appears. If you see a bad site-link issue (wrong link) check for canonicalisation issue.

A +1 button can override Robots.txt, on the basis that it is a stronger signal than Robots.txt.

Question from the audience: “Why are rich snippets showing are so volatile?” Google has noticed people spamming rich snippets recently, so he said maybe that was a reason for increased testing.

Pierre was completely unable to talk about using +1 as a ranking signal. (whether by policy or because it was not his part of the ship)

Q: “How can we prioritize the crawl to get new content spidered?” A: Pierre threw it back. Do some simple maths. 1 URL/second is 8400 per day. Google is unlikely to hit your site continually for 24 hours, so large amounts of new content can take time to crawl.

Q: “What error message should you use if your site comes offline for a while?” A: 503, but be careful if only some of your site is offline not to serve a 503 on robots.txt.

OK – that was about it. Thanks Pierre for the help.

Oh – nearly forgot – Pierre would like to point out that all this is in the Google Webmaster Documentation.