What is Googlebot-how does it work?

Share:

What is Googlebot how does it work? All websites around the world are crawled by the Googlebot, which is responsible for analyzing them in order to then establish a relevant ranking in the search results. In this post, we will see the different factions of the Googlebot, its expectations, and the means available to you to optimize the exploration of your site.

What is Googlebot how does it work?

What is Googlebot?

Googlebot is a virtual robot, developed by the engineers of the Mountain View giant. This little "Wall-E of the web" scours websites before indexing some of their pages. This computer program searches for and reads the content of sites and modifies its index according to the news that it finds. The index, which contains search results, is sort of the brain of Google. This is where all his knowledge resides.

Google uses thousands of small computers to send its crawlers to all corners of the web to find these pages and see what's on them. There are several different robots, each with a well-defined purpose. For example, AdSense and AdsBot are responsible for checking the relevance of paid ads, while Android Mobile Apps checks Android apps. There is also a Googlebot Images, News… Here is a list of the most famous and the most important with their name “User-agent” “:

Googlebot "desktop" Mozilla / 5.0 (compatible; Googlebot / 2.1; + http: //www.google.com/bot.html)

Googlebot "mobile" Mozilla / 5.0 (Linux; Android 6.0.1; Nexus 5X Build / MMB29P) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 41.0.2272.96 Mobile Safari / 537.36 (compatible; Googlebot / 2.1; + http: / /www.google.com/bot.html)

Googlebot Video Googlebot-Video / 1.0

Googlebot Images Googlebot-Image / 1.0

Googlebot News Googlebot-News

How does Googlebot work and what is it looking for?

Googlebot is completely self-contained, no one really “pilots” it once it's started. The robot uses sitemaps and links discovered in previous searches. Whenever the crawler finds new links on a site, it will follow them to visit landing pages and add them to its index if they are of interest. Likewise, if Googlebot encounters broken links or modified links, it will take them into account and refresh its index. Googlebot itself determines how often it will crawl the pages. 

It allocates a “crawl budget” to each site. It is therefore normal that a site of several hundred thousand pages is not fully crawled or indexed. To make it easier for the Googlebot and to ensure that your site will be correctly indexed, you must check that nothing is blocking the crawl or slowing it down (wrong command in robots.txt for example.

Robots.txt commands

The robots.txt is in a way the roadmap of Googlebot. It's the first thing he crawls so he can follow his directions. In the robots.txt file, it is possible to restrict access to Googlebot to certain parts of its site. This system is often used in crawl budget optimization strategies. Each website's robots.txt can be accessed by appending /robots.txt at the end of the URL. It looks like this:We see that this site prohibits the exploration of cart pages, my account, and other configuration pages.

CSS files

CSS stands for Cascading Style Sheets. This file describes how HTML elements should be displayed on the screen. It saves a lot of time because the stylesheets apply throughout the site. It can even control the layout of multiple sites at the same time. Googlebot doesn't just text, it also downloads CSS files to better understand the overall content of a page.

Detect possible manipulation attempts on the part of sites to deceive robots and better position themselves (the most famous: cloaking and white writing on a white background)Download some images (logo, pictograms Read the responsive design guidelines, which are essential to show that your site adapts to mobile browsing

Images

Googlebot downloads the images on your site to enrich its “Google Images” tool. Of course, the crawler doesn't “see” the image yet, but it can understand it thanks to the alt attribute and the overall context of the page. You should therefore not neglect your images because they can become a major source of traffic, even if it is very complicated today to analyze them with Google Analytics.

Google-ads-alt-attribute

Google's robot is rather discreet, we don't really see it at first. For beginners, this is even a totally abstract notion. However, it is there, and it leaves some traces in its path. These “traces” are visible in the site logs. One way to understand how Googlebot is visiting your site is through log analysis. The log file also allows you to observe the precise date and time of the bot's visit, the target file or the requested page, the server response header, etc.

Google Search Console


Search Console, formerly known as Webmaster Tools, is one of the most important free tools for checking the navigability of your site. Through its indexing and crawl curves, you can see the ratio of crawled and indexed pages compared to the total number of pages of which your site is composed. You will also get a list of crawl errors (404 or 500 errors for example) that you can correct to help Googlebot to better crawl your site.

Paid log analysis tools


To find out how often Googlebot visits your site and what it does there, you can also opt for paid tools but much more advanced than Search Console. Among the best known: Oncrawl, Botify, Kibana, Screaming Frog… These tools are more intended for sites made up of many pages that it is necessary to segment to facilitate analysis. Indeed, unlike Search Console which gives you an overall crawl rate, some of these tools offer the possibility of refining your analyzes by determining a crawl rate for each type of page (category pages, product sheet, etc.). This segmentation is essential to bring out the problematic pages and then consider the necessary corrections.

Robots.txt


Google does not share its lists of IP addresses used by different robots because they change often. So, to find out if a (real) Googlebot is visiting your site, you can do a reverse IP search. Spammers can easily spoof a user-agent name, but not an IP address. The robots.txt file can help you determine how Googlebot is visiting certain parts of your site. Be careful, this method is not ideal for beginners because if you use the wrong commands, you could prevent Googlebot from crawling your entire site, which will result in your site being removed from search results.

How can I optimize my site to please Googlebot?
Helping Googlebot crawl more pages on your site can be a complex process, which boils down to breaking down the technical barriers that prevent the crawler from crawling your site optimally. This is one of the pillars of SEO: on-site optimization.

Update the content of your site regularly


Content is by far the most important criterion for Google but also for other search engines. Sites that regularly update their content are likely to be crawled more frequently because Google is constantly on the lookout for new things. If you have a showcase site where it is difficult to add content regularly, you can use a blog, directly attached to your site. 

This will encourage the bot to come more often while enriching the semantics of your site. On average, it is recommended that you provide fresh content at least three times a week to significantly improve your crawl rate.

Improve server response time and page load time The page load time is a determining factor. Indeed, if Googlebot takes too long to load and crawl a page, it will crawl fewer pages behind. You must therefore host your site on a reliable server offering a good performance.

Create Sitemap


Submitting a sitemap is one of the first things you can do to make bots crawl your site easier and faster. These may not crawl all the pages in the sitemap, but they will have the paths all cooked up, which is especially important for pages that tend to be improperly linked within the site.

Avoid duplicate content


Duplicate content greatly decreases the crawl rate because Google considers that you are using its resources to crawl the same. In other words, you tire his robots for nothing! Duplicate content should therefore be avoided as much as possible for Googlebot but also for this dear friend Google Panda.
Block access to unwanted pages via Robots.txt

To preserve your crawl budget, you do not need to let search engine robots crawl irrelevant pages, such as information pages, account administration pages, etc. A simple modification to the robots.txt file will allow you to block the crawling of these pages by Googlebot.

Use Ping services

Pinging is a great way to get bots to come to visit you by notifying them of new updates. There are many manual ping services like Pingomatic on WordPress. You can manually add other ping services to many search engine bots.

Take care of your internal mesh
Internal networking is essential to optimize your crawl budget. It not only allows you to deliver SEO juice to every page but also better guide bots to the deeper pages. 

Concretely, if you maintain a blog, when you add an article, you should, as far as possible, make a link towards an older page. The latter will always be fed and will continue to show all its interest to Googlebot.

Internal linking doesn't directly help increase Google's crawl rate, but it does help bots efficiently crawl the deep pages of your site that are often overlooked.

Optimize your images

As smart as they are, robots are not yet able to visualize an image. They need textual guidance. If your site uses images, be sure to complete the alt attributes to provide a clear description that search engines will understand and index. Images can only appear in search results if they are properly optimized.

Conclusion

Googlebot is therefore a little robot that visits your site daily, looking for new things. If you've made the right technical choices for your site, it will come up frequently and crawl many pages.

If you provide it with fresh content on a regular basis, it will come back even more often. In fact, whenever you make a change on your site, you can invite Googlebot to come and see that change from the Google Search Console. In principle, this allows for faster indexing.

No comments