If you have a website and it is being crawled too often by Bing, Yahoo, or Live, this post describes how to reduce their crawl rate to acceptable levels.
Last week, we began receiving 500 and 503 errors from one of our affiliate stores. This had the undesirable side effect of placing our local instance of Apache's web server in an error state and thus taking our site offline for several hours each day.
We realized that our site was down, but did not know why. After reading through our servers log files, we discovered the 5xx errors. After researching these http error codes, we found that we could not fix these errors directly. Instead, we had to correct the root cause.
Searching through our affiliate's website, we found that they will return these error codes when their server receives too many requests from a particular IP. So, we returned back to our log files and found that the Bing, Yahoo, and Live crawlers were simultaneously requesting many of our pages at the same time.
In order to fix our problem, we had to slow down these crawlers. Our first action was to add a crawl delay to our robots.txt file. Initially, we set this to 60 seconds.
Next, we discovered
Bing Webmaster Tools.
In order to utilize this, we needed to sign in with our Windows LiveID. We did not have one so we created a new one. That was very easy and we were able to Sign In to that site within minutes.
Next, we had to add our site to the crawler. The Bing Webmaster Home page has two sections. The first is for messages, and the second is for sites. We found the "
Add Site" link and submitted our site's URL.
Unfortunately, it takes about 3 days before any statistics are displayed. So, we just waited.
Once we saw that Bing was crawling and indexing our pages, we were then able to reduce the crawl rate.
This was done by:
- Signing into the Bing Webmaster Tools
- Clicking on our site's URL listed in the Sites section.
- That brought us to the Dashboard page.
- At the top is a "Crawl" link, and we clicked on it.
- The next page then provided a sub-menu.
- We clicked on the "Crawl Settings" link and it brought us to a graphical "Crawl Rate" page.
- We lowered our Crawl rate to Minimum (by highlighting the boxes for each hour of the day)
- And lastly, we pressed the "Save" link.
Within 2 days, the Bing, Yahoo, and Live crawlers were behaving properly, and all of our HTTP 5xx errors disappeared.
During this process, we learned five important things about crawlers:
- The Google bot crawl rate is well behaved, and does not overwhelm your server
- The Google crawler ignores the "Crawl-delay" command in the robots.txt
- Bing only allows a maximum crawl-delay of 4 seconds
- Once your site becomes large enough, the crawling bots can harm your site
- Crawlers are tamable.
Note: To set a crawl delay in your robots.txt file, enter the two lines:
User-Agent: *
Crawl-delay: 4
at the top of the file.
Even if you are not experiencing problems with your website, we suggest that you submit your site to Bing Webmaster Tools. Although the interface is slow, it provides a wide variety of information about your website which is a great complement to the Google Webmaster Tools.