What to do when robots.txt file prevent googlebot from crawling your site.?

by Anjali Gupta
Posted: Dec 14, 2017

you land on this page as you want to know What to do when robots.txt file prevent googlebot from crawling your site.?

robots.txt file prevent googlebot from crawling your site is the most common problem faced by blogeers or websites owners. Ranking in google search is the lifeline for any blog or website at present and if a robots.txt file in your database or server prevents googlebot from crawling your site that means google crawlers are not able to fetch this file.

For your refrence below we are mentioning few errors we usually encounters due to malfunctioning of robots.txt file

Error-1, http://example.com/: Googlebot can’t access your site

Over the last 24 hours, Googlebot encountered 255 errors while attempting to access your robots.txt. To ensure that we didn’t crawl any pages listed in that file, we postponed our crawl. Your site’s overall robots.txt error rate is 66.2%.

Error-2( error with other crawlers like woorank.com)

This URL cannot be reviewed.

There may be several reasons for this: the URL does not exist, the website does not allow woorank to make reviews, the website denies both our HEAD and GET requests. The DNS send to an infinite loop or the robots.txt page does not allow access to our bot.

Error-3 ( Biggest Confusion)

http://yoursite.com/robots.txt url shows working robots.txt file in your server. This url pointing it to be present it in root of site (based on url structure above). Secondaly bots and crawlers not read or execute robots.txt file if its present in subfolder. BUT on checking people often complaint that in root they not able to find any robots.txt file but url showing it present and working mode.

We will give you proper solutions for all errors but before that it is necessary to have some little discussion on robots.txt file, what are these file, whats are prons and cons of using them so that if any beginner also land on this page can have maximum help with these informations.

a) What are robots.txt file and their utility.

In very simple words robots.txt is a file to give hint to crawlers/users not to access specific content or url on your site or server. If you do not need to keep any content private or not want to prevent anything from being indexed by google than you not even need this "robots.txt" file at all.

b) robots.txt file physically present at root folder of your site.

If you have robots.txt file physically present at root folder of your site than trust me your life has become really very easy. I will recommend you to first ask your host to fix that file or you can use proper format of code to edit that file.[ for proper code refer below part of article] In plugins like "All in one seo" there are inbuilt option to create and customize robots.txt file.

c) robots.txt file virtually present but not in root folder.

This is the toughest part of this problem. You need to fix your robots.tx file but its not available in root of your site in real. Its just virtual file and without making changes to it you cant get rid of issue with googlebots. Even if you not have problem with googlebot than also you may need to make changes to robots.txt file to make some content private.

So considering this problem we are suggesting you best solutions please try them 1 by 1 as checklist and we are sure your problem will be solved.

1. Go to Dashboard>> Privacy>> "My blog is visible to anyone".

At the time of installation of wordpress or later if you chose option " to block search engine from indexing your site" than even there is not robots.txt file in your server but if you will view the source code you can see virtually "robots.txt " file present there.

2. Plugin like Google XML Sitemaps Generator also creates a virtual "robots.txt" file so follow below solutions

a)In plugin settings page Please uncheck the box saying "Add sitemap URL to the virtual robots.txt file."

b) You can add onto the sitemap by adding some code in the "sitemap-core.php" file for this plugin.

Find a line in your "sitemap-core.php" file identical to below code

echo "\nSitemap: ". $smUrl. "\n";

This adds the sitemap to the robots.txt file, Now ad this code to allow all robots for now and your code should look like this.

echo "\nUser-agent: *";

echo "\nAllow: /\n";

echo "\nSitemap: ". $smUrl. "\n";

You can add as much content as you want from there according to your needs.

3. Fixing directly the virtual "robots.txt" file in wordpress.

the actual robots.txt comes from /functions.php file in the /wp-includes/ folder.

Find a line in your "functions.php" file inside wp-includes folder identical to below code

function do_robots() {

header( ‘Content-Type: text/plain; charset=utf-8’ );

do_action( ‘do_robotstxt’ );

if ( ‘0’ == get_option( ‘blog_public’ ) ) {

echo "User-agent: *\n";

echo "Disallow: /\n";

} else {

echo "User-agent: *\n";

echo "Disallow:\n";

}

below is just sample for working robots.txt file,you can feel free to change the directories to suit your needs…

function do_robots() {

header( ‘Content-Type: text/plain; charset=utf-8’ );

do_action( ‘do_robotstxt’ );

if ( ‘0’ == get_option( ‘blog_public’ ) ) {

echo "User-agent: *";

echo "\nDisallow: /wp-admin";

echo "\nDisallow: /wp-includes";

echo "\nDisallow: /wp-content";

echo "\nDisallow: /stylesheets";

echo "\nDisallow: /_db_backups";

echo "\nDisallow: /cgi";

echo "\nDisallow: /store";

echo "\nDisallow: /wp-includes\n";

} else {

echo "User-agent: *";

echo "\nDisallow: /wp-admin";

echo "\nDisallow: /wp-includes";

echo "\nDisallow: /wp-content";

echo "\nDisallow: /stylesheets";

echo "\nDisallow: /_db_backups";

echo "\nDisallow: /cgi";

echo "\nDisallow: /store";

echo "\nDisallow: /wp-includes\n";

}

4. Disabling virtual robots.txt so that it not stop googlebots from crawling from your site.

Create a proper robots.txt file in root folder of site make sure they allowing crawlers than googlebot crawlers recognize robots.txt file present in root directory as legitimate file and they will automatically ignore virtual robots.txt file.

Testing = login google webmaster tools panel – go to "fetch as google" and try to test by fetching url – http://www.yoursite.com/robots.txt/ and if url is submitted and indexed than your error is properly solved and googlebot are now able to crawl, fetch and index your website easily.

Actually googlebots wants http:/www.yoursite.com/robots.txt/ to return "error 200" that means robots.txt available and allow googlebot to crawl and your site.

OR

googlebots wants http:/www.yoursite.com/robots.txt/ to return "error 404" that means robots.txt not available and allow googlebot can crawl anything on your site.

With all suggestions mentioned above we are sure that 100% your robots.txt issue will be solved but as there are many seo plugins providing robots.txt file creation features so we will recommend you to contact your host providers or very experienced in relevant filed in case above solutions not worked for you to get professional assistance without damaging your site or its features.

Source – copyrighted@smileitsolutions

About the Author

Smile IT Solutions (http://www.smileitsolutions.com/) is India's best mobile apps development, SEO & web designing company in India and offering quality best SEO services with lowest costs.

Rate this Article

Anjali Gupta

Member since: Dec 14, 2017
Published articles: 1

What to do when robots.txt file prevent googlebot from crawling your site.?

This URL cannot be reviewed.

About the Author

Rate this Article

Leave a Comment

Anjali Gupta

Related Articles