Hi again we here today to talk about 3rd part in blogging lessons and lesson if The day is how we can add robot txt to our blogger .
Before that we need to know some information about this robot txt so let's start with definition of robot txt.
What is a robot txt?
Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.
How does robot txt work?
Search engines have two main jobs:
Crawling the web to discover content;
Indexing that content so that it can be served up to searchers who are looking for information.
To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. This crawling behavior is sometimes known as “spidering.”
After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site
Example of a robot txt:
1) Disallow All
The first template will stop all bots from crawling your site. This is useful for many reasons. For example:
The site is not ready yet
You do not want the site to appear in Google Search results
It is a staging website used to test changes before adding to production
Whatever the reason this is how you would stop all web crawlers from reading the pages:
User-agent: *
Disallow: /
Here we have introduced two “rules” they are:
User-agent - Target a specific bot using this rule or use the * as a wildcard which means all bots
Disallow - Used to tell a bot that it cannot go to this area of the site. By setting this to a / the bot will not crawl any of your pages
What if we want the bot to crawl the whole site?
2) Allow All
If you do not have a robots.txt file on your site then by default a bot will crawl the entire website. One option then is to not create or remove the robots.txt file.
Yet, sometimes this is not possible and you have to add something. In this case, we would add the following:
User-agent: *
Disallow:
At first, this seems strange as we still have the Disallow rule in place. Yet, it is different as it does not contain the /. When a bot reads this rule it will see that no URLs have the Disallow rule.
In other words, the whole site is open.
3) Block a Folder
Sometimes there are times when you need to block an area of a site but allow access to the rest. A good example of this is an admin area of a page.
The admin area may allow admins to login and change the content of the pages. We don't want bots looking in this folder so we can disallow it like this:
User-agent: *
Disallow: /admin/
Now the bot will ignore this area of the site.
4) Block a file
The same is true for files. There may be a specific file that you don't want ending up in Google Search. Again this could be an admin area or similar.
To block the bots from this you would use this robots.txt.
User-agent: *
Disallow: /admin.html
This will allow the bot to crawl all the website except the /admin.html file.
How to add robot txt to the Blog:
1-First sign up to your blog than click on sittings
Note : change the url with your own url
Comments
Post a Comment