{"id":879,"date":"2022-03-28T08:23:49","date_gmt":"2022-03-28T08:23:49","guid":{"rendered":"https:\/\/easyschema.com\/blog\/?p=879"},"modified":"2022-03-28T08:26:45","modified_gmt":"2022-03-28T08:26:45","slug":"robots-txt-what-is-it-and-how-does-it-works","status":"publish","type":"post","link":"https:\/\/easyschema.com\/blog\/robots-txt-what-is-it-and-how-does-it-works\/","title":{"rendered":"Robots.txt: What is it, and how does it works?"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Robots.txt is a simple and short text file stored in the root directory of a domain that instructs search engine crawlers like Googlebot, what they are allowed to crawl on your site.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In SEO, robots.txt helps crawl the pages of high importance first, preventing them from visiting the &#8216;second-hand&#8217; pages. To accomplish that, robots.txt excludes entire domains, one or more subdirectories or individual files, and even complete directories from search engine crawling.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">However, this simple text file doesn&#8217;t controls crawling and integrates a link to your Sitemap. This gives website crawlers an overall view of the current URLs that exist in your domain.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">That&#8217;s what a robots.txt file looks like:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As we mentioned a little earlier, you can find robots.txt at any domain homepage by simply adding &#8220;\/robots.txt&#8221; at the end.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here is an example of an actual, working robots.txt file:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-881\" src=\"https:\/\/easyschema.com\/blog\/wp-content\/themes\/veen\/assets\/images\/transparent.gif\" data-lazy=\"true\" data-src=\"https:\/\/easyschema.com\/blog\/wp-content\/uploads\/2022\/03\/robots.png\" alt=\"\" width=\"623\" height=\"225\"><\/p>\n<pre><code class=\"language-\">https:\/\/yourdomain.com\/robots.txt<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">Don&#8217;t forget that the robots.txt file is a public folder that you can find on nearly every website, such as Facebook, Apple, and even Amazon. Robots.txt is the first document that search engine crawlers open when visiting your website. You should also know that robots.txt is weak in protection against unauthorized access.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">Why is a robots.txt file important?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The fundamental purpose of the robots.txt file is to show website crawlers the area on your website they have access to and show them how they should interact with all pages. Search engines have to find your web pages before they end up in search results; that&#8217;s why your website&#8217;s content needs to be crawled and indexed first.<br \/>\n<\/span><span style=\"font-weight: 400;\">But in some cases, it&#8217;s better to ban crawlers from visiting specific pages such as empty pages, login pages (for your website), and so on. That&#8217;s why we need to use a robots.txt file, as it is always checked by web crawlers right before they start crawling the website as a whole.<br \/>\n<\/span><span style=\"font-weight: 400;\">Another thing to consider is that robots.txt is only used to prevent search engines from crawling, not indexing.<br \/>\n<\/span><span style=\"font-weight: 400;\">Even though website crawlers might not have access to a particular page, search engines may continue to index it if external links are directed.<br \/>\n<\/span><span style=\"font-weight: 400;\">Besides this primary purpose of the robots.txt file, many other SEO benefits can be beneficial in different situations.<\/span><\/p>\n<p><strong>First, they can optimize the crawl budget.<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">The crawl budget indicates the total number of pages that website crawlers (e.g., Googlebot) will crawl within a specific amount of time.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Some larger websites contain dozens of unimportant pages that do not need to be indexed or crawled, and robots.txt tells search engines the exact pages to crawl. So, using the robots.txt file optimizes the crawling frequency and efficiently helps page indexes.<\/span><\/p>\n<p><strong>Second. They can manage duplicate content.<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">Robots.txt can prevent the crawling of similar or duplicate content on your web pages. As you now know, there are a lot of websites that contain forms of the same content. These pages can be www vs. non-www pages, URL parameters (HTTP encryption) and identical PDF files, etc.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">You can avoid crawling these pages by using the robots.txt file to point them out. So, by using robots.txt, you can manage content that is unnecessary to be crawled, helping the search engine crawl only pages you want to appear in SERPs.<\/span><\/p>\n<p><strong>Third. They can prevent servers from overloading.<\/strong><\/p>\n<p><span style=\"font-weight: 400;\">Using the robots.txt file can prevent your website server from crashing. Website crawlers such as Googlebot are known for their ability to determine how fast your website should crawl, regardless of your website server capacity. However, sometimes you can get bothered by how often specific web crawlers visit your site, and you may probably want to block them. The urge to block some bots comes when a lousy bot overloads your website with requests or when block scrapers try to copy all your site&#8217;s content, which can cause a lot of site issues.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">It is essential to use robots.txt because it tells web crawlers the specific pages to turn their focus. In this way, the other pages of your website will be left alone, which will prevent the site from overloading.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">How does a robots.txt file work?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">The fundamental principles that indicate how a robots.txt file works consist of 2 essential elements. Both elements are related tight to each other, and they dictate a specific website crawler to do something (user-agents) and what they should do (directives).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">User agents specify which will direct web crawlers to crawl or avoid crawling certain pages, whereas directives indicate user agents should do with these pages.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is what they look like:<\/span><\/p>\n<pre><code class=\"language-\">User-agent: Googlebot\nDisallow: \/wp-admin\/<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">Down below, you have an in-depth survey on these two elements.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">User-agents<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">User-agent represents a specific crawler instructed by directives on crawling your site.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">e.g., the user-agent for the Google crawler is &#8220;Googlebot,&#8221; for Yahoo is &#8220;Slurp,&#8221; for Bing crawler is &#8220;BingBot,&#8221; etc.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In case you want to mark all types of bots for a particular directive at once, use the wildcard symbol, &#8220;*.&#8221; This symbol represents all types of website crawlers that have to follow the directives&#8217; rules.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is what this symbol looks like in the robots.txt file:<\/span><\/p>\n<pre><code class=\"language-\">User-agent: *\u00a0\nDisallow: \/wp-admin\/<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">You have to keep in mind that several user agents are focused on crawling for their purposes.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Directives<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Whereas robots.txt directives are the instructions that specified user-agents have to follow. Directives instruct web crawlers to crawl every available page, and then it&#8217;s time for the robots.txt file to decide which pages shouldn&#8217;t crawl pages (or sections) on your website.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here are three standard rules used by robots.txt directives:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8220;Disallow&#8221; \u2013tells web crawlers not to access anything specified within this directive. You can designate several disallow instructions to user agents.\u00a0\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8220;Allow&#8221; \u2013 tells web crawlers that they can now access some pages from the current disallowed website section.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">&#8220;Sitemap&#8221; \u2013 if you already arranged an XML sitemap, robots.txt tells crawlers where they can find web pages that you want to be crawled (pointing crawlers to your site map).<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">Here is an example of these three directives for wordpress sites:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">User-agent: Googlebot \u2013 the specific crawler<br \/>\n<\/span><span style=\"font-weight: 400;\">Disallow: \/wp-admin\/ &#8211; the directive (tells Googlebot that we do not want access to the login page for a WordPress site).<br \/>\n<\/span><span style=\"font-weight: 400;\">Allow: \/wp-admin\/random-content.php \u2013 we added an exception \u2013 Googlebot can visit that specific address (it cannot access anything else under the \/wp-admin\/folder).<br \/>\n<\/span><span style=\"font-weight: 400;\">Sitemap: <\/span><a href=\"https:\/\/www.example.com\/sitemap.xml\" target=\"_blank\" rel=\"noopener\"><span style=\"font-weight: 400;\">https:\/\/www.example.com\/sitemap.xml<\/span><\/a><span style=\"font-weight: 400;\"> &#8211; a list of URLs that you want to be crawled \u2013 we instructed Googlebot where to find your Sitemap.<br \/>\n<\/span><span style=\"font-weight: 400;\">Here are a few other rules that can apply to your robots.txt file if your site contains dozens of pages that need to be managed somehow.\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">* (Wildcard)<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">This directive indicated a rule for matching patterns, and it is used mainly for websites that contain filtered product pages, dozens of generated contents, etc.<br \/>\n<\/span><span style=\"font-weight: 400;\">For instance, instead of disallowing each product page under the \/products\/ section one-by-one as in the example below,<\/span><\/p>\n<pre><code class=\"language-\">User-agent: *\u00a0\nDisallow: \/products\/shoes?\nDisallow: \/products\/boots?\nDisallow: \/products\/sneakers?<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">you can use the wildcard directive to disallow them all at once:<\/span><\/p>\n<pre><code class=\"language-\">User-agent: *\u00a0\nDisallow: \/products\/*?\n$<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">This symbol is used to define the end of a URL.<br \/>\n<\/span><span style=\"font-weight: 400;\">May instruct web crawlers whether they should crawl URLs with an ending.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Example:<\/span><\/p>\n<pre><code class=\"language-\">User-agent: *\nDisallow: \/*.gif$\nIn this example, the &quot;$&quot; sign indicates crawlers to ignore all URLs that end with &quot;.gif.&quot;\n#<\/code><\/pre>\n<p><span style=\"font-weight: 400;\">This sign serves in the same way that comments\/annotations help human readers. The &#8220;#&#8221; symbol doesn&#8217;t indicate a directive and has no impact on user agents.<\/p>\n<p><\/span><span style=\"font-weight: 400;\"># We don&#8217;t want any crawler to visit our login page!<br \/>\n<\/span><code class=\"language-\">User-agent: *<br \/>\nDisallow: \/wp-admin\/<\/code><\/p>\n<h3><span style=\"font-weight: 400;\">How to compose your own robots.txt file?<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">If you use WordPress for your website, you will have a default robots.txt file there. Anyway, a few plugins like Yoast SEO, Rank Math SEO, or All in One SEO can help you manage your robots.txt file in case you want to make some changes in the future.<br \/>\n<\/span><span style=\"font-weight: 400;\">These plugins can help you easily control what you want to allow or disallow, so you don&#8217;t have to write any complicated syntax all by yourself.<\/span><\/p>\n<h1><span style=\"font-weight: 400;\">Robots.txt file best practices.<\/span><\/h1>\n<p><span style=\"font-weight: 400;\">You have to know that robots.txt files can quickly get complex, so it&#8217;s better to keep things under control \u2013 as simple as possible.<br \/>\n<\/span><span style=\"font-weight: 400;\">Down below, you have some tips on how to create and update your robots.txt file:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Use separate files for subdomains \u2013 if your site has multiple subdomains, the best you can do is treat them as different websites. You have to create separated robots.txt files for every subdomain you own.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Ensure specificity \u2013you have to specify the exact URL paths, and you also have to pay attention to specific signs (or any trailing slashes) that are present or missing on your URLs.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Specify user-agents just once \u2013 merge all directives of a specific user-agent. This helps you establish simplicity and organization in your robots.txt file.<\/span><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Robots.txt is a simple and short text file stored in the root directory of a domain that instructs search engine crawlers like Googlebot, what they&#8230;<\/p>\n","protected":false},"author":1,"featured_media":885,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14,18,1,20],"tags":[112,111,113,114,115],"class_list":["post-879","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-seo","category-beginner-seo","category-easyschema-blog","category-technical-seo","tag-create-robots-txt","tag-robots-txt","tag-robots-txt-example","tag-robots-txt-test","tag-robots-txt-wordpress"],"_links":{"self":[{"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/posts\/879","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/comments?post=879"}],"version-history":[{"count":7,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/posts\/879\/revisions"}],"predecessor-version":[{"id":888,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/posts\/879\/revisions\/888"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/media\/885"}],"wp:attachment":[{"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/media?parent=879"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/categories?post=879"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/easyschema.com\/blog\/wp-json\/wp\/v2\/tags?post=879"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}