Disallow the robots txt file. Yandex robots. Allow - direct robots

Robots.txt- this is text file, which is located in the root of the site - http://site.ru/robots.txt. Its main purpose is to set certain directives to search engines - what and when to do on the site.

The simplest Robots.txt

The simplest robots.txt , which allows all search engines to index everything, looks like this:

User-agent : *
Disallow :

If the Disallow directive does not have a slash at the end, then all pages are allowed to be indexed.

This directive completely prohibits the site from being indexed:

User-agent : *
Disallow: /

User-agent - indicates for whom the directives are intended, an asterisk indicates that for all PSs, for Yandex indicate User-agent: Yandex.

Yandex help says that its crawlers process User-agent: * , but if User-agent: Yandex is present, User-agent: * is ignored.

Disallow and Allow directives

There are two main directives:

Disallow - prohibit

Allow - allow

Example: On the blog, we forbade indexing the /wp-content/ folder where the plugin files, template, etc. are located. But there are also images that must be indexed by the PS in order to participate in the image search. To do this, you need to use the following scheme:

User-agent : *
Allow : /wp-content/uploads/ # Allow images to be indexed in the uploads folder
Disallow : /wp-content/

The order in which directives are used is important for Yandex if they apply to the same pages or folders. If you specify like this:

User-agent : *
Disallow : /wp-content/
Allow : /wp-content/uploads/

Images will not be loaded by the Yandex robot from the /uploads/ directory, because the first directive is being executed, which denies all access to the wp-content folder.

Google takes it easy and follows all the directives of the robots.txt file, regardless of their location.

Also, do not forget that directives with and without a slash perform a different role:

Disallow: /about Denies access to the entire site.ru/about/ directory, and pages that contain about - site.ru/about.html , site.ru/aboutlive.html, etc. will not be indexed.

Disallow: /about/ It will prohibit robots from indexing pages in the site.ru/about/ directory, and pages like site.ru/about.html, etc. will be available for indexing.

Regular expressions in robots.txt

Two characters are supported, these are:

* - implies any order of characters.

Example:

Disallow: /about* will deny access to all pages that contain about, in principle, and without an asterisk, such a directive will also work. But in some cases this expression is not replaceable. For example, in one category there are pages with .html at the end and without, in order to close all pages that contain html from indexing, we write the following directive:

Disallow : /about/*.html

Now the site.ru/about/live.html page is closed from indexing, and the site.ru/about/live page is open.

Another analogy example:

User-agent : Yandex
Allow : /about/*.html #allow indexing
Disallow : /about/

All pages will be closed, except for pages that end in .html

$ - cuts off the rest and marks the end of the line.

Example:

Disallow: /about- This robots.txt directive prohibits indexing all pages that start with about , as well as prohibiting pages in the /about/ directory.

By adding a dollar sign at the end - Disallow: /about$ we will tell the robots that only the /about page cannot be indexed, but the /about/ directory, /aboutlive pages, etc. can be indexed.

Sitemap Directive

This directive specifies the path to the Sitemap, as follows:

Sitemap : http://site.ru/sitemap.xml

Host Directive

Specified in this form:

Host: site.ru

Without http:// , slashes, and the like. If you have a main mirror site with www, then write:

Robots.txt example for Bitrix

User-agent: *
Disallow: /*index.php$
Disallow: /bitrix/
Disallow: /auth/
Disallow: /personal/
Disallow: /upload/
Disallow: /search/
Disallow: /*/search/
Disallow: /*/slide_show/
Disallow: /*/gallery/*order=*
Disallow: /*?*
Disallow: /*&print=
Disallow: /*register=
Disallow: /*forgot_password=
Disallow: /*change_password=
Disallow: /*login=
Disallow: /*logout=
Disallow: /*auth=
Disallow: /*action=*
Disallow: /*bitrix_*=
Disallow: /*backurl=*
Disallow: /*BACKURL=*
Disallow: /*back_url=*
Disallow: /*BACK_URL=*
Disallow: /*back_url_admin=*
Disallow: /*print_course=Y
Disallow: /*COURSE_ID=
Disallow: /*PAGEN_*
Disallow: /*PAGE_*
Disallow: /*SHOWALL
Disallow: /*show_all=
Host: sitename.com
Sitemap: https://www.sitename.ru/sitemap.xml

WordPress robots.txt example

After all the necessary directives described above have been added. You should end up with a robots file like this:

This is, so to speak, the basic version of robots.txt for wordpress. There are two User-agents here - one for everyone and the second for Yandex, where the Host directive is specified.

robots meta tags

It is possible to close a page or site from indexing not only with the robots.txt file, this can be done using the meta tag.

<meta name = "robots" content = "noindex,nofollow" >

You need to register it in the tag and this meta tag will prohibit indexing the site. There are plugins in WordPress that allow you to set such meta tags, for example - Platinum Seo Pack. With it, you can close any page from indexing, it uses meta tags.

Crawl-delay directive

With this directive, you can set the time for which the search bot should be interrupted between downloading site pages.

User-agent : *
Crawl-delay : 5

The timeout between two page loads will be 5 seconds. To reduce the load on the server, they usually set it to 15-20 seconds. This directive is needed for large, frequently updated sites where search bots just "live".

For regular sites/blogs this directive is not needed, but you can thus limit the behavior of other irrelevant search robots (Rambler, Yahoo, Bing), etc. After all, they also visit the site and index it, thereby creating a load on the server.

Most of the robots are well designed and do not pose any problems for site owners. But if the bot is written by an amateur or “something went wrong”, then it can create a significant load on the site that it bypasses. By the way, spiders do not enter the server at all like viruses - they simply request the pages they need remotely (in fact, these are analogues of browsers, but without the page browsing function).

Robots.txt - user-agent directive and search engine bots

Robots.txt has a very simple syntax, which is described in great detail, for example, in help yandex and Google help. It usually specifies which search bot the following directives are intended for: bot name (" user-agent"), allowing (" allow") and forbidding (" Disallow"), and "Sitemap" is also actively used to indicate to search engines exactly where the map file is located.

The standard was created quite a long time ago and something was added later. There are directives and design rules that will be understood only by robots of certain search engines. In RuNet, only Yandex and Google are of interest, which means that it is with their help in compiling robots.txt that you should familiarize yourself in particular detail (I provided the links in the previous paragraph).

For example, earlier for the Yandex search engine it was useful to indicate that your web project is the main one in the special "Host" directive, which only this search engine understands (well, also Mail.ru, because they have a search from Yandex). True, at the beginning of 2018 Yandex still canceled Host and now its functions, like those of other search engines, are performed by a 301 redirect.

Even if your resource does not have mirrors, it will be useful to indicate which of the spellings is the main one - .

Now let's talk a little about the syntax of this file. Directives in robots.txt look like this:

<поле>:<пробел><значение><пробел> <поле>:<пробел><значение><пробел>

The correct code should contain at least one "Disallow" directive after each "User-agent" entry. An empty file assumes permission to index the entire site.

user-agent

"User-agent" directive must contain the name of the search bot. With it, you can set up rules of conduct for each specific search engine (for example, create a ban on indexing a separate folder only for Yandex). An example of writing a "User-agent", addressed to all bots that come to your resource, looks like this:

User-agent: *

If you want to set certain conditions in the "User-agent" for only one bot, for example, Yandex, then you need to write this:

User agent: Yandex

The name of the search engine robots and their role in the robots.txt file

Bot of each search engine has its own name (for example, for a rambler it is StackRambler). Here I will list the most famous of them:

Google http://www.google.com Googlebot Yandex http://www.ya.ru Yandex Bing http://www.bing.com/ bingbot

For major search engines, sometimes except for the main bots, there are also separate instances for indexing blogs, news, images, and more. You can get a lot of information on the types of bots (for Yandex) and (for Google).

How to be in this case? If you need to write a no indexing rule that all types of Googlebots must follow, then use the name Googlebot and all other spiders of this search engine will also obey. However, you can only prohibit, for example, the indexing of images by specifying the Googlebot-Image bot as the User-agent. Now it is not very clear, but with examples, I think it will be easier.

Examples of using the Disallow and Allow directives in robots.txt

Let me give you a few simple examples of using directives explaining his actions.

  1. The code below allows all bots (indicated by an asterisk in the User-agent) to index all content without any exceptions. It is given empty Disallow directive. User-agent: * Disallow:
  2. The following code, on the contrary, completely prohibits all search engines from adding pages of this resource to the index. Sets this to Disallow with "/" in the value field. User-agent: * Disallow: /
  3. In this case, all bots will be prohibited from viewing the contents of the /image/ directory (http://mysite.ru/image/ is the absolute path to this directory) User-agent: * Disallow: /image/
  4. To block one file, it will be enough to register its absolute path to it (read): User-agent: * Disallow: /katalog1//katalog2/private_file.html

    Looking ahead a little, I’ll say that it’s easier to use the asterisk character (*) so as not to write the full path:

    Disallow: /*private_file.html

  5. In the example below, the "image" directory will be prohibited, as well as all files and directories that begin with the characters "image", i.e. files: "image.htm", "images.htm", directories: "image", " images1", "image34", etc.): User-agent: * Disallow: /image The fact is that by default, an asterisk is implied at the end of the entry, which replaces any characters, including their absence. Read about it below.
  6. By using allow directives we allow access. Good complement to Disallow. For example, with this condition, we forbid the Yandex search robot from downloading (indexing) everything except web pages whose address starts with /cgi-bin: User-agent: Yandex Allow: /cgi-bin Disallow: /

    Well, or this is an obvious example of using the Allow and Disallow bundle:

    User-agent: * Disallow: /catalog Allow: /catalog/auto

  7. When describing paths for Allow-Disallow directives, you can use the symbols "*" and "$", thus setting certain logical expressions.
    1. Symbol "*"(star) means any (including empty) sequence of characters. The following example prevents all search engines from indexing files with the ".php" extension: User-agent: * Disallow: *.php$
    2. Why is it needed at the end $ (dollar) sign? The fact is that, according to the logic of compiling the robots.txt file, a default asterisk is added at the end of each directive (it doesn’t exist, but it seems to be there). For example we write: Disallow: /images

      Assuming it's the same as:

      Disallow: /images*

      Those. this rule forbids indexing of all files (web pages, pictures and other types of files) whose address starts with /images and anything else follows (see the example above). So here it is $ symbol simply overrides that default (unspecified) asterisk at the end. For example:

      Disallow: /images$

      Only disables indexing of the /images file, not /images.html or /images/primer.html. Well, in the first example, we prohibited indexing only files ending in .php (having such an extension), so as not to catch anything extra:

      Disallow: *.php$

  • In many engines, users (human-readable URLs), while system-generated URLs have a question mark "?" in the address. You can use this and write such a rule in robots.txt: User-agent: * Disallow: /*?

    Asterisk after question mark It suggests itself, but, as we found out a little higher, it is already implied at the end. Thus, we will prohibit the indexing of search pages and other service pages created by the engine, which the search robot can reach. It will not be superfluous, because the question mark is most often used by CMS as a session identifier, which can lead to duplicate pages getting into the index.

  • Sitemap and Host directives (for Yandex) in Robots.txt

    In order to avoid unpleasant problems with site mirrors, it was previously recommended to add the Host directive to robots.txt, which pointed the Yandex bot to the main mirror.

    Host directive - specifies the main site mirror for Yandex

    For example, before, if you have not switched to a secure protocol yet, it was necessary to indicate in the Host not the full URL, but Domain name(without http://, i.e..ru). If you have already switched to https, then you will need to specify the full URL (like https://myhost.ru).

    A wonderful tool for combating duplicate content - the search engine simply will not index the page if a different URL is registered in Canonical. For example, for such a page of my blog (a page with pagination), Canonical points to https: // site and there should not be any problems with duplicating titles.

    But I digress...

    If your project is based on any engine, then duplicate content will occur with a high probability, which means you need to fight it, including with the help of a ban in robots.txt, and especially in the meta tag, because in the first case, Google can ignore the ban, but it can no longer give a damn about the meta tag ( brought up like that).

    For example, in WordPress Pages with very similar content can get into the index of search engines if indexing is allowed for both the content of the categories, the content of the tag archive, and the content of temporary archives. But if using the Robots meta tag described above to create a ban for the tag archive and temporary archive (you can leave the tags, but prohibit the indexing of the contents of the categories), then duplication of content will not occur. How to do this is described by the link given just above (to the OlInSeoPak plugin)

    Summing up, I’ll say that the Robots file is designed to set global rules for denying access to entire site directories, or to files and folders whose names contain specified characters (by mask). You can see examples of setting such prohibitions a little higher.

    Now let's consider concrete examples robots designed for different engines - Joomla, WordPress and SMF. Naturally, all three options created for different CMS will differ significantly (if not cardinally) from each other. True, they all will have one common moment, and this moment is connected with the Yandex search engine.

    Because Yandex has a fairly large weight in Runet, then you need to take into account all the nuances of its work, and here we Host directive will help. It will explicitly indicate to this search engine the main mirror of your site.

    For her, it is advised to use a separate User-agent blog, intended only for Yandex (User-agent: Yandex). This is due to the fact that other search engines may not understand Host and, accordingly, its inclusion in the User-agent record intended for all search engines (User-agent: *) can lead to negative consequences and incorrect indexing.

    It’s hard to say how things really are, because search algorithms are a thing in themselves, so it’s better to do as they advise. But in this case, you will have to duplicate in the User-agent: Yandex directive all the rules that we set User-agent: * . If you leave User-agent: Yandex with an empty Disallow: , then in this way you will allow Yandex to go anywhere and drag everything into the index.

    Robots for WordPress

    I will not give an example of a file that the developers recommend. You can watch it yourself. Many bloggers do not limit Yandex and Google bots at all in their walks through the content of the WordPress engine. Most often on blogs you can find robots automatically filled with a plugin.

    But, in my opinion, one should still help the search in the difficult task of sifting the wheat from the chaff. Firstly, it will take a lot of time for Yandex and Google bots to index this garbage, and there may not be time at all to add web pages with your new articles to the index. Secondly, bots crawling through the junk files of the engine will create an additional load on the server of your host, which is not good.

    You can see my version of this file for yourself. It is old, has not changed for a long time, but I try to follow the principle “don’t fix what didn’t break”, and it’s up to you to decide: use it, make your own or peep from someone else. I still had a ban on indexing pages with pagination there until recently (Disallow: */page/), but recently I removed it, relying on Canonical, which I wrote about above.

    But in general, the only correct file for WordPress, probably does not exist. It is possible, of course, to implement any prerequisites in it, but who said that they would be correct. There are many options for ideal robots.txt on the web.

    I will give two extremes:

    1. you can find a megafile with detailed explanations (the # symbol separates comments that would be better removed in a real file): User-agent: * # general rules for robots, except for Yandex and Google, # because the rules for them are below Disallow: /cgi-bin # hosting folder Disallow: /? # all query options on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: /wp/ # if there is a /wp/ subdirectory where the CMS is installed ( if not, # rule can be removed) Disallow: *?s= # search Disallow: *&s= # search Disallow: /search/ # search Disallow: /author/ # author's archive Disallow: /users/ # authors' archive Disallow: */ trackback # trackbacks, notifications in comments when an open # article link appears Disallow: */feed # all feeds Disallow: */rss # rss feed Disallow: */embed # all embeds Disallow: */wlwmanifest.xml # manifest xml file Windows Live Writer (if not using # rule can be removed) Disallow: /xmlrpc.php # WordPress API file Disallow: *utm= # links with utm tags Disallow: *openstat= # links with openstat tags Allow: */uploads # open folder with files uploads User-agent: GoogleBot # rules for Google (no duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Disallow: *utm= Disallow: *openstat= Allow: */uploads Allow: /*/*.js # open js scripts inside /wp - (/*/ - for priority) Allow: /*/*.css # open css files inside /wp- (/*/ - for priority) Allow: /wp-*.png # pictures in plugins, cache folder and etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php # used by plugins to not block JS and CSS User-agent: Yandex # rules for Yandex (do not duplicate comments) Disallow: /cgi-bin Disallow: /? Disallow: /wp- Disallow: /wp/ Disallow: *?s= Disallow: *&s= Disallow: /search/ Disallow: /author/ Disallow: /users/ Disallow: */trackback Disallow: */feed Disallow: */ rss Disallow: */embed Disallow: */wlwmanifest.xml Disallow: /xmlrpc.php Allow: */uploads Allow: /*/*.js Allow: /*/*.css Allow: /wp-*.png Allow: /wp-*.jpg Allow: /wp-*.jpeg Allow: /wp-*.gif Allow: /wp-admin/admin-ajax.php Clean-Param: utm_source&utm_medium&utm_campaign # Yandex recommends not closing # from indexing, but deleting tag parameters, # Google does not support such rules Clean-Param: openstat # similar # Specify one or more Sitemap files (no need to duplicate for each User-agent #). Google XML Sitemap creates 2 sitemaps like in the example below. Sitemap: http://site.ru/sitemap.xml Sitemap: http://site.ru/sitemap.xml.gz # Specify the main mirror of the site, as in the example below (with WWW / without WWW, if HTTPS # then write protocol, if you need to specify the port, specify). Host command understands # Yandex and Mail.RU, Google does not take into account. Host: www.site.ru
    2. Here is an example of minimalism: User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Host: https://site.ru Sitemap: https://site. ru/sitemap.xml

    The truth probably lies somewhere in the middle. Also don't forget to write robots meta tag for "extra" pages, for example, using a wonderful plugin - . He will also help set up Canonical.

    Correct robots.txt for Joomla

    User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/

    In principle, almost everything is taken into account here and it works well. The only thing is that you should add a separate User-agent: Yandex rule to it to insert the Host directive that defines the main mirror for Yandex, as well as specify the path to the Sitemap file.

    Therefore, in the final form, the correct robots for Joomla, in my opinion, should look like this:

    User-agent: Yandex Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /component/tags* Disallow: /*mailto/ Disallow: /*.pdf Disallow : /*% Disallow: /index.php Host: vash_sait.ru (or www.vash_sait.ru) User-agent: * Allow: /*.css?*$ Allow: /*.js?*$ Allow: /* .jpg?*$ Allow: /*.png?*$ Disallow: /administrator/ Disallow: /cache/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /libraries/ Disallow: /modules/ Disallow : /plugins/ Disallow: /tmp/ Disallow: /layouts/ Disallow: /cli/ Disallow: /bin/ Disallow: /logs/ Disallow: /components/ Disallow: /component/ Disallow: /*mailto/ Disallow: /*. pdf Disallow: /*% Disallow: /index.php Sitemap: http://path to your map XML format

    Yes, also note that in the second option there are directives Allow, allowing indexing of styles, scripts and pictures. This was written specifically for Google, because its Googlebot sometimes swears that indexing of these files is prohibited in robots, for example, from the folder with the theme used. He even threatens to lower the rankings for this.

    Therefore, we allow this whole thing to be indexed in advance using Allow. By the way, the same thing happened in the sample file for WordPress.

    Good luck to you! See you soon on the blog pages site

    You may be interested

    Domains with and without www - the history of their appearance, using 301 redirects to glue them together
    Mirrors, page duplicates and Url addresses- audit of your site or what could be the reason for the failure of its SEO promotion SEO for Beginners: 10 Essentials for a Technical Website Audit
    Bing webmaster - center for webmasters from the search engine Bing
    Google Webmaster - Search Console Tools (Google Webmaster)
    How to avoid common mistakes when promoting a website
    How to promote the site yourself, improving internal optimization for keywords and removing duplicate content
    Yandex Webmaster - indexing, links, site visibility, region selection, authorship and virus check in Yandex Webmaster

    The robot.txt file is required for most sites.

    Each SEO-optimizer should understand the meaning of this file, as well as be able to prescribe the most requested directives.

    Properly composed robots improves the position of the site in search results and, among other promotion methods, is an effective SEO tool.

    To understand what robot.txt is and how it works, let's remember how search engines work.

    To check for it, enter the root domain in the address bar, then add /robots.txt to the end of the URL.

    For example, the Moz robot file is located at: moz.com/robots.txt. We enter, and get the page:

    Instructions for the "robot"

    How to create a robots.txt file?

    3 types of instructions for robots.txt.

    If you find that the robots.txt file is missing, creating one is easy.

    As already mentioned at the beginning of the article, this is a regular text file in the root directory of the site.

    It can be done through the admin panel or a file manager, with which the programmer works with files on the site.

    We will figure out how and what to prescribe there in the course of the article.

    Search engines receive three types of instructions from this file:

    • scan everything, that is full access(Allow);
    • nothing can be scanned - a complete ban (Disallow);
    • scan individual elements it is impossible (it is specified which ones) - partial access.

    In practice, it looks like this:

    Please note that the page can still get into the SERP if it has a link installed on this site or outside it.

    To better understand this, let's study the syntax of this file.

    Robots.Txt Syntax

    Robots.txt: what does it look like?

    Important points: what you should always remember about robots.

    Seven common terms that are often found on websites.

    In its simplest form, the robot looks like this:

    User agent: [name of the system for which we write directives] Disallow: Sitemap: [indicate where we have the sitemap] # Rule 1 User agent: Googlebot Disallow: /prim1/ Sitemap: http://www.nashsite.com /sitemap.xml

    Together, these three lines are considered the simplest robots.txt.

    Here we prevented the bot from indexing the URL: http://www.nashsite.com/prim1/ and indicated where the sitemap is located.

    Please note: in the robots file, the set of directives for one user agent (search engine) is separated from the set of directives for another by a line break.

    In a file with multiple search engine directives, each prohibition or permission applies only to the search engine specified in that particular block of lines.

    This is an important point and should not be forgotten.

    If the file contains rules that apply to multiple user agents, the system will give priority to directives that are specific to the specified search engine.

    Here is an example:

    In the illustration above, MSNbot, discobot and Slurp have individual rules that will work only for these search engines.

    All other user agents follow the general directives in the user-agent: * group.

    The robots.txt syntax is absolutely straightforward.

    There are seven general terms that are often found on websites.

    • User-agent: The specific web search engine (search engine bot) that you are instructing to crawl. A list of most user agents can be found here. In total, it has 302 systems, of which two are the most relevant - Google and Yandex.
    • Disallow: A disallow command that tells the agent not to visit the URL. Only one "disallow" line is allowed per URL.
    • Allow (only applicable to Googlebot): The command tells the bot that it can access the page or subfolder even if its parent page or subfolder has been closed.
    • Crawl-delay: How many milliseconds the search engine should wait before loading and crawling the page content.

    Please note - Googlebot does not support this command, but the crawl rate can be manually set in Google Search Console.

    • Sitemap: Used to call the location of any XML maps associated with this URL. This command is only supported by Google, Ask, Bing and Yahoo.
    • Host: this directive specifies the main mirror of the site, which should be taken into account when indexing. It can only be written once.
    • Clean-param: This command is used to deal with duplicate content in dynamic addressing.

    Regular Expressions

    Regular Expressions: what they look like and what they mean.

    How to enable and disable crawling in robots.txt.

    In practice, robots.txt files can grow and become quite complex and unwieldy.

    The system makes it possible to use regular expressions to provide the required functionality of the file, that is, to work flexibly with pages and subfolders.

    • * is a wildcard, meaning that the directive works for all search bots;
    • $ matches the end of the URL or string;
    • # used for developer and optimizer comments.

    Here are some examples of robots.txt for http://www.nashsite.com

    Robots.txt URL: www.nashsite.com/robots.txt

    User-agent: * (i.e. for all search engines) Disallow: / (slash denotes the site's root directory)

    We have just banned all search engines from crawling and indexing the entire site.

    How often is this action required?

    Infrequently, but there are cases when it is necessary that the resource does not participate in search results, but visits are made through special links or through corporate authorization.

    This is how the internal sites of some firms work.

    In addition, such a directive is prescribed if the site is under development or modernization.

    If you need to allow the search engine to crawl everything on the site, then you need to write the following commands in robots.txt:

    User-agent: * Disallow:

    There is nothing in the prohibition (disallow), which means everything is possible.

    Using this syntax in the robots.txt file allows crawlers to crawl all pages on http://www.nashsite.com, including homepage, admin and contacts.

    Blocking specific search bots and individual folders

    Syntax for Google search engine (Googlebot).

    Syntax for other search agents.

    User-agent: Googlebot Disallow: /example-subfolder/

    This syntax only tells Google (Googlebot) not to crawl the address: www.nashsite.com/example-subfolder/.

    blocking individual pages for the specified bots:

    User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

    This syntax says that only Bingbot (the name of the Bing crawler) should not visit the page at: www.nashsite.com /example-subfolder/blocked-page.

    In fact, that's all.

    If you master these seven commands and three symbols and understand the application logic, you can write the correct robots.txt.

    Why it doesn't work and what to do

    Main action algorithm.

    Other methods.

    Misbehaving robots.txt is a problem.

    After all, it will take time to identify the error, and then figure it out.

    Re-read the file, make sure you haven't blocked anything extra.

    If after a while it turns out that the page is still hanging in the search results, look in Google Webmaster to see if the site has been re-indexed by the search engine, and check if there are any external links to the closed page.

    Because if they are, then it will be more difficult to hide it from the search results, other methods will be required.

    Well, before using, check this file with a free tester from Google.

    Timely analysis helps to avoid troubles and saves time.

    This article is an example of the best, in my opinion, code for the WordPress robots.txt file that you can use in your sites.

    To begin with, let's remember why robots.txt is needed- the robots.txt file is needed exclusively for search robots in order to “tell” them which sections / pages of the site to visit and which do not need to be visited. Pages that are closed from visiting will not be indexed by search engines (Yandex, Google, etc.).

    Option 1: Optimal robots.txt code for WordPress

    User-agent: * Disallow: /cgi-bin # classic... Disallow: /? # all query options on the main page Disallow: /wp- # all WP files: /wp-json/, /wp-includes, /wp-content/plugins Disallow: *?s= # search Disallow: *&s= # search Disallow: /search # search Disallow: /author/ # author archive Disallow: */embed # all embeds Disallow: */page/ # all types of pagination Allow: */uploads # open uploads Allow: /*/*.js # inside /wp - (/*/ - for priority) Allow: /*/*.css # inside /wp- (/*/ - for priority) Allow: /wp-*.png # images in plugins, cache folder, etc. Allow: /wp-*.jpg # images in plugins, cache folder, etc. Allow: /wp-*.jpeg # images in plugins, cache folder, etc. Allow: /wp-*.gif # pictures in plugins, cache folder, etc. Allow: /wp-*.svg # images in plugins, cache folder, etc. Allow: /wp-*.pdf # files in plugins, cache folder, etc. Allow: /wp-admin/admin-ajax.php #Disallow: /wp/ # when WP is installed in wp subdirectory Sitemap: http://example.com/sitemap.xml Sitemap: http://example.com/sitemap2. xml # another file #Sitemap: http://example.com/sitemap.xml.gz # compressed version (.gz) # Code version: 1.1 # Don't forget to change `site.ru` to your site.

    Code parsing:

      In the User-agent: * line, we indicate that all of the following rules will work for all crawlers * . If you want these rules to work only for one specific robot, then instead of * specify the name of the robot (User-agent: Yandex , User-agent: Googlebot).

      In the Allow: */uploads line, we intentionally allow pages that contain /uploads to be indexed. This rule is mandatory, because above, we forbid indexing pages starting with /wp- , and /wp- included in /wp-content/uploads. Therefore, to break the Disallow: /wp- rule, you need the line Allow: */uploads , because on links like /wp-content/uploads/... we may have pictures that should be indexed, as well as there may be some uploaded files that there is no need to hide. Allow: can be "before" or "after" Disallow: .

      The rest of the lines prevent robots from "walking" on links that start with:

      • Disallow: /cgi-bin - closes the script directory on the server
      • Disallow: /feed - closes the blog's RSS feed
      • Disallow: /trackback - Disable notifications
      • Disallow: ?s= or Disallow: *?s= - close search pages
      • Disallow: */page/ - closes all types of pagination
    1. The Sitemap rule: http://example.com/sitemap.xml points the robot to an XML sitemap file. If you have such a file on your site, then write the full path to it. There can be several such files, then we specify the path to each separately.

      In the Host: site.ru line, we indicate the main mirror of the site. If the site has mirrors (copies of the site on other domains), then in order for Yandex to index all of them equally, you need to specify the main mirror. Directive Host: understands only Yandex, Google does not understand! If the site works under the https protocol, then it must be specified in Host: Host: http://example.com

      From the Yandex documentation: "Host is an independent directive and works anywhere in the file (cross-section)". Therefore, we put it at the top or at the very end of the file, through an empty line.

    Because the presence of open feeds is required, for example, for Yandex Zen, when you need to connect the site to the channel (thanks to the Digital commentator). Perhaps open feeds are needed somewhere else.

    At the same time, feeds have their own format in response headers, thanks to which search engines understand that this is not HTML page, while the feed and obviously handle it differently.

    Host directive for Yandex is no longer needed

    Yandex completely abandons the Host directive, it has been replaced by 301 redirects. Host can be safely removed from robots.txt . However, it is important that all site mirrors have a 301 redirect to the main site (main mirror).

    This is important: sorting rules before processing

    Yandex and Google does not process the Allow and Disallow directives in the order in which they are specified, but first sorts them from the short rule to the long one, and then processes the last matching rule:

    User-agent: * Allow: */uploads Disallow: /wp-

    will be read as:

    User-agent: * Disallow: /wp- Allow: */uploads

    To quickly understand and apply the sorting feature, remember this rule: “the longer the rule in robots.txt, the more priority it has. If the length of the rules is the same, then the Allow directive takes precedence.”

    Option 2: Standard robots.txt for WordPress

    I don’t know how anyone, but I am for the first option! Because it is more logical - you do not need to completely duplicate the section in order to specify the Host directive for Yandex, which is cross-sectional (it is understood by the robot anywhere in the template, without specifying which robot it refers to). As for the non-standard directive Allow , it works for Yandex and Google, and if it does not open the uploads folder for other robots that do not understand it, then in 99% it will not entail anything dangerous. I haven't noticed yet that the first robots doesn't work as it should.

    The above code is slightly incorrect. Thanks to the commentator "" for pointing out the incorrectness, though I had to figure out what it was myself. And here is what I came up with (I could be wrong):

      Some robots (not Yandex and Google) do not understand more than 2 directives: User-agent: and Disallow:

    1. The Yandex directive Host: should be used after Disallow: because some robots (not Yandex and Google) may not understand it and generally reject robots.txt. Judging by the documentation, Yandex itself does not care where and how to use Host:, even if you create robots.txt with only one line Host: www.site.ru in order to glue all the mirrors of the site.

    3. Sitemap: cross-sectional directive for Yandex and Google and apparently for many other robots too, so we write it at the end through an empty line and it will work for all robots at once.

    Based on these amendments, the correct code should look like this:

    User-agent: Yandex Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-json/ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: */embed Disallow: */page/ Disallow: /cgi-bin Disallow: *?s= Allow: /wp-admin/admin-ajax.php Host: site.ru User-agent: * Disallow: /wp-admin Disallow : /wp-includes Disallow: /wp-content/plugins Disallow: /wp-json/ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: */embed Disallow: */page/ Disallow: / cgi-bin Disallow: *?s= Allow: /wp-admin/admin-ajax.php Sitemap: http://example.com/sitemap.xml

    We add for ourselves

    If you need to prohibit any more pages or groups of pages, you can add a rule (directive) below Disallow:. For example, we need to close all posts in a category from indexing news, then before sitemap: add a rule:

    Disallow: /news

    It prevents robots from following links like this:

    • http://example.com/news
    • http://example.com/news/drugoe-name/

    If you need to close any occurrences of /news , then we write:

    Disallow: */news

    • http://example.com/news
    • http://example.com/my/news/drugoe-name/
    • http://example.com/category/newsletter-name.html

    You can learn more about the robots.txt directives on the Yandex help page (but keep in mind that not all the rules that are described there work for Google).

    Robots.txt check and documentation

    You can check if the prescribed rules are working correctly at the following links:

    • Yandex: http://webmaster.yandex.ru/robots.xml .
    • At Google, this is done in search console. You need authorization and the presence of the site in the webmaster panel...
    • Service for creating a robots.txt file: http://pr-cy.ru/robots/
    • Service for generating and checking robots.txt: https://seolib.ru/tools/generate/robots/

    I asked Yandex...

    Asked a question in those. Yandex support for cross-sectional use of Host and Sitemap directives:

    Question:

    Hello!
    I am writing an article about robots.txt on my blog. I would like to get an answer to such a question (I did not find an unambiguous "yes" in the documentation):

    If I need to glue all the mirrors, and for this I use the Host directive at the very beginning of the robots.txt file:

    Host: site.ru User-agent: * Disallow: /asd

    Will it be in this example work correctly Host: site.ru? Will it indicate to robots that site.ru is the main mirror. Those. I use this directive not in a section, but separately (at the beginning of the file) without specifying which User-agent it refers to.

    I also wanted to know if the Sitemap directive must be used inside the section or can it be used outside: for example, through an empty line, after the section?

    User-agent: Yandex Disallow: /asd User-agent: * Disallow: /asd Sitemap: http://example.com/sitemap.xml

    Will the robot understand the Sitemap directive in this example?

    I hope to get an answer from you that will put an end to my doubts.

    Answer:

    Hello!

    The Host and Sitemap directives are cross-sectional, so they will be used by the robot regardless of where they are specified in the robots.txt file.

    --
    Sincerely, Platon Schukin
    Yandex Support

    Conclusion

    It is important to remember that changes to robots.txt on an already working site will be noticeable only after a few months (2-3 months).

    Rumor has it that Google can sometimes ignore the rules in robots.txt and take a page into the index if it considers that the page is very unique and useful and it simply has to be in the index. However, other rumors refute this hypothesis by saying that inexperienced optimizers may incorrectly specify the rules in robots.txt and thus close the necessary pages from indexing and leave unnecessary ones. I'm leaning more towards the second suggestion...

    Dynamic robots.txt

    In WordPress, a request for a robots.txt file is handled separately and it is not necessary to physically create a robots.txt file in the root of the site, moreover, it is not recommended, because with this approach it will be very difficult for plugins to change this file, and this is sometimes necessary.

    Read about how the dynamic creation of the robots.txt file works in the description of the function, and below I will give an example of how you can change the content of this file, on the fly, through the hook.

    To do this, add the following code to your functions.php file:

    Add_action("do_robotstxt", "my_robotstxt"); function my_robotstxt()( $lines = [ "User-agent: *", "Disallow: /wp-admin/", "Disallow: /wp-includes/", "", ]; echo implode("\r\n ", $lines); die; // terminate PHP )

    User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/

    Crawl-delay - timeout for crazy robots (not taken into account since 2018)

    Yandex

    After analyzing emails for the last two years to our indexing support, we found out that one of the main reasons for the slow download of documents is an incorrectly configured Crawl-delay directive in robots.txt […] So that site owners do not have to worry about this anymore and so that all really necessary pages of sites appear and update quickly in the search, we decided to refuse to take into account the Crawl-delay directive.

    When the Yandex robot crawls the site like crazy and this creates an unnecessary load on the server. The robot can be asked to “slow down”.

    To do this, you need to use the Crawl-delay directive. It specifies the time in seconds that the robot should idle (wait) to crawl each next page of the site.

    For compatibility with robots that do not follow the robots.txt standard, Crawl-delay must be specified in the group (in the User-Agent section) immediately after Disallow and Allow

    The Yandex robot understands fractional values, for example, 0.5 (half a second). This does not guarantee that the crawler will visit your site every half a second, but it allows you to speed up the crawl of the site.

    User-agent: Yandex Disallow: /wp-admin Disallow: /wp-includes Crawl-delay: 1.5 # 1.5 second timeout User-agent: * Disallow: /wp-admin Disallow: /wp-includes Allow: /wp-* .gif Crawl-delay: 2 # timeout of 2 seconds

    Google

    Googlebot does not understand the Crawl-delay directive. The timeout for its robots can be specified in the webmaster's panel.

    On the avi1.ru service, you can already purchase SMM promotion in more than 7 of the most popular in social networks. At the same time, pay attention to the rather low cost of all site services.

    This requires instructions for working, search engines are no exception to the rule, which is why they came up with a special file called robots.txt. This file must be in the root folder of your site, or it can be virtual, but it must be opened on request: www.yoursite.ru/robots.txt

    Search engines have long learned to distinguish necessary files html, from internal script sets of your CMS systems, more precisely, they have learned to recognize links to content articles and all sorts of rubbish. Therefore, many webmasters already forget to make robots for their sites and think that everything will be fine anyway. Yes, they are 99% right, because if your site does not have this file, then search engines are limitless in their search for content, but there are nuances that can be taken care of in advance.

    If you have any problems with this file on the site, write comments to this article and I will quickly help you with this, absolutely free. Very often, webmasters make minor mistakes in it, which brings the site to poor indexing, or even exclusion from the index.

    What is robots.txt for?

    The robots.txt file is created to set up the correct indexing of the site by search engines. That is, it contains rules for allowing and denying certain paths of your site or content type. But this is not a panacea. All the rules in the robots file are not guidelines follow them exactly, but just a recommendation for search engines. Google writes for example:

    You cannot use a robots.txt file to hide a page from Google Search results. Other pages may link to it, and it will still be indexed.

    Search robots themselves decide what to index and what not, and how to behave on the site. Each search engine has its own tasks and functions. As much as we would like, this is a way not to tame them.

    But there is one trick that does not directly relate to the subject of this article. To completely prevent robots from indexing and showing a page in search results, you need to write:

    Let's get back to robots. The rules in this file can close or allow access to the following types of files:

    • Non-graphic files. Mainly html files containing some information. You can close duplicate pages, or pages that do not carry any useful information(pagination pages, calendar pages, archive pages, profile pages, etc.).
    • Graphic files. If you want site images not to appear in searches, you can set this in the robots.
    • Resource files. Also, with the help of robots, you can block the indexing of various scripts, files css styles and other unimportant resources. But you should not block resources that are responsible for the visual part of the site for visitors (for example, if you close the css and js of the site that display beautiful blocks or tables, the search robot will not see this and will swear at it).

    To visually show how robots works, look at the picture below:

    The search robot, following the site, looks at the indexing rules, then starts indexing according to the recommendations of the file.
    Depending on the rule settings, the search engine knows what can be indexed and what can't be indexed.

    With the syntax of the robots.txt file

    To write rules for search engines in the robots file, directives with various parameters are used, with the help of which robots follow. Let's start with the very first and probably the most important directive:

    User-agent directive

    user-agent- With this directive, you specify the name of the robot that should use the recommendations in the file. These robots are officially in the world of the Internet - 302 pieces. Of course, you can write the rules for everyone separately, but if you don’t have time for this, just write:

    User-agent: *

    * - in this example means "All". Those. your robots.txt file should start with "who exactly" the file is for. In order not to bother with all the names of the robots, just write an asterisk in the user-agent directive.

    I will give you detailed lists of robots of popular search engines:

    Google- Googlebot- main robot

    Rest Google robots

    Googlebot news- news search robot
    Googlebot Image- robot pictures
    Googlebot Video- robot video
    Googlebot mobile- robot mobile version
    AdsBot-Google- landing page quality check robot
    Mediapartners-Google- adsense robot

    Yandex - YandexBot- the main indexing robot;

    Other Yandex robots

    Disallow and Allow Directives

    Disallow- the most basic rule in robots, it is with the help of this directive that you prohibit indexing certain places on your site. The directive is written like this:

    Disallow:

    Very often you can see the Disallow directive: empty, i.e. allegedly telling the robot that nothing is prohibited on the site, index whatever you want. Be careful! If you put / in disallow, then you will completely close the site for indexing.

    Therefore, the most standard version of robots.txt, which "allows the indexing of the entire site for all search engines" looks like this:

    User Agent: * Disallow:

    If you don't know what to write in robots.txt, but have heard about it somewhere, just copy the code above, save it to a file called robots.txt and upload it to the root of your site. Or don't create anything, because even without it, robots will index everything on your site. Or read the article to the end, and you will understand what to close on the site and what not.

    According to robots rules, the disallow directive must be required.

    With this directive, you can disable both a folder and a separate file.

    If you want to deny folder you should write:

    Disallow: /folder/

    If you want to disable a specific file:

    Disallow: /images/img.jpg

    If you want to disallow certain types of files:

    Disallow: /*.png$

    Regular expressions are not supported by many search engines. Google supports.

    allow— permissive directive in Robots.txt. It allows the robot to index a specific path or file in the deny directory. Until recently, it was used only by Yandex. Google caught up with this and started using it too. For example:

    Allow: /content Disallow: /

    these directives prohibit indexing all site content, except for the content folder. Or here are some more popular directives lately:

    Allow: /themplate/*.js Allow: /themplate/*.css Disallow: /themplate

    these values allow indexing of all CSS and JS files on the site, but prevent everything in your template folder from being indexed. Over the past year, Google sent a lot of letters to webmasters with the following content:

    Googlebot can't access CSS and JS files on website

    And the related comment: We have discovered an issue on your site that may prevent it from being crawled. Googlebot cannot process JavaScript code and/or CSS files due to restrictions in the robots.txt file. This data is needed to evaluate the performance of the site. Therefore, if access to resources is blocked, then this may worsen the position of your site in the Search.

    If you add the two allow directives that are written in the last code to your Robots.txt, then you will not see such messages from Google.

    And using special characters in robots.txt

    Now about signs in directives. Basic signs (special characters) in prohibiting or allowing this /, *, $

    About slashes (forward slash) "/"

    The slash is very deceptive in robots.txt. I observed an interesting situation several dozen times when, out of ignorance, they added to robots.txt:

    User-Agent: * Disallow: /

    Because they read somewhere about the structure of the site and copied it to themselves on the site. But in this case you disable indexing of the entire site. To prohibit indexing of the directory, with all the internals, you definitely need to put / at the end. For example, if you write Disallow: /seo, then absolutely all links on your site that contain the word seo will not be indexed. Even though it will be the /seo/ folder, even though it will be the /seo-tool/ category, even though it will be the /seo-best-of-the-best-soft.html article, all this will not be indexed.

    Look carefully at everything / in your robots.txt

    Always put / at the end of directories. If you set / to Disallow, you will prevent indexing of the entire site, but if you do not set / to Allow, you will also disable indexing of the entire site. / - in some sense means "Everything that follows the directive /".

    About asterisks * in robots.txt

    The special character * means any (including empty) sequence of characters. You can use it anywhere in robots like this:

    User-agent: * Disallow: /papka/*.aspx Disallow: /*old

    Forbids all files with the aspx extension in the papka directory, also forbids not only the /old folder, but also the /papka/old directive. Tricky? So I do not recommend you to play around with the * symbol in your robots.

    By default in indexing and ban rules file robots.txt is * on all directives!

    About the special character $

    The $ special character in robots terminates the * special character. For example:

    Disallow: /menu$

    This rule forbids '/menu' but does not forbid '/menu.html', i.e. the file disallows search engines only with the /menu directive, and cannot disallow all files with the word menu in the URL.

    host directive

    The host rule only works in Yandex, so is optional, it determines the primary domain from your site mirrors, if any. For example, you have a dom.com domain, but the following domains are also purchased and configured: dom2.com, dom3,com, dom4.com and from them there is a redirect to the main domain dom.com

    In order for Yandex to quickly determine which of them is the main site (host), add the host directory to your robots.txt:

    host: site

    If your site does not have mirrors, then you can not prescribe this rule. But first, check your site by IP address, it may open your main page, and you should register the main mirror. Or perhaps someone copied all the information from your site and made an exact copy, the entry in robots.txt, if it was also stolen, will help you with this.

    The host entry must be one, and if necessary, with a prescribed port. (Host: site:8080)

    Crawl-delay directive

    This directive was created in order to remove the possibility of loading your server. Search bots can make hundreds of requests to your site at the same time, and if your server is weak, it can cause minor glitches. To prevent this from happening, we came up with a rule for Crawl-delay robots - this is the minimum period between page loads on your site. The default value for this directive is recommended to be set to 2 seconds. In Robots it looks like this:

    Crawl delay: 2

    This directive works for Yandex. In Google, you can set the crawl rate in the webmaster panel, in the Site Settings section, in the upper right corner with a "gear".

    Clean-param directive

    This parameter is also only for Yandex. If site page addresses contain dynamic parameters that do not affect their content (for example: session IDs, user IDs, referrers IDs, etc.), you can describe them using the Clean-param directive.

    The Yandex robot, using this information, will not repeatedly reload duplicate information. Thus, the efficiency of crawling your site will increase, and the load on the server will decrease.
    For example, the site has pages:

    www.site.com/some_dir/get_book.pl?ref=site_1&book_id=123

    Parameter ref is used only to track from which resource the request was made and does not change the content, the same page with the book book_id=123 will be shown at all three addresses. Then if you specify the directive like this:

    User-agent: Yandex Disallow: Clean-param: ref /some_dir/get_book.pl

    the Yandex robot will reduce all page addresses to one:
    www.site.com/some_dir/get_book.pl?ref=site_1&book_id=123,
    If a page without parameters is available on the site:
    www.site.com/some_dir/get_book.pl?book_id=123
    then everything will come down to it when it is indexed by the robot. Other pages on your site will be crawled more often as there is no need to refresh the pages:
    www.site.com/some_dir/get_book.pl?ref=site_2&book_id=123
    www.site.com/some_dir/get_book.pl?ref=site_3&book_id=123

    #for addresses like: www.site1.com/forum/showthread.php?s=681498b9648949605&t=8243 www.site1.com/forum/showthread.php?s=1e71c4427317a117a&t=8243 #robots.txt will contain: User-agent: Yandex Disallow: Clean-param: s /forum/showthread.php

    Sitemap Directive

    With this directive, you simply specify the location of your sitemap.xml. The robot remembers this, “thanks you”, and constantly analyzes it along the given path. It looks like this:

    Sitemap: http://site/sitemap.xml

    And now let's look at the general questions that arise when compiling a robot. There are many such topics on the Internet, so we will analyze the most relevant and most frequent.

    Correct robots.txt

    There is a lot of “correct” in this word, because for one site on one CMS it will be correct, and on another CMS it will give errors. "Correctly configured" for each site is individual. In Robots.txt, you need to close from indexing those sections and those files that are not needed by users and do not carry any value for search engines. The simplest and most correct version of robots.txt

    User-Agent: * Disallow: Sitemap: http://site/sitemap.xml User-agent: Yandex Disallow: Host: site.com

    This file contains the following rules: settings for prohibition rules for all search engines (User-Agent: *), indexing of the entire site is completely allowed (“Disallow:” or you can specify “Allow: /”), the host of the main mirror for Yandex is specified (Host : site.ncom) and the location of your Sitemap.xml (Sitemap: .

    Robots.txt for WordPress

    Again, there are a lot of questions, one site can be online stores, another blog, the third is a landing page, the fourth is a business card site of the company, and all this can be on the WordPress CMS and the rules for robots will be completely different. Here is my robots.txt for this blog:

    User-Agent: * Allow: /wp-content/uploads/ Allow: /wp-content/*.js$ Allow: /wp-content/*.css$ Allow: /wp-includes/*.js$ Allow: / wp-includes/*.css$ Disallow: /wp-login.php Disallow: /wp-register.php Disallow: /xmlrpc.php Disallow: /template.html Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content Disallow: /category Disallow: /archive Disallow: */trackback/ Disallow: */feed/ Disallow: /?feed= Disallow: /job Disallow: /?.net/sitemap.xml

    There are a lot of settings here, let's analyze them together.

    Allow in WordPress. The first permission rules are for content that users need (these are pictures in the uploads folder), and robots (these are CSS and JS for displaying pages). It is css and js that Google often swears at, so we left them open. It was possible to use the method of all files by simply inserting "/ * .css $", but the prohibition line of these folders where the files are located did not allow them to be used for indexing, so I had to write the path to the prohibition folder in full.

    Allow always points to the path of content that is prohibited in Disallow. If something is not forbidden for you, you should not prescribe allow for it, supposedly thinking that you are giving an impetus to search engines, like “Come on, here’s your URL, index faster.” That won't work.

    Disallow in WordPress. A lot of things need to be banned in CMS WP. Lots of different plugins various settings and topics, a bunch of scripts and various pages that do not carry any useful information. But I went further and completely forbade indexing everything on my blog, except for the articles themselves (posts) and pages (about the Author, Services). I even closed the categories on the blog, I will open them when they are optimized for queries and when a text description for each of them appears there, but now these are just duplicate post previews that search engines do not need.

    Well Host and Sitemap are standard directives. Only it was necessary to take out the host separately for Yandex, but I did not bother about this. Let's finish with Robots.txt for WP.

    How to create robots.txt

    It is not as difficult as it seems at first glance. You just need to take a regular notepad (Notepad) and copy the data for your site according to the settings from this article. But if this is difficult for you, there are resources on the Internet that allow you to generate robots for your sites:

    No one will tell more about your Robots.txt than these comrades. After all, it is for them that you create your “forbidden file”.

    Now let's talk about some of the little bugs that can be in robots.

    • « Empty line' - it is not allowed to make an empty string in the user-agent directive.
    • At conflict between two directives with prefixes of the same length, priority is given to the directive allow.
    • For each robots.txt file is processed only one Host directive. If multiple directives are specified in the file, the robot uses the first one.
    • Directive Clean Param is cross-sectional, so it can be listed anywhere in the robots.txt file. If there are several directives, all of them will be taken into account by the robot.
    • Six Yandex robots do not follow the Robots.txt rules (YaDirectFetcher, YandexCalendar, YandexDirect, YandexDirectDyn, YandexMobileBot, YandexAccessibilityBot). To prevent them from indexing on the site, you should make separate user-agent parameters for each of them.
    • User agent directive, must always be written above the deny directive.
    • One line, for one directory. You cannot write multiple directories on one line.
    • File name should only be like this: robots.txt. No Robots.txt, ROBOTS.txt, and so on. Only small letters in the title.
    • In directive host you should write the path to the domain without http and without slashes. Incorrect: Host: http://www.site.ru/, Correct: Host: www.site.ru
    • When the site uses a secure protocol https in directive host(for the Yandex robot) must be specified with the protocol, so Host: https://www.site.ru

    This article will be updated as interesting questions and nuances come in.

    With you was, lazy Staurus.