Correct robots for joomla 3. Robots meta tag - helps close duplicate content

In this article we will talk about how to compose correct robots.txt file for Joomla. It plays a vital role for correct, fast indexing of your project, and if robots.txt is compiled incorrectly, then some pages of your site may be excluded by search engines altogether, and duplicate and junk pages will end up in the index, which will naturally have a negative impact on search results and your efforts to optimize the site will go to waste.

And so, the robots.txt file is a text file that is located at the root of your site and tells search robots exactly how to index your project. Which pages should you not pay attention to, and which ones should you pay special attention to?

If in robots file.txt file does not correctly define the rules for search robots, then they will index many junk pages, and multiple duplication of information on your site may occur, that is, the same article will be available through different links, and this is not good.

Let's look at the main directives and rules of this file.

Directives and rules for writing the robots.txt file.

The file starts with the most important directive - User-agent– it contains the name of the search robot. For all search robots - User-agent: *, and for Yandex we add the name Yandex to the User-agent - User-agent: Yandex.

The following are Allow And Disallow. The first allows, and the second prohibits, indexing by search robots.

Your correct robots.txt file should contain at least one "Disallow" directive respectively after each "User-agent" entry. But if you leave a completely empty robots.txt file, then search engines will index your entire resource, and many junk and duplicate pages will end up in the index.

A directive is also needed Host- which is understood only by the Yandex search engine, it serves to determine the main mirror of your site, that is, your resource can be accessible at several addresses, for example, with and without www, which for search engines is two different sites.

Since the Host directive is understood only by Yandex, for this you need to use a separate User-agent: Yandex, and to indicate actions for indexing to other search robots, use the User-agent directive.

And when composing the correct robots.txt, you must follow the writing rules: (directive): (space)(value).

And the last important directive - Sitemap. It shows search engines where your sitemap is located on your blog in .xml format.

Correct robots.txt for Joomla

The correct robots.txt file for Joomla that I use on this site looks like this:

By the way, if you want to view the robots of any Internet site, then just add in command line browser to url /robots.txt, for example .

Yes, and you need to know that search engines Google systems and Yandex, in addition to the main ones, there are special robots for indexing news, images, etc., so do not forget to open images from your site for indexing. Default in robots.txt Joomla costs Disallow: /images/. Remove this directive.

Before making changes to the robot.txt file, I think it would be useful to tell what kind of file it is and what it is needed for. Those who are already familiar with this file can skip the first part of the text.

Robots.txt what is this file and what is it for?

This is a regular text file that is needed exclusively for search engines; it serves to indicate (or, if you want, recommendations) to search robots what and how to index. A lot depends on a properly composed robot.txt file; with its help, you can close the site from search robots or, conversely, allow crawling only certain sections of the site. Therefore, its competent compilation is one of the priorities in SEO optimization site.

In order to correctly edit the robots.txt file, you first need to decide on its location. For any site, including those created in CMS Joomla 3, this file located in the root directory (folder) of the site. After Joomla installations 3 this file is already present, but its contents are far from ideal.

Robots.txt file syntax

In Joomla 3, the robots.txt file in the basic version contains only the most basic things, its contents are something like this:

User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/

At the very beginning of the file there may be more text, but it is, let’s say, commented out with the “#” symbol. Simply put, a line that contains the “#” symbol at the beginning is not taken into account by search robots and can be safely deleted to reduce the file size. Thus, the basic robot.txt file will have exactly the contents specified above. Let's look at each line.

The first line contains the User-agent directive, the parameters of which are the name of the robot that will index the site. Thus, directives following it will be processed only by the specified robot. There can be many parameters, but let's consider only those that we need:

  • User-agent: *#This parameter with the value "*" says that the text following this line will contain information for all robots without exception.

U this parameter There are other meanings, the most common of them are Yandex and Google robot:

  • User-agent: Yandex#as the name implies, the parameter is intended for Yandex robots, and for all robots, of which Yandex has more than 10, I see no point in considering each one separately.
  • User-agent: Googlebot#and this is Google's main indexing robot.

It is worth noting that if you do not specify the User-agent directive, then the robots will think that they are allowed to crawl the entire site, that is, access is not limited. So don't neglect it.

Next directive Disallow, it is necessary to prevent search robots from indexing certain sections, it plays a very important role, since Joomla is famous for creating duplicate pages.

This is where the directives in the basic robots.txt file end, but there are many more than two of them. I won’t describe everything, I’ll only write what’s really needed for proper indexing of sites on Joomla.

Compiling the correct robots.txt file for Joomla 3

I’ll save you from unnecessary text and immediately give an example of my robots.txt file, and add comments to the lines:

User-agent: * # we indicate that the following directives are intended for all robots without exception Host: site # The directive points to the main mirror of the site, according to Yandex recommendations it is advisable to place it after the Allow and Disallow directives Disallow: /administrator Disallow: /component/slogin/* #prohibition of bypassing the left pages created by the Slogin authorization component (if there is no such component, then remove the directive) Disallow: /component/jcomments/ #Prohibit robots from downloading pages created by the JComments component (remove if not used) Disallow: /component/users #In the same way prohibit bypassing other left pages Disallow: /bin/ #Prohibit bypassing system folders Disallow: /cache Disallow: /cli Disallow: /includes Disallow: /installation Disallow: /language Disallow: /layouts Disallow: /libraries Disallow: /logs Disallow: / tmp Disallow: /components Disallow: /modules Disallow: /plugins Disallow: /component/content Disallow: /component/contact Disallow: /404 #close the 404 error from the eyes of the robot Disallow: /index.php? #urls with parameters, Joomla can create a great many such pages, they should not be included in the index Disallow: /*? #urls with questions Disallow: /*% #urls with percentages Disallow: /*& #urls with & Disallow: /index.php #remove duplicates, they should not be there either Disallow: /index2.php #duplicates again Allow: /*.js* #This directive allows robots to index files with the specified extensions. Allow: /*.css* Allow: /*.png* Allow: /*.jpg* Allow: /*.gif* Allow: /index.php?option=com_jmap&view=sitemap&format=xml #Allow bypassing the sitemap, otherwise in case it will be prohibited..php?option=com_jmap&view=sitemap&format=xml #This directive is intended to indicate the operation of the storage location for the sitemap in xml format

This is roughly the robot.txt file used on this site, in it are listed as allowing, so prohibitory directives, indicated main mirror of the site, and path to site map. Of course, everything is individual for each site and there can be much more directives. But on in this example You can understand the basic principles of working with the “robot txt” file and in the future distribute bans or permissions to certain pages specifically for your site.

I would like to add that, contrary to Yandex’s recommendations that it is better to place the Host directive after the Disallow and Allow directives, I still placed it almost at the very top. And I did this after, after another crawl of the site by a robot, Yandex informed me that it could not find this directive. Was it temporary failure, or something else, I didn’t check and returned this directive to the very top.

Pay attention to the last directive, whose name is Sitemap, it is necessary to indicate to the search robot the location of the site map, this is a very important point. What's happened Sitemap file and what is its role in website promotion can be read in

Online service by OceanTheme are a platform where people can unite with each other with mutual interest to purchase premium templates and extensions Joomla! at a bargain price. The target audience of the service are individuals and small and medium businesses, professional web developers to create online stores, community sites or people wishing to have your blog. In our great collection of premium solutions everyone will find what he needs.

Our resource acts as an organizer pooling, specifies the number of people that you want to buy templates and extensions, the cost of goods, as well as the amount and access to these materials. Our website has a lot of opportunities for easy searching of templates and extensions. Intuitive navigation, tagging system, sorting by the filter and the tool "add to bookmarks" will allow you to find the right material you want incredibly fast. In addition You will always find the latest information, so as to update the collection every day.

Access to the entire database of materials is provided for the duration of the club specified in the subscription purse. Subscribers receive unrestricted access to all available archives, news and updates, as well as technical support throughout the subscription period.

All the products you can find on this site are 100% GPL-compatible, which means you can change them as you want and install on unlimited number of sites.

Thanks to our collection you will save a lot of time and money, as the templates and extensions are easy to use, easy to install and configure, multi-functional and diverse. That will allow you to create a website of any complexity and orientation, without learning advanced web development technologies.

Main features of our website

A rich set of functions, working out of the box:

Use all opportunities of our resource to get ready-made professional solution for rapid implementation of your business projects or creative ideas.

Use the search tools

Use advanced search and filtering, and easy navigation for quickly finding the desired web solutions in design, functionality and other criteria.

To favorite materials were always at hand, use the unique function "Add to favorites", and they are available in a separate section for the whole year.

Logged into our site, you will be able to leave comments and to participate in promotions, as well as use of a free subscription with permium access.

Join our club membership

Club subscription gives you full access to our entire catalog of original material. And includes premium templates and extensions for several years.

Download appropriate to your Joomla templates and extensions, both free and subscription for the club without any limits and ogoranicheny speed.

If you liked any material on the site, you can leave your voice, as well as share it with friends via social networks.

Robots.txt file text file, to control the behavior of search engines when crawling a site. Using disallow directories you can close from scanning individual pages site, its sections and the site as a whole. However, disallow is closed indexing pages only for Yandex bots.

About the robots.txt file

You should not postpone steps to prepare your site for indexing until you fill it with materials. The basic preparation of a site for indexing can be done immediately after creating the site.

The main tools for managing search engines Google, Yandex, Bing and others is the robots.txt text file. The robots.txt file allows you to control what search engines should crawl and what they should bypass. Yandex reads the directives of the robots.txt file not only for crawling permission, but also for permission to index pages. If a page is banned by robots, Yandex will, after a while, remove it from the index if it is there, and will not index it if the page is not in the index.

The robots.txt file is text file placed at the root of the site. According to certain rules, it prescribes what material on the site search engines should scan and what material should be “avoided.” You must set the rules for search engine behavior in relation to site material in the robots.txt file.

To see how The robots.txt file looks like (if it is in the site directory), just in the browser line to the site name, add robots.txt through a slash.

The robots.txt file is created according to certain rules. These rules are called file syntax. You can view the detailed syntax of the robots.txt file on Yandex ( https://help.yandex.ru/webmaster/?id=996567). Here I will focus on the basic rules that will help you create a robots.txt file for a Joomla website.

Rules for creating a robots.txt file

First, let me draw your attention: the robots.txt file must be created individually, taking into account the peculiarities of the site structure and its promotion policy. The proposed version of the file is conditional and approximate and cannot claim universality.

Each line in the file is called a directive. The robots.txt file directives look like this:

<ПОЛЕ>:<ПРОБЕЛ><ЗНАЧЕНИЕ><ПРОБЕЛ>

<ПОЛЕ>:<ПРОБЕЛ><ЗНАЧЕНИЕ><ПРОБЕЛ>

<ПОЛЕ>:<ПРОБЕЛ><ЗНАЧЕНИЕ><ПРОБЕЛ>

An empty robots.txt file means the entire site is indexed.

It would seem that there is something bad here. Let search engines crawl and index all site material. But it's good as long as the site is empty. With its filling with materials, constant editing, uploading photos, deleting materials, articles that are no longer related to the site, duplicate pages, old archives, and other garbage material are indexed. Search engines don’t like this, especially duplicate pages, and even behind this “garbage” the main material can be lost.

Robots.txt file directives

  • “User-agent” is a personal or general address to search engines.
  • "Allow" are permissive directives;
  • "Disallow" are prohibiting directives.

"User-agent" directive

If in User-agent line the search engine is not specified, the “User-agent” line contains an asterisk (*), which means that all directives in the robots.txt file apply to all search engines.

You can set indexing rules for a specific search engine. For example, the rules for Yandex should be written in the “User-agent” directive, so

User-agent: Yandex

I will give an example of other search engines that can be registered in the “User-agent” directory.

  • Google Googlebot
  • Yahoo! Slurp (or Yahoo! Slurp)
  • AOL Slurp
  • MSN MSNBot
  • Live MSNBot
  • Ask Teoma
  • AltaVista Scooter
  • Alexa ia_archiver
  • Lycos Lycos
  • Yandex Yandex
  • Rambler StackRambler
  • Mail.ru Mail.Ru
  • Aport Aport
  • Webalta WebAlta (WebAlta Crawler/2.0)

Important! The robots.txt file is required and must contain a “Disallow” directive. Even if the entire robots.txt file is empty, the “Disallow” directive should be in it.

Let's look at the syntax signs that define indexing rules

The following are allowed Special symbols"asterisk" (*); slash (/); and ($).

  • The asterisk (*) symbol means “any”, “all”.
  • The symbol ($) cancels (*)
  • The slash (/) symbol alone means the root directory of the site, just as the slash (/) separator shows the paths to the files for which the rule is written.

For example, the line:

Disallow:

It means a ban “for no one”, that is, no ban for the entire site. And the line:

Disallow: /

It means a ban “for everyone”, that is, a ban for all folders and files on the site. String like:

Disallow: /components/

Completely creates a ban on the entire /components/ folder, which is located at: http://your_site/components/

And here is the line

class="eliadunit">Disallow: /components

Creates a ban on the “components” folder and on all files and folders starting with “components”. For example: “components56”;”components77”.

If we add “Disallow” to the given examples of directories for which search engine this rule was created, we get a ready-made robots.txt file

User-agent: Yandex Disallow:

This is a robots.txt file which means that the Yandex search engine can index the entire site without exception.

And this is how the lines are written:

User-agent: Yandex Disallow: /

On the contrary, Yandex completely prohibits indexing the entire site.

The principle is clear, I’ll look at a few examples and at the end I’ll give the classic robots.txt files for Yandex and Google.

The following example is the robots.txt file of a template (just installed) Joomla site

User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /components/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /modules/ Disallow: /plugins/ Disallow: /tmp/

This robots.txt file defines rules for all search engines and prohibits indexing of 15 site folders located in the root directory (root) of the site.

Additional information in the robots.txt file

In the robots.txt file, you need to indicate to search engines the Sitemap address and the mirror domain for the Yandex search engine.

  • Sitemap: http://exempl.com/sitemap.xml.gz
  • Sitemap: http://exempl.com/sitemap.xml

Separately, you can create robots.txt for Yandex to include a Host directive and specify a site mirror in it.

Host: www.your-site.com# means that the main mirror of the site from www.

Host: your-site.com#means that the main domain of the site without www.

Important! When writing your robots.txt file, remember to leave a space after the colon, and everything after the colon should be written in lowercase.

Important! Try not to use template robots.txt files taken from the Internet (except Joomla's robots.txt by default). Each robots.txt file must be compiled individually and edited depending on site traffic and its SEO analysis.

At the end of the article I will give an example of the correct robots.txt file for a Joomla site.

User-agent: * Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /tmp/ Disallow: /templates/ User-agent: Yandex Disallow: /administrator/ Disallow: /bin/ Disallow: /cache/ Disallow: /cli/ Disallow: /includes/ Disallow: /installation/ Disallow: /language/ Disallow: /layouts/ Disallow: /libraries/ Disallow: /logs/ Disallow: /plugins/ Disallow: /tmp/ Disallow: /templates/ Disallow: /*?* Host: domen.ru (or https: //domen.ru) Sitemap: http://domen.ru/sitemap.xml (or https://domen.ru/sitamap.xml)

conclusions

Despite traditions, I note that to block site pages from indexing, use internal CSM tools. All content editors have the insertion of noindex, nofollow tags.

  • closing the entire site during its creation;
  • closing the site from unnecessary search engines;
  • closing personal sections;
  • reducing the load on the server (crawl-delay directive).
  • closing the indexing of paging, sorting and search pages;
  • Close duplicate pages only for Yandex, and use CMS tools for Google;
  • Do not attempt to remove from index Google pages and sections. This only works for Yandex.

As a result, I note again that the robots.txt file for the Joomla site is compiled individually. To get started, use the boxed version of the robots.txt.disc file, which you rename to robots.txt and divide into two sections, one for Yandex and the second for all other bots. For Yandex, be sure to add the Host directory, indicating the main mirror of the site in it.

Good afternoon dear friends! All you know is that search engine optimization- a responsible and delicate matter. You need to take into account absolutely every little detail to get an acceptable result.

Today we will talk about robots.txt - a file that is familiar to every webmaster. It contains all the most basic instructions for search robots. As a rule, they are happy to follow the prescribed instructions and, if they are compiled incorrectly, refuse to index the web resource. Next, I will tell you how to compose the correct version of robots.txt, as well as how to configure it.

In the preface I already described what it is. Now I’ll tell you why it is needed. Robots.txt is a small text file that is stored in the root of the site. It is used by search engines. It clearly states the rules of indexing, i.e. which sections of the site need to be indexed (added to the search) and which sections should not.

Typically, technical sections of a site are closed from indexing. Occasionally, non-unique pages are blacklisted (copy-paste of the privacy policy is an example of this). Here the robots are “explained” the principles of working with sections that need to be indexed. Very often rules are prescribed for several robots separately. We will talk about this further.

At correct setting robots.txt, your website is guaranteed to rise in search engine rankings. Robots will only take into account useful content, ignoring duplicate or technical sections.

Creating robots.txt

To create a file, just use the standard functionality of your operating system, and then upload it to the server via FTP. Where it lies (on the server) is easy to guess - at the root. Typically this folder is called public_html.

You can easily get into it using any FTP client (for example) or built-in file manager. Naturally, we will not upload empty robots to the server. Let's write some basic directives (rules) there.

User-agent: *
Allow: /

Using these lines in your robots file, you will contact all robots (User-agent directive), allowing them to index your entire site (including all technical pages Allow: /)

Of course, this option is not particularly suitable for us. The file will not be particularly useful for search engine optimization. It definitely needs some proper tuning. But before that, we will look at all the main directives and robots.txt values.

Directives

User-agentOne of the most important, because it indicates which robots should follow the rules that follow it. The rules are taken into account until the next User-agent in the file.
AllowAllows indexing of any resource blocks. For example: “/” or “/tag/”.
DisallowOn the contrary, it prohibits indexing of sections.
SitemapPath to the site map (in xml format).
HostMain mirror (with or without www, or if you have several domains). Protected is also indicated here. https protocol(in the presence of). If you have standard http, you don't need to specify it.
Crawl-delayWith its help, you can set the interval for robots to visit and download files on your site. Helps reduce the load on the host.
Clean-paramAllows you to disable indexing of parameters on certain pages (like www.site.com/cat/state?admin_id8883278).
Unlike previous directives, 2 values ​​are specified here (the address and the parameter itself).

These are all rules that are supported by flagship search engines. It is with their help that we will create our robots, operating with various variations for the most different types sites.

Settings

To properly configure the robots file, we need to know exactly which sections of the site should be indexed and which should not. In the case of a simple one-page website using html + css, we just need to write a few basic directives, such as:

User-agent: *
Allow: /
Sitemap: site.ru/sitemap.xml
Host: www.site.ru

Here we have specified the rules and values ​​for all search engines. But it’s better to add separate directives for Google and Yandex. It will look like this:

User-agent: *
Allow: /

User-agent: Yandex
Allow: /
Disallow: /politika

User-agent: GoogleBot
Allow: /
Disallow: /tags/

Sitemap: site.ru/sitemap.xml
Host: site.ru

Now absolutely all files on our html site will be indexed. If we want to exclude some page or picture, then we need to specify relative link to this fragment in Disallow.

You can use robots automatic file generation services. I don’t guarantee that with their help you will create a perfectly correct version, but you can try it as an introduction.

Among such services are:

With their help you can create robots.txt in automatic mode. Personally, I strongly do not recommend this option, because it is much easier to do it manually, customizing it for your platform.

When we talk about platforms, I mean all kinds of CMS, frameworks, SaaS systems and much more. Next we'll talk about how to configure the file WordPress robots and Joomla.

But before that, let’s highlight a few universal rules that can guide you when creating and setting up robots for almost any site:

Disallow from indexing:

  • site admin;
  • personal account and registration/authorization pages;
  • cart, data from order forms (for an online store);
  • cgi folder (located on the host);
  • service sections;
  • ajax and json scripts;
  • UTM and Openstat tags;
  • various parameters.

Open (Allow):

  • Pictures;
  • JS and CSS files;
  • other elements that must be taken into account by search engines.

In addition, at the end, do not forget to indicate the sitemap (path to the site map) and host (main mirror) data.

Robots.txt for WordPress

To create a file, we need to drop robots.txt into the root of the site in the same way. In this case, you can change its contents using the same FTP and file managers.

There is a more convenient option - create a file using plugins. In particular, Yoast SEO has such a function. Editing robots directly from the admin panel is much more convenient, so I myself use this method of working with robots.txt.

How you decide to create this file is up to you; it is more important for us to understand exactly what directives should be there. On my sites running WordPress I use this option:

User-agent: * # rules for all robots, except Google and Yandex

Disallow: /cgi-bin # folder with scripts
Disallow: /? # request parameters with home page
Disallow: /wp- # files of the CSM itself (with the wp- prefix)
Disallow: *?s= # \
Disallow: *&s= # everything related to search
Disallow: /search/ # /
Disallow: /author/ # author archives
Disallow: /users/ # and users
Disallow: */trackback # notifications from WP that someone is linking to you
Disallow: */feed # feed in xml
Disallow: */rss # and rss
Disallow: */embed # built-in elements
Disallow: /xmlrpc.php #WordPress API
Disallow: *utm= # UTM tags
Disallow: *openstat= # Openstat tags
Disallow: /tag/ # tags (if available)
Allow: */uploads # open downloads (pictures, etc.)

User-agent: GoogleBot # for Google
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: /xmlrpc.php
Disallow: *utm=
Disallow: *openstat=
Disallow: /tag/
Allow: */uploads
Allow: /*/*.js # open JS files
Allow: /*/*.css # and CSS
Allow: /wp-*.png # and pictures in png format
Allow: /wp-*.jpg # \
Allow: /wp-*.jpeg # and other formats
Allow: /wp-*.gif # /
# works with plugins

User-agent: Yandex # for Yandex
Disallow: /cgi-bin
Disallow: /?
Disallow: /wp-
Disallow: *?s=
Disallow: *&s=
Disallow: /search/
Disallow: /author/
Disallow: /users/
Disallow: */trackback
Disallow: */feed
Disallow: */rss
Disallow: */embed
Disallow: /xmlrpc.php
Disallow: /tag/
Allow: */uploads
Allow: /*/*.js
Allow: /*/*.css
Allow: /wp-*.png
Allow: /wp-*.jpg
Allow: /wp-*.jpeg
Allow: /wp-*.gif
Allow: /wp-admin/admin-ajax.php
# clean UTM tags
Clean-Param: openstat # and don’t forget about Openstat

Sitemap: # specify the path to the site map
Host: https://site.ru # main mirror

Attention! When copying lines to a file, do not forget to remove all comments (text after #).

This robots.txt option is most popular among webmasters who use WP. Is he ideal? No. You can try to add something or, on the contrary, remove something. But keep in mind that errors are common when optimizing a robot’s text engine. We will talk about them further.

Robots.txt for Joomla

And although in 2018 few people use Joomla, I believe that this wonderful CMS cannot be ignored. When promoting projects on Joomla, you will certainly have to create a robots file, otherwise how do you want to block unnecessary elements from indexing?

As in the previous case, you can create a file manually by simply uploading it to the host, or use a module for these purposes. In both cases, you will have to configure it correctly. This is what the correct option for Joomla will look like:

User-agent: *
Allow: /*.css?*$
Allow: /*.js?*$
Allow: /*.jpg?*$
Allow: /*.png?*$
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

User-agent: Yandex
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

User-agent: GoogleBot
Disallow: /cache/
Disallow: /*.pdf
Disallow: /administrator/
Disallow: /installation/
Disallow: /cli/
Disallow: /libraries/
Disallow: /language/
Disallow: /components/
Disallow: /modules/
Disallow: /includes/
Disallow: /bin/
Disallow: /component/
Disallow: /tmp/
Disallow: /index.php
Disallow: /plugins/
Disallow: /*mailto/

Disallow: /logs/
Disallow: /component/tags*
Disallow: /*%
Disallow: /layouts/

Host: site.ru # don't forget to change the address here to yours
Sitemap: site.ru/sitemap.xml # and here

As a rule, this is enough to prevent unnecessary files from getting into the index.

Errors during setup

Very often people make mistakes when creating and configuring a robots file. Here are the most common ones:

  • The rules are specified only for User-agent.
  • Host and Sitemap are missing.
  • The presence of the http protocol in the Host directive (you only need to specify https).
  • Failure to comply with nesting rules when opening/closing images.
  • UTM and Openstat tags are not closed.
  • Writing host and sitemap directives for each robot.
  • Superficial elaboration of the file.

It is very important to configure this small file correctly. If you make serious mistakes, you can lose a significant part of the traffic, so be extremely careful when setting up.

How to check a file?

For these purposes, it is better to use special services from Yandex and Google, since these search engines are the most popular and in demand (most often the only ones used), there is no point in considering search engines such as Bing, Yahoo or Rambler.

First, let's consider the option with Yandex. Go to Webmaster. Then go to Tools – Analysis of robots.txt.

Here you can check the file for errors, as well as check in real time which pages are open for indexing and which are not. Very convenient.

Google has exactly the same service. Let's go to Search Console. Find the Scanning tab and select Robots.txt File Check Tool.

The functions here are exactly the same as in the domestic service.

Please note that it shows me 2 errors. This is due to the fact that Google does not recognize the directives for clearing the parameters that I specified for Yandex:

Clean-Param: utm_source&utm_medium&utm_campaign
Clean-Param: openstat

You shouldn't pay attention to this, because... Google robots use only GoogleBot rules.

Conclusion

The robots.txt file is very important for SEO optimization of your website. Approach its setup with all responsibility, because if implemented incorrectly, everything can go to waste.

Keep in mind all the instructions I've shared in this article, and don't forget that you don't have to copy my robots variations exactly. It is quite possible that you will have to further understand each of the directives, adjusting the file to suit your specific case.

And if you want to understand robots.txt and creating websites on WordPress more deeply, then I invite you to. Here you will learn how you can easily create a website, not forgetting to optimize it for search engines.