A command for filtering data using a regular expression. Regular expressions grep, egrep, sed in Linux. Special character classes

Regular expressions are a very powerful tool for searching text by pattern, processing and modifying strings, which can be used to solve many problems. Here are the main ones:

  • Text input check;
  • Search and replace text in a file;
  • Batch renaming of files;
  • Interaction with services such as Apache;
  • Checking a string for matching a pattern.

This is far from full list, regular expressions allow you to do much more. But for new users they may seem too complicated, since they use a special language to create them. But given the capabilities provided, everyone should know and be able to use Linux regular expressions System Administrator.

In this article, we'll look at bash regular expressions for beginners so that you can understand all the features of this tool.

There are two types of characters that can be used in regular expressions:

  • ordinary letters;
  • metacharacters.

Common characters are the letters, numbers, and punctuation marks that make up any string. All texts are made up of letters and you can use them in regular expressions to find the desired position in the text.

Metasymbols are something else, they give power regular expressions. With metacharacters you can do much more than just search for a single character. You can search for combinations of symbols, use a dynamic number of symbols, and select ranges. All special characters can be divided into two types: replacement characters, which replace regular characters, or operators, which indicate how many times a character can be repeated. The regular expression syntax would look like this:

regular_character special character_operator

special_replacement_character special character_operator

  • — alphabetic special characters begin with a backslash, and it is also used if you need to use a special character in the form of any punctuation mark;
  • ^ — indicates the beginning of the line;
  • $ — indicates the end of the line;
  • * — indicates that the previous character can be repeated 0 or more times;
  • + — indicates that the previous character should be repeated one or more times;
  • ? — the previous character can occur zero or once;
  • (n)— indicates how many times (n) the previous character should be repeated;
  • (N,n)— the previous character can be repeated from N to n times;
  • . — any character except line feed;
  • — any character specified in brackets;
  • x|y— symbol x or symbol y;
  • [^az]- any character except those indicated in brackets;
  • — any character from the specified range;
  • [^a-z]— any character that is not in the range;
  • b— denotes a word boundary with a space;
  • B— means that the character must be inside a word, for example, ux will match uxb or tuxedo, but will not match Linux;
  • d— means that the symbol is a number;
  • D— non-numeric character;
  • n— line feed character;
  • s- one of the space characters, space, tab, and so on;
  • S— any character except space;
  • t— tab character;
  • v— vertical tab character;
  • w- any alphabetic character, including underscore;
  • W- any alphabetic character, except underscore;
  • uXXX— Unicdoe symbol.

It is important to note that you must use a slash before alphabetic special characters to indicate that a special character comes next. The reverse is also true, if you want to use a special character that is used without a slash as a regular character, then you will have to add a slash.

For example, you want to find the line 1+ 2=3 in the text. If you use this string as a regular expression, you will not find anything, because the system interprets the plus as a special character that indicates that the previous unit should be repeated one or more times. So it needs to be escaped: 1 + 2 = 3. Without escaping, our regular expression would only match the string 11=3 or 111=3 and so on. There is no need to put a line in front of equal, because it is not a special character.

Examples of using regular expressions

Now that we've covered the basics and you know how everything works, all that remains is to consolidate the knowledge you've gained about linux grep regular expressions in practice. Two very useful special characters are ^ and $, which indicate the beginning and end of a line. For example, we want to get all users registered in our system whose name starts with s. Then you can apply the regular expression "^s". You can use the egrep command:

egrep "^s" /etc/passwd

If we want to select lines based on the last character in the line, we can use $ for this. For example, let's select everyone system users, without a shell, records for such users end in false:

egrep "false$" /etc/passwd

To display usernames that begin with s or d, use this expression:

egrep "^" /etc/passwd

The same result can be obtained by using the "|" symbol. The first option is more suitable for ranges, and the second is more often used for regular or/or:

egrep "^" /etc/passwd

Now let's select all users whose name is not three characters long. The username ends with a colon. We can say that it can contain any alphabetic character, which must be repeated three times, before the colon:

egrep "^w(3):" /etc/passwd

conclusions

In this article we covered Linux regular expressions, but that was just the basics. If you dig a little deeper, you will find that you can do a lot more interesting things with this tool. Taking the time to master regular expressions will definitely be worth it.

To conclude, a lecture from Yandex about regular expressions:

In order to fully process texts in bash scripts using sed and awk, you just need to understand regular expressions. Implementations of this most useful tool can be found literally everywhere, and although all regular expressions are structured in a similar way and are based on the same ideas, working with them in different environments has certain features. Here we will talk about regular expressions that are suitable for use in scripts command line Linux.

This material is intended as an introduction to regular expressions, intended for those who may be completely unaware of what they are. So let's start from the very beginning.

What are regular expressions

Many people, when they first see regular expressions, immediately think that they are looking at a meaningless jumble of characters. But this, of course, is far from the case. Take a look at this regex for example


In our opinion, even an absolute beginner will immediately understand how it works and why it is needed :) If you don’t quite understand it, just read on and everything will fall into place.
A regular expression is a pattern that programs like sed or awk use to filter text. Templates use regular ASCII characters that represent themselves, and so-called metacharacters that play a special role, for example, allowing reference to certain groups of characters.

Types of Regular Expressions

Implementations of regular expressions in various environments, such as programming languages ​​like Java, Perl and Python, in Linux tools like sed, awk and grep, have certain features. These features depend on so-called regular expression engines, which interpret patterns.
Linux has two regular expression engines:
  • An engine that supports the POSIX Basic Regular Expression (BRE) standard.
  • An engine that supports the POSIX Extended Regular Expression (ERE) standard.
Most Linux utilities conform to at least the POSIX BRE standard, but some utilities (including sed) understand only a subset of the BRE standard. One of the reasons for this limitation is the desire to make such utilities as fast as possible in text processing.

The POSIX ERE standard is often implemented in programming languages. It allows you to use a large number of tools when developing regular expressions. For example, these could be special sequences of characters for frequently used patterns, such as searching for individual words or sets of numbers in text. Awk supports the ERE standard.

There are many ways to develop regular expressions, depending both on the opinion of the programmer and on the features of the engine for which they are created. It's not easy to write universal regular expressions that any engine can understand. Therefore, we will focus on the most commonly used regular expressions and look at the features of their implementation for sed and awk.

POSIX BRE regular expressions

Perhaps the simplest BRE pattern is a regular expression for searching for the exact occurrence of a sequence of characters in text. This is what searching for a string looks like in sed and awk:

$ echo "This is a test" | sed -n "/test/p" $ echo "This is a test" | awk "/test/(print $0)"

Finding text by pattern in sed


Finding text by pattern in awk

You may notice that the search for a given pattern is performed without taking into account the exact location of the text in the line. In addition, the number of occurrences does not matter. After the regular expression finds the specified text anywhere in the string, the string is considered suitable and is passed on for further processing.

When working with regular expressions, you need to take into account that they are case sensitive:

$ echo "This is a test" | awk "/Test/(print $0)" $ echo "This is a test" | awk "/test/(print $0)"

Regular expressions are case sensitive

The first regular expression did not find any matches because the word “test”, starting with a capital letter, does not appear in the text. The second, configured to search for a word written in capital letters, found a suitable line in the stream.

In regular expressions, you can use not only letters, but also spaces and numbers:

$ echo "This is a test 2 again" | awk "/test 2/(print $0)"

Finding a piece of text containing spaces and numbers

Spaces are treated as regular characters by the regular expression engine.

Special symbols

When using various characters in regular expressions, there are some things to consider. Yes, there are some Special symbols, or metacharacters, the use of which in a template requires a special approach. Here they are:

.*^${}\+?|()
If one of them is needed in the template, it will need to be escaped using a backslash (backslash) - \ .

For example, if you need to find a dollar sign in the text, you need to include it in the template, preceded by an escape character. Let's say there is a file myfile with the following text:

There is 10$ on my pocket
The dollar sign can be detected using this pattern:

$awk "/\$/(print $0)" myfile

Using a special character in a pattern

In addition, the backslash is also a special character, so if you need to use it in a pattern, it will also need to be escaped. It looks like two slashes following each other:

$ echo "\ is a special character" | awk "/\\/(print $0)"

Escaping a backslash

Although the forward slash is not included in the list of special characters above, attempting to use it in a regular expression written for sed or awk will result in an error:

$ echo "3 / 2" | awk "///(print $0)"

Incorrect use of forward slash in a pattern

If it is needed, it must also be escaped:

$ echo "3 / 2" | awk "/\//(print $0)"

Escaping a forward slash

Anchor symbols

There are two special characters for linking a pattern to the beginning or end of a text string. The "cap" symbol - ^ allows you to describe sequences of characters that are at the beginning text strings. If the pattern you are looking for is somewhere else in the string, the regular expression will not respond to it. The use of this symbol looks like this:

$ echo "welcome to likegeeks website" | awk "/^likegeeks/(print $0)" $ echo "likegeeks website" | awk "/^likegeeks/(print $0)"

Finding a pattern at the beginning of a string

The ^ character is designed to search for a pattern at the beginning of a line, while the case of characters is also taken into account. Let's see how this affects the processing of a text file:

$awk "/^this/(print $0)" myfile


Finding a pattern at the beginning of a line in text from a file

When using sed, if you place a cap somewhere inside the template, it will be treated like any other regular character:

$ echo "This ^ is a test" | sed -n "/s ^/p"

Cap not at the beginning of the pattern in sed

In awk, when using the same template, this character must be escaped:

$ echo "This ^ is a test" | awk "/s\^/(print $0)"

Cover not at the beginning of the template in awk

We have figured out the search for text fragments located at the beginning of a line. What if you need to find something located at the end of a line?

The dollar sign - $, which is the anchor character for the end of the line, will help us with this:

$ echo "This is a test" | awk "/test$/(print $0)"

Finding text at the end of a line

You can use both anchor symbols in the same template. Let's process the file myfile, the contents of which are shown in the figure below, using the following regular expression:

$ awk "/^this is a test$/(print $0)" myfile


A pattern that uses special characters to start and end a line

As you can see, the template responded only to a line that fully corresponded to the given sequence of characters and their location.

Here's how to filter using anchor symbols: empty lines:

$awk "!/^$/(print $0)" myfile
In this template I used a negation symbol, an exclamation point - ! . Using this pattern searches for lines that contain nothing between the beginning and end of the line, and thanks to the exclamation mark, only lines that do not match this pattern are printed.

Dot symbol

The period is used to match any single character except the newline character. Let's pass the file myfile to this regular expression, the contents of which are given below:

$awk "/.st/(print $0)" myfile


Using a dot in regular expressions

As can be seen from the output data, only the first two lines from the file correspond to the pattern, since they contain the sequence of characters “st” preceded by another character, while the third line does not contain a suitable sequence, and the fourth does have it, but is in at the very beginning of the line.

Character classes

A dot matches any single character, but what if you want to be more flexible in limiting the set of characters you're looking for? IN similar situation You can use character classes.

Thanks to this approach, you can organize a search for any character from a given set. To describe a character class, square brackets are used:

$awk "/th/(print $0)" myfile


Description of a character class in a regular expression

Here we are looking for a sequence of "th" characters preceded by an "o" character or an "i" character.

Classes come in handy when searching for words that can begin with either an uppercase or lowercase letter:

$ echo "this is a test" | awk "/his is a test/(print $0)" $ echo "This is a test" | awk "/his is a test/(print $0)"

Search for words that may begin with a lowercase or uppercase letter

Character classes are not limited to letters. Other symbols can be used here. It is impossible to say in advance in what situation classes will be needed - it all depends on the problem being solved.

Negation of character classes

Character classes can also be used to solve the inverse problem described above. Namely, instead of searching for symbols included in a class, you can organize a search for everything that is not included in the class. In order to achieve this regular expression behavior, you need to place a ^ sign in front of the list of class characters. It looks like this:

$ awk "/[^oi]th/(print $0)" myfile


Finding characters not in a class

In this case, sequences of “th” characters will be found that are preceded by neither “o” nor “i”.

Character ranges

In character classes, you can describe ranges of characters using dashes:

$awk "/st/(print $0)" myfile


Description of a range of characters in a character class

IN in this example The regular expression responds to a sequence of characters "st" preceded by any character located, in alphabetical order, between the characters "e" and "p".

Ranges can also be created from numbers:

$ echo "123" | awk "//" $ echo "12a" | awk "//"

Regular expression to find any three numbers

A character class can include several ranges:

$awk "/st/(print $0)" myfile


A character class consisting of several ranges

This regular expression will find all sequences of “st” preceded by characters from ranges a-f and m-z.

Special character classes

BRE has special character classes that can be used when writing regular expressions:
  • [[:alpha:]] - matches any alphabetic character, written in upper or lower case.
  • [[:alnum:]] - matches any alphanumeric character, namely characters in the ranges 0-9 , A-Z , a-z .
  • [[:blank:]] - matches a space and a tab character.
  • [[:digit:]] - any digit character from 0 to 9.
  • [[:upper:]] - uppercase alphabetic characters - A-Z .
  • [[:lower:]] - lowercase alphabetic characters - a-z .
  • [[:print:]] - matches any printable character.
  • [[:punct:]] - matches punctuation marks.
  • [[:space:]] - whitespace characters, in particular - space, tab, characters NL, FF, VT, CR.
You can use special classes in templates like this:

$ echo "abc" | awk "/[[:alpha:]]/(print $0)" $ echo "abc" | awk "/[[:digit:]]/(print $0)" $ echo "abc123" | awk "/[[:digit:]]/(print $0)"


Special character classes in regular expressions

Star symbol

If you place an asterisk after a character in a pattern, this will mean that the regular expression will work if the character appears in the string any number of times - including the situation when the character is absent in the string.

$ echo "test" | awk "/tes*t/(print $0)" $ echo "tessst" | awk "/tes*t/(print $0)"


Using the * character in regular expressions

This wildcard is typically used for words that are constantly misspelled or for words that are subject to different variants correct spelling:

$ echo "I like green color" | awk "/colou*r/(print $0)" $ echo "I like green color " | awk "/colou*r/(print $0)"

Finding a word with different spellings

In this example, the same regular expression responds to both the word "color" and the word "colour". This is so due to the fact that the character “u”, followed by an asterisk, can either be absent or appear several times in a row.

Another useful feature that comes from the asterisk symbol is to combine it with a dot. This combination allows the regular expression to respond to any number of any characters:

$ awk "/this.*test/(print $0)" myfile


A template that responds to any number of any characters

In this case, it doesn’t matter how many and what characters are between the words “this” and “test”.

The asterisk can also be used with character classes:

$ echo "st" | awk "/s*t/(print $0)" $ echo "sat" | awk "/s*t/(print $0)" $ echo "set" | awk "/s*t/(print $0)"


Using an asterisk with character classes

In all three examples, the regular expression works because the asterisk after the character class means that if any number of "a" or "e" characters are found, or if none are found, the string will match the given pattern.

POSIX ERE regular expressions

POSIX ERE templates that support some Linux utilities, may contain additional characters. As already mentioned, awk supports this standard, but sed does not.

Here we will look at the most commonly used symbols in ERE patterns, which will be useful to you when creating your own regular expressions.

▍Question mark

A question mark indicates that the preceding character may appear once or not at all in the text. This character is one of the repetition metacharacters. Here are some examples:

$ echo "tet" | awk "/tes?t/(print $0)" $ echo "test" | awk "/tes?t/(print $0)" $ echo "tesst" | awk "/tes?t/(print $0)"


Question mark in regular expressions

As you can see, in the third case the letter “s” appears twice, so the regular expression does not respond to the word “testst”.

The question mark can also be used with character classes:

$ echo "tst" | awk "/t?st/(print $0)" $ echo "test" | awk "/t?st/(print $0)" $ echo "tast" | awk "/t?st/(print $0)" $ echo "taest" | awk "/t?st/(print $0)" $ echo "teest" | awk "/t?st/(print $0)"


Question mark and character classes

If there are no characters from the class in the line, or one of them occurs once, the regular expression works, but as soon as two characters appear in the word, the system no longer finds a match for the pattern in the text.

▍Plus symbol

The plus character in the pattern indicates that the regular expression will match what it is looking for if the preceding character occurs one or more times in the text. However, this construction will not react to the absence of a symbol:

$ echo "test" | awk "/te+st/(print $0)" $ echo "teest" | awk "/te+st/(print $0)" $ echo "tst" | awk "/te+st/(print $0)"


The plus symbol in regular expressions

In this example, if there is no “e” character in the word, the regular expression engine will not find matches to the pattern in the text. The plus symbol also works with character classes - in this way it is similar to an asterisk and question mark:

$ echo "tst" | awk "/t+st/(print $0)" $ echo "test" | awk "/t+st/(print $0)" $ echo "teast" | awk "/t+st/(print $0)" $ echo "teeast" | awk "/t+st/(print $0)"


Plus sign and character classes

In this case, if the line contains any character from the class, the text will be considered to match the pattern.

▍Curly braces

Curly braces, which can be used in ERE patterns, are similar to the symbols discussed above, but they allow you to more precisely specify the required number of occurrences of the symbol preceding them. You can specify a restriction in two formats:
  • n - a number specifying the exact number of searched occurrences
  • n, m are two numbers that are interpreted as follows: “at least n times, but no more than m.”
Here are examples of the first option:

$ echo "tst" | awk "/te(1)st/(print $0)" $ echo "test" | awk "/te(1)st/(print $0)"

Curly braces in patterns, searching for the exact number of occurrences

In older versions of awk you had to use the --re-interval command line option to make the program recognize intervals in regular expressions, but in newer versions this is not necessary.

$ echo "tst" | awk "/te(1,2)st/(print $0)" $ echo "test" | awk "/te(1,2)st/(print $0)" $ echo "teest" | awk "/te(1,2)st/(print $0)" $ echo "teeest" | awk "/te(1,2)st/(print $0)"


Spacing specified in curly braces

In this example, the character “e” must appear 1 or 2 times in the line, then the regular expression will respond to the text.

Curly braces can also be used with character classes. The principles you already know apply here:

$ echo "tst" | awk "/t(1,2)st/(print $0)" $ echo "test" | awk "/t(1,2)st/(print $0)" $ echo "teest" | awk "/t(1,2)st/(print $0)" $ echo "teeast" | awk "/t(1,2)st/(print $0)"


Curly braces and character classes

The template will react to the text if it contains the character “a” or the character “e” once or twice.

▍Logical “or” symbol

Symbol | - a vertical bar means a logical “or” in regular expressions. When processing a regular expression containing several fragments separated by such a sign, the engine will consider the analyzed text suitable if it matches any of the fragments. Here's an example:

$ echo "This is a test" | awk "/test|exam/(print $0)" $ echo "This is an exam" | awk "/test|exam/(print $0)" $ echo "This is something else" | awk "/test|exam/(print $0)"


Logical "or" in regular expressions

In this example, the regular expression is configured to search the text for the words “test” or “exam”. Please note that between the template fragments and the symbol separating them | there should be no spaces.

Regular expression fragments can be grouped using parentheses. If you group a certain sequence of characters, it will be perceived by the system as an ordinary character. That is, for example, repetition metacharacters can be applied to it. This is what it looks like:

$ echo "Like" | awk "/Like(Geeks)?/(print $0)" $ echo "LikeGeeks" | awk "/Like(Geeks)?/(print $0)"


Grouping regular expression fragments

In these examples, the word “Geeks” is enclosed in parentheses, followed by a question mark. Recall that a question mark means “0 or 1 repetition,” so the regular expression will respond to both the string “Like” and the string “LikeGeeks.”

Practical examples

Now that we've covered the basics of regular expressions, it's time to do something useful with them.

▍Counting the number of files

Let's write a bash script that counts files located in directories that are written to a variable environment PATH. In order to do this, you will first need to generate a list of directory paths. Let's do this using sed, replacing the colons with spaces:

$ echo $PATH | sed "s/:/ /g"
The replace command supports regular expressions as patterns for searching text. In this case, everything is extremely simple, we are looking for the colon symbol, but no one bothers us to use something else here - it all depends on the specific task.
Now you need to go through the resulting list in a loop and perform the actions necessary to count the number of files. The general outline of the script will be like this:

Mypath=$(echo $PATH | sed "s/:/ /g") for directory in $mypath do done
Now let’s write the full text of the script, using the ls command to obtain information about the number of files in each directory:

#!/bin/bash mypath=$(echo $PATH | sed "s/:/ /g") count=0 for directory in $mypath do check=$(ls $directory) for item in $check do count=$ [ $count + 1 ] done echo "$directory - $count" count=0 done
When running the script, it may turn out that some directories from PATH do not exist, however, this will not prevent it from counting files in existing directories.


File counting

The main value of this example is that using the same approach, you can solve much more complex problems. Which ones exactly depends on your needs.

▍Verifying email addresses

There are websites with huge collections of regular expressions that allow you to check addresses Email, phone numbers, and so on. However, it’s one thing to take something ready-made, and quite another to create something yourself. So let's write a regular expression to check email addresses. Let's start with analyzing the source data. Here, for example, is a certain address:

[email protected]
The username, username, can consist of alphanumeric and some other characters. Namely, this is a dot, a dash, an underscore, a plus sign. The username is followed by an @ sign.

Armed with this knowledge, let's start assembling the regular expression from its left side, which is used to check the username. Here's what we got:

^(+)@
This regular expression can be read as follows: “The line must begin with at least one character from those in the group specified in square brackets, followed by an @ sign.”

Now - the hostname queue - hostname . The same rules apply here as for the username, so the template for it will look like this:

(+)
The top-level domain name is subject to special rules. There can only be alphabetic characters, of which there must be at least two (for example, such domains usually contain a country code), and no more than five. All this means that the template for checking the last part of the address will be like this:

\.({2,5})$
You can read it like this: “First there must be a period, then 2 to 5 alphabetic characters, and after that the line ends.”

Having prepared templates for individual parts of the regular expression, let's put them together:

^(+)@(+)\.({2,5})$
Now all that remains is to test what happened:

$ echo " [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)" $ echo " [email protected]" | awk "/^(+)@(+)\.((2,5))$/(print $0)"


Validating an email address using regular expressions

The fact that the text passed to awk is displayed on the screen means that the system recognized it as an email address.

Results

If the regular expression for checking email addresses that you came across at the very beginning of the article seemed completely incomprehensible then, we hope that now it no longer looks like a meaningless set of characters. If this is true, then this material has fulfilled its purpose. In fact, regular expressions are a topic that you can study for a lifetime, but even the little that we have covered can already help you write scripts that process texts quite advanced.

In this series of materials we usually showed very simple examples bash scripts that consisted of literally several lines. Next time we'll look at something bigger.

Dear readers! Do you use regular expressions when processing text in command line scripts?

grep stands for 'global regular expression printer'. grep cuts the lines you need from text files that contain user-specified text.

grep can be used in two ways - on its own or in combination with streams.

grep is very extensive in functionality, due to large quantity options it supports, such as: searching using a string pattern or RegExp regular expression pattern or perl based regular expressions, etc.

Due to its different functionality The grep tool has many options including egrep (Extended GREP), fgrep (Fixed GREP), pgrep (Process GREP), rgrep (recursive GREP) etc. But these options have minor differences from the original grep.

grep options

$ grep -V grep (GNU grep) 2.10 Copyright (C) 2011 Free Software Foundation, Inc. License GPLv3+

There are modifications of the grep utility: egrep (with extended regular expression processing), fgrep (which treats $*^|()\ symbols as literals, i.e. literally), rgrep (with recursive search enabled).

    egrep is the same as grep -E

    fgrep is the same as grep -F

    rgrep is the same as grep -r

    grep [-b] [-c] [-i] [-l] [-n] [-s] [-v] restricted_regex_BRE [file ...]

The grep command matches strings source files with the pattern specified by limited_regex. If no files are specified, standard input is used. Typically, each successfully matched string is copied to standard output; if there are several source files, the file name is given before the found line. grep uses a compact, non-deterministic algorithm. Restricted regular expressions (expressions that have strings of characters with their meanings and use a limited set of alphanumeric and special characters) are perceived as templates. They have the same meaning as regular expressions in ed.

To escape the characters $, *, , ^, |, (), and \ from shell interpretation, it is easiest to enclose the constrained_regex in single quotes.

Options:

B Prefaces each line with the block number in which it was found. This can be useful when searching for blocks by context (blocks are numbered starting from 0). -c Prints only the number of lines containing the pattern. -h Prevents the file name containing the matched line from being printed before the line itself. Used when searching across multiple files. -i Ignores case when making comparisons. -l Prints only the names of the files containing the matching strings, one per line. If a pattern is found on multiple lines of a file, the file name is not repeated. -n Prints before each line its number in the file (lines are numbered starting from 1). -s Suppresses messages about non-existent or unreadable files. -v Prints all lines except those containing a pattern. -w Searches the expression as a word, as if it were surrounded by metacharacters \< и \>.

grep --help

Usage: grep [OPTION]... PATTERN [FILE]... Searches for PATTERN in each FILE or standard input. By default, PATTERN is a simple regular expression (BRE). Example: grep -i "hello world" menu.h main.c Selecting the type of regular expression and its interpretation: -E, --extended-regexp PATTERN - extended regular expression (ERE) -F, --fixed-regexp PATTERN - strings fixed length, separated by character new line -G, --basic-regexp PATTERN - simple regular expression (BRE) -P, --perl-regexp PATTERN - Perl regular expressions -e, --regexp=PATTERN use PATTERN to search -f, --file=FILE take PATTERN from FILE -i, --ignore-case ignore case difference -w, --word-regexp PATTERN must match all words -x, --line-regexp PATTERN must match entire line -z, --null- data lines are separated by a null byte rather than a line end character Miscellaneous: -s, --no-messages suppress error messages -v, --revert-match select unmatched lines -V, --version print version information and exit --help show this help and exit --mmap for backwards compatibility, ignored Output control: -m, --max-count=NUM stop after the specified NUM matches -b, --byte-offset print offset along with output lines in bytes -n, --line-number print the line number along with the output lines --line-buffered flush the buffer after each line -H, --with-filename print the file name for each match -h, --no-filename not start output with filename --label=LABEL use LABEL as filename for standard input -o, --only-matching show only part of line matching PATTERN -q, --quiet, --silent suppress all normal output - -binary-files=TYPE assume that the binary file has a TYPE: binary, text, or without-match. -a, --text same as --binary-files=text -I same as --binary-files=without-match -d, --directories=ACTION how to handle directories ACTION can be read ), recurse (recursively) or skip (skip). -D, --devices=ACTION how to handle devices, FIFOs and sockets ACTION can be read or skip -R, -r, --recursive same as --directories=recurse --include=F_PATTERN process only files matching under F_TEMPLATE --exclude=F_TEMPLATE skip files and directories matching F_TEMPLATE --exclude-from=FILE skip files matching the template files from FILE --exclude-dir=TEMPLATE directories matching PATTERN will be skipped -L, - -files-without-match print only FILE names without matches -l, --files-with-matches print only FILE names with matches -c, --count print only the number of matching lines per FILE -T, --initial-tab align tab (if necessary) -Z, --null print byte 0 after the FILE name Context management: -B, --before-context=NUM print the NUMBER of lines of the preceding context -A, --after-context=NUM print the NUMBER of lines of the subsequent context -C, --context[=NUMBER] print the NUMBER of context lines -NUMBER is the same as --context=NUMBER --color[=WHEN], --colour[=WHEN] use markers to distinguish matching lines; WHEN can be always, never or auto -U, --binary do not remove CR characters at the end of the line (MSDOS) -u, --unix-byte-offsets show offset as if there were none CR-s (MSDOS) Instead of “egrep”, it is supposed to run “grep -E”. "grep -F" is assumed instead of "fgrep". It is better not to run as “egrep” or “fgrep”. When FILE is not specified, or when FILE is -, then standard input is read. If fewer than two files are specified, -h is assumed. If a match is found, the exit code will be 0, and 1 if not. If errors occur, or if the -q option is not specified, the exit code will be 2. Report errors to: Please report errors in translation to: GNU Grep home page: Help for working with GNU programs:

Background and source: Not everyone who has to use regular expressions fully understands how they work or how to create them. I also belonged to this group - I looked for examples of regular expressions suitable for my tasks, tried to correct them as necessary. Everything changed radically for me after reading the book. The Linux Command Line (Second Internet Edition) author William E. Shotts, Jr. It sets out the principles of how regular expressions work so clearly that after reading I learned to understand them, create regular expressions of any complexity, and now use them whenever necessary. This material is a translation of the part of the chapter devoted to regular expressions. This material is intended for absolute beginners who have absolutely no idea how regular expressions work, but have some understanding of how . I hope this article helps you make the same breakthrough that helped me. If the material presented here does not contain anything new to you, try looking at the article “Regular expressions and the grep command”, It describes grep options in more detail, as well as additional examples.

How regular expressions are used

Text data plays an important role in all Unix-like systems, such as Linux. Among other things, the text is the output of console programs, configuration files, reports, etc. Regular Expressions are (perhaps) one of the most difficult concepts in working with text, since they involve high level abstractions. But the time spent studying them will more than pay off. If you know how to use regular expressions, you can do amazing things, although their full value may not be immediately obvious.

This article will look at using regular expressions along with the command grep. But their use is not limited to this: regular expressions are supported by other Linux commands, many programming languages, used in configuration (for example, in the mod_rewrite rule settings in Apache), as well as some programs with graphical interface allow you to set rules for search/copy/delete with support for regular expressions. Even in popular office program In Microsoft Word, you can use regular expressions and wildcard characters to find and replace text.

What are regular expressions?

In simple terms, a regular expression is symbol, a symbolic notation of the pattern that is searched for in the text. Regular expressions are supported by many command line tools and most programming languages ​​and are used to help solve text manipulation problems. However, as if their complexity isn't enough for us, not all regular expressions are created equal. They vary slightly from tool to tool and from programming language to language. For our discussion, we will limit ourselves to the regular expressions described in the POSIX standard (which will cover most command line tools), as opposed to many programming languages ​​(most notably Perl), which use slightly larger and richer sets of notations.

grep

The main program we'll use for regular expressions is our old friend, . The name "grep" actually comes from the phrase "global regular expression print", so we can see that grep has something to do with regular expressions. Essentially, grep searches text files for text that matches a specified regular expression and prints to standard output any line that contains a match.

grep can search for text received in standard input, for example:

ls /usr/bin | grep zip

This command will list files in the /usr/bin directory whose names contain the substring "zip".

The grep program can search for text in files.

General usage syntax:

Grep [options] regex [file...]

  • regex is a regular expression.
  • [file…]- one or more files that will be searched using a regular expression.

[options] and [file...] may be missing.

List of the most commonly used grep options:

Option Description
-i Ignore case. Do not differentiate between large and small characters. You can also set the option --ignore-case.
-v Invert Match. Normally grep will print the lines that contain the match. This option causes grep to print every line that does not contain a match. You can also use --invert-match.
-c Print the number of matches (or mismatches if the option is specified -v) instead of the lines themselves. You can also specify the option --count.
-l Instead of the strings themselves, print the name of each file that contains the match. Can be specified with the option --files-with-matches.
-L As an option -l, but only prints filenames that don't contain matches. Another option name --files-withoutmatch.
-n Adding a line number within the file to the beginning of each matched line. Another option name --line-number.
-h To search multiple files, suppress the file name output. You can also specify the option --no-filename.

To explore grep more fully, let's create some text files to search for:

Ls /bin > dirlist-bin.txt ls /usr/bin > dirlist-usr-bin.txt ls /sbin > dirlist-sbin.txt ls /usr/sbin > dirlist-usr-sbin.txt ls dirlist*.txt dirlist -bin.txt dirlist-sbin.txt dirlist-usr-bin.txt dirlist-usr-sbin.txt

We can do a simple search through our list of files like this:

Grep bzip dirlist*.txt dirlist-bin.txt:bzip2 dirlist-bin.txt:bzip2recover

In this example, grep searches all listed files for the string bzip and finds two matches, both in the file dirlist-bin.txt. If we are only interested in the list of files containing the matches, and not the matching strings themselves, we can specify the option -l:

Grep -l bzip dirlist*.txt dirlist-bin.txt

Conversely, if we only wanted to see a list of files that did not contain matches, we could do this:

Grep -L bzip dirlist*.txt dirlist-sbin.txt dirlist-usr-bin.txt dirlist-usr-sbin.txt

If there is no output, it means that no files satisfying the conditions were found.

Metacharacters and literals

Although it may not seem obvious, our grep searches always use regular expressions, albeit very simple ones. The regular expression "bzip" means that a match will occur (i.e. the line will be considered a match) only if the line in the file contains at least four characters and that the characters "b", "z" are somewhere in the line , "i" and "p" are in that order, with no other characters in between. The characters in the "bzip" string are literals, i.e. literal symbols, because they correspond to themselves. In addition to literals, regular expressions can also include metacharacters, which are used to specify more complex matches. Regular expression metacharacters consist of the following:

^ $ . { } - ? * + () | \

All other characters are considered literals. The backslash character can have different meanings. It is used in several cases to create meta-sequences, and also allows metacharacters to be escaped and treated not as metacharacters, but as literals.

Note: as we can see, many regular expression metacharacters are also shell-meaning characters (performing expansion). When specifying a regular expression that contains command line metacharacters, it is imperative that it is enclosed in quotes, otherwise the shell will interpret them its own way and break your command.

Any character

The first metacharacter with which we will begin our acquaintance is dot symbol, which means "any character". If we include it in a regular expression, then it will match any character for that character position. Example:

Grep -h ".zip" dirlist*.txt bunzip2 bzip2 bzip2recover gunzip gzip funzip gpg-zip mzip p7zip preunzip prezip prezip-bin unzip unzipsfx

We looked for any string in our files that matched the regular expression ".zip". There are a couple to note interesting moments in the results obtained. Please note that the zip program was not found. This is because including the dot metacharacter in our regular expression increased the length required for a match to four characters, and since the name "zip" only contains three, it does not match. Also, if any of the files in our lists contained a .zip file extension, they would also be considered eligible, since the dot character in the file extension also qualifies for the "any character" condition.

Anchors

Caret symbol ( ^ ) and dollar sign ( $ ) are considered in regular expressions anchors. This means that they only cause a match if the regular expression is found at the beginning of the string ( ^ ) or at the end of the line ( $ ):

Grep -h "^zip" dirlist*.txt zip zipcloak zipdetails zipgrep zipinfo zipnote zipsplit grep -h "zip$" dirlist*.txt gunzip gzip funzip gpg-zip mzip p7zip preunzip prezip unzip zip grep -h "^zip$" dirlist *.txt zip

Here we searched the lists of files for the string “zip” located at the beginning of the line, at the end of the line, and also in a line where it would be both at the beginning and at the end (i.e. the entire line would contain only “zip” ). Please note that the regular expression " ^$ " (the beginning and end with nothing between) will match empty lines.

A short digression: a crossword puzzle assistant

Even with our limited this moment With knowledge of regular expressions we can do something useful.

If you've ever done crossword puzzles, you've had to solve problems like "what's the five letter word where the third letter is a 'j' and the last letter is an 'r' that means...". This question may make you think. Did you know that in Linux system do you have a dictionary? And he is. Look in the /usr/share/dict directory, you can find one or more dictionaries there. The dictionaries posted there are simply long lists of words, one per line, arranged in alphabetical order. On my system the dictionary file contains 99171 words. To search for possible answers to the above crossword question, we can do this:

Grep -i "^..j.r$" /usr/share/dict/american-english Major major

Using this regular expression, we can find all the words in our dictionary file that are five letters long, have a "j" in the third position and an "r" in the last position.

The example used an English dictionary file because it is present on the system by default. Having previously downloaded the appropriate dictionary, you can do similar searches using words in Cyrillic or any other characters.

Bracket Expressions and Character Classes

In addition to matching any character at a given position in our regular expression, we also, using expressions in square brackets, we can set a match to a single character from the specified character set. With bracket expressions, we can specify a set of characters to match (including characters that would otherwise be interpreted as metacharacters). In this example, using a set of two characters:

Grep -h "zip" dirlist*.txt bzip2 bzip2recover gzip

we will find any lines containing the strings "bzip" or "gzip".

The set can contain any number of characters, and metacharacters lose their special meaning when placed inside square brackets. However, there are two cases in which the metacharacters used inside square brackets have different meanings. The first one is the carriage ( ^ ), which is used to indicate negation; the second is a dash ( - ), which is used to specify a range of characters.

Negation

If the first character of the expression in square brackets is a caret ( ^ ), then the remaining characters are taken as a set of characters that should not be present at the given character position. Let's do this by changing our previous example:

Grep -h "[^bg]zip" dirlist*.txt bunzip2 gunzip funzip gpg-zip mzip p7zip preunzip prezip prezip-bin unzip unzipsfx

With negation enabled, we get a list of files that contain the string "zip" preceded by any character other than "b" or "g". Please note that zip was not found. A negated character set still requires a character at the given position, but the character must not be a member of the negated character set.

The caret character is negated only if it is the first character inside a bracketed expression; otherwise, it loses its special purpose and becomes a regular symbol from the set.

Traditional character ranges

If we wanted to construct a regular expression that would find every file in our list that starts with a capital letter, we could do the following:

Grep -h "^" dirlist*.txt MAKEDEV GET HEAD POST VBoxClient X X11 Xorg ModemManager NetworkManager VBoxControl VBoxService

The point is that we put all 26 capital letters in the expression inside square brackets. But the idea of ​​printing them all does not inspire enthusiasm, so there is another way:

Grep -h "^" dirlist*.txt

Using a three-character range, we can shorten the 26-letter entry. You can express any range of characters this way, including multiple ranges at once, such as this expression, which matches all file names that begin with letters and numbers:

Grep -h "^" dirlist*.txt

In character ranges we see that the dash character is treated in a special way, so how can we include the dash character in an expression inside square brackets? By making it the first character in the expression. Let's look at two examples:

Grep -h "" dirlist*.txt

This will match every filename containing a capital letter. Wherein:

Grep -h "[-AZ]" dirlist*.txt

will match every filename that contains a dash, or a capital "A", or a capital "Z".

A continuous expression is a pattern that describes a set of strings. Regular expressions are constructed similarly to arithmetic expressions, using various operators to combine smaller expressions.

Continuous expressions (English regular expressions, abbreviated RegExp, RegEx, jargon regexps or regexes) - a system for parsing text fragments according to a formalized template, based on a system for recording patterns for search. Sample (English pattern) sets the search rule; in Russian it is also sometimes clicked “template”, “mask”. Regular expressions revolutionized electronic content processing in the late 20th century. They appear to be a development of wildcard characters.

Now constant expressions are used by numerous text editors and utilities to search and change text based on selected rules. Almost many programming languages ​​support regular expressions for working with strings. For example, Java, . NET Framework, Perl, PHP, JavaScript, Python, etc. have built-in support for constant expressions. A set of utilities (including the sed editor and the grep filter) found in UNIX distributions were among the original ones that helped popularize the concept of regular expressions.

One of the more useful and feature-rich commands in the Linux terminal is the “grep” command. Grep is an acronym that stands for “global regular expression print” (that is, “search everywhere for strings corresponding to a constant expression and print them”).

This means that grep can be used to see if input matches given patterns. In its simplest form, grep is used to find matches of letter patterns in text file. This means that if grep acquires a search word, it will print every line in the file that contains that word.

The purpose of grep is to search for strings according to the condition represented by the regular expression. There are modifications to the classic grep - egrep, fgrep, rgrep. All of them are honed for specific purposes, while grep’s abilities cover all functionality. The simplest example of using the command is to output a line that matches a pattern from a file. Example we want to find a line storing 'user' in the /etc/mysql/my.cnf file. To do this, use the following command:

Grep user /etc/mysql/my.cnf

Grep can simply search for a specific word:

Grep Hello ./example.cpp

Or a string, but in this version it must be enclosed in quotes:

Grep "Hello world" ./example.cpp

In addition, program alternatives are egrep and fgrep, which are the same as grep -E and grep -F, respectively. The egrep and fgrep options are deprecated but work for backwards compatibility. It is recommended to use grep -E and grep –F instead of the legacy options.

The grep command matches lines in source files against a pattern, this basic regular expression. If no files are specified, standard input is used. As usual, each successfully matched string is copied to standard output; If
There are only a few source files; the file name is displayed before the found line. Basic continuous expressions (expressions that have strings of characters with their meanings and use a limited set of alphanumeric and special characters) are perceived as templates.

Using egrep on Linux

Egrep or grep -E is another version of grep or Extended grep. This version of grep is excellent and fast when it comes to regular expression pattern matching because it treats metacharacters as is and doesn't replace them as strings. Egrep uses ERE or Extended Extended Expression.

egrep is a stripped-down call to grep with the -E switch. The difference from grep is the ability to use extended continuous expressions using POSIX character classes. Often the task arises of searching for words or representations that belong to the same type, but with possible variations in spelling, such as dates, file names with a certain extension and standard name, e-mail addresses. On the other hand, there are tasks of finding well-defined words, which may have different styles, or a search that excludes individual characters or classes of characters.

For these purposes of truth, some systems have been created based on the description of text using templates. Constant expressions are also included in such systems. Two very useful special characters are ^ and $, which indicate the beginning and end of a line. For example, we want to get all users registered in our system whose name starts with s. Then you can use the regular expression "^s". You can use the egrep brigade:

Egrep "^s" /etc/passwd

It is possible to search across multiple files and in such a case The file name is displayed before the line.

Egrep -i Hello ./example.cpp ./example2.cpp

And the following query displays the entire code, excluding lines containing only comments:

Egrep -v ^/ ./example.cpp

As egrep, even if you don't escape metacharacters, the command will treat them as special characters and replace them with its special meaning instead of treating them as part of the string.

Using fgrep on Linux

Fgrep or Fixed grep or grep -F is another version of grep that is necessary when it comes to searching the entire line instead of a regular concept, since it does not recognize either regular expressions or metacharacters. To search for any string directly, choose this version of grep.

Fgrep searches the entire string and does not recognize special characters as part of a continuous expression, whether the characters are escaped or not.

Fgrep -C 0 "(f|g)ile" check_file fgrep -C 0 "\(f\|g\)ile" check_file

Using sed on Linux

sed (from English Stream EDitor) - streaming text editor(as well as a programming language) using various predefined text transformations to a sequential stream of text these. Sed can be treated like grep, outputting lines using a basic regular expression pattern:

Sed -n /Hello/p ./example.cpp

Maybe use it to remove lines (removing all empty lines):

Sed /^$/d ./example.cpp

The main tool for working with sed is an expression like:

Sed s/search_expression/what_to_replace/file_name

So, an example, if you run the command:

Sed s/int/long/ ./example.cpp

The differences between "grep", "egrep" and "fgrep" are discussed above. Despite differences in the set of regular representations used and execution speed, the command line options remain the same for all three versions of grep.