How contextual information search works in search engines. How does a search engine work? We compose the search query correctly

21.05.2020

Ilya Segalovich passed away last July, founder and director of technology at Yandex, creator of the first version of the search engine and author of its name. In memory of this outstanding person and public figure, who helped many, includingCOLTA. RU, we are republishing it about information retrieval and the mathematical models that underlie it. Ilya Segalovich called search engines one of the two new wonders of the world. In any case, without them, including without Segalovich’s main brainchild, Yandex, our life would be completely different.

There are hundreds of search engines written in the world, and if you count the search functions implemented in the most different programs, then the count must be in thousands. And no matter how the search process is implemented, no matter what mathematical model it is based on, the ideas and programs that implement the search are quite simple. Although this simplicity apparently belongs to the category about which they say “simple, but it works.” One way or another, it was search engines that became one of the two new wonders of the world, providing homosapiens unlimited and instant access to information. The first miracle, obviously, can be considered the Internet as such, with its capabilities for universal communication.

Search engines in historical perspective

There is a common belief that each new generation of software is more advanced than the previous one. They say that before everything was imperfect, but now artificial intelligence reigns everywhere. Another extreme point of view is that “everything new is well forgotten old.” I think that when it comes to search engines, the truth lies somewhere in the middle.

But what has actually changed in recent years? Not algorithms or data structures, not mathematical models. Although they too. The paradigm for using systems has changed. Simply put, a housewife looking for a cheaper iron and a graduate of a auxiliary boarding school in hopes of finding a job as a car mechanic sat down at the screen with the search line. In addition to the emergence of a factor that was impossible in the pre-Internet era - the factor of total demand for search engines - a couple more changes became obvious. First, it became clear that people not only “think in words,” but also “search in words.” They expect to see the word typed in the query string in the system response. And second: it is difficult to “re-teach a seeker to search,” just as it is difficult to re-teach to speak or write. Dreams of the 60-80s about iterative refinement of queries, about understanding natural language, about searching by meaning, about generating a coherent answer to a question now hardly stand the test of reality.

Algorithm + data structure = search system

Like any program, a search engine operates on data structures and executes an algorithm. The variety of algorithms is not very great, but it exists. Not counting quantum computers, which promise us a magical breakthrough in the “algorithmic complexity” of search and about which the author knows almost nothing, there are four classes of search algorithms. Three out of four algorithms require “indexing”, preliminary processing of documents, during which an auxiliary file is created, that is, an “index”, designed to simplify and speed up the search itself. These are algorithms for inverted files, suffix trees, and signatures. In the degenerate case, there is no preliminary indexing stage, and the search occurs using sequential scanning of documents. This search is called direct.

Direct search

Its simplest version is familiar to many, and there is no programmer who has not written similar code at least once in his life:

Despite its apparent simplicity, direct search has been intensively developing over the past 30 years. A considerable number of ideas have been put forward that reduce search time by several times. These algorithms are described in detail in various literature; there are summaries and comparisons of them. Good reviews of direct search methods can be found in textbooks, for example, by Sedgwick or Corman. It should be taken into account that new algorithms and their improved versions appear constantly.

Although directly scanning all texts is a rather slow task, one should not think that direct search algorithms are not used on the Internet. Norwegian search engine Fast used a chip that implements the logic of direct search for simplified regular expressions (fastpmc) , and placed 256 such chips on one board. This allowed Fast serve quite a large number of requests per unit of time.

In addition, there are a lot of programs that combine index search to find a block of text with further direct search within the block. For example, very popular, including in RuNet, Glimpse.

In general, direct algorithms have fundamentally win-win distinctive features. For example, unlimited possibilities by approximate and fuzzy search. After all, any indexing is always associated with simplification and normalization of terms, and therefore with the loss of information. Direct search works directly from original documents without any distortion.

Inverted file

This simplest data structure, despite its mysterious foreign name, is intuitively familiar to both any literate person and any database programmer who has not even dealt with full-text search. The first category of people knows what it is by “concordances” - alphabetically ordered exhaustive lists of words from one text or belonging to one author (for example, “Concordance to the poems of A.S. Pushkin”, “Dictionary-concordance of journalism by F.M. Dostoevsky” ). The latter deal with some form of inverted list whenever they build or use a “database index on a key field.”

Let us illustrate this structure with the help of a wonderful Russian concordance - “Symphony”, released by the Moscow Patriarchate based on the text of the Synodal translation of the Bible.

Here is an alphabetically ordered list of words. For each word, all “positions” in which this word occurred are listed. Search algorithm consists of finding the desired word and loading an already expanded list of positions into memory.

To save on disk space and speed up searches, two techniques are usually used. Firstly, you can save on the details of the position itself. After all, the more detailed such a position is specified (for example, in the case of “Symphony” it is “book+chapter+verse”), the more space will be required to store the inverted file.

In the most detailed version, the inverted file can store the word number, the offset in bytes from the beginning of the text, the color and font size, and much more. More often, they simply indicate the number of the document (say, a book of the Bible) and the number of times this word is used in it. It is this simplified structure that is considered fundamental in the classical theory of information retrieval - InformationRetrieval(IR) .

The second (in no way related to the first) compression method: arrange the positions for each word in ascending order of addresses and for each position store not its full address, but the difference from the previous one. This is what such a list would look like for our page, assuming that we remember the position up to the chapter number:

Additionally, some simple packaging method is superimposed on the differential method of storing addresses: why give a fixed “huge” number of bytes to a small integer, because you can give it almost as many bytes as it deserves. Here it is appropriate to mention Golomb codes or a built-in function of a popular language Perl: pack(“w») .

In the literature there is also a heavier artillery of packing algorithms of the widest range: arithmetic, Huffman, LZW etc. Progress in this area is ongoing. In practice, they are rarely used in search engines: the gains are small, and processor power is used inefficiently.

As a result of all the described tricks, the size of the inverted file is usually from 7 to 30 percent of the size of the original text, depending on the addressing details.

Listed in the Red Book

Algorithms and data structures other than inverted and direct search have been repeatedly proposed. These are, first of all, suffix trees (Manber, Gonnet), as well as signatures (Faloutsos).

The first of them also functioned on the Internet, being a patented search engine algorithm OpenText. I have come across suffix indexes in domestic search engines. The second - the signature method - is the transformation of a document into block-by-block tables of hash values of its words - a “signature” and sequential viewing of the “signatures” during the search.

Widespread neither one nor the other method received, and therefore did not deserve, detailed discussion in this short article.

Mathematical models

Approximately three out of five search engines and modules function without any problems. mathematical models. More precisely, their developers do not set themselves the task of implementing an abstract model and/or are not aware of its existence. The principle here is simple: as long as the program finds at least something. Anyhow. And then the user himself will figure it out.

However, as soon as it comes to improving the quality of search, about a large volume of information, about the flow of user requests, in addition to empirically established coefficients, it turns out to be useful to operate with some, albeit simple, theoretical apparatus. A search model is a certain simplification of reality, on the basis of which a formula is obtained (in itself useless to anyone) that allows the program to make a decision: which document is considered found and how to rank it. After accepting the model, the coefficients often acquire a physical meaning and become clearer to the developer himself, and selecting them becomes more interesting.

All the variety of traditional information search models (IR) usually divided into three types: set-theoretic (Boolean, fuzzy sets, extended Boolean), algebraic (vector, generalized vector, latent semantic, neural network) and probabilistic.

The Boolean family of models is, in fact, the first one that comes to mind for a programmer implementing full-text search. There is a word - the document is considered found, if there is no word - it is not found. Actually, the classical Boolean model is a bridge connecting the theory of information retrieval with the theory of search and data manipulation.

The criticism of the Boolean model, which is quite fair, is that it is extremely rigid and unsuitable for ranking. Therefore, back in 1957, Joyce and Needham proposed taking into account the frequency characteristics of words, so that “... the comparison operation would be a ratio of the distance between vectors...” (Joyce, 1957). The vector model was successfully implemented in 1968 by the founding father of information retrieval science, Gerard Salton. (GerardSalton) in the search engine SMART(Salton"sMagicalAutomaticRetrieverofText) .

The ranking in this model is based on the natural statistical observation that the higher the local frequency of a term in a document (TF) and more “rarity” (i.e., reverse occurrence in documents) of the term in the collection (IDF) , the higher the weight of this document in relation to the term. Designation IDF introduced by Karen Spark-Jones in 1972 in an article about distinctive power (termspecificity) . From now on the designation TF*IDF widely used as a synonym for vector model.

Finally, in 1977, Robertson and Spark-Jones justified and implemented a probabilistic model (proposed back in 1960 (Maron)), which also laid the foundation for a whole family. Relevance in this model is considered as the probability that this document may be of interest to the user. This implies the presence of an already existing initial set of relevant documents, selected by the user or obtained automatically under some simplified assumption. The probability of being relevant for each subsequent document is calculated based on the ratio of the occurrence of terms in the relevant set and in the rest, “irrelevant” part of the collection. Although probabilistic models have some theoretical advantages—they rank documents in descending order of “likelihood of being relevant”—they have never gained much traction in practice.

I'm not going to go into detail and write out cumbersome formulas for each model. Their summary, together with a discussion, takes up 35 pages in a compressed form in the book “Modern Information Retrieval” (Baeza-Yates). It is only important to note that in each of the families the simplest model is based on the assumption of the mutual independence of words and has simple condition filtering: documents that do not contain the query word are never found. Advanced (“alternative”) models of each of the families do not consider query words to be mutually independent, and in addition, they allow you to find documents that do not contain a single word from the query.

Search "by meaning"

The ability to find and rank documents that do not contain query words is often considered a characteristic artificial intelligence or search by meaning and are attributed a priori to the advantages of the model. We will leave the question of whether this is true or not beyond the scope of this article.

As an example, I will describe only one, perhaps the most popular model, which works in a meaningful way. In information retrieval theory this model usually called latent semantic indexing (in other words, identifying hidden meanings). This algebraic model is based on the singular value decomposition of a rectangular matrix that associates words with documents. The matrix element is frequency response, reflecting the degree of connection between a word and a document, for example, TF*IDF. Instead of the original million-dimensional matrix, the authors of the Furnas and Dirvester method proposed using 50-150 “hidden meanings” corresponding to the first principal components of its singular decomposition.

Singular decomposition of a real matrix A sizes m*n any decomposition of the form is called A= USV, Where U m*m, V - orthogonal size matrix n*n, S- diagonal size matrix m*n, whose elements sij = 0 , If i not equal j, And sii=si >= 0 . Quantities si are called singular values of the matrix and are equal to the arithmetic values of the square roots of the corresponding eigenvalues of the matrix AAT. In English-language literature, singular decomposition is usually called SVD-decomposition.

It was proven a long time ago (Eckart) that if we leave the first k singular numbers (equate the rest to zero), we obtain the closest possible approximation of the original rank matrix k(in a sense, its “closest semantic interpretation of rank k"). By reducing the rank, we filter out irrelevant details; By increasing, we try to reflect all the nuances of the structure of real data.

Search operations or finding similar documents are greatly simplified, since each word and each document is associated with a relatively short vector of k meanings (rows and columns of the corresponding matrices). However, due to the lack of meaningfulness of the “meanings” or for some other reason, but the use LSI in the forehead for search never gained distribution. Although for auxiliary purposes (automatic filtering, classification, separation of collections, preliminary reduction of dimensionality for other models), this method apparently finds application.

Quality control

“...robustness testing showed that the overlap of relevant documents between any two assessors was approximately 40% on average<...>accuracy and recall measured between assessors, about 65%<...>This places a practical upper limit on search quality in the region of 65%..."

(“What we have learned, and not learned, from TREC”, Donna Harman)

Whatever the model, the search engine needs “tuning” - assessing the quality of the search and adjusting the parameters. Quality assessment is an idea fundamental to search theory. For it is precisely thanks to quality assessment that one can talk about the applicability or inapplicability of a particular model and even discuss their theoretical aspects.

In particular, one of the natural limitations of search quality is the observation made in the epigraph: the opinions of two “assessors” (specialists who make a verdict on relevance) on average do not coincide with each other to a very large extent! This implies a natural upper limit for the quality of the search, because quality is measured based on the results of comparison with the opinion of the assessor.

“...I was shocked when someone from Googletold me that they do not use anything developed in TREC at all, because all algorithms sharpened on the track of “arbitrary requests” are smashed to smithereens by spam...”

It's time to return to the topic with which this article began: what has changed in search engines lately?

First of all, it became obvious that a search on the Internet cannot be performed in any way correctly, being based on an analysis (even no matter how deep, semantic, etc.) of the document text alone. After all, extra-textual (off-page) factors play no less, and sometimes even greater, role than the text of the page itself. Position on the site, traffic, authority of the source, frequency of updates, citation of the page and its authors - all these factors cannot be discounted.

Having become the main source of reference information for the human species, search engines have become the main source of traffic for Internet sites. As a result, they were immediately “attacked” by unscrupulous authors who wanted to appear on the first pages of search results at any cost. The artificial generation of entry pages rich in popular words, cloaking techniques, “blind text” and many other techniques designed to deceive search engines instantly flooded the Internet.

In addition to the problem of correct ranking, the creators of Internet search engines had to solve the problem of updating and synchronizing a colossal collection with heterogeneous formats, delivery methods, languages, encodings, and a lot of meaningless and duplicate texts. It is necessary to maintain the database in a state of maximum freshness (in fact, it is enough to create the illusion of freshness - but this is a topic for another discussion), perhaps taking into account the individual and collective preferences of users. Many of these problems have never before been considered in traditional information retrieval science.

As an example, let's look at a couple of such problems and practical ways to solve them in Internet search engines.

Ranking quality

Not all extra-textual criteria are equally useful. It was link popularity and its derivatives that turned out to be the decisive factor that changed in 1999-2000. the world of search engines and the loyalty of users that have returned to them. Since it was with its help that search engines learned to decently and independently (without support from manually edited results) rank answers to short frequency queries that make up a significant part of the search flow.

The simplest idea for global (i.e. static) tracking of link popularity is to count the number of links pointing to pages. This is approximately what in traditional library science is called a citation index. This criterion has been used in search engines since before 1998. However, it is easily subject to cheating, in addition, it does not take into account the weight of the sources themselves.

A natural development of this idea can be considered the algorithm proposed by Brin and Page in 1998 PageRank- an iterative algorithm similar to that used in the problem of determining the winner in a Swiss chess tournament. Combined with lexical search of links pointing to a page (an old, very productive idea that was used in hypertext search engines back in the 80s), this measure allowed for a dramatic increase in search quality.

A little earlier than PageRank, a local (i.e. dynamic, query-based) popularity accounting algorithm was proposed - HITS(Kleinberg), which is not used in practice mainly due to its computational cost. For about the same reason as local (i.e. dynamic) methods that operate on words.

Both algorithms, their formulas, and convergence conditions are described in detail, including in Russian-language literature. I will only note that calculating static popularity is not a valuable task in itself; it is used for numerous auxiliary purposes: determining the order in which documents are crawled, ranking searches based on the text of links, etc. Formulas for calculating popularity are constantly being improved; they take into account additional factors - thematic proximity of documents (for example, a popular search engine www.teoma.com), their structure, etc., allowing to reduce the influence of nepotism. An interesting separate topic is the efficient implementation of appropriate data structures (Bharat).

Index quality

Although the size of the Internet database does not seem to be a critical factor at first glance, it is not so. It’s no wonder that the number of visitors to such cars as Google And Fast, correlates well with the growth of their bases. The main reason: “rare” queries, that is, those for which there are less than 100 documents, amount to about 30% of the total mass of searches - a very significant part. This fact makes the database size one of the most critical system parameters.

However, the growth of the base in addition to technical problems with disks and servers is also limited by logical ones: the need to adequately respond to garbage, repetitions, etc. I can't help but describe the ingenious algorithm used in modern search engines to exclude "very similar documents."

The origin of copies of documents on the Internet may vary. The same document on the same server may differ in technical reasons: be presented in different encodings and formats, contain insertion variables - advertising or the current date.

A wide class of documents on the web are actively copied and edited - news agency feeds, documentation and legal documents, store price lists, answers to frequently asked questions, etc. Popular types of changes: proofreading, reorganization, revision, abstracting, topic disclosure, etc. Finally, publications may be copied in a way that violates copyright and modified maliciously to make them difficult to discover.

In addition, indexing by search engines of pages generated from databases gives rise to another common class of documents that are not very different in appearance: questionnaires, forums, product pages in electronic stores.

Obviously, there are no particular problems with complete repetitions; it is enough to store the checksum of the text in the index and ignore all other texts with the same checksum. However, this method does not work to detect even slightly changed documents.

To solve this problem, Udi Manber (author of the famous approximate direct search program agrep) in 1994 proposed the idea, and Andrei Broder in 1997 came up with the name and brought to mind the “shingles” algorithm (from the word shingles- “tiles, scales”). Here is a rough description of it.

For each ten-word text, a checksum (shingle) is calculated. The tens of words go overlapping, overlapping, so that not a single one is lost. And then, from the entire set of checksums (obviously, there are as many of them as there are words in the document minus 9), only those that are divisible by, say, 25 are selected. Since the checksum values are distributed evenly, the selection criterion is in no way tied to the features of the text. It is clear that the repetition of even one ten-word sentence is a significant sign of duplication, but if there are many of them, say, more than half, then with a certain (it is easy to estimate the probability) confidence we can say: a copy has been found! After all, one matched shingle in the sample corresponds to approximately 25 matched ten-word words in the full text!

Obviously, this way you can determine the percentage of text overlap, identify all its sources, etc. This elegant algorithm has made the long-standing dream of associate professors come true: from now on, the painful question “from whom did the student copy this coursework” can be considered resolved! It is easy to assess the proportion of plagiarism in any article.

So that the reader does not get the impression that information retrieval is an exclusively Western science, I will mention an alternative algorithm for identifying near-duplicates, invented and implemented here in Yandex (Ilyinsky). It takes advantage of the fact that most search engines already have an index in the form of an inverted file (or inverted index), and this fact can be conveniently used in the procedure of finding near-duplicates.

The price of one percent

Architecturally, modern search systems are complex multi-computer systems. Starting from a certain point, as the system grows, the main load falls not on the robot at all, but on the search. After all, dozens and hundreds of requests arrive within a second.

To deal with this problem, the index is broken down into parts and spread across tens, hundreds, or even thousands of computers. The computers themselves since 1997 (search engine Inktomi) are regular 32-bit machines ( Linux, Solaris, FreeBSD, Win32 ) with corresponding restrictions on price and performance. Exception from general rule only left AltaVista, which from the very beginning used relatively "large" 64-bit computers Alpha.

Internet search engines (and all large search engines in general) can speed up their work using echeloning and pruning techniques.

The first technique is to divide the index into obviously more relevant and less relevant parts. The search is first performed in the first part, and then, if nothing or little is found, the search engine accesses the second part of the index. Prüning (from English. pruning- “cutting, reduction”) is to dynamically stop processing a request after accumulating a sufficient amount of relevant information. There is also static pruning, when, based on certain assumptions, the index is reduced at the expense of documents that will certainly never be found.

A separate problem is to organize the uninterrupted operation of multi-computer systems, seamless index updating, and resistance to failures and delays in responses of individual components. Special protocols are being developed for communication between search servers and servers that collect responses and form the search results page.

Note that one percent of performance (say, a poorly written statement in some loop) for a ten thousand computer system costs about a hundred computers. Therefore, you can imagine how the code responsible for searching and ranking results is cleaned up, how the use of all possible resources is optimized: every byte of memory, every disk access.

Thinking through the architecture of the entire complex from the very beginning is crucial, since any changes - for example, adding an unusual factor in the ranking or a complex data source - become an extremely painful and complex procedure. Obviously, systems that start later have an advantage in this situation. But user inertia is very high: for example, it takes two to four years for an established multimillion-dollar audience to switch, albeit slowly, to an unusual search system, even if it has undeniable advantages. In conditions of fierce competition, this is sometimes not feasible.

Syntactic Clustering of the Web
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse
WWW6, 1997

The Approximation of One Matrix by Another of Lower Rank
Eckart, G. Young Psychometrika, 1936

Description and Performance Analysis of Signature File Methods
Faloutsos, S. Christodoulakis
ACM TOIS, 1987

Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure
G.W. Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R. A. Harshman, L.A. Streeter and K.E. Lochbaum
ACM SIGIR, 1988

Examples of PAT Applied to the Oxford English Dictionary
Gonnet G.
University of Waterloo, 1987

The Thesaurus Approach to Information Retrieval
T. Joyce and R.M. Needham
American Documentation, 1958

An Efficient Method to Detect Duplicates of Web Documents with the Use of Inverted Index
S. Ilyinsky, M. Kuzmin, A. Melkov, I. Segalovich
WWW2002, 2002

Suffix Arrays: A New Method for On-line String Searches
U. Manber, G. Myers
1st ACM-SIAM Symposium on Discrete Algorithms, 1990

Finding Similar Files in a Large File System
U. Manber
USENIX Conference, 1994

On Relevance, Probabilistic Indexing and Information Retrieval
M.E. Maron and J.L. Kuhns
Journal of the ACM, 1960

Relevance Weighting of Search Terms
S.E. Robertson and K. Sparck Jones
JASIS, 1976

Algorithms in C++
Robert Sedgewick
Addison-Wesley, 1992

A Statistical Interpretation of Term Specificity and Its Application in Retrieval
K. Spark Jones
Journal of Documentation, 1972

Natural Language Information Retrieval
Tomek Strzalkowski (ed.)
Kluwer Academic Publishers, 1999

Symphony, or Dictionary-Index to the Holy Scriptures of the Old and New Testaments
Compiled by M.A. Bondarev, M.S. Kosyan, S.Yu. Kosyan
Publishing house of the Moscow Patriarchate, 1995

Glossary

Assessor (assessor, expert) - a specialist in the subject area who makes a conclusion about the relevance of a document found by a search engine.

Boolean model (boolean, Boolean, Boolean, binary) is a search model based on the operations of intersection, union and subtraction of sets.

Vector model- an information retrieval model that considers documents and queries as vectors in word space, and relevance as the distance between them.

Probabilistic model- an information retrieval model that considers relevance as the probability of a given document matching a query based on the probabilities of matching the words of a given document with an ideal answer.

Extra-textual criteria (off-page, off-page) - criteria for ranking documents in search engines, taking into account factors not contained in the text of the document itself and not extracted from there in any way.

Entry pages (doorways, hallways) - pages created to artificially increase rank in search engines (search spam). When they land, the user is redirected to the landing page.

Disambiguation (tagging, partofspeechdisambiguation, tagging) - choosing one of several homonyms using context; V English language often comes down to the automatic assignment of the grammatical category “part of speech.”

Duplicates (duplicates) - different documents with identical, from the user’s point of view, content; approximate duplicates (nearduplicates, Near-duplicates), unlike exact duplicates, contain minor differences.

The illusion of freshness- the effect of apparent freshness achieved by Internet search engines by more regularly crawling those documents that are more often found by users.

Inverted file (invertedfile, inverse file, inverted index, inverted list) is a search engine index that lists the words of a collection of documents, and for each word lists all the places in which it occurred.

Index (index, index) - see Indexing.

Citation index (citationindex) - the number of mentions (citations) of a scientific article, in traditional bibliographic science is calculated over a period of time, for example, per year.

Indexing (indexing, indexing) is the process of compiling or assigning a pointer (index) - a service data structure necessary for subsequent search.

Information search (InformationRetrieval, IR) - search for unstructured information, the unit of presentation of which is a document of arbitrary formats. The subject of the search is the user’s information need, informally expressed in the search query. Both the search criterion and its results are not deterministic. These features distinguish information retrieval from “data retrieval,” which operates on a set of formally defined predicates, deals with structured information, and whose result is always deterministic. Information retrieval theory studies all components of the search process, namely text preprocessing (indexing), query processing and execution, ranking, user interface and feedback.

Cloaking (cloaking) - a technique of search spam, which consists in the recognition by the authors of documents of the robot (indexing agent) of the search engine and the generation of special content for it, which is fundamentally different from the content given to the user.

Term Contrast- see Distinctive power.

Latent semantic indexing- a patented search algorithm for meaning, identical to factor analysis. Based on the singular value decomposition of the matrix of connections between words and documents.

Lemmatization (lemmatization, normalization) - bringing the form of a word to a dictionary form, that is, a lemma.

Cheat search engines- see Search engine spam.

Nepotism- a type of search engine spam, the installation of reciprocal links by document authors for the sole purpose of raising their rank in search results.

Reverse occurrence in documents (inverteddocumentfrequency, IDF, inverse frequency in documents, inverse document frequency) is an indicator of the search value of a word (its distinctive power); “reverse” is said because when calculating this indicator, the denominator of the fraction usually includes the number of documents containing this word.

Feedback - user response to the search result, their judgments about the relevance of the documents found, recorded by the search system and used, for example, for iterative modification of the query. It should be distinguished from pseudofeedback, a query modification technique in which the first few documents found are automatically considered relevant.

Homonymy- see Polysemy.

The basis- part of a word common to a set of its derivational and inflectional (more often) forms.

Search by meaning- an information retrieval algorithm capable of finding documents that do not contain query words.

Search for similar documents (similardocumentsearch) - an information retrieval task in which the document itself acts as a query and it is necessary to find documents that are most similar to the given one.

Search system (searchengine, S.E., information retrieval system, IRS, search engine, search engine, “search engine”, “search engine”) - a program designed to search for information, usually text documents.

Search prescription (query, request) - usually a line of text.

Polysemy (polysemy, polysemy) - the presence of several meanings for the same word.

Completeness (recall,coverage) is the proportion of relevant material contained in a ,search engine response relative to all relevant material in the ,collection.

Almost-duplicates (near-duplicates, approximate duplicates) - see Duplicates.

Prüning (pruning) - cutting off obviously irrelevant documents during search in order to speed up query execution.

Direct search- search directly through the text of documents, without preliminary processing (without indexing).

Pseudofeedback- see Feedback.

The distinctive power of words (termspecificity, termdiscriminatingpower, contrast, distinctive power) - the degree of width or narrowness of a word. Search terms that are too broad bring up too much information, and much of it is useless. Too narrow terms help to find too few documents, although more accurate ones.

Regular expression (regularexpression, pattern, “template”, less often “stencil”, “mask”) - a way of recording a search instruction that allows you to determine wishes for the searched word, its possible spellings, errors, etc. In a broad sense, it is a language that allows you to specify queries of unlimited complexity.

Relevance (relevance, relevance) - compliance of the document with the request.

Signature (signature, signature) - a set of hash values of words of a certain block of text. When searching by signature method all signatures of all blocks in the collection are searched sequentially in search of matches with the hash values of the query words.

Inflection (inflection) - the formation of a form of a certain grammatical meaning, usually obligatory in a given grammatical context, belonging to a fixed set of forms (paradigm) characteristic of words of this type. Unlike word formation, it never leads to a change in type and gives rise to a predictable meaning. The inflection of names is called declension (declension) , and verbs - conjugation (conjugation) .

Word formation (derivation) - formation of a word or stem from another word or stem.

Distinctive- see Distinctive power.

Search engine spam (spam, spamdexing, cheating of search engines) - an attempt to influence the result of information search on the part of document authors.

Static popularity- cm. PageRank.

Stemming- the process of identifying the stem of a word.

Safe words (stop-words) - those conjunctions, prepositions and other frequent words that a given search engine has excluded from the indexing and search process to improve its performance and/or search accuracy.

Suffix trees, suffix arrays (suffixtrees, suffixarrays, PAT-arrays) is an index based on the representation of all significant suffixes of a text in a data structure known as "boron" (trie) . Suffix this index refers to any “substring” that begins at some position in the text (the text is treated as one continuous line) and continues to its end. In real applications, the length of suffixes is limited, and only significant positions are indexed - for example, the beginning of words. This index allows you to perform more complex queries than an index built on inverted files.

Tokenization (tokenization, lexicalanalysis, graphematic analysis, lexical analysis) - highlighting words, numbers and other tokens in the text, including, for example, finding sentence boundaries.

Accuracy (precision) - the share of relevant material in the search engine response.

Hash value (hash-value) - meaning hash functions (hash-function) , which converts data of arbitrary length (usually a string) into a number of a fixed order.

Frequency (words) in documents (documentfrequency, occurrence in documents, document frequency) - the number of documents in the collection containing a given word.

Term frequency (termfrequency, TF) - frequency of use of a word in a document.

Shingle (shingle) - hash value of a continuous sequence of text words of a fixed length.

PageRank- an algorithm for calculating the static (global) popularity of a page on the Internet, named after one of the authors, Lawrence Page. Corresponds to the probability of a user hitting a page in a random walk model.

TF*IDF- a numerical measure of the correspondence between a word and a document in a vector model; the more than relatively more often the word appeared in the document and relatively less often- in the collection.

By definition, an Internet search engine is an information retrieval system that helps us find information on the World Wide Web. This facilitates the global exchange of information. But the Internet is an unstructured database. It is growing exponentially and has become a huge repository of information. Finding information on the Internet is a difficult task. There is a need to have a tool to manage, filter and retrieve this ocean information. The search engine serves this purpose.

How does a search engine work?

Internet search engines are engines that search and retrieve information on the Internet. Most of them use a crawler indexer architecture. They depend on their track modules. Crawlers are also called spiders small programs who browse the web.

Crawlers visit an initial set of URLs. They mine URLs that appear on crawled pages and send this information to the crawler control module. The crawler decides which pages to visit next and gives those URLs to the crawlers.

The topics covered by different search engines vary depending on the algorithms they use. Some search engines are programmed to search sites on a specific topic, while others' crawlers may visit as many places as possible.

The indexing module extracts information from each page it visits and enters the URL into the database. This results in a huge lookup table with a list of URLs pointing to pages of information. The table shows the pages that were covered during the crawl.

The analysis module is another important part of the search engine architecture. It creates a utility index. The index utility can provide access to pages of a given length or pages containing a certain number of pictures on them.

During the crawling and indexing process, the search engine stores the pages it retrieves. They are temporarily stored in page storage. Search engines maintain a cache of the pages they visit to speed up the retrieval of pages that have already been visited.

The search engine query module receives search query ov from users in the form of keywords. The ranking module sorts the results.

The crawler indexer architecture has many variations. They change in a distributed search engine architecture. These architectures consist of collectors and brokers. Collectors collect indexing information from web servers while brokers provide the indexing engine and query interface. Brokers index the update based on information received from collectors and other brokers. They can filter information. Many search engines today use this type of architecture.

Search engines and page ranking

When we create a query in a search engine, the results are displayed in in a certain order. Most of us tend to visit the top pages and ignore the bottom ones. This is because we believe that the top few pages are more relevant to our query. So everyone is interested in having their pages rank in the top ten search engine results.

The words listed in the search engine query interface are the keywords that were requested in the search engines. They are a list of pages related to the requested keywords. During this process, search engines retrieve those pages that have frequent occurrences of these keywords. They look for relationships between keywords. The placement of keywords also counts, as does the ranking of the pages containing them. Keywords that appear in page titles or URLs are given more weight. Pages that have links pointing to them make them even more popular. If many other sites link to a page, it is seen as valuable and more relevant.

There is a ranking algorithm that every search engine uses. The algorithm is a computerized formula designed to provide relevant pages to a user's request. Each search engine may have a different ranking algorithm that analyzes pages in the engine's database to determine relevant responses to search queries. Search engines index different information differently. This means that a specific query posed to two different search engines may return pages in different orders or extract different pages. The popularity of a website are factors that determine relevance. Click-through popularity of a site is another factor that determines its rank. This is a measure of how often a site is visited.

Webmasters try to deceive search engine algorithms in order to increase the position of their site in search results. Stuffing website pages with keywords or using meta tags to cheat search engine ranking strategies. But search engines are smart enough! They are improving their algorithms so that the machinations of webmasters do not affect search results.

You need to understand that even the pages after the first few in the list may contain exactly the information you were looking for. But rest assured that good search engines will always bring you highly relevant pages in the top order!

On the Internet, on various sites, the user is offered a large amount of different information. Search engines have been created to obtain the necessary information and find answers to questions. Hearing this phrase, many people think about Google, “Yandex”. However, there are many more search engines on the Internet.

What is a search engine

A search engine is considered to be software, which consists of a database of documents. Users are provided with a special interface that allows them to enter the necessary queries and receive links with relevant information. The documents that best match what a particular person is looking for are always at the top positions in search results.

Search results, which are generated in accordance with the entered query, usually contain different types results. It may contain Internet pages, video and audio files, pictures, pdf files, specific products (if the search is carried out by an online store).

Classification of search engines

Existing search engines are classified into several types. First of all, it is worth mentioning traditional search engines. Such search engines' operating principles are focused on searching for information on a huge number of existing sites. Search engines are still found on certain Internet resources:

in online stores (to search for the necessary products);
on forums and blogs (to search for messages);
on information sites (to search for articles on the desired topic or news), etc.

Search engines are also subdivided based on geographic location. In this classification there are 3 groups of search engines:

Global. The search is being conducted all over the world. The leader in this group is the Google search engine. Previously, there were such search engines as Inktomi, AltaVista, etc.
Regional. The search is carried out by country or group of countries that share the same language. Regional search engines are widespread. Their example in Russia is Yandex, Rambler.
Local. The search is carried out in a specific city. An example of such a search engine is Tomsk.ru.

Components of search engines

In any search engine, there are 3 components that determine the principles of operation of the search system:

robot (indexer, spider, crawler);
database;
request handler.

A robot is a special program whose purpose is to create a database. The database stores and sorts all collected information. The request processor, also called the client, handles user requests. He has access to the database. The client is not always located on the same computer. The request processor is distributed across several physically unconnected electronic computers.

All existing systems work according to the same principle. Consider, for example, the functioning of traditional search engines designed for the Internet. The functioning of the robot is similar to the actions of a regular user. This program periodically crawls all sites, adding new pages and Internet resources to the database. This process is called indexing.

When a user on the Internet enters a specific query into the search bar, the client starts working. The program accesses the existing database and generates results based on keywords. The search engine provides links to the user in a certain sequence. They are sorted as they match the query, i.e., relevance is taken into account.

Each search engine has its own way of determining relevance. If a user sends a specific request to different systems, then he will receive not exactly the same results. The algorithm for determining relevance is kept secret.

How to formulate queries correctly

Back in school we were taught to ask questions correctly. This determines what kind of answers we will receive. However, this rule does not need to be followed when using search engines. For modern search engines, it does not matter in what number or case a person writes his query. In any case, the output will include the same results.

Search engines do not need a clear formulation of the question. The user only needs to select the right keywords. Let's look at an example. We need to find the lyrics of the song “A Day Without You”, performed by the famous female pop group “Via-Gra”. When contacting a search engine, it is not necessary to name the group or indicate that it is a song. It is enough to write “a day without you text.” No case or punctuation required. These nuances are not taken into account by search engines.

The leading search engine in the world is Google. It was founded in 1998. The system is very popular, which is confirmed by analytical information. About 70% of requests received on the Internet are processed by Google. The search engine database is huge. More than 60 trillion different documents have been indexed. Google attracts users with a simple interface. On the main page there is a logo and a search bar. This feature allows us to call Google one of the most minimalistic search engines.

Bing is in second place in the ranking of popular search engines. It appeared the same year as Google. The creator of this search engine is the famous international corporation Microsoft. Lower positions in the ranking are occupied by Baidu, Yahoo!, AOL, Excite, Ask.

What's popular in Russia

Among search engines in Russia, Yandex is the most popular. This service appeared in 1997. The first time I did it Russian company CompTek International. A little later, the Yandex company appeared, which continued to develop the search engine. The search engine has gained enormous popularity over the years. It allows searching in several languages - Russian, Belarusian, Ukrainian, Tatar, Kazakh, English, German, French, Turkish.

From statistical information it is known that Yandex is of interest to more than 50% of Runet users. More than 40% of people prefer Google. Approximately 3% of users chose Mail.ru, a Russian-language Internet portal.

Protected search engines

Conventional search engines that are familiar to us are not entirely suitable for children. Young Internet users may accidentally find some adult materials or information that can harm their psyche. For this reason, special secure search engines were created. Their databases store only safe content for children.

An example of one such search engine is “Sputnik.Children”. This service is quite young. It was created by Rostelecom in 2014. The search engine's main page is brightly and interestingly designed. It presents a wide range of domestic and foreign cartoons for children of different ages. Additionally, the main page contains educational links related to several headings - “Sports”, “I want to know everything”, “Do it yourself”, “Games”, “Technology”, “School”, “Nature”.

Another example of a secure children's search system is Agakids.ru. This is an absolutely safe resource. How does a search engine work? The robot is configured in such a way that it crawls only those sites that are related to children's topics or are useful for parents. The search engine database includes resources with cartoons, books, educational literature, games, coloring books. Parents, using Agakids.ru, can find sites for themselves on the upbringing and health of children.

In conclusion, it is worth noting that search engines are complex systems. They face many problems - problems of spam, determining the relevance of documents, filtering out low-quality content, analyzing documents that do not contain textual information. For this reason, developers are introducing new approaches and algorithms that are a trade secret into the work of Internet search engines.

Why does a marketer need to know the basic principles? search engine optimization? It's simple: organic traffic is a great source of incoming traffic. target audience for your corporate website and even landing pages.

Meet a series of educational posts on the topic of SEO.

What is a search engine?

A search engine is a large database of documents (content). Search robots crawl resources and index different types of content, and it is these saved documents that are ranked in search.

In fact, Yandex is a “snapshot” of the Runet (also Turkey and a few English-language sites), and Google is the global Internet.

A search index is a data structure containing information about documents and the location of keywords in them.

According to the principle of operation, search engines are similar to each other, the differences lie in the ranking formulas (ordering sites in search results), which are based on machine learning.

Every day, millions of users submit queries to search engines.

“Write an abstract”:

"Buy":

But most of all they are interested...

How does a search engine work?

To provide users with quick answers, the search architecture was divided into 2 parts:

basic search,
metasearch.

Basic search

Basic search is a program that searches its part of the index and provides all documents that match the query.

Metasearch is a program that processes a search query, determines the user's regionality, and if the query is popular, it produces a ready-made search option, and if the query is new, it selects a basic search and issues a command to select documents, then uses machine learning to rank the found documents and provide to the user.

Classification of search queries

To give a relevant answer to the user, the search engine first tries to understand what exactly he needs. The search query is analyzed and the user is analyzed in parallel.

Search queries are analyzed according to the following parameters:

Length;
definition;
popularity;
competitiveness;
syntax;
geography.

Request type:

navigation;
informational;
transactional;
multimedia;
general;
official

After parsing and classifying the request, a ranking function is selected.

The designation of query types is confidential information and the proposed options are the guesswork of search engine optimization specialists.

If a user asks a general query, the search engine returns different types of documents. And you should understand that by promoting the commercial page of the site in the TOP 10 for a general request, you are applying not to get into one of the 10 places, but into the number of places
for commercial pages, which is highlighted by the ranking formula. And therefore, the likelihood of ranking in the top for such queries is lower.

Machine learning MatrixNet is an algorithm introduced in 2009 by Yandex that selects a function for ranking documents for certain queries.

MatrixNet is used not only in Yandex search, but also for scientific purposes. For example, at the European Nuclear Research Center it is used for rare events in large volumes of data (they are looking for the Higgs boson).

The primary data to evaluate the effectiveness of the ranking formula is collected by the assessor department. These are specially trained people who evaluate a sample of sites using an experimental formula according to the following criteria.

Site quality assessment

Vital - official website (Sberbank, LPgenerator). The search query corresponds to the official website, groups in in social networks, information on authoritative resources.

Useful (rated 5) - a site that provides extensive information upon request.

Example - request: banner fabric.

A site that is rated “useful” must contain the following information:

what is banner fabric;
specifications;
photos;
kinds;
price list;
something else.

Examples of queries in the top:

Relevant+ (score 4) – This score means the page is relevant to the search query.

Relevant - (score 3) - The page does not exactly match the search query.

Let’s say the query “Guardians of the Galaxy sessions” displays a page about a movie without sessions, a page of a past session, or a trailer page on YouTube.

Irrelevant (score 2) – the page does not match the request.
Example: the name of the hotel displays the name of another hotel.

To promote a resource for a general or informational request, you need to create a page that corresponds to the “useful” rating.

For clear queries, it is enough to meet the “relevant+” rating.

Relevance is achieved through textual and link correspondence of the page to search queries.

conclusions

Not all queries can be promoted to a commercial landing page;
Not all information requests can be used to promote a commercial website;
When promoting a general request, create a useful page.

A common reason why a site does not rank in the top is that the content of the promoted page does not match the search query.

We’ll talk about this in the next article, “Checklist for basic website optimization.”

Search engines (SEs) have been an essential part of the Internet for quite some time. Today they are huge and complex mechanisms that are not only a tool for finding any necessary information, but also quite exciting areas for business.

Many search users have never thought about the principles of their operation, how to process user requests, or how these systems are built and function. This material will help people who are involved in optimization and understand the structure and main functions of search engines.

Functions and concept of PS

Search system is a hardware and software complex that is designed to carry out the search function on the Internet, and responds to a user request, which is usually specified in the form of a text phrase (or more precisely a search query), by issuing a reference list to information sources, based on relevance. The most common and largest search engines: Google, Bing, Yahoo, Baidu. In Runet - Yandex, Mail.Ru, Rambler.

Let's take a closer look at the meaning of the search query, taking the Yandex system as an example.

The request must be formulated by the user in full accordance with the subject of his search, as simply and briefly as possible. For example, we want to find information in this search engine: “how to choose a car for yourself.” To do this, open the main page and enter the search query “how to choose a car.” Then our functions are reduced to following the provided links to information sources on the network.

But even acting in this way, we may not get the information we need. If we received such a negative result, we just need to reformat our query, or there really is no useful information in the search database this species query (this is quite possible given “narrow” query parameters, such as, for example, “how to choose a car in Anadyr”).

The most basic task of every search engine is to deliver to people exactly the type of information that they need. And it is practically impossible to teach users to create the “correct” type of queries to search engines, that is, phrases that will correspond to their operating principles.

That is why specialist search engine developers create principles and algorithms for their work that would allow users to find the information they are interested in. This means that the system must “think” in the same way as a person thinks when searching for the necessary information on the Internet.

When he enters his query into a search engine, he wants to find what he needs as easily and quickly as possible. Having received the result, the user makes his assessment of the system’s performance, guided by several criteria. Did he manage to find necessary information? If not, how many times did he have to reformat the query text to find it? How up-to-date was the information they received? How quickly did the search engine process his request? How user friendly were the search results provided? Was the desired result first, or was it in 30th place? How much “junk” (unnecessary information) was found along with useful information? Will relevant information be found for him, when using the PS, in a week or in a month?

In order to get the right answers to such questions, search developers are constantly improving the principles of ranking and its algorithms, adding new features and functions to them and trying by any means to make faster work systems.

Main characteristics of search engines

Let us indicate the main characteristics of the search:

Completeness.

Completeness is one of the most important search characteristics; it is the ratio of the numbers found for the query information documents to their total number on the Internet related to this request. For example, there are 100 pages on the Internet with the phrase “how to choose a car”, and for the same query only 60 of the total were selected, then in this case the completeness of the search will be 0.6. It is clear that the more complete the search itself, the greater the likelihood that the user will find exactly the document he needs, of course, if it exists at all.

Accuracy.

Another main function of a search engine is accuracy. It determines the degree to which the pages found on the Internet match the user’s request. For example, if for the key phrase “how to choose a car” there are a hundred documents, half of them contain this phrase, and the rest simply have the following words (how to choose a car radio correctly and install it in a car), then the search accuracy equals 50/100 = 0.5.

The more accurate the search, the sooner the user will find the information he needs, the less various “garbage” will be found among the results, the fewer documents found will not correspond to the meaning of the request.

Relevance.

This is a significant component of search, which is characterized by the time that passes from the moment information is published on the Internet until it is entered into the search engine’s index database.

For example, the day after information about the exit appears new iPad, many users turned to the search with relevant types of queries. In most cases, information about this news is already available in the search, although very little time has passed since its appearance. This is due to the presence of a “fast database” in large search engines, which is updated several times a day.

Search speed.

Such a function as search speed is closely related to the so-called “load resistance”. Searches every second great amount people, such workload requires a significant reduction in the time to process one request. Here the interests of both the search engine and the user completely coincide: the visitor wants to get results as quickly as possible, and the search engine must process his request as quickly as possible, so as not to slow down the processing of subsequent requests.

Visibility.

Visual presentation of results is the most important element of search convenience. For many queries, the search engine finds thousands, and in some cases millions different documents. Due to unclear drafting key phrases for search or its inaccuracy, even the very first query results do not always have only the necessary information.

This means that a person often has to conduct their own search among the results provided. Various components of search engine results pages help you navigate search results.

History of the development of search engines

When the Internet first began to develop, the number of its regular users was small, and the amount of information to access was relatively small. Basically, only specialists in research fields had access to this network. At that time, the task of finding information was not as urgent as it is now.

One of the very first methods of organizing wide access to information resources was the creation of site directories, and links to them began to be grouped by topic. The first project was the Yahoo.com resource, which opened in the spring of 1994. Subsequently, when the number of sites in the Yahoo directory increased significantly, the option to search for the necessary information in the directory was added. It was not yet a full search system, since the scope of such a search was limited only to sites included in this directory, and not absolutely all resources on the Internet. Link directories were widely used in the past, but nowadays they have almost completely lost their popularity.

After all, even today’s catalogs, which are enormous in volume, contain information about only a small portion of sites on the Internet. The most famous and largest directory in the world has information on five million sites, while Google's database contains information on more than 25 billion pages.

The very first real search engine was WebCrawler, which appeared back in 1994.

The following year AltaVista and Lycos appeared. Moreover, the first was the leader in information search for a very long time.

In 1997, Sergey Brin, together with Larry Page, created a car Google search as a research project at Stanford University. Today it is Google, the most popular and popular search engine in the world.

In September 1997, the Yandex PS was announced (officially), which is currently the most popular search system on the RuNet.

According to September 2015, the shares of search engines in the world are distributed as follows:

Google - 69.24%;
Bing - 12.26%;
Yahoo! - 9.19%;
Baidu - 6.48%;
AOL - 1.11%;
Ask - 0.23%;
Excite - 0.00%

According to December 2016, shares of search engines in Runet:

Yandex - 48.40%
Google - 45.10%
Search.Mail.ru - 5.70%
Rambler - 0.40%
Bing - 0.30%
Yahoo - 0.10%

How a search engine works

In Russia, the main search engine is Yandex, then Google, and then [email protected]. All large search engines have their own structure, which is quite different from others. But it is still possible to identify the basic elements common to all search engines.

Indexing module.

This component consists of three robot programs:

Spider(in English spider) is a program that is designed to download web pages. The spider downloads a specific page, simultaneously extracting all links from it. Downloading html code from almost every page. For this, robots use HTTP protocols.

"Spider" functions as follows. The robot sends a request to the server “get/path/document” and other HTTP request commands. In response, the robot program receives a text stream that contains service-type information and, of course, the document itself.

URL of the downloaded page;
date when the page was downloaded;
server http response header;
html code, “body” of the page.

Crawler(“traveling” spider). This program automatically visits all links that are found on the page and also highlights them. Its task is to decide where the spider should go next, based on these links or based on a given list of addresses.

Indexer(robot indexer) is a program that analyzes pages that spiders have downloaded.

The indexer completely parses the page into its component elements and analyzes them using its own morphological and lexical types of algorithms.

The analysis is carried out on various parts of the page, such as headings, text, links, style and structural features, html tags, etc.

Thus, the indexing module makes it possible to follow links of a given number of resources, download pages, extract links to new pages from received documents and perform a detailed analysis of them.

Database

Database(or search engine index) is a data storage complex, an array of information in which the modified parameters of each document processed by the indexing module and downloaded are stored in a certain way.

Search server

This is the most important element of the entire system, because the speed and, of course, the quality of the search directly depend on the algorithms underlying its functionality.

The search server works as follows:

The request that comes from the user is subject to morphological analysis. The information environment of any document available in the database is generated (it will subsequently be displayed as a snippet, i.e. an information field of text corresponding to a given request).
The received data is passed as input parameters to a specialized ranking module. They are processed for all documents, and as a result, for each such document its own rating is calculated, which characterizes the relevance of such a document to the user’s request, and other components.
Depending on the conditions specified by the user, this rating may well be adjusted by additional ones.
Then the snippet itself is generated, i.e. For any document found, the title, abstract that best matches the query, and a link to this document are extracted from the corresponding table, and the found word forms and words are highlighted.
The results of the resulting search are transmitted to the person who performed it in the form of a page on which search results (SERP) are displayed.

All these elements are closely related to each other and function, interacting, forming a distinct, but rather complex mechanism for the functioning of the PS, requiring enormous expenditure of resources.