It’s a well known fact that Google uses human raters in their search quality assessment. Yandex mainly talks about their machine learning and other fancy technologies, while not many know that human factor is involved in Yandex search just as much as in Google’s.
How Yandex collects the assessment data
Russia is a big country, and having committed to displaying perfect (or almost perfect anyway) local search results, Yandex needs a lot of human raters in all regions of the country (which they do). The raters work remotely, from home, and the results of their work are closely monitored by Yandex.
Interestingly, Yandex hires a lot of these people through Amazon Mechanical Turk service.
The raters are given random SERPs from real search queries and rate the documents according to the scale:
Vital (the best answer possible; usually official sites of organizations)
Useful (very good and informative answer)
Relevant + (answers the question)
Relevant – (partly relevant, but does not answer the question fully)
Irrelevant (does not answer the question)
The raters are given tasks like to evaluate a specific document, evaluate search results for a particular query, evaluate site snippet in a SERP, compare two documents and pick the most relevant to a specific search query, compare two search results pages and pick the best.
Mainly the human assessments are used on top 10 results, but can be also applied to further positions, depending on Yandex’s goal.
How Yandex uses the human ratings
There are two main ways the human assessments are used at Yandex: for evaluating quality of search results and for “teaching” MatrixNet, the machine learning technology that powers Yandex search results.
Yandex search quality measurements
Yandex has many different metrics to measure the quality of search results, one of them being pFound. pFound measures probability of that the user will find the answer he / she is looking for, based on hypotheses that a) the user will browse the SERP from the top to the bottom and b) the user will click on every document until he / she finds the answer or leaves the SERP without the answer.
Yandex keeps track of historical values of pFound, measuring the effect of changes in the ranking formula on the value of the variable. The example below shows a big improvement of pFound value after Yandex changed the way of handling mistypes:
Yandex is very proud of their MatrixNet, which has indeed significantly improved their search results since it was implemented, but they also understand the shortcomings of using the algorithm for judging content written by humans. Algorithms rely on rules and look for patterns in order to determine whether the document is relevant or not, e.g.
the document contains the search term + the document is clicked + the document is linked -> the document is probably relevant
And as all SEOs know, there are always ways to cheat the algorithms.
Human raters, on the other hand, without knowing the rules, classify documents as relevant or irrelevant according to their knowledge, feeling, understanding and common sense. Reviewing the data received from human raters, the algorithm is trying to find and learn new, less obvious patters and factors, e.g.
The document is rated relevant when it contains X factors (words, page structure etc.)
The document is rated irrelevant when it contains Y factors (words, page structure etc.)
Other search quality improvement techniques at Yandex
The technology described above is very powerful, but it alone is not enough to understand the complexity of the Russian language. Some words have several absolutely different meanings (e.g. Napoleon – a cake or the French emperor?), some search terms are too generic terms like, for example, “vidoes” (what does the user want – to watch, to make or to download?).
Yandex has been focusing on determining user intent for quite some time now. I mentioned some of these attempts in my earlier posts.
For better understanding of user intent, Yandex uses, among other techniques, search session analysis and SERP experiments.
When performing search session analysis, Yandex look at what queries users type in, how they reformulate their queries if no answer found from the first attempt and where users click eventually, if click at all.
SERP experiments basically means that a change in the search algorithm is being rolled out on a certain percentage of search queries. After running the experiment for a period of time, Yandex analyzes how the change affected user behavior, e.g. how much the percentage on non-clicked results changed, what was the average position of 1st clicked result and about 10500 other factors.
Yandex has a smart and complicated technology that is becoming better and better at understanding users. I find this fascinating. What’s your take on it?