Yandex scrapes Google and other SEO learnings from the source code leak
๐ Abstract
The article discusses the recent leak of Yandex's codebase, which provides a rare glimpse into the inner workings of a modern search engine. The author analyzes the leaked code, highlighting similarities and differences between Yandex and Google's search technologies, and shares insights into Yandex's ranking factors, indexing processes, and architectural components.
๐ Q&A
[01] Yandex Codebase Leak
1. What are the key insights the author gained from reviewing the Yandex codebase?
- The codebase reveals details about Yandex's search engine architecture, including its dual-distributed crawler system, sharded database structure, and use of BERT and neural network-based ranking (MatrixNet).
- The codebase contains over 17,000 ranking factors, which are categorized as static, dynamic, and search-related. This suggests a highly complex and dynamic ranking environment, similar to Google's.
- The author found initial ranking factor weights hard-coded in the codebase, providing insights into Yandex's relevance scoring process.
- The codebase also reveals Yandex's use of various text processing techniques like TF-IDF, BM25, and BERT, as well as its link analysis and anti-spam measures.
2. How does the Yandex codebase compare to Google's search technology?
- While Yandex is not Google, the two companies share similarities in their use of state-of-the-art technologies and approaches to search, such as distributed computing, neural networks, and natural language processing.
- The author notes that many Yandex engineers have also worked at Google, and the companies share some open-source technologies like TensorFlow and BERT.
- However, the author also highlights differences, such as Yandex's lack of a separate rendering system for JavaScript and its apparent use of some Google data in its ranking calculations.
3. What are the implications of the Yandex codebase leak for the SEO community?
- The codebase provides valuable insights that can help the SEO community expand their understanding of modern search engine ranking factors and algorithms beyond the traditional "200 signals" narrative.
- The author suggests the codebase will lead to more hypotheses and testing around ranking factors, as well as improvements in SEO tools and techniques to better measure and analyze the factors that modern search engines consider.
- The author also anticipates that the leak will spur innovation as engineers at various search engines learn from the Yandex codebase.
[02] Ranking Factors and Algorithms
1. What are the key ranking factors and algorithms revealed in the Yandex codebase?
- The codebase contains over 17,000 ranking factors, covering a wide range of metrics related to content, links, user signals, and more.
- The author highlights some of the most heavily weighted positive and negative ranking factors, as well as some unexpected factors that stood out.
- Yandex uses a multi-layered ranking process, with an initial relevance scoring followed by re-ranking using its neural network-based MatrixNet system.
- The codebase also reveals Yandex's use of various text processing techniques like TF-IDF, BM25, and BERT, as well as its link analysis and anti-spam measures.
2. How do Yandex's ranking factors and algorithms compare to Google's?
- The author suggests that Google's "200 signals" are likely composed of many individual features, similar to the thousands of factors found in the Yandex codebase.
- Both Yandex and Google use distributed computing and neural network-based ranking approaches, indicating similarities in their overall architectural approaches.
- However, the author notes that Yandex appears to have some unique features, such as its "Vital Hosts" boost that favors certain news agencies, which differs from Google's stated commitment to avoiding biases in its ranking system.
3. What are the implications of the revealed ranking factors and algorithms for SEO practitioners?
- The codebase provides SEO practitioners with a more concrete understanding of the complexity and dynamism of modern search engine ranking, moving beyond the abstract "200 signals" narrative.
- The insights can inform new hypotheses and testing around ranking factors, as well as improvements to SEO tools and techniques to better measure and analyze the factors that search engines consider.
- The author suggests the codebase will spur innovation in the SEO community as practitioners work to maximize the opportunities presented by the leaked information.