Software

Devops

Google’s in-house docs about search ranking leak online, sparking SEO frenzy

GitHub trove details API features that 'contradict' Big G’s public statements about how its engine works


Updated A trove of documents that appear to describe how Google ranks search results has appeared online, likely as the result of accidental publication by an in-house bot.

The leaked documentation describes an old version of Google's Content Warehouse API and provides a glimpse of Google Search’s inner workings.

The material appears to have been inadvertently committed to a publicly accessible Google-owned repository on GitHub around March 13 by the web giant's own automated tooling. That automation tacked an Apache 2.0 open source license on the commit, as is standard for Google's public documentation. A follow-up commit on May 7 attempted to undo the leak.

The material was nonetheless spotted by Erfan Azimi, CEO of search engine optimization (SEO) biz EA Digital Eagle and were then disclosed on Sunday by fellow SEO operatives Rand Fishkin, CEO of SparkToro and Michael King, CEO of iPullRank.

These documents do not contain code or the like, and instead describe how to use Google's Content Warehouse API that's likely intended for internal use only; the leaked documentation includes numerous references to internal systems and projects. While there is a similarly named Google Cloud API that's already public, what ended up on GitHub goes well beyond that, it seems.

The files are noteworthy for what they reveal about the things Google considers important when ranking web pages for relevancy, a matter of enduring interest to anyone involved in the SEO business and/or anyone operating a website and hoping Google will help it to win traffic.

Among the 2,500-plus pages of documentation, assembled for easy perusal here, there are details on more than 14,000 attributes accessible or associated with the API, though scant information about whether all these signals are used and their importance. It is therefore hard to discern the weight Google applies to the attributes in its search result ranking algorithm.

But SEO consultants believe the documents contain noteworthy details because they differ from public statements made by Google representatives.

"Many of [Azimi's] claims [in an email describing the leak] directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more," explained SparkToro’s Fishkin in a report.

iPullRank’s King, in his post on the documents, pointed to a statement made by Google search advocate John Mueller, who said in a video that "we don’t have anything like a website authority score" – a measure of whether Google considers a site authoritative and therefore worthy of higher rankings for search results.

But King notes that the docs reveal that as part of the Compressed Quality Signals Google stores for documents, a "siteAuthority" score can be calculated.

Several other revelations are cited in the two posts.

One is the importance of clicks – and different types of clicks (good, bad, long, etc.) – are in determining how a webpage rankings. Google during the US v. Google antitrust trial acknowledged [PDF] that it considers click metrics as a ranking factor in web search.

Another is that Google uses websites viewed in Chrome as a quality signal, seen in the API as the parameter ChromeInTotal. "One of the modules related to page quality scores features a site-level measure of views from Chrome," according to King.

Additionally, the documents indicate that Google considers other factors like content freshness, authorship, whether a page is related to a site's central focus, alignment between page title and content, and "the average weighted font size of a term in the doc body."

Google did not respond to a request for comment. ®

Updated to add

Post-publication Google has told The Register that everyone needs to calm down, and be aware that the accidentally revealed files may be missing vital context.

"We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information," a spokesperson told us. "We've shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation."

Send us news
39 Comments

Google can totally explain why Chromium browsers quietly tell only its websites about your CPU, GPU usage

OK, now tell us why this isn't an EU DMA violation – asking for a friend in Brussels

Google festoons Chrome Enterprise browser with more controls

Because if there's one thing it really needed more of...

Apple, Google, ease cross-cloud data transfers, perhaps with costly catch

The joy of cloudy interoperability may be dampened by differently-sized free storage tiers

Microsoft ad subsidiary Xandr accused of violating GDPR

Access, deletion requests go ignored, and consumer profiles contradict themselves, complaint alleges

Firefox 128 bumps system requirements for old boxes

Get comfortable, it'll be here for a while

Google Translate now fluent in 110 additional languages from Abkhaz to Zulu

Ta shoh scansh mie, son ymmyd

Google begs court for relief from Epic Games' Play Store demands

$137M needed to overhaul Play Store too great to bear, Google argues. Oh, and user security is important, too

Antitrust cops cry foul over Meta's pay-or-consent ultimatum to Europeans

Facebook, Instagram gobble up same data whether you hand over cash or not

OpenAI, Google ink deals to augment AI efforts with news – it was Time for better sources

Tech giants can't play the RAG-time blues until they pay their dues – in this case to quality publishers

Google’s attempt to kill off child privacy app advertising lawsuit defeated

Won't somebody pleeease think of the ... oh, right, they are

Google cuts ties with Entrust in Chrome over trust issues

Move comes weeks after Mozilla blasted certificate authority for failings

Yahoo<i>!</i> Japan to waive $189 million ad revenue after detecting fraudulent clicks

Admits it's not sure some clicks came from humans, points to bettter quality as sign not all is rotten