Gaps in the algorithm

Gaps in the algorithm
What machine learning can teach us about the limits of our knowledge
SAScon 2017
Will Critchlow - @willcritchlow

The rise of ML has taken an
already-complex system and made
it incomprehensible

We might believe we know
what works.
But experiments show that’s not
really true

Computers might already be better
than us.
By exploring their limits, we learn
more about our own, and about the
underlying algorithm

This is the sequel to a talk I’ve given
a couple of times in the US
...and once in Leeds... if you didn’t see those, you can catch up here:

See the full video of my San Diego talk in DistilledU

If you did see one of them, have a
nap for a few minutes
Or check your email

Information
retrieval
PageRank
Original
research
TWEAKS
The “classical” algorithm is full of tweaks

Particularly this comment from a user called Kevin Lacker (@lacker):
When Amit left, this thread was fascinating

High-
dimension
Non-linear
Discontinuous
The algorithm became far too complex to
approximate in your head:

Authority
Relevance
It’s not even easy in two
dimensions:

Authority
Relevance
dimensions:
Imagine choosing
between a
more-relevant page
with less authority…

Authority
Relevance
dimensions:
Imagine choosing
between a
more-relevant page with
less authority…
...and a less-relevant
page with more
authority.

It’s only getting worse under Sundar Pichai

Aided by the new head of search John
Giannandrea and ML experts like Jeff Dean

If you haven’t already seen it, you should
read the story of how Jeff Dean & three
engineers took just a month to beat a
decade’s worth of work by hundreds of
engineers by attacking Translate with ML.

Audiences generally still think
they’re pretty good at this
You’re probably thinking something similar to yourself right now.

I’ve now run an
in-person experiment a
few times.

I show two pages that
rank for a particular
search along with
various metrics for each
page.

Then I ask the audience
to stand up and predict
which page ranks better
for a given query.

I get people to sit down as they get
them wrong.
By the time we’ve done 2 or 3
almost everyone is sitting.

Behind this chart is a lot of story...

This is the Thameslink. I commute into London on it.
It’s also where I allow myself to write code.

It all started because I wanted to learn ML

keras.io
I quickly found working in Keras was easier

In order to work on a problem area I knew well,
I decided to build a system to predict rankings:

The question we really want to answer is:
“How good is this page for this query?”

We want to train our
model on Google data

But we don’t actually
know how close
together these different
results are.

And we certainly don’t
know if position #3 is
the same relevance to
this query as #3 is to a
totally different query.

So I decided to train on
the problem “does page
A outrank page B for
query X”?
I.e. is it A then B or B
then A?
A
B
A
B

We have tons more data
to train this model on -
every pair of URLs for
every query we look at.
A
B
A
B

And it’s ultimately
equivalent to “how do
we improve page A?”
A
B
A
B

In mathematical terms, we express each page as a set
of features:
{‘DA’: ‘67’, ‘lrd’: ‘254’, ‘tld’: ‘1’, ‘h1_tgtg’: ‘0.478’, ‘links_on_page’: ‘200’ ....}
Combine the two sets of features into one big vector.
Label it as (1,0) if A outranks B and (0,1) if B outranks A.
A
B

Note: we’re doing no spam detection
We’re working only with Google’s top 10

To run the model, we input a
pair of pages with their
associated metrics.
New
input

We get back a probability of
page A outranking page B.
Model
Probability-
weighted
predictions
New
input

If we could do this perfectly, then
we could tweak the values of our
page (call that A`) and compare
A to A`
We’d get to simulate changes to
see impacts without making them
This is the holy grail

And when we get close the gaps
will tell us where the unknowns in
the algorithm lie

There’s a lot of dead-ends before
we get anywhere near that though
Let’s go stumbling through the trees

The first thing to realise is that data
pipelines are hard.
Really hard.
There’s a reason that most of Google’s rules of ML is about data.
Here’s what we did:

Raw rankings
data
Pull in API
data

Raw rankings
data
Pull in API
data
Crawl the
page

Raw rankings
data
Pull in API
data
Crawl the
page
Process
on-page data

Google just released a useful tool for exploring and
checking your data

This is what it looks like on our data
(Running on their web version)

So I took this big dataset, restricted
it to property keywords, and gave it
a shot
I have an ongoing argument with @tomanthonySEO about how much
the keyword grouping matters...

OVER 90%
accuracy
Now hold on a second. That sounds implausible.

I was accidentally telling it the
answer.
I had included the rank in the
features.
Remember how I said that data pipelines are hard?

So I fixed that problem and re-ran it

OVER 80%
accuracy
Now hold on a second. That still sounds implausible.

One of the problems with deep
learning is the the models are far
from human understanding
There is not really any concept of “explain how you got this answer”

So I tried a much simpler model on the same data
A “decision tree classifier” from scikit-learn

You read these decision trees like flowcharts
The first # refers to the two URLs in the comparison

The name refers to the feature in question

...and the inequality should be self-explanatory

Then at the “leaf” node, you select the category
that got more of the samples
(the 2nd in this case - which means that B outranks A)

So you might end up taking a path like this:

ALSO OVER 80%
accuracy
This is getting silly.

I eventually figured out what was going on.
There are a small number of domains that rank well for
essentially every property-related search in the UK.
My model was just learning:
domain A > domain B > domain C

The model was essentially just identifying URLs
Zoopla vs.
findaproperty
Rightmove vs.
primelocation
etc

So we started splitting the data
better so that it never saw the
same domains that it was trained
on

Our current state-of-the-art is 65-66% accuracy on
large diverse keyword sets.
Decision trees are nowhere near as good on this data.
We are still only using fairly naive on-page metrics.

Known factors Unknown factors
The better our model gets, the more we can
constrain how much of an impact other things must
be having - advanced on-page ML, usage data etc

Known factors Unknown factors
The better our model gets, the more we can
constrain how much of an impact other things must
be having - advanced on-page ML, usage data etc
We expect to see progress from more advanced on-page analysis - we
have a theory that link signals get you into the consideration set, but
increasingly don’t reorder it:

See Tom Capper’s SearchLove San Diego talk in DistilledU

That was all very complicated.
In practice, we are running
real-world split-tests.
This is a difficult thing to do, so we’ve built a platform to help:

In keeping with the theme of this
presentation, I want to share some
scary results
It turns out that you are probably recommending a ton of changes that
are making no difference, or even making things worse...

1. Adding ALT attributes
2. Adding structured data
3. Setting exact match title tags
4. Writing more emotive meta copy

Established wisdom and correlation studies would suggest ALT
attributes on images might be good for SEO

Result: null test. No measurable change in performance.

Surprisingly often, also a null test result

Title tag before: Which TV should I buy? - Argos
Title tag after: Which TV to buy? - Argos
What happens when you match title tags to the greatest search volume?

Organic sessions decreased by an average of 8%

What happens when you try to write more engaging titles & meta?

What happens when you try to write more engaging titles & meta?
Maybe not quite this engaging

Don’t worry.
We’ve also had some great results.

Some that we have
talked about before

1. Adding structured data
2. Using JS to show content
3. Removing SEO category text

Category pages have lots of images and not much text

Adding structured data to category pages

Organic sessions increased by 11%

What happens if your content is only visible with Javascript?
Javascript EnabledJavascript Disabled

Making it visible increased organic sessions by ~ 6.2%

Read more on our blog: early results from split-testing JS for SEO

How does SEO text on category pages perform?

E-commerce site number 1 ~ 3.1% increase in organic sessions

E-commerce site number 2 - No effect/negative effect

And a bunch that we haven’t written up yet:
Including:
● Replacing en-gb words & spellings with en-us on British company’s US site
○ Status: statistically significant positive uplift
● Fresh content: more recent update dates across large long-tail set of pages
● Change on-page targeting to higher volume query structure

All of this is why we have been
investing so much in
split-testing
Check out www.distilledodn.com
if you haven’t already.
We will be happy to demo for
you.
We’re now serving well over a
billion requests / month, and
recently published information
covering everything from
response times to our +£100k /
month split test.

Let’s recap
1. Even in a world of 200+ “classical” ranking factors, humans were bad at
understanding the algorithm

Let’s recap
2. Machine learning will make this worse, and is accelerating under Sundar

Let’s recap
3. By applying our own machine learning, we can model the algorithm and find
the gaps in our understanding

Let’s recap
4. We can apply what we learn by split-testing on our own sites:

Let’s recap
a. It is very likely that if you are not split-testing, you are recommending
changes that have no effect

Let’s recap
a. It is very likely that if you are not split-testing, you are recommending
changes that have no effect
b. And (obviously worse) you are very likely recommending changes that
damage your visibility

● Sundar Pichai
● Go
● Jeff Dean
● Train
● Wake up
● Statue of Liberty
● Sleeping cat
● Complexity
● Holy Grail
● Wilderness
● Pipeline
● Houses
Image credits
● Head in hands
● Rope bridge
● Spider
● Cheating
● Celebration
● Split rock
● Science
● Jolly Roger
● Thumbs up
● Spam

Gaps in the algorithm

Related slideshows

More Related Content

Gaps in the algorithm