SlideShare a Scribd company logo
JR Oakes | @jroakes | #TechSEOBoost
#TechSEOBoost | @CatalystSEM
THANK YOU TO THIS YEAR’S SPONSORS
What I Learned Building a Toy Example to
Crawl & Render like Google
JR Oakes, Locomotive
JR Oakes | @jroakes | #TechSEOBoost
JR Oakes
Building a Simple Crawler on
a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
About Me
Senior Director, Technical SEO Research, at
@LocomotiveSEO
Passionate about:
• Development
• Learning
• Community
• Technology
JR Oakes | @jroakes | #TechSEOBoost
About Me
• Write some and do the Twitter thing.
• Share as much as I can on Github.
• Love to organize meetups
• Always testing something
• Love the brilliant team at Locomotive

Recommended for you

TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition

This document contains the slides from a TechSEO Boost conference research competition. There were three finalists who presented on topics related to technical SEO research they had conducted: Eric Enge presented on how Google search features impact click-through rates, Tomek Rudzki discussed taking JavaScript SEO to the next level by addressing issues with indexing JavaScript content, and Vincent Terrasi presented on generating qualitative content in multiple languages. Tomek Rudzki was selected as the grand prize winner for his research on JavaScript indexing challenges.

seotechnical seo
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett

Covering the fundamentals of Python and Machine Learning and discussing the positive impact they can have in automating technical SEO tasks.

Advanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data ScienceAdvanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data Science

Tyler Reardon is an SEO Strategist at CARFAX where he helps drive the traffic acquisition strategy for the CARFAX Used Car Listings marketplace. He began his journey in search in 2011 at eVacuumStore.com before co-founding United SEO, a Boston-based consultancy specializing in SEO and Analytics, where he crafted and executed strategies for clients such as Oreck, HyDrive Energy, and MedStar Health.

seopythondata science
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
• Overview of Crawling Landscape
• Key Components of Crawler
• Building a Toy Internet
• Building a Crawler and Renderer
JR Oakes | @jroakes | #TechSEOBoost
Overview of Crawling
Landscape
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
We have worked on sites with as many as a
billion potential pages. Google only crawls
(or knows about) a fraction of those.
• Crawled
• Want to Crawl (frontier)
• Unseen (or not wanted to be seen)
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)

Recommended for you

Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro

Interested in learning about Natural Language Processing (NLP)? Are you using NLP for your SEO already and want to step it up a level? Join this session to get a crash course in NLP. From stemming and lemmatization to word embeddings and its applications for SEO. Paul Shapiro will break down NLP to explain how NLP technology uses machine learning to decipher and analyze our human languages in a way that is highly valuable for marketers and SEOs. Paul will also share specific examples using the Python programming language along the way so you can either start using NLP right away for SEO or find new and more effective ways to use NLP.

seonlpdata science
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...

Covering the fundamentals of Python and Machine Learning and discussing the positive impact they can have in automating technical SEO tasks.

Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...

Find out how Python and Machine Learning can be used to save you time and help you understand your website better. This session will cover how advancements to Python and Machine Learning are changing the game for busy SEOs, discuss the positive impact they can have and provide you with plenty of ready-to-use scripts you can use to save time with your technical SEO efforts.

JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
PageRank (or node popularity metrics) is a
good way to measure how deep to go.
Hypothesis is that a measurement of node
popularity can deprioritize links from very
unpopular nodes.
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
Google has over 25 BILLION results in
their inverted index.
JR Oakes | @jroakes | #TechSEOBoost
What a crawler must do
• Be robust. Handle spider traps and malicious behavior.
• Be distributed. Run across many machines.
• Be scalable. Easy to add more machines.
• Be efficient. Use network and processing resources wisely.
• Prioritize. Know the quality and priority of pages.
• Operate continuously.
• Be adaptable. Easy to change with new data / web needs.
• Be a good citizen. Respect robots.txt and server load.
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
Key Components of
Crawler

Recommended for you

TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research Competition

Vincent Terrasi presented on building a machine learning model to predict webpage rankings in search engine results pages (SERPs) with 92% reliability. The model analyzed data from SEMrush, Majestic, ScreamingFrog, OnCrawl, and Visiblis to identify the most important ranking factors for specific topics without human intervention. OVH was then able to instantly check if new webpages would rank first in Google or compare pages' predicted positions. Dan Taylor discussed utilizing Cloudflare Workers to implement technical SEO elements like Hreflang, redirects, and meta robots tags through serverless functions. Testing showed these could be deployed at scale with minimal DevOps while being detected by Google. However, workers

seotechnical seosearch engine optimization
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot X

The document discusses improving user experiences for large websites and bots. It covers several challenges for large sites including crawling, indexing, rendering, and unique content. It provides tips for addressing these challenges such as using log files to analyze crawling, breaking up XML sitemaps, and prioritizing unique content. The document also discusses automating testing through tools to continuously monitor things like the robots.txt file and technical SEO elements. Overall, it advocates that combining human and bot strengths through techniques like these can create better user and search engine experiences.

seotechnical seobot experience
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...

Ruth Everett gives a presentation on how Python can help with technical SEO tasks. She discusses how Python can be used to automate repetitive tasks, allowing SEOs to focus on more important work. Some examples of automating with Python include parameter finder, 404 checking, internal linking analysis, and image optimization. Machine learning is also an area that Python can help with for SEO, such as evaluating content quality, log file analysis, and predictive analysis. The future of SEO lies in understanding data through Python to make better decisions.

seopythontechnical seo
JR Oakes | @jroakes | #TechSEOBoost
Basic Crawl Architecture
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
Hard to believe Google is wasting
resources to render something
that has not changed in 40 years.
JR Oakes | @jroakes | #TechSEOBoost
Key Learnings
• Frontier is broken into two sections, a Front Queue, that manages priority, and a Back
Queue that manages politeness
• All queues are FIFO
• Each host has its own Back Queue
• Min Hashes (Sketches) are an effective way of deduping content
• Duplicates vs Near Duplicates measured by edit distance
• Everything is cached to reduce latency
• URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/)
• There are interesting things that can happen in the DOM rather than just parsing
retrieved URL

Recommended for you

MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...

SEO has always sat at the intersection between being a science and an art. We all love to try out new ideas and try to understand what makes the search engines tick, but it can be frustrating to have to cut through the guesswork and speculation just to figure out what Google really wants from us. Even worse, we still find ourselves making SEO changes, seeing uplifts, but then not knowing which changes actually had any impact. Fortunately, new software and better technologies now make it possible to run proper SEO-focused tests and, for the first time, actually measure the impact that each SEO change has on our site. Rob will share these techniques, discuss some of the experiments that Distilled has been running, reveal the unexpected things they’ve learned along the way, and share how you can start running experiments yourself.

mnsearch summitmnsearchdigital marketing
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...

Learn about the different things you can and cannot automate in SEO, saving you time and enabling more advanced work. Discover free tools, such as KNIME, and learn how to use them to begin your automation efforts. Finally, learn what an API is and how it can help you and your SEO work

mnsearchmnsearch summitdigital marketing
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile LandscapeMax Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape

Max Prin discussed technical SEO tactics for the mobile landscape. He emphasized that mobile searches now surpass desktop searches, so websites need to be mobile-friendly and fast loading for mobile users. He recommended responsive design, AMP pages, and progressive web apps to provide optimized experiences for users on all devices. Ensuring content parity between mobile and desktop versions is also important. Structured data and metadata can enhance search engine results and voice search results.

seotechnical seomnsearch
JR Oakes | @jroakes | #TechSEOBoost
Building a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Build quickly with topically similar pages for
each site
• Exist on separate domains
• Linked to each other, but not to any other
pages on the internet
• Contain basic SEO elements like title,
description, canonical, etc
JR Oakes | @jroakes | #TechSEOBoost
Solution
• Github Pages
• Jekyll
• Wikipedia
• Python
• search-engine-optimization-blog.github.io
• data-science-blog.github.io
• python-software.github.io
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000

Recommended for you

Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?

Search engines have come a long way in understanding JavaScript, but issues with rendering and load times can still impact your crawl budget and prevent search engines from indexing valuable content! Finding the optimal solution that provides the best user experience, whilst also satisfying the bots can be a challenge. This talk will cover the differences between these solutions, a number of tools and metrics you can use, and other significant considerations to take into account when proposing a rendering solution to your developers.

seo
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...

SEOs play a crucial role in the overlap between SEO and accessibility. This presentation will show how we can make a positive impact on accessibility through our work, as we help to make the web a more welcoming place for everyone.

accessibilityseo
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...

The document discusses combining AMP (Accelerated Mobile Pages) and PWA (Progressive Web Apps) technologies to create PWAMP (Progressive Web App + AMP) sites. It provides examples of how AMP pages can serve as an entry point to direct users to a PWA experience with additional functionality. The document also addresses SEO considerations, noting that AMP pages are well-suited for search engine results while PWAs improve interactivity and engagement. Overall, the document advocates a PWAMP approach to gain benefits from both technologies.

weloveseomax prinnalexis sanders
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
Building a Crawler and
Renderer
JR Oakes | @jroakes | #TechSEOBoost
Step One
I have no idea how to start. So
let’s do some research.
I <3 Github
JR Oakes | @jroakes | #TechSEOBoost
Step Two
I don’t want to reinvent the wheel,
so let’s see what is already out
there that I can use.

Recommended for you

Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)

This is not your everyday data talk. Through working deep inside the fastest growing SaaS startups in our space, we've studied the patterns, methods, and models for driving outsized results. The one common thread? How they use their data. (How else would you grow from one marketer through to a $60M+ Series B just 12 months later?). How do they make their data accessible, draw the right insights, set effective goals, prioritise and optimise processes, and automate ALL the (right) things. So brace yourselves: we're going to be navigating through AI, automation, "moving the needle", and a minefield of other buzzwords to try to make sense of using your data for growth. But you'll leave this talk with a simple framework and set of questions you can take and use right away.

ed fryhull.iogrowth
How Search Works
How Search WorksHow Search Works
How Search Works

Patrick Stox gives a presentation on how search works. He discusses how Google crawls and indexes websites, processes content, handles queries, and ranks results. Some key points include: Google's crawler downloads pages and files from websites; processing includes duplicate detection, link parsing, and content analysis; queries are understood through techniques like spelling correction and query expansion; and search results are ranked based on numerous freshness, popularity, and relevancy signals.

seotechnical seosearch
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications

This document discusses modernizing Domino and XPages applications. It covers modernizing web and mobile interfaces, using Dynamic Query Language for improved performance, and integrating applications via REST APIs. The document provides examples and considerations for updating applications' user interfaces, database design, and integration to meet modern needs and habits while leveraging new platform capabilities.

dominohclxpages
JR Oakes | @jroakes | #TechSEOBoost
Step Three
A lot of coffee
… and some beer.
JR Oakes | @jroakes | #TechSEOBoost
A little help along the way
Streamlit is the first app
framework specifically for
Machine Learning and
Data Science teams.
So you can stop spending time on
frontend development and get
back to what you do best.
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Use existing libraries where possible
• Be hardy enough to crawl my toy internet
• Make it as simple and approachable as possible (e.g. I use Pandas
a lot)
• Try to be true (as possible) to what is known that Google does
• Process linearly. No threading or extra services
• Include unit testing
• Include a Jupyter Notebook
• Include READMEs
• Include a simple indexer and search apparatus to play with results
(Thanks John M.!)
JR Oakes | @jroakes | #TechSEOBoost
Parts
• PageRank
• Chrome Headless Rendering
• Text NLP Normalization
• Bert Embeddings
• Robots
• Duplicate Content Shingling
• URL Hashing
• Document Frequency Functions (BM25 and TFIDF)

Recommended for you

MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics

MongoDB workshop given by me at MIT, Pune. This PDF has example of how to design mongodb schema as per application usage.

designmongodbpython
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox

Patrick Stox is a product advisor, technical SEO expert, and brand ambassador at Ahrefs. He speaks at various conferences and organizes several meetup groups. He has judged various search awards and is a founder of the Technical SEO Slack group. Stox provides advice on JavaScript frameworks, headless CMS, code splitting, and best practices for JavaScript sites to be search engine friendly. He notes the challenges search engines face in rendering JavaScript content at scale.

seotechnical seojavascript
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website

This document summarizes the services and expertise offered by Acquia, a Drupal consulting firm. It discusses Acquia's Drupal and open source expertise, software industry experience, and the Acquia Network which provides Drupal support and optimized hosting. It also introduces the author and describes services like Drupal jumpstarts, workshops, audits, on-site consulting, and balancing custom and contributed code. The document emphasizes best practices in areas like content and display architecture, security, performance, infrastructure, maintenance, and deployment to help clients maintain a high-quality Drupal site.

drupalsecuritydcporto12
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
Embeddings
https://github.com/huggingface/transformers

Recommended for you

SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...

Event: SEJ Summit Chicago 2015 Presenter: Carolyn Shelby of Tribune Publishing Description: Considering a major overhaul of your CMS or the technology that powers your website? Sometimes the latest and greatest can have unintended negative consequences for your rankings. Carolyn plots out the gotchas and strategies of upgrading site platforms without destroying your standing with Google.

seoonline marketingcms
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website

The document discusses common mistakes that are often found during website audits. It covers 5 categories: content architecture, display architecture, site architecture, security, and performance. Some examples of mistakes mentioned include having similar content types, not reusing fields, extra modules installed that are not useful, reinventing functionality that Drupal already provides well, outdated core/contrib modules, and complex queries without indexes. The document provides best practices for each category such as planning content architecture ahead of time, separating logic from presentation, using the right hooks for custom modules, keeping software updated, and optimizing databases before caching. It emphasizes the importance of testing, environments, and maintenance for the website lifecycle.

drupalacquia webinardrupal webinar
Bollean Search - NageshRao
Bollean Search - NageshRaoBollean Search - NageshRao
Bollean Search - NageshRao

This document discusses Boolean logic and Boolean search techniques. It provides an overview of Boolean logic operators such as AND, OR, and NOT. It then explains how Boolean logic is used for digital circuits and search engines. The document gives several examples of Boolean searches and tips for optimizing searches. Key topics include Boolean logic, search engine optimization, Boolean operators, truncation, wildcards, nesting, and Boolean searches for common roles.

JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things waaaaayy simpler than they would be in real life.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things way simpler than they would be in real life.
• Sentencepiece and BPE encoding is revolutionary for indexes and NLG
• A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog.
• Minhash comparison made checking rendering to crawled comparison, easy.
JR Oakes | @jroakes | #TechSEOBoost
Result
A crawler written in Python that we are releasing as
open source.
Keep in mind:
1. This was written in a month
2. Google engineers would laugh at it
3. It probably has bugs
4. It is really fun to play around with

Recommended for you

Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life

ORM or Stored Procedures? Code First or Database First? Ad-Hoc Queries? Impedance Mismatch? If you're a developer or you are a DBA working with developers you have heard all this terms at least once in your life…and usually in the middle of a strong discussion, debating about one or the other. Well, thanks to StackOverflow's Dapper, all these fights are finished. Dapper is a blazing fast microORM that allows developers to map SQL queries to classes automatically, leaving (and encouraging) the usage of stored procedures, parameterized statements and all the good stuff that SQL Server offers (JSON and TVP are supported too!) In this session I'll show how to use Dapper in your projects from the very basis to some more complex usages that will help you to create *really fast* applications without the burden of huge and complex ORMs. The days of Impedance Mismatch are finally over!

ormmicroormsql
How to write a web framework
How to write a web frameworkHow to write a web framework
How to write a web framework

This document discusses writing web frameworks. It begins by introducing the speaker, Ngoc, and his experience writing several web frameworks in different languages. It then asks questions to prompt discussion about web frameworks, including differences between frameworks and libraries, challenges in writing frameworks, and important framework features. The document emphasizes that frameworks should have a clear vision and workflow. It also provides examples from Sinetja and Xitrum frameworks to illustrate concepts.

web frameworkxitrumscala
Il semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problemaIl semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problema

Usi Yoast SEO e vuoi ottenere tutti i pallini verdi? L'intervento al WordCamp Bari mette in luce qual è il segreto per usare al meglio le analisi di Yoast SEO sul tuo sito.

wordcamp bariyoast
JR Oakes | @jroakes | #TechSEOBoost
Result
We also built a simple UI in
Streamlit so you can play
around with the results and
parameters.
JR Oakes | @jroakes | #TechSEOBoost
Result
Complete with Ads!
JR Oakes | @jroakes | #TechSEOBoost
Thank You
Start playing at the link below
https://locomotive.agency/coal-crawler-renderer-indexer-caboose
–
Find me on Twitter at: @jroakes
JR Oakes | @jroakes | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/

Recommended for you

October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101

This document provides an overview and introduction to building websites with Drupal, an open-source content management system (CMS). It discusses what a CMS is and compares Drupal to WordPress. Key features of Drupal are explained, including its use of modules, entities and fields, content types, taxonomy, and the Views module. Common modules are listed and it is noted that Drupal can be used to build various applications without coding. The document concludes with suggestions for getting started with Drupal development locally and lists resources for learning more.

drupalintroductiondrupal 7
Surviving in a Microservices Environment
Surviving in a Microservices EnvironmentSurviving in a Microservices Environment
Surviving in a Microservices Environment

The document discusses various topics related to surviving in a microservices environment. It addresses questions around infrastructure, architecture, team communication and provides advice. Key points include the importance of centralized logging and monitoring, avoiding tight coupling between services, ensuring an overall architectural vision, and being reluctant to add new process unless something goes wrong. The document emphasizes that most of the challenge with microservices is in infrastructure.

springio17software architecturemicroservices
Technical Club PPT for BTech CS and Btech IT
Technical Club PPT for BTech CS and Btech ITTechnical Club PPT for BTech CS and Btech IT
Technical Club PPT for BTech CS and Btech IT

Technical Club PPT for BTech CS and Btech IT

More Related Content

What's hot

TechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEOTechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEO
Catalyst
 
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword ResearchSearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
Distilled
 
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based WebsitesTechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
Catalyst
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition
Catalyst
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Ruth Everett
 
Advanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data ScienceAdvanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data Science
Tyler Reardon
 
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Paul Shapiro
 
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
Ruth Everett
 
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Ruth Everett
 
TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research Competition
Catalyst
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot X
Alexis Sanders
 
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Ruth Everett
 
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch, The Minnesota Search Engine Marketing Association
 
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch, The Minnesota Search Engine Marketing Association
 
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile LandscapeMax Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin
 
Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?
Petra Kis-Herczegh
 
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Ruth Everett
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...
WeLoveSEO
 
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Turing Fest
 

What's hot (19)

TechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEOTechSEO Boost 2017: The State of Technical SEO
TechSEO Boost 2017: The State of Technical SEO
 
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword ResearchSearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
SearchLove Boston 2016 | Paul Shapiro | How to Automate Your Keyword Research
 
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based WebsitesTechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
TechSEO Boost 2017: SEO Best Practices for JavaScript T-Based Websites
 
TechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research CompetitionTechSEO Boost 2019: Research Competition
TechSEO Boost 2019: Research Competition
 
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
Python For Technical SEO | Women In Tech SEO Festival March 2020 | Ruth Everett
 
Advanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data ScienceAdvanced Technical SEO in 2020 - Data Science
Advanced Technical SEO in 2020 - Data Science
 
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul ShapiroBreaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
Breaking Down NLP for SEOs - SMX Advanced Europe 2019 - Paul Shapiro
 
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
The Power of Python :: How It Can Help With Technical SEO | Bristol SEO May 2...
 
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
Getting Started with Python and Machine Learning for SEO | BrightonSEO Octobe...
 
TechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research CompetitionTechSEO Boost 2018: Research Competition
TechSEO Boost 2018: Research Competition
 
Alexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot XAlexis + Max - We Love SEO 19 - Bot X
Alexis + Max - We Love SEO 19 - Bot X
 
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
Tackling Python: How It Can Help With Technical SEO | Pint Sized Meetup Janua...
 
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
MnSearch Summit 2018 - Rob Ousbey – The Evolution of SEO: Split-Testing for S...
 
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
MnSearch Summit 2018 - Paul Shapiro – Start Building SEO Efficiencies with Au...
 
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile LandscapeMax Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
Max Prin - MnSearch Summit 2018 - SEO for the Current Mobile Landscape
 
Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?Hey Googlebot, did you cache that ?
Hey Googlebot, did you cache that ?
 
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
Why Accessibility is More Than Just a Lighthouse Metric | SEONerdSwitzerland ...
 
Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...Alexis max-Creating a bot experience as good as your user experience - Alexis...
Alexis max-Creating a bot experience as good as your user experience - Alexis...
 
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
Ed Fry — Data-Driven Growth: Lies, Lawyers & Outsized Results (Turing Fest 2018)
 

Similar to What I Learned Building a Toy Example to Crawl & Render like Google

How Search Works
How Search WorksHow Search Works
How Search Works
Ahrefs
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications
Paul Withers
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
Sarang Shravagi
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
patrickstox
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website
hernanibf
 
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
Search Engine Journal
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website
Acquia
 
Bollean Search - NageshRao
Bollean Search - NageshRaoBollean Search - NageshRao
Bollean Search - NageshRao
Nagesh Rao is Hiring Testing People at AUS
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
Davide Mauri
 
How to write a web framework
How to write a web frameworkHow to write a web framework
How to write a web framework
Ngoc Dao
 
Il semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problemaIl semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problema
Laura Sacco
 
October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101
Eric Sembrat
 
Surviving in a Microservices Environment
Surviving in a Microservices EnvironmentSurviving in a Microservices Environment
Surviving in a Microservices Environment
Steve Pember
 
Technical Club PPT for BTech CS and Btech IT
Technical Club PPT for BTech CS and Btech ITTechnical Club PPT for BTech CS and Btech IT
Technical Club PPT for BTech CS and Btech IT
paurushsinhad
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
Derek Jacoby
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16
Christian Berg
 
WordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress WebappsWordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress Webapps
tjasko
 
Student Industrial Training Presentation Slide
Student Industrial Training Presentation SlideStudent Industrial Training Presentation Slide
Student Industrial Training Presentation Slide
Khairul Filhan
 
Django course
Django courseDjango course
Django course
Nagi Annapureddy
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
Eric Phan
 

Similar to What I Learned Building a Toy Example to Crawl & Render like Google (20)

How Search Works
How Search WorksHow Search Works
How Search Works
 
Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications Engage 2019: Modernising Your Domino and XPages Applications
Engage 2019: Modernising Your Domino and XPages Applications
 
MongoDB Basics
MongoDB BasicsMongoDB Basics
MongoDB Basics
 
JavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick StoxJavaScript SEO Ungagged 2019 Patrick Stox
JavaScript SEO Ungagged 2019 Patrick Stox
 
The things we found in your website
The things we found in your websiteThe things we found in your website
The things we found in your website
 
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
SEJ Summit 2015: Upgrade Your Platform Without Sacrificing Your Rankings by C...
 
5 Common Mistakes You are Making on your Website
 5 Common Mistakes You are Making on your Website 5 Common Mistakes You are Making on your Website
5 Common Mistakes You are Making on your Website
 
Bollean Search - NageshRao
Bollean Search - NageshRaoBollean Search - NageshRao
Bollean Search - NageshRao
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
How to write a web framework
How to write a web frameworkHow to write a web framework
How to write a web framework
 
Il semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problemaIl semaforo di Yoast non è il (tuo) problema
Il semaforo di Yoast non è il (tuo) problema
 
October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101October 2014 - USG Rock Eagle - Drupal 101
October 2014 - USG Rock Eagle - Drupal 101
 
Surviving in a Microservices Environment
Surviving in a Microservices EnvironmentSurviving in a Microservices Environment
Surviving in a Microservices Environment
 
Technical Club PPT for BTech CS and Btech IT
Technical Club PPT for BTech CS and Btech ITTechnical Club PPT for BTech CS and Btech IT
Technical Club PPT for BTech CS and Btech IT
 
Untangling - fall2017 - week 9
Untangling - fall2017 - week 9Untangling - fall2017 - week 9
Untangling - fall2017 - week 9
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16
 
WordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress WebappsWordCamp 2012 - WordPress Webapps
WordCamp 2012 - WordPress Webapps
 
Student Industrial Training Presentation Slide
Student Industrial Training Presentation SlideStudent Industrial Training Presentation Slide
Student Industrial Training Presentation Slide
 
Django course
Django courseDjango course
Django course
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 

More from Catalyst

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Catalyst
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
Catalyst
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO Experimentation
Catalyst
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
Catalyst
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
Catalyst
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing Programmatic
Catalyst
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
Catalyst
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel Imperative
Catalyst
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
Catalyst
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
Catalyst
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Catalyst
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Catalyst
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Catalyst
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020
Catalyst
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All Languages
Catalyst
 
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Catalyst
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive Search
Catalyst
 
The Ultimate Pagination for SEO
The Ultimate Pagination for SEOThe Ultimate Pagination for SEO
The Ultimate Pagination for SEO
Catalyst
 
Crawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl BudgetCrawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl Budget
Catalyst
 

More from Catalyst (20)

Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
Closing the Gap: Adopting Omnichannel Strategies for Stronger Brand-Consumer ...
 
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for SuccessTechSEO Boost 2021 - Cultivating a Product Mindset for Success
TechSEO Boost 2021 - Cultivating a Product Mindset for Success
 
TechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO ExperimentationTechSEO Boost 2021 - SEO Experimentation
TechSEO Boost 2021 - SEO Experimentation
 
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
TechSEO Boost 2021 - Rendering Strategies: Measuring the Devil’s Details in C...
 
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
TechSEO Boost 2021 - The Future Is The Past: Tagging And Tracking Through The...
 
10 Trends Changing Programmatic
10 Trends Changing Programmatic10 Trends Changing Programmatic
10 Trends Changing Programmatic
 
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...New Commerce Conference: Charting a Course to Success with Your Retail Media ...
New Commerce Conference: Charting a Course to Success with Your Retail Media ...
 
The New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel ImperativeThe New Commerce Conference: The Omni-channel Imperative
The New Commerce Conference: The Omni-channel Imperative
 
New Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things InstacartNew Commerce Commerce: All Things Instacart
New Commerce Commerce: All Things Instacart
 
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your BrandThe Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
The Power of SEO: Protect Your Bottom Line & Future Proof Your Brand
 
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
The Era of Omni-Commerce: New Insights for Dominating the Digital Shelf and B...
 
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your ReopeningReignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
Reignite Your Business with Performance Marketing: 4 Ways to Fuel Your Reopening
 
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
Reignite Your Business with Performance Marketing: 4 Ways to Dial-Up Brand In...
 
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond FacebookEvolve Your Social Commerce Strategy: Thinking Beyond Facebook
Evolve Your Social Commerce Strategy: Thinking Beyond Facebook
 
B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020B2B SEO: Increase Traffic & Leads in 2020
B2B SEO: Increase Traffic & Leads in 2020
 
Generating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All LanguagesGenerating Qualitative Content with GPT-2 in All Languages
Generating Qualitative Content with GPT-2 in All Languages
 
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps ScriptAutomate, Create Tools, & Test Ideas Quickly with Google Apps Script
Automate, Create Tools, & Test Ideas Quickly with Google Apps Script
 
The User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive SearchThe User is The Query: The Rise of Predictive Proactive Search
The User is The Query: The Rise of Predictive Proactive Search
 
The Ultimate Pagination for SEO
The Ultimate Pagination for SEOThe Ultimate Pagination for SEO
The Ultimate Pagination for SEO
 
Crawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl BudgetCrawl Budget Conqueror - Take Control of Your Crawl Budget
Crawl Budget Conqueror - Take Control of Your Crawl Budget
 

Recently uploaded

Digital marketing metrics every one must know in 2024
Digital marketing metrics every one must know in 2024Digital marketing metrics every one must know in 2024
Digital marketing metrics every one must know in 2024
Digital Scape
 
PPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin Lund
PPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin LundPPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin Lund
PPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin Lund
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale BertrandSEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...
Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...
Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Chemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptx
Chemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptxChemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptx
Chemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptx
mayurparate000
 
Unlocking the Potential of AI and XR - A Step-by-Step Guide to Strategic Int...
Unlocking the Potential of AI and XR -  A Step-by-Step Guide to Strategic Int...Unlocking the Potential of AI and XR -  A Step-by-Step Guide to Strategic Int...
Unlocking the Potential of AI and XR - A Step-by-Step Guide to Strategic Int...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
10 Advantages and Disadvantages of Social Media Marketing in 2024
10 Advantages and Disadvantages of Social Media Marketing in 202410 Advantages and Disadvantages of Social Media Marketing in 2024
10 Advantages and Disadvantages of Social Media Marketing in 2024
Markonik
 
ABM, The True Story - Rob Griffin, G5 Futures
ABM, The True Story - Rob Griffin, G5 FuturesABM, The True Story - Rob Griffin, G5 Futures
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Traditional Foods Of Australia and The History
Traditional Foods Of Australia and The HistoryTraditional Foods Of Australia and The History
Traditional Foods Of Australia and The History
The Aussie Way
 
campaign ads for fostanak local brand in egypt
campaign ads for fostanak local brand in egyptcampaign ads for fostanak local brand in egypt
campaign ads for fostanak local brand in egypt
shahdmahmoudattia
 
Content Optimization Master Class - Matt Raven
Content Optimization Master Class - Matt RavenContent Optimization Master Class - Matt Raven
Free Healthcare Marketing Plan for Healthcare professionals
Free Healthcare Marketing Plan for Healthcare professionalsFree Healthcare Marketing Plan for Healthcare professionals
Free Healthcare Marketing Plan for Healthcare professionals
Mazhar Shah
 
Factsheet pdf
Factsheet                            pdfFactsheet                            pdf
Factsheet pdf
Kaushal445159
 
Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...
Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...
Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...
VikasYadav194549
 
NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...
NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...
NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...
BBPMedia1
 
Paid Media Targeting in a Cookieless Future - Kevin Lee
Paid Media Targeting in a Cookieless Future - Kevin LeePaid Media Targeting in a Cookieless Future - Kevin Lee
Paid Media Targeting in a Cookieless Future - Kevin Lee
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...
[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...
[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...
VWO
 
Revolutionizing Advertising with Billion Broadcaster Standee Screen Media
Revolutionizing Advertising with Billion Broadcaster Standee Screen MediaRevolutionizing Advertising with Billion Broadcaster Standee Screen Media
Revolutionizing Advertising with Billion Broadcaster Standee Screen Media
VikasYadav194549
 
Toortizi - Rationale ( SALTY SNACKS )
Toortizi  -  Rationale  ( SALTY SNACKS )Toortizi  -  Rationale  ( SALTY SNACKS )
Toortizi - Rationale ( SALTY SNACKS )
IQads
 

Recently uploaded (20)

Digital marketing metrics every one must know in 2024
Digital marketing metrics every one must know in 2024Digital marketing metrics every one must know in 2024
Digital marketing metrics every one must know in 2024
 
PPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin Lund
PPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin LundPPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin Lund
PPC and SEO Synergies - Strategies Every Company Should Deploy - Benjamin Lund
 
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale BertrandSEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
 
Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...
Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...
Digital Marketing Trends Experts Insights on How to Gain a Competitive Edge -...
 
Chemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptx
Chemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptxChemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptx
Chemical Industry- Rashtriya Chemical Fertilizers (RCF) .pptx
 
Unlocking the Potential of AI and XR - A Step-by-Step Guide to Strategic Int...
Unlocking the Potential of AI and XR -  A Step-by-Step Guide to Strategic Int...Unlocking the Potential of AI and XR -  A Step-by-Step Guide to Strategic Int...
Unlocking the Potential of AI and XR - A Step-by-Step Guide to Strategic Int...
 
10 Advantages and Disadvantages of Social Media Marketing in 2024
10 Advantages and Disadvantages of Social Media Marketing in 202410 Advantages and Disadvantages of Social Media Marketing in 2024
10 Advantages and Disadvantages of Social Media Marketing in 2024
 
ABM, The True Story - Rob Griffin, G5 Futures
ABM, The True Story - Rob Griffin, G5 FuturesABM, The True Story - Rob Griffin, G5 Futures
ABM, The True Story - Rob Griffin, G5 Futures
 
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
 
Traditional Foods Of Australia and The History
Traditional Foods Of Australia and The HistoryTraditional Foods Of Australia and The History
Traditional Foods Of Australia and The History
 
campaign ads for fostanak local brand in egypt
campaign ads for fostanak local brand in egyptcampaign ads for fostanak local brand in egypt
campaign ads for fostanak local brand in egypt
 
Content Optimization Master Class - Matt Raven
Content Optimization Master Class - Matt RavenContent Optimization Master Class - Matt Raven
Content Optimization Master Class - Matt Raven
 
Free Healthcare Marketing Plan for Healthcare professionals
Free Healthcare Marketing Plan for Healthcare professionalsFree Healthcare Marketing Plan for Healthcare professionals
Free Healthcare Marketing Plan for Healthcare professionals
 
Factsheet pdf
Factsheet                            pdfFactsheet                            pdf
Factsheet pdf
 
Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...
Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...
Billion Broadcaster's Frame Posters and Horizontal Lift Advertising Screens: ...
 
NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...
NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...
NIMA2024 | Hoe Danone Trends vertaalt naar Strategie voor het versterken van ...
 
Paid Media Targeting in a Cookieless Future - Kevin Lee
Paid Media Targeting in a Cookieless Future - Kevin LeePaid Media Targeting in a Cookieless Future - Kevin Lee
Paid Media Targeting in a Cookieless Future - Kevin Lee
 
[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...
[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...
[Webinar - VWO] AI-First Strategies to Drive Traffic and Conversions for 2024...
 
Revolutionizing Advertising with Billion Broadcaster Standee Screen Media
Revolutionizing Advertising with Billion Broadcaster Standee Screen MediaRevolutionizing Advertising with Billion Broadcaster Standee Screen Media
Revolutionizing Advertising with Billion Broadcaster Standee Screen Media
 
Toortizi - Rationale ( SALTY SNACKS )
Toortizi  -  Rationale  ( SALTY SNACKS )Toortizi  -  Rationale  ( SALTY SNACKS )
Toortizi - Rationale ( SALTY SNACKS )
 

What I Learned Building a Toy Example to Crawl & Render like Google

  • 1. JR Oakes | @jroakes | #TechSEOBoost #TechSEOBoost | @CatalystSEM THANK YOU TO THIS YEAR’S SPONSORS What I Learned Building a Toy Example to Crawl & Render like Google JR Oakes, Locomotive
  • 2. JR Oakes | @jroakes | #TechSEOBoost JR Oakes Building a Simple Crawler on a Toy Internet
  • 3. JR Oakes | @jroakes | #TechSEOBoost About Me Senior Director, Technical SEO Research, at @LocomotiveSEO Passionate about: • Development • Learning • Community • Technology
  • 4. JR Oakes | @jroakes | #TechSEOBoost About Me • Write some and do the Twitter thing. • Share as much as I can on Github. • Love to organize meetups • Always testing something • Love the brilliant team at Locomotive
  • 5. JR Oakes | @jroakes | #TechSEOBoost What we will learn
  • 6. JR Oakes | @jroakes | #TechSEOBoost What we will learn • Overview of Crawling Landscape • Key Components of Crawler • Building a Toy Internet • Building a Crawler and Renderer
  • 7. JR Oakes | @jroakes | #TechSEOBoost Overview of Crawling Landscape
  • 8. JR Oakes | @jroakes | #TechSEOBoost The Web is Big We have worked on sites with as many as a billion potential pages. Google only crawls (or knows about) a fraction of those. • Crawled • Want to Crawl (frontier) • Unseen (or not wanted to be seen) Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 9. JR Oakes | @jroakes | #TechSEOBoost The Web is Big PageRank (or node popularity metrics) is a good way to measure how deep to go. Hypothesis is that a measurement of node popularity can deprioritize links from very unpopular nodes.
  • 10. JR Oakes | @jroakes | #TechSEOBoost The Web is Big Google has over 25 BILLION results in their inverted index.
  • 11. JR Oakes | @jroakes | #TechSEOBoost What a crawler must do • Be robust. Handle spider traps and malicious behavior. • Be distributed. Run across many machines. • Be scalable. Easy to add more machines. • Be efficient. Use network and processing resources wisely. • Prioritize. Know the quality and priority of pages. • Operate continuously. • Be adaptable. Easy to change with new data / web needs. • Be a good citizen. Respect robots.txt and server load. Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 12. JR Oakes | @jroakes | #TechSEOBoost Key Components of Crawler
  • 13. JR Oakes | @jroakes | #TechSEOBoost Basic Crawl Architecture Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 14. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture
  • 15. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture Hard to believe Google is wasting resources to render something that has not changed in 40 years.
  • 16. JR Oakes | @jroakes | #TechSEOBoost Key Learnings • Frontier is broken into two sections, a Front Queue, that manages priority, and a Back Queue that manages politeness • All queues are FIFO • Each host has its own Back Queue • Min Hashes (Sketches) are an effective way of deduping content • Duplicates vs Near Duplicates measured by edit distance • Everything is cached to reduce latency • URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/) • There are interesting things that can happen in the DOM rather than just parsing retrieved URL
  • 17. JR Oakes | @jroakes | #TechSEOBoost Building a Toy Internet
  • 18. JR Oakes | @jroakes | #TechSEOBoost Criteria • Build quickly with topically similar pages for each site • Exist on separate domains • Linked to each other, but not to any other pages on the internet • Contain basic SEO elements like title, description, canonical, etc
  • 19. JR Oakes | @jroakes | #TechSEOBoost Solution • Github Pages • Jekyll • Wikipedia • Python • search-engine-optimization-blog.github.io • data-science-blog.github.io • python-software.github.io
  • 20. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 21. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 22. JR Oakes | @jroakes | #TechSEOBoost Building a Crawler and Renderer
  • 23. JR Oakes | @jroakes | #TechSEOBoost Step One I have no idea how to start. So let’s do some research. I <3 Github
  • 24. JR Oakes | @jroakes | #TechSEOBoost Step Two I don’t want to reinvent the wheel, so let’s see what is already out there that I can use.
  • 25. JR Oakes | @jroakes | #TechSEOBoost Step Three A lot of coffee … and some beer.
  • 26. JR Oakes | @jroakes | #TechSEOBoost A little help along the way Streamlit is the first app framework specifically for Machine Learning and Data Science teams. So you can stop spending time on frontend development and get back to what you do best.
  • 27. JR Oakes | @jroakes | #TechSEOBoost Criteria • Use existing libraries where possible • Be hardy enough to crawl my toy internet • Make it as simple and approachable as possible (e.g. I use Pandas a lot) • Try to be true (as possible) to what is known that Google does • Process linearly. No threading or extra services • Include unit testing • Include a Jupyter Notebook • Include READMEs • Include a simple indexer and search apparatus to play with results (Thanks John M.!)
  • 28. JR Oakes | @jroakes | #TechSEOBoost Parts • PageRank • Chrome Headless Rendering • Text NLP Normalization • Bert Embeddings • Robots • Duplicate Content Shingling • URL Hashing • Document Frequency Functions (BM25 and TFIDF)
  • 29. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content.
  • 30. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 31. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible.
  • 32. JR Oakes | @jroakes | #TechSEOBoost Learnings Embeddings https://github.com/huggingface/transformers
  • 33. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things waaaaayy simpler than they would be in real life.
  • 34. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 35. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things way simpler than they would be in real life. • Sentencepiece and BPE encoding is revolutionary for indexes and NLG • A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog. • Minhash comparison made checking rendering to crawled comparison, easy.
  • 36. JR Oakes | @jroakes | #TechSEOBoost Result A crawler written in Python that we are releasing as open source. Keep in mind: 1. This was written in a month 2. Google engineers would laugh at it 3. It probably has bugs 4. It is really fun to play around with
  • 37. JR Oakes | @jroakes | #TechSEOBoost Result We also built a simple UI in Streamlit so you can play around with the results and parameters.
  • 38. JR Oakes | @jroakes | #TechSEOBoost Result Complete with Ads!
  • 39. JR Oakes | @jroakes | #TechSEOBoost Thank You Start playing at the link below https://locomotive.agency/coal-crawler-renderer-indexer-caboose – Find me on Twitter at: @jroakes
  • 40. JR Oakes | @jroakes | #TechSEOBoost Thanks for Viewing the Slideshare! – Watch the Recording: https://youtube.com/session-example Or Contact us today to discover how Catalyst can deliver unparalleled SEO results for your business. https://www.catalystdigital.com/