SlideShare a Scribd company logo
LOG FILE ANALYSIS 
The most powerful tool in your SEO toolkit 
Tom Bennet 
Consultant, Builtvisible 
@tomcbennet
Log File Analysis: The most powerful tool in your SEO toolkit
Getting Started
What is a log file? 
A record of all hits that a server has received – humans and robots. 
http://www.brightonseo.com/about/ 
1. Protocol 
2. Host name 
3. File name 
Host name -> IP Address via DNS -> Connection to Server -> 
HTTP Get Request via Protocol for File -> HTML to Browser
They’re not pretty…
…but they’re very powerful. 
188.65.114.122 - - [30/Sep/2013:08:07:05 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html)" 
Server IP 
Timestamp (date & time) 
Method (GET / POST) 
Request URI 
HTTP status code 
User-agent
Log Files & SEO
What is Crawl Budget? 
Crawl Budget = The number of URLs crawled on each visit to your site. 
Higher Authority = Higher Crawl Budget
Crawl Budget Utilisation 
http://example.com/thin-product-page-1 
http://example.com/category/thin-product-page-1 
http://example.com/category/subcategory/thin-product-page-1 
http://example.com/category/subcategory/thin-product-page-1?colour=blue 
Etc… 
Conservation of crawl budget is key.
Working With Logs
Preparing Your Data 
Extraction: Varies by server. See accompanying guide. 
Filter: By Googlebot user-agent, validate the IP range. https://support.google.com/webmasters/answer/80553?hl=en 
Tools: Gamut and Splunk are great, but you can’t beat Excel.
Working in Excel 
1. Convert .log to .csv 
(cool tip: just change the file extension)
Working in Excel 
2. Sample size 
(60-120k Googlebot requests / rows is a good size)
Working in Excel 
3. Text-to-columns 
(a space will usually be a suitable delimiter)
Working in Excel 
4. Create a table 
(Label your columns, sort by timestamp)
Investigate
Most vs Least Crawled 
Formula: Use COUNTIF on Request URL. 
Tip: Extract top-level category for crawl distribution by site-section. 
http://www.brightonseo.com/speakers/person-name/
Crawl Frequency Over Time 
Formula: Pivot date against count of requests. 
Tip: Segment by site section or by user-agent (G-bot Mobile, Images, Video, etc).
HTTP Response Codes 
Formula: Total up HTTP Response Codes. 
Tip: Find most common 302s or 404s, filter by code and sort by URL occurrence.
Log File Analysis: The most powerful tool in your SEO toolkit
Level Up 
Robots.txt – Crawl all URLs with Screaming Frog to determine if they are blocked in robots.txt. Investigate most frequently crawled. 
Faceted Nav Issues – Dedupe a list of unique resources, sort by times requested. 
Sitemap – Add your sitemap URLs into an Excel table, VLOOKUP against your logs. Which mapped URLs are crawl deficient? 
CSS / JS – These resources should be crawlable, but are files unnecessary for render absorbing an inordinate amount of crawl budget?
Top Level Crawl Waste 
Formula: Use IF statements to check for every cause of waste.
Crime = Solved
All Brighton SEO attendees will receive the guide via email.
THANKS FOR LISTENING 
Get in touch 
e: tom@builtvisible.com 
t: @tomcbennet 
Tom Bennet 
Consultant, Builtvisible 
@tomcbennet

More Related Content

Log File Analysis: The most powerful tool in your SEO toolkit

  • 1. LOG FILE ANALYSIS The most powerful tool in your SEO toolkit Tom Bennet Consultant, Builtvisible @tomcbennet
  • 4. What is a log file? A record of all hits that a server has received – humans and robots. http://www.brightonseo.com/about/ 1. Protocol 2. Host name 3. File name Host name -> IP Address via DNS -> Connection to Server -> HTTP Get Request via Protocol for File -> HTML to Browser
  • 6. …but they’re very powerful. 188.65.114.122 - - [30/Sep/2013:08:07:05 -0400] "GET /resources/whitepapers/retail-whitepaper/ HTTP/1.1" 200 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; + http://www.google.com/bot.html)" Server IP Timestamp (date & time) Method (GET / POST) Request URI HTTP status code User-agent
  • 8. What is Crawl Budget? Crawl Budget = The number of URLs crawled on each visit to your site. Higher Authority = Higher Crawl Budget
  • 9. Crawl Budget Utilisation http://example.com/thin-product-page-1 http://example.com/category/thin-product-page-1 http://example.com/category/subcategory/thin-product-page-1 http://example.com/category/subcategory/thin-product-page-1?colour=blue Etc… Conservation of crawl budget is key.
  • 11. Preparing Your Data Extraction: Varies by server. See accompanying guide. Filter: By Googlebot user-agent, validate the IP range. https://support.google.com/webmasters/answer/80553?hl=en Tools: Gamut and Splunk are great, but you can’t beat Excel.
  • 12. Working in Excel 1. Convert .log to .csv (cool tip: just change the file extension)
  • 13. Working in Excel 2. Sample size (60-120k Googlebot requests / rows is a good size)
  • 14. Working in Excel 3. Text-to-columns (a space will usually be a suitable delimiter)
  • 15. Working in Excel 4. Create a table (Label your columns, sort by timestamp)
  • 17. Most vs Least Crawled Formula: Use COUNTIF on Request URL. Tip: Extract top-level category for crawl distribution by site-section. http://www.brightonseo.com/speakers/person-name/
  • 18. Crawl Frequency Over Time Formula: Pivot date against count of requests. Tip: Segment by site section or by user-agent (G-bot Mobile, Images, Video, etc).
  • 19. HTTP Response Codes Formula: Total up HTTP Response Codes. Tip: Find most common 302s or 404s, filter by code and sort by URL occurrence.
  • 21. Level Up Robots.txt – Crawl all URLs with Screaming Frog to determine if they are blocked in robots.txt. Investigate most frequently crawled. Faceted Nav Issues – Dedupe a list of unique resources, sort by times requested. Sitemap – Add your sitemap URLs into an Excel table, VLOOKUP against your logs. Which mapped URLs are crawl deficient? CSS / JS – These resources should be crawlable, but are files unnecessary for render absorbing an inordinate amount of crawl budget?
  • 22. Top Level Crawl Waste Formula: Use IF statements to check for every cause of waste.
  • 24. All Brighton SEO attendees will receive the guide via email.
  • 25. THANKS FOR LISTENING Get in touch e: tom@builtvisible.com t: @tomcbennet Tom Bennet Consultant, Builtvisible @tomcbennet