Large Scale Crawling with Apache Nutch and Friends

Large Scale Crawling with

Apache
and friends...

Julien Nioche
julien@digitalpebble.com
LUCENE/SOLR REVOLUTION EU 2013

About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
–
–
–
–

Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning

 Strong focus on Open Source & Apache ecosystem
 VP Apache Nutch
 User | Contributor | Committer
–
–
–
–
–

Tika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth

2 / 43

Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

3 / 43

Nutch?
 “Distributed framework for large scale web crawling”
(but does not have to be large scale at all)

 Apache TLP since May 2010
 Based on Apache Hadoop

 Indexing and Search by

4 / 43

A bit of history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache

5 / 43

Recent Releases

trunk

1.0

1.1 1.2

1.3

1.4 1.5.1 1.6

1.7

2.x
2.0 2.1

06/09

06/10

06/11

06/12

2.2.1

06/13

6 / 43

Why use Nutch?
 Usual reasons
– Open source with a business-friendly license, mature, community, ...

 Scalability
– Tried and tested on very large scale
– Standard Hadoop

 Features
–
–
–
–

Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised

7 / 43

Use cases
 Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

 with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

8 / 43

Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–

Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index

SimilarPages.com
–
–
–
–
–

Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved

Size

9 / 43

CommonCrawl
http://commoncrawl.org/
 Open repository of web crawl data
 2012 dataset : 3.83 billion docs
 ARC files on Amazon S3
 Using Nutch 1.7
 A few modifications to Nutch code
– https://github.com/Aloisius/nutch

 Next release imminent
10 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

11 / 43

Installation
 http://nutch.apache.org/downloads.html
 1.7 => src and bin distributions
 2.2.1 => src only
 'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts

 Binary distribution for 1.x == runtime/local

12 / 43

Configuration and resources
 Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf

 Specify configuration in nutch-site.xml
– Leave nutch-default alone!

 At least :
<property>
<name>http.agent.name</name>
<value>WhateverNameDescribesMyMightyCrawler</value>
</property>

13 / 43

Running it!
 bin/crawl script : typical sequence of steps
 bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….

 Local mode : great for testing and debugging
 Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters

14 / 43

Monitor Crawl with MapReduce UI

15 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

17 / 43

Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)

Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]

 Repeat steps 2 to 7
 Or use the all-in-one crawl script
18 / 43

Main steps from a data perspective
Seed
List

Segment

CrawlDB

/crawl_generate/
/crawl_fetch/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB
19 / 43

Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction

seed
i=1
i=2
i=3

[Slide courtesy of A. Bialecki]

20 / 43

An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
–
–
–
–
–
–
–
–

Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)

21 / 43

Features
 Fetcher
–
–
–
–

Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

22 / 43

Features (cont.)
 Protocols
– Http, file, ftp, https
– Respects robots.txt directives

 Scheduling
– Fixed or adaptive

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

23 / 43

Features (cont.)
 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well

 Other plugins
–
–
–
–
–

CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

 Pluggable indexing
– SOLR | ES etc...

24 / 43

Indexing
 Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377

 ElasticSearch
– Version 0.90.1

 AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

 Easy to build your own
– Text, DB, etc...

25 / 43

Typical Nutch document
 Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–

url
content
title
anchor
site
boost
digest
segment
host
type

 Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata

26 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

27 / 43

NUTCH 2.x
 2.0 released in July 2012
 2.2.1 in July 2013
 Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...

 Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA
28 / 43

Apache GORA
 http://gora.apache.org/
 ORM for NoSQL databases
– and limited SQL support + file based storage

 Current version 0.3
 DataStore implementations
●
●
●

Accumulo
Cassandra
HBase

●
●
●

Avro
DynamoDB
SQL (broken)

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)
29 / 43

AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

30 / 43

Mapping file (backend specific – Hbase)
<gora-orm>
<table name="webpage">
<family name="p" maxVersions="1"/>
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

31 / 43

DataStore operations
 Basic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

32 / 43

GORA in Nutch
 AVRO schema provided and java code pre-generated
 Mapping files provided for backends
– can be modified if necessary
 Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
 http://wiki.apache.org/nutch/Nutch2Tutorial

33 / 43

Benefits
 Storage still distributed and replicated
 … but one big table
– status, metadata, content, text → one place
– no more segments

 Resume-able fetch and parse steps
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

 Simplify the Nutch code
 Potentially faster (e.g. update step)
34 / 43

Drawbacks
 More stuff to install and configure
– Higher hardware requirements

 Current performance :-(
–
–
–
–
–
–

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!

 Not as stable as Nutch 1.x

35 / 43

2.x Work in progress
 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)

 Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset

36 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

37 / 43

Future
 1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x

 New functionalities
–
–
–
–

Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)

 Move to new MapReduce API
– Use Nutch on Hadoop 2.x

38 / 43

More delegation
 Great deal done in recent years (SOLR, Tika)
 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering

 PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain

39 / 43

Longer term
 Hadoop 2.x & YARN
 Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …

 End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch

 See https://github.com/DigitalPebble/storm-crawler
40 / 43

Where to find out more?
 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

41 / 43

Large Scale Crawling with Apache Nutch and Friends

More Related Content

Large Scale Crawling with Apache Nutch and Friends