SlideShare a Scribd company logo
Large Scale Crawling with

Apache
and friends...

Julien Nioche
julien@digitalpebble.com
LUCENE/SOLR REVOLUTION EU 2013
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
–
–
–
–

Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning

 Strong focus on Open Source & Apache ecosystem
 VP Apache Nutch
 User | Contributor | Committer
–
–
–
–
–

Tika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth

2 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

3 / 43
Nutch?
 “Distributed framework for large scale web crawling”
(but does not have to be large scale at all)

 Apache TLP since May 2010
 Based on Apache Hadoop

 Indexing and Search by

4 / 43
A bit of history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache

5 / 43
Recent Releases

trunk

1.0

1.1 1.2

1.3

1.4 1.5.1 1.6

1.7

2.x
2.0 2.1

06/09

06/10

06/11

06/12

2.2.1

06/13

6 / 43
Why use Nutch?
 Usual reasons
– Open source with a business-friendly license, mature, community, ...

 Scalability
– Tried and tested on very large scale
– Standard Hadoop

 Features
–
–
–
–

Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised

7 / 43
Use cases
 Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

 with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

8 / 43
Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–

Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index

SimilarPages.com
–
–
–
–
–

Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved

Size

9 / 43
CommonCrawl
http://commoncrawl.org/
 Open repository of web crawl data
 2012 dataset : 3.83 billion docs
 ARC files on Amazon S3
 Using Nutch 1.7
 A few modifications to Nutch code
– https://github.com/Aloisius/nutch

 Next release imminent
10 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

11 / 43
Installation
 http://nutch.apache.org/downloads.html
 1.7 => src and bin distributions
 2.2.1 => src only
 'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts

 Binary distribution for 1.x == runtime/local

12 / 43
Configuration and resources
 Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf

 Specify configuration in nutch-site.xml
– Leave nutch-default alone!

 At least :
<property>
  <name>http.agent.name</name>
  <value>WhateverNameDescribesMyMightyCrawler</value>
</property>

13 / 43
Running it!
 bin/crawl script : typical sequence of steps
 bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….

 Local mode : great for testing and debugging
 Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters

14 / 43
Monitor Crawl with MapReduce UI

15 / 43
Counters and logs

16 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

17 / 43
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)

Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]

 Repeat steps 2 to 7
 Or use the all-in-one crawl script
18 / 43
Main steps from a data perspective
Seed
List

Segment

CrawlDB

/crawl_generate/
/crawl_fetch/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB
19 / 43
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction

seed
i=1
i=2
i=3

[Slide courtesy of A. Bialecki]

20 / 43
An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
–
–
–
–
–
–
–
–

Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)

21 / 43
Features
 Fetcher
–
–
–
–

Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

22 / 43
Features (cont.)
 Protocols
– Http, file, ftp, https
– Respects robots.txt directives

 Scheduling
– Fixed or adaptive

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

23 / 43
Features (cont.)
 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well

 Other plugins
–
–
–
–
–

CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

 Pluggable indexing
– SOLR | ES etc...

24 / 43
Indexing
 Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377

 ElasticSearch
– Version 0.90.1

 AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

 Easy to build your own
– Text, DB, etc...

25 / 43
Typical Nutch document
 Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–

url
content
title
anchor
site
boost
digest
segment
host
type

 Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata

26 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

27 / 43
NUTCH 2.x
 2.0 released in July 2012
 2.2.1 in July 2013
 Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...

 Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA
28 / 43
Apache GORA
 http://gora.apache.org/
 ORM for NoSQL databases
– and limited SQL support + file based storage

 Current version 0.3
 DataStore implementations
●
●
●

Accumulo
Cassandra
HBase

●
●
●

Avro
DynamoDB
SQL (broken)

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)
29 / 43
AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

30 / 43
Mapping file (backend specific – Hbase)
<gora-orm>
<table name="webpage">
<family name="p" maxVersions="1"/> 
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

31 / 43
DataStore operations
 Basic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

32 / 43
GORA in Nutch
 AVRO schema provided and java code pre-generated
 Mapping files provided for backends
– can be modified if necessary
 Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
 http://wiki.apache.org/nutch/Nutch2Tutorial

33 / 43
Benefits
 Storage still distributed and replicated
 … but one big table
– status, metadata, content, text → one place
– no more segments

 Resume-able fetch and parse steps
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

 Simplify the Nutch code
 Potentially faster (e.g. update step)
34 / 43
Drawbacks
 More stuff to install and configure
– Higher hardware requirements

 Current performance :-(
–
–
–
–
–
–

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!

 Not as stable as Nutch 1.x

35 / 43
2.x Work in progress
 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)

 Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset

36 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

37 / 43
Future
 1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x

 New functionalities
–
–
–
–

Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)

 Move to new MapReduce API
– Use Nutch on Hadoop 2.x

38 / 43
More delegation
 Great deal done in recent years (SOLR, Tika)
 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering

 PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain

39 / 43
Longer term
 Hadoop 2.x & YARN
 Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …

 End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch

 See https://github.com/DigitalPebble/storm-crawler
40 / 43
Where to find out more?
 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

41 / 43
Questions

?
42 / 43
43 / 43

More Related Content

Large Scale Crawling with Apache Nutch and Friends

  • 1. Large Scale Crawling with Apache and friends... Julien Nioche julien@digitalpebble.com LUCENE/SOLR REVOLUTION EU 2013
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – – – – Web Crawling Natural Language Processing Information Retrieval Machine Learning  Strong focus on Open Source & Apache ecosystem  VP Apache Nutch  User | Contributor | Committer – – – – – Tika SOLR, Lucene GATE, UIMA Mahout Behemoth 2 / 43
  • 3. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 3 / 43
  • 4. Nutch?  “Distributed framework for large scale web crawling” (but does not have to be large scale at all)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search by 4 / 43
  • 5. A bit of history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  Sept 2010 : Storage abstraction in Nutch 2.x – 2012 : Gora TLP @Apache 5 / 43
  • 6. Recent Releases trunk 1.0 1.1 1.2 1.3 1.4 1.5.1 1.6 1.7 2.x 2.0 2.1 06/09 06/10 06/11 06/12 2.2.1 06/13 6 / 43
  • 7. Why use Nutch?  Usual reasons – Open source with a business-friendly license, mature, community, ...  Scalability – Tried and tested on very large scale – Standard Hadoop  Features – – – – Index with SOLR / ES / CloudSearch PageRank implementation Loads of existing plugins Can easily be extended / customised 7 / 43
  • 8. Use cases  Crawl for search – Generic or vertical – Index and Search with SOLR and al. – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML  with – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 8 / 43
  • 9. Customer cases Specificity (Verticality) BetterJobs.com (CareerBuilder) – – – – – Single server Aggregates content from job portals Extracts and normalizes structure (description, requirements, locations) ~2M pages total Feeds SOLR index SimilarPages.com – – – – – Large cluster on Amazon EC2 (up to 400 nodes) Fetched & parsed 3 billion pages 10+ billion pages in crawlDB (~100TB data) 200+ million lists of similarities No indexing / search involved Size 9 / 43
  • 10. CommonCrawl http://commoncrawl.org/  Open repository of web crawl data  2012 dataset : 3.83 billion docs  ARC files on Amazon S3  Using Nutch 1.7  A few modifications to Nutch code – https://github.com/Aloisius/nutch  Next release imminent 10 / 43
  • 11. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 11 / 43
  • 12. Installation  http://nutch.apache.org/downloads.html  1.7 => src and bin distributions  2.2.1 => src only  'ant clean runtime' – runtime/local => local mode (test and debug) – runtime/deploy => job jar for Hadoop + scripts  Binary distribution for 1.x == runtime/local 12 / 43
  • 13. Configuration and resources  Changes in $NUTCH_HOME/conf – Need recompiling with 'ant runtime' – Local mode => can be made directly in runtime/local/conf  Specify configuration in nutch-site.xml – Leave nutch-default alone!  At least : <property>   <name>http.agent.name</name>   <value>WhateverNameDescribesMyMightyCrawler</value> </property> 13 / 43
  • 14. Running it!  bin/crawl script : typical sequence of steps  bin/nutch : individual Nutch commands – Inject / generate / fetch / parse / update ….  Local mode : great for testing and debugging  Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism – MapReduce UI to monitor crawl, check logs, counters 14 / 43
  • 15. Monitor Crawl with MapReduce UI 15 / 43
  • 17. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 17 / 43
  • 18. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) 2) 3) 4) 5) 6) 7) Inject → populates CrawlDB from seed list Generate → Selects URLS to fetch in segment Fetch → Fetches URLs from segment Parse → Parses content (text + metadata) UpdateDB → Updates CrawlDB (new URLs, new status...) InvertLinks → Build Webgraph Index → Send docs to [SOLR | ES | CloudSearch | … ]  Repeat steps 2 to 7  Or use the all-in-one crawl script 18 / 43
  • 19. Main steps from a data perspective Seed List Segment CrawlDB /crawl_generate/ /crawl_fetch/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 19 / 43
  • 20. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i=1 i=2 i=3 [Slide courtesy of A. Bialecki] 20 / 43
  • 21. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – – – – – – – – Protocol Parser HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x) ScoringFilter (used in various places) URLFilter (ditto) URLNormalizer (ditto) IndexingFilter IndexWriter (NEW IN 1.7!) 21 / 43
  • 22. Features  Fetcher – – – – Multi-threaded fetcher Queues URLs per hostname / domain / IP Limit the number of URLs for round of fetching Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom ScoringFilters  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 22 / 43
  • 23. Features (cont.)  Protocols – Http, file, ftp, https – Respects robots.txt directives  Scheduling – Fixed or adaptive  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 23 / 43
  • 24. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – – – – – CreativeCommons Feeds Language Identification Rel tags Arbitrary Metadata  Pluggable indexing – SOLR | ES etc... 24 / 43
  • 25. Indexing  Apache SOLR – schema.xml in conf/ – SOLR 3.4 – JIRA issue for SOLRCloud • https://issues.apache.org/jira/browse/NUTCH-1377  ElasticSearch – Version 0.90.1  AWS CloudSearch – WIP : https://issues.apache.org/jira/browse/NUTCH-1517  Easy to build your own – Text, DB, etc... 25 / 43
  • 26. Typical Nutch document  Some of the fields (IndexingFilters in plugins or core code) – – – – – – – – – – url content title anchor site boost digest segment host type  Configurable ones – meta tags (keywords, description etc...) – arbitrary metadata 26 / 43
  • 27. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 27 / 43
  • 28. NUTCH 2.x  2.0 released in July 2012  2.2.1 in July 2013  Common features as 1.x – MapReduce, Tika, delegation to SOLR, etc...  Moved to 'big table'-like architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 28 / 43
  • 29. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  Current version 0.3  DataStore implementations ● ● ● Accumulo Cassandra HBase ● ● ● Avro DynamoDB SQL (broken)  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 29 / 43
  • 30. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 30 / 43
  • 31. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 31 / 43
  • 32. DataStore operations  Basic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 32 / 43
  • 33. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – hence source only distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 33 / 43
  • 34. Benefits  Storage still distributed and replicated  … but one big table – status, metadata, content, text → one place – no more segments  Resume-able fetch and parse steps  Easier interaction with other resources – Third-party code just need to use GORA and schema  Simplify the Nutch code  Potentially faster (e.g. update step) 34 / 43
  • 35. Drawbacks  More stuff to install and configure – Higher hardware requirements  Current performance :-( – – – – – – http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html N2+HBase : 2.7x slower than 1.x N2+Cassandra : 4.4x slower than 1.x due mostly to GORA layer : not inherent to Hbase or Cassandra https://issues.apache.org/jira/browse/GORA-119 → filtered scans Not all backends provide data locality!  Not as stable as Nutch 1.x 35 / 43
  • 36. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph) – No pluggable indexers yet (NUTCH-1568)  Filter enabled scans – GORA-119 • => don't need to de-serialize the whole dataset 36 / 43
  • 37. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 37 / 43
  • 38. Future  1.x and 2.x to coexist in parallel – 2.x not yet a replacement of 1.x  New functionalities – – – – Support for SOLRCloud Sitemap (from CrawlerCommons library) Canonical tag Generic deduplication (NUTCH-656)  Move to new MapReduce API – Use Nutch on Hadoop 2.x 38 / 43
  • 39. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – URL normalisation / filtering  PageRank-like computations to graph library – Apache Giraph – Should be more efficient + less code to maintain 39 / 43
  • 40. Longer term  Hadoop 2.x & YARN  Convergence of batch and streaming – Storm / Samza / Storm-YARN / …  End of 100% batch operations ? – Fetch and parse as streaming ? – Always be fetching – Generate / update / pagerank remain batch  See https://github.com/DigitalPebble/storm-crawler 40 / 43
  • 41. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 41 / 43