Harnessing the power of Nutch with Scala

Crawling the web, Nutch with Scala

Vikas Hazrati @

about

CTO at Knoldus Software

Co-Founder at MyCellWasStolen.com

Community Editor at InfoQ.com

Dabbling with Scala – last 40 months

Enterprise grade implementations on Scala – 18 months

2

nutch
Web search
crawler link-graph parsing
software

solr

lucene

3

nutch – but we have google!

transparent

understanding

extensible

4

nutch – basic architecture

crawler searcher

5

nutch - architecture

Recursive segments

crawler

links

web database pages
fetchlists Crawl db
6

nutch – crawl cycle
generate – fetch – update cycle
Create crawldb

Inject root URLs
In crawldb
Update segments
Generate fetchlist

Index fetched pages
Fetch content repeat until
depth reached deduplication
Update crawldb
Merge indexes for
searching

bin/nutch crawl urls -dir crawl -depth 3 -topN 5
7

nutch - plugins
generate – fetch – update cycle

Create crawldb parser

Inject root URLs
In crawldb HTMLParserFilter

Generate fetchlist

Fetch content URL Filter

Update crawldb
scoring filter

8

nutch – extension points

plugin.xml // tells Nutch about the plugin

build.xml // build the plugin

ivy.xml // plugin dependencies

// plugin source
src

9

nutch - example
<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"
version="1.0.0" provider-name="nutch.org">
<runtime>
<library name="kdaggregator.jar">
<export name="*" />
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints" />
</requires>
<extension id="org.apache.nutch.parse.headings" name="Nutch
Headings Parse Filter"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="KDParseFilter"
class="com.knoldus.aggregator.server.plugins.DetailParserFilter
"></implementation>
</extension>
</plugin>
10

public ParseResult filter(Content content, ParseResult
parseResult, HTMLMetaTags metaTags, DocumentFragment
doc) {

LOG.debug("Parsing URL: " + content.getUrl());

}
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
for (String tag : tags) {
metadata.add(TAG_KEY, tag);
}
return parseResult;

}
11

scala
I have Java !

concurrency verbose

popular Strongly typed

jvm

OO library

12

scala
Java:
class Person {
private String firstName;
private String lastName;
private int age;

public Person(String firstName, String lastName, int age) {
this.firstName = firstName;
this.lastName = lastName;
this.age = age;
}

public void setFirstName(String firstName) { this.firstName = firstName; }
public void String getFirstName() { return this.firstName; }
public void setLastName(String lastName) { this.lastName = lastName; }
public void String getLastName() { return this.lastName; }
public void setAge(int age) { this.age = age; }
public void int getAge() { return this.age; }
}

Scala:
class Person(var firstName: String, var lastName: String, var age: Int)

Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13

scala
Java – everything is an object unless it is primitive

Scala – everything is an object. period.

Java – has operators (+, -, < ..) and methods

Scala – operators are methods

Java – statically typed – Thing thing = new Thing()
Scala – statically typed but uses type inferencing
val thing = new Thing
14

scala and concurrency

Fine grained coarse grained

Actors

16

problem context

Aggregator

UGC

19

solution

Supplier 1
Aggregator

Supplier 2

Supplier 3

20

Create crawldb

Inject root URLs
In crawldb Supplier URLs

Generate fetchlist

Fetch content

Update crawldb

plugins written in Scala

21

logic

Crawl the supplier

Parse
Is URL interesting

Pass extraction to
actor

seed
database

22

plugin - scala
class DetailParserFilter extends HtmlParseFilter {

def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc:
DocumentFragment): ParseResult = {
if (isDetailURL(content.getUrl)) {
val rawHtml = content.getContent
if (rawHtml.length > 0) processContent(rawHtml)
}
parseResult
}

private def isDetailURL(url: String): Boolean = {
val result = url.matches(AggregatorConfiguration.regexEventDetailPages)
result
}

private def processContent(rawHtml: Array[Byte]) = {
(new DetailProcessor).start ! rawHtml
} 23

result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected

All with Nutch and 823 lines of Scala code

24

demo

in action ….

25

resources

http://blog.knoldus.com

http://wiki.apache.org/nutch/NutchTutorial

http://www.scala-lang.org/

vikas@knoldus.com

26

Harnessing the power of Nutch with Scala

Related slideshows

More Related Content

Harnessing the power of Nutch with Scala