Harnessing the power of Nutch with Scala
- 2. about
CTO at Knoldus Software
Co-Founder at MyCellWasStolen.com
Community Editor at InfoQ.com
Dabbling with Scala – last 40 months
Enterprise grade implementations on Scala – 18 months
2
- 4. nutch – but we have google!
transparent
understanding
extensible
4
- 7. nutch – crawl cycle
generate – fetch – update cycle
Create crawldb
Inject root URLs
In crawldb
Update segments
Generate fetchlist
Index fetched pages
Fetch content repeat until
depth reached deduplication
Update crawldb
Merge indexes for
searching
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
7
- 8. nutch - plugins
generate – fetch – update cycle
Create crawldb parser
Inject root URLs
In crawldb HTMLParserFilter
Generate fetchlist
Fetch content URL Filter
Update crawldb
scoring filter
8
- 9. nutch – extension points
plugin.xml // tells Nutch about the plugin
build.xml // build the plugin
ivy.xml // plugin dependencies
// plugin source
src
9
- 10. nutch - example
<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"
version="1.0.0" provider-name="nutch.org">
<runtime>
<library name="kdaggregator.jar">
<export name="*" />
</library>
</runtime>
<requires>
<import plugin="nutch-extensionpoints" />
</requires>
<extension id="org.apache.nutch.parse.headings" name="Nutch
Headings Parse Filter"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="KDParseFilter"
class="com.knoldus.aggregator.server.plugins.DetailParserFilter
"></implementation>
</extension>
</plugin>
10
- 11. public ParseResult filter(Content content, ParseResult
parseResult, HTMLMetaTags metaTags, DocumentFragment
doc) {
LOG.debug("Parsing URL: " + content.getUrl());
}
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
for (String tag : tags) {
metadata.add(TAG_KEY, tag);
}
return parseResult;
}
11
- 12. scala
I have Java !
concurrency verbose
popular Strongly typed
jvm
OO library
12
- 13. scala
Java:
class Person {
private String firstName;
private String lastName;
private int age;
public Person(String firstName, String lastName, int age) {
this.firstName = firstName;
this.lastName = lastName;
this.age = age;
}
public void setFirstName(String firstName) { this.firstName = firstName; }
public void String getFirstName() { return this.firstName; }
public void setLastName(String lastName) { this.lastName = lastName; }
public void String getLastName() { return this.lastName; }
public void setAge(int age) { this.age = age; }
public void int getAge() { return this.age; }
}
Scala:
class Person(var firstName: String, var lastName: String, var age: Int)
Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
- 14. scala
Java – everything is an object unless it is primitive
Scala – everything is an object. period.
Java – has operators (+, -, < ..) and methods
Scala – operators are methods
Java – statically typed – Thing thing = new Thing()
Scala – statically typed but uses type inferencing
val thing = new Thing
14
- 20. solution
Supplier 1
Aggregator
Supplier 2
Supplier 3
20
- 21. Create crawldb
Inject root URLs
In crawldb Supplier URLs
Generate fetchlist
Fetch content
Update crawldb
plugins written in Scala
21
- 23. plugin - scala
class DetailParserFilter extends HtmlParseFilter {
def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc:
DocumentFragment): ParseResult = {
if (isDetailURL(content.getUrl)) {
val rawHtml = content.getContent
if (rawHtml.length > 0) processContent(rawHtml)
}
parseResult
}
private def isDetailURL(url: String): Boolean = {
val result = url.matches(AggregatorConfiguration.regexEventDetailPages)
result
}
private def processContent(rawHtml: Array[Byte]) = {
(new DetailProcessor).start ! rawHtml
} 23
- 26. resources
http://blog.knoldus.com
http://wiki.apache.org/nutch/NutchTutorial
http://www.scala-lang.org/
vikas@knoldus.com
26