SlideShare a Scribd company logo
Crawling the web, Nutch with Scala

    Vikas Hazrati @
about


  CTO at Knoldus Software

  Co-Founder at MyCellWasStolen.com

  Community Editor at InfoQ.com

  Dabbling with Scala – last 40 months

  Enterprise grade implementations on Scala – 18 months



                                                          2
nutch
Web search
                    crawler   link-graph   parsing
 software




             solr


 lucene


                                                         3
nutch – but we have google!


             transparent




            understanding




             extensible




                              4
nutch – basic architecture




crawler                 searcher




                                       5
nutch - architecture



          Recursive                segments

crawler




                                                     links




                                      web database   pages
                      fetchlists      Crawl db
                                                             6
nutch – crawl cycle
                                             generate – fetch – update cycle
Create crawldb

    Inject root URLs
        In crawldb
                                                 Update segments
        Generate fetchlist

                                                 Index fetched pages
          Fetch content      repeat until
                             depth reached          deduplication
         Update crawldb
                                                  Merge indexes for
                                                     searching




 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
                                                                               7
nutch - plugins
                               generate – fetch – update cycle



Create crawldb               parser


    Inject root URLs
        In crawldb                     HTMLParserFilter

        Generate fetchlist


          Fetch content        URL Filter


         Update crawldb
                                      scoring filter




                                                                 8
nutch – extension points

plugin.xml            // tells Nutch about the plugin




               build.xml        // build the plugin




   ivy.xml           // plugin dependencies




                      // plugin source
         src

                                                        9
nutch - example
<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"
version="1.0.0" provider-name="nutch.org">
    <runtime>
        <library name="kdaggregator.jar">
            <export name="*" />
        </library>
    </runtime>
    <requires>
        <import plugin="nutch-extensionpoints" />
    </requires>
    <extension id="org.apache.nutch.parse.headings" name="Nutch
Headings Parse Filter"
point="org.apache.nutch.parse.HtmlParseFilter">
        <implementation id="KDParseFilter"
class="com.knoldus.aggregator.server.plugins.DetailParserFilter
"></implementation>
    </extension>
</plugin>
                                                             10
public ParseResult filter(Content content, ParseResult
parseResult, HTMLMetaTags metaTags, DocumentFragment
doc) {

      LOG.debug("Parsing URL: " + content.getUrl());

      }
      Parse parse = parseResult.get(content.getUrl());
      Metadata metadata = parse.getData().getParseMeta();
      for (String tag : tags) {
        metadata.add(TAG_KEY, tag);
      }
      return parseResult;

  }
                                                            11
scala
                  I have Java !

concurrency       verbose




        popular             Strongly typed

                                             jvm

          OO                library


                                                           12
scala
Java:
class Person {
   private String firstName;
   private String lastName;
   private int age;

    public Person(String firstName, String lastName, int age) {
      this.firstName = firstName;
      this.lastName = lastName;
      this.age = age;
    }

    public   void   setFirstName(String firstName) { this.firstName = firstName; }
    public   void   String getFirstName() { return this.firstName; }
    public   void   setLastName(String lastName) { this.lastName = lastName; }
    public   void   String getLastName() { return this.lastName; }
    public   void   setAge(int age) { this.age = age; }
    public   void   int getAge() { return this.age; }
}


Scala:
class Person(var firstName: String, var lastName: String, var age: Int)




Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i      13
scala
Java – everything is an object unless it is primitive

Scala – everything is an object. period.



Java – has operators (+, -, < ..) and methods

Scala – operators are methods



Java – statically typed – Thing thing = new Thing()
Scala – statically typed but uses type inferencing
val thing = new Thing
                                                                14
evolution




            15
scala and concurrency



Fine grained        coarse grained




                         Actors


                                       16
actors




         17
18
problem context


Aggregator




             UGC



                                     19
solution

             Supplier 1
Aggregator

             Supplier 2



              Supplier 3




                                      20
Create crawldb

    Inject root URLs
        In crawldb           Supplier URLs

        Generate fetchlist


          Fetch content


         Update crawldb




                             plugins written in Scala

                                                        21
logic


Crawl the supplier


                                                     Parse
                            Is URL interesting




                                                 Pass extraction to
                                                       actor

                       seed
                     database



                                                                      22
plugin - scala
class DetailParserFilter extends HtmlParseFilter {

 def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc:
 DocumentFragment): ParseResult = {
   if (isDetailURL(content.getUrl)) {
     val rawHtml = content.getContent
     if (rawHtml.length > 0) processContent(rawHtml)
   }
   parseResult
 }

 private def isDetailURL(url: String): Boolean = {
   val result = url.matches(AggregatorConfiguration.regexEventDetailPages)
   result
 }

 private def processContent(rawHtml: Array[Byte]) = {
   (new DetailProcessor).start ! rawHtml
 }                                                                                     23
result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected




All with Nutch and 823 lines of Scala code


                                                      24
demo




in action ….




                      25
resources

         http://blog.knoldus.com


http://wiki.apache.org/nutch/NutchTutorial


       http://www.scala-lang.org/


          vikas@knoldus.com



                                                 26

More Related Content

Harnessing the power of Nutch with Scala

  • 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  • 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  • 3. nutch Web search crawler link-graph parsing software solr lucene 3
  • 4. nutch – but we have google! transparent understanding extensible 4
  • 5. nutch – basic architecture crawler searcher 5
  • 6. nutch - architecture Recursive segments crawler links web database pages fetchlists Crawl db 6
  • 7. nutch – crawl cycle generate – fetch – update cycle Create crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  • 8. nutch - plugins generate – fetch – update cycle Create crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  • 9. nutch – extension points plugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  • 10. nutch - example <plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter "></implementation> </extension> </plugin> 10
  • 11. public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  • 12. scala I have Java ! concurrency verbose popular Strongly typed jvm OO library 12
  • 13. scala Java: class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; } } Scala: class Person(var firstName: String, var lastName: String, var age: Int) Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  • 14. scala Java – everything is an object unless it is primitive Scala – everything is an object. period. Java – has operators (+, -, < ..) and methods Scala – operators are methods Java – statically typed – Thing thing = new Thing() Scala – statically typed but uses type inferencing val thing = new Thing 14
  • 15. evolution 15
  • 16. scala and concurrency Fine grained coarse grained Actors 16
  • 17. actors 17
  • 18. 18
  • 20. solution Supplier 1 Aggregator Supplier 2 Supplier 3 20
  • 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  • 22. logic Crawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  • 23. plugin - scala class DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  • 24. result 5 suppliers crawled Crawl cycles run continuously for few days > 500K seed data collected All with Nutch and 823 lines of Scala code 24
  • 26. resources http://blog.knoldus.com http://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26