SlideShare a Scribd company logo
Bixo - Web Mining Toolkit                                                                   23 Sep 2009




                   Web Mining Toolkit




                                            Ken Krugler
                                    TransPac Software, Inc.




             My background - did a startup called Krugle from 2005 - 2008
             Used Nutch to do a vertical crawl of the web, looking for technical software
             pages.
             Mined pages for references to open source projects.


             Used experience to create Bixo, an open source web mining toolkit
             Built on top of Hadoop, Cascading, Tika.




                                                                                                     1
Bixo - Web Mining Toolkit                                                                 23 Sep 2009




                       Web Mining 101

                        Extracting & Processing Web Data
                        More Than Just Search
                        Business intelligence, competitive intelligence,
                        events, people, companies, popularity, pricing,
                        social graphs, Twitter feeds, Facebook friends,
                        support forums, shopping carts…




             Quick intro to web mining, so we’re on the same page


             Most people think about the big search companies when they think about web
             mining.
             Search is clearly the biggest web mining category, and generates the most
             revenue.
             But other types of web mining have value that is high and growing.
             This is what Bixo focuses on.




                                                                                                   2
Bixo - Web Mining Toolkit                                                                23 Sep 2009




                      4 Steps in Mining

                        Collect - fetch content from web
                        Parse - extract data from formats
                        Analyze - tokenize, rate, classify, cluster
                        Produce - an index, a report
                        Search




             Note - does not include serving up the search results
             Why do I bring this up? To help clarify why web mining is not the same as
             vertical search (next slide)




                                                                                                  3
Bixo - Web Mining Toolkit                                                                     23 Sep 2009




                         Vertical Search

                         Vertical crawl to get specific content
                         Common use case for Nutch, Heritrix
                         But web mining often has different outcome
                         And specialized processing of data




             Most people think of vertical search when they think of specialized web
             mining.
             Lots of people have been doing this, using OSS like Nutch & Heritrix.
             End result is typically a Lucene index, plus the content, inverted links, etc.


             Typical web mining is not the same as vertical search.
             Often uses a white list, versus crawling to discover links.
             More specialized processing of the data.
             And these differences help answer the question of (next slide)…




                                                                                                       4
Bixo - Web Mining Toolkit                                                                  23 Sep 2009




                              Why Bixo?

                        Response to needs of commercial projects
                         – Plug into Cascading-based workflow
                         – Low IT time/skill requirements
                         – Run well in AWS EC2 environment
                         – Flexible I/O support for AWS - S3, HBase
                         – Toolkit for building custom solutions
                             • Fetch white list (parse/index, data mine)
                             • Scrape white list (social popularity)




             Does the world really need yet another web crawler?
             No, but it does need a web mining toolkit


             Two companies agreed to sponsor work on Bixo as an open source project.


             On the point of running well in an EC2 environment…
             Even though there are many web mining tasks that can be handled on a single
             computer,
             You very quickly run into issues of scale if you can’t handle upwards of
             100M+ pages.




                                                                                                    5
Bixo - Web Mining Toolkit                                                                23 Sep 2009




                            Bixo Overview

                        MIT license open source project
                        In use by three companies
                        “Pipe” model for building workflows
                        Runs on top of Hadoop/Cascading




             Full disclosure - Bixo makes heavy use of Cascading, which is under GPL.
             So if you want to sell a product based on Bixo, you need to talk to Chris
             Wensel.


             The pipe model comes from our use of Cascading to define the workflows.




                                                                                                  6
Bixo - Web Mining Toolkit                                                                     23 Sep 2009




                     What is Cascading

                        API for Hadoop data processing workflows
                        Operations on tuples with named fields
                        Workflows created from pipes
                        Reduces painful low-level MR details
                        Key for complex/reliable workflows




             I know Chris Wensel has previously talked about Cascading here, but just to
             make sure we’re all on the same page…


             “tuple” is like a row in a database. Named fields with values.
             Example of tuple - result of fetching a page, has URL, time of fetch, content,
             headers, response rate, etc.


             Because you can build workflows out of a mix of pre-defined & custom pipes,
             it’s a real toolkit.


             Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels
             more like C++ :)


             Key aspect of reliable workflows is Cascading’s ability to check your
             workflow (the DAG it builds)
             Finds cases where fields aren’t available for operations.
             Solves a key problem we ran into when customizing Nutch at Krugle




                                                                                                       7
Bixo - Web Mining Toolkit                                                                    23 Sep 2009




                            Architecture




             This architecture looks nice and squeaky clean - and in general it is.
             One issue is with the fetch phase of bixo not fitting well into the MR model.
             External resource constraints mean you can’t treat it like a regular job.
             So lots of threads in a special reduce phase, with corresponding issues
             -Stack size
             -Error handling




                                                                                                      8
Bixo - Web Mining Toolkit                                                                    23 Sep 2009




                                HUGMEE

                        Hadoop
                        Users who
                        Generate the
                        Most
                        Effective
                        Emails




             Let’s use a real example now of using Bixo to do web mining.

             Imagine that the Apache Foundation decided to honor people who make
             significant contributions to the Hadoop community.


             In a typical company, determining the winner would depend on political
             maneuvering, bribes,and sucking up.


             But the Apache Foundation could decides to go for a quantitative approach for
             the HUGMEE award.




                                                                                                      9
Bixo - Web Mining Toolkit                                                                      23 Sep 2009




                    Helpful Hadoopers

                        Use mailing list archives for data (collect)
                        Parse mbox files and emails (parse)
                        Score based on key phrases (analyze)
                        End result is score/name pair (produce)




             How do you figure out the most helpful Hadoopers?
             As we discussed previously, it’s a classic web mining problem


             Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files.


             How do we score based on key phrases (next slide)?




                                                                                                       10
Bixo - Web Mining Toolkit                                         23 Sep 2009




                     Scoring Algorithm

                        Very sophisticated point system
                        “thanks” == 5
                        “owe you a beer” == 50
                        “worship the ground you walk on” == 100




                                                                          11
Bixo - Web Mining Toolkit                                                                 23 Sep 2009




                       High Level Steps

                        Collect emails
                         – Fetch mod_mbox generated page
                         – Parse it to extract links to mbox files
                         – Fetch mbox files
                         – Split into separate emails
                        Parse emails
                         – Extract key headers (messageId, email, etc)
                         – Parse body to identify quoted text




             Parsing the mod_mbox page is simple with Tika’s HtmlParser


             Cheated a bit when parsing emails - some users like Owen have many aliases
             So hand-generated alias resolution table.




                                                                                                  12
Bixo - Web Mining Toolkit                                                             23 Sep 2009




                      High Level Steps

                        Analyze emails
                         – Find key phrases in replies (ignore signoff)
                         – Score emails by phrases
                         – Group & sum by message ID
                         – Group & sum by email address
                        Produce ranked list
                         – Toss email addresses with no love
                         – Sort by summed score




             Need to ignore “thanks” in “thanks in advance for doing my job for me”
             signoff.


             Generate two tuples for each email:
             -one with messageId/name/address
             -One with reply-to messageId/score


             Group/sum aspect is classic reduce operation.




                                                                                              13
Bixo - Web Mining Toolkit                                                                      23 Sep 2009




                                Workflow




             I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom
             Cascading operations, 6 MR jobs.


             OK, actually not so clear, but…
             Key point is that only purple is stuff that I had to actually create
             Some lines are purple as well, since that workflow (DAG) is also something I
             defined - see next page.
             But only two custom operations actually needed - parsing mbox_page and
             calculating score


             Running took about 30 minutes - mostly politely waiting until it was Ok to
             politely do another fetch.
             Downloaded 150MB of mbox files
             409 unique email addresses with at least one positive reply.




                                                                                                       14
Bixo - Web Mining Toolkit                                                                 23 Sep 2009




                      Building the Flow




             Most of the code needed to create the workflow for this data mining app.


             Lots of oatmeal code - which is good. Don’t want to be writing tricky code
             here.


             Could optimize, but that would be a mistake…most web mining is
             programmer-constrained.
             So just use more servers in EC2 - cheaper & faster.




                                                                                                  15
Bixo - Web Mining Toolkit                                                       23 Sep 2009




                       mod_mbox Page




             Example of the top-level pages that were fetched in first phase.


             Then needed to be parsed to extract links to mbox files.




                                                                                        16
Bixo - Web Mining Toolkit                             23 Sep 2009




                     Custom Operation




             Example of one of two custom operation
             Parsing mod_mbox page
             Uses Tika to extract Ids
             Emits tuple with URL for each mbox ID




                                                              17
Bixo - Web Mining Toolkit                                                   23 Sep 2009




                                Validate




             Curve looks right - exponential decay.
             409 unique email addresses that got some love from somebody.




                                                                                    18
Bixo - Web Mining Toolkit                                          23 Sep 2009




                    This Hug’s for Ted!




             And the winner is…Ted Dunning


             I know - I should have colored the elephant yellow.




                                                                           19
Bixo - Web Mining Toolkit                                                             23 Sep 2009




                                 Produce




             A list of the usual suspects

             Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.




                                                                                              20
Bixo - Web Mining Toolkit                                                                  23 Sep 2009




                            Use Bixo to…

                        Find +/- product comments on forums
                        Compare web site quality
                        Track social network popularity
                        Derive optimized SEO terms
                        Scape and analyze pricing data




             Previous example could be easily changed to “find opinion makers on forums”


             Many other use cases


             All involve web mining workflow - fetch, parse, analyze, produce




                                                                                                   21
Bixo - Web Mining Toolkit                                         23 Sep 2009




                               Summary

                        Bixo is a web mining toolkit
                        Built on Hadoop, Cascading, Tika
                        Young project but used commercially
                        Future - Mahout, monitoring, HBase, URL
                        DB, cleanup, bug fixes, rinse, repeat




             Lots to be done, of course, but moving fast




                                                                          22
Bixo - Web Mining Toolkit                                                               23 Sep 2009




                             Resources

                        Web: http://bixo.101tec.com
                        List: http://tech.groups.yahoo.com/group/bixo-dev/
                        Source: http://github.com/emi/bixo/tree
                        Bugs: http://oss.101tec.com/jira/browse/bixo




             URLs to find out more about the Bixo project.


             Stefan Groschupf from 101tec helped with initial Bixo coding.
             His company provides infrastructure for project, thus 101tec.com in URLs
             above




                                                                                                23
Bixo - Web Mining Toolkit                23 Sep 2009




                        Any Questions?




                                                 24

More Related Content

The Bixo Web Mining Toolkit

  • 1. Bixo - Web Mining Toolkit 23 Sep 2009 Web Mining Toolkit Ken Krugler TransPac Software, Inc. My background - did a startup called Krugle from 2005 - 2008 Used Nutch to do a vertical crawl of the web, looking for technical software pages. Mined pages for references to open source projects. Used experience to create Bixo, an open source web mining toolkit Built on top of Hadoop, Cascading, Tika. 1
  • 2. Bixo - Web Mining Toolkit 23 Sep 2009 Web Mining 101 Extracting & Processing Web Data More Than Just Search Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts… Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing. This is what Bixo focuses on. 2
  • 3. Bixo - Web Mining Toolkit 23 Sep 2009 4 Steps in Mining Collect - fetch content from web Parse - extract data from formats Analyze - tokenize, rate, classify, cluster Produce - an index, a report Search Note - does not include serving up the search results Why do I bring this up? To help clarify why web mining is not the same as vertical search (next slide) 3
  • 4. Bixo - Web Mining Toolkit 23 Sep 2009 Vertical Search Vertical crawl to get specific content Common use case for Nutch, Heritrix But web mining often has different outcome And specialized processing of data Most people think of vertical search when they think of specialized web mining. Lots of people have been doing this, using OSS like Nutch & Heritrix. End result is typically a Lucene index, plus the content, inverted links, etc. Typical web mining is not the same as vertical search. Often uses a white list, versus crawling to discover links. More specialized processing of the data. And these differences help answer the question of (next slide)… 4
  • 5. Bixo - Web Mining Toolkit 23 Sep 2009 Why Bixo? Response to needs of commercial projects – Plug into Cascading-based workflow – Low IT time/skill requirements – Run well in AWS EC2 environment – Flexible I/O support for AWS - S3, HBase – Toolkit for building custom solutions • Fetch white list (parse/index, data mine) • Scrape white list (social popularity) Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. On the point of running well in an EC2 environment… Even though there are many web mining tasks that can be handled on a single computer, You very quickly run into issues of scale if you can’t handle upwards of 100M+ pages. 5
  • 6. Bixo - Web Mining Toolkit 23 Sep 2009 Bixo Overview MIT license open source project In use by three companies “Pipe” model for building workflows Runs on top of Hadoop/Cascading Full disclosure - Bixo makes heavy use of Cascading, which is under GPL. So if you want to sell a product based on Bixo, you need to talk to Chris Wensel. The pipe model comes from our use of Cascading to define the workflows. 6
  • 7. Bixo - Web Mining Toolkit 23 Sep 2009 What is Cascading API for Hadoop data processing workflows Operations on tuples with named fields Workflows created from pipes Reduces painful low-level MR details Key for complex/reliable workflows I know Chris Wensel has previously talked about Cascading here, but just to make sure we’re all on the same page… “tuple” is like a row in a database. Named fields with values. Example of tuple - result of fetching a page, has URL, time of fetch, content, headers, response rate, etc. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle 7
  • 8. Bixo - Web Mining Toolkit 23 Sep 2009 Architecture This architecture looks nice and squeaky clean - and in general it is. One issue is with the fetch phase of bixo not fitting well into the MR model. External resource constraints mean you can’t treat it like a regular job. So lots of threads in a special reduce phase, with corresponding issues -Stack size -Error handling 8
  • 9. Bixo - Web Mining Toolkit 23 Sep 2009 HUGMEE Hadoop Users who Generate the Most Effective Emails Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award. 9
  • 10. Bixo - Web Mining Toolkit 23 Sep 2009 Helpful Hadoopers Use mailing list archives for data (collect) Parse mbox files and emails (parse) Score based on key phrases (analyze) End result is score/name pair (produce) How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)? 10
  • 11. Bixo - Web Mining Toolkit 23 Sep 2009 Scoring Algorithm Very sophisticated point system “thanks” == 5 “owe you a beer” == 50 “worship the ground you walk on” == 100 11
  • 12. Bixo - Web Mining Toolkit 23 Sep 2009 High Level Steps Collect emails – Fetch mod_mbox generated page – Parse it to extract links to mbox files – Fetch mbox files – Split into separate emails Parse emails – Extract key headers (messageId, email, etc) – Parse body to identify quoted text Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table. 12
  • 13. Bixo - Web Mining Toolkit 23 Sep 2009 High Level Steps Analyze emails – Find key phrases in replies (ignore signoff) – Score emails by phrases – Group & sum by message ID – Group & sum by email address Produce ranked list – Toss email addresses with no love – Sort by summed score Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: -one with messageId/name/address -One with reply-to messageId/score Group/sum aspect is classic reduce operation. 13
  • 14. Bixo - Web Mining Toolkit 23 Sep 2009 Workflow I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply. 14
  • 15. Bixo - Web Mining Toolkit 23 Sep 2009 Building the Flow Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster. 15
  • 16. Bixo - Web Mining Toolkit 23 Sep 2009 mod_mbox Page Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files. 16
  • 17. Bixo - Web Mining Toolkit 23 Sep 2009 Custom Operation Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID 17
  • 18. Bixo - Web Mining Toolkit 23 Sep 2009 Validate Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody. 18
  • 19. Bixo - Web Mining Toolkit 23 Sep 2009 This Hug’s for Ted! And the winner is…Ted Dunning I know - I should have colored the elephant yellow. 19
  • 20. Bixo - Web Mining Toolkit 23 Sep 2009 Produce A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm. 20
  • 21. Bixo - Web Mining Toolkit 23 Sep 2009 Use Bixo to… Find +/- product comments on forums Compare web site quality Track social network popularity Derive optimized SEO terms Scape and analyze pricing data Previous example could be easily changed to “find opinion makers on forums” Many other use cases All involve web mining workflow - fetch, parse, analyze, produce 21
  • 22. Bixo - Web Mining Toolkit 23 Sep 2009 Summary Bixo is a web mining toolkit Built on Hadoop, Cascading, Tika Young project but used commercially Future - Mahout, monitoring, HBase, URL DB, cleanup, bug fixes, rinse, repeat Lots to be done, of course, but moving fast 22
  • 23. Bixo - Web Mining Toolkit 23 Sep 2009 Resources Web: http://bixo.101tec.com List: http://tech.groups.yahoo.com/group/bixo-dev/ Source: http://github.com/emi/bixo/tree Bugs: http://oss.101tec.com/jira/browse/bixo URLs to find out more about the Bixo project. Stefan Groschupf from 101tec helped with initial Bixo coding. His company provides infrastructure for project, thus 101tec.com in URLs above 23
  • 24. Bixo - Web Mining Toolkit 23 Sep 2009 Any Questions? 24