SlideShare a Scribd company logo
The original vision of Nutch, 14 years later:
Building an open source search engine
Apache Big Data Europe 2016
sylvain@sylvainzimmer.com
@sylvinus
/usr/bin/whoami
• Jamendo (Founder & CTO, 2004-2011)
• TEDxParis (Co-founder, 2009-2012)
• dotConferences (Founder, 2012-)
• Pricing Assistant (Co-founder & CTO, 2012-)
"The original motivation for the Nutch project was to provide
a transparent alternative to the growing power of a
handful of private search services over most users’ view of
the Web.
CommerceNet Labs Technical Report, Nov 2004
However, as Nutch has been adopted with greater
enthusiasm by smaller organizations, the Nutch
Organization has de-emphasized operating a multi-
billion-page index in the public interest."
The original vision of Nutch, 14 years later: Building an open source search engine
again?
The original vision of Nutch, 14 years later: Building an open source search engine
The original vision of Nutch, 14 years later: Building an open source search engine
The original vision of Nutch, 14 years later: Building an open source search engine
The original vision of Nutch, 14 years later: Building an open source search engine
transparency
reproducibility
The original vision of Nutch, 14 years later: Building an open source search engine
https://uidemo.commonsearch.org
https://explain.commonsearch.org/?q=python&g=en
Agenda
• Values & tech choices
• Search engine components
• Challenges
• Opportunities
Values & tech choices
The original vision of Nutch, 14 years later: Building an open source search engine
Radical transparency
• Open source (Apache License v2)
• Open data
• (Governance)
Privacy
• Results can be tailored by language/country, but
NOT by user/cookie/sessionid
• o/ Cache everything!
• Tor service: http://comsearchl2zlnre.onion
Participation & Pragmatism
• Use high-level languages as much as possible
(Python, Go)
• Embrace active communities (Spark, Elasticsearch)
• Use mainstream participation platforms, even if they
are nonfree (GitHub, Slack)
Search engines
http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Indexer
Database
SearcherRanker
Crawler
http://commoncrawl.org
Today at 3:30pm!
http://scrapy.org
http://github.com/cocrawler/cocrawler
Indexer
Specs
• HTML parsing & analysis
• Tokenization / NLP
• Static rankings
• Language detection
• I/O from crawls to databases
The original vision of Nutch, 14 years later: Building an open source search engine
Common Search Pipeline
Doc sources
Common Crawl,
WARC files,
URLs ...
Filter
plugins
Document
parsing
Output
plugins
Data output
Database, file,
HDFS, S3, ...
HTML parsers
• BeautifulSoup & friends
• lxml
• html5lib
• Gumbo!
https://github.com/google/gumbo-parser
Gumbocy
• Use Cython instead of ctypes
• Smaller API
• Tree traversal on the Cython side with basic
boilerplate/visibility support
https://github.com/commonsearch/gumbocy
https://github.com/commonsearch/urlparse4
Database(s)
The original vision of Nutch, 14 years later: Building an open source search engine
http://lucene.apache.org/
The original vision of Nutch, 14 years later: Building an open source search engine
Ranker
Ranking formula
rank = f( static_score , dynamic_score( query ) )
Alexa
DMOZ
Blacklists
PageRank
...
ElasticSearch & Lucene
TF-IDF
BM25
...
The original vision of Nutch, 14 years later: Building an open source search engine
https://about.commonsearch.org/developer/get-started
Today @ 4:30pm ;-)
Searcher / Frontend
Specs
• Send user query to databases
• Search-as-you-type
• HTML & JSON endpoints
• High performance
The original vision of Nutch, 14 years later: Building an open source search engine
https://github.com/commonsearch/cosr-front
http://infolab.stanford.edu/~backrub/google.html
The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)
Crawler
Parser
Index
SearcherRanker
Challenges
Funding / Scale
• Frugalism
• Caching
• In-kind services
• Individual donations / Foundation grants
• General economic incentives
Spam
• Email spam
• Wikipedia vandalism
• Algorithm complexity & scale
• Given enough eyeballs, all spam is shallow?
Relevance
• Exhaustivity
• Rescoring
• Evaluation
• More at 4:30pm ;-)
More search dimensions
• Realtime search
• Local search
• Universal search
Semantic search
• Wikidata
• YAGO
• Conversational / Voice search
Outreach
• Easy onboarding & docs
• Making people care believe
Opportunities
Decentralization
• YaCy
• Extremely high technical & social cost!
• Transparency?
Research
• More people should know how to build search
engines
• Spam, Relevance, Large-scale data processing
• We need more open datasets!
https://about.commonsearch.org/blog/
Make the Web a better place!
• SEO
• Transparency
• Influence of money
• Public service
Questions?
https://about.commonsearch.org/contributing
https://github.com/commonsearch
contact@commonsearch.org
slack.commonsearch.org

More Related Content

The original vision of Nutch, 14 years later: Building an open source search engine