SlideShare a Scribd company logo
How to build a small
distributed search
engine using open
source software
Building a distributed search engine
Search engine subsytems:
●

Page database

●

List of the pages to retrieve

●

Pages retrieval and save

●

Page content parsing

●

Full-text indexing of the contents

●

Graph database of the links for ranking
Building a distributed search engine

Open Source Software
•

Apache Hadoop
•
•
•

•

MapReduce
HDFS
HBase

Apache Lucene
Building a distributed search engine

HDFS
Hadoop Distributed File System
Building a distributed search engine

HDFS – Assumptions and goals
●

Hardware failure

●

Big data

●

Write once / read many

●

Moving computation, not data
Building a distributed search engine
Building a distributed search engine
Building a distributed search engine

Lucene
Building a distributed search engine
Lucene - Inverse Indexing
Term

Doc Id

Weight

JUG
301
198
120

0.97
0.65
0.43

301
278
451
103
763

0.94
0.15
0.87
0.45
0.77

Lugano
Building a distributed search engine
Lucene - Indexing main classes


IndexWriter



Directory



Analyzer



Document



Field
Building a distributed search engine
Lucene - Searching main classes


IndexSearcher



Collector



Query



TopDocs



ScoreDoc
Building a distributed search engine

Lucene - Analyzers






StopWords

”the book is on the table” → [book, table]
Stemming

[paint, paints, painted, …] → paint
Synonims

[cat, feline] → cat
Building a distributed search engine

Lucene - Search options


Fields





Wildcards





Title: JUG
body: ”JUG Lugano”
J?G → [JUG, JAG, ...]
J*G →[JUG, JEEG, JUNG, …]

Fuzzy (basata su vocabolario)


JUG~[n] → [MUG, JAG, …]
Building a distributed search engine

Lucene - Search options


Range





Boost





JUG^5 Lugano
”JUG Lugano”^5

Proximity




Year: [2002 TO 2012]
Name: {Alberto TO Andrea}

”JUG Lugano”~5

Boolean and existance


AND, OR, NOT, (), +, -
Building a distributed search engine

HDFS - Lucene Integration


File copy from/to HDFS



Patch IndexWriter/Director
IndexWriter/Directory



Rewrite of IndexWriter on RAM



Lucene 4
Building a distributed search engine

And now...
Hands on!

More Related Content

How to build_a_search_engine

  • 1. How to build a small distributed search engine using open source software
  • 2. Building a distributed search engine Search engine subsytems: ● Page database ● List of the pages to retrieve ● Pages retrieval and save ● Page content parsing ● Full-text indexing of the contents ● Graph database of the links for ranking
  • 3. Building a distributed search engine Open Source Software • Apache Hadoop • • • • MapReduce HDFS HBase Apache Lucene
  • 4. Building a distributed search engine HDFS Hadoop Distributed File System
  • 5. Building a distributed search engine HDFS – Assumptions and goals ● Hardware failure ● Big data ● Write once / read many ● Moving computation, not data
  • 6. Building a distributed search engine
  • 7. Building a distributed search engine
  • 8. Building a distributed search engine Lucene
  • 9. Building a distributed search engine Lucene - Inverse Indexing Term Doc Id Weight JUG 301 198 120 0.97 0.65 0.43 301 278 451 103 763 0.94 0.15 0.87 0.45 0.77 Lugano
  • 10. Building a distributed search engine Lucene - Indexing main classes  IndexWriter  Directory  Analyzer  Document  Field
  • 11. Building a distributed search engine Lucene - Searching main classes  IndexSearcher  Collector  Query  TopDocs  ScoreDoc
  • 12. Building a distributed search engine Lucene - Analyzers    StopWords  ”the book is on the table” → [book, table] Stemming  [paint, paints, painted, …] → paint Synonims  [cat, feline] → cat
  • 13. Building a distributed search engine Lucene - Search options  Fields    Wildcards    Title: JUG body: ”JUG Lugano” J?G → [JUG, JAG, ...] J*G →[JUG, JEEG, JUNG, …] Fuzzy (basata su vocabolario)  JUG~[n] → [MUG, JAG, …]
  • 14. Building a distributed search engine Lucene - Search options  Range    Boost    JUG^5 Lugano ”JUG Lugano”^5 Proximity   Year: [2002 TO 2012] Name: {Alberto TO Andrea} ”JUG Lugano”~5 Boolean and existance  AND, OR, NOT, (), +, -
  • 15. Building a distributed search engine HDFS - Lucene Integration  File copy from/to HDFS  Patch IndexWriter/Director IndexWriter/Directory  Rewrite of IndexWriter on RAM  Lucene 4
  • 16. Building a distributed search engine And now... Hands on!