How to build_a_search_engine

How to build a small
distributed search
engine using open
source software

Building a distributed search engine
Search engine subsytems:
●

Page database

●

List of the pages to retrieve

●

Pages retrieval and save

●

Page content parsing

●

Full-text indexing of the contents

●

Graph database of the links for ranking


Open Source Software
•

Apache Hadoop
•
•
•

•

MapReduce
HDFS
HBase

Apache Lucene


HDFS
Hadoop Distributed File System


HDFS – Assumptions and goals
●

Hardware failure

●

Big data

●

Write once / read many

●

Moving computation, not data


Lucene

Lucene - Inverse Indexing
Term

Doc Id

Weight

JUG
301
198
120

0.97
0.65
0.43

301
278
451
103
763

0.94
0.15
0.87
0.45
0.77

Lugano

Lucene - Indexing main classes


IndexWriter



Directory



Analyzer



Document



Field

Lucene - Searching main classes


IndexSearcher



Collector



Query



TopDocs



ScoreDoc


Lucene - Analyzers






StopWords

”the book is on the table” → [book, table]
Stemming

[paint, paints, painted, …] → paint
Synonims

[cat, feline] → cat


Lucene - Search options


Fields





Wildcards





Title: JUG
body: ”JUG Lugano”
J?G → [JUG, JAG, ...]
J*G →[JUG, JEEG, JUNG, …]

Fuzzy (basata su vocabolario)


JUG~[n] → [MUG, JAG, …]


Lucene - Search options


Range





Boost





JUG^5 Lugano
”JUG Lugano”^5

Proximity




Year: [2002 TO 2012]
Name: {Alberto TO Andrea}

”JUG Lugano”~5

Boolean and existance


AND, OR, NOT, (), +, -


HDFS - Lucene Integration


File copy from/to HDFS



Patch IndexWriter/Director
IndexWriter/Directory



Rewrite of IndexWriter on RAM



Lucene 4


And now...
Hands on!

How to build_a_search_engine

More Related Content

How to build_a_search_engine