This document discusses how to build a small distributed search engine using open source software. It describes the main subsystems of a search engine, including a page database, crawler, parser, indexer and link graph database. It then introduces Apache Hadoop and Apache Lucene as open source tools that can be used to build each subsystem in a distributed manner. Hadoop provides HDFS for distributed storage and MapReduce for distributed processing, while Lucene handles full-text indexing and search. The document outlines how Lucene indexes and searches document contents, and how its components can be integrated with HDFS to build a distributed search index and query system.
Report
Share
Report
Share
1 of 16
Download to read offline
More Related Content
How to build_a_search_engine
1. How to build a small
distributed search
engine using open
source software
2. Building a distributed search engine
Search engine subsytems:
●
Page database
●
List of the pages to retrieve
●
Pages retrieval and save
●
Page content parsing
●
Full-text indexing of the contents
●
Graph database of the links for ranking
3. Building a distributed search engine
Open Source Software
•
Apache Hadoop
•
•
•
•
MapReduce
HDFS
HBase
Apache Lucene
5. Building a distributed search engine
HDFS – Assumptions and goals
●
Hardware failure
●
Big data
●
Write once / read many
●
Moving computation, not data