Building a distributed search system with Apache Hadoop and Lucene
Anno Accademico 2012-2013
Outline
• Big Data Problem• Map and Reduce approach: Apache Hadoop• Distributing a Lucene index using Hadoop• Measuring Performance• Conclusion
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
“Big Data”
This works analyzes the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (10E12 bytes) or Petabyte (10E15 bytes) and with an exponential growth rate.• Facebook processes 2.5 billion contents/day.• Youtube: 72 hours of video uploaded per minutes.
• Twitter:50 million tweets per day.
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Multitier architecture vs Cloud computing
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Front End Servers
Database Servers
Client
Front End Servers
Cloud
Client
Data asynchronous analysis
Real ti
me pr
oces
sing
Real ti
me pr
oces
sing
Apache Hadoop architecture
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
A Hadoop cluster scales computation capacity, storage capacity and IO bandwidthby simply adding commodity servers
HDFS: the distributed file system
• Files are stored as sets of (large) blocks– Default block size: 64 MB (ext4 default is 4kB)– Blocks are replicated for durability and availability
• Namespace is managed by a single name node– Actual data transfer is directly between client & data node– Pros and cons of this decision?
foo.txt: 3,9,6bar.data: 2,4
block #2 of foo.txt?
9Read block 9
9
9
9 93
332
2
24
4
46
6Name node
Data nodesClient
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Map and Reduce
The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Recap: Map Reduce approach
Mapper
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Reducer
Inpu
t dat
a
Out
put d
ata
"The Shuffle"
Intermediate (key,value) pairs
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Map and Reduce: where is applicable
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
• Distributed “Grep”• Count of URL Access Frequency• Reverse Web-Link Graph• Term-Vector per Host• Reduce a n level graph in a redundant hash
table
Implementation: distributing a Lucene index using Map and Reduce
The scope of the implementation is to:1. populate a Lucene distributed index using the
HDFS cluster 2. distributing and retrieving results using Map
and Reduce
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Apache Lucene: indexing
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
n
Apache Lucene is the standard de facto in the open source community for textual search
DocumentField(type)->ValueField(type)->ValueField(type)->Value
Apache Lucene: searching
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
In Lucene each document is a vector.A measure of the relevance is the value of the θ angle between the document and the query vector
Distributing Lucene indexes using Hadoop
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Index 1
Lucene Indexer Job
Indexing Searching
Index 2
Index 3
PDF doc archive
Map Phase: Creates and populate each indexReduce Phase: None
HDFS
Cluster
Index 1
Lucene Search Job
Index 2
Index 3
HDFS
Cluster
map
Sort
Reduce
ResulSet
Combine
map map
{Search Filter}(list of Lucene Restrictions)
Map Phase: Queries the indexesReduce Phase: Merges and orders result set
Measuring Performance
The entire execution time can be formally defined as:
While the single Map (or Reduce) phase:
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Where α is the % of reduce tasks still on going after map phase completion.
Measuring Performance
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
With 4 or more data nodes Hadoop infrastructure setup cost is compensated
Measuring Performance (Word Count)
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Having a single Big file speeds up Hadoop consistently, so performance are not really determined by the quantity of data but how many splits are added to the HDFS
Job Detail Page
Tasks Queue
Tasks currently running
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
Conclusion
Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"
What • Analysis of the current status of Open Source technologies• Analysis of the potential applications for the web• Implemented a full working Hadoop architecture• Designed a web portal based on the previous architecture
Objectives: • Explore Map and Reduce approach to analyze unstructured data• Measure performance and understand the Apache Hadoop framework
Outcomes• Setup of the entire architecture in my company environment (Innovation
Engineering)• Main benefits in the indexing phase• Poor impact on the search side (for standard queries format)• In general major benefits when the HDFS is populated by a relatively small
number of Big (GB) files
Top Related