Building a distributed search system with Hadoop and Lucene

Building a distributed search system with Apache Hadoop and Lucene

Anno Accademico 2012-2013

Outline

• Big Data Problem• Map and Reduce approach: Apache Hadoop• Distributing a Lucene index using Hadoop• Measuring Performance• Conclusion

Mirko Calvaresi, "Building a distributed search system with Apache Hadoop and Lucene"

“Big Data”

This works analyzes the technological challenge to manage and administer quantity of information with global dimension in the order of Terabyte (10E12 bytes) or Petabyte (10E15 bytes) and with an exponential growth rate.• Facebook processes 2.5 billion contents/day.• Youtube: 72 hours of video uploaded per minutes.

• Twitter:50 million tweets per day.

Multitier architecture vs Cloud computing

Front End Servers

Database Servers

Client

Front End Servers

Client

Data asynchronous analysis

Real ti

Apache Hadoop architecture

A Hadoop cluster scales computation capacity, storage capacity and IO bandwidthby simply adding commodity servers

HDFS: the distributed file system

• Files are stored as sets of (large) blocks– Default block size: 64 MB (ext4 default is 4kB)– Blocks are replicated for durability and availability

• Namespace is managed by a single name node– Actual data transfer is directly between client & data node– Pros and cons of this decision?

foo.txt: 3,9,6bar.data: 2,4

block #2 of foo.txt?

9Read block 9

6Name node

Data nodesClient

Map and Reduce

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs.

Recap: Map Reduce approach

Mapper

Reducer

"The Shuffle"

Intermediate (key,value) pairs

Map and Reduce: where is applicable

• Distributed “Grep”• Count of URL Access Frequency• Reverse Web-Link Graph• Term-Vector per Host• Reduce a n level graph in a redundant hash

Implementation: distributing a Lucene index using Map and Reduce

The scope of the implementation is to:1. populate a Lucene distributed index using the

HDFS cluster 2. distributing and retrieving results using Map

and Reduce

Apache Lucene: indexing

Apache Lucene is the standard de facto in the open source community for textual search

DocumentField(type)->ValueField(type)->ValueField(type)->Value

Apache Lucene: searching

In Lucene each document is a vector.A measure of the relevance is the value of the θ angle between the document and the query vector

Distributing Lucene indexes using Hadoop

Index 1

Lucene Indexer Job

Indexing Searching

Index 2

Index 3

PDF doc archive

Map Phase: Creates and populate each indexReduce Phase: None

Cluster

Index 1

Lucene Search Job

Index 2

Index 3

Cluster

Reduce

ResulSet

Combine

map map

{Search Filter}(list of Lucene Restrictions)

Map Phase: Queries the indexesReduce Phase: Merges and orders result set

Measuring Performance

The entire execution time can be formally defined as:

While the single Map (or Reduce) phase:

Where α is the % of reduce tasks still on going after map phase completion.

Measuring Performance

With 4 or more data nodes Hadoop infrastructure setup cost is compensated

Measuring Performance (Word Count)

Having a single Big file speeds up Hadoop consistently, so performance are not really determined by the quantity of data but how many splits are added to the HDFS

Job Detail Page

Tasks Queue

Tasks currently running

Conclusion

What • Analysis of the current status of Open Source technologies• Analysis of the potential applications for the web• Implemented a full working Hadoop architecture• Designed a web portal based on the previous architecture

Objectives: • Explore Map and Reduce approach to analyze unstructured data• Measure performance and understand the Apache Hadoop framework

Outcomes• Setup of the entire architecture in my company environment (Innovation

Engineering)• Main benefits in the indexing phase• Poor impact on the search side (for standard queries format)• In general major benefits when the HDFS is populated by a relatively small

number of Big (GB) files

Building a distributed search system with Hadoop and Lucene

Technology

Transcript of Building a distributed search system with Hadoop and Lucene

Cloudera Distributed Hadoop (CDH) Installation and ...mwang2/projects/CDH_installConfig1_13m.pdf · 1 Cloudera Distributed Hadoop (CDH) Installation and Configuration on Virtual Box

Hadoop Integration Function User's Guide...-In the case of integrating with Apache Hadoop: Hadoop distributed file system (HDFS: Hadoop Distributed File System). Figure 1.1 MapReduce

O’Reilly – Hadoop : The Definitive Guide Ch.3 The Hadoop Distributed Filesystem

Querying your Datagrid with Lucene, Hadoop and Spark

Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

TriHUG: Lucene Solr Hadoop

Lucene Rev Preso Cope Real Time Searching of Big Data With Solr and Hadoop

Hadoop: Distributed Data Processing

Cloudera Distributed Hadoop (CDH) Installation and ...

Hadoop Distributed File System Usage in USCMSsupercomputing.caltech.edu/archive/sc09/docs/2009_11_18_Hadoop... · Hadoop Distributed File System Usage in USCMS Michael Thomas, ...

Analytics in olap with lucene & hadoop

Building a distributed search system with Hadoop and Lucene

Hadoop Distributed File System - SNIA · Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Troubleshooting Hadoop: Distributed Debugging

BIGDATA HADOOP COURSE CONTENT · Industries using Hadoop. Data Locality. Hadoop Architecture. Map Reduce & HDFS. Using the Hadoop single node image (Clone). The Hadoop Distributed

Hadoop Distributed File System and Map Reduce Processing ...ijsr.net/archive/v4i8/SUB157601.pdf · Hadoop Architecture . 2.2 Hadoop Distributed File System (HDFS) When data can potentially

Hadoop Distributed File System

Hadoop and Distributed Computing

Lucene 4 - Revisiting problems for speedisabel-drost.de/hadoop/slides/simon_lucene_2011.pdf · Lucene 4 - Revisiting problems for speed Simon Willnauer Lucene Core-Committer & PMC

Snapshotting in Hadoop Distributed File System for Hadoop ...€¦ · Snapshotting in Hadoop Distributed File System for Hadoop Open Platform as Service ... 2.2 Hadoop Open Platform