[Hadoop] Terapot: Massive Email Archiving with Hadoop

of 24 /24
Next Revolution, Toward Open Platform Jason Han Founder & CEO, NexR [email protected] Terapot: Massive Email Archiving with Hadoop & Friends - Commercial Hadoop Application
  • date post

    12-Nov-2014
  • Category

    Documents

  • view

    889
  • download

    1

Embed Size (px)

description

[Hadoop] Terapot: Massive Email Archiving with Hadoop & Friends - Commercial Hadoop Application Jason Han Founder & CEO, NexR [email protected]

Transcript of [Hadoop] Terapot: Massive Email Archiving with Hadoop

Terapot: Massive Email Archiving with Hadoop & Friends- Commercial Hadoop Application

Jason Han Founder & CEO, NexR [email protected]

Next Revolution, Toward Open Platform

#2

NexR: IntroductionOffering Hadoop & Cloud Computing Platform and ServicesHadoop & Cloud Computing Services Hadoop Provisioning & Management

Massive Email Archiving

MapReduce Workflow

Academic Support Program

Massive Data Storage & Processing Platform Cloud Computing Platform (Compatible with Amazon AWS) icube-cc (Co mpute) icube-sc (Storage)

#3

Email Archiving: Objectives

Regulatory compliance e-Discovery: Litigation and legal discovery E-mail backup and disaster recovery Messaging system & storage optimization Monitoring of internal and external e-mail content

#4

Email Archiving: Architecture

Email ServersCrawling

Journaling

DB ServerMetadata

Email Archiving Servers (HA)Storage NetworkIndexes

Search & Discovery

Archival StorageAging DAS Tape Library NAS SANEmail

#5

Email Archiving: Challenges Explosive growth of digital data- 6 times (988XB) in 2010 than 2006 - 95% (939 XB) unstructured data including email - Increasing the cost and complexity of archiving Requiring scalable & low cost archiving

Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security - HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery

Needs for intelligent data management- Knowledge management from email data - Filtering, monitoring, data mining, etc Requiring integration with intelligent system

#6

Email Archiving: Regulatory Compliance

#7

Email Archiving: Problems

Email ServersCrawling

Journaling

DB ServerMetadata

Email Archiving Servers (HA) Centralized searchStorage NetworkIndexes

Search & is slow & Discovery not scalable

Archival StorageDiscovery from ta Aging pe is slow

DAS Tape Library

Storage is expensi ve & not scalable

Email

NAS

SAN

#8

Terapot: When Hadoop Met Email Archiving Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data - Hadoop MapReduce for crawling & indexing - Apache Lucene for search & discovery Email ServersDistributed Crawling

Email Archiving Servers (HA)

Journaling

Hadoop MapReduce(Crawling, Indexing, etc)Metadata

DB Server

Journaling Server

Hadoop HDFS(Archiving)

Distributed Search & Discovery

#9

Terapot: Overview

Design Principles

Shared nothing architecture Inexpensive hardware Using open source software Exploiting parallelism Integrating with analysis Unlimited scalability Low cost Fast development High performance High intelligence

Features

Distributed massive email archiving High scalability

thousands of servers, billions of emails

High Performance

Fast search under 1-2 seconds for each user account Fast discovery in parallel with MapReduce

High Intelligence

Email data mining, such as social network analysis

Support both on-premise version and cloud(hosted) version Development with various open source software

#10

Terapot: Open Source Software StackFrontend LayerApache Tomcat Apache JAMES

Crawling Downloadi ng

Indexing

Searching

Email Mining

Zookeeper

MySQL

Apache Lucene

Hive

Hadoop MapReduce

Hadoop HDFS

Backend Layer

#11

Terapot: ArchitectureTerapot ClientsSOAP REST JSON POP3 Server

Email SourcesHTTP/ FTP/SFTP Server Mail Server NAS/ NFS

Terapot Frontend Search Gateway MailServer MR Workflow Manager Analyzer

Searching

Real-Time Indexing

Batch processingCrawling Indexing Merging

AnalysisETL Mining

Hadoop MapReduce, Lucene, & Hive

HDFS(email, index)

Local(index)

#12

Terapot Data Archiving Flow1. Send email 6. Receive email

Internet2. Deliver email

5. Forward email

HTTP/ FTP/SFTP Server

NAS/ NFS

1. Search emails

SMTP Server

1. Fetch emails in parallel

3. Push email

Shard Index

Shard Index

Shard Index

Real-Time Shard

Crawler (MR)

Indexing (MR)

Index

2. Save emails 3. Build index files

4. Save email & build index files in runtime

emails

emails

emails

emails

HDFS

emails

emails emails

Index Index

Search Layer

Real-Time Indexing Layer

Batch Processing Layer

#13

Terapot Data Analysis Flow

Terapot Mining Engine

Terapot Archiving Storage

1. View Report for Archving data

1. Send HiveQL to analysis data

1. Fetch emails in parallel

NexR Terapot Front

2. Generate Report in MySQL

Transform (MR)

2. Store large data

Shard

ShardMySQL

HIVE Analysis data Analysis dataData Analysis Layer

HDFS

Analysis data Analysis dataETL Layer

Report Retrieval Layer

#14

Technical Features

Distributed Archiving

Hadoop HDFS for storing email data Compression and deduplication for storage space efficiency

Distributed Crawling & Indexing

Implemented by Hadoop MapReduce Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP , HTTP, NFS, etc) Support batch indexing & merging by MapReduce and real-time indexing for i nstant archiving

Distributed Search

Shard a search job and executing it in parallel Searchable instantly on receiving an email (due to real-time indexing)

Parallel Download

Download full search results in parallel by MapReduce Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)

Standard Client Interface

Support REST/SOAP and JSON interface

Management

Configurable MapReduce job scheduling (crawling, indexing, merging, etc)

#15

Crawling

Store Massive Email Data in HDFS through MapReduce

Hadoop utility(dfs put) just copies data sequentially Each Crawling MR takes & stores a range of data in parallel

Crawling Client Splitting

Data Location Information

Crawling MR

{key,email}*

HDFSCrawling MR

Crawling MR

INPUT

#16

Indexing

Indexing Email Data with MapReduce

Each Indexing MR takes a range of data and makes lucene index in parallel

Indexing Client Splitting

Email Data

Indexing MR

{key,index}*

HDFSIndexing MR

Indexing MR

INPUT

#17

Real-Time Indexing

Indexing Email Data in Runtime

Indexing in memory on arriving a new email Flushing RT-Shard periodically into HDFSReal-Time ShardPeriodic flushing into HDFSemails

Local Index

Mailet Component JAMES

Forwarding Email Data

RT Shard RT Shard

emails

HDFS

emails

Mail

#18

Searching

Distributed Search

Indexes are split & stored in local disks Shard is responsible for searching a range of index

Local Index Read email

Searching Client

Shard Search Shard

HDFS

Notification Update shard state & index information

Zookeeper

RT Shard

#19

Parallel Downloading

Downloading Massive Search Results in Parallel

Support various types of communications for downloading Downloading MR sorts search results globally & pushes into targetswrite result directlywrite resultDL Map DL Map DL Map DL Reduce DL Reduce DL Reduce

Local HDF S FTP SFTP HTTP

ShardDonwload ClientDownload Request

write result

Shard

write resultDL Map DL Map

Shard

write result

HDFS

Distributed Global Sort

#20

Email Data Analysis

Analysis Process

ETL(Extract-Transform-Load) email archiving data to Hive table format Analyzing data using Hive with various analysis algorithm Generating the analysis result reportwrite result

Terapot Mining

ETL M RLoad Archving Data

write result

Terapot Miningexecute HiveQL

HIVE ETL M R ETL M Rwrite result Generate Report

write result

MySQL

#21

Types of Analysis

Social Network Analysis

Personal Network Analysis

Computing distance between recipients or senders based on TO, CC, FRO M links Analyzing the statistics of mail frequency

Domain Analysis

Computing distance between recipients domain based on TO, CC, FROM

Keyword Analysis (in progress)

Keyword frequency for each user

#22

Terapot Performance

Experimental Environment

11 Intel Servers: 1 Master + 10 Slaves

Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk

The number of emails: 270 millions (Index size: 270 GB)

ResultsIndexing in local disksNumber of Emails 67,217,298 134,434,596 201,651,894 268,869,192 Number of Results 12,547,398 25,094,796 37,642,194 50,189,592 Response Time (sec) 1.4 1.4 1.4 1.4

Indexing in HDFSNumber of Emails 67,217,298 134,434,596 201,651,894 268,869,192 Number of Results 12,547,398 25,094,796 37,642,194 50,189,592 Response Time (sec) 2.8 2.8 3.2 3.2

#23

Demonstration

#24

www.nexr.co.kr

Hadoop & Cloud Computing Company