[Hadoop] Terapot: Massive Email Archiving with Hadoop
-
date post
12-Nov-2014 -
Category
Documents
-
view
889 -
download
1
Embed Size (px)
description
Transcript of [Hadoop] Terapot: Massive Email Archiving with Hadoop
Terapot: Massive Email Archiving with Hadoop & Friends- Commercial Hadoop Application
Jason Han Founder & CEO, NexR [email protected]
Next Revolution, Toward Open Platform
#2
NexR: IntroductionOffering Hadoop & Cloud Computing Platform and ServicesHadoop & Cloud Computing Services Hadoop Provisioning & Management
Massive Email Archiving
MapReduce Workflow
Academic Support Program
Massive Data Storage & Processing Platform Cloud Computing Platform (Compatible with Amazon AWS) icube-cc (Co mpute) icube-sc (Storage)
#3
Email Archiving: Objectives
Regulatory compliance e-Discovery: Litigation and legal discovery E-mail backup and disaster recovery Messaging system & storage optimization Monitoring of internal and external e-mail content
#4
Email Archiving: Architecture
Email ServersCrawling
Journaling
DB ServerMetadata
Email Archiving Servers (HA)Storage NetworkIndexes
Search & Discovery
Archival StorageAging DAS Tape Library NAS SANEmail
#5
Email Archiving: Challenges Explosive growth of digital data- 6 times (988XB) in 2010 than 2006 - 95% (939 XB) unstructured data including email - Increasing the cost and complexity of archiving Requiring scalable & low cost archiving
Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security - HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs, OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery
Needs for intelligent data management- Knowledge management from email data - Filtering, monitoring, data mining, etc Requiring integration with intelligent system
#6
Email Archiving: Regulatory Compliance
#7
Email Archiving: Problems
Email ServersCrawling
Journaling
DB ServerMetadata
Email Archiving Servers (HA) Centralized searchStorage NetworkIndexes
Search & is slow & Discovery not scalable
Archival StorageDiscovery from ta Aging pe is slow
DAS Tape Library
Storage is expensi ve & not scalable
NAS
SAN
#8
Terapot: When Hadoop Met Email Archiving Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data - Hadoop MapReduce for crawling & indexing - Apache Lucene for search & discovery Email ServersDistributed Crawling
Email Archiving Servers (HA)
Journaling
Hadoop MapReduce(Crawling, Indexing, etc)Metadata
DB Server
Journaling Server
Hadoop HDFS(Archiving)
Distributed Search & Discovery
#9
Terapot: Overview
Design Principles
Shared nothing architecture Inexpensive hardware Using open source software Exploiting parallelism Integrating with analysis Unlimited scalability Low cost Fast development High performance High intelligence
Features
Distributed massive email archiving High scalability
thousands of servers, billions of emails
High Performance
Fast search under 1-2 seconds for each user account Fast discovery in parallel with MapReduce
High Intelligence
Email data mining, such as social network analysis
Support both on-premise version and cloud(hosted) version Development with various open source software
#10
Terapot: Open Source Software StackFrontend LayerApache Tomcat Apache JAMES
Crawling Downloadi ng
Indexing
Searching
Email Mining
Zookeeper
MySQL
Apache Lucene
Hive
Hadoop MapReduce
Hadoop HDFS
Backend Layer
#11
Terapot: ArchitectureTerapot ClientsSOAP REST JSON POP3 Server
Email SourcesHTTP/ FTP/SFTP Server Mail Server NAS/ NFS
Terapot Frontend Search Gateway MailServer MR Workflow Manager Analyzer
Searching
Real-Time Indexing
Batch processingCrawling Indexing Merging
AnalysisETL Mining
Hadoop MapReduce, Lucene, & Hive
HDFS(email, index)
Local(index)
#12
Terapot Data Archiving Flow1. Send email 6. Receive email
Internet2. Deliver email
5. Forward email
HTTP/ FTP/SFTP Server
NAS/ NFS
1. Search emails
SMTP Server
1. Fetch emails in parallel
3. Push email
Shard Index
Shard Index
Shard Index
Real-Time Shard
Crawler (MR)
Indexing (MR)
Index
2. Save emails 3. Build index files
4. Save email & build index files in runtime
emails
emails
emails
emails
HDFS
emails
emails emails
Index Index
Search Layer
Real-Time Indexing Layer
Batch Processing Layer
#13
Terapot Data Analysis Flow
Terapot Mining Engine
Terapot Archiving Storage
1. View Report for Archving data
1. Send HiveQL to analysis data
1. Fetch emails in parallel
NexR Terapot Front
2. Generate Report in MySQL
Transform (MR)
2. Store large data
Shard
ShardMySQL
HIVE Analysis data Analysis dataData Analysis Layer
HDFS
Analysis data Analysis dataETL Layer
Report Retrieval Layer
#14
Technical Features
Distributed Archiving
Hadoop HDFS for storing email data Compression and deduplication for storage space efficiency
Distributed Crawling & Indexing
Implemented by Hadoop MapReduce Support both push-based crawling(HTTP) and pull-based crawling(SFTP, FTP , HTTP, NFS, etc) Support batch indexing & merging by MapReduce and real-time indexing for i nstant archiving
Distributed Search
Shard a search job and executing it in parallel Searchable instantly on receiving an email (due to real-time indexing)
Parallel Download
Download full search results in parallel by MapReduce Support various download protocol (Local FS, HDFS, FTP, SFTP, HTTP, etc)
Standard Client Interface
Support REST/SOAP and JSON interface
Management
Configurable MapReduce job scheduling (crawling, indexing, merging, etc)
#15
Crawling
Store Massive Email Data in HDFS through MapReduce
Hadoop utility(dfs put) just copies data sequentially Each Crawling MR takes & stores a range of data in parallel
Crawling Client Splitting
Data Location Information
Crawling MR
{key,email}*
HDFSCrawling MR
Crawling MR
INPUT
#16
Indexing
Indexing Email Data with MapReduce
Each Indexing MR takes a range of data and makes lucene index in parallel
Indexing Client Splitting
Email Data
Indexing MR
{key,index}*
HDFSIndexing MR
Indexing MR
INPUT
#17
Real-Time Indexing
Indexing Email Data in Runtime
Indexing in memory on arriving a new email Flushing RT-Shard periodically into HDFSReal-Time ShardPeriodic flushing into HDFSemails
Local Index
Mailet Component JAMES
Forwarding Email Data
RT Shard RT Shard
emails
HDFS
emails
#18
Searching
Distributed Search
Indexes are split & stored in local disks Shard is responsible for searching a range of index
Local Index Read email
Searching Client
Shard Search Shard
HDFS
Notification Update shard state & index information
Zookeeper
RT Shard
#19
Parallel Downloading
Downloading Massive Search Results in Parallel
Support various types of communications for downloading Downloading MR sorts search results globally & pushes into targetswrite result directlywrite resultDL Map DL Map DL Map DL Reduce DL Reduce DL Reduce
Local HDF S FTP SFTP HTTP
ShardDonwload ClientDownload Request
write result
Shard
write resultDL Map DL Map
Shard
write result
HDFS
Distributed Global Sort
#20
Email Data Analysis
Analysis Process
ETL(Extract-Transform-Load) email archiving data to Hive table format Analyzing data using Hive with various analysis algorithm Generating the analysis result reportwrite result
Terapot Mining
ETL M RLoad Archving Data
write result
Terapot Miningexecute HiveQL
HIVE ETL M R ETL M Rwrite result Generate Report
write result
MySQL
#21
Types of Analysis
Social Network Analysis
Personal Network Analysis
Computing distance between recipients or senders based on TO, CC, FRO M links Analyzing the statistics of mail frequency
Domain Analysis
Computing distance between recipients domain based on TO, CC, FROM
Keyword Analysis (in progress)
Keyword frequency for each user
#22
Terapot Performance
Experimental Environment
11 Intel Servers: 1 Master + 10 Slaves
Xeon 2.0 GHz 2 CPU, 16 GB Memory 4 TB Disk
The number of emails: 270 millions (Index size: 270 GB)
ResultsIndexing in local disksNumber of Emails 67,217,298 134,434,596 201,651,894 268,869,192 Number of Results 12,547,398 25,094,796 37,642,194 50,189,592 Response Time (sec) 1.4 1.4 1.4 1.4
Indexing in HDFSNumber of Emails 67,217,298 134,434,596 201,651,894 268,869,192 Number of Results 12,547,398 25,094,796 37,642,194 50,189,592 Response Time (sec) 2.8 2.8 3.2 3.2
#23
Demonstration
#24
www.nexr.co.kr
Hadoop & Cloud Computing Company