Research on big data

1© Copyright 2011 EMC Corporation. All rights reserved.

Research on Big Data- FlexDB: A cloud-scale database engine based on Hadoop

Jidong Chen ([email protected])Manager, Research Scientist, Big Data Lab

EMC Labs ChinaSept. 2011

mailto:[email protected]


Grand Opening Announcement

EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO.


EMC Labs China - Vision and Mission

Advanced Technology Research and Development

Big Data Lab

Cloud Infrastructure and System Lab

Cloud Platform and Applications Lab

University Collaboration

Industry Standards Office

IP Portfolio Development

VisionBecome an elite

research and advanced technology institute

in China -

Become the model for future EMC Labs

worldwide


Outline

• Big Data projects overview at EMC Labs China• Introduction to Cloud Databases• Data analytics in the cloud

– Parallel DBMS– MapReduce

• FlexDB - A cloud-scale database engine based on Hadoop

• Summary


2009:0.8 Zb

Growing

by a

Factor of 44

Source: IDC Digital Universe Study, sponsored by EMC, May 20102020: 35.2 Zettabytes

The Digital Universe 2009-2020


Big Data is Changing the WorldExpanding Data Sources

• Science and research– Gene sequences– LHC accelerator– Earth and space exploration

• Enterprise applications– Email, documents, files– Applications log– Transaction records

• Web 2.0 data– Search log / click stream– Twitter/ Blog / SNS– Wiki

• Other unstructured data– Video/Movie– Graphics– Digital widgets

Bigger Challenges• Scale out automatically

– Vs. scale up manually

• More capacity and bigger pool– E.g., 10 PB in a single file system

• New process capability– Loading, Analyzing, Moving data– Intelligence

• Better performance– Linear vs. exponent– Faster

• Autonomous– Fewer human interference– Lower cost


Research Scopes and Topics in Big Data• Search and Analytics

– Search: Entity Search, Faceted Search, Associative Search– Analytics: Text Analysis, Activity Modeling and Sequence Analysis,

Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms

• MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning

and Replication, Distributed Transaction– In-memory Database: Cache, Recovery, Consistence– Database as a Service: Multi-tenant Data Management, Auto-

Administration

• Hadoop/NoSQL– Hadoop: Single-node Failure, Performance, Real-time MapReduce

Scheduler and Fault Tolerance– NoSQL: Key-Value Store, Documents Store, Graph Data Store


Project Overview• Hadoop/NoSQL

– vHadoop - joint project with VMWare• Parallel SAN file system for DISC on virtualized platform

– Online MapReduce for Real-time Data Analytics• Pipelined task execution, Group task scheduling, Enhanced fault tolerance• Parallel Data Mining

– FlexDB: Cloud-scale Parallel Database for OLAP• MapReduce integration into DBMS, Parallel query execution, Cost-based query

optimization

– Cloud-scale Parallel Database for OLTP• Intelligent database sharding and resharding• Active-active (eager) replication with group communication service• Multiple masters with elastic distributed coordination


Cloud Databases• Two largest components of data management market

– Transactional Data Management• Banks, airline reservation, online e-commerce• ACID, write-intensive

– Analytical Data Management• Business planning, decision support• Query-intensive

• Challenges of data management in the Cloud– Scalability– Fault Tolerance– Availability & Consistence– Transaction Management– Flexible Schemes


Cloud Databases• Data analytics in the cloud

– Parallel DBMS– MapReduce

• Transactional data management in the cloud– NoSQL Store– SQL Database

• Cloud data services (Database as a Service)– Multi-tenant data management– Auto-administration


Commercial Landscape Major Players

• Amazon EC2– IaaS abstraction– Data management using S3 and SimpleDB

• Microsoft Azure– PaaS abstraction– Relational engine (SQL Azure)

• Google AppEngine– PaaS abstraction– Data management using Google MegaStore


Data Analytics in the Cloud

• Scalability to large data volumes:– Scan 100 TB on 1 node @ 50 MB/sec = 23 days– Scan on 1000-node cluster = 33 minutes

Divide-And-Conquer (i.e., data partitioning)

• Cost-efficiency:– Commodity nodes (cheap, but unreliable)– Commodity network– Automatic fault-tolerance (fewer admins)– Easy to use (fewer programmers)


Solutions for Large-scale Data Analysis

• Parallel DBMS technologies– Proposed in late eighties– Matured over the last two decades– Multi-billion dollar industry: Proprietary DBMS Engines

intended as Data Warehousing solutions for very large enterprises

• Map Reduce – pioneered by Google– popularized by Yahoo! (Hadoop)


Parallel DBMS technologies

• Popularly used for more than two decades– Research Projects: Gamma, Grace, …– Commercial: Teradata, Greenplum (acquired by EMC), Netezza

(acquired by IBM), DATAllegro (acquired by Microsoft), Vertica(acquired by HP), Aster Data (acquired by Teradata)

• Share-nothing nodes clusters• Relational Data Model• Indexing• Familiar SQL interface• Parallel query execution

– Horizontal partitioning of relational tables with partitioned execution of SQL queries

• Advanced query optimization• Well understood and studied


Greenplum: A Share-nothing Parallel DBMS

Greenplum’s MPP Database has extreme scalability– Optimized for BI and analytics– Fault-tolerant reliability and optimized performance

using commodity CPUs, disks and networking

Provides automatic parallelization– No need for manual partitioning or tuning– Just load and query like any database– Tables are automatically distributed across nodes

Extremely scalable and I/O optimized– All nodes can scan and process in parallel– No I/O contention between segments

Linear scalability by adding nodes

– Each adds storage, query performance and loading performance

Interconnect

Loading

http://www.greenplum.com/


Greenplum Database Architecture MPP (Massively Parallel Processing)

Shared-Nothing Architecture

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

SQL

MapReduce

ExternalSources

Loading, streaming, etc.


Example of Parallel Query Optimization

select

c_custkey, c_name,

sum(l_extendedprice * (1 - l_discount)) as revenue,

c_acctbal, n_name, c_address, c_phone, c_comment

from

customer, orders, lineitem, nation

where

c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate >= date '1994-08-01'

and o_orderdate < date '1994-08-01'

+ interval '3 month'

and l_returnflag = 'R'

and c_nationkey = n_nationkey

group by

c_custkey, c_name, c_acctbal,

c_phone, n_name, c_address, c_comment

order by

revenue desc

Gather Motion 4:1(slice 3)

Sort

HashAggregate

HashJoin

Redistribute Motion 4:4(slice 1)

HashJoin

Seq Scan on lineitem Hash

Seq Scan on orders

Hash

HashJoin

Seq Scan on customer Hash

Broadcast Motion 4:4(slice 2)

Seq Scan on nation


MapReduce

• Overview– large-scale, massively parallel data access platform– Simple data-parallel programming model to express relatively

sophisticated distributed programs – An associated parallel and distributed implementation for commodity

clusters

• Pioneered by Google– Processes 20 PB of data per day

• Popularized by open-source Hadoop project– Used by Yahoo!, Facebook, Amazon, and the list is growing …


Programming Framework

Raw Input: <key, value>

MAP

<K2,V2><K1, V1> <K3,V3>

REDUCE


Cat...

Bat..

Dog..

Other Words(size:

TByte)

map

map

map

map

split

split

split

split

combine

combine

combine

reduce

reduce

reduce

part0

part1

part2

MapReduce Example: WordCountMap(K, V) {

For each word w in VCollect(w, 1);

}

Combine(K, V[ ]) {Int count = 0;For each v in V

count += v;Collect(K, count);

}

Reduce(K, V[ ]) {Int count = 0;For each v in V

count += v;Collect(K, count);

}

Cat 3

Bat 4

Dog 3…


MapReduce Implementation in Hadoop

split0

mapper

split1

split2

split3

split4

mapper

mapper

master

client

job

reducer

reducer

file0

file1

input files

map phase

intermediate files(local disk)

reduce phase

output files

read local write

remote read

write

assign map

assign reduce


MapReduce Advantages

• Automatic Parallelization:– Depending on the size of RAW INPUT DATA instantiate

multiple MAP tasks– Similarly, depending upon the number of intermediate <key,

value> partitions instantiate multiple REDUCE tasks

• Run-time:– Data partitioning– Task scheduling– Handling machine failures– Managing inter-machine communication

• Completely transparent to the programmer/analyst/user


Possible Applications

• Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc.– ETL and “read once” data sets– Complex analytics– Semi-structured data, key-value pairs

• At Google and others (Yahoo!, Facebook):– Inverted index– Graph structure of the WEB documents– Summaries of #pages/host, set of frequent queries, etc.– Ad Optimization– Spam filtering


Map Reduce vs Parallel DBMS

Parallel DBMS MapReduce

Schema Support Not out of the box

Indexing Not out of the box

Programming ModelDeclarative

(SQL)

Imperative(C/C++, Java, …)

Extensions through Pig and Hive

Optimizations (Compression, Query

Optimization)

Not out of the box

Flexibility Not out of the box

Fault ToleranceCoarse grained

techniques


Further Analysis and Comparison• Limitations of some current parallel database / data warehouse

– Often use expensive/specialized hardware– Difficult to scale to more than 100 nodes– Difficult to parallelize data mining applications

• MPI …

– Difficult to deal with unstructured data– Fault tolerance

• One node fails, restart whole query

– Expensive

• Disadvantages of some MapReduce based solution (Hive)– A sub-optimal brute force implementation: No indexing, No JOINs

• Find those guys whose salary is $10,000

– Row based storage, Updates?– Not SQL/BI tool compatible – No support for schema– Non-declarative programming model


MapReduce Integration in DBMS Context

• FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project)– An architectural hybrid of MapReduce and DBMS

technologies– Use Fault-tolerance and Scalability of Map Reduce

framework – Leverage advanced data processing techniques (e.g.,

Query Optimization) of an RDBMS for high performance– Expose a declarative interface to the user

• Goal: Leverage from the best of both worlds


FlexDB Architecture


Catalog manager

FlexDB Master

subquery subquery

SELECT *FROM Account

WHERE balance > 30


WHERE balance > 30

subquery


WHERE balance > 30

MapperReducer

MapReduceFramework


WHERE balance > 30

m1n1r1m0n0r0

m3n3r3m2n2r2

m5n5r5m4n4r4

m7n7r7m6n6r6

m9n9r9m8n8r8

JobJob

JobJob

Database Database Database Database Database Database Database

m1n1r1m0n0r0

m3n3r3m2n2r2

m5n5r5m4n4r4

m7n7r7m6n6r6

Account

Query Parser

Query Optimizer

Job Generator

Job Executor

http://www.portlandproperty.com.au/images/icon_1.jpg


Comparison with other systems

FlexDB Hive HadoopDB Traditional parallel database

Query Language SQL HQL SQL (not support join currently)

SQL

Storage Postgres/Greenplum HDFS JDBC compatible

Native OS files

Optimizer Cost based (DB/MR paths)

Simple rule based

Simple rule based

Cost based

Physical storage organization

Column/Row based Row based Currently Row based

Column/Row based

Implementation FlexDB Master + Hadoop + DB

Hive + Hadoop Hive (rev) + Hadoop + DB

Native

Efficiency High Low Middle Very High

Scale Large Large Large Middle

Cost Low Low Low High


Summary

• New in cloud computing– Elasticity/Scalability– Resource sharing (multi-tenancy)– Focus on failure

• Data analytics in the cloud: Different solutions suitable for different workloads

– Parallel DBMSs excel at efficient querying of large data sets– MR-style systems excel at complex analytics and ETL tasks

• Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market


Acknowledgements

• Some slides are adapted from the following references:– Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud

Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik

Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010


易安信中国研究院

陶波博士

易安信中国研究院院长

博客 http://blog.sina.com.cn/emclabschina

微博 http://weibo.com/emclabschina

http://blog.sina.com.cn/emclabschina

http://blog.sina.com.cn/emclabschina

http://weibo.com/emclabschina

http://weibo.com/emclabschina


THANK YOU

Research on big data

Technology

Transcript of Research on big data