Research on big data

33
1 © Copyright 2011 EMC Corporation. All rights reserved. Research on Big Data - FlexDB: A cloud-scale database engine based on Hadoop Jidong Chen ([email protected] ) Manager, Research Scientist, Big Data Lab EMC Labs China Sept. 2011

description

Big Data projects overview at EMC Labs China• Introduction to Cloud Databases• Data analytics in the cloud– Parallel DBMS– MapReduce• FlexDB - A cloud-scale database engine based on Hadoop

Transcript of Research on big data

Page 1: Research on big data

1© Copyright 2011 EMC Corporation. All rights reserved.

Research on Big Data- FlexDB: A cloud-scale database engine based on Hadoop

Jidong Chen ([email protected])Manager, Research Scientist, Big Data Lab

EMC Labs ChinaSept. 2011

Page 2: Research on big data

2© Copyright 2011 EMC Corporation. All rights reserved.

Grand Opening Announcement

EMC Labs China is formed from EMC Research China and the Advanced Technology Venture group, which were established in 2007 by the office of CTO.

Page 3: Research on big data

3© Copyright 2011 EMC Corporation. All rights reserved.

EMC Labs China - Vision and Mission

Advanced Technology Research and Development

Big Data Lab

Cloud Infrastructure and System Lab

Cloud Platform and Applications Lab

University Collaboration

Industry Standards Office

IP Portfolio Development

VisionBecome an elite

research and advanced technology institute

in China -

Become the model for future EMC Labs

worldwide

Page 4: Research on big data

4© Copyright 2011 EMC Corporation. All rights reserved.

Outline

• Big Data projects overview at EMC Labs China• Introduction to Cloud Databases• Data analytics in the cloud

– Parallel DBMS– MapReduce

• FlexDB - A cloud-scale database engine based on Hadoop

• Summary

Page 5: Research on big data

5© Copyright 2011 EMC Corporation. All rights reserved.

2009:0.8 Zb

Growing

by a

Factor of 44

Source: IDC Digital Universe Study, sponsored by EMC, May 20102020: 35.2 Zettabytes

The Digital Universe 2009-2020

Page 6: Research on big data

6© Copyright 2011 EMC Corporation. All rights reserved.

Big Data is Changing the WorldExpanding Data Sources

• Science and research– Gene sequences– LHC accelerator– Earth and space exploration

• Enterprise applications– Email, documents, files– Applications log– Transaction records

• Web 2.0 data– Search log / click stream– Twitter/ Blog / SNS– Wiki

• Other unstructured data– Video/Movie– Graphics– Digital widgets

Bigger Challenges• Scale out automatically

– Vs. scale up manually

• More capacity and bigger pool– E.g., 10 PB in a single file system

• New process capability– Loading, Analyzing, Moving data– Intelligence

• Better performance– Linear vs. exponent– Faster

• Autonomous– Fewer human interference– Lower cost

Page 7: Research on big data

7© Copyright 2011 EMC Corporation. All rights reserved.

Research Scopes and Topics in Big Data• Search and Analytics

– Search: Entity Search, Faceted Search, Associative Search– Analytics: Text Analysis, Activity Modeling and Sequence Analysis,

Real-time Data Analysis for Streaming, Parallel Data Mining Algorithms

• MPP Databases and Data Services – Parallel Database: Parallel Query Optimization, Data Partitioning

and Replication, Distributed Transaction– In-memory Database: Cache, Recovery, Consistence– Database as a Service: Multi-tenant Data Management, Auto-

Administration

• Hadoop/NoSQL– Hadoop: Single-node Failure, Performance, Real-time MapReduce

Scheduler and Fault Tolerance– NoSQL: Key-Value Store, Documents Store, Graph Data Store

Page 8: Research on big data

8© Copyright 2011 EMC Corporation. All rights reserved.

Project Overview• Hadoop/NoSQL

– vHadoop - joint project with VMWare• Parallel SAN file system for DISC on virtualized platform

– Online MapReduce for Real-time Data Analytics• Pipelined task execution, Group task scheduling, Enhanced fault tolerance• Parallel Data Mining

– FlexDB: Cloud-scale Parallel Database for OLAP• MapReduce integration into DBMS, Parallel query execution, Cost-based query

optimization

– Cloud-scale Parallel Database for OLTP• Intelligent database sharding and resharding• Active-active (eager) replication with group communication service• Multiple masters with elastic distributed coordination

Page 9: Research on big data

9© Copyright 2011 EMC Corporation. All rights reserved.

Cloud Databases• Two largest components of data management market

– Transactional Data Management• Banks, airline reservation, online e-commerce• ACID, write-intensive

– Analytical Data Management• Business planning, decision support• Query-intensive

• Challenges of data management in the Cloud– Scalability– Fault Tolerance– Availability & Consistence– Transaction Management– Flexible Schemes

Page 10: Research on big data

10© Copyright 2011 EMC Corporation. All rights reserved.

Cloud Databases• Data analytics in the cloud

– Parallel DBMS– MapReduce

• Transactional data management in the cloud– NoSQL Store– SQL Database

• Cloud data services (Database as a Service)– Multi-tenant data management– Auto-administration

Page 11: Research on big data

11© Copyright 2011 EMC Corporation. All rights reserved.

Commercial Landscape Major Players

• Amazon EC2– IaaS abstraction– Data management using S3 and SimpleDB

• Microsoft Azure– PaaS abstraction– Relational engine (SQL Azure)

• Google AppEngine– PaaS abstraction– Data management using Google MegaStore

Page 12: Research on big data

12© Copyright 2011 EMC Corporation. All rights reserved.

Data Analytics in the Cloud

• Scalability to large data volumes:– Scan 100 TB on 1 node @ 50 MB/sec = 23 days– Scan on 1000-node cluster = 33 minutes

Divide-And-Conquer (i.e., data partitioning)

• Cost-efficiency:– Commodity nodes (cheap, but unreliable)– Commodity network– Automatic fault-tolerance (fewer admins)– Easy to use (fewer programmers)

Page 13: Research on big data

13© Copyright 2011 EMC Corporation. All rights reserved.

Solutions for Large-scale Data Analysis

• Parallel DBMS technologies– Proposed in late eighties– Matured over the last two decades– Multi-billion dollar industry: Proprietary DBMS Engines

intended as Data Warehousing solutions for very large enterprises

• Map Reduce – pioneered by Google– popularized by Yahoo! (Hadoop)

Page 14: Research on big data

14© Copyright 2011 EMC Corporation. All rights reserved.

Parallel DBMS technologies

• Popularly used for more than two decades– Research Projects: Gamma, Grace, …– Commercial: Teradata, Greenplum (acquired by EMC), Netezza

(acquired by IBM), DATAllegro (acquired by Microsoft), Vertica(acquired by HP), Aster Data (acquired by Teradata)

• Share-nothing nodes clusters• Relational Data Model• Indexing• Familiar SQL interface• Parallel query execution

– Horizontal partitioning of relational tables with partitioned execution of SQL queries

• Advanced query optimization• Well understood and studied

Page 15: Research on big data

15© Copyright 2011 EMC Corporation. All rights reserved.

Greenplum: A Share-nothing Parallel DBMS

Greenplum’s MPP Database has extreme scalability– Optimized for BI and analytics– Fault-tolerant reliability and optimized performance

using commodity CPUs, disks and networking

Provides automatic parallelization– No need for manual partitioning or tuning– Just load and query like any database– Tables are automatically distributed across nodes

Extremely scalable and I/O optimized– All nodes can scan and process in parallel– No I/O contention between segments

Linear scalability by adding nodes

– Each adds storage, query performance and loading performance

Interconnect

Loading

Page 16: Research on big data

16© Copyright 2011 EMC Corporation. All rights reserved.

Greenplum Database Architecture MPP (Massively Parallel Processing)

Shared-Nothing Architecture

NetworkInterconnect

... ...

......MasterServers

Query planning & dispatch

SegmentServers

Query processing & data storage

SQL

MapReduce

ExternalSources

Loading, streaming, etc.

Page 17: Research on big data

17© Copyright 2011 EMC Corporation. All rights reserved.

Example of Parallel Query Optimization

select

c_custkey, c_name,

sum(l_extendedprice * (1 - l_discount)) as revenue,

c_acctbal, n_name, c_address, c_phone, c_comment

from

customer, orders, lineitem, nation

where

c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate >= date '1994-08-01'

and o_orderdate < date '1994-08-01'

+ interval '3 month'

and l_returnflag = 'R'

and c_nationkey = n_nationkey

group by

c_custkey, c_name, c_acctbal,

c_phone, n_name, c_address, c_comment

order by

revenue desc

Gather Motion 4:1(slice 3)

Sort

HashAggregate

HashJoin

Redistribute Motion 4:4(slice 1)

HashJoin

Seq Scan on lineitem Hash

Seq Scan on orders

Hash

HashJoin

Seq Scan on customer Hash

Broadcast Motion 4:4(slice 2)

Seq Scan on nation

Page 18: Research on big data

18© Copyright 2011 EMC Corporation. All rights reserved.

MapReduce

• Overview– large-scale, massively parallel data access platform– Simple data-parallel programming model to express relatively

sophisticated distributed programs – An associated parallel and distributed implementation for commodity

clusters

• Pioneered by Google– Processes 20 PB of data per day

• Popularized by open-source Hadoop project– Used by Yahoo!, Facebook, Amazon, and the list is growing …

Page 19: Research on big data

19© Copyright 2011 EMC Corporation. All rights reserved.

Programming Framework

Raw Input: <key, value>

MAP

<K2,V2><K1, V1> <K3,V3>

REDUCE

Page 20: Research on big data

20© Copyright 2011 EMC Corporation. All rights reserved.

Cat...

Bat..

Dog..

Other Words(size:

TByte)

map

map

map

map

split

split

split

split

combine

combine

combine

reduce

reduce

reduce

part0

part1

part2

MapReduce Example: WordCountMap(K, V) {

For each word w in VCollect(w, 1);

}

Combine(K, V[ ]) {Int count = 0;For each v in V

count += v;Collect(K, count);

}

Reduce(K, V[ ]) {Int count = 0;For each v in V

count += v;Collect(K, count);

}

Cat 3

Bat 4

Dog 3…

Page 21: Research on big data

21© Copyright 2011 EMC Corporation. All rights reserved.

MapReduce Implementation in Hadoop

split0

mapper

split1

split2

split3

split4

mapper

mapper

master

client

job

reducer

reducer

file0

file1

input files

map phase

intermediate files(local disk)

reduce phase

output files

read local write

remote read

write

assign map

assign reduce

Page 22: Research on big data

22© Copyright 2011 EMC Corporation. All rights reserved.

MapReduce Advantages

• Automatic Parallelization:– Depending on the size of RAW INPUT DATA instantiate

multiple MAP tasks– Similarly, depending upon the number of intermediate <key,

value> partitions instantiate multiple REDUCE tasks

• Run-time:– Data partitioning– Task scheduling– Handling machine failures– Managing inter-machine communication

• Completely transparent to the programmer/analyst/user

Page 23: Research on big data

23© Copyright 2011 EMC Corporation. All rights reserved.

Possible Applications

• Special-purpose programs to process large amounts of data: crawled documents, Web query logs, etc.– ETL and “read once” data sets– Complex analytics– Semi-structured data, key-value pairs

• At Google and others (Yahoo!, Facebook):– Inverted index– Graph structure of the WEB documents– Summaries of #pages/host, set of frequent queries, etc.– Ad Optimization– Spam filtering

Page 24: Research on big data

24© Copyright 2011 EMC Corporation. All rights reserved.

Map Reduce vs Parallel DBMS

Parallel DBMS MapReduce

Schema Support Not out of the box

Indexing Not out of the box

Programming ModelDeclarative

(SQL)

Imperative(C/C++, Java, …)

Extensions through Pig and Hive

Optimizations (Compression, Query

Optimization)

Not out of the box

Flexibility Not out of the box

Fault ToleranceCoarse grained

techniques

Page 25: Research on big data

25© Copyright 2011 EMC Corporation. All rights reserved.

Further Analysis and Comparison• Limitations of some current parallel database / data warehouse

– Often use expensive/specialized hardware– Difficult to scale to more than 100 nodes– Difficult to parallelize data mining applications

• MPI …

– Difficult to deal with unstructured data– Fault tolerance

• One node fails, restart whole query

– Expensive

• Disadvantages of some MapReduce based solution (Hive)– A sub-optimal brute force implementation: No indexing, No JOINs

• Find those guys whose salary is $10,000

– Row based storage, Updates?– Not SQL/BI tool compatible – No support for schema– Non-declarative programming model

Page 26: Research on big data

26© Copyright 2011 EMC Corporation. All rights reserved.

MapReduce Integration in DBMS Context

• FlexDB - A Cloud-scale Parallel Database Engine based on Hadoop MapReduce (A Research Project)– An architectural hybrid of MapReduce and DBMS

technologies– Use Fault-tolerance and Scalability of Map Reduce

framework – Leverage advanced data processing techniques (e.g.,

Query Optimization) of an RDBMS for high performance– Expose a declarative interface to the user

• Goal: Leverage from the best of both worlds

Page 27: Research on big data

27© Copyright 2011 EMC Corporation. All rights reserved.

FlexDB Architecture

Page 28: Research on big data

28© Copyright 2011 EMC Corporation. All rights reserved.

Catalog manager

FlexDB Master

subquery subquery

SELECT *FROM Account

WHERE balance > 30

SELECT *FROM Account

WHERE balance > 30

subquery

SELECT *FROM Account

WHERE balance > 30

MapperReducer

MapReduceFramework

SELECT *FROM Account

WHERE balance > 30

m1n1r1m0n0r0

m3n3r3m2n2r2

m5n5r5m4n4r4

m7n7r7m6n6r6

m9n9r9m8n8r8

JobJob

JobJob

Database Database Database Database Database Database Database

m1n1r1m0n0r0

m3n3r3m2n2r2

m5n5r5m4n4r4

m7n7r7m6n6r6

Account

Query Parser

Query Optimizer

Job Generator

Job Executor

Page 29: Research on big data

29© Copyright 2011 EMC Corporation. All rights reserved.

Comparison with other systems

FlexDB Hive HadoopDB Traditional parallel database

Query Language SQL HQL SQL (not support join currently)

SQL

Storage Postgres/Greenplum HDFS JDBC compatible

Native OS files

Optimizer Cost based (DB/MR paths)

Simple rule based

Simple rule based

Cost based

Physical storage organization

Column/Row based Row based Currently Row based

Column/Row based

Implementation FlexDB Master + Hadoop + DB

Hive + Hadoop Hive (rev) + Hadoop + DB

Native

Efficiency High Low Middle Very High

Scale Large Large Large Middle

Cost Low Low Low High

Page 30: Research on big data

30© Copyright 2011 EMC Corporation. All rights reserved.

Summary

• New in cloud computing– Elasticity/Scalability– Resource sharing (multi-tenancy)– Focus on failure

• Data analytics in the cloud: Different solutions suitable for different workloads

– Parallel DBMSs excel at efficient querying of large data sets– MR-style systems excel at complex analytics and ETL tasks

• Combine MapReduce with shared-nothing DBMS to produce a system that better fit the cloud computing market

Page 31: Research on big data

31© Copyright 2011 EMC Corporation. All rights reserved.

Acknowledgements

• Some slides are adapted from the following references:– Divy Agrawal, Sudipto Das, and Amr El Abbadi, “Big Data and Cloud

Computing: New Wine or just New Bottles?”, VLDB 2010 Tutorial – Michael Stonebraker, Daniel AbadI, David J. DeWitt, Sam Madden, Erik

Paulson, Andrew Pavlo, and Alexander Rasin, “MapReduce and Parallel DBMS’s: Friends or Foes?”, Communications of the ACM 2010

Page 32: Research on big data

32© Copyright 2011 EMC Corporation. All rights reserved.

易安信中国研究院

陶波博士

易安信中国研究院 院长

博客 http://blog.sina.com.cn/emclabschina

微博 http://weibo.com/emclabschina

Page 33: Research on big data

33© Copyright 2011 EMC Corporation. All rights reserved.

THANK YOU