Big data (overview) - (MOSG)

Post on 13-Apr-2017

252 views 0 download

Transcript of Big data (overview) - (MOSG)

Big data - Overview -

2016/03/04 Mulodo Vietnam Co., Ltd.

“Big data”

Types Science :

LHC: Large Hadron Collider

Medical : Gene analysis

Market (IT?): Business use

What is “Big data”?

Types Science :

LHC: Large Hadron Collider

Medical : Gene analysis

What is “Big data”?

Market (IT?): Business use

History of Data processing

50’s - “BI : Business Intelligence” (1958) 80’s - “DSS : Decision support system” (80’s) - “SQL86” (1986) - “Knowledge Discovery in Databases” (1989) - “BI (Redefinition)” (1989) 90’s - “Data Warehouse” (1990) - “OLAP: online analytical processing” (1993) - “Improvement of computing power” (90’s) - “Price reduction of storage” (90’s) - “Data Mining” (1996)

History of Data processing2000’s - “Spread of The Internet” (00’s) - ‘Google: Big data stack 1.0’ (00’s) - “MapReduce framework” (2004) - “Independence of Hadoop project from Nutch” (2006) - “Amazon: S3” (2006) - “Explosive prosperity of EC” (00’s)

2010’s - “Big data” in ‘The Economist(UK)’ (2010) - “Google: BigQuery” (2010) - “fluentd” (2011) - “Amazon: Redshift” (2012) - “DMP: data management platform” (10’s) - “Google: Big data stack 2.0-3.0” (10’s) - “Apache crunch, Implara, Prest,...” (10’s)

80's 90's 00's 10's

Let's look back on the history of Big data

(Especially storage and query engine)

80's 90's 00's 10's

SQL(86)

Easy to use, structured/ruled.

independent from storage

80's 90's 00's 10's

Map Reduce

SQL(86)

big data stack/GFS

use HUGE data batch like process (for huge logs)

But, Proprietary

Too Huge to treat on usual RDBMS

80's 90's 00's 10's

Map Reduce

SQL(86)

Hadoop

big data stack/GFS

HBaseOpen source products!

We need source. We love freedom.

80's 90's 00's 10's

Map Reduce

SQL(86)

Hadoop

big data stack/GFS

Hive

HBase

pig

Easy to useE-commerce require huge data analysis.

M/R is too heavy to use......

80's 90's 00's 10's

Map Reduce

SQL(86)

Hadoop

big data stack/GFS

Hive

HBase

pig Hive SQL -> (M/R) -> Result

Pig Original language <=> (M/R)

80's 90's 00's 10's

Map Reduce

big data stack/CFS

SQL(86)

Hadoop

big data stack/GFS

Hive

HBase

Dremel

pig

Google announced Dremel

for interactive analysis

of huge data

BigQuery

We want analyze huge data interactively.

80's 90's 00's 10's

Map Reduce

big data stack/CFS

SQL(86)

Hadoop

big data stack/GFS

Hive

HBase

Dremel

pig

BigQuery

Dremel 1. divide SQL for shards 2. process them in parallel.

It’s Not a wrapper of M/R, but process SQL super parallel. (ie. full scan for each query with thousands servers w/o index)

80's 90's 00's 10's

Map Reduce

big data stack/CFS

BigQuery

SQL(86)

Hadoop

big data stack/GFS

Hive

HBase

DremelPrestoImpala

pigOpen source products!

We need source. We love freedom.

80's 90's 00's 10's

Map Reduce

big data stack/CFS

BigQuery

SQL(86)

Hadoop

big data stack/GFS

Hive

HBase

DremelPrestoImpala

pig

Add social circumstances on this figure.

80's 90's 00's 10's

Map Reduce

big data stack/CFS

BigQuery

SQL(86)

Hadoop

big data stack/GFS

Hive

HBaseHDFS

DremelPrestoImpala

pig

RedshiftS3

DWHDataMining

BI BIDSS

DMP

computing powerImprovement of

StoragePrice reduction of Spread of The Internet

Explosive prosperity of EC

Many requests Many solutions...

Many requests Many solutions...

But you can think which solution is better for your project. (I hope)

How to use Big dataA) How to aggregate data? - huge amount of data - too high frequency data

B) How to maintenance data? - Data will increase.... - Query engine cost, Storage cost. - Data check cost

C) How to analyze data? (what for?) - UI / UX — Understanding of business requirements

How to aggregate data<Libevent shock> parallel -> event driven. * similar to “parallel -> USB” Fluentd - Async - (Puseudo) realtime <-> Periodic Batch

other - logstash - Lamda and Kinesis (AWS) - ...

How to analyze dataUI / UX <solution set for log monitering> * ELK : logstash + Elastic search + Kibaa

* Fluentd + Norikra + GrowthForecast

Next : * Trying some storage

* Trying to build system design

* Diving to some solutions