Sharing bisnis big data v3 part1

Akselerasi Pertumbuhan Startupdengan Big Data

Dwika SudrajatIT Consultant

Florida, Hong Kong & Jakarta.November 23th, 2016

▐ email: [email protected]▐ Florida: +1-407-2502812▐ Hong Kong: +852-54152971▐ Jakarta: +62-8161108571▐ FB: dwika.sudrajat▐ TW: @dwikasudrajat▐ managingconsultant.blogspot.com▐ dwikasudrajat.blogspot.com▐ dwikasudrajat.wordpress.com

mailto:[email protected]

Peluang Pekerjaan

Startup Team at Work

Startup Team Creating Mobile Apps

What technologies do you think they are running on?

Conventional Startup Development Team

Today Startup Development Team

From LAMP to MEAN

Modern web development stack

MEAN.JS a full-stack JavaScript using MongoDB, Express, AngularJS, and NodeJS

What is Big Data?

Hadoop, Why?

Hadoop, Volume, Velocity, Variety

Data Growing

Real Application of Big Data Today

SHORT LIFESPAN OF THE DATA

FAST

MO

VIN

G D

ATA

FAST

DAT

A PR

OC

ESSI

NG

HIGH VARIETY OF DATA

Challenges

Data Volume and Variety

Four V’s and a C

Not only volume makes big data big, it’s all about the three V’s: High Volume, Variety, Velocity High Value!

In addition the Challenge : the data is very complex in nature, often unstructured: Text documents, emails, images and videos, etc. Click stream data, social media feed data, etc.

Eliminate A Single Point Of Failure load balancer itself does not become a single point of failure. Load balancers must be implemented in high availability cluster

Rack 2 Rack 3Rack 1

A Typical Hadoop Cluster

ClientDATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Job Tracker

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Task Tracker

Task Tracker

Map Reduce

Map Reduce

Data Node

Data Node

Task Tracker

Map Reduce

Data Node

Master Node

Slave Nodes

Slave Nodes

Slave Nodes

Name Node

JOB ASSIGNMENT

TASK ASSIGNMENT

1. Client2. Master Node

Name Node Job Tracker

3. Slave Nodes Data Nodes Task

Trackers Map /

Reduce

1. Client consults Name Node2. Client writes block to Data

Node3. Data Node replicates block4. Cycle repeats for next blocks

Rack 2 Rack 3Rack 1

Hadoop File System (HDFS)

Data Node 1 Data Node 4 Data Node 7



Name Node

Client

FILE

FILE

DATA ASSIGNMENT TO NODES

DATA READDATA WRITE

METADATA FORBLOCK INFO

Rack 1: Data Node 1 Data Node 2 …Rack 2: Data Node 4 …

MapReduce

the, 1quick, 1brown, 1fox, 1

the, 1fox, 1ate, 1the, 1mouse, 1

how, 1now, 1brown, 1cow, 1

the, 1the, 1the, 1

fox, 1fox, 1

quick, 1

brown, 1brown, 1

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3

fox, 2

quick, 1

brown, 2

ate, 1

mouse, 1

how, 1

now, 1

cow, 1

the, 3fox, 2quick, 1brown, 2ate, 1mouse, 1how, 1now, 1cow, 1

Input Splitting Map ShuffleSort

Reduce

OutputThe Map function processes one line at a time, splits it into tokens seperated by a withespace and emits a key-value pair

<word, 1>.

The Reducer function just sums up the values, which are the occurence counts for each key (i.e. words in this example).

MapReduce Wordcount Example in R

Map function.

Reduce function.

Reading the input from HDFS from.dfs().

Writing the results back to HDFS to.dfs().

What is MapReduce used for?

• At Google:– Index building for Google Search– Article clustering for Google News– Statistical machine translation

• At Yahoo!:– Index building for Yahoo! Search– Spam detection for Yahoo! Mail

• At Facebook:– Data mining– Ad optimization– Spam detection

Who uses Hadoop?

▐ Facebook (Hadoop, Hive, Scribe)▐ Google File System (HDFS)▐ Yahoo! (Hadoop in Yahoo Search)▐ IBM Transarc (Andrew File System)▐ Amazon/A9

Goals of HDFS - Hadoop Distributed File System ▐ Very Large Distributed File System

– 10K nodes, 100 million files, 10 PB▐ Assumes Commodity Hardware

– Files are replicated to handle hardware failure– Detect failures and recovers from them

▐ Optimized for Batch Processing– Provides very high aggregate bandwidth

Hadoop, Why?

▐ Need to process Multi Petabyte Datasets▐ Need common infrastructure

– Efficient, reliable, Open Source Apache License▐ The above goals are same as Condor, but

Workloads are IO bound and not CPU bound

Hive, Why?▐ Need a Multi Petabyte Warehouse▐ Hive is a Hadoop subproject!

What is MapReduce?▐ Data-parallel programming model for clusters of commodity

machines▐ Pioneered by Google Processes 20 PB of data per day▐ Popularized by open-source Hadoop project

Used by Yahoo!, Facebook, Amazon, …

Hadoop at Facebook▐ Production cluster

4800 cores, 600 machines, 16GB per machine – April 20098000 cores, 1000 machines, 32 GB per machine – July 20094 SATA disks of 1 TB each per machine2 level network hierarchy, 40 machines per rackTotal cluster size is 2 PB, projected to be 12 PB in Q3 2009

▐ Test cluster• 800 cores, 16GB each

2016 - Hadoop clusters

▐ ~20,000 machines running Hadoop▐ Largest clusters are currently 2000 nodes▐ Several Petabytes of user data (compressed, unreplicated)▐ Run hundreds of thousands of jobs every month

2016 - Big Data Server Farm

Conclusions

The Digital Age brings many opportunities but also challenges.

Big Data and Analytics can face the challenges and realize the opportunities.

It is within anyone’s grasp, do it incremental and iterative. Hadoop cloud solutions are scalable, flexible and cost-

efficient, but sometimes limited in functionality (or not standardized).

Need for good Data Scientists in a mixed team of competences to make the right choices.

Conclusions

QUESTIONS?

39

Q&A

Thanks

Sharing bisnis big data v3 part1

Business

Transcript of Sharing bisnis big data v3 part1