Introduction of Big data, NoSQL & Hadoop

49
INTRODUCTION NOSQL HADOOP. BIGDATA.

Transcript of Introduction of Big data, NoSQL & Hadoop

Page 1: Introduction of Big data, NoSQL & Hadoop

INTRODUCTIONNOSQLHADOOP.BIGDATA.

Page 2: Introduction of Big data, NoSQL & Hadoop

BIG DATA

Page 3: Introduction of Big data, NoSQL & Hadoop

Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.

1

WHAT IS BIG DATA?

BIGDATA

Page 4: Introduction of Big data, NoSQL & Hadoop

Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.

1

WHAT IS BIG DATA?

VOLUMEHigh data capacity

(Terabytes or petabytes)

BIGDATA

BIG DATA CHARACTERISTICS

Page 5: Introduction of Big data, NoSQL & Hadoop

Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.

1

WHAT IS BIG DATA?

VOLUME VELOCITYHigh data capacity

(Terabytes or petabytes)

BatchReal-timeStreams

BIGDATA

BIG DATA CHARACTERISTICS

Page 6: Introduction of Big data, NoSQL & Hadoop

Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.

1

WHAT IS BIG DATA?

VOLUME VELOCITY VARIETYHigh data capacity

(Terabytes or petabytes)

BatchReal-timeStreams

Various kinds(Structured, unstructured,

semi-structured)

BIGDATA

BIG DATA CHARACTERISTICS

Page 7: Introduction of Big data, NoSQL & Hadoop

Big Data refers to TECHNOLOGY and INITIATIVES that involve data that is too DIVERSE FAST-CHANGING or MASSIVE for conventional technologies, skills and infrastructure to address efficiently.

1

WHAT IS BIG DATA?

BIG DATA CHARACTERISTICSVOLUME VELOCITY VARIETY VERACITY

High data capacity

(Terabytes or petabytes)

BatchReal-timeStreams

Various kinds(Structured, unstructured,

semi-structured)

QualityConsistency

Reliability

BIGDATA

Page 8: Introduction of Big data, NoSQL & Hadoop

Type Characteristics Examples Technology

STRUCTURED d a t a

Entities with a pre-defined format/schema. RDBMS records. RDBMS, NoSQL

SEMI -STRUCTURED

d a t aData is lesser, maybe a schema. XML Files, JSON

filesNoSQL,

MapReduce

UNSTRUCTURED d a t a NO structure

Email content, images, videos,

PDF filesMapReduce

1BIGDATA

BIG DATA TYPES

Page 9: Introduction of Big data, NoSQL & Hadoop

1BIGDATA

BIG DATA CHALLENGES IN STORAGE&ANALYSIS 1. PROCESS SLOWLY, UNSCALABLE

SSD (800Mb/s, 2ms seek)

SATA (300Mb/s)

IDE drive (75MB/sec, 10ms seek)

Page 10: Introduction of Big data, NoSQL & Hadoop

1BIGDATA

1. PROCESS SLOWLY, UNSCALABLE

2. UNRELIABLE MACHINE

IDE drive (75MB/sec, 10ms seek)

Risky

BIG DATA CHALLENGES IN STORAGE&ANALYSIS

Page 11: Introduction of Big data, NoSQL & Hadoop

1BIGDATA

1. PROCESS SLOWLY, UNSCALABLE

2. UNRELIABLE MACHINE

3. RELIABILITY

IDE drive (75MB/sec, 10ms seek)

Scalability

Data recovery

Partial failure

BIG DATA CHALLENGES IN STORAGE&ANALYSIS

Page 12: Introduction of Big data, NoSQL & Hadoop

1BIGDATA

1. PROCESS SLOWLY, UNSCALABLE

2. UNRELIABLE MACHINE

3. RELIABILITY

4. BACKUP

IDE drive (75MB/sec, 10ms seek)

BIG DATA CHALLENGES IN STORAGE&ANALYSIS

Page 13: Introduction of Big data, NoSQL & Hadoop

1BIGDATA

1. PROCESS SLOWLY, UNSCALABLE

2. UNRELIABLE MACHINE

3. RELIABILITY

4. BACKUP

5. PARALLEL PROCESS

IDE drive (75MB/sec, 10ms seek)

BIG DATA CHALLENGES IN STORAGE&ANALYSIS

Page 14: Introduction of Big data, NoSQL & Hadoop

1BIGDATA

1. PROCESS SLOWLY, UNSCALABLE

2. UNRELIABLE MACHINE

3. RELIABILITY

4. BACKUP

5. PARALLEL PROCESS

6. EXPENSIVE COST

IDE drive (75MB/sec, 10ms seek)

BIG DATA CHALLENGES IN STORAGE&ANALYSIS

Page 15: Introduction of Big data, NoSQL & Hadoop

HADOOP

Page 16: Introduction of Big data, NoSQL & Hadoop

2HADOOP

WHAT IS HADOOP ?A free, Java-based framework that allows the DISTRIBUTED PROCESSING of LARGE DATA SETS across CLUSTER OF COMPUTERS using SIMPLE PROGRAMING MODELS

Page 17: Introduction of Big data, NoSQL & Hadoop

2HADOOP

WHAT IS HADOOP ?

HADOOP ORIGIN

GOOGLE PUBLISH GFS & MAP REDUCE

PAPER

2 0 0 2 - 2 0 0 4

DOUGH CUTTING ADD GFS & MAP

REDUCE TO NUTCH

2 0 0 4

YAHOO! HIRE DOUGH, BUILD A TEAM TO DEVELOP HADOOP

2 0 0 7

NY TIME CONVERT 4 TB OF ARCHIVE (100

EC2 CLUSTER)

WEB SCALE DEVELOPMENT AT

YAHOO, FACEBOOK, TWITTER

A free, Java-based framework that allows the DISTRIBUTED PROCESSING of LARGE DATA SETS across CLUSTER OF COMPUTERS using SIMPLE PROGRAMING MODELS

Page 18: Introduction of Big data, NoSQL & Hadoop

2HADOOP

WHAT IS HADOOP ?

HADOOP ORIGIN

WEB SCALE DEVELOPMENT AT

YAHOO, FACEBOOK, TWITTER

YAHOO! DOES FASTEST SORT OF a TB in 62 sec

2 0 0 9

YAHOO! SORT A PB IN 16.25 HOURS (3658

NODES)APACHE HADOOP IS

NOW AN OPEN SOURCENY TIME CONVERT 4

TB OF ARCHIVE (100 EC2 CLUSTER)

A free, Java-based framework that allows the DISTRIBUTED PROCESSING of LARGE DATA SETS across CLUSTER OF COMPUTERS using SIMPLE PROGRAMING MODELS

Page 19: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

Hadoop is designed and built on top two independent parts

HADOOP HDFSMAP REDUCE +

=

Storage file system Processing

Page 20: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

Distributed across “NODES”HDFS – Hadoop distributed file system

Page 21: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+ Provide actual storage

NAME NODE DATA NODE

Master of the system

Store meta dataTransaction blog, list of files,

list of block, data nodes

Maintain and manage blocks

on data nodes

Responsible for serving read/write requests

Slaves; deployed on each machine.

Distributed across “NODES”HDFS – Hadoop distributed file system

Page 22: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

MODELHDFS – Hadoop distributed file system

Page 23: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

MAP REDUCECOMPONENTS

JOB TRACKER TASK TRACKER

Master & manage job & resource in the cluster

Slaves, deployed on each machines

Running the map & reduce tasks as job tracker requires

Page 24: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

MAP REDUCEMODEL

Page 25: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

ALGORITHMo Parallel algorithm

MAP REDUCE

Page 26: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

ALGORITHMo Parallel algorithmo 3 basic steps

Map stepSplit data into key & value

MAP REDUCE

Page 27: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

ALGORITHMo Parallel algorithmo 3 basic steps

Map step

Shuffle step

Split data into key & value

Sorted by key

MAP REDUCE

Page 28: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

ALGORITHMo Parallel algorithmo 3 basic steps

Map step

Shuffle step

Reduce step

Split data into key & value

Gather

Sorted by key

MAP REDUCE

Page 29: Introduction of Big data, NoSQL & Hadoop

oLogical functions: MAPPER & REDUCER

2HADOOP

HADOOP ARCHITECTURE

FUNCTIONS

oHadoop handles distributing MAP & REDUCE tasks across the cluster

oMAP & REDUCE functions were written and submit .jars to Hadoop clusters.

oTypically batch oriented.

MAP REDUCE

Page 30: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP ARCHITECTURE

+

ECOSYSTEMMODEL

Page 31: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP FEATURES SUMMARY

+

STORE ANYTHING

Unstructured datasemi structured data

Page 32: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP FEATURES SUMMARY

+

STORE ANYTHING

Unstructured data,semi structured data

STORAGE CAPACITY

Scale linearlyCost is not exponential

Page 33: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP FEATURES SUMMARY

+

STORAGE CAPACITY

Scale linearlyCost is not exponential

DATA LOCALITY & PROCESS IN YOUR WAY

STORE ANYTHING

Unstructured data,semi structured data

Page 34: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP FEATURES SUMMARY

+

STORE ANYTHING

Unstructured data,semi structured data

STORAGE CAPACITY

Scale linearlyCost is not exponential

DATA LOCALITY & PROCESS IN YOUR WAY

FAILURE & FAULT TOLERANCE

Detect failure & heal itself(data replicated, failed task is re-run, no need to maintain backup data)

Page 35: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP FEATURES SUMMARY

+

STORE ANYTHING

Unstructured data,semi structured data

STORAGE CAPACITY

Scale linearlyCost is not exponential

DATA LOCALITY & PROCESS IN YOUR WAY

FAILURE & FAULT TOLERANCE

Detect failure & heal itself(data replicated, failed task is re-run, no need to maintain backup data)

COST EFFECTIVE

Page 36: Introduction of Big data, NoSQL & Hadoop

2HADOOP

HADOOP FEATURES SUMMARY

+

STORE ANYTHING

Unstructured data,semi structured data

STORAGE CAPACITY

Scale linearlyCost is not exponential

DATA LOCALITY & PROCESS IN YOUR WAY

FAILURE & FAULT TOLERANCE

Detect failure & heal itself(data replicated, failed task is re-run, no need to maintain backup data)

COST EFFECTIVE

PRIMARILY USED FOR BATCH PROCESSING, NOT REAL-TIME

Page 37: Introduction of Big data, NoSQL & Hadoop

2HADOOP

WHO IS USING HADOOP & FOR WHAT

+

SEARCH

LOG PROCESSING

RECOMMENDATION SYSTEMS

DATA WAREHOUSE

VIDEO & IMAGE ANALYSIS

Page 38: Introduction of Big data, NoSQL & Hadoop

2HADOOP

+

SEARCH

LOG PROCESSING

RECOMMENDATION SYSTEMS

DATA WAREHOUSE

VIDEO & IMAGE ANALYSIS

ANDMANY

MORE …

WHO IS USING HADOOP & FOR WHAT

Page 39: Introduction of Big data, NoSQL & Hadoop

NOSQL

Page 40: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE

Page 41: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES

KEY VALUE STOREDYNAMO,

AZURE, REDIS, MEMCACHED

Page 42: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES

KEY VALUE STOREDYNAMO,

AZURE, REDIS, MEMCACHED

B IG TABLE / COLUMN STORE

(GOOGLE )HBASE; CASSANDARSimilar to RBDMS but handles semi - structured

Page 43: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES

KEY VALUE STOREDYNAMO,

AZURE, REDIS, MEMCACHED

B IG TABLE / COLUMN STORE

(GOOGLE )HBASE; CASSANDARSimilar to RBDMS but handles semi - structured

GRAPH DB NEO4J

Page 44: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

WHAT IS NOSQL ?NOSQL = Not Only SQLSCHEMA FREE NOSQL CATEGORIES

KEY VALUE STOREDYNAMO,

AZURE, REDIS, MEMCACHED

B IG TABLE / COLUMN STORE

(GOOGLE )HBASE; CASSANDARSimilar to RBDMS but handles semi - structured

GRAPH DB NEO4J

DOCUMENT STORE

MONGODB, REDIS, COUCHDBSimilar to key – value store but DB knows what is the

value

Page 45: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

NOSQL

+ COLLECTION: is a group of RELATED DOCUMENTS

MONGO DB – DATA MODELING CONCEPT

In form of DOCUMENTS (JSON-liked key value).

Data in MongoDB has A FLEXIBLE SCHEMA.

Page 46: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

NOSQL

+

No JOIN, instead, there are 2 types of DOCUMENT STRUCTUREReference Embedded

MONGO DB – DATA MODELING CONCEPT

Page 47: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

NOSQL

+

MONGO DB – DATA MODELING CONCEPT * Always consider the usage of data (queries or update) when designing data

modelsMODEL RELATIONSHIP BETWEEN DOCUMENTS

MODEL TREE STRUCTURES

One - to - one

One - to - many

Parent referenceChild reference

Array of ancestors

Materialized paths Nested sets

Page 48: Introduction of Big data, NoSQL & Hadoop

3N O S Q L

NOSQL MONGO DB – CRUD OPERATIONS

COMPARING: SQL VS MONGO STATEMENTS

QUERY STATEMENT

CREATE / INSERT / UPDATE / DELETE

Page 49: Introduction of Big data, NoSQL & Hadoop

THE END