Introduction to big data

48
Haifa Big Data Meetup - Meeting 1 Introduction to Big Data Organizer + Lecture – Nathan Krasney Nathan Krasney 23/6/15 1

Transcript of Introduction to big data

Page 1: Introduction to big data

1

Haifa Big Data Meetup - Meeting 1

Introduction to Big DataOrganizer + Lecture – Nathan Krasney

Nathan Krasney 23/6/15

Page 2: Introduction to big data

Nathan Krasney 23/6/15 2

Introduction to Big Data

• Big Data use cases• What is Big Data :– Definitions– Technologies

• Why is the future so bright for Big Data

Page 3: Introduction to big data

Nathan Krasney 23/6/15 3

Use Cases – A•http://www.ted.com/playlists/56/making_sense_of_too_much_data• We have in recent years huge amount of data

coming from users : Blogs, Web Sites, Forums ,Facebook , YouTube, LinkedIn,…

• Data is mostly personal : post, like , profile, …• Data contains personal preferences , geographic

location, …. of hundreds million of people in a scale that did not exist few years ago.

• It is possible to process this data using Machine Learning algorithm to get very interesting personal characteristics of people

Page 4: Introduction to big data

Nathan Krasney 23/6/15 4

Use Cases – A con’dFacebook Active Users Per Month [in millions]

Page 5: Introduction to big data

Nathan Krasney 23/6/15 5

Use Cases – A con’d

What kind of info can we produce by processing data on the web ?

• Political preferences• Personal characteristics• Age• Gender• Religious• Intelligence• Consumer preferences

Page 6: Introduction to big data

Nathan Krasney 23/6/15 6

Use Cases – A1

Example 1 : facebook likesA research conducted lately has found the top 5

likes which indicated intelligent peopleFor example clicking on this page. But why ?

Page 7: Introduction to big data

Nathan Krasney 23/6/15 7

Use Cases – A1 con’d

in general ,people tends to choose their friend to be like them. For example , young people will choose young people as their friends, smart people will choose smart people as their friends and so on.

It turns out that this particular page was liked by a group of intelligent people and it spread on the web virally via the likes of their friends (who also have high intelligence).

But this could be concluded only by having big data and being able to process it to come out with this conclusion.

Page 8: Introduction to big data

Nathan Krasney 23/6/15 8

Use Cases – A2Example 2 - Forbes magazine a company name Target started to send particular family suggestions for baby clothing even before the daughter has told her parents she is pregnant. How did Target know about it ?

Page 9: Introduction to big data

Nathan Krasney 23/6/15 9

Use Cases – A2 con’d

• It turns out that the company -https://corporate.target.com/ has huge data base of shopping done on their stores. Furthermore, the company has smart algorithm that identify pregnancy given the shopping a woman does at Target

• The algorithm identify the pregnancy due date !!!• The algorithm has identified the girl pregnancy not

necessarily given baby products bought but by vitamins she bought and bigger hand bag (for dippers) and other indirect characteristics

• Sales of the company in 2014 have reached 71 billion $ and the company exist from 1902 so she quite big data …

Page 10: Introduction to big data

Nathan Krasney 23/6/15 10

Use Cases – A2 con’d

• The huge data – big data that Target has gathered about her customers and their purchases has allowed the company to get Behavioral Patterns that indicated coming pregnancy using purchase of items like vitamins , bigger bag and so on

Page 11: Introduction to big data

Nathan Krasney 23/6/15 11

Use Cases – A3

Example 3• Processing the huge amount of personal data that

publically exist on the web : Facebook , LinkedIn , forums , web sites , blogs , YouTube, Instegram ,… to predict personal profile. This can help e.g. HR offices, Companies hiring people…

• Identifying the social group you belong to using clustering can further improve this predicted profile

• Better prediction of the user profile worth more money

Page 12: Introduction to big data

Nathan Krasney 23/6/15 12

What is Big Data?

• 3 V’s :– Volume– Velocity– Variety

Page 13: Introduction to big data

Nathan Krasney 23/6/15 13

What is Big Data ? Con’d

Page 14: Introduction to big data

Nathan Krasney 23/6/15 14

What is Big Data ? Con’d

Page 15: Introduction to big data

Nathan Krasney 23/6/15 15

What is Big Data ? Con’d

Page 16: Introduction to big data

Nathan Krasney 23/6/15 16

What is Big Data ? Con’dה אחר – vשלושת מכיוון ים

Page 17: Introduction to big data

Nathan Krasney 23/6/15 17

What is Big Data ? Con’d• Data model - what fields of data will be stored and

how : data type and any restrictions on the data input• Structured data – data model based e.g. relational

database. Need schema• Unstructured Data – no data model e.g. E-mails, pdf

files, web pages, videos, audios , photos. Schema free. Suits NoSQL

• Batch : offline processing. e.g. by Hadoop• Streaming : online processing (real-time) . E.g. by Spark• Terabyte – 1,000 GB• Zettabyte – 1,000,000,000 TB

Page 18: Introduction to big data

Nathan Krasney 23/6/15 18

What is Big Data ? Con’dה נוסף – vשלושת מכיוון ים

Page 19: Introduction to big data

Nathan Krasney 23/6/15 19

What is Big Data ? Con’d

Social media and networks(all of us are generating data)

Scientific instruments(collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data

But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

Who’s Generating Big Data

Page 20: Introduction to big data

Nathan Krasney 23/6/15 20

What is Big Data ? Con’d

Batch use case – Blackberry (good times stat…)Data :• Instrumentation data from devices• 650 TB daily, 100 PB total

Processing is used for business analytics e.g. view graphs

Page 21: Introduction to big data

Nathan Krasney 23/6/15 21

What is Big Data ? Con’dBatch use case – CBS Interactive (online content

network for information and entertainment.)Data :• 1 PB of content , click streams , web logs• 1 PB events tracked daily

Processing is used for business analytics e.g. to identify user patterns e.g. “high value” users to target content

Page 22: Introduction to big data

Nathan Krasney 23/6/15 22

What is Big Data ? Con’d

Streaming use case – Cyber security (fraud detection) by RSA

Machine learning may stop credit card transaction which are suspicious. E.g. an Israeli person buy a lot online , however, once he travel to china he might be blocked for the same online buy.

Page 23: Introduction to big data

Nathan Krasney 23/6/15 23

What is Big Data ? Con’dSo we have gathered huge amount of data, now

what ?

The problem – processing big dataTraditional large scale computation used strong computer (super computer):

• faster processors • more memory

Page 24: Introduction to big data

Nathan Krasney 23/6/15 24

What is Big Data ? Con’dbut even this was not enoughBetter solution is distributed system - use

multiple machine for single job.But this also has its problems :• programming complexity - keeping data

and processes in sync• finite bandwidth• partial failures - e.g. one computer fails

should not keep the system down

Page 25: Introduction to big data

Nathan Krasney 23/6/15 25

What is Big Data ? Con’dmodern systems have much more data• terabytes (1000 gigabytes) a day • petabytes (1000 terabyte) total

The approach of central data place is not suitable for big data

Page 26: Introduction to big data

Nathan Krasney 23/6/15 26

What is Big Data ? Con’d

Page 27: Introduction to big data

Nathan Krasney 23/6/15 27

What is Big Data ? Con’dThe new approach – Apache Hadoop

A software framework for storing , processing and analyzing big data

• Distributed• scalable• fault tolerant• open source• Eco system

Page 28: Introduction to big data

Nathan Krasney 23/6/15 28

What is Big Data ? Con’dThe new approach – Hadoop

Hadoop core components :

• HDFS (Hadoop Distributed File System) - store the data on the cluster

• MapReduce - process the data on the cluster

Page 29: Introduction to big data

Nathan Krasney 23/6/15 29

What is Big Data ? Con’dHDFS basic concepts

• HDFS is a file system written in java• Sit on top of native file system e.g. Linux• storage of massive amount of data :– scalable– fault tolerant– supports efficient processing with MapReduce

Page 30: Introduction to big data

Nathan Krasney 23/6/15 30

What is Big Data ? Con’dHDFS basic concepts

Cluster may hundreds or thousands of servers

Page 31: Introduction to big data

Nathan Krasney 23/6/15 31

What is Big Data ? Con’dHDFS basic concepts

How files are stored

• Data files are splited into blocks and distributed to the data nodes(computer)

• Each block is replicated on multiple node (3 is default)

• NameNode stores metadata

Page 32: Introduction to big data

Nathan Krasney 23/6/15 32

What is Big Data ? Con’dHDFS basic concepts

Page 33: Introduction to big data

Nathan Krasney 23/6/15 33

What is Big Data ? Con’dGet data in \ out of HDFS

Page 34: Introduction to big data

Nathan Krasney 23/6/15 34

What is Big Data ? Con’dMapReduce

MapReduce has 3 main phases :

phase 1 - The Mapper• Each task works (typically) on one HDFS block• Map task run (typically) on the same node where the block is stored

phase 2 - Shuffle & Sort• sort and collect all intermediate data from all mappers• happens after all Map tasks are completed

phase 3 - The Reducer• operate on sorted \ shuffled intermediate data - previous phase output• produces final output

Page 35: Introduction to big data

Nathan Krasney 23/6/15 35

What is Big Data ? Con’dExample : counting words

Page 36: Introduction to big data

Nathan Krasney 23/6/15 36

What is Big Data ? Con’dPhase 1 - The mapper map the text

Page 37: Introduction to big data

Nathan Krasney 23/6/15 37

What is Big Data ? Con’dPhase 2 - Shuffle & Sort

Page 38: Introduction to big data

Nathan Krasney 23/6/15 38

What is Big Data ? Con’dPhase 3 – Reduce

Page 39: Introduction to big data

Nathan Krasney 23/6/15 39

What is Big Data ? Con’dIt is important to understand that :

• Map tasks run in parallel - this reduce computation time.

• Map tasks run on the machines that contains the data so there is no network traffic issues

• Reduce also runs in parallel

Page 40: Introduction to big data

Nathan Krasney 23/6/15 40

What is Big Data ? Con’dCore Hadoop concepts :

• applications are written in high level languages• nodes talk to each other as little as possible• data is distributed in advanced• data is replicated for increased availability and

reliability• Hadoop is scalable and fault tolerant

Page 41: Introduction to big data

Nathan Krasney 23/6/15 41

What is Big Data ? Con’dFault tolerance :• node failure is inevitable• what to do in this case :– system continues to function– master re-assign tasks to a different node– data replication - so no lost of data– node which recover rejoin the cluster

automatically

Page 42: Introduction to big data

Nathan Krasney 23/6/15 42

What is Big Data ? Con’dScalability means • adding more nodes is linearly proportional to

capacity• increase load result in graceful decline in

performance and not failure

Page 43: Introduction to big data

Nathan Krasney 23/6/15 43

What is Big Data ? Con’dHadoop Eco system

Page 44: Introduction to big data

Nathan Krasney 23/6/15 44

What is Big Data ? Con’d

Hadoop Ecosystem• querying data : Hive , Pig, Impala• Data store : Hbase (Big table like over HDFS)• get data into HDFS : Flume• Schedulers (e.g. Hadoop Map/Reduce jobs, Pig

jobs): Oozie• Machine Learning : Mahout

Page 45: Introduction to big data

Nathan Krasney 23/6/15 45

What is Big Data ? Con’dWho uses Hadoop

Page 46: Introduction to big data

Nathan Krasney 23/6/15 46

What is Big Data ? Con’d

Spark

The problem : MapReduce may be slow and does only batch processing

Solution – Spark• Can do both batch and streaming• Apache Spark processes data in-memory while Hadoop

MapReduce persists back to the disk after a map or reduce action. Up to X100 better processing time

Page 47: Introduction to big data

Nathan Krasney 23/6/15 47

What is Big Data ? Con’dNoSQL (Not only SQL)The problem : storage and retrieval of unstructured data,

typically huge amount of it.

The solution :• NoSQL database• The data structures used by NoSQL databases : – key-value : key is the identifier – Graph : nodes + edges to represent relationship– document : store data as JSON document (MongoDB ,

CouchDB,..)– …

Page 48: Introduction to big data

Nathan Krasney 23/6/15 48

Why is the future so bright for Big Data

• IOT (Internet Of Things) will add huge amount of data in the coming years

• Cloud allows us to save easily a lot of data• More data is stored as time goes by on the net,

Companies , institutions,…• Data processing abilities improves As time goes by (Hadoop

, Spark)• the ability to store huge amount of data improves as time

goes by • The ability to store more data + better processing leads to

smarter info that can be retrieved from the data• Smart info is power = money