Introduction to big data

1

Haifa Big Data Meetup - Meeting 1

Introduction to Big DataOrganizer + Lecture – Nathan Krasney

Nathan Krasney 23/6/15

Nathan Krasney 23/6/15 2

Introduction to Big Data

• Big Data use cases• What is Big Data :– Definitions– Technologies

• Why is the future so bright for Big Data


Use Cases – A•http://www.ted.com/playlists/56/making_sense_of_too_much_data• We have in recent years huge amount of data

coming from users : Blogs, Web Sites, Forums ,Facebook , YouTube, LinkedIn,…

• Data is mostly personal : post, like , profile, …• Data contains personal preferences , geographic

location, …. of hundreds million of people in a scale that did not exist few years ago.

• It is possible to process this data using Machine Learning algorithm to get very interesting personal characteristics of people

http://www.ted.com/playlists/56/making_sense_of_too_much_data

http://www.ted.com/playlists/56/making_sense_of_too_much_data


Use Cases – A con’dFacebook Active Users Per Month [in millions]


Use Cases – A con’d

What kind of info can we produce by processing data on the web ?

• Political preferences• Personal characteristics• Age• Gender• Religious• Intelligence• Consumer preferences


Use Cases – A1

Example 1 : facebook likesA research conducted lately has found the top 5

likes which indicated intelligent peopleFor example clicking on this page. But why ?


Use Cases – A1 con’d

in general ,people tends to choose their friend to be like them. For example , young people will choose young people as their friends, smart people will choose smart people as their friends and so on.

It turns out that this particular page was liked by a group of intelligent people and it spread on the web virally via the likes of their friends (who also have high intelligence).

But this could be concluded only by having big data and being able to process it to come out with this conclusion.


Use Cases – A2Example 2 - Forbes magazine a company name Target started to send particular family suggestions for baby clothing even before the daughter has told her parents she is pregnant. How did Target know about it ?



• It turns out that the company -https://corporate.target.com/ has huge data base of shopping done on their stores. Furthermore, the company has smart algorithm that identify pregnancy given the shopping a woman does at Target

• The algorithm identify the pregnancy due date !!!• The algorithm has identified the girl pregnancy not

necessarily given baby products bought but by vitamins she bought and bigger hand bag (for dippers) and other indirect characteristics

• Sales of the company in 2014 have reached 71 billion $ and the company exist from 1902 so she quite big data …

https://corporate.target.com/



• The huge data – big data that Target has gathered about her customers and their purchases has allowed the company to get Behavioral Patterns that indicated coming pregnancy using purchase of items like vitamins , bigger bag and so on


Use Cases – A3

Example 3• Processing the huge amount of personal data that

publically exist on the web : Facebook , LinkedIn , forums , web sites , blogs , YouTube, Instegram ,… to predict personal profile. This can help e.g. HR offices, Companies hiring people…

• Identifying the social group you belong to using clustering can further improve this predicted profile

• Better prediction of the user profile worth more money


What is Big Data?

• 3 V’s :– Volume– Velocity– Variety


What is Big Data ? Con’d


What is Big Data ? Con’dה אחר – vשלושת מכיוון ים


What is Big Data ? Con’d• Data model - what fields of data will be stored and

how : data type and any restrictions on the data input• Structured data – data model based e.g. relational

database. Need schema• Unstructured Data – no data model e.g. E-mails, pdf

files, web pages, videos, audios , photos. Schema free. Suits NoSQL

• Batch : offline processing. e.g. by Hadoop• Streaming : online processing (real-time) . E.g. by Spark• Terabyte – 1,000 GB• Zettabyte – 1,000,000,000 TB


What is Big Data ? Con’dה נוסף – vשלושת מכיוון ים



Social media and networks(all of us are generating data)

Scientific instruments(collecting all sorts of data)

Mobile devices (tracking all objects all the time)

Sensor technology and networks(measuring all kinds of data)

The progress and innovation is no longer hindered by the ability to collect data

But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

Who’s Generating Big Data



Batch use case – Blackberry (good times stat…)Data :• Instrumentation data from devices• 650 TB daily, 100 PB total

Processing is used for business analytics e.g. view graphs


What is Big Data ? Con’dBatch use case – CBS Interactive (online content

network for information and entertainment.)Data :• 1 PB of content , click streams , web logs• 1 PB events tracked daily

Processing is used for business analytics e.g. to identify user patterns e.g. “high value” users to target content



Streaming use case – Cyber security (fraud detection) by RSA

Machine learning may stop credit card transaction which are suspicious. E.g. an Israeli person buy a lot online , however, once he travel to china he might be blocked for the same online buy.


What is Big Data ? Con’dSo we have gathered huge amount of data, now

what ?

The problem – processing big dataTraditional large scale computation used strong computer (super computer):

• faster processors • more memory


What is Big Data ? Con’dbut even this was not enoughBetter solution is distributed system - use

multiple machine for single job.But this also has its problems :• programming complexity - keeping data

and processes in sync• finite bandwidth• partial failures - e.g. one computer fails

should not keep the system down


What is Big Data ? Con’dmodern systems have much more data• terabytes (1000 gigabytes) a day • petabytes (1000 terabyte) total

The approach of central data place is not suitable for big data

http://en.wikipedia.org/wiki/Gigabyte


What is Big Data ? Con’dThe new approach – Apache Hadoop

A software framework for storing , processing and analyzing big data

• Distributed• scalable• fault tolerant• open source• Eco system


What is Big Data ? Con’dThe new approach – Hadoop

Hadoop core components :

• HDFS (Hadoop Distributed File System) - store the data on the cluster

• MapReduce - process the data on the cluster


What is Big Data ? Con’dHDFS basic concepts

• HDFS is a file system written in java• Sit on top of native file system e.g. Linux• storage of massive amount of data :– scalable– fault tolerant– supports efficient processing with MapReduce



Cluster may hundreds or thousands of servers



How files are stored

• Data files are splited into blocks and distributed to the data nodes(computer)

• Each block is replicated on multiple node (3 is default)

• NameNode stores metadata


What is Big Data ? Con’dGet data in \ out of HDFS


What is Big Data ? Con’dMapReduce

MapReduce has 3 main phases :

phase 1 - The Mapper• Each task works (typically) on one HDFS block• Map task run (typically) on the same node where the block is stored

phase 2 - Shuffle & Sort• sort and collect all intermediate data from all mappers• happens after all Map tasks are completed

phase 3 - The Reducer• operate on sorted \ shuffled intermediate data - previous phase output• produces final output


What is Big Data ? Con’dExample : counting words


What is Big Data ? Con’dPhase 1 - The mapper map the text


What is Big Data ? Con’dPhase 2 - Shuffle & Sort


What is Big Data ? Con’dPhase 3 – Reduce


What is Big Data ? Con’dIt is important to understand that :

• Map tasks run in parallel - this reduce computation time.

• Map tasks run on the machines that contains the data so there is no network traffic issues

• Reduce also runs in parallel


What is Big Data ? Con’dCore Hadoop concepts :

• applications are written in high level languages• nodes talk to each other as little as possible• data is distributed in advanced• data is replicated for increased availability and

reliability• Hadoop is scalable and fault tolerant


What is Big Data ? Con’dFault tolerance :• node failure is inevitable• what to do in this case :– system continues to function– master re-assign tasks to a different node– data replication - so no lost of data– node which recover rejoin the cluster

automatically


What is Big Data ? Con’dScalability means • adding more nodes is linearly proportional to

capacity• increase load result in graceful decline in

performance and not failure


What is Big Data ? Con’dHadoop Eco system



Hadoop Ecosystem• querying data : Hive , Pig, Impala• Data store : Hbase (Big table like over HDFS)• get data into HDFS : Flume• Schedulers (e.g. Hadoop Map/Reduce jobs, Pig

jobs): Oozie• Machine Learning : Mahout


What is Big Data ? Con’dWho uses Hadoop



Spark

The problem : MapReduce may be slow and does only batch processing

Solution – Spark• Can do both batch and streaming• Apache Spark processes data in-memory while Hadoop

MapReduce persists back to the disk after a map or reduce action. Up to X100 better processing time


What is Big Data ? Con’dNoSQL (Not only SQL)The problem : storage and retrieval of unstructured data,

typically huge amount of it.

The solution :• NoSQL database• The data structures used by NoSQL databases : – key-value : key is the identifier – Graph : nodes + edges to represent relationship– document : store data as JSON document (MongoDB ,

CouchDB,..)– …


Why is the future so bright for Big Data

• IOT (Internet Of Things) will add huge amount of data in the coming years

• Cloud allows us to save easily a lot of data• More data is stored as time goes by on the net,

Companies , institutions,…• Data processing abilities improves As time goes by (Hadoop

, Spark)• the ability to store huge amount of data improves as time

goes by • The ability to store more data + better processing leads to

smarter info that can be retrieved from the data• Smart info is power = money

Introduction to big data

Software

Transcript of Introduction to big data