a simple introduction to big data and hadoop

Post on 24-Jan-2017

220 views 0 download

Transcript of a simple introduction to big data and hadoop

BIG DATA

BIG DATA EXAMPLE

• Social media (likes, friends, videos, pictures, tweets,…)• Mobile signals , sensors ,

clicks• Online shopping, stocks• Codes• …

BUY A BOOK FROM AMAZON

• Knows what you searched for • What did you buy EVER• How much you are willing to

pay• Ask Facebook (friends, likes,

hangouts,…)• Who else is buying what?

BIG DATA USAGE ?

WHAT IS A BIG DATA?• Any data that you can not store in 1 pc• 3V (Volume, Velocity, Variety)

APACHE HADOOP

• Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

DISTRIBUTED STORAGE HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

SUPER COMPUTER? NORMAL COMPUTER

WHY HDFS?

• What if something goes wrong (hardware failure)?• What is the cost of super

computer?• How easily we can add

capacity?

• Automatically handle hardware failure• Automatically backup data• Just buy new cheap

computers

DISTRIBUTED PROCESSING (MAP REDUCE)

• Count the number of trees in united states?• Solution 1: ask superman?• Solution 2: ask 1000 people?

BIG DATA USAGE IN COMPUTER SCIENCE

• Mining repositories• Ownership (plagiarism, copy

right)• Detecting code smells• Auto commenting• Predicting bugs, bug reports

OTHER TOPICS

• Data scientist• No SQL• Machine learning