CX4242: Data & Visual Analytics Scaling...

http://poloclub.gatech.edu/cse6242CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Hadoop

Duen Horng (Polo) Chau Assistant ProfessorAssociate Director, MS AnalyticsGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

How to handle data that is really large?

Really big, as in...

• Petabytes (PB, about 1000 times of terabytes)

• Or beyond: exabyte, zettabyte, etc.Do we really need to deal with such scale?

• Yes!

“Big Data” is Common...Google processed 24 PB / day (2009)

Facebook’s add 0.5 PB / day to its data warehouses

CERN generated 200 PB of data from “Higgs boson” experiments

Avatar’s 3D effects took 1 PB to store

So, think BIG!

http://www.theregister.co.uk/2012/11/09/facebook_open_sources_corona/http://thenextweb.com/2010/01/01/avatar-takes-1-petabyte-storage-space-equivalent-32-year-long-mp3/http://dl.acm.org/citation.cfm?doid=1327452.1327492 3

How to analyze such large datasets?First thing, how to store them?

Single machine? 60TB SSD announced. $$$$$…

Cluster of machines?

• How many machines?

• Need data backup, redundancy, recovery, etc.

• Need to worry about machine and drive failure.

https://arstechnica.com/gadgets/2016/08/seagate-unveils-60tb-ssd-the-worlds-largest-hard-drive/ 4

How to analyze such large datasets?First thing, how to store them?

Single machine? 60TB SSD announced. $$$$$…

Cluster of machines?

• How many machines?

• Need data backup, redundancy, recovery, etc.

• Need to worry about machine and drive failure.

https://arstechnica.com/gadgets/2016/08/seagate-unveils-60tb-ssd-the-worlds-largest-hard-drive/

Really? Really???4

http://lifehacker.com/how-long-will-my-hard-drives-really-last-1700405627 5

How to analyze such large datasets?3% of 100,000 hard drives fail within first 3 months

Failure Trends in a Large Disk Drive Populationhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdfhttp://arstechnica.com/gadgets/2015/08/samsung-unveils-2-5-inch-16tb-ssd-the-worlds-largest-hard-drive/

How to analyze such large datasets?

How to analyze them?

• What software libraries to use?• What programming languages to learn?• Or more generally, what framework to use?

Lecture based on Hadoop: The Definitive Guide

Book covers Hadoop, some Pig, some HBase, and other things.

http://goo.gl/YNCWN

Open-source software for reliable, scalable, distributed computing

Written in Java

Scale to thousands of machines

• Linear scalability (with good algorithm design): if you have 2 machines, your job runs twice as fast (ideally)

Uses simple programming model (MapReduce)

Fault tolerant (HDFS)

• Can recover from machine/disk failure (no need to restart computation)

http://hadoop.apache.org 9

Why learn Hadoop?Fortune 500 companies use it

Many research groups/projects use it

Strong community support, and favored/backed my major companies, e.g., IBM, Google, Yahoo, eBay, Microsoft, etc.

It’s free, open-source

Low cost to set up (works on commodity machines)

An “essential skill”, like SQL

http://strataconf.com/strata2012/public/schedule/detail/22497

Elephant in the room

Hadoop created by Doug Cutting and Michael Cafarella while at Yahoo

Hadoop named after Doug’s son’s toy elephant

How does Hadoop scale up computation?

Uses master-worker architecture, and a simple computation model called MapReduce (popularized by Google’s paper)

Simple way to think about it

1.Divide data and computation into smaller pieces; each machine works on one piece

2.Combine results to produce final results

MapReduce: Simplified Data Processing on Large Clustershttp://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf 12

How does Hadoop scale up computation?

More technically...

1.Map phase Master node divides data and computation into smaller pieces; each worker node (“mapper”) works on one piece independently in parallel

2.Shuffle phase (automatically done for you)Master sorts and moves results to “reducers”

3.Reduce phase Worker nodes (“reducers”) combines results independently in parallel

Example: Find words’ frequencies among text documents

• “Apple Orange Mango Orange Grapes Plum”• “Apple Plum Mango Apple Apple Plum”

Output

• Apple, 4Grapes, 1Mango, 2Orange, 2Plum, 3

http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html 14

Master divides the data (each worker gets one line)

Each worker (mapper) outputs a key-value pair

Pairs sorted by key(automatically done)

Each worker (reducer) combines pairs into one

A machine can be both a mapper and a reducer 15

map(String key, String value): // key: document id // value: document contents for each word w in value: emit(w, "1");

How to implement this?

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

How to implement this?

What if a machine dies?Replace it!

“map” and “reduce” jobs redistributed (for you) to other machines

Hadoop’s HDFS (Hadoop File System) enables this

HDFS: Hadoop File SystemA distribute file system

Built on top of OS’s existing file system to provide redundancy and distribution

HDFS hides complexity of distributed storage and redundancy from the programmer

In short, you don’t need to worry much about this!

“History” of HDFS and HadoopHadoop & HDFS based on...

• 2003 Google File System (GFS) paper • 2004 Google MapReduce paper

http://cracking8hacking.com/cracking-hacking/Ebooks/Misc/pdf/The%20Google%20filesystem.pdf

http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf

What can you use Hadoop for?As a “swiss knife”.

Works for many types of analyses/tasks (but not all of them).

What if you want to write less code?

• There are tools to make it easier to write MapReduce program (Pig), or to query results (Hive)

How to try Hadoop?Hadoop can run on a single machine (e.g., your laptop)

• Takes < 30min from setup to runningOr a “home-grown” cluster

• Research groups often connect retired computers as a small cluster

Amazon EC2 (Amazon Elastic Compute Cloud), Microsoft Azure

• You only pay for what you use, e.g, compute time, storage

CX4242: Data & Visual Analytics Scaling...

Documents

Transcript of CX4242: Data & Visual Analytics Scaling...

CX4242: Data & Visual Analytics Data Collection...CSE6242/CX4242: Data & Visual Analytics Data Collection Duen Horng (Polo) Chau Associate Professor, College of Computing Associate

poloclub.github.io/#cse6242 CSE6242 / CX4242 Data ......Twitter 104 Million 3.7 Billion Phone call network 30 Million 260 Million We also work with small data. Small data also needs

CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-610-ScalingUp-hadoop.pdf · 2017-09-28 · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

Class Website CX4242: Time Series Non-linear Forecasting · 2020. 1. 5. · Non-linear Forecasting Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech.

CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../2016spring/slides/CSE6242-12-ScalingUp-had… · Why learn Hadoop? Fortune 500 companies use it Many research groups/projects

Class Website CX4242: Data & Visual Analytics

Simple Data Storage; SQLite · 2021. 1. 19. · poloclub.github.io/#cse6242 CSE6242/CX4242: Data & Visual Analytics Simple Data Storage; SQLite Duen Horng (Polo) Chau Associate Professor,

CX4242: Data & Visual Analytics Scaling Up · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Associate Professor Associate Director, MS Analytics

CX4242: Data & Visual Analytics Scaling Up · 2020-04-14 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Associate Professor Associate Director,

Scalingup framework

CX4242: Data & Visual Analytics - Visualization€¦ · Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007. Find bad sellers (fraudsters) on eBay who

Unit4: CollocationExtractionwithStatistical ... · SIGIL(Baroni&Evert) 4b.Collocations(part2) sigil.r-forge.r-project.org 14/42 Scalingup: workingwithlargedatasets Statisticalassociationmeasures

CX4242: Data & Visual Analytics Data Collection

CX4242: Data & Visual Analytics Scaling Up - Polo Club of ...poloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-630-Scalin… · Duen Horng (Polo) Chau Assistant Professor Associate

E ciently Scaling Up Video Annotation with Crowdsourced ...vondrick/vatic/scalingup-poster.pdf · 4.If worker can track and work is not degenerate, add to video. 5.Else, if no new

CX4242: Data & Visual Analytics Classiﬁcation Key Concepts · What is a “model”? “a simplified representation of reality created to serve a purpose” Data Science for Business

Class Website CX4242: Data & Visual AnalyticsAssignment Submission Canvas Logistics Make sure you’re in the right Piazza! (CX4242, CSE-6242-O01, CSE-6242-OAN have their Piazza forums

Class Website CX4242: Data & Visual Analytics · Data types inform visualization design • Data size informs choice of algorithms • Visualization motivates more data cleaning •

Polo Club of Data Science - CX4242: Data & Visual Analytics Time …poloclub.gatech.edu/cse6242/2016fall/slides/CSE6242-20... · 2020. 8. 22. · Non-linear Forecasting Duen Horng

5 wb jonasova ifad scalingup event june 2012 final as of june 6