Another Intro To Hadoop

23
Another Intro to Ha Fridays@5 Context Optional April 2, 2010 By Adeel Ahmad

description

Introduction to Hadoop. What are Hadoop, MapReeduce, and Hadoop Distributed File System. Who uses Hadoop? How to run Hadoop? What are Pig, Hive, Mahout?

Transcript of Another Intro To Hadoop

Page 1: Another Intro To Hadoop

Another Intro to Hadoop

Fridays@5Context Optional

April 2, 2010By Adeel Ahmad

Page 2: Another Intro To Hadoop

About Me

Follow me on Twitter @_adeel

The AI Show podcast: www.aishow.org

Artificial intelligence news every week.

Senior App Genius at Context Optional

We're hiring Ruby developers. Contact me!

Page 3: Another Intro To Hadoop

Too much data

User-generated, social networks, logging and tracking

Google, Yahoo and others need to index the entire internet and return search results in milliseconds

NYSE generates 1 TB data/day Facebook has 400 terabytes of stored

data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)

Page 4: Another Intro To Hadoop

Can't scale

Challenge to both store and analyze datasets

Slow to process Unreliable machines (CPUs and disks can

do down) Not affordable (faster, more reliable

machines are expensive)

Page 5: Another Intro To Hadoop

Solve it through software

Split up the data Run jobs in parallel Sort and combine to get the answer Schedule across arbitrarily-sized cluster Handle fault-tolerance Since even the best systems breakdown,

use cheap commodity computers

Page 6: Another Intro To Hadoop

Enter Hadoop

Open-source Apache project written in Java

MapReduce implementation for parallelizing application

Distributed filesystem for redundant data Many other sub-projects Meant for cheap, heterogenous hardware Scale up by simply adding more cheap

hardware

Page 7: Another Intro To Hadoop

History

Open-source Apache project Grew out of Apache Nutch project, an

open-source search engine Two Google papers

MapReduce (2003): programming model for parallel processing

Google File System (2003) for fault-tolerant processing of large amounts of data

Page 8: Another Intro To Hadoop

MapReduce

Operates exclusively on <key, value> pairs

Split the input data into independent chunks

Processed by the map tasks in parallel Sort the outputs of the maps Send to the reduce tasks Write to output files

Page 9: Another Intro To Hadoop

MapReduce

Page 10: Another Intro To Hadoop

MapReduce

Page 11: Another Intro To Hadoop

HDFS

Hadoop Distributed File System Files split into large blocks Designed for streaming reads and

appending writes, not random access 3 replicas for each piece of data by

default Data can be encoded/archived formats

Page 12: Another Intro To Hadoop

Self-managing and self-healing

Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data

Tries to use same node, then same rack, then same data center

Auto-replication if data lost Auto-kill and restart of tasks on another

node if taking too long or flaky

Page 13: Another Intro To Hadoop

Hadoop Streaming

Don't need to write mappers and reducers in Java

Text-based API that exposes stdin and stdout

Use any language Ruby gems: Wukong, Mandy

Page 14: Another Intro To Hadoop

Example: Word count

# mapper.rb

STDIN.each_line do |line| word_count = {} line.split.each do |word| word_count[word] ||= 0 word_count[word] += 1 end word_count.each do |k,v| puts "#{k}\t#{v}" endend

# reducer.rb

word = nilcount = 0STDIN.each_line do |line| wordx, countx = line.strip.split if word x!= word puts "#{word}\t#{count}" unless

word.nil? word = wordx count = 0 end count += countx.to_iendputs "#{word}\t#{count}" unless

word.nil?

Page 15: Another Intro To Hadoop

Who Uses Hadoop?

Yahoo Facebook Netflix eHarmony LinkedIn NY Times Digg

Flightcaster RapLeaf Trulia Last.fm Ning CNET Lots more...

Page 16: Another Intro To Hadoop

Developing With Hadoop

Don't need a whole cluster to start Standalone

– Non-distributed

– Single Java process Pseudo-distributed

Just like full-distributed Components in separate processes

Full distributed Now you need a real cluster

Page 17: Another Intro To Hadoop

How to Run Hadoop

Linux, OSX, Windows, Solaris Just need Java, SSH access to nodes XML config files

Download core Hadoop Can do everything we mentioned Still needs user to play with config files

and create scripts

Page 18: Another Intro To Hadoop

How to Run Hadoop

Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop Core Hadoop plus patches Bundled with command-line scripts,

Hive, Pig Publish AMI and scripts for EC2 Best option for your own cluster

Page 19: Another Intro To Hadoop

How to Run Hadoop

Amazon Elastic MapReduce (EMR) GUI or command-line cluster management Supports Streaming, Hive, Pig Grabs data and MapReduce code from S3

buckets and puts it into HDFS Auto-shutdown EC2 instances Cloudera now has scripts for EMR Easiest option

Page 20: Another Intro To Hadoop

Pig

High-level scripting language developed by Yahoo

Describes multi-step jobs Translated into MapReduce tasks Grunt command-line interfaceEx: Find top 5 most visited pages by users aged 18 to 25

Users = LOAD 'users' AS (name, age);

Filtered = FILTER Users BY age >=18 AND age <= 25;

Pages = LOAD 'pages' AS (user, url);

Joined = JOIN Filtered BY name, Pages BY user;

Grouped = GROUP Joined BY url;

Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;

Sorted = ORDER Summed BY clicks DESC

Page 21: Another Intro To Hadoop

Hive

High-level interface created by Facebook Gives db-like structure to data HIveQL declarative language for querying Queries get turned into MapReduce jobs Command-line interfaceex.

CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);

LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;

SELECT … FROM … JOIN ...

Page 22: Another Intro To Hadoop

Mahout

Machine-learning libraries for Hadoop– Collaborative filtering

– Clustering

– Frequent pattern recognition

– Genetic algorithms Applications

– Product/friend recommendation

– Classify content into defined groups

– Find associations, patterns, behaviors

– Identify important topics in conversations

Page 23: Another Intro To Hadoop

More stuff

Hbase – database based on Google's Bigtable

Sqoop – database import tool Zookeeper – coordination service for

distributed apps to keep track of servers, like a filesystem

Avro – data serialization system Scribe – logging system developed by

Facebook