Another Intro To Hadoop

Another Intro to Hadoop

Fridays@5Context Optional

April 2, 2010By Adeel Ahmad

mailto:Fridays@5

About Me

Follow me on Twitter @_adeel

The AI Show podcast: www.aishow.org

Artificial intelligence news every week.

Senior App Genius at Context Optional

We're hiring Ruby developers. Contact me!

Too much data

User-generated, social networks, logging and tracking

Google, Yahoo and others need to index the entire internet and return search results in milliseconds

NYSE generates 1 TB data/day Facebook has 400 terabytes of stored

data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)

Can't scale

Challenge to both store and analyze datasets

Slow to process Unreliable machines (CPUs and disks can

do down) Not affordable (faster, more reliable

machines are expensive)

Solve it through software

Split up the data Run jobs in parallel Sort and combine to get the answer Schedule across arbitrarily-sized cluster Handle fault-tolerance Since even the best systems breakdown,

use cheap commodity computers

Enter Hadoop

Open-source Apache project written in Java

MapReduce implementation for parallelizing application

Distributed filesystem for redundant data Many other sub-projects Meant for cheap, heterogenous hardware Scale up by simply adding more cheap

hardware

History

Open-source Apache project Grew out of Apache Nutch project, an

open-source search engine Two Google papers

MapReduce (2003): programming model for parallel processing

Google File System (2003) for fault-tolerant processing of large amounts of data

MapReduce

Operates exclusively on <key, value> pairs

Split the input data into independent chunks

Processed by the map tasks in parallel Sort the outputs of the maps Send to the reduce tasks Write to output files

MapReduce

HDFS

Hadoop Distributed File System Files split into large blocks Designed for streaming reads and

appending writes, not random access 3 replicas for each piece of data by

default Data can be encoded/archived formats

Self-managing and self-healing

Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data

Tries to use same node, then same rack, then same data center

Auto-replication if data lost Auto-kill and restart of tasks on another

node if taking too long or flaky

Hadoop Streaming

Don't need to write mappers and reducers in Java

Text-based API that exposes stdin and stdout

Use any language Ruby gems: Wukong, Mandy

Example: Word count

# mapper.rb

STDIN.each_line do |line| word_count = {} line.split.each do |word| word_count[word] ||= 0 word_count[word] += 1 end word_count.each do |k,v| puts "#{k}\t#{v}" endend

# reducer.rb

word = nilcount = 0STDIN.each_line do |line| wordx, countx = line.strip.split if word x!= word puts "#{word}\t#{count}" unless

word.nil? word = wordx count = 0 end count += countx.to_iendputs "#{word}\t#{count}" unless

word.nil?

Who Uses Hadoop?

Yahoo Facebook Netflix eHarmony LinkedIn NY Times Digg

Flightcaster RapLeaf Trulia Last.fm Ning CNET Lots more...

Developing With Hadoop

Don't need a whole cluster to start Standalone

– Non-distributed

– Single Java process Pseudo-distributed

Just like full-distributed Components in separate processes

Full distributed Now you need a real cluster

How to Run Hadoop

Linux, OSX, Windows, Solaris Just need Java, SSH access to nodes XML config files

Download core Hadoop Can do everything we mentioned Still needs user to play with config files

and create scripts

How to Run Hadoop

Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop Core Hadoop plus patches Bundled with command-line scripts,

Hive, Pig Publish AMI and scripts for EC2 Best option for your own cluster

How to Run Hadoop

Amazon Elastic MapReduce (EMR) GUI or command-line cluster management Supports Streaming, Hive, Pig Grabs data and MapReduce code from S3

buckets and puts it into HDFS Auto-shutdown EC2 instances Cloudera now has scripts for EMR Easiest option

Pig

High-level scripting language developed by Yahoo

Describes multi-step jobs Translated into MapReduce tasks Grunt command-line interfaceEx: Find top 5 most visited pages by users aged 18 to 25

Users = LOAD 'users' AS (name, age);

Filtered = FILTER Users BY age >=18 AND age <= 25;

Pages = LOAD 'pages' AS (user, url);

Joined = JOIN Filtered BY name, Pages BY user;

Grouped = GROUP Joined BY url;

Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;

Sorted = ORDER Summed BY clicks DESC

Hive

High-level interface created by Facebook Gives db-like structure to data HIveQL declarative language for querying Queries get turned into MapReduce jobs Command-line interfaceex.

CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);

LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;

SELECT … FROM … JOIN ...

Mahout

Machine-learning libraries for Hadoop– Collaborative filtering

– Clustering

– Frequent pattern recognition

– Genetic algorithms Applications

– Product/friend recommendation

– Classify content into defined groups

– Find associations, patterns, behaviors

– Identify important topics in conversations

More stuff

Hbase – database based on Google's Bigtable

Sqoop – database import tool Zookeeper – coordination service for

distributed apps to keep track of servers, like a filesystem

Avro – data serialization system Scribe – logging system developed by

Facebook

Another Intro To Hadoop

Technology

Transcript of Another Intro To Hadoop