Another Intro To Hadoop
-
Upload
adeel-ahmad -
Category
Technology
-
view
7.918 -
download
4
description
Transcript of Another Intro To Hadoop
About Me
Follow me on Twitter @_adeel
The AI Show podcast: www.aishow.org
Artificial intelligence news every week.
Senior App Genius at Context Optional
We're hiring Ruby developers. Contact me!
Too much data
User-generated, social networks, logging and tracking
Google, Yahoo and others need to index the entire internet and return search results in milliseconds
NYSE generates 1 TB data/day Facebook has 400 terabytes of stored
data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009)
Can't scale
Challenge to both store and analyze datasets
Slow to process Unreliable machines (CPUs and disks can
do down) Not affordable (faster, more reliable
machines are expensive)
Solve it through software
Split up the data Run jobs in parallel Sort and combine to get the answer Schedule across arbitrarily-sized cluster Handle fault-tolerance Since even the best systems breakdown,
use cheap commodity computers
Enter Hadoop
Open-source Apache project written in Java
MapReduce implementation for parallelizing application
Distributed filesystem for redundant data Many other sub-projects Meant for cheap, heterogenous hardware Scale up by simply adding more cheap
hardware
History
Open-source Apache project Grew out of Apache Nutch project, an
open-source search engine Two Google papers
MapReduce (2003): programming model for parallel processing
Google File System (2003) for fault-tolerant processing of large amounts of data
MapReduce
Operates exclusively on <key, value> pairs
Split the input data into independent chunks
Processed by the map tasks in parallel Sort the outputs of the maps Send to the reduce tasks Write to output files
MapReduce
MapReduce
HDFS
Hadoop Distributed File System Files split into large blocks Designed for streaming reads and
appending writes, not random access 3 replicas for each piece of data by
default Data can be encoded/archived formats
Self-managing and self-healing
Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data
Tries to use same node, then same rack, then same data center
Auto-replication if data lost Auto-kill and restart of tasks on another
node if taking too long or flaky
Hadoop Streaming
Don't need to write mappers and reducers in Java
Text-based API that exposes stdin and stdout
Use any language Ruby gems: Wukong, Mandy
Example: Word count
# mapper.rb
STDIN.each_line do |line| word_count = {} line.split.each do |word| word_count[word] ||= 0 word_count[word] += 1 end word_count.each do |k,v| puts "#{k}\t#{v}" endend
# reducer.rb
word = nilcount = 0STDIN.each_line do |line| wordx, countx = line.strip.split if word x!= word puts "#{word}\t#{count}" unless
word.nil? word = wordx count = 0 end count += countx.to_iendputs "#{word}\t#{count}" unless
word.nil?
Who Uses Hadoop?
Yahoo Facebook Netflix eHarmony LinkedIn NY Times Digg
Flightcaster RapLeaf Trulia Last.fm Ning CNET Lots more...
Developing With Hadoop
Don't need a whole cluster to start Standalone
– Non-distributed
– Single Java process Pseudo-distributed
Just like full-distributed Components in separate processes
Full distributed Now you need a real cluster
How to Run Hadoop
Linux, OSX, Windows, Solaris Just need Java, SSH access to nodes XML config files
Download core Hadoop Can do everything we mentioned Still needs user to play with config files
and create scripts
How to Run Hadoop
Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop Core Hadoop plus patches Bundled with command-line scripts,
Hive, Pig Publish AMI and scripts for EC2 Best option for your own cluster
How to Run Hadoop
Amazon Elastic MapReduce (EMR) GUI or command-line cluster management Supports Streaming, Hive, Pig Grabs data and MapReduce code from S3
buckets and puts it into HDFS Auto-shutdown EC2 instances Cloudera now has scripts for EMR Easiest option
Pig
High-level scripting language developed by Yahoo
Describes multi-step jobs Translated into MapReduce tasks Grunt command-line interfaceEx: Find top 5 most visited pages by users aged 18 to 25
Users = LOAD 'users' AS (name, age);
Filtered = FILTER Users BY age >=18 AND age <= 25;
Pages = LOAD 'pages' AS (user, url);
Joined = JOIN Filtered BY name, Pages BY user;
Grouped = GROUP Joined BY url;
Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks;
Sorted = ORDER Summed BY clicks DESC
Hive
High-level interface created by Facebook Gives db-like structure to data HIveQL declarative language for querying Queries get turned into MapReduce jobs Command-line interfaceex.
CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING);
LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table;
SELECT … FROM … JOIN ...
Mahout
Machine-learning libraries for Hadoop– Collaborative filtering
– Clustering
– Frequent pattern recognition
– Genetic algorithms Applications
– Product/friend recommendation
– Classify content into defined groups
– Find associations, patterns, behaviors
– Identify important topics in conversations
More stuff
Hbase – database based on Google's Bigtable
Sqoop – database import tool Zookeeper – coordination service for
distributed apps to keep track of servers, like a filesystem
Avro – data serialization system Scribe – logging system developed by