Perl on Amazon Elastic MapReduce

Post on 01-Dec-2014

105 views 3 download

Transcript of Perl on Amazon Elastic MapReduce

Perl on Amazon Elastic MapReduce

Friday, 30 December 11

A Gentle Introduction to MapReduce

• Distributed computing model

• Mappers process the input and forward intermediate results to reducers.

• Reducers aggregate these intermediate results, and emit the final results.

Friday, 30 December 11

Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.

$ map | sort | reduce

Friday, 30 December 11

MapReduce

• Input data sent to mappers as (k, v) pairs.

• After processing, mappers emit (kout, vout).

• These pairs are sorted and sent to reducers.

• All (kout, vout) pairs for a given kout are sent to a single reducer.

Friday, 30 December 11

The sorting guarantees that all values for a given key are sent to a single reducer.

MapReduce

• Reducers get (k, [v1, v2, …, vn]).

• After processing, the reducer emits a (kf, vf) per result.

Friday, 30 December 11

MapReduce

We wanted to have a world map showing where people were starting our games (like

Mozilla Glow)

Friday, 30 December 11

Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.

Glowfish

Friday, 30 December 11

MapReduce

• Input: ( epoch, IP address )

• Mappers group these into 5-minute blocks, and emit ( block Id, IP address )

• Reducers get ( blockId, [ip1, ip2, …, ipn] )

• Do a geo lookup and emit

( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )

Friday, 30 December 11

On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago1 day to modify the Glow protocol, 1 day to buildEverything stored on S3

$ map | sort | reduce

Friday, 30 December 11

Friday, 30 December 11

Apache Hadoop

• Distributed programming framework

• Implements MapReduce

• Does all the usual distributed programming heavy-lifting for you

• Highly-fault tolerant, automatic task re-assignment in case of failure

• You focus on mappers and reducers

Friday, 30 December 11

Serialisation, heartbeat, node management, directory, etc.Speculative task execution, first one to finish winsPotentially very simple and contained code

Apache Hadoop

• Native Java API

• Streaming API which can use mappers and reducers written in any programming language.

• Distributed file system (HDFS)

• Distributed Cache

Friday, 30 December 11

You supply the mapper, reducer, and driver code

Amazon Elastic MapReduce

• On-demand Hadoop clusters running on EC2 instances.

• Improved S3 support for storage of input and output data.

• Build workflows by sending jobs to a cluster.

Friday, 30 December 11

S3 gives you virtually unlimited storage with very high redundancyS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)All this is controlled using a REST APIJobs are called ‘steps’ in EMR lingo

EMR Downsides

• No control over the machine images.

• Perl 5.8.8

• Ephemeral, when your cluster is shut down (or dies), HDFS is gone.

• HDFS not available at cluster-creation time.

• Debian

Friday, 30 December 11

No way to customise the image and, e.g., install your own PerlSo it’s a good idea to store the final results of a workflow in S3No way to store dependencies in HDFS when cluster is created

Streaming vs. Native

$ cat | map | sort | reduce

Friday, 30 December 11

Streaming vs. Native

Instead of

( k, [ v1, v2, …, vn ] )

reducers get

(( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))

Friday, 30 December 11

Composite Keys

• Reducers receive both keys and values sorted

• Merge 3 tables:userid, 0, … # customer info

userid, 1, … # payments history

userid, recordid1, … # clickstream

userid, recordid2, … # clickstream

Friday, 30 December 11

If you set a value to 0, you’ll know that it’s going to be the first (k,v) the reducer will see, 1 will be the second, etc.when the userid changes, it’s a new user.

Streaming vs. Native

• Limited API

• About a 7-10% increase in run time

• About a 1000% decrease in development time (as reported by a non-representative sample of developers)

Friday, 30 December 11

E.g., no control over output file names, many of the API settings can’t be configured programmatically (cmd-line switches), no separate mappers per input, etc.Because reducer input is also sorted on keys, when the key changes you know you won’t be seeing any more of those. Might need to keep track of the current key, to use as the previous.

Where’s My Towel?

• Tasks run chrooted in a non-deterministic location.

• It’s easy to store files in HDFS when submitting a job, impossible to store directory trees.

• For native Java jobs, your dependencies get packaged in the JAR alongside your code.

Friday, 30 December 11

So how do you get all the CPAN goodness you know and love in there?HDFS operations are limited to copy, move, delete, and the host OS doesn’t see it - no untar’ing!

Streaming’s Little Helpers

Define your inputs and outputs:--input s3://events/2011-30-10

--output s3://glowfish/output/2011-30-10

Friday, 30 December 11

Can have multiple inputs

Streaming’s Little Helpers

You can use any class in Hadoop’s classpath as a codec, several come bundled:-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Friday, 30 December 11

That -D is a Hadoop define, not a JVM system property definition

Streaming’s Little Helpers

• Use S3 to store…

• input data

• output data

• supporting data (e.g., Geo-IP)

• your code

Friday, 30 December 11

On a streaming job you specify the programs to use as mapper and reducer

Mapper and Reducer

To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3:--mapper s3://glowfish/bin/mapper.pl

--reducer s3://glowfish/bin/reducer.pl

Friday, 30 December 11

Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

Friday, 30 December 11

Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

Friday, 30 December 11

In the unknown directory where the task is running, making it accessible to it

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz

Friday, 30 December 11

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz

Friday, 30 December 11

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …-cacheArchive s3://glowfish/lib/perllib.tgz#locallib

Friday, 30 December 11

Dependencies

Hadoop will uncompress it and create a link to whatever directory it created, in the task’s

working directory.

Friday, 30 December 11

Dependencies

Which is where it stores your mapper and reducer.

Friday, 30 December 11

Dependencies

use lib qw/ locallib /;

Friday, 30 December 11

Mapper#!/usr/bin/env perl

use strict;use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";

Friday, 30 December 11

Mapper#!/usr/bin/env perl

use strict;use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";

Friday, 30 December 11

At the end of the job, Hadoop aggregates counters from all tasks.

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Friday, 30 December 11

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Friday, 30 December 11

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Friday, 30 December 11

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Friday, 30 December 11

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Friday, 30 December 11

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Friday, 30 December 11

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Recap

• EMR clusters are volatile!

Friday, 30 December 11

Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted.

Friday, 30 December 11

Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted.

• Use S3 for everything, and plan your dataflow ahead.

Friday, 30 December 11

( On data )

• Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others:

s3://bucket/path/data/run_date=2011-11-12

• Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.

Friday, 30 December 11

Hive partitioning

Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted. Watch for the key changing.

• Use S3 for everything, and plan your dataflow ahead.

• Make carton a part of your life, and especially of your build tool’s.

Friday, 30 December 11

( carton )

• Shipwright for humans

• Reads dependencies from Makefile.PL

• Installs them locally to your app

• Deploy your stuff, including carton.lock

• Run carton install --deployment

• Tar result and upload to S3

Friday, 30 December 11

URLs

• The MapReduce Paperhttp://labs.google.com/papers/mapreduce.html

• Apache Hadoophttp://hadoop.apache.org/

• Amazon Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/

Friday, 30 December 11

URLs

• Amazon EMR Perl Client Libraryhttp://aws.amazon.com/code/Elastic-MapReduce/2309

• Amazon EMR Command-Line Toolhttp://aws.amazon.com/code/Elastic-MapReduce/2264

Friday, 30 December 11

That’s All, Folks!

Slides available athttp://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce

me@pedrofigueiredo.org

Friday, 30 December 11