Perl on Amazon Elastic MapReduce

53
Perl on Amazon Elastic MapReduce Friday, 30 December 11

Transcript of Perl on Amazon Elastic MapReduce

Page 1: Perl on Amazon Elastic MapReduce

Perl on Amazon Elastic MapReduce

Friday, 30 December 11

Page 2: Perl on Amazon Elastic MapReduce

A Gentle Introduction to MapReduce

• Distributed computing model

• Mappers process the input and forward intermediate results to reducers.

• Reducers aggregate these intermediate results, and emit the final results.

Friday, 30 December 11

Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.

Page 3: Perl on Amazon Elastic MapReduce

$ map | sort | reduce

Friday, 30 December 11

Page 4: Perl on Amazon Elastic MapReduce

MapReduce

• Input data sent to mappers as (k, v) pairs.

• After processing, mappers emit (kout, vout).

• These pairs are sorted and sent to reducers.

• All (kout, vout) pairs for a given kout are sent to a single reducer.

Friday, 30 December 11

The sorting guarantees that all values for a given key are sent to a single reducer.

Page 5: Perl on Amazon Elastic MapReduce

MapReduce

• Reducers get (k, [v1, v2, …, vn]).

• After processing, the reducer emits a (kf, vf) per result.

Friday, 30 December 11

Page 6: Perl on Amazon Elastic MapReduce

MapReduce

We wanted to have a world map showing where people were starting our games (like

Mozilla Glow)

Friday, 30 December 11

Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.

Page 7: Perl on Amazon Elastic MapReduce

Glowfish

Friday, 30 December 11

Page 8: Perl on Amazon Elastic MapReduce

MapReduce

• Input: ( epoch, IP address )

• Mappers group these into 5-minute blocks, and emit ( block Id, IP address )

• Reducers get ( blockId, [ip1, ip2, …, ipn] )

• Do a geo lookup and emit

( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )

Friday, 30 December 11

On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago1 day to modify the Glow protocol, 1 day to buildEverything stored on S3

Page 9: Perl on Amazon Elastic MapReduce

$ map | sort | reduce

Friday, 30 December 11

Page 10: Perl on Amazon Elastic MapReduce

Friday, 30 December 11

Page 11: Perl on Amazon Elastic MapReduce

Apache Hadoop

• Distributed programming framework

• Implements MapReduce

• Does all the usual distributed programming heavy-lifting for you

• Highly-fault tolerant, automatic task re-assignment in case of failure

• You focus on mappers and reducers

Friday, 30 December 11

Serialisation, heartbeat, node management, directory, etc.Speculative task execution, first one to finish winsPotentially very simple and contained code

Page 12: Perl on Amazon Elastic MapReduce

Apache Hadoop

• Native Java API

• Streaming API which can use mappers and reducers written in any programming language.

• Distributed file system (HDFS)

• Distributed Cache

Friday, 30 December 11

You supply the mapper, reducer, and driver code

Page 13: Perl on Amazon Elastic MapReduce

Amazon Elastic MapReduce

• On-demand Hadoop clusters running on EC2 instances.

• Improved S3 support for storage of input and output data.

• Build workflows by sending jobs to a cluster.

Friday, 30 December 11

S3 gives you virtually unlimited storage with very high redundancyS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)All this is controlled using a REST APIJobs are called ‘steps’ in EMR lingo

Page 14: Perl on Amazon Elastic MapReduce

EMR Downsides

• No control over the machine images.

• Perl 5.8.8

• Ephemeral, when your cluster is shut down (or dies), HDFS is gone.

• HDFS not available at cluster-creation time.

• Debian

Friday, 30 December 11

No way to customise the image and, e.g., install your own PerlSo it’s a good idea to store the final results of a workflow in S3No way to store dependencies in HDFS when cluster is created

Page 15: Perl on Amazon Elastic MapReduce

Streaming vs. Native

$ cat | map | sort | reduce

Friday, 30 December 11

Page 16: Perl on Amazon Elastic MapReduce

Streaming vs. Native

Instead of

( k, [ v1, v2, …, vn ] )

reducers get

(( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))

Friday, 30 December 11

Page 17: Perl on Amazon Elastic MapReduce

Composite Keys

• Reducers receive both keys and values sorted

• Merge 3 tables:userid, 0, … # customer info

userid, 1, … # payments history

userid, recordid1, … # clickstream

userid, recordid2, … # clickstream

Friday, 30 December 11

If you set a value to 0, you’ll know that it’s going to be the first (k,v) the reducer will see, 1 will be the second, etc.when the userid changes, it’s a new user.

Page 18: Perl on Amazon Elastic MapReduce

Streaming vs. Native

• Limited API

• About a 7-10% increase in run time

• About a 1000% decrease in development time (as reported by a non-representative sample of developers)

Friday, 30 December 11

E.g., no control over output file names, many of the API settings can’t be configured programmatically (cmd-line switches), no separate mappers per input, etc.Because reducer input is also sorted on keys, when the key changes you know you won’t be seeing any more of those. Might need to keep track of the current key, to use as the previous.

Page 19: Perl on Amazon Elastic MapReduce

Where’s My Towel?

• Tasks run chrooted in a non-deterministic location.

• It’s easy to store files in HDFS when submitting a job, impossible to store directory trees.

• For native Java jobs, your dependencies get packaged in the JAR alongside your code.

Friday, 30 December 11

So how do you get all the CPAN goodness you know and love in there?HDFS operations are limited to copy, move, delete, and the host OS doesn’t see it - no untar’ing!

Page 20: Perl on Amazon Elastic MapReduce

Streaming’s Little Helpers

Define your inputs and outputs:--input s3://events/2011-30-10

--output s3://glowfish/output/2011-30-10

Friday, 30 December 11

Can have multiple inputs

Page 21: Perl on Amazon Elastic MapReduce

Streaming’s Little Helpers

You can use any class in Hadoop’s classpath as a codec, several come bundled:-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Friday, 30 December 11

That -D is a Hadoop define, not a JVM system property definition

Page 22: Perl on Amazon Elastic MapReduce

Streaming’s Little Helpers

• Use S3 to store…

• input data

• output data

• supporting data (e.g., Geo-IP)

• your code

Friday, 30 December 11

On a streaming job you specify the programs to use as mapper and reducer

Page 23: Perl on Amazon Elastic MapReduce

Mapper and Reducer

To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3:--mapper s3://glowfish/bin/mapper.pl

--reducer s3://glowfish/bin/reducer.pl

Friday, 30 December 11

Page 24: Perl on Amazon Elastic MapReduce

Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

Friday, 30 December 11

Page 25: Perl on Amazon Elastic MapReduce

Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

Friday, 30 December 11

In the unknown directory where the task is running, making it accessible to it

Page 26: Perl on Amazon Elastic MapReduce

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz

Friday, 30 December 11

Page 27: Perl on Amazon Elastic MapReduce

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz

Friday, 30 December 11

Page 28: Perl on Amazon Elastic MapReduce

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …-cacheArchive s3://glowfish/lib/perllib.tgz#locallib

Friday, 30 December 11

Page 29: Perl on Amazon Elastic MapReduce

Dependencies

Hadoop will uncompress it and create a link to whatever directory it created, in the task’s

working directory.

Friday, 30 December 11

Page 30: Perl on Amazon Elastic MapReduce

Dependencies

Which is where it stores your mapper and reducer.

Friday, 30 December 11

Page 31: Perl on Amazon Elastic MapReduce

Dependencies

use lib qw/ locallib /;

Friday, 30 December 11

Page 32: Perl on Amazon Elastic MapReduce

Mapper#!/usr/bin/env perl

use strict;use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";

Friday, 30 December 11

Page 33: Perl on Amazon Elastic MapReduce

Mapper#!/usr/bin/env perl

use strict;use warnings;

use lib qw/ locallib /;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";

Friday, 30 December 11

At the end of the job, Hadoop aggregates counters from all tasks.

Page 34: Perl on Amazon Elastic MapReduce

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Friday, 30 December 11

Page 35: Perl on Amazon Elastic MapReduce

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Friday, 30 December 11

Page 36: Perl on Amazon Elastic MapReduce

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Friday, 30 December 11

Page 37: Perl on Amazon Elastic MapReduce

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Friday, 30 December 11

Page 38: Perl on Amazon Elastic MapReduce

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Friday, 30 December 11

Page 39: Perl on Amazon Elastic MapReduce

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Friday, 30 December 11

Page 40: Perl on Amazon Elastic MapReduce

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Page 41: Perl on Amazon Elastic MapReduce

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Page 42: Perl on Amazon Elastic MapReduce

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Page 43: Perl on Amazon Elastic MapReduce

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

Friday, 30 December 11

Page 44: Perl on Amazon Elastic MapReduce

Recap

• EMR clusters are volatile!

Friday, 30 December 11

Page 45: Perl on Amazon Elastic MapReduce

Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted.

Friday, 30 December 11

Page 46: Perl on Amazon Elastic MapReduce

Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted.

• Use S3 for everything, and plan your dataflow ahead.

Friday, 30 December 11

Page 47: Perl on Amazon Elastic MapReduce

( On data )

• Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others:

s3://bucket/path/data/run_date=2011-11-12

• Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.

Friday, 30 December 11

Hive partitioning

Page 48: Perl on Amazon Elastic MapReduce

Recap

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted. Watch for the key changing.

• Use S3 for everything, and plan your dataflow ahead.

• Make carton a part of your life, and especially of your build tool’s.

Friday, 30 December 11

Page 49: Perl on Amazon Elastic MapReduce

( carton )

• Shipwright for humans

• Reads dependencies from Makefile.PL

• Installs them locally to your app

• Deploy your stuff, including carton.lock

• Run carton install --deployment

• Tar result and upload to S3

Friday, 30 December 11

Page 50: Perl on Amazon Elastic MapReduce

URLs

• The MapReduce Paperhttp://labs.google.com/papers/mapreduce.html

• Apache Hadoophttp://hadoop.apache.org/

• Amazon Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/

Friday, 30 December 11

Page 52: Perl on Amazon Elastic MapReduce

URLs

• Amazon EMR Perl Client Libraryhttp://aws.amazon.com/code/Elastic-MapReduce/2309

• Amazon EMR Command-Line Toolhttp://aws.amazon.com/code/Elastic-MapReduce/2264

Friday, 30 December 11

Page 53: Perl on Amazon Elastic MapReduce

That’s All, Folks!

Slides available athttp://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce

[email protected]

Friday, 30 December 11