Perl on Amazon Elastic MapReduce

Friday, 30 December 11

A Gentle Introduction to MapReduce

• Distributed computing model

• Mappers process the input and forward intermediate results to reducers.

• Reducers aggregate these intermediate results, and emit the final results.

Sort/shuffle between the two steps, guaranteeing that all mapper results for a single key go to the same reducer, and that workload is distributed evenly.

$ map | sort | reduce

MapReduce

• Input data sent to mappers as (k, v) pairs.

• After processing, mappers emit (kout, vout).

• These pairs are sorted and sent to reducers.

• All (kout, vout) pairs for a given kout are sent to a single reducer.

The sorting guarantees that all values for a given key are sent to a single reducer.

MapReduce

• Reducers get (k, [v1, v2, …, vn]).

• After processing, the reducer emits a (kf, vf) per result.

MapReduce

We wanted to have a world map showing where people were starting our games (like

Mozilla Glow)

Mozilla Glow tracked Firefox 4 downloads on a world map, in near real-time.

Glowfish

MapReduce

• Input: ( epoch, IP address )

• Mappers group these into 5-minute blocks, and emit ( block Id, IP address )

• Reducers get ( blockId, [ip1, ip2, …, ipn] )

• Do a geo lookup and emit

( epoch, [ ( lat1, lon1 ), ( lat2, lon2), … ] )

On a 50-node cluster, processing ~3BN events takes 11 minutes, including data transfers.2 hours worth take 3 minutes, so we can easily have data from 5 minutes ago1 day to modify the Glow protocol, 1 day to buildEverything stored on S3

$ map | sort | reduce

Apache Hadoop

• Distributed programming framework

• Implements MapReduce

• Does all the usual distributed programming heavy-lifting for you

• Highly-fault tolerant, automatic task re-assignment in case of failure

• You focus on mappers and reducers

Serialisation, heartbeat, node management, directory, etc.Speculative task execution, first one to finish winsPotentially very simple and contained code

Apache Hadoop

• Native Java API

• Streaming API which can use mappers and reducers written in any programming language.

• Distributed file system (HDFS)

• Distributed Cache

You supply the mapper, reducer, and driver code

Amazon Elastic MapReduce

• On-demand Hadoop clusters running on EC2 instances.

• Improved S3 support for storage of input and output data.

• Build workflows by sending jobs to a cluster.

S3 gives you virtually unlimited storage with very high redundancyS3 performance: ~750MB of uncompressed data (110-byte rows -> ~7M rows/sec)All this is controlled using a REST APIJobs are called ‘steps’ in EMR lingo

EMR Downsides

• No control over the machine images.

• Perl 5.8.8

• Ephemeral, when your cluster is shut down (or dies), HDFS is gone.

• HDFS not available at cluster-creation time.

• Debian

No way to customise the image and, e.g., install your own PerlSo it’s a good idea to store the final results of a workflow in S3No way to store dependencies in HDFS when cluster is created

Streaming vs. Native

$ cat | map | sort | reduce

Instead of

( k, [ v1, v2, …, vn ] )

reducers get

(( k1, v1 ), …, ( k1, vn ), ( k2, v1 ), …, ( k2, v2 ))

Composite Keys

• Reducers receive both keys and values sorted

• Merge 3 tables:userid, 0, … # customer info

userid, 1, … # payments history

userid, recordid1, … # clickstream

userid, recordid2, … # clickstream

If you set a value to 0, you’ll know that it’s going to be the first (k,v) the reducer will see, 1 will be the second, etc.when the userid changes, it’s a new user.

• Limited API

• About a 7-10% increase in run time

• About a 1000% decrease in development time (as reported by a non-representative sample of developers)

E.g., no control over output file names, many of the API settings can’t be configured programmatically (cmd-line switches), no separate mappers per input, etc.Because reducer input is also sorted on keys, when the key changes you know you won’t be seeing any more of those. Might need to keep track of the current key, to use as the previous.

Where’s My Towel?

• Tasks run chrooted in a non-deterministic location.

• It’s easy to store files in HDFS when submitting a job, impossible to store directory trees.

• For native Java jobs, your dependencies get packaged in the JAR alongside your code.

So how do you get all the CPAN goodness you know and love in there?HDFS operations are limited to copy, move, delete, and the host OS doesn’t see it - no untar’ing!

Streaming’s Little Helpers

Define your inputs and outputs:--input s3://events/2011-30-10

--output s3://glowfish/output/2011-30-10

Can have multiple inputs

You can use any class in Hadoop’s classpath as a codec, several come bundled:-D mapred.output.key.comparator.class = org.apache.hadoop.mapred.lib.KeyFieldBasedComparator

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

That -D is a Hadoop define, not a JVM system property definition

• Use S3 to store…

• input data

• output data

• supporting data (e.g., Geo-IP)

• your code

On a streaming job you specify the programs to use as mapper and reducer

Mapper and Reducer

To specify the mapper and reducer to be used in your streaming job, you can point Hadoop to S3:--mapper s3://glowfish/bin/mapper.pl

--reducer s3://glowfish/bin/reducer.pl

Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

Support Files

When specifying a file to store in the DC, a URI fragment will be used as a symlink in the local filesystem:-cacheFile s3://glowfish/data/GeoLiteCity.dat#GeoLiteCity.dat

In the unknown directory where the task is running, making it accessible to it

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …

-cacheArchive s3://glowfish/lib/perllib.tgz

Dependencies

But if you store an archive (Zip, TGZ, or JAR) in the Distributed Cache, …-cacheArchive s3://glowfish/lib/perllib.tgz#locallib

Dependencies

Hadoop will uncompress it and create a link to whatever directory it created, in the task’s

working directory.

Dependencies

Which is where it stores your mapper and reducer.

Dependencies

use lib qw/ locallib /;

Mapper#!/usr/bin/env perl

use strict;use warnings;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";

Mapper#!/usr/bin/env perl

use strict;use warnings;

use JSON::PP;

my $decoder = JSON::PP->new->utf8;my $missing_ip = 0;

while ( <> ) { chomp; next unless /load_complete/; my @line = split /\t/; my ( $epoch, $payload ) = ( int( $line[1] / 1000 / 300 ), $line[5] ); my $json = $decoder->decode( $payload ); if ( ! exists $json->{'ip'} ) { $missing_ip++; next; } print "$epoch\t$json->{'ip'}\n";}

print STDERR "reporter:counter:Job Counters,MISSING_IP,$missing_ip\n";

At the end of the job, Hadoop aggregates counters from all tasks.

Reducer#!/usr/bin/env perl

use strict;use warnings;use lib qw/ locallib /;

use Geo::IP;use Regexp::Common qw/ net /;use Readonly;

Readonly::Scalar my $TAB => "\t";my $geo = Geo::IP->open( 'GeoLiteCity.dat', GEOIP_MEMORY_CACHE ) or die "Could not open GeoIP database: $!\n";

my $format_errors = 0;my $invalid_ip_address = 0;my $geo_lookup_errors = 0;

my $time_slot;my $previous_time_slot = -1;

Reducerwhile ( <> ) { chomp; my @cols = split( TAB ); if ( scalar @cols != 2 ) { $format_errors++; next; } my ( $time_slot, $ip_addr ) = @cols; if ( $previous_time_slot != -1 && $time_slot != $previous_time_slot ) { # we've entered a new time slot, write the previous one out emit( $time_slot, $previous_time_slot ); } if ( $ip_addr !~ /$RE{net}{IPv4}/ ) { $invalid_ip_address++; $previous_time_slot = $time_slot; next; }

Reducer my $geo_record = $geo->record_by_addr( $ip_addr ); if ( ! defined $geo_record ) { $geo_lookup_errors++; $previous_time_slot = $time_slot; next; }

# update entry for time slot with lat and lon

$previous_time_slot = $time_slot;} # while ( <> )

emit( $time_slot + 1, $time_slot );

print STDERR "reporter:counter:Job Counters,FORMAT_ERRORS,$format_errors\n";print STDERR "reporter:counter:Job Counters,INVALID_IPS,$invalid_ip_address\n";print STDERR "reporter:counter:Job Counters,GEO_LOOKUP_ERRORS,$geo_lookup_errors\n";

• EMR clusters are volatile!

• EMR clusters are volatile.

• Values for a given key will all go to a single reducer, sorted.

• Use S3 for everything, and plan your dataflow ahead.

( On data )

• Store it wisely, e.g., using a directory structure looking like the following to get free partitioning in Hive/others:

s3://bucket/path/data/run_date=2011-11-12

• Don’t worry about getting the data out of S3, you can always write a simple job that does that and run it at the end of your workflow.

Hive partitioning

• Values for a given key will all go to a single reducer, sorted. Watch for the key changing.

• Use S3 for everything, and plan your dataflow ahead.

• Make carton a part of your life, and especially of your build tool’s.

( carton )

• Shipwright for humans

• Reads dependencies from Makefile.PL

• Installs them locally to your app

• Deploy your stuff, including carton.lock

• Run carton install --deployment

• Tar result and upload to S3

• The MapReduce Paperhttp://labs.google.com/papers/mapreduce.html

• Apache Hadoophttp://hadoop.apache.org/

• Amazon Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/

• Hadoop Streaming Tutorial (Apache)http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

• Hadoop Streaming How-To (Amazon)http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/CreateJobFlowStreaming.html

• Amazon EMR Perl Client Libraryhttp://aws.amazon.com/code/Elastic-MapReduce/2309

• Amazon EMR Command-Line Toolhttp://aws.amazon.com/code/Elastic-MapReduce/2264

That’s All, Folks!

Slides available athttp://slideshare.net/pfig/perl-on-amazon-elastic-mapreduce

me@pedrofigueiredo.org

Perl on Amazon Elastic MapReduce

Documents

Transcript of Perl on Amazon Elastic MapReduce

Introduction to Amazon Elastic Mapreduce

Processing Data using Amazon Elastic MapReduce and Apache Hive

Masterclass Webinar: Amazon Elastic MapReduce (EMR)

CS 425 / ECE 428 Distributed Systems Fall 2019 · – Google: MapReduce and Sawzall – Amazon: Elastic MapReduce service (pay-as-you-go) – Google (MapReduce) • Indexing: a chain

An Elastic Middleware Platform for Concurrent and ... Elastic Middleware Platform for Concurrent and Distributed Cloud and MapReduce Simulations ... 2.4 Hazelcast Management Center

An Elastic Middleware Platform for Concurrent and Distributed … · An Elastic Middleware Platform for Concurrent and Distributed Cloud and MapReduce Simulations Pradeeban Kathiravelu

(BDT316) Offloading ETL to Amazon Elastic MapReduce

MapReduce Using Perl and Gearman

Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti.

Getting Started with Amazon Elastic MapReduce 1.2.2 · Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and

Scaling Information Retrieval to the Webmooney/ir-course/slides/ScalingIR.pdf · Apache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System

BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012

(BDT308) Using Amazon Elastic MapReduce as Your Scalable Data Warehouse | AWS re:Invent 2014

From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Amazon Elastic MapReduce · 2020-05-22 · Amazon Elastic MapReduce API Reference Request Parameters Request Parameters For information about the parameters that are common to all

Amazon Elastic MapReduce (EMR): Hadoop as a Service

Deep Dive: Amazon Elastic MapReduce

Scaling Information Retrieval to the WebApache/Yahoo Computation MapReduce Hadoop EC2 / Elastic MapReduce File Storage Google File System (GFS) HDFS Amazon S3 Database BigTable HBase,

Resilin: Elastic MapReduce over Multiple Cloudsde Hadoop, la copie des données, la tolérance aux fautes, etc. Ce service facilite l’exécution d’applications MapReduce dans les

Amazon Elastic MapReduce Best Practices