Meet Hadoop Family: part 3

MAP REDUCE&

• What are they? Programming framework to write Hadoop applications. MapReduce: original, batch based, used by other applications such as Hive, Sqoop, etc Spark: newer, memory bound, generally faster than MapReduce, best for iteration intensive processes

such as machine learning

• Both work on top of YARNMapReduce was native framework of YARN Spark can work on top of YARN or as a stand alone processes

• MapperSplits the input from HDFSMost likely one split per HDFS blockEach split has a key and value

• ReducerSimillar as an aggregator in common SQL Output of reducer written to HDFS Default reducer = 2

• Sort and Shuffle phaseGenerally most demanding phase

Map Reduce

Map Reduce: the Big Picture

• DAG (Directed Acyclic Graph) scheduler engine

• RDDImmutable structure stored in memory accross the cluster Can be created from a file, data in memory or another RDDResilient, if data in memory lost, it can be recreated Distributed, stored in memory across the clusterDataset, contains data initially from HDFS or another RDD

• OperationsTransformation, recreate another RDD out of existing RDD, lazy operatorAction, return value of any type but RDD

• Some Spark modules Spark streaming, mlib, graphx

Spark Driver and Worker

Spark: the Big Picture

DAG in Spark

Alternating Least Squares (ALS) implementation using Spark MLlib

Spark Run Mode

• Stand alone Simple distributed FIFO

• On YARNTwo Deploy mode: client and cluster

• On MesosWider partitioning capability (with other framework and between Spark instances)

Spark on YARN

MapReduce vs Spark

MapReduce Spark

Disk bound Memory bound

Mapper Reducer DAG scheduler

One YARN container per task One YARN container per application

Great for batch programs Great for iterative programs

Map Reduce Program Example: Word Counts

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

Spark Program Example: Word Counts

scala> var file = sc.textFile(“hdfs://sabtu:8020/rawdata/emails.csv”);

scala> var counts = file.flatMap(line => line.split(“ “)). map(word => (word, 1)).reduceByKey( _ + _).sortByKey();

scala> counts.saveAsTextFile( "hdfs://sabtu:8020/tmp/sparkcount");

scala> sc.stop();

scala> sys.exit();

• Resource manager web interface, port 8088

• Job history web interface, port 19888

Questions?https://www.meetup.com/Jakarta-Hadoop-Big-Data/

Meet Hadoop Family: part 3

Data & Analytics

Transcript of Meet Hadoop Family: part 3

· (Page views ? Hourly? Monthly Hadoop Node Hadoop Node Hadoop Camus Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Hadoop Node Ad-Hoc Analysis External Datastores Trends

Meet the royal family.5

Meet Hadoop Family: part 2

Meet the String Family! - Nashville Symphony

Back to Hadoop - GitHub Pagesstg-tud.github.io/ctbd/2016/CTBD_05_mapreduce.pdf•Hadoop is an ecosystem of tools for processing Big Data. •Hadoop is an open source project. 2 A family

Meet The Family (Philippines Remix)

Meet the MSI 2015 Krait Family

Meet The Valerio Family

RTB and Big Data Where Erlang and Hadoop Meet › static › upload › media › ...Data Handling with Hadoop . What is RTB ? Online Advertising Paid for by Advertising ... What is

Meet My Family Fill in Te Blanks

Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)

Meet Coach Meg’s (Margaret Moore) Inner Family · Meet Coach Meg’s (Margaret Moore) Inner Family . Coach Meg’s Inner Family. Autonomy. Body Regulator. Confidence. Curious Adventurer.

Encontra la Familia! (Meet the Family!)

Meet the note family

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

Continuous Delivery for Linux/Windows/Hadoop...Beta Cluster Hadoop JobTracker Jenkins Slave Hadoop node Hadoop node Hadoop node Hadoop node Slave Node Gateway Prod. Cluster PigServer

Meet my family

Presentacion meet the family

WP-SGI-Hadoop-Big Data Engine-on-Intel-E5-Familyfiles.meetup.com/1479250/WP-SGI-Hadoop-Big-Data... · SGI® Hadoop™ Big Data Engine on Intel® Xeon® Processor E5 Family ‐ Reference

O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.