Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...

33
Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Transcript of Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...

Page 1: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Experiences with a new Hadoop cluster:

deployment, teaching and research

Andre Barczak February 2018

Page 2: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

abstract

In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However, the funding only sufficed for buying the hardware. The deployment of the cluster was done by the research group. The members had no previous specific experience with Hadoop clusters, so the learning curve was steep.

This talk covers the pros and cons of deploying a new machine in this manner, and also illustrates how we are currently using the machine for research and teaching. The history of Hadoop is briefly covered, putting it in context with other parallel platforms such as Beowulf clusters.

Page 3: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

The beginnings

● Jan 2017: proposed a new Hadoop cluster for our research group– Several industry projects with big data

– New master of analytics programme● New courses in data analysis and big data● No specific infra-structure for teaching nor research

● The cluster should have– 2 master servers

– 14 slave nodes

– ~NZ$ 80K for the budget, everything included

Page 4: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

The machine

Page 5: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

The machine● 2 master nodes

– 32 cores

– 24 TB disk

– 64 GB RAM

– 2 x network

● 14 slave nodes– 8 cores

– 8 TB disk

– 32 GB RAM

● TOTAL– 160 TB disk

– 576 GB RAM

Page 6: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Deployment

● Ambari → the easy choice for installing all the components

● Ubuntu → previous experience with Beowulf clusters

● ITS wants the nodes isolated from the network– No ITS support: the academics become

system administrators...

Page 7: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Deployment

● Industry projects require confidentiality● Teaching requires sharing

Teaching

Research

To the Internet

Page 8: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Flexible Configuration

● Teaching: busy less than 30 weeks/year● Research: may need full resources

Teaching

Research

To the Internet

Page 9: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

The Software

● We chose Ambari as the main platform● Free, open source● No free support (this took its toll later)● Tools included with Ambari:

– HDFS

– MapReduce, Spark

– Hive, Pig etc...

● Two biggest hurdles: – Hostname → IP

– Wrong space measurement in HDFS

Page 10: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

A view of the dashboard

Page 11: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

A view of the hosts

Page 12: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

From MPI to MapReduce

Early clusters (Beowulf type) with MPI (1984)

● Broadcast

● Scatter

● Gather

● Reduce

Page 13: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MPI Broadcast

Master

Buffer

Data

...Node 1

Data

Node N

Data

Node 2

Data

Page 14: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MPI Scatter

Master

Data

...Node 1

Data

Node N

Data

Node 2

Data

B1 B2 BnB3

Page 15: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MPI Gather

Master

Data

...Node 1

Data

Node N

Data

Node 2

Data

B1 B2 BnB3

Page 16: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MPI Reduce

Master

Data

...Node 1

Data

Node N

Data

Node 2

Data

Buf

F( )

Page 17: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Problem: Amdahl's Law

Source: Wilkinson and Allen, 2005

Page 18: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Amdahl's Law

Source: Wilkinson and Allen, 2005

Page 19: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Beyond Amdahl's Law

● The serial percentage is not constant with different problem sizes

● Scatter data before processing it– Distributed database

● Develop an algorithm that is “aware” about the data distribution

● Resilient and fault-tolerant● Scalable● One answer: MapReduce with HDFS

Page 20: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MapReduce

● Resembles Scatter / Reduce from MPI● Added benefits

– Scalability

– Fault-tolerance

● Called – “Infrastructure”

– “Framework”

– “Technology “

● Criticism: is this really a new “technology”?● Key element: HDFS

Page 21: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MapReduce example

● Counting word occurencies in a book.● Main:

public static void main(String[] args) throws Exception {    Configuration conf = new Configuration();    Job job = Job.getInstance(conf, "word count");    job.setJarByClass(WordCount.class);    job.setMapperClass(TokenizerMapper.class);    job.setCombinerClass(IntSumReducer.class);    job.setReducerClass(IntSumReducer.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    FileInputFormat.addInputPath(job, new Path(args[0]));    FileOutputFormat.setOutputPath(job, new Path(args[1]));    System.exit(job.waitForCompletion(true) ? 0 : 1);  }

Page 22: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MapReduce example

Map and reduce

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Page 23: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Command LineHadoop:time hadoop jar wc.jar WordCount filename.txt output.txt

Stand alone machine:time cat filename.txt | tr '[:space:]' '[\n*]' | grep -v "^\s*$" | sort | uniq -c

...A pleasant smile broke

quietly over his lips.

—The mockery of it! he said gaily. Your absurd name, an ancient Greek!

He pointed his finger in friendly jest and went over to the parapet,

laughing to himself. Stephen Dedalus stepped up,

followed him wearilyhalfway and sat down on

the edge of the gunrest, watching him still ashe propped his mirror on the parapet,

dipped the brush in the bowl andlathered cheeks and neck

...

…..Pleasant 20Pleasants 20Please 90Please, 10Pleased 10Pleasure 10Pleiades, 10Plenty 10Plevna 30Plevna. 10Plot, 10Plough 10Plovers 10Pluck 10Plucking 10Plump. 10Plumped, 10

…..

Page 24: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MapReduce example

● Counting words in a book. Compare:

File size (KB)

Single machine

1 master/5 slaves

1 master/9 slaves

Number of splits

1.5 0.9s 23s 21s 1

15 6s 28s 27s 1

150 1m 4s 59s 58s 2

1500 9m 53s 3m 36s 3m 33s 12

15000 110m 47s 34m 4s 27m 11s 118

Page 25: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MapReduce example

● Counting words in a book. Compare:

0 200 400 600 800 1000 1200 1400 16000

100

200

300

400

500

600

700

Word Count MapReduce

single machine

1 master/5slaves

1 master/9 slaves

size (KB)

run

time

(s

)

Page 26: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

MapReduce example

● Counting words in a book. Compare:

0 2000 4000 6000 8000 10000 12000 14000 160000

1000

2000

3000

4000

5000

6000

7000

Word Count MapReduce

single machine

1 master/5slaves

1 master/9 slaves

size (KB)

run

time

(s

)

Page 27: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Spark

text_file = sc.textFile("hdfs:///user/albarcza/test/4300.txt")counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)counts.saveAsTextFile("hdfs:///user/albarcza/test/sparktest")

● Spark minimises I/Os

– Keeps partial results in memory

– Smart scheduling

– pyspark example:

Page 28: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Spark example

● Counting words in a book. Compare:

File size (KB)

1 master/5 slaves

1 master/9 slaves

Number of tasks

1.5 1.3s 1.4s 2

15 2.7s 2.8s 2

150 14.7s 13.9s 4

1500 35.2s 31.0s 24

15000 225.7s 225.6 236

Page 29: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Spark X MapReduce

0 2000 4000 6000 8000 10000 12000 14000 160000

1000

2000

3000

4000

5000

6000

7000

Word Count MapRed X Spark

single machine

1/5 MapRed

1/9 MapRed

Spark

size (KB)

run

time

(s

)

Page 30: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Teaching● 158222 Data Wrangling and Machine Learning

– Perform data processing and data preparation tasks using domain-specific programming technologies.

– Integrate data from different sources and formats using a high-level programming language.

– Transform data into appropriate structures for analysis.

– Plot raw data and results of data analysis at an introductory level.

– Apply introductory machine learning and statistical techniques to generate data-driven solutions.

Page 31: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Teaching● 158333 Applied Machine Learning and Data

Visualisation– Use a broad variety of sophisticated machine

learning and data mining techniques to extract patterns in data.

– Assess the usefulness of predictive models.

– Perform advanced data visualisation techniques.

– Formulate problems for real-world datasets from various contexts.

– Present data-driven solutions to real world problems.

– Devise strategies for Big Data problems.

Page 32: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Conclusions: negative aspects

● Too much jargon● Too many competing tools● Documentation often incomplete (e.g., Ambari). ● Difficult to configure anything beyond the defaults.● Very difficult to fine tune particular jobs for

performance (e.g., MapReduce example)● The effective size (disk and memory) is much smaller

than the nominal one● Many of the tools are not mature yet

– (e.g., Zeppelin for multiple users)

Page 33: Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018

Conclusions: positive aspects

● When it all works, it is wonderful: one can really use big data and get results– e.g., we trained RF for a 1000 classes problem, with 200 GB

of images

– Time series project (GDP prediction)

● Ambari facilitates the installation process● Very good performance for multiple jobs● Multiple usage of the machines (even when not using as

a dedicated Hadoop), flexible arrangement for the nodes● Teaching: students benefit from using a true platform

rather than just a sandbox.