Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...

Experiences with a new Hadoop cluster:

deployment, teaching and research

Andre Barczak February 2018

abstract

In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However, the funding only sufficed for buying the hardware. The deployment of the cluster was done by the research group. The members had no previous specific experience with Hadoop clusters, so the learning curve was steep.

This talk covers the pros and cons of deploying a new machine in this manner, and also illustrates how we are currently using the machine for research and teaching. The history of Hadoop is briefly covered, putting it in context with other parallel platforms such as Beowulf clusters.

The beginnings

● Jan 2017: proposed a new Hadoop cluster for our research group– Several industry projects with big data

– New master of analytics programme● New courses in data analysis and big data● No specific infra-structure for teaching nor research

● The cluster should have– 2 master servers

– 14 slave nodes

– ~NZ$ 80K for the budget, everything included

The machine

The machine● 2 master nodes

– 32 cores

– 24 TB disk

– 64 GB RAM

– 2 x network

● 14 slave nodes– 8 cores

– 8 TB disk

– 32 GB RAM

● TOTAL– 160 TB disk

– 576 GB RAM

Deployment

● Ambari → the easy choice for installing all the components

● Ubuntu → previous experience with Beowulf clusters

● ITS wants the nodes isolated from the network– No ITS support: the academics become

system administrators...

Deployment

● Industry projects require confidentiality● Teaching requires sharing

Teaching

Research

To the Internet

Flexible Configuration

● Teaching: busy less than 30 weeks/year● Research: may need full resources

Teaching

Research

To the Internet

The Software

● We chose Ambari as the main platform● Free, open source● No free support (this took its toll later)● Tools included with Ambari:

– HDFS

– MapReduce, Spark

– Hive, Pig etc...

● Two biggest hurdles: – Hostname → IP

– Wrong space measurement in HDFS

A view of the dashboard

A view of the hosts

From MPI to MapReduce

Early clusters (Beowulf type) with MPI (1984)

● Broadcast

● Scatter

● Gather

● Reduce

MPI Broadcast

Master

Buffer

Data

...Node 1

Data

Node N

Data

Node 2

Data

MPI Scatter

Master

Data

...Node 1

Data

Node N

Data

Node 2

Data

B1 B2 BnB3

MPI Gather

Master

Data

...Node 1

Data

Node N

Data

Node 2

Data

B1 B2 BnB3

MPI Reduce

Master

Data

...Node 1

Data

Node N

Data

Node 2

Data

Buf

F( )

Problem: Amdahl's Law

Source: Wilkinson and Allen, 2005

Amdahl's Law

Source: Wilkinson and Allen, 2005

Beyond Amdahl's Law

● The serial percentage is not constant with different problem sizes

● Scatter data before processing it– Distributed database

● Develop an algorithm that is “aware” about the data distribution

● Resilient and fault-tolerant● Scalable● One answer: MapReduce with HDFS

MapReduce

● Resembles Scatter / Reduce from MPI● Added benefits

– Scalability

– Fault-tolerance

● Called – “Infrastructure”

– “Framework”

– “Technology “

● Criticism: is this really a new “technology”?● Key element: HDFS

MapReduce example

● Counting word occurencies in a book.● Main:

public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }

MapReduce example

Map and reduce

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Command LineHadoop:time hadoop jar wc.jar WordCount filename.txt output.txt

Stand alone machine:time cat filename.txt | tr '[:space:]' '[\n*]' | grep -v "^\s*$" | sort | uniq -c

...A pleasant smile broke

quietly over his lips.

—The mockery of it! he said gaily. Your absurd name, an ancient Greek!

He pointed his finger in friendly jest and went over to the parapet,

laughing to himself. Stephen Dedalus stepped up,

followed him wearilyhalfway and sat down on

the edge of the gunrest, watching him still ashe propped his mirror on the parapet,

dipped the brush in the bowl andlathered cheeks and neck

...

…..Pleasant 20Pleasants 20Please 90Please, 10Pleased 10Pleasure 10Pleiades, 10Plenty 10Plevna 30Plevna. 10Plot, 10Plough 10Plovers 10Pluck 10Plucking 10Plump. 10Plumped, 10

…..

MapReduce example

● Counting words in a book. Compare:

File size (KB)

Single machine

1 master/5 slaves

1 master/9 slaves

Number of splits

1.5 0.9s 23s 21s 1

15 6s 28s 27s 1

150 1m 4s 59s 58s 2

1500 9m 53s 3m 36s 3m 33s 12

15000 110m 47s 34m 4s 27m 11s 118

MapReduce example


0 200 400 600 800 1000 1200 1400 16000

100

200

300

400

500

600

700

Word Count MapReduce

single machine

1 master/5slaves

1 master/9 slaves

size (KB)

run

time

(s

)

MapReduce example


0 2000 4000 6000 8000 10000 12000 14000 160000

1000

2000

3000

4000

5000

6000

7000

Word Count MapReduce

single machine

1 master/5slaves

1 master/9 slaves

size (KB)

run

time

(s

)

Spark

text_file = sc.textFile("hdfs:///user/albarcza/test/4300.txt")counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)counts.saveAsTextFile("hdfs:///user/albarcza/test/sparktest")

● Spark minimises I/Os

– Keeps partial results in memory

– Smart scheduling

– pyspark example:

Spark example


File size (KB)

1 master/5 slaves

1 master/9 slaves

Number of tasks

1.5 1.3s 1.4s 2

15 2.7s 2.8s 2

150 14.7s 13.9s 4

1500 35.2s 31.0s 24

15000 225.7s 225.6 236

Spark X MapReduce

0 2000 4000 6000 8000 10000 12000 14000 160000

1000

2000

3000

4000

5000

6000

7000

Word Count MapRed X Spark

single machine

1/5 MapRed

1/9 MapRed

Spark

size (KB)

run

time

(s

)

Teaching● 158222 Data Wrangling and Machine Learning

– Perform data processing and data preparation tasks using domain-specific programming technologies.

– Integrate data from different sources and formats using a high-level programming language.

– Transform data into appropriate structures for analysis.

– Plot raw data and results of data analysis at an introductory level.

– Apply introductory machine learning and statistical techniques to generate data-driven solutions.

Teaching● 158333 Applied Machine Learning and Data

Visualisation– Use a broad variety of sophisticated machine

learning and data mining techniques to extract patterns in data.

– Assess the usefulness of predictive models.

– Perform advanced data visualisation techniques.

– Formulate problems for real-world datasets from various contexts.

– Present data-driven solutions to real world problems.

– Devise strategies for Big Data problems.

Conclusions: negative aspects

● Too much jargon● Too many competing tools● Documentation often incomplete (e.g., Ambari). ● Difficult to configure anything beyond the defaults.● Very difficult to fine tune particular jobs for

performance (e.g., MapReduce example)● The effective size (disk and memory) is much smaller

than the nominal one● Many of the tools are not mature yet

– (e.g., Zeppelin for multiple users)

Conclusions: positive aspects

● When it all works, it is wonderful: one can really use big data and get results– e.g., we trained RF for a 1000 classes problem, with 200 GB

of images

– Time series project (GDP prediction)

● Ambari facilitates the installation process● Very good performance for multiple jobs● Multiple usage of the machines (even when not using as

a dedicated Hadoop), flexible arrangement for the nodes● Teaching: students benefit from using a true platform

rather than just a sandbox.

Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...

Documents

Transcript of Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...