Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...
Transcript of Experiences with a new Hadoop cluster: deployment ... · Experiences with a new Hadoop cluster:...
Experiences with a new Hadoop cluster:
deployment, teaching and research
Andre Barczak February 2018
abstract
In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However, the funding only sufficed for buying the hardware. The deployment of the cluster was done by the research group. The members had no previous specific experience with Hadoop clusters, so the learning curve was steep.
This talk covers the pros and cons of deploying a new machine in this manner, and also illustrates how we are currently using the machine for research and teaching. The history of Hadoop is briefly covered, putting it in context with other parallel platforms such as Beowulf clusters.
The beginnings
● Jan 2017: proposed a new Hadoop cluster for our research group– Several industry projects with big data
– New master of analytics programme● New courses in data analysis and big data● No specific infra-structure for teaching nor research
● The cluster should have– 2 master servers
– 14 slave nodes
– ~NZ$ 80K for the budget, everything included
The machine
The machine● 2 master nodes
– 32 cores
– 24 TB disk
– 64 GB RAM
– 2 x network
● 14 slave nodes– 8 cores
– 8 TB disk
– 32 GB RAM
● TOTAL– 160 TB disk
– 576 GB RAM
Deployment
● Ambari → the easy choice for installing all the components
● Ubuntu → previous experience with Beowulf clusters
● ITS wants the nodes isolated from the network– No ITS support: the academics become
system administrators...
Deployment
● Industry projects require confidentiality● Teaching requires sharing
Teaching
Research
To the Internet
Flexible Configuration
● Teaching: busy less than 30 weeks/year● Research: may need full resources
Teaching
Research
To the Internet
The Software
● We chose Ambari as the main platform● Free, open source● No free support (this took its toll later)● Tools included with Ambari:
– HDFS
– MapReduce, Spark
– Hive, Pig etc...
● Two biggest hurdles: – Hostname → IP
– Wrong space measurement in HDFS
A view of the dashboard
A view of the hosts
From MPI to MapReduce
Early clusters (Beowulf type) with MPI (1984)
● Broadcast
● Scatter
● Gather
● Reduce
MPI Broadcast
Master
Buffer
Data
...Node 1
Data
Node N
Data
Node 2
Data
MPI Scatter
Master
Data
...Node 1
Data
Node N
Data
Node 2
Data
B1 B2 BnB3
MPI Gather
Master
Data
...Node 1
Data
Node N
Data
Node 2
Data
B1 B2 BnB3
MPI Reduce
Master
Data
...Node 1
Data
Node N
Data
Node 2
Data
Buf
F( )
Problem: Amdahl's Law
Source: Wilkinson and Allen, 2005
Amdahl's Law
Source: Wilkinson and Allen, 2005
Beyond Amdahl's Law
● The serial percentage is not constant with different problem sizes
● Scatter data before processing it– Distributed database
● Develop an algorithm that is “aware” about the data distribution
● Resilient and fault-tolerant● Scalable● One answer: MapReduce with HDFS
MapReduce
● Resembles Scatter / Reduce from MPI● Added benefits
– Scalability
– Fault-tolerance
● Called – “Infrastructure”
– “Framework”
– “Technology “
● Criticism: is this really a new “technology”?● Key element: HDFS
MapReduce example
● Counting word occurencies in a book.● Main:
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }
MapReduce example
Map and reduce
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Command LineHadoop:time hadoop jar wc.jar WordCount filename.txt output.txt
Stand alone machine:time cat filename.txt | tr '[:space:]' '[\n*]' | grep -v "^\s*$" | sort | uniq -c
...A pleasant smile broke
quietly over his lips.
—The mockery of it! he said gaily. Your absurd name, an ancient Greek!
He pointed his finger in friendly jest and went over to the parapet,
laughing to himself. Stephen Dedalus stepped up,
followed him wearilyhalfway and sat down on
the edge of the gunrest, watching him still ashe propped his mirror on the parapet,
dipped the brush in the bowl andlathered cheeks and neck
...
…..Pleasant 20Pleasants 20Please 90Please, 10Pleased 10Pleasure 10Pleiades, 10Plenty 10Plevna 30Plevna. 10Plot, 10Plough 10Plovers 10Pluck 10Plucking 10Plump. 10Plumped, 10
…..
MapReduce example
● Counting words in a book. Compare:
File size (KB)
Single machine
1 master/5 slaves
1 master/9 slaves
Number of splits
1.5 0.9s 23s 21s 1
15 6s 28s 27s 1
150 1m 4s 59s 58s 2
1500 9m 53s 3m 36s 3m 33s 12
15000 110m 47s 34m 4s 27m 11s 118
MapReduce example
● Counting words in a book. Compare:
0 200 400 600 800 1000 1200 1400 16000
100
200
300
400
500
600
700
Word Count MapReduce
single machine
1 master/5slaves
1 master/9 slaves
size (KB)
run
time
(s
)
MapReduce example
● Counting words in a book. Compare:
0 2000 4000 6000 8000 10000 12000 14000 160000
1000
2000
3000
4000
5000
6000
7000
Word Count MapReduce
single machine
1 master/5slaves
1 master/9 slaves
size (KB)
run
time
(s
)
Spark
text_file = sc.textFile("hdfs:///user/albarcza/test/4300.txt")counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b)counts.saveAsTextFile("hdfs:///user/albarcza/test/sparktest")
● Spark minimises I/Os
– Keeps partial results in memory
– Smart scheduling
– pyspark example:
Spark example
● Counting words in a book. Compare:
File size (KB)
1 master/5 slaves
1 master/9 slaves
Number of tasks
1.5 1.3s 1.4s 2
15 2.7s 2.8s 2
150 14.7s 13.9s 4
1500 35.2s 31.0s 24
15000 225.7s 225.6 236
Spark X MapReduce
0 2000 4000 6000 8000 10000 12000 14000 160000
1000
2000
3000
4000
5000
6000
7000
Word Count MapRed X Spark
single machine
1/5 MapRed
1/9 MapRed
Spark
size (KB)
run
time
(s
)
Teaching● 158222 Data Wrangling and Machine Learning
– Perform data processing and data preparation tasks using domain-specific programming technologies.
– Integrate data from different sources and formats using a high-level programming language.
– Transform data into appropriate structures for analysis.
– Plot raw data and results of data analysis at an introductory level.
– Apply introductory machine learning and statistical techniques to generate data-driven solutions.
Teaching● 158333 Applied Machine Learning and Data
Visualisation– Use a broad variety of sophisticated machine
learning and data mining techniques to extract patterns in data.
– Assess the usefulness of predictive models.
– Perform advanced data visualisation techniques.
– Formulate problems for real-world datasets from various contexts.
– Present data-driven solutions to real world problems.
– Devise strategies for Big Data problems.
Conclusions: negative aspects
● Too much jargon● Too many competing tools● Documentation often incomplete (e.g., Ambari). ● Difficult to configure anything beyond the defaults.● Very difficult to fine tune particular jobs for
performance (e.g., MapReduce example)● The effective size (disk and memory) is much smaller
than the nominal one● Many of the tools are not mature yet
– (e.g., Zeppelin for multiple users)
Conclusions: positive aspects
● When it all works, it is wonderful: one can really use big data and get results– e.g., we trained RF for a 1000 classes problem, with 200 GB
of images
– Time series project (GDP prediction)
● Ambari facilitates the installation process● Very good performance for multiple jobs● Multiple usage of the machines (even when not using as
a dedicated Hadoop), flexible arrangement for the nodes● Teaching: students benefit from using a true platform
rather than just a sandbox.