Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data...
Transcript of Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data...
![Page 1: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/1.jpg)
Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof. Xifeng Yan; University of California, Santa Barbara
![Page 2: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/2.jpg)
Data Mining: Big Picture
• Big data is rampant in fields, data mining helps solve that
• Process of extracting patterns and meaningful data from large data sets
• Useful for business, research, medicine, etc.
2
![Page 3: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/3.jpg)
Data Mining Applications
3
![Page 4: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/4.jpg)
What is MapReduce and Hadoop?
• MapReduce was invented by Google and used to index the web
• Hadoop is software that implements MapReduce
• Map and Reduce refer to the two main steps in algorithm
• MapReduce steps and final results are key-value pairs
4
1. map (k1,v1) → list(k2,v2) 2. reduce (k2,list(v2)) → list(k3,v3)
![Page 5: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/5.jpg)
Word Count in MapReduce
5
Courtesy of JTeam/Martijn van Groningen <http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/>
Input Splitting Mapping Shuffling Reducing Final Result
![Page 6: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/6.jpg)
The Problem of Skew
• MapReduce is a sequential algorithm
• Heterogeneous computing environments and non-random datasets can cause each task, or partition of data, to complete at varying times, also known as skew
• Skew can mean that a cluster will not be utilized efficiently
• SkewReduce, a framework developed by Washington researchers, solves the issue of skew
6
![Page 7: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/7.jpg)
Skew Illustrated
7
Tim
e El
apse
d
Time wasted
Time doing task
Courtesy of Skew-Resistant Parallel Processing […] YongChul Kwon, Magdalena Balazinksa, Bill Howe, and Jerome Rolia
#1 #2 #3 #4 #5 #6
![Page 8: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/8.jpg)
SkewReduce
• SkewReduce is a framework built on top of Hadoop
• Has an API, which is tied to processing specific types of data
• Also has an optimizer – Makes use of cost analysis functions
– Cost is used to partition data so that each computer finishes its task at about the same time as the rest
• Cost functions require sample data and more programming, not out of the box
8
![Page 9: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/9.jpg)
Project Goals
• Setup Hadoop cluster, run SkewReduce
• Work off of SkewReduce as a base
• Leave API alone, remove optimizer
• Implement a task scheduler that does not make use of cost functions or respective sample data
• Compare performance with default “dumb” Hadoop task scheduler and SkewReduce’s optimizer
9
![Page 10: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/10.jpg)
Our Optimized Task Scheduler
• Novel and clever way of “fast-tracking” tasks to completion
• Does not care about underlying data or algorithm
• Tasks which are deemed to take too long in comparison to all other equally-sized tasks on a computer are stopped and split up for rest of the cluster
• Removes the need for cost functions or sample data
10
![Page 11: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/11.jpg)
Task Scheduler Visualized Ti
me
Elap
sed
Incomplete, running too long, task
Redistributed task chunks
Killed tasks
Complete task
#1 #2 #3 #4 #5 #6
![Page 12: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/12.jpg)
Hadoop Cluster Performance Tuning
12
• Test cluster with 8665 books from Gutenberg project, or ~3.2 GB, using word count
• Seven node cluster, Core 2 Duo 2.8GHz, 3GB RAM, and 160GB HD each
Run # Configuration (each run inherits last configuration)
Runtime
1 8665 separate files, replication 2
1 hrs, 3 mins, 12 sec
2 Compiled single file 3 mins, 30 sec
3 Increase file buffer size 3 mins, 25 sec
4 Turn off speculative execution 3 mins, 20 sec
5 Increase MapReduce memory to 512MB from 200MB
3 mins, 30 sec
6 Increase block size to 128MB from 64MB
3 mins, 21 sec
![Page 13: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/13.jpg)
Experimental Methods
• Use SkewReduce’s following datasets:
• Use SkewReduce’s included MapReduce algorithms which identifies clusters of particles
• Benchmark SkewReduce emulating Hadoop behavior, SkewReduce’s Optimizer, and our task scheduler with both datasets
13
Dataset Size # Items Description
Astro 18 GB 900 M Cosmology simulation
Seaflow 1.9 GB 59 M Flow cytometry
![Page 14: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/14.jpg)
Expected Runtime Results
14.1
1.6
87.2
14.1
0
10
20
30
40
50
60
70
80
90
100
Hadoop's default scheduler
Our Task Scheduler Goal
SkewReduce's Optimizer
Astro (hours)
Seaflow (minutes)
14
Dataset (time scale)
![Page 15: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/15.jpg)
Challenges and Future Work
• Large learning curve for MapReduce, Hadoop, and SkewReduce
• Finish task scheduler
• Ensure task scheduler requires no changes to algorithm or dataset
• Experiment with small variations to task scheduler algorithm to improve upon
• Compare to Hadoop’s scheduler and SkewReduce’s optimizer
15
![Page 16: Daniel Vicory - inset-csep.cnsi.ucsb.eduinset-csep.cnsi.ucsb.edu/sites/inset-csep.cnsi... · Data Mining: Big Picture • Big data is rampant in fields, data mining helps solve that](https://reader033.fdocuments.in/reader033/viewer/2022043011/5fa48401c01acc27d02cd082/html5/thumbnails/16.jpg)
Acknowledgements
University of California, Santa Barbara
Mentor: Nan Li
Faculty advisor: Prof. Xifeng Yan
Graduate students: Shengqi Yang
University of Washington
Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions
YongChul Kwon, Magdalena Balazinska
16