Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
-
Upload
horace-richard -
Category
Documents
-
view
217 -
download
0
Transcript of Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
![Page 1: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/1.jpg)
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms
Michael Sevilla
![Page 2: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/2.jpg)
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms
Michael SevillaX
![Page 3: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/3.jpg)
Applicability of Mahout for Large Data Sets
Michael Sevilla
![Page 4: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/4.jpg)
What is Mahout?
• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop
http://heureka.blogetery.com/
![Page 5: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/5.jpg)
The Data: Million Song Data Set
• Large Data Set– 1,019,318 users– 384,546 MSD songs– 48,373,586 (user, song, count)
• Kaggle Competition: offline evaluation– Predict songs a user will listen to using• Training: 1M user listening history• Validation: 110K users
• “Martin L” blogged his methodology + results
![Page 6: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/6.jpg)
22 vs.
Motivations
• Can Mahout easily be modified?• Can Mahout perform well for this workload?• Can Mahout produce accurate results?• Can Mahout work ‘out of box’?
• Hypothesis: 22 machines + Mahout > 1 guy
![Page 7: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/7.jpg)
What kind of Recommender?
• Format: <userID, songID, count>• Users interacting with items• Users express preferences towards items
• We can us Collaborative Filtering
22 vs.
![Page 8: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/8.jpg)
Collaborative Filtering
• Predicts preference of user towards an item• Constructs a Top-N-Recommendation
1. Parse input training data2. Create user-item-matrix3. Predict missing entries
Mahout has item-based Collaborative Filtering jobs!
![Page 9: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/9.jpg)
CAN MAHOUT EASILY BE MODIFIED?
![Page 10: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/10.jpg)
Martin’s Code
• Methodology: similarity vector of history– Sparse-matrix• COLISTEN(i, j) – listeners who listened to i and j
– Sum similarities for each song user x listens to• The code: all python– Parse: 27 lines of code (l.o.c)– Create Matrix: 46 l.o.c– Predict: 45 l.o.c
![Page 11: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/11.jpg)
Mahout’s Code
• Methodology: – No Idea…
• The code: all java– Poorly commented– 14 *.java files – Many Directories
• ~/mahout/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java
– RecommenderJob.java: 284 lines of code (l.o.c)– SimilarityMatrixRowWrapperMapper.java: 47 l.o.c– UserVectorSplitterMapper.java: 138 l.o.c
![Page 12: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/12.jpg)
Mahout’s Code
![Page 13: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/13.jpg)
CAN MAHOUT EASILY BE MODIFIED?
NO
![Page 14: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/14.jpg)
CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?
![Page 15: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/15.jpg)
• Performance on 86MB: – Parse data: 10 minutes– Make Matrix: 22 minutes– Predict songs for 11000 users: 1 hour, 18 minutes
• Did not test scalability
$/ python convertToNumbers.py$/ python colisten.py$/ python predict_colisten.py
Martin’s Code
![Page 16: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/16.jpg)
• Performance on 86MB:– Parse Time: 10 minutes– Total Time: 25 minutes
• Tested scalability– 64MB, 128MB, 256MB, 1GB, 2GB, 3GB
Mahout’s Code
![Page 17: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/17.jpg)
Mahout’s Code
• Total Time• ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr ….
10 Nodes Failed
![Page 18: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/18.jpg)
• Prepare Jobs (parse): seconds - minutes
Mahout’s Code
![Page 19: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/19.jpg)
Mahout’s Code
• Recommend Jobs (predict): seconds - minutes
![Page 20: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/20.jpg)
Mahout’s Code
• Create Matrix Jobs: minutes - hours
![Page 21: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/21.jpg)
CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?
NO
![Page 22: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/22.jpg)
CAN MAHOUT PRODUCE ACCURATE RESULTS?
![Page 23: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/23.jpg)
Training Set
• Kaggle Million Song Subset: 110K users– User 2: 16 entries – took out 8– User 16: 32 entries – took out 8– User 17: 25 entries – took out 8
![Page 24: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/24.jpg)
User 2:
User 16:
User 17:
where Q is the number of queries Martin’s Code
![Page 25: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/25.jpg)
User 2:
User 16:
User 17:
where Q is the number of queries Mahout’s Code
![Page 26: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/26.jpg)
CAN MAHOUT PRODUCE ACCURATE RESULTS?
YES
![Page 27: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/27.jpg)
CAN MAHOUT WORK ‘OUT OF BOX’?
YES… but not well
![Page 28: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.](https://reader031.fdocuments.in/reader031/viewer/2022032311/56649dda5503460f94ad0c4f/html5/thumbnails/28.jpg)
Conclusion
• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable
• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology