Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

28
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla

Transcript of Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Page 1: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Michael Sevilla

Page 2: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Identifying and Incorporating Latencies in Distributed Data Mining Algorithms

Michael SevillaX

Page 3: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Applicability of Mahout for Large Data Sets

Michael Sevilla

Page 4: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

What is Mahout?

• Distributed machine learning libraries– “scalable to reasonably large data sets”– Runs on Hadoop

http://heureka.blogetery.com/

Page 5: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

The Data: Million Song Data Set

• Large Data Set– 1,019,318 users– 384,546 MSD songs– 48,373,586 (user, song, count)

• Kaggle Competition: offline evaluation– Predict songs a user will listen to using• Training: 1M user listening history• Validation: 110K users

• “Martin L” blogged his methodology + results

Page 6: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

22 vs.

Motivations

• Can Mahout easily be modified?• Can Mahout perform well for this workload?• Can Mahout produce accurate results?• Can Mahout work ‘out of box’?

• Hypothesis: 22 machines + Mahout > 1 guy

Page 7: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

What kind of Recommender?

• Format: <userID, songID, count>• Users interacting with items• Users express preferences towards items

• We can us Collaborative Filtering

22 vs.

Page 8: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Collaborative Filtering

• Predicts preference of user towards an item• Constructs a Top-N-Recommendation

1. Parse input training data2. Create user-item-matrix3. Predict missing entries

Mahout has item-based Collaborative Filtering jobs!

Page 9: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT EASILY BE MODIFIED?

Page 10: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Martin’s Code

• Methodology: similarity vector of history– Sparse-matrix• COLISTEN(i, j) – listeners who listened to i and j

– Sum similarities for each song user x listens to• The code: all python– Parse: 27 lines of code (l.o.c)– Create Matrix: 46 l.o.c– Predict: 45 l.o.c

Page 11: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Mahout’s Code

• Methodology: – No Idea…

• The code: all java– Poorly commented– 14 *.java files – Many Directories

• ~/mahout/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java

– RecommenderJob.java: 284 lines of code (l.o.c)– SimilarityMatrixRowWrapperMapper.java: 47 l.o.c– UserVectorSplitterMapper.java: 138 l.o.c

Page 12: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Mahout’s Code

Page 13: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT EASILY BE MODIFIED?

NO

Page 14: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?

Page 15: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

• Performance on 86MB: – Parse data: 10 minutes– Make Matrix: 22 minutes– Predict songs for 11000 users: 1 hour, 18 minutes

• Did not test scalability

$/ python convertToNumbers.py$/ python colisten.py$/ python predict_colisten.py

Martin’s Code

Page 16: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

• Performance on 86MB:– Parse Time: 10 minutes– Total Time: 25 minutes

• Tested scalability– 64MB, 128MB, 256MB, 1GB, 2GB, 3GB

Mahout’s Code

Page 17: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Mahout’s Code

• Total Time• ~ 12m, 43m, 1hr, 2hr, 4hr, >5hr ….

10 Nodes Failed

Page 18: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

• Prepare Jobs (parse): seconds - minutes

Mahout’s Code

Page 19: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Mahout’s Code

• Recommend Jobs (predict): seconds - minutes

Page 20: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Mahout’s Code

• Create Matrix Jobs: minutes - hours

Page 21: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT PERFORM WELL FOR THIS WORKLOAD?

NO

Page 22: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT PRODUCE ACCURATE RESULTS?

Page 23: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Training Set

• Kaggle Million Song Subset: 110K users– User 2: 16 entries – took out 8– User 16: 32 entries – took out 8– User 17: 25 entries – took out 8

Page 24: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

User 2:

User 16:

User 17:

where Q is the number of queries Martin’s Code

Page 25: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

User 2:

User 16:

User 17:

where Q is the number of queries Mahout’s Code

Page 26: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT PRODUCE ACCURATE RESULTS?

YES

Page 27: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

CAN MAHOUT WORK ‘OUT OF BOX’?

YES… but not well

Page 28: Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.

Conclusion

• Mahout did not scale well• Mahout was not easy to learn• Mahout was not easily modifiable

• For performance and efficiency, it is better to– Understand the data set– Understand data mining– Understand the methodology