Introduction to Mahout with HDInsight
-
Upload
chris-price -
Category
Technology
-
view
4.516 -
download
5
description
Transcript of Introduction to Mahout with HDInsight
![Page 1: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/1.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Introduction to Mahoutwith HDInsight (Hadoop)
Chris PriceSenior BI Consultant
@BluewaterSQL
![Page 2: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/2.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Intro
Chris Price Senior BI Consultant with Pragmatic Works
AuthorRegular SpeakerData Geek & Super Dad!
@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]
![Page 3: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/3.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Survey Whose currently using Machine Learning?
Google Facebook LinkedIn Twitter Amazon Wal-Mart
![Page 4: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/4.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Outline Mahout Introduction The Algorithms Hands On:
A recommendation engine
![Page 5: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/5.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Riding the Elephant Born out of the Apache Lucene project Top-level Apache project A scalable machine learning library
Fast, Efficient & Pragmatic Many of the algorithms can be run on Hadoop
![Page 6: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/6.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Algorithms Collaborative Filtering
Item/User Recommenders Clustering
Grouping movies by type Classification
Categorizing documents Frequent Itemset
Market basket analysis
![Page 7: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/7.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Collaborative Filtering Find subset of users who have similar
taste/preferences to target user and use this subset for recommendations
Types: User-Based Item-Based
Examples: Amazon
![Page 8: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/8.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Clustering Group similar objects Examples:
News Aggregator Customer Grouping
![Page 9: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/9.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Clustering Algorithms:
K-Means Fuzzy K-Means Mean Shift Canopy Dirichlet
Similarity Distance: Euclidean Squared Euclidean Cosine Tanimoto Manhattan
** Also weighted measures
![Page 10: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/10.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Clustering
![Page 11: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/11.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Classification Using a pre-determined set of groups:
Predict the type of a new object based on its features
Classifiable Data Continuous – Quantitative Value (i.e. Stock Price) Categorical – Small known set (i.e. Colors) Word-Like – Large unknown set Text-Like – Many word-like that are unordered
Examples: Spam Identification Photo Facial Recognition
![Page 12: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/12.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Frequent Itemset Examples:
Product Placement Market Basket Analysis Query Recommendations
![Page 13: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/13.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Mahout on HDInsight Installation
Download http://www.apache.org/dyn/closer.cgi/mahout/
Unpack Add to Path (Environment Variable)
![Page 14: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/14.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Recommendation Engine Define the Business Objective
Metrics Context
Identify Data Sources Normalization Data Shift
Which Algorithm? Integration?
![Page 15: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/15.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Business Objective
NavigationalInefficiency
Cross-Sell
Up-Sell
Increase # of Orders
Increase Items per Order
Increase Average
Item PriceWebsite Increase
Revenue
![Page 16: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/16.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Handling Context
???January
20 degrees & Snowing…..
![Page 17: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/17.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Data Acquisition Sources of Data for Recommendation
Implicit Ratings Feedback Demographics Pyschographics (Personality/Lifestyle/Attitude), Ephemeral Need (Need for a moment)
Explicit Purchase History Click/Browse History
Product/Item Taxonomy Attributes Descriptions
![Page 18: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/18.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Data Preparation Preparation
Remove Outliers (Z-Score) Remove frequent buyers (Skew) Normalize Data (Unity-Based)
Beware of Data Shift
![Page 19: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/19.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Algorithms Collaborative Filtering (Mahout)
User-Based Item-Based
Content-Based (Mahout Clustering) Data Mining (SSAS)
Association Clustering
![Page 20: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/20.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
CF Recommendations
Neighborhood Formation Similarity Metrics
Pearson Correlation Euclidean Distance Spearman Correlation Cosine Tanimoto Coefficient Log-Likelihood
![Page 21: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/21.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
CF Pseudo-Code
for each item i that u has no preferencefor each user v that has a preference for
icompute similarity s between u and v
calculate running average of v‘s preference for i, weighted
by s
return top ranked (weighted average) i
Restrict to Neighborhood
![Page 22: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/22.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Testing
Smell Test Built-In (Requires Java Coding)
Root Mean Squared Error (RMSE) Average Absolute Difference
RandomUtils.useTestSeed()Evaluator.evaluate(builder,null,0.7,1.0)
![Page 23: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/23.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Recommendation Engine Steps 1 – Generate List of ItemIDs 2 – Create Preference Vector 3 – Count Unique Users 4 – Transpose Preference Vectors 5 – Row Similarity
Compute Weights Computer Similarities Similarity Matrix
6 – Pre-Partial Multiply, Similarity Matrix 7 – Pre-Partial Multiply, Preferences 8 – Partial Multiple (Steps 6 & 7) 9 – Filter Items 10 – Aggregate & Recommend
![Page 24: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/24.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Batch Integration
ETL Data to HDFS SSIS Map/Reduce
Process with Mahout ETL Results
Map/Reduce Hive/Sqoop
![Page 25: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/25.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Hands-On Demo
![Page 26: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/26.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Resources
Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman
Hadoop: The Definitive Guide Tom White
![Page 27: Introduction to Mahout with HDInsight](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b72bf84a795916198b480a/html5/thumbnails/27.jpg)
MAKING BUSINESS INTELLIGENT www.pragmaticworks.comMAKING BUSINESS INTELLIGENT www.pragmaticworks.com
Thank you!
@BluewaterSQL http://bluewatersql.wordpress.com/ [email protected]