Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
-
Upload
chris-fregly -
Category
Software
-
view
439 -
download
0
Transcript of Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
advancedspark.com
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Who Am I?
2
Streaming Data EngineerNetflix OSS Committer
Data Solutions EngineerApache Contributor
Principal Data Solutions EngineerIBM Technology Center
Meetup OrganizerAdvanced Apache Meetup
Book AuthorAdvanced .
Due 2016
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Advanced Apache Spark Meetuphttp://advancedspark.com
Meetup MetricsTop 10 Most-active Spark Meetup!3200+ Members in just 9 mos!!3700+ Docker downloads (demos)
Meetup MissionCode deep-dive into Spark and related open source projectsSurface key patterns and idiomsFocus on distributed systems, scale, and performance
3
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Live, Interactive Demo!Audience Participation Required!!Cell Phone Compatible!!!
demo.advancedspark.com4
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
http://demo.advancedspark.com
End User ->
ElasticSearch ->
Spark ML ->
Data Scientist ->
5
<- Kafka
<- SparkStreaming
<- Cassandra,Redis
<- Zeppelin, iPython
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations6
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Scaling with Parallelism
7
Peter
O(log n)O(log n)
WorkerNodes
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Parallelism with ComposabilityWorker 1 Worker 2
Max (a max b max c max d) == (a max b) max (c max d)
Set Union (a U b U c U d) == (a U b) U (c U d)
Addition (a + b + c + d) == (a + b) + (c + d)
Multiply (a * b * c * d) == (a * b) * (c * d)
8
What about Division and Average?Collect at Driver
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
What about Division?Division (a / b / c / d) != (a / b) / (c / d)
(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))
0.134 != 0.857
9
What were the Egyptians thinking?!Not Composable
“Divide like an Egyptian”
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
What about Average?
Overall AVG(3, 1) (3 + 5 + 5 + 7) 20
+ (5, 1) == -------------------- == --- == 5+ (5, 1) (1 + 1 + 1 + 1) 4+ (7, 1)
10
values
counts
Pairwise AVG(3 + 5) (5 + 7) 8 12 20------- + ------- == --- + --- == --- == 10 != 5
2 2 2 2 2
Divide, Add, Divide?Not Composable
Single-Node Divide at the End?Doesn’t need to be Composable!
AVG (3, 5, 5, 7) == 5
Add, Add, Add?Composable!
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations11
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Similarities
12
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Euclidean SimilarityExists in Euclidean, flat spaceBased on Euclidean distance Linear measureBias towards magnitude
13
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Cosine SimilarityAngular measureAdjusts for Euclidean magnitude biasNormalize to unit vectors in all dimensionsUsed with real-valued vectors (versus binary)
14
org.jblas.DoubleMatrix
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Jaccard SimilaritySet similarity measurementSet intersection / set union Bias towards popularityWorks with binary vectors
15
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Log Likelihood SimilarityAdjusts for popularity biasNetflix “Shawshank” problem
16
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Word SimilarityEdit Distance
Misspellings and autocorrect
Word2VecSimilar words are defined by similar contexts in vector space
17
English Spanish
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Find Synonyms with Word2Vec
18
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Find Synonyms using Word2Vec
19
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Document SimilarityTF/IDF
Term Freq / Inverse Document FreqUsed by most search engines
Doc2VecSimilar documents are determined by similar contexts
20
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus! Text Rank Document SummaryText Rank (aka Sentence Rank)
Surface summary sentences TF/IDF + Similarity Graph + PageRank
Most similar sentence to all other sentencesTF/IDF + Similarity Graph
Most influential sentencesPageRank
21
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Similarity Pathways (Recommendations)Best recommendations for 2 (or more) people
“You like Max Max. I like Message in a Bottle.We might like a movie similar to both.”
Item-to-Item Similarity Graph + Dijkstra Heaviest Path
22
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Similarity Pathway for Movie Recommendations
23
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Load Movies with Tags into DataFrame
24
My Choice
TheirChoice
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Tag Jaccard SimilarityBased on Tags
25
Calculate Jaccard Similarity(Tag Set Similarity)
Must be Above the Given Jaccard Similarity Threshold
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Tag Similarity Graph
26
Edge Weights ==
Jaccard Similarity(Based on Tag Sets)
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
TODO: Use Dijkstra to Find Heaviest Pathway
27
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Calculating Exact SimilarityBrute-Force Similarity
Cartesian ProductO(n^2) shuffle and computeaka. All-pairs, Pair-wise,
Similarity Join
28
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Calculating Approximate SimilarityGoal: Reduce Shuffle
Approximate SimilaritySamplingBucketing or ClusteringIgnore low-similarity probability
Locality Sensitive Hashing Twitter Algebird MinHash
29
BucketBy Genre
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline① Scaling
② Similarities
③ Recommendations
④ Approximations
① Netflix Recommendations30
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Recommendations
31
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Basic TerminologyUser: User seeking recommendationsItem: Item being recommendedExplicit User Feedback: user knows they are rating or liking, can choose to dislikeImplicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc)Instances: Rows of user feedback/input dataOverfitting: Training a model too closely to the training data & hyperparametersHold Out Split: Holding out some of the instances to avoid overfittingFeatures: Columns of instance rows (of feedback/input data)Cold Start Problem: Not enough data to personalize (new)Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)Model Evaluation: Compare predictions to actual values of hold out splitFeature Engineering: Modify, reduce, combine features
32
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
FeaturesBinary: True or FalseNumeric Discrete: Integers
Numeric: Real Values
Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)
Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)
Categorical Nominal: Independent, Favorite Sports Teams, Dating SpotsTemporal: Time-based, Time of Day, Binge Viewing
Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)
Media: Images, Audio, Video
Geographic: (Longitude, Latitude), Geohash
Latent: Hidden Features within Data (Collaborative Filtering)Derived: Age of Movie, Duration of User Subscription
33
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Feature EngineeringDimension Reduction
Reduce number of features in feature spacePrinciple Component Analysis (PCA)
Find principle features that best describe data variancePeel dimensional layers back
One-Hot EncodingConvert nominal categorical feature values into 0’s and 1’sRemove any numerical relationship between categories
Bears -> 1 Bears -> [1.0, 0.0, 0.0]49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]
34
Convert Each Item to Binary Vector
with Single 1.0 Column
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Feature Normalization & StandardizationGoal
Scale features to standard sizeRequired by many ML algos
Normalize FeaturesCalculate L1 (or L2, etc) norm, then divide into each elementorg.apache.spark.ml.feature.Normalizer
Standardize FeaturesApply standard normal transformation (mean->0, stddev->1)org.apache.spark.ml.feature.StandardScaler
35
http://www.mathsisfun.com/data/standard-normal-distribution.html
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Non-Personalized Recommendations
36
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Cold Start Problem“Cold Start” problem
New user, don’t know their preferences, must show something!
Movies with highest-rated actorsTop K aggregations
Facebook social graphFriend-based recommendations
Most desirable singlesPageRank of likes and dislikes
37
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!GraphFrame PageRank
38
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Example: Dating Site “Like” Graph
39
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
PageRank of Top Influencers
40
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Personalized Recommendations
41
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Personalized PageRank
42
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Personalized PageRank: Outbound Links
43
0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network
15% Probability: Choose Self or Random
85% AmongOutboundNetwork
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Personalized PageRank: No Outbound
44
0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network
15% Probability: Choose Self or Random
85% Among No
OutboundNetwork!!
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
User-to-User ClusteringUser Similarity
Time-basedPattern of viewing (binge or casual)Time of viewing (am or pm)
Ratings-basedContent ratings or number of viewsAverage rating relative to others (critical or lenient)
Search-basedSearch terms
45
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item ClusteringItem Similarity
Profile text (TF/IDF, Word2Vec, NLP)Categories, tags, interests (Jaccard Similarity, LSH)Images, facial structures (Neural Nets, Eigenfaces)
Dating Site Example…
46Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: NLP Conversation Starter Bot
47
“If your responses to my generic opening lines are positive, I may read your profile.”
Spark ML, Stanford CoreNLP,TF/IDF, DecisionTrees, Sentiment
http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Demo!Spark + Stanford CoreNLP Sentiment Analysis
48
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Top 100 Country Song Sentiment
49
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Surprising Results…?!
50
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Based RecommendationsBased on Metadata: Genre, Description, Cast, City
51
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Item-to-Item-based Recommendations
One-Hot Encoding + K-Means Clustering
52
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
One-Hot Encode Tag Feature Vectors
53
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Cluster Movie Tag Feature Vectors
54
HyperparameterTuning
(K Clusters?)
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Analyze Movie Tag Clusters
55
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
User-to-Item Collaborative FilteringMatrix Factorization① Factor the large matrix (left) into 2 smaller matrices (right)② Lower-rank matrices approximate original when multiplied③ Fill in the missing values of the large matrix④ Surface k (rank) latent features from user-item interactions
56
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Item-to-Item Collaborative FilteringFamous Amazon Paper circa 2003
ProblemAs users grew, user-to-item collaborative filtering didn’t scale
SolutionItem-to-item similarity, nearest neighbors Offline (Batch)
Generate itemId->List[userId] vectorsOnline (Real-time)
From cart, recommend nearest-neighbors in vector space57
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Collaborative Filtering-based Recommendations
58
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Fitting the Matrix Factorization Model
59
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Show ItemFactors Matrix from ALS
60
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Show UserFactors Matrix from ALS
61
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Generating Individual Recommendations
62
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Generating Batch Recommendations
63
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Clustering + Collaborative Filtering RecsCluster matrix output from Matrix FactorizationLatent features derived from user-item interaction
Item-to-Item SimilarityCluster item-factor matrix->
User-to-User Similarity<-Cluster user-factor matrix
64
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Clustering + Collaborative Filtering-based Recommendations
65
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Show ItemFactors Matrix from ALS
66
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Convert to Item Factors -> mllib.VectorRequired by K-Means Clustering Algorithm
67
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Fit and Evaluate K-Means Cluster Model
68
Measures ClosenessOf Points Within Clusters
K = 5 Clusters
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Genres and ClustersTypical Genres
Documentary, Romance, Comedy, Horror, Action, Adventure
Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy
69
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations70
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
When to Approximate?Memory or time constrained queries
Relative vs. exact counts are OK (approx # errors after a release)
Using machine learning or graph algosInherently probabilistic and approximate
Streaming aggregationsInherently sloppy collection (exactly once?)
71
Approximate as much as you can get away with!Ask for forgiveness later !!
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
When NOT to Approximate?If you’ve ever heard the term…
“Sarbanes-Oxley”
…at the office.
72
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
A Few Good Algorithms
73
You can’t handle the approximate!
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Common to These Algos & Data StructsLow, fixed size in memoryStore large amount of dataKnown error boundsTunable tradeoff between size and errorLess memory than Java/Scala collectionsRely on multiple hash functions or operationsSize of hash range defines error
74
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bloom FilterSet.contains(key): Boolean
“Hash Multiple Times and Flip the Bits Wherever You Land”
75
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bloom FilterApproximate Set.contains(key)
No means No, Yes means Maybe
Elements can only be addedNever updated or removed
76
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bloom Filter in Action
77
set(key) contains(key): Boolean
Images by @avibryant
Set.contains(key): TRUE -> maybe contains (other key hashes may overlap)Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
CountMin SketchFrequency Count and TopK
“Hash Multiple Times and Add 1 Wherever You Land”
78
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
CountMin Sketch (CMS)Approximate frequency count and TopK for keyie. “Heavy Hitters” on Twitter
79
Matei Zaharia Martin Odersky Donald Trump
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
CountMin Sketch In Action (TopK Count)
80
Images derived from @avibryant
Find minimum of all rows
……
Can overestimate, but never underestimate
Multiple hash functions(1 hash function per row)
Binary hash output(1 element per column)
x 2 occurrences of “Top Gun” for slightly additional complexity
Top GunTop Gun
Top Gun(x 2)
A FewGood Men
Taps
Top Gun(x 2)
add(Top Gun, 2)
getCount(Top Gun): Long
Use Case: TopK movies using total views
add(A Few Good Men, 1)
add(Taps, 1)
A FewGood Men
Taps
…
…
Overlap Top Gun
Overlap A Few Good Men
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HyperLogLogCount Distinct
“Hash Multiple Times and Uniformly Distribute Where You Land”
81
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HyperLogLog (HLL)Approximate count distinctSlight twist
Special hash function creates uniform distributionHash subsets of data with single, special hash func
Error estimate14 bits for size of rangem = 2^14 = 16,384 hash slotserror = 1.04/(sqrt(16,384)) = .81%
82
Not many of these
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HyperLogLog In Action (Count Distinct)Use Case: Number of distinct users who view a movie
83
0 32
Top Gun: Hour 2user2001
user4009
user3002
user7002
user1005
user6001
User8001
User8002
user1001
user2009
user3005
user3003
Top Gun: Hour 1user3001
user7009
0 16
Uniform Distribution:Estimate distinct # of users by inspecting just the beginning
0 32
Top Gun: Hour 1 + 2user2001
user4009
user3002
user7002
user1005
user6001
User8001
User8002
Combine across different scales
user7009
user1001
user2009
user3005
user3003
user3001
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Locality Sensitive HashingSet Similarity
“Pre-process Items into Buckets, Compare Within Buckets”
84
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Locality Sensitive Hashing (LSH)Approximate set similarityPre-process m rows into b buckets
b << m; b = buckets, m = rowsHash items multiple times
** Similar items hash to overlapping buckets** Hash designed to cluster similar items
Compare just contents of bucketsMuch smaller cartesian compare ** Compare in parallel !!
Avoids huge cartesian all-pairs compare85
Chapter 3: LSH
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
DIMSUMSet Similarity
“Pre-process and ignore data that is unlikely to be similar.”
86
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
DIMSUM“Dimension Independent Matrix Square Using MR”Remove vectors with low probability of similarity
RowMatrix.columnSimiliarites(threshold)Twitter DIMSUM Case Study
40% efficiency gain over bruce-force Cosine Sim
87
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Common Tools to Approximate
Twitter Algebird
Redis
Apache Spark
88
Composable Library
Distributed Cache
Big Data Processing
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Twitter AlgebirdAlgebraic Fundamentals
Parallel
Associative
ComposableExamples
Min, Max, AvgBloomFilter (Set.contains(key))HyperLogLog (Count Distinct)CountMin Sketch (TopK Count)
89
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
RedisImplementation of HyperLogLog (Count Distinct)
12KB per item count2^64 max # of items0.81% error
Add user views for given moviePFADD TopGun_Hour1_HLL user1001 user2009 user3005PFADD TopGun_Hour1_HLL user3003 user1001
Get distinct count (cardinality) of setPFCOUNT TopGun_Hour1_HLLReturns: 4 (distinct users viewed this movie)
Union 2 HyperLogLog Data StructuresPFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL
90
ignore duplicates
Tunable
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Approximations in Spark LibrariesSpark Core
countByKeyApprox(timeout: Long, confidence: Double)PartialResult
Spark SQLapproxCountDistinct(column: Column)HyperLogLogPlus
Spark MLStratified sampling
sampleByKey(fractions: Map[K, Double])DIMSUM sampling
Probabilistic sampling reduces amount of shuffleRowMatrix.columnSimilarities(threshold: Double)
91
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Exact Count vs. Approximate HLL and CMS Count
92
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HashSet vs. HyperLogLog (Memory)
93
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
HashSet vs. CountMin Sketch (Memory)
94
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Demo!Exact Similarity vs. Approximate LSH Similarity
95
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Brute Force Cartesian All Pair Similarity
96
47 seconds
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Locality Sensitive Hash All Pair Similarity
97
6 seconds
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Many More Demos!
or
Download Docker Clone on Github
98
http://advancedspark.com
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Presentation Outline① Scaling
② Similarities
③ Recommendations
④ Approximations
⑤ Netflix Recommendations99
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix RecommendationsFrom Ratings to Real-time
100
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Has a Lot of DataNetflix has a lot of data about a lot of users and a lot of movies.
Netflix can use this data to buy new movies.
Netflix is global.
Netflix can use this data to choose original programming.
Netflix knows that a lot of people like politics and Kevin Spacey.
101
The UK doesn’t have White Castle.Renamed my favourite movie to:
“Harold and Kumar Get the Munchies”
My favorite movie:“Harold and Kumar Go to White Castle”
Summary: Buy NFLX Stock!
This broke my unit tests!
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Data Pipeline - Then
102
v1.0
v2.0
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Data Pipeline – Now (Keystone)
103
v3.0
9 million events per second22 GB per second!!
EC2 D2XLDisk: 6 TB, 475 MB/sRAM: 30 GNetwork: 700 Mbps
Auto-scaling,Fault tolerance
A/B Tests,Trending Now
SAMZA
Splits high andnormal priority
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Recommendation Data Pipeline
104
Throw away batch user factors (U)
Keep batch video factors (V)
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Trending Now (Time-based Recs)Uses Spark StreamingPersonalized to user (viewing history, past ratings)Learns and adapts to events (Valentine’s Day)
105
“VHS”
Number of Plays
Number of Impressions
CalculateTake Rate
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Bonus: Pandora Time-based RecsWork Days
Play familiar musicUser is less likely accept new music
Evenings and WeekendsPlay new musicMore like to accept new music
106
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
$1 Million Netflix Prize (2006-2009)Goal
Improve movie predictions by 10% (Root Mean Sq Error)Test data withheld to calculate RMSE upon submission
5-star Ratings Dataset(userId, movieId, rating, timestamp)
Winning algorithm(s)10.06% improvement (RMSE)Ensemble of 500+ ML combined with GBDT’sComputationally impractical
107
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Secrets to the Winning AlgorithmsAdjust for the following human bias…① Alice effect: user rates lower than avg② Inception effect: movie rated higher than avg③ Overall mean rating of a movie④ Number of people who have rated a movie⑤ Number of days since user’s first rating⑥ Number of days since movie’s first rating⑦ Mood, time of day, day of week, season, weather
108
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Common ML AlgorithmsLogistic RegressionLinear RegressionGradient Boosted Decision TreesRandom ForestMatrix FactorizationSVDRestricted Boltzmann MachinesDeep Neural NetsMarkov ModelsLDAClustering
109
Ensembles!
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Genres and ClustersTypical Genres
Documentaries, Romance Comedies, Horror, Action, Adventure
Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy
110
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Social IntegrationPost to Facebook after movie start (5 mins)Recommend to new users based on friendsHelps with Cold Start problem
111
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix SearchNo results? No problem… Show similar results!
Utilize extensive DVD CatalogMetadata search (ElasticSearch)Named entity recognition (NLP)
Empty searches are opportunity!Explicit feedback for future recommendationsContent to buy and produce!
112
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix A/B TestsUsers tend to click on images featuring…
Faces with strong emotional expressionsVillains over heroesSmall number of cast members
113
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Netflix Recommendation Serving LayerUse Case: Recommendation service depends on EVCacheProblem: EVCache cluster does down or becomes latent!?Answer: github.com/Netflix/Hystrix Circuit Breaker!
Circuit StatesClosed: Service OK
Open: Service DOWNFallback to Static
114
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Why Higher Average Ratings 2004+?2004, Netflix noticed higher ratings on averageSome possible reasons why…
115
① Significant UI improvements deployed② New recommendation engine deployed③
Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI
Thank You, Everyone!!Chris Fregly @cfreglyIBM Spark Technology Center San Francisco, California, USA
http://advancedspark.comSign up for the Meetup and BookContribute to Github RepoRun all Demos using Docker
Find me LinkedIn, Twitter, Github, Email, Fax116
Image derived from http://www.duchess-france.org/
Flux Capacitor AI Bringing AI Back to the Future!
Bringing AI Back to the Future!Flux Capacitor AI
http://advancedspark.com@cfregly