Social network analysis and growth recommendations for DataScience SG community
Telecom datascience master_public
-
Upload
vincent-michel -
Category
Data & Analytics
-
view
56 -
download
0
Transcript of Telecom datascience master_public
Data Science in E-commerce industry Telecom Paris - Séminaires Big Data 2016/06/09Vincent Michel
Big Data Europe, BDD, Rakuten Inc. / PriceMinister
[email protected] @HowIMetYourData
2
Short Bio
ESPCI: engineer in Physics / Biology
ENS Cachan: MVA Master Mathematics Vision and Learning
INRIA Parietal team: PhD in Computer ScienceUnderstanding the visual cortex by using classification techniques
Logilab – Development and data science consultingData.bnf.fr (French National Library open-data platform)Brainomics (platform for heterogeneous medical data)
EducationExperience
Rakuten PriceMinister– Senior Developer and data scientistData engineer and data science consulting
Software engineeringLessons learned from (painful) experiences
4
Do not redo it yourself !
Lots of interesting open-source libraries for all your needs:Test first on a small POC, then contribute/developScikit-learn, pandas, Caffe, Scikit-image, opencv, ….Be careful: it is easy to do something wrong !
Open-data:More and more open-data for catalogs, …E.g. data.bnf.fr
~ 2.000.000 authors~ 200.000 works~ 200.000 topics
Contribute to open-source:Is there a need / pool of potential developers ?Do it well (documentation / test)Unless you are doing some kind of super magical algorithmMay bring you help, bug fixes, and engineers ! But it takes time and energy
5
Quality in data science software engineering
Never underestimates integration costEasy to write a 20 lines Python code doing somefancy Random Forests… …that could be hard to deploy (data pipeline, packaging, monitoring)Developer != DevOps != Sys admin
Make it clean from the start (> 2 days of dev or > 100 lines of code):Tests, tests, tests, tests, tests, tests, tests, …DocumentationPackaging / supervision / monitoringRelease often release earlierAgile development, Pull request, code versioning
Choose the right tool:Do you really need this super fancy NoSQL databaseto store your transactions?
6
Monitoring and metrics
Always monitor:Your development: continuous integration (Jenkins)Your service: nagios/shinkenYour business data (BI): KibanaYour user: trackerYour data science process : e.g. A/B test
Evaluation:Choose the right metricPrediction accuracy / Precision-recall …Always A/B test rather than relying on personal thoughtsGood question leads to good answer: Define your problem
Hiring remarksSelling yourself as a (good) data scientist
8
Few remarks on hiring – my personal opinion
Be careful of CVs with buzzwords!E.g. “IT skills: SVM (linear, non-linear), Clustering (K-means, Hierarchical), Random Forests, Regularization (L1, L2, Elastic net…) …”It is like as someone saying “ IT skills: Python (for loop, if/else pattern, …)
Often found in Junior CVs (ok), but huge warning in Senior CVs
Hungry for data?Loving data is the most important thing to checkOpendata? Personal project? Curious about data? (Hackaton?)Pluridisciplinary == knowing how to handle various datasets
Check for IT skills:Should be able to install/develop new libraries/algorithmsA huge part of the job could be to format / cleanup the dataExperience VS education -> Autonomy
Recommendations @RakutenData science use-case
10
Rakuten Group Worldwide
Recommendationchallenges
Different languagesUsers behaviorBusiness areas
11
Rakuten Group in Numbers
Rakuten in Japan
> 12.000 employees> 48 billions euros of GMS> 100.000.000 users> 250.000.000 items> 40.000 merchants
Rakuten Group
Kobo 18.000.000 usersViki 28.000.000 usersViber 345.000.000 users
12
Rakuten Ecosystem
Rakuten global ecosystem :Member-based business model that connects Rakuten servicesRakuten ID common to various Rakuten servicesOnline shopping and services;
Main business areasE-commerceInternet financeDigital content
Recommendation challengesCross-servicesAggregated dataComplex users features
13
Rakuten’s e-commerce: B2B2C Business Model
Business to Business to Consumer:Merchants located in different regions / online virtual shopping mallMain profit sources
• Fixed fees from merchants• Fees based on each transaction and other service
Recommendationchallenges
Many shopsItems referencesGlobal catalog
14
Big Data Department @ Rakuten
Big Data Department150+ engineers – Japan / Europe / US
Missions
Development and operations of internal systems for:
RecommendationsSearchTargetingUser behavior tracking
Average traffic
> 100.000.000 events / day> 40.000.000 items view / day> 50.000.000 search / day> 750.000 purchases / day
Technology stackJava / Python / RubySolr / LuceneCassandra / CouchbaseHadoop / Hive / PigRedis / Kafka
15
Recommendations on Rakuten Marketplaces
Non-personalized recommendationsAll-shop recommendations:
Item to itemUser to item
In-shop recommendationsReview-based recommendations
Personalized recommendationsPurchase history recommendationsCart add recommendationsOrder confirmation recommendations
System status and scaleIn production in over 35 services of Rakuten Group worldwideSeveral hundreds of servers running:
HadoopCassandraAPIS
RecommendationsThe big picture
17
Challenges in Recommendations
ItemsCatalogue
ItemsSimilarity
Recommendationsengine
EvaluationProcess
Items cataloguesCatalogue for multiple shops with different items
references ?Items similarity / distances
Cross services aggregation ?Lots of parameters ?
Recommendations engineBest / optimal recommendations logic ?
Evaluation processOffline / online evaluation ?Long-tail ? KPI ?
18
Recommendations Architecture: Constantly Evolving
BrowsingEvents
Cocounts Storage
PurchaseEvents
Cat
alog
ue(s
)
Dis
tribu
tion
laye
r
RecommendationsOffline / materialized
RecommendationsOnline algebra / multi-arm
19
Items Catalogues
Use different levels of aggregation to improve recommendations
Category-level(e.g. food, soda, clothes, …)
Product-level(manufactured items)
Item in shop-level(specific product sell by a specific shop)
Increased statistical power in co-events computation
Easier business handling(picking the good item)
20
Enriching Catalogues using Record Linkage
Marketplace 2Marketplace 1 Reference database
Record linkage Use external sources (e.g., Wikidata) to align markets' products Fuzzy matching of 600K vs 350K items for movies alignments usecase. Blocking algorithm
Cross recommendation Global catalog Items aggregation Helps with cold start issues Improved navigation
21
Semantic-web and RDF format
Triples: <subject> <relation> <object>URI: unique identifier
http://dbpedia.org/page/Terminator_2:_Judgment_Day
RecommendationsCocounts and matrixes
23
Recommendation datatypes
RatingsNumerical feedbacks from the usersSources: Stars, reviews, …
✔ Qualitative and valuable data✖ Hard to obtainScaling and normalization !
Users
Item
s
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
Unitary dataOnly 0/1 without any quality feedbackSources: Click, purchase…
✔ Easy to obtain (e.g. tracker)✖ No direct rating
Users
Item
s1 1 1
1 1
1 1 1
1 1 1
1 1 1 1
24
Collaborative filtering
User-user#items < #usersItems are changing quickly
Users
Item
s
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
?
1 – Compute users similarities(cosine-similarity, Pearson)
2 – Weighted average of ratings
Item-item#items >> #users
25
Matrix factorization
Users
Item
s
1 3 2
5 2
2 4 1
3 1 5
4 4 1 3
-0.7 1 0.4……………
2.3 0.2 -0.3
Item
s
0.5 0.3 … 1.2 …
1.2 -0.2 … -3.2
Users
~ X
Choose a number of latent variables to decompose the data
Predict new rating using the product of latent vectors
Use gradient descent technics (e.g. SGD)
Add some regularization
26
Matrix factorization – MovieLens example
Read filesimport csvmovies_fname = '/path/ml-latest/movies.csv'with open(movies_fname) as fobj: movies = dict((r[0], r[1]) for r in csv.reader(fobj))ratings_fname = ’/path/ml-latest/ratings.csv'with open(ratings_fname) as fobj: header = fobj.next() ratings = [(r[0], movies[r[1]], float(r[2])) for r in csv.reader(fobj)]
Build sparse matriximport scipy.sparse as spuser_idx, item_idx = {}, {}data, rows, cols = [], [], []for u, i, s in ratings: rows.append(user_idx.setdefault(u, len(user_idx))) cols.append(item_idx.setdefault(i, len(item_idx))) data.append(s)ratings = sp.csr_matrix((data, (rows, cols)))reverse_item_idx = dict((v, k) for k, v in item_idx.iteritems())reverse_user_idx = dict((v, k) for k, v in user_idx.iteritems())
27
Matrix factorization – MovieLens example
Fit Non-negative Matrix Factorizationfrom sklearn.decomposition import NMFnmf = NMF(n_components=50)user_mat = nmf.fit_transform(ratings)item_mat = nmf.components_
Plot resultscomponent_ind = 3component = [(reverse_item_idx[i], s)
for i, s in enumerate(item_mat[component_ind , :]) if s>0.] For movie, score in sorted(component, key=lambda x: x[1], reverse=True)[:10]: print movie, round(score)
Terminator 2: Judgment Day (1991) 24.0Terminator, The (1984) 23.0Die Hard (198 19.0Aliens (1986) 17.0Alien (1979) 16.0
Exorcist, The (1973) 8.0Halloween (197 7.0Nightmare on Elm Street, A (1984) 7.0Shining, The (1980) 7.0Carrie (1976) 7.0
Star Trek II: The Wrath of Khan (1982) 10.0Star Trek: First Contact (1996) 10.0Star Trek IV: The Voyage Home (1986) 9.0Contact (1997) 8.0Star Trek VI: The Undiscovered Country (1991) 8.0Blade Runner (1982) 8.0
28
Binary / Unitary data
Only occurences of items views/purchases/…
Jaccard distance
Cosine similarity
Conditional probability
29
Co-occurrences and Similarities Computation
Only access to unitary data (purchase / browsing)
Use co-occurrences for computing items similarity
Multiple possible parameters: Size of time window to be considered:
Does browsing and purchase data reflect similar behavior ?
Threshold on co-occurrencesIs one co-occurrence significant enough to be used ? Two ? Three ?
Symmetric or asymmetricIs the order important in the co-occurrence ? A then B == B then A ?
Similarity metricsWhich similarity metrics to be used based on the co-occurrences ?
30
Co-occurrences Example
Browsing
Purchase
Session ? Session ?Time window 1
Session ?Time window 2
07/11/2015 08/11/2015
08/11/2015
24/11/2015
08/11/2015
08/11/2015
10/09/2015
08/09/2015
10/09/2015
31
Co-occurrences Computation
Co-purchases
Co-browsing
Classical co-occurrences
Complementaryitems
Substituteitems
Other possible co-occurrences
Items browsed and bought together
Items browsed and not bought together
“You may also want…”
“Similar items…”
08/11/2015
08/11/2015
08/11/2015
07/11/2015
08/11/201510/09/2015
08/09/2015
07/11/2015
RecommendationsDevelopment and evaluation
33
Recommendations Algebra
Algebra for defining and combining recommendations engines
Keys ideasReuse already existing logics and combine them easily.Write business logic, not code ! Handle multiple input/output formats.
Available LogicsContent-basedCollaborative-filteringItem-itemUser-item
(personalization)
Available BackendsIn-memoryHDF5 filesCassandraCouchbase
Available HybridizationLinear algebra /
weightingMixedCascade enginesMeta-level
34
Python Algebra Example
Purchase-basedTop-20
AsymmetricConditional probability
Browsing-basedSimilarity > 0.01
SymmetricCosine similarity
+ 0.2 Composite engine
>>> engine1 = RecommendationsEngine(nb_recos=20, datatype=‘purchase’, asymmetric=True, distance=‘conditional_probability’)>>> engine2 = RecommendationsEngine(similarity_th=0.01, datatype=‘browsing’, asymmetric=False,
distance=‘cosine_similarity’)>>> composite_engine = engine1 + 0.2 * engine2
Get recommendations from items (item-to-item)
>>> recos = composite_engine.recommendations_by_items([123, 456, 789, …])
35
Python Algebra with Personalization
Purchase-basedTop-20
AsymmetricConditional probability
Browsing-basedSimilarity > 0.01
SymmetricCosine similarity
+ 0.2 Composite engine
Purchase-historyTime window 180 days
Time decay 0.01
>>> history = HistoryEngine(datatype=‘purchase’, time_window=180, time_decay=0.01)>>> engine1.register_history_engine(history)
…same code as previously (user-to-item)
>>> recos = composite_engine.recommendations_by_user(‘userid’)
36
Python Algebra – Complete Example
Purchase-basedTop-20
AsymmetricConditional probability
Browsing-basedSimilarity > 0.01
SymmetricCosine similarity
+ 0.2 Composite engine
Purchase-historyTime window 180 days
Time decay 0.01
X (cascade)
Purchase-basedCategory-level
Similarity > 0.01Asymmetric
Conditional probability
Browsing-basedCategory-levelSimilarity > 0.1
SymmetricCosine similarity
+ 0.1
Composite engine
37
Recommendation Quality Challenges
Recommendations categories
Cold start issue• External data ?• Cross-services ?
Hot products (A)• Top-N items ?
Short tail (B)
Long tail (C + D)
Minor Product
Major Product
(Popular)New Product
OldProduct
(A)(B)
(D)
(C)
38
Long Tail is Fat
Long tail numbers
• Most of the items are long tail• They still represent a large
portion of the traffic
Long tail approaches
• Content-based• Aggregation / clustering• Personalization
Popular
Short tail
Long tail
Browsing share Number of items
Long tail Short tail Popular
39
Recommendations Offline Evaluation
Pros/Cons
• Convenient way to try new ideas
• Fast and cheap• But hard to align
with online KPI
Approaches
• Rescoring• Prediction game• Business simulator
40
Public Initiative – Viki Recommendation Challenge
567 submissions from 132 participantshttp://www.dextra.sg/challenges/rakuten-viki-video-challenge
41
Datascience everywhere !
Rakuten provides marketplaces worldwide
Specific challenges for recommendations
Items catalogue: reinforce statistical power of co-occurrences across shops and services;
Items similarities: find the good parameters for the different use-cases;
Recommendations models: what is the best models for in-shop, all-shops, personalization?
Evaluation: handling long-tail? Comparing different models?
42
THANKS !
Questions ?
More on Rakuten tech initiatives
http://www.slideshare.net/rakutentechhttp://rit.rakuten.co.jp/oss.html
http://rit.rakuten.co.jp/opendata.html
Positions
• http://global.rakuten.com/corp/careers/bigdata/• http://www.priceminister.com/recrutement/?p=197
43
We are Hiring!
Big Data Department – team in Parishttp://global.rakuten.com/corp/careers/bigdata/
http://www.priceminister.com/recrutement/?p=197
Data Scientist / Software Developer
Build algorithms for recommendations, search, targeting Predictive modeling, machine learning, natural language processing Working close to business Python, Java, Hadoop, Couchbase, Cassandra…
Also hiring: search engine developers, big data system administrators, etc.