Data Science and Machine Learning for eCommerce and Retail
-
Upload
andrei-yigal-lopatenko -
Category
Data & Analytics
-
view
281 -
download
0
Transcript of Data Science and Machine Learning for eCommerce and Retail
Data Science and Machine Learning for eCommerce and Retail
Dr. Andrei Lopatenko Director of Engineering,
Recruit Institute of Technology Recruit Holdings
former Walmart Labs, Google (twice), Apple (twice) [email protected]
ML for eCommerce
• Search, Browse, for commerce sites and application
• Help users to find and discover items they will purchase
• Maximize revenue/profit per user session
Search data size
• Catalogue items • 8 M items now compare ~ 400 M
Amazon / eBay • X 10 in near future • 2 K text description per item + images • Several hundreds of structured attributes
per catalog
Search – user searches
• Tens of millions per day • Tens billions session per year • Online sales 13.2 B per year (http://
fortune.com/2015/11/17/walmart-ecommerce/)
• 500B per year sales offline stories (8% USA economy) in ~ 11K stores
• The number of transactions ~ 10B (public data)
ML addressable problems
• Learning to rank • Given a query, what’s the list of items
with the highest probability of conversion (purchase), ATC (add to card), page view
ML addressable problems
• Typeahead • Given a sequence of characters types by
user, what’s most probably competitions, what are most probable items users wants to buy
ML addressable problems
• Spell correction • Given a user query, what’s the query user
actually wanted to type
ML addressable problems
• Cold start • Given a new items with it’s set of
attributes and no history of sales or exposure on site, predict items sales and item sales per query
ML addressable problems
• Prediction of LHN • Given a user query, what’s the best set of
facet and facet values, which gives higher probability of users interacting with them and finally buying an item
ML addressable problems
• Query understanding • Given a query, build a semantic parse of
query, tag tokens with attributes: blue tshirts for teenagers -> blue:color tshirts:type for:opt teenagers:agerestriction10-20
• Classification: blue tshirts for teenagers: -> type:apparel, price preference: 10-30, releaseyearpreference: 2014-2016
ML addressable problems
• Related searches • Given a query, what are queries which are
either semantically close to this one, or represent coincidental users interests
• Nike shoes -> adidas shoes, sport shoes, • Coffee mugs -> travel mugs, photo coffee
mugs, cappuccino cups
ML addressable problems
• product discovery • help users to explore product assortment, • drive users to diverse products • reduce risk of selecting irrelevant items • help to find price,quality,brand etc
alternatives • reduce pigeonhole risk • provide relevant data to make a decision
ML addressable problems
• Image similarity • Given images of the items, give other
items such that images of those are visually appealing to the users which like the original item (appealing by shape? Color? Texture?) -> causing high conversion in recommendation
ML addressable problems
• Voice search • Given voice input, reply with a list of the
best items • “what are the cheapest samsung tvs in the
store” • “what is best deal on queen bed today?”
ML addressable problems
• extraction of item attributes • Given an item: what are item attributes:
brand, color, size (wheel, screen, height, S/M/XL, Queen/Twin/King/Full), Gender, Pattern, Shape, Features
ML addressable problems
• Representations of users : actions on websites/apps -> searches, clicks, browsing behaviour, product -> purchase preferences, reviews, ratings, return rates
ML addressable problems
• title generation: how to generate the title which will cause maximum conversion rate
• which product attributes select for the title?
Limits
• Most models should be served in production
• 50ms on prediction • Part of big system, memory limits ~ 10G
Retail
• Key directions which require machine learning:
• discounting tools • coupons and rewards • loyalty • inventory management
Inventory management
• Customer want to buy products • Customers have diverse needs • Products should be in stock, ideally in
warehouses close to customers • but it’s expensive to store products • Problem: How many products of each type
should be stored, when product supply should be refilled?
Customer intelligence
• Retail • analyze sales data, find anomalies, explain
them • low sales of umbrellas during last month in
North California’s stores • No rains? (integration with external data about
weather conditions) • Seasonal / the same as last year / time series • Competitors
Fraud detection
• identify fraudulent transactions online • Hundreds fraud schemas detected daily • Global retail shrinkage is $119 billion in
2011, an average of 1.45% of retail sales. • from stolen credit card to price tag
replaced, price discounts by high level managers to achieve personal goals
Propensity Modeling for Marketing Campaigns
• build effective email/facebook/google ads campaign addressing proper customer at proper time at proper costs
• behavior based customer segmentation and clusterization with demographics, lifestyle, attitudinal information
Online Grocery
• which items can be replaced by other items and by which items they can be replaced
• data are individual purchases in chain grocery, drug stores, online grocery shopping
• the problem - find which items can be replaced by other item if they are not in store to fulfill the order
Dynamic pricing
• define the best price • scrap continuously prices of competitors,
predict demand by price, know the expenses
• online commerce sites change prices every 10 minutes
Challenges
• Data volumes: transactions: Walmart: 10 Million per day
• Computations: complicated modeling techniques
Data storage
• Volumes of data: • 10 M transactions per day, 5 years - 18
billion transactions -> 1T • Catalog: 500 M items * 2K per each -> 1T
Data Storage
• but if go to video: petabytes of data, RetailNext 75P per year from 30000+ sensors
• Walmart 500P • eBay 40 P in 2013 (transactions + online
behaviours)
Data processing
• Rebuild model over fresh data: • typically daily: add daily data (millions of
transactions, hundreds of millions of behavior units) to year data store (billions of transactions, hundred billion/trillion behavior units)
• build a model to serve in production the next day
Data processing
• some models such as fraud detection,dynamic pricing should be almost online (10-15 minutes)
• build over data such as daily transactions or web crawl over competitors' sites
Serving online
• online commerce WML - thousands / tens thousands queries per second in peak times
• complicated algorithm of ranking, recommendation,
• 50ms limit
serving online
• price, in store availability - millions requests per second in peak times
• item informations - millions requests per second
• serving online - Solr/Lucene/Elastic Search, Cassandra, MongoDB, Oracle, CouchDB,Node.JS/Java solutions etc
Data processing
• Hadoop / Spark clusters • a lot of I/O • HDFS does the redundancy , RAID is not
necessary, RAID is slow to write, Hadoop writes a lot
• SAN, NAS are not good either • so bare metal with DAS Directly Attached
Storage
Data Processing
• more servers, cheaper servers • more smaller disks is better than large
disks • allocate cluster 100% to Hadoop
Data processing
• Hadoop Masters vs Workers • large clusters: Masters > 64G RAM, dual
Ethernet NIC, dual quad core CPU • Workers: memory 64G+, SAS 6Gb/s disk
controllers, 2 Ethernet cards, 2*6core processors, 15M cache, Intel’s Hyper-Threading and QPI good to have
Data Processing
• big models, deep learning • Nvidia DGX-1 and alike • Pascal GPUs , NVLink interconnect • Tesla k40, K80 work pretty well too • may require a lot of tuning http://
timdettmers.com/2015/03/09/deep-learning-hardware-guide/
• hard to buy: big data solutions are considered profit generators, HPC servers are not
Serving online
• Typically large memory, but not necessary (for example, Elastic Search/Solr degrades over 64G)
• CPUs: more cores rather than faster • Disks: SSD, RAID 0, no NAS, a lot of
conditions frequently optimize wrt how easy to change drivers rather than SSD endurance
ecommerce example
• Database servers • Unified hardware platform : from HP • HP DL line: • 4 cpu sockets • 256 GM RAM • network interfaces • not much HDD, data is in NAS
ecommerce example
• cloud servers: • purchased by racks: 40 in a rack • 2 CPU socket • 198G • 18 core CPU • SSD
network requirements
• 1 network card per server - a big mistake, 1 switch per rack
• 3 cards per servers: • typical three data flows: • production • “administrative” (dockers etc) • analytics
example
• application servers vs big data servers • application servers (java, node.js apps): • 1TB SSD, RAID 5 • Big data servers: • 5T SAS
Questions?
Dr. Andrei Lopatenko Director of Engineering,
Recruit Institute of Technology Recruit Holdings