Data Science and Machine Learning for eCommerce and Retail

Data Science and Machine Learning for eCommerce and Retail

Dr. Andrei Lopatenko Director of Engineering,

Recruit Institute of Technology Recruit Holdings

former Walmart Labs, Google (twice), Apple (twice) [email protected]

mailto:[email protected]

ML for eCommerce

• Search, Browse, for commerce sites and application

• Help users to find and discover items they will purchase

• Maximize revenue/profit per user session

Search

Search - ranking

ranking

Search - LHN

Left Hand

Navigation

Search spell correction

Search type ahead

Browse

Search data size

• Catalogue items • 8 M items now compare ~ 400 M

Amazon / eBay • X 10 in near future • 2 K text description per item + images • Several hundreds of structured attributes

per catalog

Search – user searches

• Tens of millions per day • Tens billions session per year • Online sales 13.2 B per year (http://

fortune.com/2015/11/17/walmart-ecommerce/)

• 500B per year sales offline stories (8% USA economy) in ~ 11K stores

• The number of transactions ~ 10B (public data)

http://fortune.com/2015/11/17/walmart-ecommerce/

ML addressable problems

• Learning to rank • Given a query, what’s the list of items

with the highest probability of conversion (purchase), ATC (add to card), page view


• Typeahead • Given a sequence of characters types by

user, what’s most probably competitions, what are most probable items users wants to buy


• Spell correction • Given a user query, what’s the query user

actually wanted to type


• Cold start • Given a new items with it’s set of

attributes and no history of sales or exposure on site, predict items sales and item sales per query


• Prediction of LHN • Given a user query, what’s the best set of

facet and facet values, which gives higher probability of users interacting with them and finally buying an item


• Query understanding • Given a query, build a semantic parse of

query, tag tokens with attributes: blue tshirts for teenagers -> blue:color tshirts:type for:opt teenagers:agerestriction10-20

• Classification: blue tshirts for teenagers: -> type:apparel, price preference: 10-30, releaseyearpreference: 2014-2016


• Related searches • Given a query, what are queries which are

either semantically close to this one, or represent coincidental users interests

• Nike shoes -> adidas shoes, sport shoes, • Coffee mugs -> travel mugs, photo coffee

mugs, cappuccino cups


• product discovery • help users to explore product assortment, • drive users to diverse products • reduce risk of selecting irrelevant items • help to find price,quality,brand etc

alternatives • reduce pigeonhole risk • provide relevant data to make a decision


• Image similarity • Given images of the items, give other

items such that images of those are visually appealing to the users which like the original item (appealing by shape? Color? Texture?) -> causing high conversion in recommendation


• Voice search • Given voice input, reply with a list of the

best items • “what are the cheapest samsung tvs in the

store” • “what is best deal on queen bed today?”


• extraction of item attributes • Given an item: what are item attributes:

brand, color, size (wheel, screen, height, S/M/XL, Queen/Twin/King/Full), Gender, Pattern, Shape, Features


• Representations of users : actions on websites/apps -> searches, clicks, browsing behaviour, product -> purchase preferences, reviews, ratings, return rates


• title generation: how to generate the title which will cause maximum conversion rate

• which product attributes select for the title?

What makes a good title?

Limits

• Most models should be served in production

• 50ms on prediction • Part of big system, memory limits ~ 10G

Retail

Retail

• Key directions which require machine learning:

• discounting tools • coupons and rewards • loyalty • inventory management

Inventory management

• Customer want to buy products • Customers have diverse needs • Products should be in stock, ideally in

warehouses close to customers • but it’s expensive to store products • Problem: How many products of each type

should be stored, when product supply should be refilled?

Customer intelligence

• Retail • analyze sales data, find anomalies, explain

them • low sales of umbrellas during last month in

North California’s stores • No rains? (integration with external data about

weather conditions) • Seasonal / the same as last year / time series • Competitors

Fraud detection

• identify fraudulent transactions online • Hundreds fraud schemas detected daily • Global retail shrinkage is $119 billion in

2011, an average of 1.45% of retail sales. • from stolen credit card to price tag

replaced, price discounts by high level managers to achieve personal goals

Propensity Modeling for Marketing Campaigns

• build effective email/facebook/google ads campaign addressing proper customer at proper time at proper costs

• behavior based customer segmentation and clusterization with demographics, lifestyle, attitudinal information

Online Grocery

• which items can be replaced by other items and by which items they can be replaced

• data are individual purchases in chain grocery, drug stores, online grocery shopping

• the problem - find which items can be replaced by other item if they are not in store to fulfill the order

Dynamic pricing

• define the best price • scrap continuously prices of competitors,

predict demand by price, know the expenses

• online commerce sites change prices every 10 minutes

Challenges

• Data volumes: transactions: Walmart: 10 Million per day

• Computations: complicated modeling techniques

Hardware platform

• Needs: • Data storage • Data processing • Serving online

Data storage

• Volumes of data: • 10 M transactions per day, 5 years - 18

billion transactions -> 1T • Catalog: 500 M items * 2K per each -> 1T

Data Storage

• but if go to video: petabytes of data, RetailNext 75P per year from 30000+ sensors

• Walmart 500P • eBay 40 P in 2013 (transactions + online

behaviours)

Data processing

• Rebuild model over fresh data: • typically daily: add daily data (millions of

transactions, hundreds of millions of behavior units) to year data store (billions of transactions, hundred billion/trillion behavior units)

• build a model to serve in production the next day

Data processing

• some models such as fraud detection,dynamic pricing should be almost online (10-15 minutes)

• build over data such as daily transactions or web crawl over competitors' sites

Serving online

• online commerce WML - thousands / tens thousands queries per second in peak times

• complicated algorithm of ranking, recommendation,

• 50ms limit

serving online

• price, in store availability - millions requests per second in peak times

• item informations - millions requests per second

• serving online - Solr/Lucene/Elastic Search, Cassandra, MongoDB, Oracle, CouchDB,Node.JS/Java solutions etc

Data processing

• Hadoop / Spark clusters • a lot of I/O • HDFS does the redundancy , RAID is not

necessary, RAID is slow to write, Hadoop writes a lot

• SAN, NAS are not good either • so bare metal with DAS Directly Attached

Storage

Data Processing

• more servers, cheaper servers • more smaller disks is better than large

disks • allocate cluster 100% to Hadoop

Data processing

• Hadoop Masters vs Workers • large clusters: Masters > 64G RAM, dual

Ethernet NIC, dual quad core CPU • Workers: memory 64G+, SAS 6Gb/s disk

controllers, 2 Ethernet cards, 2*6core processors, 15M cache, Intel’s Hyper-Threading and QPI good to have

Data Processing

• big models, deep learning • Nvidia DGX-1 and alike • Pascal GPUs , NVLink interconnect • Tesla k40, K80 work pretty well too • may require a lot of tuning http://

timdettmers.com/2015/03/09/deep-learning-hardware-guide/

• hard to buy: big data solutions are considered profit generators, HPC servers are not

http://timdettmers.com/2015/03/09/deep-learning-hardware-guide/

Serving online

• Typically large memory, but not necessary (for example, Elastic Search/Solr degrades over 64G)

• CPUs: more cores rather than faster • Disks: SSD, RAID 0, no NAS, a lot of

conditions frequently optimize wrt how easy to change drivers rather than SSD endurance

ecommerce example

• Database servers • Unified hardware platform : from HP • HP DL line: • 4 cpu sockets • 256 GM RAM • network interfaces • not much HDD, data is in NAS

ecommerce example

• cloud servers: • purchased by racks: 40 in a rack • 2 CPU socket • 198G • 18 core CPU • SSD

network requirements

• 1 network card per server - a big mistake, 1 switch per rack

• 3 cards per servers: • typical three data flows: • production • “administrative” (dockers etc) • analytics

example

• application servers vs big data servers • application servers (java, node.js apps): • 1TB SSD, RAID 5 • Big data servers: • 5T SAS

Questions?

Dr. Andrei Lopatenko Director of Engineering,

Recruit Institute of Technology Recruit Holdings

• [email protected]

mailto:[email protected]

Data Science and Machine Learning for eCommerce and Retail

Data & Analytics

Transcript of Data Science and Machine Learning for eCommerce and Retail