Thomas Jensen. Machine Learning

Post on 27-Aug-2014

2.134 views 3 download

Tags:

description

#BigDataBY

Transcript of Thomas Jensen. Machine Learning

The Impact of Big Data on Classic Machine Learning

Algorithms

Thomas Jensen, Senior Business Analyst @ Expedia

Who am I?

• Senior Business Analyst @ Expedia• Working within the competitive

intelligence unit• Responsible for :

• Algorithm that score new hotels• Algorithm that predicts room nights

sold on existing Expedia hotels• Scraping competitor sites• Other stuff….

The Promise of Big Data

Real time dataData driven decision

More accurate and robust models

Granularity

Big Data Challenges

Data Processing – not going to talk about this.

Speed at which to use data – how fast should we update algorithms?

How do we train algorithms on data sets that do not fit into memory?

Big Data Challenges

Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Classification - Logistic Regression

• One classic task in machine learning / statistics is to classify some objects/events/decisions correctly

• Examples are:• Customer churn• Click behavior• Purchase behavior• ….

• One of the most popular algorithms to carry out these tasks is logistic regression

What is logistic regression?

• Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other

• Pr 𝑦 𝑥 =1

1+𝑒−𝑥𝛽

• The challenge is to choose the optimal beta(s)

• To do that we minimize a cost function

Why Use Logistic Regression?

• It is simple and well understood algorithm

• Outputs probabilities

• There are tried and tested models to estimate the parameters

• It is flexible – can handle a number of different inputs, and feature transformations

Usual Approaches

• Batch training (offline approach)• Get all the data and train the algorithm in one go

• Disadvantages when data is big• Requires all data to be loaded into memory

• Periodic retraining is necessary

• Very time consuming with big data!

Batch Training

Examples of Logistic Regression in Industry Settings – Real Time Bidding

• RTB• RTB algorithms are usually

based on logistic regression• Whether or not to bid on a

user is determined by the probability that the user will click on an add

• Each day billions of bids are processed

• Each bid has to be processed within 80 milliseconds

Examples of Logistic Regression in Industry Settings – Fraud Detection

Detecting Fraudulent Credit Card Transactions

• The probability that a transaction is using a stolen credit card is typically estimated with logistic regression

• Billions of transactions are analyzed each day

How Slow is the Batch Version of Logistic Regression?

One target variable and two feature vectors.All randomly generated.

A Real World Problem

A Real World Problem

• Some stats on the training job in the pipeline:• Runs training jobs on a per country basis

• Longest running job lasts ~9 hours

• Shortest running job lasts ~3 hours

• There are often convergence failures

• What we need an algorithm that:• Can reduce training time

• Is robust towards convergence failures

A Big Data Friendly Approach

Online Training

• Pass each data point sequentially through the algorithm

• Only requires one data point at a time in memory

• Allows for on-the-fly training of the algorithm

Online Learning

• We want to learn a vector of weights

• Initialize all weights. Begin loop:1. Get training example

2. Make a prediction for the target variable

3. Learn the true value of the target

4. Update the weights and go to 1

Online Learning

• Initialise all weights. Begin loop:

Repeat {For i = 1 to m {

𝜃𝑗 = 𝜃𝑗 − 𝛼𝜕

𝜕𝜃𝑗𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖 , 𝑦𝑖))

}

}

the partial derivative of the cost functions

the cost function – giventheta and row i, i.e. how wrongAre we?

the step size – how fastwe should climb the gradient

Online Learning

• Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.

Batch vs. Online Learning

DataSize: 4.8GBRows: 500,000Columns: 5000

0

20

40

60

80

100

120

Batch SGDClassifier Sofia-ml

Training

*Times include reading data and training algorithm

Online Learning Vs. Batch

Online Learning

• When we have a continuous stream of data

• When It is important to update the algorithm in real time – can hit a moving target

• When training speed is important

• Parameters are “jumpy” around the optimal values

Batch

• When it is very important to get the exact optimal values

• When data can fit in memory

• When training time is not of the essence

Popular Online Learning Libraries

• Sofia-ml (c/c++)• Requires data in svmLight format• Have implementations of SVM, Neural networks and logistic regression• Supports classification and ranking

• Wovbal wabbit (c/c++)• Requires data in own wv format• Have implementations of the most popular loss functions• Supports classification, ranking and regression

• Pandas + scikit-learn (python)• Pandas has a nice function for reading files in batches• Can handle sparse and non-sparse matrices• Scikit–learn has an SGD classifier that can fit the model in batches• Supports classification, ranking and regression