Pollyanna Document Classifier

44
Pollyanna A machine learning system for classifying product pages on the Internet

description

 

Transcript of Pollyanna Document Classifier

Page 1: Pollyanna Document Classifier

PollyannaA machine learning system for

classifying product pages on the Internet

Page 2: Pollyanna Document Classifier

What is Pollyanna?

• Pollyanna is a Machine Learning System that uses ‘Supervised Learning’ techniques to associate words and categories quantitatively, based on the examples in the training set.

• The training system is programmed to interpret the association between words and categories using theories in probability and statistics

• It applies the training knowledge to classify documents based on the text contained in the document using the ‘linear classifier’ function

Page 3: Pollyanna Document Classifier

What does Pollyanna do?

• It reads the text in the product pages of Internet merchants and retailers,

• Quantitatively associates the words in the title, meta and body tags with the product categories in its taxonomy, and

• Predicts the top 3 categories to which the products in the product page may belong

Page 4: Pollyanna Document Classifier

The Context

What is Pollyanna’s business context?

Page 5: Pollyanna Document Classifier

The Comparison Shopping Engine (CSE) Eco-system

Internet Retailer

Comparison Shopping EngineInternet Buyer

Page 6: Pollyanna Document Classifier

The Process

Retailer Offer Classification

Retailer Offer Alignment

Product

Attribution

Internet Shopper Internet Retailer

Search

Identify/Shortlist

Purchase

Comparison Shopping Engine

Page 7: Pollyanna Document Classifier

Sample Product Taxonomy for Classification

Page 8: Pollyanna Document Classifier

Classification

• Classification of Retailer’s offers is a critical process for most Comparison Shopping Sites• Classification enables a focused

search for a product within a specific product category

Page 9: Pollyanna Document Classifier

Efficiency of existing classification methods

• The approximate accuracy of current classification algorithms (in the Comparison Shopping Space) – 65%

• About 10 % of merchant offers are manually classified

• About 10 % of merchant offers are always mis-classified

Page 10: Pollyanna Document Classifier

Problem Definition

How to most effectively classify merchant/retailer offers accurately at

the lowest cost?

Page 11: Pollyanna Document Classifier

The Solution

The Pollyanna System

Page 12: Pollyanna Document Classifier

A fresh perspective of the process and inputs

Disregard retailer’s data-feed to search engine

Train the system on retailer’s website content

Use the product web page text as input

Page 13: Pollyanna Document Classifier

A new viewpoint on support vectors in a machine learning system

A new predictive coefficient not used in any other publically known machine learning system

Synthesis of statistical theories widely applied in social sciences and medical research

Page 14: Pollyanna Document Classifier

Pollyanna’s Current 1 dimensional relationship analysis

Keyword

Product Category 3

Product Category 4

Product Category 1

Product Category 2

Page 15: Pollyanna Document Classifier

Example of the one dimensional relationship

Word Relationship Product Category

acrylic 0.950338803 Men’s Hats

acrylic 0.944220332 Men’s Socks

acrylic 0.061613565 Men’s Sweaters / Vests

acrylic 0.002798075 Miscellaneous Men’s Accessories

acrylic 0.001157465 Miscellaneous Women’s Accessories

acrylic 0.772611278 Women’s Hats

acrylic 0.442448187 Women’s Socks & Hosiery

Page 16: Pollyanna Document Classifier

Conditional Probability• Conditional probability is the probability of some event A,

given the occurrence of some other event B. Conditional probability is written as P(A|B), and is read as "the probability of A, given B".

• Bayes Theorem provides the Equation for Conditional Probability which can be stated as:

P (A | B) = P (B | A) * P (A) P (B)

Can be written as = P (A ∩ B) P (B)

Page 17: Pollyanna Document Classifier

Conditional Probability

Attribute Document contains the word ‘Drawastring’ (B)

Document does not contain the word ‘Drawstring’ (b)

Total

Women’s Pants (A) 195 12053 12248

Not Women’s Pants (a) 628 434347 434975

Total823 446400 447223

Data from Pollyanna

In this example CP = 195/823CP = 0.2369380316

Page 18: Pollyanna Document Classifier

400 400

1100 2600

0.23000/4001500/400 RR

Risk Ratio

Normal BP

Congestive Heart Failure

No CHF

1500 3000

High Systolic BP

Example from Cohort studies in Medicine.

Page 19: Pollyanna Document Classifier

738 29689

808 415988

7.16591289445677/296891546/738 RR

Risk Ratio

Does not contain “Oxford”

Men’s Shoes

Not Men’s Shoes

1546 445677

Document Contains“Oxford”

Data from Pollyanna

Page 20: Pollyanna Document Classifier

Pollyanna is a Linear Classifier

• If the input feature vector to the classifier is a real vector x, then the output score is

• where w is a real vector of weights and f is a function that converts the scalar product of the two vectors into the desired output.

Page 21: Pollyanna Document Classifier

Solution Statement

• Pollyanna is a Machine Learning System that uses new processes, inputs and statistical theories

• That provides a highly accurate automated classification (87% ± 3%)

• Unlike other classification algorithms (in the E-Commerce space) that are dependent on retailer’s data-feeds, and are less accurate (Approx 65%) and are supported by manual classification

• We have assembled a highly accurate classification system that is cost effective, one that does not require an ongoing manual support

Page 22: Pollyanna Document Classifier

Pollyanna Demo

Page 23: Pollyanna Document Classifier

Architecture

Knowledge

Statistical Validation

Statistical Elimination

Sampling

Internet Cloud

Internet Cloud

Front End Tool

User/Client

Training Module

Perl

Perl

Perl

Page 24: Pollyanna Document Classifier

Pollyanna can be applied to predictive analytics in online payment fraud

If the following conditions are met:

Page 25: Pollyanna Document Classifier

The problem must be clearly defined in terms of:

• Input– Type of data: Integer, String, Floating, Boolean– File type: XML, Delimited, Database

• Process– Human intelligence and any other methods, procedures

required for arriving at a decision, prediction or forecast• Output

– All possible decisions/outcomesExamples:• Bucketing a transaction into fraud risk category• Forecasting fraud losses on completed transactions

Page 26: Pollyanna Document Classifier

Historical data should be available

• Reliable data– Data is sufficiently complete and error free

• Valid data– Data actually represents what you think is being

measured• Sufficient data– Data is adequate to support the outcome of the

process or the decision• Spatial data• Time series data

Page 27: Pollyanna Document Classifier

Data should yield binomial probability distribution for each attribute

• Example• A key attribute of an online transaction is the

location of the “IP” address and the location of the physical address of the credit card holder:– Two outcomes are possible for the above attribute

• The “IP” address and the physical address are located geographically in the same country

• The “IP” address and the physical address are not located geographically in the same country

• Continued in the next slide

Page 28: Pollyanna Document Classifier

Data should yield binomial probability distribution for each attribute

• Example – Continued from previous slide• The Machine Learning system is supplied 200

online payment transactions received in the previous year.

• The machine learning system should be able to determine, for each possible outcome, the number of Yes or No events observed– Example: For the outcome “The IP address and the

physical address are located geographically in the same country” – 20 Yes and 180 No

Page 29: Pollyanna Document Classifier

How will the support vector be calculated in the context of online payment transaction

Illustration with an hypothetical case

Page 30: Pollyanna Document Classifier

Support Vector Computation Example

• To simplify the problem let us say that every transaction has to be bucketed into one of the two classes:– A genuine transaction– A fraudulent transaction

• The training module’s goal is to calculate the relationship - between an attribute of a transaction and each of the classes mentioned above - which is the ‘Support Vector’

Page 31: Pollyanna Document Classifier

Support Vector Computation Example

• The training module is supplied with 200 sample transactions (historical data) representing the population

• Of the 200 transactions 20 are fraudulent and 180 are genuine

• A key attribute of the transaction is: The IP address and the physical address of the credit card holder are not located geographically in the same country. Of the 200 transactions 40 had the above attribute and 160 did not have the above attribute. Let us call the above attribute ‘X’.

• The training module will analyze the data and arrive at the following matrix:

Page 32: Pollyanna Document Classifier

Attribute ‘X’ observed

Attribute ‘X’ not observed

Total

Fraudulent 15 5 20

Not Fraudulent

25 155 180

Total 40 160 200

Association between Fraudulent Transaction and Attribute ‘X’

Support Vector Computation Example

Page 33: Pollyanna Document Classifier

Support Vector Computation Example

• Applying a synthesis of theories in probability and statistics the support vector is calculated as 4.040816

• The support vector is a measure of the relationship between a Fraudulent Transaction and the attribute: “The IP address and the physical address of the credit card holder are not located geographically in the same country”.

Page 34: Pollyanna Document Classifier

How will the machine learning system forecast fraud loss

Illustration with an hypothetical case

Page 35: Pollyanna Document Classifier

Forecasting Fraud Loss

• The problem:– To forecast the value of losses on all fraudulent credit

card payment transactions that have been successfully executed in a given month

• There are two steps to doing this:– Step 1: Determine whether each transaction is

fraudulent or not based on attributes of the transaction

– Step 2: Sum the values of the fraudulent transactions to arrive at the forecast of loss for that month

Page 36: Pollyanna Document Classifier

Forecasting Fraud Loss – Step 1Determining whether a transaction is

fraudulent or not• Let us hypothetically say that there are two

outcomes for each transaction, either it is a Fraudulent transaction or it is a Genuine transaction.

• For each outcome the following linear function is applied:

• Refer slide 20 for a brief explanation of the function

Page 37: Pollyanna Document Classifier

Forecasting Fraud Loss – Step 1Determining whether a transaction is

fraudulent or not

• So the linear function is applied for the observed attributes in the transaction (X vector) weighted by the Support Vector (W vector) calculated in the training module

• For our example there are 2 outcomes for each transaction – Fraudulent or Genuine

• For every transaction, the linear function gives the values for both the outcomes and the prediction will be in favor of the outcome with the higher value

Page 38: Pollyanna Document Classifier

Forecasting Fraud Loss – Step 2Summing the values of all fraudulent

transactions

• Step 1 is performed on each transaction in a given period

• The values of fraudulent transactions are totaled to arrive at the forecast of losses due to fraud in a given period

Page 39: Pollyanna Document Classifier

All the details contained in the examples above are imaginary.

They serve only for the purpose of understanding the system and its application to the field of fraud

analytics

Page 40: Pollyanna Document Classifier

Benefits of adopting Pollyanna

Page 41: Pollyanna Document Classifier

Benefits of adopting Pollyanna

• For a process that is currently supported by human intelligence, Pollyanna may confer a cost saving benefit ranging from 40% to 80% from reduction of human resources

• For a process that is already automated or uses machine intelligence, Pollyanna may bring efficiency or accuracy improvement ranging from 10% to 25%

Page 42: Pollyanna Document Classifier

“Any technology sufficiently advanced is indistinguishable from magic”

Sir Arthur C. Clarke

Page 43: Pollyanna Document Classifier

Thank You

Page 44: Pollyanna Document Classifier

Contact Details

• PG Vijay (Consultant – Machine Learning Systems)– Mobile: +91 98418 21167– E-Mail: [email protected]– LinkedIn Public Profile:

http://www.linkedin.com/in/machinelearning