Pollyanna Document Classifier

PollyannaA machine learning system for

classifying product pages on the Internet

What is Pollyanna?

• Pollyanna is a Machine Learning System that uses ‘Supervised Learning’ techniques to associate words and categories quantitatively, based on the examples in the training set.

• The training system is programmed to interpret the association between words and categories using theories in probability and statistics

• It applies the training knowledge to classify documents based on the text contained in the document using the ‘linear classifier’ function

What does Pollyanna do?

• It reads the text in the product pages of Internet merchants and retailers,

• Quantitatively associates the words in the title, meta and body tags with the product categories in its taxonomy, and

• Predicts the top 3 categories to which the products in the product page may belong

The Context

What is Pollyanna’s business context?

The Comparison Shopping Engine (CSE) Eco-system

Internet Retailer

Comparison Shopping EngineInternet Buyer

The Process

Retailer Offer Classification

Retailer Offer Alignment

Product

Attribution

Internet Shopper Internet Retailer

Search

Identify/Shortlist

Purchase

Comparison Shopping Engine

Sample Product Taxonomy for Classification

Classification

• Classification of Retailer’s offers is a critical process for most Comparison Shopping Sites• Classification enables a focused

search for a product within a specific product category

Efficiency of existing classification methods

• The approximate accuracy of current classification algorithms (in the Comparison Shopping Space) – 65%

• About 10 % of merchant offers are manually classified

• About 10 % of merchant offers are always mis-classified

Problem Definition

How to most effectively classify merchant/retailer offers accurately at

the lowest cost?

The Solution

The Pollyanna System

A fresh perspective of the process and inputs

Disregard retailer’s data-feed to search engine

Train the system on retailer’s website content

Use the product web page text as input

A new viewpoint on support vectors in a machine learning system

A new predictive coefficient not used in any other publically known machine learning system

Synthesis of statistical theories widely applied in social sciences and medical research

Pollyanna’s Current 1 dimensional relationship analysis

Keyword

Product Category 3

Product Category 4

Product Category 1

Product Category 2

Example of the one dimensional relationship

Word Relationship Product Category

acrylic 0.950338803 Men’s Hats

acrylic 0.944220332 Men’s Socks

acrylic 0.061613565 Men’s Sweaters / Vests

acrylic 0.002798075 Miscellaneous Men’s Accessories

acrylic 0.001157465 Miscellaneous Women’s Accessories

acrylic 0.772611278 Women’s Hats

acrylic 0.442448187 Women’s Socks & Hosiery

Conditional Probability• Conditional probability is the probability of some event A,

given the occurrence of some other event B. Conditional probability is written as P(A|B), and is read as "the probability of A, given B".

• Bayes Theorem provides the Equation for Conditional Probability which can be stated as:

P (A | B) = P (B | A) * P (A) P (B)

Can be written as = P (A ∩ B) P (B)

Conditional Probability

Attribute Document contains the word ‘Drawastring’ (B)

Document does not contain the word ‘Drawstring’ (b)

Total

Women’s Pants (A) 195 12053 12248

Not Women’s Pants (a) 628 434347 434975

Total823 446400 447223

Data from Pollyanna

In this example CP = 195/823CP = 0.2369380316

400 400

1100 2600

0.23000/4001500/400 RR

Risk Ratio

Normal BP

Congestive Heart Failure

No CHF

1500 3000

High Systolic BP

Example from Cohort studies in Medicine.

738 29689

808 415988

7.16591289445677/296891546/738 RR

Risk Ratio

Does not contain “Oxford”

Men’s Shoes

Not Men’s Shoes

1546 445677

Document Contains“Oxford”

Data from Pollyanna

Pollyanna is a Linear Classifier

• If the input feature vector to the classifier is a real vector x, then the output score is

•

• where w is a real vector of weights and f is a function that converts the scalar product of the two vectors into the desired output.

Solution Statement

• Pollyanna is a Machine Learning System that uses new processes, inputs and statistical theories

• That provides a highly accurate automated classification (87% ± 3%)

• Unlike other classification algorithms (in the E-Commerce space) that are dependent on retailer’s data-feeds, and are less accurate (Approx 65%) and are supported by manual classification

• We have assembled a highly accurate classification system that is cost effective, one that does not require an ongoing manual support

Pollyanna Demo

http://97.74.87.186/pollyanna

Architecture

Knowledge

Statistical Validation

Statistical Elimination

Sampling

Internet Cloud

Internet Cloud

Front End Tool

User/Client

Training Module

Perl

Perl

Perl

Pollyanna can be applied to predictive analytics in online payment fraud

If the following conditions are met:

The problem must be clearly defined in terms of:

• Input– Type of data: Integer, String, Floating, Boolean– File type: XML, Delimited, Database

• Process– Human intelligence and any other methods, procedures

required for arriving at a decision, prediction or forecast• Output

– All possible decisions/outcomesExamples:• Bucketing a transaction into fraud risk category• Forecasting fraud losses on completed transactions

Historical data should be available

• Reliable data– Data is sufficiently complete and error free

• Valid data– Data actually represents what you think is being

measured• Sufficient data– Data is adequate to support the outcome of the

process or the decision• Spatial data• Time series data

Data should yield binomial probability distribution for each attribute

• Example• A key attribute of an online transaction is the

location of the “IP” address and the location of the physical address of the credit card holder:– Two outcomes are possible for the above attribute

• The “IP” address and the physical address are located geographically in the same country

• The “IP” address and the physical address are not located geographically in the same country

• Continued in the next slide

Data should yield binomial probability distribution for each attribute

• Example – Continued from previous slide• The Machine Learning system is supplied 200

online payment transactions received in the previous year.

• The machine learning system should be able to determine, for each possible outcome, the number of Yes or No events observed– Example: For the outcome “The IP address and the

physical address are located geographically in the same country” – 20 Yes and 180 No

How will the support vector be calculated in the context of online payment transaction

Illustration with an hypothetical case

Support Vector Computation Example

• To simplify the problem let us say that every transaction has to be bucketed into one of the two classes:– A genuine transaction– A fraudulent transaction

• The training module’s goal is to calculate the relationship - between an attribute of a transaction and each of the classes mentioned above - which is the ‘Support Vector’


• The training module is supplied with 200 sample transactions (historical data) representing the population

• Of the 200 transactions 20 are fraudulent and 180 are genuine

• A key attribute of the transaction is: The IP address and the physical address of the credit card holder are not located geographically in the same country. Of the 200 transactions 40 had the above attribute and 160 did not have the above attribute. Let us call the above attribute ‘X’.

• The training module will analyze the data and arrive at the following matrix:

Attribute ‘X’ observed

Attribute ‘X’ not observed

Total

Fraudulent 15 5 20

Not Fraudulent

25 155 180

Total 40 160 200

Association between Fraudulent Transaction and Attribute ‘X’



• Applying a synthesis of theories in probability and statistics the support vector is calculated as 4.040816

• The support vector is a measure of the relationship between a Fraudulent Transaction and the attribute: “The IP address and the physical address of the credit card holder are not located geographically in the same country”.

How will the machine learning system forecast fraud loss

Illustration with an hypothetical case

Forecasting Fraud Loss

• The problem:– To forecast the value of losses on all fraudulent credit

card payment transactions that have been successfully executed in a given month

• There are two steps to doing this:– Step 1: Determine whether each transaction is

fraudulent or not based on attributes of the transaction

– Step 2: Sum the values of the fraudulent transactions to arrive at the forecast of loss for that month

Forecasting Fraud Loss – Step 1Determining whether a transaction is

fraudulent or not• Let us hypothetically say that there are two

outcomes for each transaction, either it is a Fraudulent transaction or it is a Genuine transaction.

• For each outcome the following linear function is applied:

• Refer slide 20 for a brief explanation of the function

Forecasting Fraud Loss – Step 1Determining whether a transaction is

fraudulent or not

• So the linear function is applied for the observed attributes in the transaction (X vector) weighted by the Support Vector (W vector) calculated in the training module

• For our example there are 2 outcomes for each transaction – Fraudulent or Genuine

• For every transaction, the linear function gives the values for both the outcomes and the prediction will be in favor of the outcome with the higher value

Forecasting Fraud Loss – Step 2Summing the values of all fraudulent

transactions

• Step 1 is performed on each transaction in a given period

• The values of fraudulent transactions are totaled to arrive at the forecast of losses due to fraud in a given period

All the details contained in the examples above are imaginary.

They serve only for the purpose of understanding the system and its application to the field of fraud

analytics

Benefits of adopting Pollyanna

Benefits of adopting Pollyanna

• For a process that is currently supported by human intelligence, Pollyanna may confer a cost saving benefit ranging from 40% to 80% from reduction of human resources

• For a process that is already automated or uses machine intelligence, Pollyanna may bring efficiency or accuracy improvement ranging from 10% to 25%

“Any technology sufficiently advanced is indistinguishable from magic”

Sir Arthur C. Clarke

Thank You

Contact Details

• PG Vijay (Consultant – Machine Learning Systems)– Mobile: +91 98418 21167– E-Mail: [email protected]– LinkedIn Public Profile:

http://www.linkedin.com/in/machinelearning

mailto:[email protected]

http://www.linkedin.com/in/machinelearning

Pollyanna Document Classifier

Technology

Transcript of Pollyanna Document Classifier