Review of Fraud Classification Using Principal Components Analysis of RIDITS

Review ofFraud Classification Using Principal

Components Analysis of RIDITS

By Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.

Objectives

Address question: Why use new method, PRIDIT?

Introduce other methods used in similar circumstances

Explain how PRIDIT adds to methods available

Explain limitations of PRIDIT/RIDIT

A Key Problem in Fraud Modeling

Most data mining methods need a target (dependent) variableY = a + b1x1 + b2x2 + … bnxn

Fraud (Yes/No or Fraud Score) = f(predictor variables)

Need sample of data where claims have been determined to be fraudulent or legitimate

Dependent variable hard to get

In a large sample of automobile insurance claims perhaps 1/3 may have an element of abuse or fraud

Scarce resources are not expensed on such large volumes of claims to determine their legitimacyOnly a small percentage referred to SIU

investigators or other investigationsThere are time lags in determining the outcome of

investigations

Unsupervised learning

Another approach that does not require a dependent variable

Two Key KindsCluster AnalysisPrincipal Components/Factor Analysis

Pridit uses this approachIt is applied to ordered categorical variables

Cluster Analysis

Records are grouped in categories that have similar values on the variables

ExamplesMarketing: People with similar values on demographic

variables (i.e., age, gender, income) may be grouped together for marketing

Text analysis: Use words that tend to occur together to classify documents

Note: no dependent variable used in analysis

ClusteringClustering

Common Method: k-means, hierarchicalNo dependent variable – records are grouped

into classes with similar values on the variable

Start with a measure of similarity or dissimilarity

Maximize dissimilarity between members of different clusters

Dissimilarity (Distance) Measure Dissimilarity (Distance) Measure – Continuous Variables– Continuous Variables

Euclidian Distance

Manhattan Distance

1/ 22

1( ) i, j = records k=variable

mij ik jkkd x x

1

mij ik jkkd x x

Binary Variables

Row Variable1 0

1 a b a+b0 c d c+d

a+c b+dCo

lum

n

Var

iab

le

Binary Variables

Sample Matching

Rogers and Tanimoto

b cd

a b c d

2( )( ) 2( )

b cd

a d b c

Example: Fraud DataData from 1993 closed claim study conducted by

Automobile Insurers Bureau of MassachusettsClaim files often have variables which may be useful

in assessing suspicion of fraud, but a dependent variable is often not available

Variables used for clustering:Legal representationPrior ClaimSIU InvestigationAt faultPolice reportNumber of providers

Statistics for Clusters Based on descriptive statistics, Cluster 2 appears to

have higher likelihood of fraudulent claims – more about this later

Police Medical At Legal SIU NumberCluster Report Audit Fault Rep Investigation Providers

Percentage Yes1 46.7% 0.1% 42.2% 6.1% 0.0% 22 49.8% 5.9% 2.4% 96.0% 6.5% 4

Principal Components Analysis

A form of dimension (variable) reductionSuppose we want to combine all the information

related to the “financial” dimension of fraudMedical provider bill (indicative of padding claim)Hospital billNumber of providersEconomic LossesClaimed wages Incurred Losses

Principal Components

These variables are correlated but not perfectly correlated

We replace many variables with a weighted sum of the variables

Correlation Matrix for VariablesCorrelations

Number Providers

Medical Bill

Provider Paid

Economic Losses Incurred

Hospital Pymt

Number Providers 1.000 0.387 0.571 0.382 0.382 0.168

Medical Bill 0.387 1.000 0.539 0.952 0.952 0.922Provider Paid 0.571 0.539 1.000 0.531 0.531 0.327Economic Losses 0.382 0.952 0.531 1.000 1.000 0.888

Inourred 0.382 0.952 0.531 1.000 1.000 0.888Hospital Pymt 0.168 0.922 0.327 0.888 0.888 1.000

Finding Factor or Component

The correlation matrix is used to find the factor that explains the most variance (captures most of the correlation) for the set of variables

That component or factor extracted will be a weighted average of the variables

More than one Component or Factor may result from applying the method

Evaluating Importance of Variables

Use factor loadings

Component MatrixVariable Loading

Number Providers 0.497Medical Bill 0.974Provider Paid 0.646Economic Losses 0.976Incurred 0.976Hospital Pymt 0.886

Problem: Categorical Variables

It is not clear how to best perform Principal Components/Factor Analysis on categorical variablesThe categories may be coded as a series of binary

dummy variablesIf the categories are ordered categories, you may

loose important information

This is the problem that PRIDIT addresses

RIDIT

Variables are ordered so that lowest value is associated with highest probability of fraud

Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i

ˆ ˆti tj tjj i j i

R p p

Example: RIDIT for Legal Representation

Legal Representation

Proportion Proportion

Value Code Number Proportion Below Above RIDITYes 1 706 0.504 0.000 0.496 -0.496No 2 694 0.496 0.504 0.000 0.504

PRIDIT

Use RIDIT statistics in Principal Components Analysis

Component Matrixa

.248

.220

.709

.752

.341

.406

SIU

Police Report

At Fault

Legal Rep

Medical Audit

Prior Claim

1

Component

Extraction Method: Principal Component Analysis.

1 components extracted.a.

Scoring

Assign a score to each claimThe score can be used to sort claims

More effort expended on claims more likely to be fraudulent or abusive

In the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT scoreA suspicion score was assigned to each claim by

an expert

PRIDIT vs. Suspicion Score

Suspicion Score vs PRIDIT Score

(1.50)

(1.00)

(0.50)

0.00

0.50

1.00

Suspicion Score

PR

IDIT

Sc

ore

Clustering and Suspicion Score

Report

Mean

.6445

3.3737

1.9643

1

2

Total

TwoStepCluster Number

SuspicionLevel

Result

There appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusive

The clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims

Comparison: PRIDIT and Clustering

PRIDIT gives a score, which may be very useful for claims sorting. Clustering assigns claims to classes. They are either in or out of the assigned class.

Clustering ignores information about the order of values for categorical variables

Clustering can accommodate both categorical and continuous variables

Comparison

Unordered categorical variables with many values (i.e., injury type):Clustering has a procedure for measuring

dissimilarity for these variables and can use them in clustering

If the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.

Review ofFraud Classification Using Principal

Components Analysis of RIDITS

By Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.

Review of Fraud Classification Using Principal Components Analysis of RIDITS

Documents

Transcript of Review of Fraud Classification Using Principal Components Analysis of RIDITS