Review of Fraud Classification Using Principal Components Analysis of RIDITS
-
Upload
summer-jackson -
Category
Documents
-
view
31 -
download
3
description
Transcript of Review of Fraud Classification Using Principal Components Analysis of RIDITS
Review ofFraud Classification Using Principal
Components Analysis of RIDITS
By Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.
Objectives
Address question: Why use new method, PRIDIT?
Introduce other methods used in similar circumstances
Explain how PRIDIT adds to methods available
Explain limitations of PRIDIT/RIDIT
A Key Problem in Fraud Modeling
Most data mining methods need a target (dependent) variableY = a + b1x1 + b2x2 + … bnxn
Fraud (Yes/No or Fraud Score) = f(predictor variables)
Need sample of data where claims have been determined to be fraudulent or legitimate
Dependent variable hard to get
In a large sample of automobile insurance claims perhaps 1/3 may have an element of abuse or fraud
Scarce resources are not expensed on such large volumes of claims to determine their legitimacyOnly a small percentage referred to SIU
investigators or other investigationsThere are time lags in determining the outcome of
investigations
Unsupervised learning
Another approach that does not require a dependent variable
Two Key KindsCluster AnalysisPrincipal Components/Factor Analysis
Pridit uses this approachIt is applied to ordered categorical variables
Cluster Analysis
Records are grouped in categories that have similar values on the variables
ExamplesMarketing: People with similar values on demographic
variables (i.e., age, gender, income) may be grouped together for marketing
Text analysis: Use words that tend to occur together to classify documents
Note: no dependent variable used in analysis
ClusteringClustering
Common Method: k-means, hierarchicalNo dependent variable – records are grouped
into classes with similar values on the variable
Start with a measure of similarity or dissimilarity
Maximize dissimilarity between members of different clusters
Dissimilarity (Distance) Measure Dissimilarity (Distance) Measure – Continuous Variables– Continuous Variables
Euclidian Distance
Manhattan Distance
1/ 22
1( ) i, j = records k=variable
mij ik jkkd x x
1
mij ik jkkd x x
Binary Variables
Row Variable1 0
1 a b a+b0 c d c+d
a+c b+dCo
lum
n
Var
iab
le
Binary Variables
Sample Matching
Rogers and Tanimoto
b cd
a b c d
2( )( ) 2( )
b cd
a d b c
Example: Fraud DataData from 1993 closed claim study conducted by
Automobile Insurers Bureau of MassachusettsClaim files often have variables which may be useful
in assessing suspicion of fraud, but a dependent variable is often not available
Variables used for clustering:Legal representationPrior ClaimSIU InvestigationAt faultPolice reportNumber of providers
Statistics for Clusters Based on descriptive statistics, Cluster 2 appears to
have higher likelihood of fraudulent claims – more about this later
Police Medical At Legal SIU NumberCluster Report Audit Fault Rep Investigation Providers
Percentage Yes1 46.7% 0.1% 42.2% 6.1% 0.0% 22 49.8% 5.9% 2.4% 96.0% 6.5% 4
Principal Components Analysis
A form of dimension (variable) reductionSuppose we want to combine all the information
related to the “financial” dimension of fraudMedical provider bill (indicative of padding claim)Hospital billNumber of providersEconomic LossesClaimed wages Incurred Losses
Principal Components
These variables are correlated but not perfectly correlated
We replace many variables with a weighted sum of the variables
Correlation Matrix for VariablesCorrelations
Number Providers
Medical Bill
Provider Paid
Economic Losses Incurred
Hospital Pymt
Number Providers 1.000 0.387 0.571 0.382 0.382 0.168
Medical Bill 0.387 1.000 0.539 0.952 0.952 0.922Provider Paid 0.571 0.539 1.000 0.531 0.531 0.327Economic Losses 0.382 0.952 0.531 1.000 1.000 0.888
Inourred 0.382 0.952 0.531 1.000 1.000 0.888Hospital Pymt 0.168 0.922 0.327 0.888 0.888 1.000
Finding Factor or Component
The correlation matrix is used to find the factor that explains the most variance (captures most of the correlation) for the set of variables
That component or factor extracted will be a weighted average of the variables
More than one Component or Factor may result from applying the method
Evaluating Importance of Variables
Use factor loadings
Component MatrixVariable Loading
Number Providers 0.497Medical Bill 0.974Provider Paid 0.646Economic Losses 0.976Incurred 0.976Hospital Pymt 0.886
Problem: Categorical Variables
It is not clear how to best perform Principal Components/Factor Analysis on categorical variablesThe categories may be coded as a series of binary
dummy variablesIf the categories are ordered categories, you may
loose important information
This is the problem that PRIDIT addresses
RIDIT
Variables are ordered so that lowest value is associated with highest probability of fraud
Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i
ˆ ˆti tj tjj i j i
R p p
Example: RIDIT for Legal Representation
Legal Representation
Proportion Proportion
Value Code Number Proportion Below Above RIDITYes 1 706 0.504 0.000 0.496 -0.496No 2 694 0.496 0.504 0.000 0.504
PRIDIT
Use RIDIT statistics in Principal Components Analysis
Component Matrixa
.248
.220
.709
.752
.341
.406
SIU
Police Report
At Fault
Legal Rep
Medical Audit
Prior Claim
1
Component
Extraction Method: Principal Component Analysis.
1 components extracted.a.
Scoring
Assign a score to each claimThe score can be used to sort claims
More effort expended on claims more likely to be fraudulent or abusive
In the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT scoreA suspicion score was assigned to each claim by
an expert
PRIDIT vs. Suspicion Score
Suspicion Score vs PRIDIT Score
(1.50)
(1.00)
(0.50)
0.00
0.50
1.00
Suspicion Score
PR
IDIT
Sc
ore
Clustering and Suspicion Score
Report
Mean
.6445
3.3737
1.9643
1
2
Total
TwoStepCluster Number
SuspicionLevel
Result
There appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusive
The clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims
Comparison: PRIDIT and Clustering
PRIDIT gives a score, which may be very useful for claims sorting. Clustering assigns claims to classes. They are either in or out of the assigned class.
Clustering ignores information about the order of values for categorical variables
Clustering can accommodate both categorical and continuous variables
Comparison
Unordered categorical variables with many values (i.e., injury type):Clustering has a procedure for measuring
dissimilarity for these variables and can use them in clustering
If the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.
Review ofFraud Classification Using Principal
Components Analysis of RIDITS
By Louise A. FrancisFrancis Analytics and Actuarial Data Mining, Inc.