Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...
Transcript of Fuzzy Matching in Fraud Analytics · Fuzzy Matching in Fraud Analytics Grant Brodie, President,...
Fuzzy Matching in Fraud Analytics
Grant Brodie, President, Arbutus Software
2
Outline
What Is Fuzzy?
Causes
Effective Implementation
Application to Specific Products
Demonstration
Q&A
3
Why Is Fuzzy Important?
Big data
Too many transactions
User-entered data (web sites)
E-Commerce
Less manual oversight
4
What Is Fuzzy?
Subset of duplicates testing
Find specific keywords in text (FCPA, PCard)
Close, but not the same
Two reasonable definitions
Proximity
Looks similar
5
Proximity
Sorts close together
Characters
“Albert” vs. “Albertson”
Numbers
123,456.78 vs. 123,792.16
Dates
Jan 19, 2014 vs. Jan 20, 2014
6
Looks Similar
Characters
Microsoft vs. Wicrosoft
Numbers
127,894.63 vs. 12,894.63
Dates
Jan 13, 2014 vs. Jan 31, 2014
7
Traditional Approach to “Close”
Pronunciation based
Soundex
NYSIIS
Designed for names
Many false positives
Not useful for numbers or dates
8
Fuzzy Today
Based on physical string matching
Levenshtein (ACL)
Damerau-Levenshtein (Arbutus)
N-Gram
Jaro-Winkler
And many more…
Differences expressed as a “distance” or percentage
9 9
Quick Lesson: Damerau-Levenshtein
Min. # changes to make one string into another
Insert, delete, replace, transpose
‘123 Main Street’ vs. ‘123 Main St’ = 4
34567 vs. 34576 = 1 (Levenshtein: 2)
‘Rob’ vs.‘Robert’ = 3
‘Gary’ vs.‘Mary’ = 1
‘Gary’ vs.‘gary’ = 1
10
Problems with String Matching
Very literal
Doesn’t apply any context
“John Smith” vs. “John Smith” (1)
“Smith John” vs. “Smith, John” (1)
“John Smith” vs. “john smith” (2)
México vs. Mexico (1)
“John Smith” vs. “john smith” same as “John Hmitz” (2)
11
What Do You Use?
Whatever your tool offers
Almost impossible to implement manually
VERY compute intensive
12
Causes
Accidental errors
Carelessness/mistyping
Transpositions
Blurry source
Punctuation
Extra blanks
1 vs. I, 0 vs. O (particularly with OCR)
13
Errors vs. Fraud
All of the causes were likely “errors”
Fraud uses intentional errors to mask activity
Obscure duplicates
Obscure relationships
Trick through similarity
Disparate systems make comparison even harder
14
Practical Issues
Generally hard to “target” fuzzy tests
Forced to use broad tests
Most findings will be errors
Even so, the finding is still valuable
Need a process to address errors found
15
“Our System Catches Duplicates”
Exact matches only
Strict application (i.e. company, vendor, invoice)
May only warn
Not all duplicates are payments
Most only test document numbers
16
Types of Duplicates
Names
Personal
Corporate
Addresses
Document numbers (e.g., invoice)
Contact information
Phone numbers
Emails
17
Issues
Very compute intensive (wait times)
Exponential relationship
1000x data = 1,000,000x more work
False positives
Ease of use
18
False Positives
Easily the most challenging aspect
Any time spent on a false positive is wasted
Can easily outnumber the true positives by 10, 100, 1000 to 1
If too many, can remove any cost effectiveness
How does this happen?
Only one way to get an exact match
Virtually unlimited ways to get close
19
False Positive Examples
Matching to “12345” with a single difference:
Missing (1245): 5, Transposition (12435): 4
Incorrect (12745): min 45 (175 if alpha, 1,000+ if any char)
Extra (123345): min 60 (200+ if alpha, 1,000+ if any char)
Hundreds/thousands of ways that differ by just 1
Not just errors, all close values
Exponentially more with a distance of 2
Bad actor tries to rely on his needle in a haystack
20
How to Address the Issues
Data preparation
Utilize “context”
Use “tight” specifications
Choose software that meets needs
Rank your results
21
Choose Your Software
Has the capabilities you need
Can process your data volumes
Easy to implement
Easy to automate
ACL, Arbutus, IDEA, fraud-specific, non-audit tools
22
Data Preparation
Remove immaterial differences first (i.e., normalization)
Text manipulation
Upper case
Punctuation
Extra blanks
Foreign characters (México vs. Mexico, Québec vs. Quebec)
23
Data Preparation (Cont.) (Remove immaterial differences first, normalization)
Eliminate “noise” words
Different by type of data
Address: Suite, Unit
Corporate name: Company, Co, Inc
Personal name: Mr, Ms, Dr, Prof
24
Data Preparation (Cont.)
(Remove immaterial differences first, normalization)
Common misspellings/typos
Common vocabulary (chair vs. silla)
Different by data type
Avenue: Av, Ave, Aven, Avenu
First vs. 1st…
West vs. W…
Richard, Rick, Dick, Ricky, Rich
25
Data Preparation (Cont.) (Remove immaterial differences first, normalization)
Word order
“123 W Main St.” vs. “123 Main St. W”
26
Data Preparation: Result
Well implemented data prep. minimizes the need for fuzzy
Consider the two addresses:
“#200-1234 Main Street West”
“1234 W MAIN ST, Suite 200”
Levenshtein distance is 20
Applying data prep can make both strings identical
W ST MAIN 200 1234
27
Text Manipulation: ACL Create a computed field
Upper case: Upper(field) (FUZZYDUP ignores case, but data prep is simpler)
Punctuation: Include(field, “ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), but…
Extra blanks: (replace 2 with 1) Replace(Replace(field, “ ”, “ ”), “ ”, “ ”)…
Foreign characters: Replace(Replace(field “É”, “E”), “Á”, “A”)…
Replace(Replace(Replace(Replace(Include(Upper(field), ‘ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ”), ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘) , ‘ ‘, ‘ ‘), “É”, “E”)…
In practice, many more replace calls
May break up into multiple fields for clarity
28
Text Manipulation: Arbutus Create a computed field
Upper case: Upper(field)
Punctuation: Include(field, “ 0~9A~Z”), but…
Extra blanks: Compact(field)
Foreign characters: Replace(field, “É”, “E”, “Á”, “A”,…)
Replace(Compact(Include(Upper(field), ‘ 0~9A~Z”)), “É”, “E”…)
May break up into multiple fields for clarity
Only for unusual situations (use Normalize function)
29
Eliminate “Noise” Words: ACL Use “whole words”
Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD…”, F), but…
Omit(field, “INC”): CINCH INDUSTRIES becomes CH INDUSTRIES
Problem is, many noise words to eliminate—two solutions:
Long list
Alltrim(Omit(field+“ ”, “INCORPORATED ,INC ,LIMITED ,LTD ,CORPORATION , CORP ,…”))
Sequential omits of a variable in a group
v_field=Omit(field…
v_field=Omit(v_field…
…
30
Common Vocabulary: ACL
Similar to noise words, only Replace instead of Omit
Use “whole words”
Replace(field+“ ”, “ROAD ”, “RD ”)
Otherwise, “BROADWAY” becomes “BRDWAY”
Don’t omit, as Peachtree Lane is not the same as Peachtree Court
Problem is, MANY vocabulary words to potentially normalize
USPS 400 street terms, 500+ male names, 700+ female names
Nested functions (with Replace instead of Omit)
Sequential replaces of a variable in a group
31
Word Order: ACL
No practical way to address this
32
Noise Words and Common Vocabulary: Arbutus If you choose, ACL syntax all works
Instead: Use Normalize() or SortNormalize()
Automatically implements ALL of the data prep described
(Upper case, punctuation, blanks, foreign, noise, vocabulary)
Normalize(address, “addr.txt”)
Norm(“Suite 200-1234 Main Street West”, “addr.txt”) = “200 1234 MAIN ST W”
SortNormalize has the same syntax, but = “W ST MAIN 200 1234”
Normalize can use a separate vocabulary file (addr.txt)
Replaces or omits any word, on a “whole word” basis
User configurable and selectable, by data type
33
Noise Words and Common Vocabulary: Arbutus Substitution file (addr.txt, for example)
FIRST 1ST
SEVENTH 7TH
AV AVE
AVENU AVE
AVENUE AVE
AVN AVE
PARKWAY PKWY
PARKWY PKWY
PKWAY PKWY
PKY PKWY
SUITE
UNIT
34
False Positive Reduction: Utilize Context
Data elements always have a “context”
Names or address: location (e.g., city, state, ZIP, country, etc.)
Documents: vendor, employee, etc.
Reference the similarities to minimize the ambiguity
Same state, city, similar address
“123 Main St.”, Springfield, IL/MA
Same vendor, date, amount, similar invoice number
35
Utilize Context: Application
ACL FUZZYDUP: Only supports one key field
Concatenate fields into a single expression/computed field
State+City+Address
Other data types require conversion: vendor+date(dt)+str(amount, 16)+invno
Arbutus DUPLICATES: Supports multiple key fields
Specify each key separately
Last key can be fuzzy
36
False Positive Reduction: Use “Tight” Specs
Levenshtein distance 1, or 2 max
Looser specifications = more false positives
Avoid Soundex and similar approaches
There is no substitute for good data prep
37
False Positives: Rank Your Results
Order based on exposure
Size of item
Degree of inherent risk (cash)
Order based on degree of similarity
Distance (1 vs. 2)
Number of matching “same” elements
38
Execution: ACL
Separate menu item
Analyze/fuzzy duplicates
Choose your (concatenated) key
Choose diff. threshold (1 or 2)
Select other fields to use in investigation
Select the output table name
Be patient
39
Execution: Arbutus
Included with duplicates testing
Analyze/duplicates
Choose your key fields (any type)
Choose either near or similar processing
Choose max. difference (0, 1, or 2)
Select other fields to use in investigation
Select output location and name
40
“Similar” Processing: Arbutus
Specifically designed to work with document IDs
Uses Damerau-Levenshtein, but auto. pre-processes
Removes all blanks and punctuation, upper cases
Matches similar characters: O=0, I=1, 5=S, etc.
Works on all data types
127,894.63 vs. 12,894.63 (diff. 1)
I-12345 vs. 112345 (diff 0)
Particularly useful with OCR
41
“Similar” Processing: ACL Not explicitly supported
Pre-process the data to create a computed field
Upper case
Include only numbers and letters (no blanks, punctuation)
Convert numbers and dates to strings (date or string)
Use the FUZZYDUP command as in the past
42
Manual Duplicates Testing: ACL Data prep is still important
LevDist(string1, string2 <, case sensitive>)
Case sensitive by default
Filter: LevDist(name1, name2, F) < 3
IsFuzzyDup(string1, string2, distance <, diff%> )
Automatically case insensitive
Filter: IsFuzzyDup(name1, name2, 2)
Can also be used as a join test
43
Manual Duplicates Testing: Arbutus All case sensitive, by default (assumes normalized inputs)
Difference(string1, string2 <, case sensitive>)
Filter: difference(name1, name2, F) < 3
Near(field1, field2, difference)
Filter: near(name1, name2, 2)
Applies to all data types
Char: Damerau-Levenshtein; numbers and dates: proximity (4799 vs 4803)
Similar(field, field2, difference)
Applies to all data types, always uses Damerau-Levenshtein
Char: prepared data; numbers and dates: 123,456 vs. 12,456
44
Find Specific Keywords in Text: ACL Very common for purchase card reviews, FCPA
Use the Find function:
Filter: IF Find(“Exotic”, desc)
Multiple words: IF Find(“Exotic”, desc) OR Find(“IPad”, desc)…
Not case sensitive, not whole word
Create a Logical computed field (say “Exception”):
T IF Find(“Exotic”, desc)
T IF Find(“IPad”, desc)
…
F
Filter: IF Exception
45
Find Specific Keywords in Text: Arbutus Find function works the same as ACL
Use the ListFind function instead:
Filter: IF ListFind(“exceptions.txt”, desc)
Simple text file
Easily maintained in Notepad
Unlimited entries
Supports an external reference file or an internal array
Like Find function, not case sensitive, not whole word
46
Continuous Monitoring
Mostly errors
“Test” vs. “control”
Ownership of the process
May relate to frequency
Detective vs. Preventative
Entire presentation detective
Opportunity to run against documents before committing
Preventative almost certainly a “control”
47
Fuzzy Testing in action
Demonstration