Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group...
-
Upload
amos-dixon -
Category
Documents
-
view
214 -
download
1
Transcript of Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group...
Benford’s LawData Quality Analysis applications
Alan F Doyle, Data Management SpecialistGroup Data Management
4 March 2009
2
Contents
Introduction – who are we ?
Benford's Law – What is it, when should it apply, and to what types of data ?
Benford's Law and Data Quality 101 30% of all balances in your GL should begin with a "1" (and other stories)!!!
Fraud identification using Benford's Law
Using Excel to test real data sets using Benford's Law
References/Bibliography
AppendixExcel approach to Benford's Law (suggested procedure and Excel tips)
Our brands
Key Group facts
39,729 employees (FTE), globally
10 million retail and business banking customers, globally
1,714 branches and service centres, globally
$15.4 billion revenue
$8.1 billion underlying profit
2,939 ATMs, globally (including non-branded ATMs)
Source: 2008 Shareholder Review
5
Data Management and Benford's Law The Data Quality link
NAB Group Data Management developed a Group Data Quality (DQ) Policy We have a key focus in the policy on the capabilities to ensure appropriate DQ We undertake Data Quality Profiling – a key tool/capability, but generally
business-rules based There was some awareness in our team, of Benford’s Law and its application to
fraud detection…..maybe we can apply it to DQ profiling also? We selected various data stores which should conform to Benford’s Law and tested
this hypothesis Conclusion: very close correlation found, and
a worthwhile addition to existing DQ assessment techniques/rules
Today’s paper provides background on:– what Benford’s Law is (and is not)– how it can be applied in practice, and – examples of the results of analysis on some real data sets
6
Benford's Law What is it? ….lay man’s guide
Also called the “First-Digit Law”
Method of predicting, with surprising accuracy,the initial digits of any non-random series of numbers
Wikipedia4 – simple, plain language:
> “…in lists of numbers from many real-life sources of data,
> the leading digit is distributed in a specific, non-uniform way…..
> the first digit is 1 almost one third of the time, and
> larger digits occur as the leading digit with lower frequency,
> to the point where 9 as a first digit occurs less than one time in twenty.”
7
Benford's Law What is it? ….the technical stuff
Basis: values of real-world measurements are often distributed logarithmically, so the log of such sets is generally distributed uniformly
Log tables were more “dog-eared” at 1st few pages (1st digits 1,2 etc) than last pages (1st digits 7,8,9) ….(fictional?)
Probability of digit D as 1st digit = log10(1+1/D)
A generalised formula exists also which
allows us to predict the probability of, for example :– the first 3 digits being 314! – the 4th digit being a 6! – etc..
Benford’s Law
First Digit of numbers
% predicted byBenford’s Law
1 30.103
2 17.609
3 12.494
4 9.691
5 7.918
6 6.695
7 5.799
8 5.115
9 4.576
8
Benford's Law Real world/day-to-day examples of where it should apply
Electricity bills
Street addresses
Stock prices
Population numbers
Death rates
Lengths of rivers
Accounts payable invoice and payment values
General Ledger balances
Customer Loan and Deposit account balances
Land Valuations
9
Benford's Law To what “types” of data should it apply?
Balances or totals of numbers resulting from aggregation (e.g. General Ledger Balances, supplier accounts payable balances, data warehouse aggregates)
The more stages of calculations to obtain each member of a series of numbers, the more likely it is that the end results will conform to the predictions of Benford’s Law
Numbers resulting from the mathematical combination of numbers (e.g. price times quantity)
Transaction-level data (e.g. payments, sales, purchases)
Numbers that describe the ‘count’ or ‘value’ of the elements of a dataset
10
Benford's Law When does it NOT (or most likely not) apply?
Assigned numbers (e.g. cheque numbers, invoice numbers) Numbers which conform to other distributions (e.g. normal distribution,
uniform distribution like random drawings, lotteries, or the roll of one die) Numbers influenced by human thought (e.g. prices set with psychological
thresholds such as $1.99) Balances of accounts set up for a specific purpose (e.g. to record $100
refunds) Items/numbers with built-in minimum or maximum values, e.g. 1st digit of
heights (in metres) of a group of humans is most likely to be a 1 or a 2 ‘Price Effect’ e.g. Sales receipts where one product (with a specific price)
forms a large part of the population of sales made, or individual staff members’ payroll totals for a pay period (predominance of similar hours times similar rates per hour
When selecting small sample sizes Non-naturally occurring numbers (e.g. telephone numbers)
11
Linking Benford's Law and Data Quality Benford’s Law helps to identify …..
Duplicate payments (accounts payable)
Fraudulent payments
Fraudulent expense claims
Tax return fraud
Biased estimation in General Ledger balances
Arbitrarily invented numbers in forecasting (forecasts should conform to the expected distributions of their related ‘actuals’)
Biased estimates in bad debt provisions
Systemic error (e.g. through incorrect ETL logic, resulting in accidentally duplicated or repeated values)
Processing inefficiencies (e.g. high quantity/low $ transactions)
12
Fraud identification using Benford's Law “It is the failure of assigned numbers to follow
Benford’s Law that makes the Law so powerful in detecting fraudulent “made up” numbers among calculated numbers” 1
“The general expectation is that Benford’s Law will apply to any series of calculated numbers, and explanation is required when any series does not conform. The explanation may be in the exceptions listed above [refer previous slides] or the explanation may be in anomalous behaviour.” 1
Results are indicative. Need to evaluate the results to conclude which applies (a valid exception to Benford’s Law, or an anomaly)
If anomalous => further investigation should be conducted relating to the anomalies to confirm why they occur (e.g. fraud, estimation biases, unintended or manual or programmed generation of duplicates)
13
Fraud identification using Benford's LawExample: Expense Claims
A classic and easily understood example relates to expense account manipulation
Assume General Manager approval required for expense claims >= $300 Often find a clustering of expenses below $300 to avoid the need to seek
GM approval This is often achieved by arranging multiple purchases just below the
threshold, and/or collusion by purchasers with suppliers, to split larger invoices into smaller individual invoices (e.g. 2 invoices for $260 and $240 rather than one for $500)
Benford’s Law to the rescue! First digits will show anomalies (e.g. preponderance of 1s and/or 2s, and fewer than expected 3s, 4s and 5s)
Refer suggested simple procedure andExcel tips in the appendix
14
Fraud identification using Benford's Law Example: Expense Claims
Anomalies in distribution of actual occurrence of first digits compared with Benford’s expected distribution – the tell-tale signs*
* Dummy data used to illustrate this example
Possible Fraudulent Expense Claims
1 2 3 4 5 6 7 8 9
15
Using Excel to test a real data set using Benford's Law Sample data from specific balance files (6.5m records) – Data Warehouse
16
Using Excel to test a real data set using Benford's Law Graph of this data (6.5m records) – closely mirrors Benford’s Distribution
Actual < Benford’s
Actual > Benford’s
KEY :
17
Using Excel to test a real data set using Benford's Law Another Sample data set from the Warehouse (3.3m records)
18
Using Excel to test a real data set using Benford's Law Graph of this data (3.3m records) – matches Benford’s, slight tendency to
understate ? However differences are relatively small.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
1 2 3 4 5 6 7 8 9
First Digit of Numbers
% O
ccu
rren
ce
Actual Result
Benford's Expected Result
Actual < Benford’s
Actual > Benford’s
KEY :
19
Using Excel to test a real data set using Benford's Law Australian GL : Balance Sheet balances : Feb 09 – 55,000 records
Match is Spot On !!! …good news for us !!
20
Using Excel to test a real data set using Benford's Law Australian GL transactions (single Balance Sheet account type; all
branches) : Feb 09 - 62,000 records
• 62k GL records (transactions) v 6.5m Warehouse records (balances) • Less inherent ‘aggregation’ (transaction vs. balance) and smaller number of items => a less exact match is not unexpected• Nevertheless, still a close match• The following example (personal credit card) illustrates increasing ‘lumpiness’ as the number of items decreases and ‘behaviour’ plays a greater role…
21
Using Excel to test a real data set using Benford's Law My MasterCard ! (12 months/976 transactions) – “raw”
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
1 2 3 4 5 6 7 8 9
First Digit of Numbers
% O
cc
urr
en
ce
Actual Result
Benford's ExpectedResult
Interesting : rough match (trend at least) even if somewhat ‘lumpy’
As expected - know it includes many duplicates (direct debits, credit transfers, round sum ATM withdrawals, bank fees, etc)
Distributions are still quite similar
May be worth adjusting the calculations to allow for the ‘known’ behaviours
Either the “raw” or the adjusted actual distributions can be used as a ‘fingerprint” for comparative analysis across/with other time periods
22
Using Excel to test a real data set using Benford's Law My MasterCard ! (12 months/976 transactions) – Adjusted for duplicates, etc
Backed out duplicates and near duplicates e.g. monthly direct debits of $19.99 then $19.98, then $19.99; bank fees, standing a/c transfers, etc
Backed out known ‘behaviours’ e.g. many ‘ATM withdrawals’ for $20, $40, $60, $80
Added back total of these, to each digit’s subtotal, after being redistributed to the leading digits per Benford’s Law
Recalculated ‘adjusted’ actuals
Close match given the relatively small sample, and many behavioural factors at play here
Tendency towards lower digits ! Penny pincher ?......contd/
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
1 2 3 4 5 6 7 8 9
First Digit of Numbers
% O
cc
urr
en
ce
Actual Result
Benford's ExpectedResult
23
Using Excel to test a real data set using Benford's Law My MasterCard ! (12 months/976 transactions) – Adjusted (continued)
Graphically the raw distribution is an intuitive outcome based on my personal behaviours ……
I tend to spend larger sums like $600 to $999 less frequently than lower amounts like $10 to $19, and $100 to $199, or $20 to $29, $200 to $299, $30 to $39, $300 to $399 etc !! …a normal behaviour for most of us when it hits our own hip pocket!!
For a business, the aggregates are comprised of the outcomes of the behaviours of many
=> reduced impact of the behaviours of one or two individuals - i.e. a closer fit to Benford’s is expected for businesses (larger populations)
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
1 2 3 4 5 6 7 8 9
First Digit of Numbers
% O
cc
urr
en
ce
Actual Result
Benford's ExpectedResult
24
Conclusions from our analyses
Benford’s Law does indeed apply to aggregated datasets => very relevant to Data Quality assessment!
Where anomalies are found, further review may indicate the initial hypothesis regarding a data set’s expected ‘distribution’ was incorrect
Alternatively, review of outcomes may indicate that although the distribution should conform to Benford’s Law, in reality it doesn’t – further investigation required.
Further investigation should then reveal the true reasons for deviationse.g. fraud, inefficient processes, genuine repeated patterns, or systemic data processing/ETL logic errors.
Benford’s Law provides a simple yet potentially powerful technique to add to our DQ assessment armory, and can be achieved by applying nothing more than a very simple spreadsheet against your data set
We’ll never look at GL and data warehouse balances (or at least their leading digits) the same way again!
25
Questions
26
References/Bibliography
1. Benford’s Law and Fraud DetectionRobert Lowe, NZ Chartered Accountants Journal November 2000
2. When Benford’s Law is Broken Robert Lowe, NZ Chartered Accountants Journal December 2000
3. I’ve Got Your NumberMark Nigrini, AICPA Journal of Accountancy May 1999
4. Wikipedia – Benford’s Lawhttp://en.wikipedia.org/wiki/Benford’s_law
27
Appendix: Excel approach to Benford's Law(suggested procedure and Excel tips)
1. Columnar (left to right) derivation recommended (helps check first digit extraction is correct)
2. Convert to absolute values first – e.g. =ABS(F11)
3. Exclude zero balances (just sort list, then delete rows)
4. Remove any leading zeros (number/balance <0)
– select numbers/balances <0, then – multiply by 100, 1000, doesn’t matter (e.g. 2 decimal places => use 100 or greater)– just need to get leading digit on the left of the decimal
5. Extract leading digit(s) – e.g. =LEFT(J11,1)
6. Sort list in ascending order of extracted leading digit cell
7. Create subtotals on change in leading digit value
8. Calculate Grand total of record count (less zero balances)
9. Calculate percentage of Grand total represented by the count of each leading digit
10. Compare with Benford’s Distribution percentages and graph if desired
11. Review outcomes and decide next steps.