Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group...

27
Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009

Transcript of Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group...

Page 1: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

Benford’s LawData Quality Analysis applications

Alan F Doyle, Data Management SpecialistGroup Data Management

4 March 2009

Page 2: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

2

Contents

Introduction – who are we ?

Benford's Law – What is it, when should it apply, and to what types of data ?

Benford's Law and Data Quality 101 30% of all balances in your GL should begin with a "1" (and other stories)!!!

Fraud identification using Benford's Law

Using Excel to test real data sets using Benford's Law

References/Bibliography

AppendixExcel approach to Benford's Law (suggested procedure and Excel tips)

Page 3: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

Our brands

Page 4: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

Key Group facts

39,729 employees (FTE), globally

10 million retail and business banking customers, globally

1,714 branches and service centres, globally

$15.4 billion revenue

$8.1 billion underlying profit

2,939 ATMs, globally (including non-branded ATMs)

Source: 2008 Shareholder Review

Page 5: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

5

Data Management and Benford's Law The Data Quality link

NAB Group Data Management developed a Group Data Quality (DQ) Policy We have a key focus in the policy on the capabilities to ensure appropriate DQ We undertake Data Quality Profiling – a key tool/capability, but generally

business-rules based There was some awareness in our team, of Benford’s Law and its application to

fraud detection…..maybe we can apply it to DQ profiling also? We selected various data stores which should conform to Benford’s Law and tested

this hypothesis Conclusion: very close correlation found, and

a worthwhile addition to existing DQ assessment techniques/rules

Today’s paper provides background on:– what Benford’s Law is (and is not)– how it can be applied in practice, and – examples of the results of analysis on some real data sets

Page 6: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

6

Benford's Law What is it? ….lay man’s guide

Also called the “First-Digit Law”

Method of predicting, with surprising accuracy,the initial digits of any non-random series of numbers

Wikipedia4 – simple, plain language:

> “…in lists of numbers from many real-life sources of data,

> the leading digit is distributed in a specific, non-uniform way…..

> the first digit is 1 almost one third of the time, and

> larger digits occur as the leading digit with lower frequency,

> to the point where 9 as a first digit occurs less than one time in twenty.”

Page 7: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

7

Benford's Law What is it? ….the technical stuff

Basis: values of real-world measurements are often distributed logarithmically, so the log of such sets is generally distributed uniformly

Log tables were more “dog-eared” at 1st few pages (1st digits 1,2 etc) than last pages (1st digits 7,8,9) ….(fictional?)

Probability of digit D as 1st digit = log10(1+1/D)

A generalised formula exists also which

allows us to predict the probability of, for example :– the first 3 digits being 314! – the 4th digit being a 6! – etc..

Benford’s Law

First Digit of numbers

% predicted byBenford’s Law

1 30.103

2 17.609

3 12.494

4 9.691

5 7.918

6 6.695

7 5.799

8 5.115

9 4.576

Page 8: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

8

Benford's Law Real world/day-to-day examples of where it should apply

Electricity bills

Street addresses

Stock prices

Population numbers

Death rates

Lengths of rivers

Accounts payable invoice and payment values

General Ledger balances

Customer Loan and Deposit account balances

Land Valuations

Page 9: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

9

Benford's Law To what “types” of data should it apply?

Balances or totals of numbers resulting from aggregation (e.g. General Ledger Balances, supplier accounts payable balances, data warehouse aggregates)

The more stages of calculations to obtain each member of a series of numbers, the more likely it is that the end results will conform to the predictions of Benford’s Law

Numbers resulting from the mathematical combination of numbers (e.g. price times quantity)

Transaction-level data (e.g. payments, sales, purchases)

Numbers that describe the ‘count’ or ‘value’ of the elements of a dataset

Page 10: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

10

Benford's Law When does it NOT (or most likely not) apply?

Assigned numbers (e.g. cheque numbers, invoice numbers) Numbers which conform to other distributions (e.g. normal distribution,

uniform distribution like random drawings, lotteries, or the roll of one die) Numbers influenced by human thought (e.g. prices set with psychological

thresholds such as $1.99) Balances of accounts set up for a specific purpose (e.g. to record $100

refunds) Items/numbers with built-in minimum or maximum values, e.g. 1st digit of

heights (in metres) of a group of humans is most likely to be a 1 or a 2 ‘Price Effect’ e.g. Sales receipts where one product (with a specific price)

forms a large part of the population of sales made, or individual staff members’ payroll totals for a pay period (predominance of similar hours times similar rates per hour

When selecting small sample sizes Non-naturally occurring numbers (e.g. telephone numbers)

Page 11: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

11

Linking Benford's Law and Data Quality Benford’s Law helps to identify …..

Duplicate payments (accounts payable)

Fraudulent payments

Fraudulent expense claims

Tax return fraud

Biased estimation in General Ledger balances

Arbitrarily invented numbers in forecasting (forecasts should conform to the expected distributions of their related ‘actuals’)

Biased estimates in bad debt provisions

Systemic error (e.g. through incorrect ETL logic, resulting in accidentally duplicated or repeated values)

Processing inefficiencies (e.g. high quantity/low $ transactions)

Page 12: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

12

Fraud identification using Benford's Law “It is the failure of assigned numbers to follow

Benford’s Law that makes the Law so powerful in detecting fraudulent “made up” numbers among calculated numbers” 1

“The general expectation is that Benford’s Law will apply to any series of calculated numbers, and explanation is required when any series does not conform. The explanation may be in the exceptions listed above [refer previous slides] or the explanation may be in anomalous behaviour.” 1

Results are indicative. Need to evaluate the results to conclude which applies (a valid exception to Benford’s Law, or an anomaly)

If anomalous => further investigation should be conducted relating to the anomalies to confirm why they occur (e.g. fraud, estimation biases, unintended or manual or programmed generation of duplicates)

Page 13: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

13

Fraud identification using Benford's LawExample: Expense Claims

A classic and easily understood example relates to expense account manipulation

Assume General Manager approval required for expense claims >= $300 Often find a clustering of expenses below $300 to avoid the need to seek

GM approval This is often achieved by arranging multiple purchases just below the

threshold, and/or collusion by purchasers with suppliers, to split larger invoices into smaller individual invoices (e.g. 2 invoices for $260 and $240 rather than one for $500)

Benford’s Law to the rescue! First digits will show anomalies (e.g. preponderance of 1s and/or 2s, and fewer than expected 3s, 4s and 5s)

Refer suggested simple procedure andExcel tips in the appendix

Page 14: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

14

Fraud identification using Benford's Law Example: Expense Claims

Anomalies in distribution of actual occurrence of first digits compared with Benford’s expected distribution – the tell-tale signs*

* Dummy data used to illustrate this example

Possible Fraudulent Expense Claims

1 2 3 4 5 6 7 8 9

Page 15: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

15

Using Excel to test a real data set using Benford's Law Sample data from specific balance files (6.5m records) – Data Warehouse

Page 16: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

16

Using Excel to test a real data set using Benford's Law Graph of this data (6.5m records) – closely mirrors Benford’s Distribution

Actual < Benford’s

Actual > Benford’s

KEY :

Page 17: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

17

Using Excel to test a real data set using Benford's Law Another Sample data set from the Warehouse (3.3m records)

Page 18: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

18

Using Excel to test a real data set using Benford's Law Graph of this data (3.3m records) – matches Benford’s, slight tendency to

understate ? However differences are relatively small.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

1 2 3 4 5 6 7 8 9

First Digit of Numbers

% O

ccu

rren

ce

Actual Result

Benford's Expected Result

Actual < Benford’s

Actual > Benford’s

KEY :

Page 19: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

19

Using Excel to test a real data set using Benford's Law Australian GL : Balance Sheet balances : Feb 09 – 55,000 records

Match is Spot On !!! …good news for us !!

Page 20: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

20

Using Excel to test a real data set using Benford's Law Australian GL transactions (single Balance Sheet account type; all

branches) : Feb 09 - 62,000 records

• 62k GL records (transactions) v 6.5m Warehouse records (balances) • Less inherent ‘aggregation’ (transaction vs. balance) and smaller number of items => a less exact match is not unexpected• Nevertheless, still a close match• The following example (personal credit card) illustrates increasing ‘lumpiness’ as the number of items decreases and ‘behaviour’ plays a greater role…

Page 21: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

21

Using Excel to test a real data set using Benford's Law My MasterCard ! (12 months/976 transactions) – “raw”

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

1 2 3 4 5 6 7 8 9

First Digit of Numbers

% O

cc

urr

en

ce

Actual Result

Benford's ExpectedResult

Interesting : rough match (trend at least) even if somewhat ‘lumpy’

As expected - know it includes many duplicates (direct debits, credit transfers, round sum ATM withdrawals, bank fees, etc)

Distributions are still quite similar

May be worth adjusting the calculations to allow for the ‘known’ behaviours

Either the “raw” or the adjusted actual distributions can be used as a ‘fingerprint” for comparative analysis across/with other time periods

Page 22: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

22

Using Excel to test a real data set using Benford's Law My MasterCard ! (12 months/976 transactions) – Adjusted for duplicates, etc

Backed out duplicates and near duplicates e.g. monthly direct debits of $19.99 then $19.98, then $19.99; bank fees, standing a/c transfers, etc

Backed out known ‘behaviours’ e.g. many ‘ATM withdrawals’ for $20, $40, $60, $80

Added back total of these, to each digit’s subtotal, after being redistributed to the leading digits per Benford’s Law

Recalculated ‘adjusted’ actuals

Close match given the relatively small sample, and many behavioural factors at play here

Tendency towards lower digits ! Penny pincher ?......contd/

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

1 2 3 4 5 6 7 8 9

First Digit of Numbers

% O

cc

urr

en

ce

Actual Result

Benford's ExpectedResult

Page 23: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

23

Using Excel to test a real data set using Benford's Law My MasterCard ! (12 months/976 transactions) – Adjusted (continued)

Graphically the raw distribution is an intuitive outcome based on my personal behaviours ……

I tend to spend larger sums like $600 to $999 less frequently than lower amounts like $10 to $19, and $100 to $199, or $20 to $29, $200 to $299, $30 to $39, $300 to $399 etc !! …a normal behaviour for most of us when it hits our own hip pocket!!

For a business, the aggregates are comprised of the outcomes of the behaviours of many

=> reduced impact of the behaviours of one or two individuals - i.e. a closer fit to Benford’s is expected for businesses (larger populations)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

1 2 3 4 5 6 7 8 9

First Digit of Numbers

% O

cc

urr

en

ce

Actual Result

Benford's ExpectedResult

Page 24: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

24

Conclusions from our analyses

Benford’s Law does indeed apply to aggregated datasets => very relevant to Data Quality assessment!

Where anomalies are found, further review may indicate the initial hypothesis regarding a data set’s expected ‘distribution’ was incorrect

Alternatively, review of outcomes may indicate that although the distribution should conform to Benford’s Law, in reality it doesn’t – further investigation required.

Further investigation should then reveal the true reasons for deviationse.g. fraud, inefficient processes, genuine repeated patterns, or systemic data processing/ETL logic errors.

Benford’s Law provides a simple yet potentially powerful technique to add to our DQ assessment armory, and can be achieved by applying nothing more than a very simple spreadsheet against your data set

We’ll never look at GL and data warehouse balances (or at least their leading digits) the same way again!

Page 25: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

25

Questions

Page 26: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

26

References/Bibliography

1. Benford’s Law and Fraud DetectionRobert Lowe, NZ Chartered Accountants Journal November 2000

2. When Benford’s Law is Broken Robert Lowe, NZ Chartered Accountants Journal December 2000

3. I’ve Got Your NumberMark Nigrini, AICPA Journal of Accountancy May 1999

4. Wikipedia – Benford’s Lawhttp://en.wikipedia.org/wiki/Benford’s_law

Page 27: Benford’s Law Data Quality Analysis applications Alan F Doyle, Data Management Specialist Group Data Management 4 March 2009.

27

Appendix: Excel approach to Benford's Law(suggested procedure and Excel tips)

1. Columnar (left to right) derivation recommended (helps check first digit extraction is correct)

2. Convert to absolute values first – e.g. =ABS(F11)

3. Exclude zero balances (just sort list, then delete rows)

4. Remove any leading zeros (number/balance <0)

– select numbers/balances <0, then – multiply by 100, 1000, doesn’t matter (e.g. 2 decimal places => use 100 or greater)– just need to get leading digit on the left of the decimal

5. Extract leading digit(s) – e.g. =LEFT(J11,1)

6. Sort list in ascending order of extracted leading digit cell

7. Create subtotals on change in leading digit value

8. Calculate Grand total of record count (less zero balances)

9. Calculate percentage of Grand total represented by the count of each leading digit

10. Compare with Benford’s Distribution percentages and graph if desired

11. Review outcomes and decide next steps.