The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA...

30
The Washington Dept. of The Washington Dept. of Revenue Revenue Data Mining Pilot Pilot Data Mining Pilot Pilot Project: Project: A Retrospective Overview A Retrospective Overview 2000 FTA Revenue Estimating and 2000 FTA Revenue Estimating and Tax Research Conference Tax Research Conference September 26, 2000 September 26, 2000 Stan Woodwell Stan Woodwell Research Information Manager Research Information Manager Washington Dept. of Revenue Washington Dept. of Revenue

Transcript of The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA...

Page 1: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

The Washington Dept. of The Washington Dept. of RevenueRevenue

Data Mining Pilot Pilot Project:Data Mining Pilot Pilot Project:A Retrospective OverviewA Retrospective Overview

2000 FTA Revenue Estimating and 2000 FTA Revenue Estimating and

Tax Research ConferenceTax Research Conference

September 26, 2000September 26, 2000

Stan WoodwellStan Woodwell

Research Information ManagerResearch Information Manager

Washington Dept. of RevenueWashington Dept. of Revenue

Page 2: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Warehousing/Data Warehousing/Data Mining Study TeamData Mining Study Team

Three Integrated EffortsThree Integrated Efforts

• building small data warehouse -- building small data warehouse -- “Data Mart”“Data Mart”

• testing query & analysis tools testing query & analysis tools • doing data mining pilot projectdoing data mining pilot project

Page 3: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data warehousing Data warehousing

A data warehouse is a copy of transaction data specifically structured for querying, analysis and reporting.

That is,- on physically separate hardware- organized differently, especially for querying,

analysis & reporting

Page 4: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mart/Query SoftwareData Mart/Query Software

Data MartData Mart• SQL Server - 50 GigSQL Server - 50 Gig• NT Operating SystemNT Operating System• ODBCODBC

Query Software -- COGNOSQuery Software -- COGNOS

Page 5: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mining ContinuumData Mining Continuum

QueryQuery StatisticalStatisticalProceduresProcedures

DecisionDecisionTreesTrees

Neural NetworksNeural Networks

Page 6: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mining Data Mining What’s New, What’s Not What’s New, What’s Not

Neural NetworksNeural Networks Decision Trees Decision Trees

(Rule Induction)(Rule Induction)

• representrepresent

““Artificial Artificial Intelligence”Intelligence”

Data trains softwareData trains software

Query LogicQuery Logic

Statistical Statistical ProceduresProcedures• RegressionRegression• Cluster AnalysisCluster Analysis• Association Rules Association Rules

(Affinity/Market (Affinity/Market Basket Analysis)Basket Analysis)

Page 7: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Driving What’s New...Driving What’s New...

Incredible Increases in Computer Incredible Increases in Computer Speed and MemorySpeed and Memory

Software Utilizing Extremely Software Utilizing Extremely Complex Iterative Processes Can Complex Iterative Processes Can Really CrankReally Crank

Page 8: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Dogbert the ConsultantDogbert the Consultant

““If you mine the data hard If you mine the data hard enough you can also find enough you can also find

messages from God”messages from God”

Page 9: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

EXPECTATIONSEXPECTATIONS Top ManagementTop Management ConferencesConferences Vendor Vendor

PresentationsPresentations

#1 DOR #1 DOR PriorityPriority

Page 10: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Committee Decisions Committee Decisions Data MiningData Mining

Selection of Pilot Project Selection of Pilot Project “Proof of “Proof of Concept” Concept”

Selection of Data Mining Software Selection of Data Mining Software for Pilot Projectfor Pilot Project

Page 11: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mining Software Data Mining Software SelectionSelection

NCRNCR SASSAS SPSSSPSS IBMIBM

SPSS Clementine MinerSPSS Clementine Miner• ““In a cavern, in a canyon, excavating for a…”In a cavern, in a canyon, excavating for a…”

Page 12: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Criteria for Pilot ProjectCriteria for Pilot Project

DoableDoable MeasurableMeasurable Produces Efficiency Produces Efficiency

within Programwithin Program Within BudgetWithin Budget Divisional Resources Divisional Resources

AvailableAvailable Can be completed by Can be completed by

End of June End of June

Page 13: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Projects Considered for Projects Considered for PilotPilot

Enhancing Audit SelectionEnhancing Audit Selection More Sophisticated Audit Retail ProfilingMore Sophisticated Audit Retail Profiling Expanded Active Non-Reporter ProfilingExpanded Active Non-Reporter Profiling Tax Discovery - Identifying Non-FilersTax Discovery - Identifying Non-Filers Parallel Taxpayer Education EffortParallel Taxpayer Education Effort Examining Transactions for Fraud Examining Transactions for Fraud Controlled Experiment with CollectionsControlled Experiment with Collections

Page 14: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mining Pilot ProjectData Mining Pilot ProjectAudit Selection Audit Selection

PurposePurpose• Provide “Proof of Concept” for Provide “Proof of Concept” for

Advanced Data MiningAdvanced Data Mining• Demonstrate Enhanced Predictive Demonstrate Enhanced Predictive

Capabilities through Utilization of Capabilities through Utilization of Sophisticated SoftwareSophisticated Software

• Lead to Development of More Lead to Development of More Productive Audit Selection CriteriaProductive Audit Selection Criteria

Page 15: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mining Pilot ProjectData Mining Pilot ProjectAudit Selection Audit Selection

DesignDesign• ““Quasi-Experimental”Quasi-Experimental”• Utilizes ODBC from Data MartUtilizes ODBC from Data Mart• Dependent Variable Audit RecoveryDependent Variable Audit Recovery• Build “Supervised” Model Using Known Build “Supervised” Model Using Known

Results from Audits Issued in 1997Results from Audits Issued in 1997• Use Model to Predict Recovery for 1998 Use Model to Predict Recovery for 1998

AuditsAudits• Compare Predictions with Actual 1998 Compare Predictions with Actual 1998

ResultsResults

Page 16: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Mining Pilot ProjectData Mining Pilot ProjectAudit Selection Audit Selection

ProcessProcess• Divided Audit Recovery into 4 BandsDivided Audit Recovery into 4 Bands

– $1 - 1,000$1 - 1,000– $1,000 - 5,000$1,000 - 5,000– $5,000 - 10,000$5,000 - 10,000– Over $10,000 Over $10,000

• Divided 1997 Audit Sample into 2 Samples Divided 1997 Audit Sample into 2 Samples -- ”Test” Sample and “ Training Sample” -- ”Test” Sample and “ Training Sample”

• Built Models using Training sample, applied Built Models using Training sample, applied to Test sample to test generalizabilityto Test sample to test generalizability

• Applied Best Models to Predict Recovery for Applied Best Models to Predict Recovery for 1998 Audits1998 Audits

Page 17: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

SPSS Clementine Rule Set ExampleRule Induction modeling

Page 18: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data Used for ModelingData Used for Modeling

← gross income, taxable income and tax due from the preceding 4 years.← total deductions from the preceding 4 years.← total wages and average # of employees from the preceding 4 years.← 26 industry categories derived from SIC codes← location (in-state vs. out-of-state)← ownership type← flags set on the presence or absence of line codes and deduction codes← lag variables--changes in variables from year to year.

Page 19: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

DataData NOT NOT Used for Used for ModelingModeling

Washington Combined Excise Tax Return•Sales Tax -- 1 line code, 1 rate•Business and Occupation Tax and•Public Utility Tax -- 22 line codes, different activities, different rates •27 Deduction Codes associated with line codes

Unable to Use•line code amounts•deduction type amounts •deduction type by line amounts

Page 20: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Major ProblemsMajor Problems

Data Structures Data Structures Missing/Imperfect DataMissing/Imperfect Data Modeling OverspecificationModeling Overspecification

Page 21: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Data StructuresData Structures

Query SoftwareQuery Software• ““Star Structures”Star Structures”• relational data base/hub tablesrelational data base/hub tables• myriad tables connected by multiple keysmyriad tables connected by multiple keys

Mining SoftwareMining Software• ““flat file”flat file”• single record containing everything for single record containing everything for

each taxpayer each taxpayer

Page 22: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

SPSS Clementine Merge Stream

Page 23: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

SPSS Clementine Merge Stream

Page 24: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

OverspecificationOverspecification

Model “too close to data”Model “too close to data” No problem generating rules to No problem generating rules to

“predict” training sample with “predict” training sample with extreme accuracyextreme accuracy

Model’s predictive rules did not Model’s predictive rules did not generalize particularly well to test generalize particularly well to test sample sample

Page 25: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Results--Predicting 1998 Results--Predicting 1998 Audit Recovery BandAudit Recovery Band

Correct Band Predicted 52%Prediction off by 1 Band 28%Prediction off by 2 Bands 19%Prediction off by 3 Bands 1%

“Results Positive but Modest…”

Page 26: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Conclusions/Lessons Conclusions/Lessons LearnedLearned

Due to a number of limiting factors, the predictive power of the pilot model was positive but modest.

As a learning experience the pilot was an unquestioned success. A great deal of technical knowledge was acquired within the Department in a very short period of time. Some of the major lessons learned are as follows:

Page 27: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Conclusions/Lessons Conclusions/Lessons LearnedLearned

Optimal data structures for query software are definitely not optimal for mining software—a “two-tiered” approach to data warehousing will frequently be necessary.

The major part of data mining (possibly 85 to 95%) is data preparation and data cleansing.

Optimal use of mining software requires “perfect” data, structured with fillers for missing records and missing fields.

Page 28: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Conclusions/Lessons Conclusions/Lessons LearnedLearned

Despite the power of the modeling software, modeling is still a complicated process of structural design, analysis and experimentation.

While training is essential and limited use of outside consultants may be beneficial, the Department does have the technical capacity to do data mining in-house.

Page 29: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Conclusions/Lessons Conclusions/Lessons LearnedLearned

Data Mining is not a “magic bullet.” It requires a highly focused and structured approach. It is highly technical and resource intensive.

For appropriate applications, sophisticated Data Mining could be an extremely valuable and cost effective strategy for the Department .

Page 30: The Washington Dept. of Revenue Data Mining Pilot Pilot Project: A Retrospective Overview 2000 FTA Revenue Estimating and Tax Research Conference September.

Into the Realm of Budget Into the Realm of Budget Process…Process…

$$$$$$ FTE’sFTE’s Internal PoliticsInternal Politics

• Mining vs. QueryingMining vs. Querying External Politics (Gov’s Office, External Politics (Gov’s Office,

Legislature)Legislature)• Government IntrusivenssGovernment Intrusivenss• Politically Correct TermsPolitically Correct Terms

??????????????????????