Differences Between Statistics and Data Mining

3

Click here to load reader

Transcript of Differences Between Statistics and Data Mining

Page 1: Differences Between Statistics and Data Mining

Differences Between Statisticsand Data MiningBy Kathy Lange

F rom a business perspective, itdoesn't really matter what youcall it: statistics, data mining orpredictive analytics. Competitive

advantage comes from making betterdecisions faster and more confidently.

A deceptively simple question trig-gers lively debate among analytical pro-fessionals: What is the differencebetween statistics and data mining?

Wikipedia defines statistics as, "Amathematical science pertaining to col-lection, analysis, interpretation and pres-entation of data." Statistics draws validconclusions and makes reasonable deci-sions on the basis of such analysis. Itfurther states that predictive analyticsencompasses a variety of statistical tech-niques that process current and histori-cal data in order to make predictionsabout future events.

I contend that data mining is a formof predictive analytics that uses a variety'of techniques to explore massivearnounts of data to identify relationshipsbetween hundreds of data elements -relationships that could not be uncov-ered through simple queries or reports.Data" mining methodologies overlap withthose in analytical disciplines such asstatistics (simulation, principal compo-nents, Bayesian methods), forecasting(regression, time-series analysis) andoperations research (clustering, neuralnetworks, genetic algorithms).

Problems such as predicting cus-tomer behavior, identifying fraud andoptimizing goods in a supply chainoften require a combination of analyticaldisciplines, business knowledge and datamanagement expertise to 'solve.

Where is Data Mining BeingUsed in Business Today?

Data mining has had its broadest successin the area of modeling customer behav-ior. Data mining techniques can be usedto measure customer profitability, predictchurn and acquisition rates, and modelacquisition costs.

Leading retail firms use data miningto profile stores and merchandise to bet-ter align their customers' purchasingpatterns with store inventory. Banks andtelecommunications firms are targetingcustomers for additional products andservices. Specialized data mining modelsare used by many financial insfitutionsto grant or deny credit to applicants.These businesses benefit from more

' responsive and targeted interactionswith customers and, ulfimately, fromhigher profits and reduced risk.

Online retailer 1-800-Flowers.comuses a data-driven decision-makingprocess for managing customer relation-ships. Collecfing data at all customer con-tact points, the company turns that datainto knowledge for understanding andanticipating customer behavior, meetingcustomer needs, building more profitablecustomer relationships and gaining aholistic view of a customer's lifetime value.

In order to increase response ratesand Identify profitable customers, 1-800-Flowers.com relies on data miningtechnologies to discover trends, explainoutcomes and predict results. Becausethe company is able to access bettercustomer information, it has reducedthe amount of time it needs to spend onthe phone with its customers.

CIO Enzo Micali views informationtechnology as an invaluable element of 1-800-Flowers.com's corporate success. Thecompany has a multitiered informationdelivery framework that puts strategicinformation directly into the hands of busi-

ness users. Micali explains, 'The decisionprocess for CRM [aistomer relationshipmanagement] permeates our entire organi-zation - on the back end gathering datafrom multiple operational systems and onthe front end using the data to make better,more reliable decisions. Customer data,accessible through our company intranet,can be securely viewed at many differentlevels, including departmental views, whichpresent data for unique divisional needsand common views which show a generalsnapshot of customers, including orderhistor)' and household data across thewhole family of our brands"

Why are Businesses Turning toPredictive Analytics?

Two primary drivers have emerged:competitive advantage and compliance.Businesses need to be more nimble inreacfing to changes in their environment,and many believe that a data-drivendecision-making process that includespredictive analytics will enable high-quality, consistent, repeatable andauditable decisions.

Professor Tom Davenport, the direc-tor of Research for Babson's School ofExecufive Education at Babson College,recently published the results of aresearch study fitled, "Compefing onAnalytics," based on discussions with C-level executives and directors at morethan 30 industry-leading and globallycompetitive organizafions. "The net take-away of the study is this: The ability tomake business decisions based on tightlyfocused, fact-based analysis is emergingas a measurable competifive edge in theglobal economy," Davenport says."Organizafions that fail to invest in theproper analytic technologies will beunable to compete in a fact-based Idata-

32 December 2006 DM Review www.dmreview.com

Page 2: Differences Between Statistics and Data Mining

driven] business environment."

What is Data-DrivenDecision-Making?

Data-driven decision-making is a processthat requires collaboration and a variety ofskills across all levels of the enterprise.Predictive modeling is only a small piece ofthe process. A large piece of the processrevolves around the data: data acquisition,data quality, data manipulation and datadistribution. In fact, good decisions cannotbe made without reliable, high-quality data.

The Data Story

From my perspective, the most difficultpart of the process is the mathematicalformulation of a model that describes theproblem you are trying to solve. Often, the

Figure 1: Steps to Data-Driven Decision-Making

analytic methods used will depend onwhat data is available. Working together,IT and the analytic teams need to identifywhere the data resides within the organi-zation and what format it is in (relationaldata tables, spreadsheets, enterpriseresource planning [ERP] systems). Arethere multiple instances of the data thatdon't match? Is the data complete? Is addi-tional data from external sources (demo-graphic or socioeconomic data) needed?

Doing exploratory data analysis on asubset ofthe data and examining the meta-data is a common practice for understand-

ing the data. Summary statistics and visual-ization can be key methods to identifyinganomalies in the data that need to beaddressed prior to a more in-depth model-ing exercise. Data may need to be con-verted or transformed for use in predictivemodeling. Measurement data may need tobe standardized. Individual transactionsmay need to be summarized into new vari-ables representing rates, counts or indica-tors. Data may need to be reformatted fromproduct or transactional data into cus-tomer-focused data. Assumptions aboutthe underlying distribution ofthe data needto be tested for statistical validity.

In the predictive modeling phase,trade-offs need to be considered betweenthe speed of modeling the accuracy of themodel and how easily it is understood.

Business users need to trustthe results of the analysis,regardless of their knowledgeof analytical methods. Manysoftware packages provideonly a few simple methodswith limited options, whileothers provide a wide variety.In general, more flexiblemodeling strategies lead tobetter predictions, whichimpact bottom-line revenue.

No single methodworks best in all cases. Onewidely accepted strategy isto try all the most commonmodeling methods (decisiontrees, regression and neuralnetworks) and comparethem to determine the bestmodel. A common criterionfor evaluation is a compari-son of the expected profits

or losses to actual profits or lossesobtained from model results. This criteri-on enables you to make cross-modelcomparisons and assessments independ-ent of all other factors.

Delivering the output from the bestmodel to the business user is a key consid-eration for IT staff Output from the mod-els can be sophisticated or simple. Outputmay be fed programmatically into real-timesystems, such as database engines, messagequeues or Web services, friggering real-timealerts or product recommendation offers tocall center staff. Alternatively, a set of

reports (documents, spreadsheets or pre-sentations) could be generated either stati-cally or dynamically on demand in a Webportal or a dashboard. Ultimately, theinformation needs to be accessible whereand when it is needed, in a context relevantto the decision-maker

Data-driven decision-making can beused throughout the enterprise to modelcustomer, supplier and operationalprocesses. The models are corporateassets that may have significant fmancialimpact, particularly in the areas of mar-keting, risk assessment and operations.They must be continually assessed andvalidated for their accuracy over time.

IT staff win be tasked with managingthe data and models throughout the life-cycle (development, test/stage, deploy,track, retire) including version control andchange management for audit reportingpurposes.

Storing model packages with theirmetadata allows automated model sched-uling, including exception reports andmodel tracking reports. A common meta-data repository provides the ability toperform impact analysis - to analyze andevaluate changes in data definitions ormodel specifications across the organiza-tion before an actual change breaks exist-ing applications.

Of course, decision-making is anongoing cycle. Information gleaned fromone iteration of the cycle should be fedback into the process to make it better thenext time.

Data mining and statistics are power-ful tools that enable organizations tomake more structured, repeatable deci-sions. The decision-making processbegins with data access, data explorationand transformation, followed by predic-tive modeling. The process concludeswith the delivery of information to thedecision-makers throughout the enter-prise enabling theni to take action. From abusiness perspective, it doesn't really mat-ter what you call it: statistics, data miningor predictive analytics. Competitiveadvantage comes from making betterdecisions faster and more confidently. ®

Kathy Lange is a senior business director for SASAnalytical Consulting. She may be reached [email protected].

www.dmreview.com I Review I December 2006 33

Page 3: Differences Between Statistics and Data Mining