HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I...

14
P A R T I HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS Part I focuses on the historical and theoretical background for statistical analysis and data mining, and integrates it with the data discovery and data preparation operations nec- essary to prepare for modeling. Part II presents some basic algorithms and applications areas where data mining technology is commonly used. Part III is not a set of chapters, but is rather a group of tutorials you can follow to learn data mining by example. In fact, you don’t even have to read the chapters in the other parts at first. You can start with a tuto- rial in an area of your choice (if you have the tool used in that tutorial) and learn how to create a model successfully in that area. Later, you can return to the text to learn why the various steps were included in the tutorial and understand what happened behind the scenes when you performed them. The third group of chapters in Part IV leads you into some advanced data mining areas, where you will learn how create a “good-enough” model and avoid the most common (and sometimes devastating) mistakes of data mining practice.

Transcript of HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I...

Page 1: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

P A R T I

HISTORY OF PHASES OFDATA ANALYSIS, BASICTHEORY, AND THE DATA

MINING PROCESS

Part I focuses on the historical and theoretical background for statistical analysis anddata mining, and integrates it with the data discovery and data preparation operations nec-essary to prepare for modeling. Part II presents some basic algorithms and applicationsareas where data mining technology is commonly used. Part III is not a set of chapters,but is rather a group of tutorials you can follow to learn data mining by example. In fact,you don’t even have to read the chapters in the other parts at first. You can start with a tuto-rial in an area of your choice (if you have the tool used in that tutorial) and learn how tocreate a model successfully in that area. Later, you can return to the text to learn why thevarious steps were included in the tutorial and understand what happened behind thescenes when you performed them. The third group of chapters in Part IV leads you intosome advanced data mining areas, where you will learn how create a “good-enough”model and avoid the most common (and sometimes devastating) mistakes of data miningpractice.

Page 2: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and
Page 3: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

C H A P T E R

1

The Background for DataMining Practice

O U T L I N E

Preamble 3

A Short History of Statisticsand Data Mining 4

Modern Statistics: A Duality? 5

Two Views of Reality 8

The Rise of Modern Statistical Analysis:The Second Generation 10

Machine Learning Methods:The Third Generation 11

Statistical Learning Theory:The Fourth Generation 12

Postscript 13

PREAMBLE

You must be interested in learning how to practice data mining; otherwise, you wouldnot be reading this book. We know that there are many books available that will give agood introduction to the process of data mining. Most books on data mining focus on thefeatures and functions of various data mining tools or algorithms. Some books do focuson the challenges of performing data mining tasks. This book is designed to give you anintroduction to the practice of data mining in the real world of business.

One of the first things considered in building a business data mining capability in acompany is the selection of the data mining tool. It is difficult to penetrate the hype erectedaround the description of these tools by the vendors. The fact is that even the most medio-cre of data mining tools can create models that are at least 90% as good as the best tools.A 90% solution performed with a relatively cheap tool might be more cost effective in yourorganization than a more expensive tool. How do you choose your data mining tool?

3Handbook of Statistical Analysis and Data Mining Applications # 2009, Elsevier Inc.

Page 4: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

Few reviews are available. The best listing of tools by popularity is maintained and updatedyearly by KDNuggets.com. Some detailed reviews available in the literature go beyondjust a discussion of the features and functions of the tools (see Nisbet, 2006, Parts 1–3).The interest in an unbiased and detailed comparison is great. We are told the “most down-loaded document in data mining” is the comprehensive but decade-old tool review byElder and Abbott (1998).

The other considerations in building a business’s data mining capability are formingthe data mining team, building the data mining platform, and forming a foundation of gooddata mining practice. This book will not discuss the building of the data mining platform.This subject is discussed in many other books, some in great detail. A good overview ofhow to build a data mining platform is presented in Data Mining: Concepts and Techniques(Han and Kamber, 2006). The primary focus of this book is to present a practical approachto building cost-effective data mining models aimed at increasing company profitability,using tutorials and demo versions of common data mining tools.

Just as important as these considerations in practice is the background against whichthey must be performed. We must not imagine that the background doesn’t matter . . . it doesmatter, whether or not we recognize it initially. The reason it matters is that the capabilitiesof statistical and data mining methodology were not developed in a vacuum. Analyticalmethodology was developed in the context of prevailing statistical and analytical theory.But the major driver in this development was a very pressing need to provide a simpleand repeatable analysis methodology in medical science. From this beginning developedmodern statistical analysis and data mining. To understand the strengths and limitationsof this body of methodology and use it effectively, we must understand the strengths andlimitations of the statistical theory from which they developed. This theory was developedby scientists and mathematicians who “thought” it out. But this thinking was not onesided or unidirectional; there arose several views on how to solve analytical problems. Tounderstand how to approach the solving of an analytical problem, we must understandthe different ways different people tend to think. This history of statistical theory behindthe development of various statistical techniques bears strongly on the ability of thetechnique to serve the tasks of a data mining project.

A SHORT HISTORY OF STATISTICSAND DATA MINING

Analysis of patterns in data is not new. The concepts of average and grouping can be datedback to the 6th century BC inAncient China, following the invention of the bamboo rod abacus(Goodman, 1968). In Ancient China and Greece, statistics were gathered to help heads of stategovern their countries in fiscal and military matters. (This makes you wonder if the wordsstatistic and state might have sprung from the same root.) In the sixteenth and seventeenthcenturies, games of chance were popular among the wealthy, prompting many questionsabout probability to be addressed to famous mathematicians (Fermat, Leibnitz, etc.). Thesequestions led to much research in mathematics and statistics during the ensuing years.

4 1. THE BACKGROUND FOR DATA MINING PRACTICE

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 5: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

MODERN STATISTICS: A DUALITY?

Two branches of statistical analysis developed in the eighteenth century: Bayesian andclassical statistics. (See Figure 1.1.) To treat both fairly in the context of history, we will con-sider both in the First Generation of statistical analysis. For the Bayesians, the probability ofan event’s occurrence is equal to the probability of its past occurrence times the likelihoodof its occurrence in the future. Analysis proceeds based on the concept of conditional prob-ability: the probability of an event occurring given that another event has already occurred.Bayesian analysis begins with the quantification of the investigator’s existing state ofknowledge, beliefs, and assumptions. These subjective priors are combined with observeddata quantified probabilistically through an objective function of some sort. The classicalstatistical approach (that flowed out of mathematical works of Gauss and Laplace) consideredthat the joint probability, rather than the conditional probability, was the appropriate basis foranalysis. The joint probability function expresses the probability that simultaneously X takesthe specific values x and Y takes value y, as a function of x and y.

Interest in probability picked up early among biologists following Mendel in the latterpart of the nineteenth century. Sir Francis Galton, founder of the School of Eugenics inEngland, and his successor, Karl Pearson, developed the concepts of regression and corre-lation for analyzing genetic data. Later, Pearson and colleagues extended their work to thesocial sciences. Following Pearson, Sir R. A. Fisher in England developed his system forinference testing in medical studies based on his concept of standard deviation. While thedevelopment of probability theory flowed out of the work of Galton and Pearson, earlypredictive methods followed Bayes’s approach. Bayesian approaches to inference testingcould lead to widely different conclusions by different medical investigators because theyused different sets of subjective priors. Fisher’s goal in developing his system of statisticalinference was to provide medical investigators with a common set of tools for use incomparison studies of effects of different treatments by different investigators. But to makehis system work even with large samples, Fisher had to make a number of assumptionsto define his “Parametric Model.”

FIGURE 1.1 Rev. Thomas Bayes (1702–1761).

5MODERN STATISTICS: A DUALITY?

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 6: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

Assumptions of the Parametric Model

1. Data Fits a Known Distribution (e.g., Normal, Logistic, Poisson, etc.)

Fisher’s early work was based on calculation of the parameter standard deviation, whichassumes that data are distributed in a normal distribution. The normal distribution is bell-shaped, with the mean (average) at the top of the bell, with “tails” falling off evenly atthe sides. Standard deviation is simply the “average” of the absolute deviation of a valuefrom the mean. In this calculation, however, averaging is accomplished by dividing thesum of the absolute deviations by the total – 1. This subtraction expresses (to some extent)the increased uncertainty of the result due to grouping (summing the absolute deviations).Subsequent developments used modified parameters based on the logistic or Poisson dis-tributions. The assumption of a particular known distribution is necessary in order todraw upon the characteristics of the distribution function for making inferences. All of theseparametric methods run the gauntlet of dangers related to force-fitting data from the realworld into a mathematical construct that does not fit.

2. Factor Independency

In parametric predictive systems, the variable to be predicted (Y) is considered as a func-tion of predictor variables (X’s) that are assumed to have independent effects on Y. That is,the effect on Y of each X-variable is not dependent on effects on Y of any other X-variable.This situation could be created in the laboratory by allowing only one factor (e.g., a treat-ment) to vary, while keeping all other factors constant (e.g., temperature, moisture, light,etc.). But, in the real world, such laboratory control is absent. As a result, some factors thatdo affect other factors are permitted to have a joint effect on Y. This problem is called collin-earity. When it occurs between more than two factors, it is termed multicollinearity. Themulticollinearity problem led statisticians to use an interaction term in the relationship thatsupposedly represented the combined effects. Use of this interaction term functioned as amagnificent kluge, and the reality of its effects was seldom analyzed. Later developmentincluded a number of interaction terms, one for each interaction the investigator might bepresenting.

3. Linear Additivity

Not only must the X-variables be independent, their effects on Y must be cumulative andlinear. That means the effect of each factor is added to or subtracted from the combinedeffects of all X-variables on Y. But what if the relationship between Y and the predictors(X-variables) is not additive, but multiplicative or divisive? Such functions can be expressedonly by exponential equations that usually generate very nonlinear relationships. Assump-tion of linear additivity for these relationships may cause large errors in the predictedoutputs. This is often the case with their use in business data systems.

4. Constant Variance (Homoscedasticity)

The variance throughout the range of each variable is assumed to be constant. Thismeans that if you divided the range of a variable into bins, the variance across all recordsfor bin #1 is the same as the range for all the other bins in the range of that variable. If

6 1. THE BACKGROUND FOR DATA MINING PRACTICE

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 7: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

the variance throughout the range of a variable differs significantly from constancy, it issaid to be heteroscedastic. The error in the predicted value caused by the combined hetero-scedasticity among all variables can be quite significant.

5. Variables Must Be Numerical and Continuous

The assumption that variables must be numerical and continuous means that data mustbe numeric (or it must be transformable to a number before analysis) and the number mustbe part of a distribution that is inherently continuous. Integer values in a string are notcontinuous; they are discrete. Classical parametric statistical methods are not valid for usewith discrete data, because the probability distributions for continuous and discrete dataare different. But both scientists and business analysts have used them anyway.

In his landmark paper, Fisher (1921; see Figure 1.2) began with the broad definitionof probability as the intrinsic probability of an event’s occurrence divided by the probabil-ity of occurrence of all competing events (very Bayesian). By the end of his paper, Fishermodified his definition of probability for use in medical analysis (the goal of his research)as the intrinsic probability of an event’s occurrence period. He named this quantity like-lihood. From that foundation, he developed the concepts of standard deviation based onthe normal distribution. Those who followed Fisher began to refer to likelihood as probability.The concept of likelihood approaches the classical concept of probability only as the samplesize becomes very large and the effects of subjective priors approach zero (von Mises,1957). In practice, these two conditions may be satisfied sufficiently if the initial distributionof the data is known and the sample size is relatively large (following the Law of LargeNumbers).

Why did this duality of thought arise in the development of statistics? Perhaps it isbecause of the broader duality that pervades all of human thinking. This duality can betraced all the way back to the ancient debate between Plato and Aristotle.

FIGURE 1.2 Sir Ronald Fisher.

7MODERN STATISTICS: A DUALITY?

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 8: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

TWO VIEWS OF REALITY

Whenever we consider solving a problem or answering a question, we start by concep-tualizing it. That means we do one of two things: (1) try to reduce it to key elements or(2) try to conceive of it in general terms. We call people who take each of these approaches“detail people” and “big picture people,” respectively. What we don’t consider is that thisdistinction has its roots deep in Greek philosophy in the works of Aristotle and Plato.

Aristotle

Aristotle (Figure 1.3) believed that the true being of things (reality) could be discernedonly by what the eye could see, the hand could touch, etc. He believed that the highest levelof intellectual activity was the detailed study of the tangible world around us. Only in thatway could we understand reality. Based on this approach to truth, Aristotle was ledto believe that you could break down a complex system into pieces, describe the piecesin detail, put the pieces together and understand the whole. For Aristotle, the “whole”was equal to the sum of its parts. This nature of the whole was viewed by Aristotle in amanner that was very machine-like.

Science gravitated toward Aristotle very early. The nature of the world around us wasstudied by looking very closely at the physical elements and biological units (species) thatcomposed it. As our understanding of the natural world matured into the concept of theecosystem, it was discovered that many characteristics of ecosystems could not beexplained by traditional (Aristotelian) approaches. For example, in the science of forestry,we discovered that when a tropical rain forest is cut down on the periphery of its range,it may take a very long time to regenerate (if it does at all). We learned that the reasonfor this is that in areas of relative stress (e.g., peripheral areas), the primary characteristicsnecessary for the survival and growth of tropical trees are maintained by the forest itself!High rainfall leaches nutrients down beyond the reach of the tree roots, so almost allof the nutrients for tree growth must come from recently fallen leaves and branches.

FIGURE 1.3 Aristotle before the bust of Homer.

8 1. THE BACKGROUND FOR DATA MINING PRACTICE

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 9: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

When you cut down the forest, you remove that source of nutrients. The forest canopy alsomaintains favorable conditions of light, moisture, and temperature required by the trees.Removing the forest removes the very factors necessary for it to exist at all in that location.These factors emerge only when the system is whole and functioning. Many complexsystems are like that, even business systems. In fact, these emergent properties may bethe major drivers of system stability and predictability.

To understand the failure of Aristotelian philosophy for completely defining the world,we must return to Ancient Greece and consider Aristotle’s rival, Plato.

Plato

Plato (Figure 1.4) was Aristotle’s teacher for 20 years, and they both agreed to disagreeon the nature of being. While Aristotle focused on describing tangible things in the worldby detailed studies, Plato focused on the world of ideas that lay behind these tangibles.For Plato, the only thing that had lasting being was an idea. He believed that the mostimportant things in human existence were beyond what the eye could see and the handcould touch. Plato believed that the influence of ideas transcended the world of tangiblethings that commanded so much of Aristotle’s interest. For Plato, the “whole” of realitywas greater than the sum of its tangible parts.

The concept of the nature of being was developed initially in Western thinking upon aPlatonic foundation. Platonism ruled philosophy for over 2,000 years—up to the Enlighten-ment. Then the tide of Western thinking turned toward Aristotle. This division of thoughton the nature of reality is reflected in many of our attempts to define the nature of realityin the world, sometimes unconsciously so. We speak of the difference between “big picturepeople” and “detail people”; we contrast “top-down” approaches to organization versus“bottom-up” approaches; and we compare “left-brained” people with “right-brained”people. These dichotomies of perception are little more than a rehash of the ancient debatebetween Plato and Aristotle.

FIGURE 1.4 Plato.

9TWO VIEWS OF REALITY

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 10: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

THE RISE OF MODERN STATISTICAL ANALYSIS:THE SECOND GENERATION

In the 1980s, it became obvious to statistical mathematicians that the rigorously Aristote-lian approach of the past was too restrictive for analyzing highly nonlinear relationships inlarge data sets in complex systems of the real world. Mathematical research continued dom-inantly along Fisherian statistical lines by developing nonlinear versions of parametricmethods. Multiple curvilinear regression was one of the earliest approaches for accountingfor nonlinearity in continuous data distributions. But many nonlinear problems involveddiscrete rather than continuous distributions (see Agresti, 1996). These methods includedthe following:

• Logit Model (including Logistic Regression): Data is assumed to follow a logisticdistribution and the dependent variable is categorical (e.g., 1:0). In this method, thedependent variable (Y) is defined as an exponential function of the predictor variables(X’s). As such, this relationship can account for nonlinearities in the response of theX-variables to the Y-variable, but not in the interaction between X-variables.

• Probit Model (including Poisson Regression): Like the Logit Model, except data areassumed to follow a Poisson rather than a Logistic distribution.

• The Generalized Linear Model (GLM): The GLM expands the general estimationequation used in prediction, Y ¼ f {X}, f is some function and X is a vector of predictorvariables. The left side of the equal sign was named the deterministic component, theright side of the equation the random component, and the equal sign one of many possiblelink functions. Statisticians recognized that the deterministic component could beexpressed as an exponential function (like the Logistic function), the random componentaccumulated effects of the X-variables and was still linear, and the link function couldbe any logical operator (equal to, greater than, less than, etc.). The equal sign was namedthe identity link. Now mathematicians had a framework for defining a function that couldfit data sets with much more nonlinearity. But it would be left to the development ofneural networks (see following text) to express functions with any degree of nonlinearity.

While these developments were happening in the Fisherian world, a stubborn group ofBayesians continued to push their approach. To the Bayesians, the practical significance(related to what happened in the past) is more significant than the statistical significancecalculated from joint probability functions. For example, the practical need to correctlydiagnose cancerous tumors (true-positives) is more important than the error of misdiagnos-ing a tumor as cancerous when it is not (false-positives). To this extent, their focus wasrather Platonic, relating correct diagnosis to the data environment from which any particu-lar sample was drawn, rather than just to data of the sample alone. To serve this practicalneed, they had to ignore the fact that you can consider only the probability of events thatactually happened in the past data environment, not the probability of events that couldhave happened but did not (Lee, 1989).

In Fisherian statistics, the observation and the corresponding alpha error determinewhether it is different from what is expected (Newton and Rudestam, 1999). The alpha

10 1. THE BACKGROUND FOR DATA MINING PRACTICE

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 11: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

error is the probability of being wrong when you think you are right, while the beta error isthe probability of being right when you think you are wrong. Fisherians set the alpha errorin the beginning of the analysis and referred to significant differences between data popu-lations in terms of the alpha error that was specified. Fisherians would add a suffix to theirprediction, such as “. . . at the 95% Confidence Level.” The Confidence Level (95% in thiscase) is the complement of the alpha error (0.05%). It means that the investigator is willingbe wrong 5% of the time. Fisherians use the beta error to calculate the “power” or “robust-ness” of an analytical test. Bayesians feel free to twiddle with both the alpha and beta errorsand contend that you cannot arrive at a true decision without considering the alternativescarefully. They maintain that a calculated probability level of 0.023 for a given event inthe sample data does not imply that the probability of the event within the entire universeof events is 0.023.

Which approach is right, Fisherian or Bayesian? The answer depends on the nature ofthe study, the possibility of considering priors, the relative cost of false-positive errorsand false-negative errors. Before selecting one, we must bear in mind that all statistical testshave advantages and disadvantages. We must be informed about the strengths and weak-nesses of both approaches and have a clear understanding of the meaning of the resultsproduced by either one. Regardless of its problems and its “bad press” among the Fisher-ians, Bayesian statistics eventually did find its niche in the developing field of data miningin business in the form of Bayesian Belief Networks and Naı̈ve Bayes Classifiers. In busi-ness, success in practical applications depends to a great degree upon analysis of all viablealternatives. Nonviable alternatives aren’t worth considering. One of the tutorials on theenclosed DVD uses a Naı̈ve Bayes Classifier algorithm.

Data, Data Everywhere . . .

The crushing practical needs of business to extract knowledge from data that could beleveraged immediately to increase revenues required new analytical techniques thatenabled analysis of highly nonlinear relationships in very large data sets with an unknowndistribution. Development of new techniques followed three paths rather than the two clas-sical paths described previously. The third path (machine learning) might be viewed as ablend of the Aristotelian and Platonic approach to truth, but it was not Bayesian.

MACHINE LEARNING METHODS:THE THIRD GENERATION

The line of thinking known as machine learning arose out of the Artificial Intelligencecommunity in the quest for the intelligent machine. Initially, these methods followed twoparallel pathways of developments: artificial neural networks and decision trees.

Artificial Neural Networks. The first pathway sought to express a nonlinear functiondirectly (the “cause”) by means of assigning weights to the input variables, accumulatetheir effects, and “react” to produce an output value (the “effect”) following some sort ofdecision function. These systems (artificial neural networks) represented simple analogs

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

11MACHINE LEARNING METHODS: THE THIRD GENERATION

Page 12: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

of the way the human brain works by passing neural impulses from neuron to neuronacross synapses. The “resistance” in transmission of an impulse between two neuronsin the human brain is variable. The complex relationship of neurons and their associatedsynaptic connections is “trainable” and could “learn” to respond faster as required by thebrain. Computer scientists began to express this sort of system in very crude terms in theform of an artificial neural network that could be used to learn how to recognize complexpatterns in the input variables of a data set.

Decision Trees. The second pathway of development was concerned with expressing theeffects directly by developing methods to find “rules” that could be evaluated for separatingthe input values into one of several “bins” without having to express the functional rela-tionship directly. These methods focused on expressing the rules explicitly (rule induction)or on expressing the relationship among the rules (decision tree) that expressed the results.These methods avoided the strictures of the Parametric Model and were well suited foranalysis of nonlinear events (NLEs), both in terms of combined effects of the X-variableswith the Y-variable and interactions between the independent variables. While decisiontrees and neural networks could express NLEs more completely than parametric statisticalmethods, they were still intrinsically linear in their aggregation functions.

STATISTICAL LEARNING THEORY:THE FOURTH GENERATION

Logistic regression techniques can account for the combined effects of interaction amongall predictor variables by virtue of the nonlinear function that defines the dependentvariable (Y). Yet, there are still significant limitations to these linear learning machines(see Minsky and Papert, 1969). Even neural nets and decision trees suffered from this prob-lem, to some extent. One way of expressing these limitations is to view them according totheir “hypothesis space.” The hypothesis space is a mathematical construct within whicha solution is sought. But this space of possible solutions may be highly constrained bythe linear functions in classical statistical analysis and machine learning techniques. Com-plex problems in the real world may require much more expressive hypothesis spacesthan can be provided by linear functions (Cristianini and Shawe-Taylor, 2000). Multilayerneural nets can account for much more of the nonlinear effects by virtue of the networkarchitecture and error minimization techniques (e.g., back-propagation).

An alternative approach is to arrange data points into vectors (like rows in a cus-tomer record). Such vectors are composed of elements (one for each attribute in thecustomer record). The vector space of all rows of customers in a database can be character-ized conceptually and mathematically as a space with the N-dimensions, where N is thenumber of customer attributes (predictive variables). When you view data in a customerrecord as a vector, you can take advantage of linear algebra concepts, one of which is thatyou can express all of the differences between the attributes of two customer records bycalculating the dot product (or the inner product). The dot product of two vectors is the

12 1. THE BACKGROUND FOR DATA MINING PRACTICE

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 13: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

sum of all the products between corresponding attributes of the two vectors. Consequently,we can express our data as a series of dot products composed into an inner product spacewith N dimensions. Conversion of our data into inner products is referred to as “mapping”the data to inner product space.

Even classical statistical algorithms (like linear regression) can be expressed in this way.In Statistical Learning Theory, various complex functions, or “kernels,” replace the innerproduct. When you map data into these complex kernel spaces, the range of possible solu-tions to your problem increases significantly. The data in these spaces are referred to as“features” rather than as attributes that characterized the original data.

A number of new learning techniques have taken advantage of the properties of kernellearning machines. The most common implementation is a Support Vector Machine. Whena neural net is “trained,” rows of customer data are fed into the net, and errors betweenpredicted and observed values are calculated (an example of supervised learning). Thelearning function of the training and the error minimization function (that defines the bestapproximate solution) are closely intertwined in neural nets. This is not the case with sup-port vector machines. Because the learning process is separated from the approximationprocess, you can experiment by using different kernel definitions with different learningtheories. Therefore, instead of choosing from among different architectures for a neuralnet application, you can experiment with different kernels in a support vector machineimplementation.

Several commercial packages include algorithms based on Statistical Learning Theory,notably STATISTICA Data Miner and KXEN (Knowledge Extraction Engine). In the future,we will see more of these powerful algorithms in commercial packages. Eventually, datamining methods may become organized around the steps that enable these algorithms towork most efficiently. For example, the KXEN tool incorporates a smart data recoder (tostandardize inputs) and a smart variable derivation routine that uses variable ratios andrecombination to produce powerful new predictors.

Is the Fourth Generation of statistical methods the last? Probably it is not. As we accumu-late more and more data, we will probably discover increasingly clever ways to simulatemore closely the operation of the most complex learning machine in the universe—thehuman brain.

POSTSCRIPT

New strategies are being exploited now to spread the computing efforts among multiplecomputers connected like the many neurons in a brain:

• Grid Computing: Utilizing a group of networked computers to divide and conquercomputing problems.

• “Cloud” Computing: Using the Internet to distribute data and computing tasks to manycomputers anywhere in the world, but without a centralized hardware infrastructure ofgrid computing.

13POSTSCRIPT

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS

Page 14: HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING ...Chapter_1.pdf · Part I focuses on the historical and theoretical background for statistical analysis and

These strategies for harnessing multiple computers for analysis provide a rich newmilieu for data mining. This approach to analysis with multiple computers is the nextlogical step in the development of artificial “brains.” This step might develop into the FifthGeneration of data mining.

References

Agresti, A. (1996). An Introduction to Categorical Data Analysis. New York, NY: John Wiley & Sons.Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge, UK: Cambridge

University Press.Elder, J., & Abbott, D. (1998). A Comparison of Leading Data Mining Tools. 4th Annual Conference on Knowledge

Discovery and Data Mining, New York, NY, August 28, 1998. http://www.datamininglab.com/Resources/TOOLCOMPARISONS/tabid/58/Default.aspx

Fisher, R. A. (1921). On the Mathematical Foundations of Theoretical Statistics. London: Philos. Trans. Royal Soc, A 222.Goodman, A. F. (1968). The interface of computer science and statistica: An historical perspective. The American

Statistician, 22, 17–20.Han, J., & Kamber, M. (2006). Data Mining: Concepts and Techniques. New York: Morgan Kaufmann.Lee, P. M. (1989). Bayesian Sensitivity: An Introduction. NewYork: Oxford University Press.Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. Cambridge, MA: MIT Press

(3rd edition published in 1988).Newton, R. R., & Rudestam, K. E. (1999). Your Statistical Consultant. Thousand Oaks, CA: Sage Publishing.Nisbet, R. A. (2006). Data mining tools: Which one is best for CRM? Part 1. DM-Review Special Report, January 23,

2006. http://www.dmreview.com/specialreports/20060124/1046025-1.htmlNisbet, R. A. (2006). Data mining tools: Which one is best for CRM? Part 2. DM-Review Special Report, February,

2006. http://www.dmreview.com/specialreports/20060207/1046597-1.htmlNisbet, R.A. (2006). Data mining tools: Which one is best for CRM? Part 3. DM-Review Special Report. http://www

.dmreview.com/specialreports/20060321/1049954-1.htmlVon Mises, R. (1957). Probability, Statistics, and Truth. New York: Dover Publications.

14 1. THE BACKGROUND FOR DATA MINING PRACTICE

I. HISTORY OF PHASES OF DATA ANALYSIS, BASIC THEORY, AND THE DATA MINING PROCESS