Kevin Swingler: Introduction to Data Mining

Data Mining Methodology

Kevin SwinglerUniversity of Stirling

Lecturer, Computing [email protected]

2

What is Data Mining?

• Generally, methods of using large quantities of data and appropriate algorithms to allow a computer to ‘learn’ to perform a task

• Task oriented:– Predict outcomes or forecast the future– Classify objects as belonging to one of several categories– Separate data into clusters of similar objects

• Most methods produce a model of the data that performs the task

3

Some Examples

• Predicting patterns of drug side-effects• Spotting credit card or insurance fraud• Controlling complex machinery• Predicting the outcome of medical

interventions• Predicting the price of stocks and shares or

exchange rates• Knowing when a cow is most fertile (really!)

4

Examples in LIS

• Text Mining– Automatically determine what an article is ‘about’– Classify attitudes in social media

• Demand Prediction– Predicting demand for resources such as new books or

journals or buildings

• Search and Recommend– Analysis of borrowing history to make recommendations– Links analysis for citation clustering

5

Data Sources

• In House – Data you own– Borrow records– Search histories– Catalogue data

• Bought in– Demographic data about customers– Demographic data about the locality around a

library

6

Methods

• Techniques for data mining are based on mathematics and statistics, but are implemented in easy to use software packages

• Where methodology is important is in pre-processing the data, choosing the techniques, and interpreting the results

7

CRISP DM Standard• CRoss Industry Standard Process for Data

Mining

8

Data Preparation

• Clean the data– Remove rows with missing values– Remove rows with obvious data entry errors – e.g.

Age = 200– Recode obvious data entry inconsistencies – e.g. If

Gender = M or F, but occasionally Male– Remove rows with minority values– Select which variables to use in the model

9

Data Quantity

• Choose the variables to be used for the model• Look at the distributions of the chosen values• Look at the level of noise in the data• Look at the degree of linearity in the data• Decide whether or not there are sufficient

examples in the data• Treat unbalanced data

10

Consider Error Costs

• Imagine a system that classifies input patterns into one of several possible categories

• Sometimes it will get things wrong, how often depends on the problem:– Direct mail targeting – very often– Credit risk assessment – quite often– Medical reasoning – very infrequently

11

Error Costs

• An error in one direction can cost more than an error in the opposite direction– Recommending a blood test based on a false

positive is better than missing an infection due to a false negative

– Missing a case of insurance fraud is more costly than flagging a claim to be double checked

• The balance of examples in each case can be manipulated to reflect the cost

12

Check Points

• Data quantity and quality: do you have sufficient good data for the task?– How many variables are there?– How complex is the task?– Is the data’s distribution appropriate?

• Outliers• Balance• Value set size

13

Distributions

• A frequency distribution is a count of how often each variable contains each value in a data set

• For discrete numbers and categorical values, this is simply a count of each value

• For continuous numbers, the count is of how many values fall into each of a set of sub-ranges

14

Plotting Distributions

• The easiest way to visualise a distribution is to plot it in a histogram:

15

Features of a Distributionto Look For

• Outliers• Minority values• Data Balance• Data entry errors

16

Outliers• A small number of values that are much larger

or much smaller than all the others• Can disrupt the data mining process and give

misleading results• You should either remove them or, if they are

important, collect more data to reflect this aspect of the world you are modelling

• Could be data entry errors

17

Minority Values

• Values that only appear infrequently in the data• Do they appear often enough to contribute to the

model?• Might be worth removing them from the data or

collecting more data where they are represented• Are they needed in the finished system?• Could they be the result of data entry errors?

18

Minority Values

0

100

200

300

400

500

600

Male Female M F

What does this chart tell you about the gender variable in a data set?

What should you do before modelling or mining the data?

19

Flat and Wide Variables

• Variables where all the values are minority values have a flat, wide distribution – one or two of each possible value

• Such variables are of little use in data mining because the goal of DM is to find general patterns from specific data

• No such patterns can exist if each data point is completely different

• Such variables should be excluded from a model

20

Data Balance

• Imagine I want to predict whether or not a prospective customer will respond to a mailing campaign

• I collect the data, put it into a data mining algorithm, which learns and reports a success rate of 98%

• Sounds good, but when I put a new set of prospects through to see who to mail, what happens?

21

A Problem

• … the system predicts ‘No’ for every single prospect.

• With a response rate on a campaign of 2%, then the system is right 98% of the time if it always says ‘No’.

• So it never chooses anybody to target in the campaign

22

A Solution• One data pre-processing solution is to balance the number of

examples of each target class in the output variable• In our previous example: 50% customers and 50% non-

customers• That way, any gain in accuracy over 50% would certainly be

due to patterns in the data, not the prior distribution• This is not always easy to achieve – you might need to throw

away a lot of data to balance the examples, or build several models on balanced subsets

• Not always necessary – if an event is rare because its cause is rare, then the problem won’t arise

23

Data Quantity

• How much data do you need?• How long is a piece of string?• Data must be sufficient to:

– Represent the dynamics of the system to be modelled

– Cover all situations likely to be encountered when predictions are needed

– Compensate for any noise in the data

24

Model Building

• Choose a number of techniques suitable to the task:– Neural network for prediction or classification– Decision tree for classification– Rule induction for classification– Bayesian network for classification– K-Means for clustering

25

Train Models

• For each technique:– Run a series of experiments with different

parameters– Each experiment should use around 70% of the

data for training and the rest for testing– When a good solution is found, use cross

validation (10 fold is a good choice) to verify the result

26

Cross Validation

• Split the data into ten subsets, then train 10 models – each one using 9 of the 10 subsets as training data and the 10th as test. The score is the average of all 10.

• This is a more accurate representation of how well the data may be modelled, as it reduces the risk of getting a lucky test set

27

Assess Models

• You can measure the success of your model in a number of ways– Mean Squared error – not always meaningful– Percentage correct for classification– Confusion matrix for classification

Output= True False

True 80 30

False 20 90

28

Probability Outputs

• Most classification techniques provide a score with the classification – either a probability or some other measure

• This can be used:– Allow an answer of “unsure” for cases where no

single class has a high enough probability– Weighting outputs to allow for unequal cost of

outcomes– Lift charts and ROC curves

29

Generalisation and Over Fitting

• Most data mining models have a degree of complexity that can be controlled by the designer

• The goal is to find the degree of complexity that is best suited to the data

• A model that is too simple over generalises• A model that is too complex over fits• Both have an adverse effect on performance

30

Gen-Spec Trade Off

• Adding to the complexity of the model fits the training data better at the expense of higher test error

31

Repeat or Finish

• The result of the data mining will leave you with either a model that works or the need to improve

• More data may need to be collected• Different variables might be tried• The process can loop several times before a

satisfactory answer is found

32

Understanding and Using the Results

• The resulting model has the ability to perform the task it was set, so can be embedded in an automated system

• Some techniques produce models that are human readable and allow insights into the structure of the data

• Some are almost impossible to extract knowledge from

Kevin Swingler: Introduction to Data Mining

Education

Transcript of Kevin Swingler: Introduction to Data Mining