Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
-
Upload
phoebe-byrd -
Category
Documents
-
view
219 -
download
2
Transcript of Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Collecting data
• Collecting “example patterns”– Inputs (vectors of
independent variables)– Outputs (vectors
dependent variables)
• More data is better• Begin with an
elementary set of data
Collecting data
• Choose an appropriate sampling rate for time-series data.
• Make sure the data measurements units are consistent.
• Keep non-essential variables not in the input vector
• Make sure no major structural (systemic) changes have occurred during collection.
Collecting data
• How much data is enough?– Training and testing using a subset of data– If the performance does not increase when full
data is used, data is enough– There are statistical validating methods
(Ch.11)
• Using simulated data– When it is difficult to collect (sufficient) data
• Realistic• Representative
Missing data
• Discard incomplete example patterns• Manually enter a reasonable, probable, or
expected values• Use an statistic generated from the example
patterns with that value– Mean, mode
• Encode missing values explicitly by creating new indicator variables
• Generate a predictive model to predict each of the missing data value
Categorical data
• Ordinal: – Convert to a numerical representation in a
straightforward manner– “Low”, “medium”, “high” => 0, 1, 2
• Nominal: – “One of n” representation– Encode the input variables as n different
binary inputs, when there are n distinct categories.
Further process of “one of n”
• When n is too large, reduce the number of inputs in the new encoding.– Manually
– PCA-based reduction• Reduce the one-of-n representation to a one-of-m r
epresentation where m is less than n.
– Eigenvalue-based reduction
– Output variable-based reduction
Inconsistent data and outliers
• Removing erroneous data
• Identifying inconsistent data– Thresholding, filtering
• Outliers– Data points that lie outside of the normal regio
n of interest in the input space, which may be• Unusual situations that are “correct”• Misleading or incorrect measurements
Outliers• Ways to spot outliers
– Plot: box plot, histogram…– Number of S.D. from the mean
• Handling outliers– Remove them
• Assumption: the input space where the outliers reside are not concerned
– “Winzorize” them• Convert the values of outliers into the values of upper or low
er thresholds.
• Outliers can always be reintroduced into the satisfying model to study the changes in the performance of the model.
Reasons to preprocess data
• Reducing noise
• Enhancing the signal
• Reducing input space
• Feature extraction
• Normalizing data
• Modifying prior probabilities (specific for classification)
Reducing noise
• Averaging data values
• Thresholding data– Convert numeric format data into categorical
– E.g. grey-scale => monotone image
Reducing input space
• Principle component analysis (PCA)
– Identify m-dimensional subspace of the n-dimensional
input space
– original n variables are reduced to m variables that are
mutually orthogonal (independent)
• Eliminating correlated input variables
– Identify highly correlated input variables by• Statistical correlation tests
• Visual inspection of graphed data variables
• Seeing if a data variable can be modeled using one or more
others.
Reducing input space
• Combining non-correlated input variables
• Sensitivity analysis– If variations of a particular input variable
cause large changes in the estimation model output, the variable is very significant.
– Sensitivity analysis prunes input variables based on information provided by both input and output data.
Normalizing data
• Not “transform to normal distribution”
• For models that perform better– Non-parametric algorithms implicitly assume d
istances in different directions carry the same weight (e.g. K-nearest neighbor, ”KNN”)
– Backpropagation (BP) and multi-layered perception (MLP) models often perform better if all inputs and outputs are normalized
• Avoiding numerical problems
Types of normalization
• Min-max normalization– It preserves all relationships of the data value
s exactly– It would compress the normal range if extrem
e values or outliers exist
• Z-score normalization
• Sigmoidal normalization
Other considerations
• According to the characteristics of the specific classifiers being used for modeling– E.g. CHAID uses categorical data directly
• Input variables produce the best modeling accuracy when exhibiting a uniform or Gaussian distribution
• Add expert knowledge when preprocessing data