Data Preprocessing in FAKE...

21
01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001 00100000 01101011 01100001 01110100 01100101 01100100 01110010 01111001 00100000 01110000 01101111 01100011 01101001 01110100 01100001 01100011 01110101 00101100 00100000 01000110 01000101 01001100 00100000 01000011 01010110 01010101 01010100 00101100 00100000 01010000 01110010 Data Preprocessing in Data Preprocessing in FAKE GAME FAKE GAME Miroslav Čepek Miroslav Čepek [email protected] [email protected] http://cig.felk.cvut.cz http://cig.felk.cvut.cz Computational Intelligence Group Computational Intelligence Group Department of Computer Science and Engineering Department of Computer Science and Engineering Faculty of Electrical Engineering Faculty of Electrical Engineering Czech Technical University in Prague Czech Technical University in Prague

Transcript of Data Preprocessing in FAKE...

Page 1: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

01001110 0110010101110101 0111001001101111 0110111001101111 0111011001100001 0010000001110011 0110101101110101 0111000001101001 0110111001100001 0010000001101011 0110000101110100 0110010101100100 0111001001111001 0010000001110000 0110111101100011 0110100101110100 0110000101100011 0111010100101100 0010000001000110 0100010101001100 0010000001000011 0101011001010101 0101010000101100 0010000001010000 0111001001100001 0110100001100001 00000000

Data Preprocessing in Data Preprocessing in FAKE GAMEFAKE GAME

Miroslav ČepekMiroslav Č[email protected]@fel.cvut.cz

http://cig.felk.cvut.czhttp://cig.felk.cvut.cz

Computational Intelligence GroupComputational Intelligence GroupDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering

Faculty of Electrical EngineeringFaculty of Electrical EngineeringCzech Technical University in PragueCzech Technical University in Prague

Page 2: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Data mining process

Data preprocessing

Page 3: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

IntroductionIntroduction

● Usualy data preprocessing takes about 80% time of whole data mining process.

● Data preprocessing is corner stone of data mining process.

– Garbage in – garbage out principle

– When training data does not contain important information resulting model is irrelevant and does not work.

● Data preprocessing is time-consuming but important!

Page 4: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

● The part of FAKE GAME project is the Data Preprocessing subsystem.

● The Data preprocessing subsystem is divided into

– Manual preprocessing part.

– Automatic preprocessing part.

Page 5: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Manual preprocessing

● In this mode you can select and apply data preprocessing methods as you like.

Page 6: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Data preprocessing methods

● Data Preprocessing subsystem contains methods from various fields:

– Data acquisition (CSV, XLS, Databases)

– Missing data imputation

– Data normalization

– Data reduction (sampling and dimension reduction)

– Outlier detection

– …

Page 7: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Automatization of data preprocessing

● How to simplify the data preprocessing task?– Some parts can not be automated – data

acquisition, feature extraction, data cleaning.

– On the other hand there is still large part which can be automated.

● Replace (impute) missing values. Which method to use?

● Normalize attributes? Which method to use? To which range?

● …

Page 8: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Automatic Preprocessing – Ideas

● The task of finding the order and setup of preprocessing methods is an optimization problem.

– Simulated annealing

– Linear programming

– Taboo search

– Ant colony optimization

– Particle swarm optimization

– Genetic algorithms

Page 9: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Ideas for data preprocessing

● Automatic Preprocessing utilizes genetic algorithm to search for the optimal sequence of preprocessing methods.

– Each individual presents a sequence of preprocessing methods.

– Fitness function of each individual is accuracy over testing data of model created from data preprocessed by sequence of preprocessing methods.

Page 10: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Automatic Preprocessing – Sequences

● For each input attribute exists one subsequence

● Each attribute is preprocessed separately

Page 11: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Genome

Attribute sequences are contecated into individual (genome).

− One more special subsequence exists – Global

− This subsequence contains preprocessing methods manipulating with the whole dataset.

− Methods like PCA, Data reduction, …

Page 12: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Fitness function

• The fitness is accuracy of the model created from the training data.

• The final fitness and results depends on modeling method used.

– Modeling method should be fast and reasonably accurate.

– At this time, the decision tree looks the best.

– But also logistic and SVM classifiers, can be used.

Page 13: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Calculation of Fitness

• Fitness of the individual in Automatic Preprocessing is accuracy of model created from data preprocessed by methods selected by given individual.

Page 14: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Fitness value modifications

● Regularization– Keep number of applied preprocessing methods

as small as possible.

– Avoid preprocessing methods which do not improve the accuracy.

● Remaining missing values– Penalization if missing values are still present in

dataset after preprocessing is complete.

● Complexity of resulting model– Penalization of too complex models.

Page 15: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Example result

● Improvements of the fitness with EColi data.● Boxplots created from 20 repetitive runs showing

accuracy of the best-so-far individual and the average accuracy in the population.

Generations

Acc

urac

y (F

itnes

s va

lue)

Average fitness of the population

Fitness of the best-so-far individual

Page 16: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.
Page 17: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Influence of modeling method

● Not all modelling methods achieved the same accuracy.

– The final fitness and differences between fitnesses achieved by modelling methods is problem depend.

– For now – if the most accurate model is needed, all modelling methods much be tried.

Page 18: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Utilization of preprocessing methods

● There remains very important question about utilized methods.

– I have to double check which methods are utilized by successful individuals.

Page 19: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Future of Automatic Preprocessing

● Utilization of Meta-data to speed-up the genetic algorithm or even to skip it.

● We plan to extend to the Automatic preprocessing to

– automatic selection of features to extract from time-series (signals),

– automatic selection of features to images.

● Improvements to “manual” part of the preprocessing algorithms.

Page 20: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

Thank you for your attention.

Miroslav Čepek([email protected])

Page 21: Data Preprocessing in FAKE GAMEfakegame.sourceforge.net/lib/exe/fetch.php?media=take2010-mira.pdfIntroduction Usualy data preprocessing takes about 80% time of whole data mining process.

● Predstaveni modulu pro predzpracovani dat + ze mame spoustu naimplementovanych metod

● Zacatek auto predzprac● Jak to funguje (vnitrnosti)● Slechteni pro konkretni modely● Fitness – presnost modelu, regularizace,

slozitost modelu, …● Zaskladni vysledky● Budoucnost predzpracovani (auto predzp

signalu/obrazku/predzpracovani podle meta-dat)