How to solve a problem with machine learning
-
Upload
amendra-shrestha -
Category
Data & Analytics
-
view
61 -
download
4
Transcript of How to solve a problem with machine learning
How I did my Master Thesis
Amendra Shrestha
Uppsala University
January 10, 2017
Introduction Workflow Data Preparation Modeling Process Experiment
Introduction
Find a topic
- 1 -
Introduction Workflow Data Preparation Modeling Process Experiment
Introduction
Reading
• papers
• online machine learning courses (coursera.org)
• mastery for self study (http://machinelearningmastery.com)
• writing
- 2 -
Introduction Workflow Data Preparation Modeling Process Experiment
Introduction
Data
• crawl data from social media, discussion forums, blogs
• download from archives• KONECT (http://konect.uni-koblenz.de)• UCI (http://archive.ics.uci.edu/ml/datasets.html)• Kaggle (http://blog.kaggle.com)• Spr̊akbanken (https://spraakbanken.gu.se/)
- 3 -
Introduction Workflow Data Preparation Modeling Process Experiment
Project workflow
• Data Preparation• data cleaning• data preparation• feature vector creation
• Modeling Process• feature selection• transformation• missing data• model generation• model selection
- 4 -
Introduction Workflow Data Preparation Modeling Process Experiment
Data Preparation
Cleaning data
• removing duplicates
• impossible values• negative ages
• misspelt words
• inconsistent time formats
• unwanted elements• text: quotes, retweets, strange symbols, URLs, punctuations,
function words
• outliers
- 5 -
Introduction Workflow Data Preparation Modeling Process Experiment
Data Preparation
Preparation of data
• alterations of data
• stemming and lemmization of text
• uniformization of units
- 6 -
Introduction Workflow Data Preparation Modeling Process Experiment
Data Preparation
Feature vector
• n-dimensional vector of features
• types• data dependent• data independent
• text• bag of words• term frequency• tf-idf• n-grams
- 7 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Feature selection
use a minimal number of maximally informative features
• noise
• overfitting
• computational load
best features?
• background/expert knowledge
• pairwise statistical analysis
• model validation
- 8 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Transformation
• scaling
• PCA
- 9 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Dealing with incomplete data
• use a model that can deal with missing items
• throw away
• simple statistic (not recommended)• mean, median, mode
• Knn imputation
- 10 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Model generation
• supervised Learning• artificial neural network• decision tree learning• support vector machines• random forests
• unsupervised Learning• clustering
- 11 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Model selection
• how well the model performs on new data
• splitting training, validation and test set
• cross validation• divide data into n subsets• generate n models, each using all but one subset• test each model on the hold-out subset• combine results
- 12 -
Introduction Workflow Data Preparation Modeling Process Experiment
Experiment
- 13 -
Introduction Workflow Data Preparation Modeling Process Experiment
Experiment
- 14 -
Thank You