Simpler Machine Learning with SKLL

Post on 09-May-2015

713 views 0 download

Transcript of Simpler Machine Learning with SKLL

Simpler Machine Learning with SKLL

Dan Blanchard Educational Testing Service

dblanchard@ets.org

PyData NYC 2013

Survived Perished

Survived Perishedfirst class, female,

1 sibling, 35 years old

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

0 siblings, 50 years old

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

0 siblings, 50 years old

Can we predict survival from data?

SciKit-Learn Laboratory

SKLL

SKLL

SKLL

It's where the learning happens.

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)$ ./make_titanic_example_data.py !Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done

Learning to Predict Survival2. Pick classifiers to try:

1. Random forest

2. Support Vector Machine (SVM)

3. Naive Bayes

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

directory with feature files for training learner

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

directory with feature files for evaluating performance

3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

# of siblings, spouses, parents, children

3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

departure port

3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

fare & passenger class

3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

sex, & age

3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival

directory to store evaluation results

3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

directory to store trained models

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

Learning to Predict Survival3. Create configuration file for SKLL

directory to store trained models

Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg !Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...

Learning to Predict Survival

Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate !+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296

5. Examine results

Dev. Accuracy

Learner

0.821 RandomForestClassifier

0.771 SVC

0.709 MultinomialNB

Aggregate Evaluation Results

Tuning learner• Can we do better than default hyperparameters?

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Tuning learner• Can we do better than default hyperparameters?

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Using All Available Data

Using All Available Data• Use training and dev to generate predictions on test

[General] experiment_name = Titanic_Predict task = predict ![Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Using All Available Data• Use training and dev to generate predictions on test

Untuned Accuracy

(Train only)

Tuned Accuracy

(Train only)

Untuned Accuracy

(Train + Dev)

Tuned Accuracy

(Train + Dev)Learner

0.732 0.746 0.746 0.756 RandomForestClassifier

0.608 0.617 0.612 0.641 SVC

0.627 0.623 0.622 0.622 MultinomialNB

Test Set Performance

Advanced SKLL Features

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

• Rescale predictions to be closer to observed data

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

• Rescale predictions to be closer to observed data

• Feature scaling

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

• Parallelize experiments on DRMAA clusters

• Ablation experiments

• Collapse/rename classes from config file

• Rescale predictions to be closer to observed data

• Feature scaling

• Python API

Currently Supported Learners

Classifiers Regressors

Linear Support Vector Machine Elastic Net

Logistic Regression Lasso

Multinomial Naive Bayes Linear

Decision Tree

Gradient Boosting

Random Forest

Support Vector Machine

Coming Soon

Classifiers Regressors

AdaBoost

K-Nearest Neighbors

Stochastic Gradient Descent

Acknowledgements• Mike Heilman

• Nitin Madnani

• Aoife Cahill

References• Dataset: kaggle.com/c/titanic-gettingStarted

• SKLL GitHub: github.com/EducationalTestingService/skll

• SKLL Docs: skll.readthedocs.org

• Titanic configs and data splitting script in examples dir on GitHub

@Dan_S_Blanchard !

dan-blanchard

Bonus Slides

Cross-validation[General] experiment_name = Titanic_CV task = cross_validate ![Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Avg. CV Accuracy

Learner

0.815 RandomForestClassifier

0.717 SVC

0.681 MultinomialNB

Cross-validation Results

SKLL API

SKLL APIfrom skll import Learner, load_examples

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

confusion matrix

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

precision, recall, f-score for each class

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

tuned model parameters

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

objective function score on test set

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

per-fold evaluation results

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

per-fold training set obj. scores

SKLL APIimport numpy as np import os from skll import write_feature_file !# Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)