Simpler Machine Learning with SKLL

Dan Blanchard Educational Testing Service

dblanchard@ets.org

PyData NYC 2013

Survived Perished

Survived Perishedfirst class, female,

1 sibling, 35 years old

third class, female,

2 siblings, 18 years old

second class, male,

Can we predict survival from data?

SciKit-Learn Laboratory

It's where the learning happens.

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)

Learning to Predict Survival1. Split up given training set: train (80%) and dev (20%)$ ./make_titanic_example_data.py !Creating titanic/train directory Creating titanic/dev directory Creating titanic/test directory Loading train.csv............done Loading test.csv........done

Learning to Predict Survival2. Pick classifiers to try:

1. Random forest

2. Support Vector Machine (SVM)

3. Naive Bayes

Learning to Predict Survival3. Create configuration file for SKLL

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Output] results = output models = output

directory with feature files for training learner

Learning to Predict Survival

directory with feature files for evaluating performance

3. Create configuration file for SKLL

# of siblings, spouses, parents, children

departure port

fare & passenger class

sex, & age

directory to store evaluation results

directory to store trained models

Learning to Predict Survival4. Run the configuration file with run_experiment$ run_experiment evaluate.cfg !Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done Loading dev/misc.csv.....done Loading dev/socioeconomic.csv.....done Loading dev/vitals.csv.....done Loading train/family.csv...........done Loading train/misc.csv...........done Loading train/socioeconomic.csv...........done Loading train/vitals.csv...........done Loading dev/family.csv.....done ...

Experiment Name: Titanic_Evaluate Training Set: train Test Set: dev Feature Set: ["family.csv", "misc.csv", “socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Task: evaluate !+-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [97] | 18 | 0.874 | 0.843 | 0.858 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 14 | [50] | 0.735 | 0.781 | 0.758 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8212290502793296

5. Examine results

Dev. Accuracy

Learner

0.821 RandomForestClassifier

0.771 SVC

0.709 MultinomialNB

Aggregate Evaluation Results

Tuning learner• Can we do better than default hyperparameters?

[General] experiment_name = Titanic_Evaluate task = evaluate ![Input] train_location = train test_location = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Tuning learner• Can we do better than default hyperparameters?

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Untuned Accuracy

Tuned Accuracy

Learner

0.821 0.849 RandomForestClassifier

0.771 0.737 SVC

0.709 0.709 MultinomialNB

Tuned Evaluation Results

Using All Available Data

Using All Available Data• Use training and dev to generate predictions on test

[General] experiment_name = Titanic_Predict task = predict ![Input] train_location = train+dev test_location = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Using All Available Data• Use training and dev to generate predictions on test

Untuned Accuracy

(Train only)

Tuned Accuracy

(Train only)

Untuned Accuracy

(Train + Dev)

Tuned Accuracy

(Train + Dev)Learner

0.732 0.746 0.746 0.756 RandomForestClassifier

0.608 0.617 0.612 0.641 SVC

0.627 0.623 0.622 0.622 MultinomialNB

Test Set Performance

Advanced SKLL Features

Advanced SKLL Features• Read/write .arff, .csv, .jsonlines, .megam, .ndj,

and .tsv data

• Parameter grids for all supported classifiers/regressors

and .tsv data

• Parallelize experiments on DRMAA clusters

and .tsv data

• Ablation experiments

and .tsv data

• Collapse/rename classes from config file

and .tsv data

• Rescale predictions to be closer to observed data

and .tsv data

• Feature scaling

and .tsv data

• Feature scaling

• Python API

Currently Supported Learners

Classifiers Regressors

Linear Support Vector Machine Elastic Net

Logistic Regression Lasso

Multinomial Naive Bayes Linear

Decision Tree

Gradient Boosting

Random Forest

Support Vector Machine

Coming Soon

Classifiers Regressors

AdaBoost

K-Nearest Neighbors

Stochastic Gradient Descent

Acknowledgements• Mike Heilman

• Nitin Madnani

• Aoife Cahill

References• Dataset: kaggle.com/c/titanic-gettingStarted

• SKLL GitHub: github.com/EducationalTestingService/skll

• SKLL Docs: skll.readthedocs.org

• Titanic configs and data splitting script in examples dir on GitHub

@Dan_S_Blanchard !

dan-blanchard

Bonus Slides

Cross-validation[General] experiment_name = Titanic_CV task = cross_validate ![Input] train_location = train+dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "SVC", "MultinomialNB"] label_col = Survived ![Tuning] grid_search = true objective = accuracy ![Output] results = output

Avg. CV Accuracy

Learner

0.815 RandomForestClassifier

0.717 SVC

0.681 MultinomialNB

Cross-validation Results

SKLL API

SKLL APIfrom skll import Learner, load_examples

# Load training examplestrain_examples = load_examples('myexamples.megam')

# Train a linear SVMlearner = Learner('LinearSVC')learner.train(train_examples)

# Load test examples and evaluatetest_examples = load_examples('test.tsv')(conf_matrix, accuracy, prf_dict, model_params, obj_score) = learner.evaluate(test_examples)

confusion matrix

precision, recall, f-score for each class

tuned model parameters

objective function score on test set

# Generate predictions from trained modelpredictions = learner.predict(test_examples)

# Perform 10-fold cross-validation with a radial SVMlearner = Learner('SVC')(fold_result_list, grid_search_scores) = learner.cross_validate(train_examples)

per-fold evaluation results

per-fold training set obj. scores

SKLL APIimport numpy as np import os from skll import write_feature_file !# Create some training examples classes = [] ids = [] features = [] for i in range(num_train_examples): y = "dog" if i % 2 == 0 else "cat" ex_id = "{}{}".format(y, i) x = {"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4), "f3": np.random.randint(1, 4)} classes.append(y) ids.append(ex_id) features.append(x) # Write them to a file train_path = os.path.join(_my_dir, 'train', 'test_summary.jsonlines') write_feature_file(train_path, ids, classes, features)

Simpler Machine Learning with SKLL

Technology

Transcript of Simpler Machine Learning with SKLL

Making search simpler

Independent NHS, Simpler Quangos

CARBIDE cut-off machine - Grindstar...CARBIDE cut–off machine, cutting of carbide/ HSS rods is simpler for tools manufacturing industries. With the intent to match the demands of

Using Simpler Operations

Simpler Taxes

Insulin technique: simpler

Smaller. Smarter. Simpler.

The Simpler Life

Simpler and better - Design Council · In Simpler and better, I have distilled CABE’s conclusions about the most important ideas which have emerged. ‘Simpler’ refers to a new,

Break into Simpler Parts

SimpleR: tips, tricks & tools

Simpler Machine Learning with SKLL 1.0

Business Management Made Simpler

Simpler Resilience Measurement - COSA

Simpler Super

Simpler Is Better

gurmatsangeetlibrary.comgurmatsangeetlibrary.com/gurmatsangeetbooks/pdf/Youth Kirtan... · PITCH OF YOUR VOICE MUST MATCH NOTE. TO YOU DEVELOP SKLL, ... It is an instrument by which

Simpler Diagnostic Brochure-IPAD

Building Simpler Corporate Cultures

Factoring Finite State Machines - Stanford University...Factoring Finite State Machines Factoring a state machine is the process of splitting the machine into two or more simpler machines.