BSSML16 L10. Summary Day 2 Sessions

29
Class summary

Transcript of BSSML16 L10. Summary Day 2 Sessions

Page 1: BSSML16 L10. Summary Day 2 Sessions

Class summary

Page 2: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 2

Day 2 – Morning sessions

Page 3: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 3

Basic transformations

Expectations Reality

$

ML-ready data needs work!!!Any data is always ML-ready

What does ML-ready mean?● Machine Learning algorithms consume instances of the question that you want to model.

Each row must describe one of the instances and each column a property of the instance● Fields can be:

– already present in your data– derived from your data– generated using other fields

Page 4: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 4

Basic transformations

● Select the right model for the problem you want to solve: Classification, regression, cluster analysis, anomaly detection, association discovery, topic modeling, etc.

● Perform cleansing, denormalizing, aggregating, pivoting, and other data wrangling tasks to generate a collection of instances relevant to the problem at hand. Finally use a very common format as output format: CSV

● Choose the right format to store each type of feature into a field● Feature engineering: Using domain knowledge and Machine

Learning expertise, generate explicit features that help to better represent the instances (Flatline)

ML-ready steps

Page 5: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 5

Basic transformations

Cleansing: Homogenize missing values and different types in the same feature, fix input errors, correct semantic issues, etc.Denormalizing: Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset.Aggregation: When data is stored as individual transactions, as in log files, an aggregation to get the entity might be neededPivoting: Different values of a feature are pivoted to new columns in the result datasetRegular time windows: Create new features using values over different periods of time.

Preprocessing data

Page 6: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 6

Basic transformations

For numeric features: – Discretization: percentiles, within percentiles, groups– Replacement– Normalization– Exponentiation– Shocks (speed of change compared to stdev)For text features:– Mispellings– Length– Number of subordinate sentences– Language– Levenshtein distanceStackingCompute a field using non-linear combinations of other fields

Feature engineering

Page 7: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 7

Basic transformations

● Define a clear idea of the goal.● Understand what ML tasks will achieve the goal.● Understand the data structure to perform those ML tasks.● Find out what kind of data you have and make it ML-Ready

– where is it, how is it stored?– what are the features?– can you access it programmatically?

● Feature Engineering: transform the data you have into thedata you actually need.

● Evaluate: Try it on a small scale● Accept that you might have to start over….● But when it works, automate it!!!

Holistic approach

Page 8: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 8

Basic transformations

Command line tools:join, jq, awk, sed, sort, uniqAutomation:Shell, Python, etc.TalendBigML: flatline, bindings, bigmler, API, whizzmlRelational Db:MySQLNon-Relational Db:MongoDB

Tools that help

Page 9: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 9

Feature Engineering

Data + ML Algorithm, is that enough?

The ML Algorithm only knows about the features in the dataset. Features can be useless to the algorithm if:

● They are not correlated to the objective to be predicted● Their values change their meaning when combined with other features

For ML Algorithms to work there must be some kind of statistical relation between some of the features and the objective. Sometimes, you must transform the available features to find such relations

Feature engineering: the process of transforming raw data into machine learning ready-data

Page 10: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 10

Feature Engineering

When do you need Feature Engineering?● When the relationship between the feature and the

objective is mathematically unsatisfying● When the relationship of a function of two or more features

with the objective is far more relevant than the one of the original features

● When there is missing data● When the data is time-series, especially when the

previous time period’s objective is known● When the data can’t be used for machine learning in the

obvious way (e.g., timestamps, text data)

Page 11: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 11

Feature Engineering

Mathematical transformations● Statistical aggregations (group by, all and all-but)● Better categories

– too many detailed categories should be avoided– ordered categories can be translated to numeric values. The model will be

able to extract more information by partinioning the ordered number range● Binning or discretization: consider whether your number is more informative

in ranges (quartiles, deciles, percentiles) even for the objective field● Linearization: non-important for decision trees but can be for logistic

regression (watch out for exponential distributions)Missing data

● Missing value induction (replace missings with common values: mean, median, mode, even with a Machine Learning model)

● Missing values presence can be informative, so this can be added as a new feature

Page 12: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 12

Feature Engineering

Time-series transformations● Better objective (percent change instead of absolute

values)● Deltas from previous reference time points● Deltas from moving average (time windows)● Recent Volatility...

Problem: Exponential explosion of possible transformationsCaveats:

● The regularity in time of the points has to match your training data● You have to keep track of past points to compute your windows● Really easy to get information leakage by including your objective in a

window computation (and can be very hard to detect)!

Page 13: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 13

Feature Engineering

Date-time features● Cannot be used “as is” in a model. It's a collection of features. BigML is able to

decompose them automatically when they are provided in the most usual formats. With Flatline, you can decompose them all.

● Date-time predicates that the computer does not know (some of them, domain dependent): Working hours? Daylight? Is rush hour?...

Text features● Bag of words: a new feature is associated to each word in the document● Tokenization: how do we select tokens? Do we want n-grams? What about

numbers?● Stemming: grouping forms of the same word in a unique term● Length● Text predicates: Dollar amounts? Dates? Salutations? Please and Thank you?

Page 14: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 14

Feature Engineering

Machine Learning for Feature engineeringLatent Dirichlet Allocation• Learn word distributions for topics• Infer topic scores for each document• Use the topic scores as features to a model (dimensional reduction)Distance to cluster CentroidsStacked Generalization: Classifiers provide new features

Page 15: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 15

Day 2 – Evening sessions

Page 16: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 16

REST API, bindings and basic workflows

Academics Real worldHow do Machine Learning Workflows look like?

We need high-level tools to face the real world workflows by growing in:

● Automation● Abstraction

Page 17: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 17

REST API, bindings and basic workflows

The foundations● REST API first applications: Standards in software development.

First level of abstractionClient side tools

● Web UI: Sitting on top of the REST API. Human-friendly access and visualizations for all the Machine Learning resources. Workflows must be defined and executed step by step. Second level of abstraction.

● Bindings: Sitting on top of the REST API. Fine-grained accessors for the REST API calls. Workflows must be defined and executed step by step. Second level of abstraction.

● BigMLer: Relying on the bindings. High-level syntax. Entire workflows can be created in only one command line. Third level of abstraction.

Page 18: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 18

REST API, bindings and basic workflows

.BigMLer automation

● Basic 1-click workflows in one command line● Rich parameterized workflows: feature selection, cross-validation, etc.

● Models are downloaded to your laptop, tablet, cell phone, etc. once and can be used offline to create predictions

Still..

Great for local predictions

Page 19: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 19

REST API, bindings and basic workflows

.Problems of client-side solutions

● Complexity Lots of details outside the problem domain

● Reuse No inter-language compatibility● Scalability Client-side workflows hard to optimize● Extensibility BigMLer hides complexity at the cost of

flexibility● Not enough abstraction

Page 20: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 20

REST API, bindings and basic workflows

.Solution: bringing automation and abstraction to the server-side

● DSL for ML workflow automation● Framework for scalable, remote execution of ML workflows

Sophisticated server-side optimizationOut-of-the-box scalability

Client-server brittleness removedInfrastructure for creating and sharing ML scripts and libraries

WhizzML

Page 21: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 21

REST API, bindings and basic workflows

.WhizzML's new REST API resources:

Scripts: Executable code that describes an actual workflow, taking a list of typed inputs and producing a list of outputs. Executions: Given a script and a complete set of inputs, the workflow can be executed and its outputs generated.Libraries: A collection of WhizzML definitions that can be imported by other libraries or scripts.

Page 22: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 22

REST API, bindings and basic workflows

ScriptsCreating scripts

● Usable by any binding (from any language)● Built-in parallelization● BigML resources management as primitives of the language● Complete programming language for workflow definition

Using scripts

Web UI

Bindings

BigMLer

WhizzML

Page 23: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 23

Advanced WhizzML workflows

WhizzML offers:● Primitives for all ML resources: (datasets, models, clusters, etc.)● A complete programming language to compose at will these ML resources.● Parallelization and Scalability built-in.

This empowers the user to benefit from:● Automated feature engineering: Best-first feature selection.● Automated configuration choice: Randomized parameter optimization, SMACdown.● Complex algorithms as 1-click: Stacked generalization, Boosting.

All of them can be shared, reproduced and reused as one more BigML resource in a language-agnostic way.

Page 24: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 24

Advanced WhizzML workflows

f5 fn... ...

......

... ...

f5 f7 f5 fn... ...

......

... ...

f5 f1

Selectedfields

()

(f5)

The best scoreis obtained forthe model with (f5)

The best scoreis obtained forthe model with (f5 f7)

Following iterations don't improve the score for the modelwith (f5 f7), so the process stops

Step 1

Step 2

f1Best-first feature selection

Page 25: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 25

Advanced WhizzML workflows

A new dataset is generatedwith the predictions for the

hold out data

A new metamodel is createdfrom this dataset

50%

Hold out

Stacked generalization

Page 26: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 26

Advanced WhizzML workflows

Configurationrandom

generator

... ...

Bestscore

Process stops when you reach the expected performanceor the user-given iterations limit

+

Randomized parameter optimization

Page 27: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 27

Advanced WhizzML workflows

Configurationrandom

generator

... ...

+ New configurations are filteredaccording to the predictionsof the model of performances

Only promisingconfigurations are analyzed

SMACdown

Page 28: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 28

Advanced WhizzML workflows

… …

The final model is an ensemble of models

T0

F0

T1

F1

T2

F2

F8

T8

Boosting

Page 29: BSSML16 L10. Summary Day 2 Sessions

BigML, Inc 29

Advanced WhizzML workflows

Script it once, for everybody anywhere

Publish scripts in the gallery

Add scripts toyour menus