Utah Code Camp 2014 - Learning from Data by Thomas Holloway

74
Learning from Data A fast-paced guide to machine learning and artificial intelligence by Thomas Holloway Co-Founder/Software Engineer @ Nuvi (http://www.nuviapp.com )

description

This is a fast paced guide to a branch of artificial intelligence in machine learning that was given at the Utah Code Camp 2014.

Transcript of Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Page 1: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Learning from DataA fast-paced guide to machine learning and artificial intelligence

by Thomas HollowayCo-Founder/Software Engineer @ Nuvi (http://www.nuviapp.com)

Page 2: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Thanks to our Sponsors!

To connect to wireless 1. Choose Uguest in the wireless list

2. Open a browser. This will open a Uof U website 3. Choose Login

Page 3: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

– H.B. BARLOW

“Intelligence is the art of good guesswork”

Page 4: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Early AI research began in the study of logic itself - leading to the algorithms that imitate step-by-step reasoning used to solve puzzles and problems. (heuristics)

• Contrast to methods pulled from economics and probability in the late 80’s/90’s led to very successful approaches for dealing with uncertainty or incompleteness.

• Statistical Approaches, Neural Networks (Probabilistic Nature of Humans to Guess)

Page 5: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Represent conceptually about objects, places, things, situations, events, things, times, language

• What they look like

• Categorical features

• Properties

• Relationships between each other

• Meta-knowledge (knowledge of what other people know)

• Causes, effects and lots of other less known research fields

• “what exists” = Ontology

Page 6: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Difficult Problems

• Working assumptions, default reasoning, qualification problem

• Commonsense Knowledge

• Major goal is to automatically acquire this largely through unsupervised learning

• Ontology Engineering

• Subsymbolic Form of Commonsense Knowledge

• Not all knowledge can be represented as facts or statements. (As such, intuition to avoid a decision, i.e. “feels too exposed” in a chess match)

Page 7: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Set Goals and achieve them

• (visualize the representation of the world, predict how actions will change it, make choices to maximize utility)

• Requires reasoning under uncertainty (as a result of the world/environment matches its predictions) -> error correction

• Move chess piece here, player responds to put me in a seemingly poor position, act accordingly

Page 8: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Machine Learning is the study of algorithms that automatically improve through experience.

• Probably the most central role to Artificial Intelligence.

• Unsupervised Learning - finding patterns

• Supervised Learning - classify categorically what something is/belongs and producing a function to represent input -> output

• Reinforcement Learning - rewards

• Developmental Learning - self-exploration, active learning, imitation, guidance, entropy

Page 9: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

General Intelligence Goals• Deduction, Reasoning, Problem

Solving

• Knowledge Representation

• Planning

• Learning

• Natural Language Processing

• Motion and Manipulation

• Perception

• Social Intelligence

• Creativity

• Read and understand text

• Listen and understand speech

• Information Retrieval

• Machine Translation

• Sentiment Analysis

• Category Theory (Quantum Logic in Information Flow Theory)

• Common techniques in semantic indexing, parse trees, syntactic and semantic analysis

• Major Goal to automatically build ontology (for knowledge representation) by scanning books, wikipedia, dictionaries… etc

• Recently used wikitionary and wikipedia to automatically build a part of speech tagger and sentiment analysis engine for multiple languages. *http://www.nuviapp.com/* <— PLUG

Page 10: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

• Entropic Force (Alex Wissner-Gross argument for intelligence)

• Language Discovery

• Automated Trading Systems

• Machine Translation

• Spam Detection

• Self-Driving Cars

• Facial Recognition

• Gesture Recognition

• Speech Recognition

• Nest

• Shazam

Statistical Machine Learning is the art of taking lots of data and turning it into statistically known probabilities.

• Spotify

• Netflix, Amazon Recommendations

• Duolingo

• Robot Movement

• Fraud Detection

• Intrusion Detection / State Anomaly

• DNA Sequence Alignment

• Siri, Google Voice, Google Now, Xinect

• Sentiment Analysis

• Text/Character Recognition (Scanning books)

• Health Monitoring (Healthcare)

• Pandora, iTunes / iGenius

Page 11: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Types of Machine Learning

• Supervised Learning

• Unsupervised Learning

• Recommendation Systems

• Reinforcement Learning

• (rewards for good responses, punished for bad ones)

• Developmental Learning • (self-exploration, entropic force, cumulative acquisition of novel skills typical

of robot movement - autonomous interaction with environment and “teachers”, imitation, maturation)

Page 12: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Supervised Learning

• Two types that we will discuss within supervised learning:

• Regression analysis (single-valued real output)

• Classification

Page 13: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Linear Regression

Page 14: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Optimization Objectives• Hypothesis:

• Parameters:

• Cost Function:

• Goal:

m = number of samplesx(i) = x at sample iy(i) = y at sample i

Our cost function is effectively taking the

square error difference between all predictions from our hypothesis and

the actual values y - and finally summing

the error up to a total “cost” error.

Minimize the error produced from the cost function by manipulating

the parameters theta.

Page 15: Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Page 16: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent• First-Order Optimization Algorithm

• Finds Local Minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point.

• Popular for large-scale optimization problems

• easy to implement

• works on just about any black-box function

• each iteration is relatively cheap

Page 17: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent

repeat until convergence {

for ( j = 1 and j = 0)}

Hypothesis

Cost Function

Page 18: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent

repeat until convergence {

for ( j = 1 and j = 0)}

repeat until convergence {

}

Page 19: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent

repeat until convergence {

for ( j = 1 and j = 0)}

Page 20: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descentrepeat until convergence {

}

Hypothesis

1

* note: sometimes referred to as batch gradient descent (given that we iterate over all training examples to perform a single update on our

parameters)

Page 21: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Multivariate Linear Regression

TV Budget Online Ads Billboards Sales

230.1 37.8 63.1 22.1

44.5 39.9 45.1 10.4

17.2 45.8 69.3 9.3

180.8 41.3 58.5 18.5

Page 22: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

• Hypothesis:

• Think of x as our example with features in a vector up to n features with

Multivariate Linear Regression

Page 23: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Optimization Objectives• Hypothesis:

• Parameters:

• Cost Function:

• Goal:

m = number of samplesx(i) = x at sample iy(i) = y at sample i

Our cost function is effectively taking the

square error difference between all predictions from our hypothesis and

the actual values y - and finally summing

the error up to a total “cost” error.

Minimize the error produced from the cost function by manipulating

the parameters theta.

Page 24: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent

repeat until convergence {

for ( j = 0…n)}

Hypothesis

Cost Function

Page 25: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent

repeat until convergence {

for ( j = 0…n)}

Hypothesis

Cost Function

Page 26: Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Page 27: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Techniques in managing input• Mean-Normalization (make sure all your input have similar ranges)

• FFT for audio

• Mean / Average / Range

• Graph your Cost Function over the number of iterations (make sure it is decreasing)

• Separate data sets (cross validation, test set)

• Train on a given set of data, manipulate regularization / extra features..etc and graph your cost function against the cross validation set

• Finally test against unseen data against your test set

• Typically this is 60-30-10, or even 70-20-10, depending on how you wish to split things up.

Page 28: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Normal Equation• Analytically solves the parameters

• Useful when n is relatively small (n of features < 5000 or so)

• Uses the entire matrix of input

• Each sample = vector of features

Page 29: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Supervised Learning - Classification

• Spam/Not Spam

• Benign/Malignant

• Biometric Identification

• Speech Recognition

• Fraudulent Transactions

• Pattern Recognition

• 0 = Negative Class

• 1 = Positive Class

Page 30: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression / Classification

• What we want is a function that will produce a value between 0 and 1 for all weighted input we provide.

• Sigmoid Activation Unit

Page 31: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression / Classification

• What we want is a function that will produce a value between 0 and 1 for all weighted input we provide.

• Sigmoid Activation Unit

Page 32: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function

• Hypothesis:

• Cost Function:Linear Regression Cost

Function

Page 33: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function

• Hypothesis:

• Cost Function:

Page 34: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function Intuition

Page 35: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function Intuition

In other words, if we predicted 0 when we should of predicted 1, in this case we are going to return back a very large cost.

Page 36: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function Intuition

Page 37: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function Intuition

In other words, if we predicted 1 when we should of predicted 0, in this case we are going to return back a very large cost.

Page 38: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Cost Function

• This is the “simplified” formula —>

Page 39: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Gradient Descent

repeat until convergence {

for ( j = 0…n)}

Hypothesis

Cost Function

Page 40: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Decision Boundaries

• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark.

Page 41: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Decision Boundaries

• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark.

Page 42: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Logistic Regression Decision Boundaries

• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark. (Sunny, Rainy, Cloudy..etc)

Multi class classification is dealing with multiple

categories of classification. Typically

done as a one-vs-all classification. Where each class is trained as (1 = positive for a given

class, 0 for everything else).

Find max probability of all classes tested against.

Page 43: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Overfitting and Regularization

Page 44: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Regularization

• Regularized Logistic Regression

• Regularized Gradient Descent

[ ]

Regularization Parameter

Page 45: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Neural Networks

Page 46: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Sophisticated Neural Networks

can do some really amazing things

Multi-layered (deep) neural networks can be built to identify extremely complex things with

potentially millions of features to train on.

Neural Networks can auto-encode (learn from the input itself/self-learn), classify into many categories at once, can be trained to output real-values, they can even be built to retain memory or long-term state (such as in the

case of hidden markov models or finite state automatons)

Page 47: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Types of Neural Networks• Feedforward

• Recurrent

• Echo-State

• Long-Short-Term Memory

• Stochastic

• Bidirectional (propagates in both directions)

Page 48: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Feed Forward Network

Page 49: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Feed Forward Network+ 1 = bias

unit+1

+1

Input Features /

Input Layer

Output

Page 50: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the value of ?

Answer: the sigmoid activation of the sum of its weighted inputs

Page 51: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the output of ?

Answer: the activation (sigmoid) of it the sum of its weighted inputs

Page 52: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the output of ?

Answer: the activation (sigmoid) of it the sum of its weighted inputs

Page 53: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the output of ?

Answer: the activation (sigmoid) of it the sum of its weighted inputs

Page 54: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the output of ?

Page 55: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the output of ?

Page 56: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

What is the output of ?

Page 57: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

+1 +1 +1

Feed Forward Propagation

Page 58: Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Page 59: Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Page 60: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Backpropagation

• Gradient computation is done by computing the derivative gradient of our expected output versus our actual output and propagating that error backwards through the network.

• Calculate:

Page 61: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Backpropagation+1 +1 +1 let y = 1 for this

sample

Page 62: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Backpropagation+1 +1 +1

Page 63: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Recurrent Neural Networks

Page 64: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Recurrent Neural Networks• Connections units form a directed cycle

• Allows items to exhibit dynamic temporal behavior

• Useful for maintaining internal memory or state over time

• Ex: unsegmented hand writing recognition

• At any given time step, each non-input unit computes its current activation as a nonlinear function of the weighted sum of the activations of all units from which it receives connections.

• Training is done with back propagation through time

• vanishing gradient (solved with LSTM networks)

Page 65: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Recurrent Neural Networks

Page 66: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Recurrent Neural Networks

http://www.manoonpong.com/AMOSWD08.html

Page 67: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

LSTM Recurrent Neural Network• Long Short Term Memory

• Well suited for classifying, predicting and processing time series data with very long range dependencies.

• Achieves best known results in unsegmented handwriting recognition

• Traps error within a memory block (often referred to as an error carousel)

• Amazing applications in rhythm learning, grammar learning, music composition, robot control…etc

Page 68: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Other classification techniques

• SVM (support vector machines)

• constructs a hyperplane in a high/infinite-dimensional space used for training/classification, regression..etc

• by defining a kernel function (or some function that will tell us similarity) svm will allow us to perform simple dot products between high-dimensional features

• high-margin (decision boundary has good separation between training points) which benefits good generalization

• Naive Bayes

Page 69: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Unsupervised Learning• Categorization

• Clustering (density estimation)

• Selecting top clusters (k-means) and updating average centroid, assign data points to a cluster and iterate

• Blind Signal Separation

• Feature Extraction for Dimensionality Reduction

• Hidden Markov Models

• Non-normal & normal distribution analysis (finding the distributions of data)

• Self-Organizing Maps

Page 70: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Autoencoders

Unsupervised Learning from Neural Networks

Page 71: Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Page 72: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Autoencoders

Page 73: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Knowing what to do next• Build your algorithm quick and dirty, don’t spend a lot of time on it until you have something to use

• Split up your training, cross validation and test sets (don’t test on your training data!)

• Move on to PCA or unsupervised pre-training for your supervised algorithms to help improve performance after: —>

• Don’t just try and get a lot of data to train on, implement your algorithm quick and dirty, use smaller data sets initially and determine bias/variance

• High variance: get more training data

• High variance: try fewer features

• High bias: add additional features

• High bias: add polynomial features

• High bias: decrease regularization

• High variance: increase regularization

Page 74: Utah Code Camp 2014 - Learning from Data by Thomas Holloway

Follow me @ @nyxtom

Thank you!

Questions?

http://ml-class.org/

https://www.coursera.org/course/bluebrain

https://www.coursera.org/course/neuralnets