Utah Code Camp 2014 - Learning from Data by Thomas Holloway
-
Upload
thomas-holloway -
Category
Technology
-
view
104 -
download
0
description
Transcript of Utah Code Camp 2014 - Learning from Data by Thomas Holloway
Learning from DataA fast-paced guide to machine learning and artificial intelligence
by Thomas HollowayCo-Founder/Software Engineer @ Nuvi (http://www.nuviapp.com)
Thanks to our Sponsors!
To connect to wireless 1. Choose Uguest in the wireless list
2. Open a browser. This will open a Uof U website 3. Choose Login
– H.B. BARLOW
“Intelligence is the art of good guesswork”
General Intelligence Goals• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Early AI research began in the study of logic itself - leading to the algorithms that imitate step-by-step reasoning used to solve puzzles and problems. (heuristics)
• Contrast to methods pulled from economics and probability in the late 80’s/90’s led to very successful approaches for dealing with uncertainty or incompleteness.
• Statistical Approaches, Neural Networks (Probabilistic Nature of Humans to Guess)
General Intelligence Goals• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Represent conceptually about objects, places, things, situations, events, things, times, language
• What they look like
• Categorical features
• Properties
• Relationships between each other
• Meta-knowledge (knowledge of what other people know)
• Causes, effects and lots of other less known research fields
• “what exists” = Ontology
General Intelligence Goals• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Difficult Problems
• Working assumptions, default reasoning, qualification problem
• Commonsense Knowledge
• Major goal is to automatically acquire this largely through unsupervised learning
• Ontology Engineering
• Subsymbolic Form of Commonsense Knowledge
• Not all knowledge can be represented as facts or statements. (As such, intuition to avoid a decision, i.e. “feels too exposed” in a chess match)
General Intelligence Goals• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Set Goals and achieve them
• (visualize the representation of the world, predict how actions will change it, make choices to maximize utility)
• Requires reasoning under uncertainty (as a result of the world/environment matches its predictions) -> error correction
• Move chess piece here, player responds to put me in a seemingly poor position, act accordingly
General Intelligence Goals• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Machine Learning is the study of algorithms that automatically improve through experience.
• Probably the most central role to Artificial Intelligence.
• Unsupervised Learning - finding patterns
• Supervised Learning - classify categorically what something is/belongs and producing a function to represent input -> output
• Reinforcement Learning - rewards
• Developmental Learning - self-exploration, active learning, imitation, guidance, entropy
General Intelligence Goals• Deduction, Reasoning, Problem
Solving
• Knowledge Representation
• Planning
• Learning
• Natural Language Processing
• Motion and Manipulation
• Perception
• Social Intelligence
• Creativity
• Read and understand text
• Listen and understand speech
• Information Retrieval
• Machine Translation
• Sentiment Analysis
• Category Theory (Quantum Logic in Information Flow Theory)
• Common techniques in semantic indexing, parse trees, syntactic and semantic analysis
• Major Goal to automatically build ontology (for knowledge representation) by scanning books, wikipedia, dictionaries… etc
• Recently used wikitionary and wikipedia to automatically build a part of speech tagger and sentiment analysis engine for multiple languages. *http://www.nuviapp.com/* <— PLUG
• Entropic Force (Alex Wissner-Gross argument for intelligence)
• Language Discovery
• Automated Trading Systems
• Machine Translation
• Spam Detection
• Self-Driving Cars
• Facial Recognition
• Gesture Recognition
• Speech Recognition
• Nest
• Shazam
Statistical Machine Learning is the art of taking lots of data and turning it into statistically known probabilities.
• Spotify
• Netflix, Amazon Recommendations
• Duolingo
• Robot Movement
• Fraud Detection
• Intrusion Detection / State Anomaly
• DNA Sequence Alignment
• Siri, Google Voice, Google Now, Xinect
• Sentiment Analysis
• Text/Character Recognition (Scanning books)
• Health Monitoring (Healthcare)
• Pandora, iTunes / iGenius
Types of Machine Learning
• Supervised Learning
• Unsupervised Learning
• Recommendation Systems
• Reinforcement Learning
• (rewards for good responses, punished for bad ones)
• Developmental Learning • (self-exploration, entropic force, cumulative acquisition of novel skills typical
of robot movement - autonomous interaction with environment and “teachers”, imitation, maturation)
Supervised Learning
• Two types that we will discuss within supervised learning:
• Regression analysis (single-valued real output)
• Classification
Linear Regression
Optimization Objectives• Hypothesis:
• Parameters:
• Cost Function:
• Goal:
m = number of samplesx(i) = x at sample iy(i) = y at sample i
Our cost function is effectively taking the
square error difference between all predictions from our hypothesis and
the actual values y - and finally summing
the error up to a total “cost” error.
Minimize the error produced from the cost function by manipulating
the parameters theta.
Gradient Descent• First-Order Optimization Algorithm
• Finds Local Minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point.
• Popular for large-scale optimization problems
• easy to implement
• works on just about any black-box function
• each iteration is relatively cheap
Gradient Descent
repeat until convergence {
for ( j = 1 and j = 0)}
Hypothesis
Cost Function
Gradient Descent
repeat until convergence {
for ( j = 1 and j = 0)}
repeat until convergence {
}
Gradient Descent
repeat until convergence {
for ( j = 1 and j = 0)}
Gradient Descentrepeat until convergence {
}
Hypothesis
1
* note: sometimes referred to as batch gradient descent (given that we iterate over all training examples to perform a single update on our
parameters)
Multivariate Linear Regression
TV Budget Online Ads Billboards Sales
230.1 37.8 63.1 22.1
44.5 39.9 45.1 10.4
17.2 45.8 69.3 9.3
180.8 41.3 58.5 18.5
• Hypothesis:
• Think of x as our example with features in a vector up to n features with
Multivariate Linear Regression
Optimization Objectives• Hypothesis:
• Parameters:
• Cost Function:
• Goal:
m = number of samplesx(i) = x at sample iy(i) = y at sample i
Our cost function is effectively taking the
square error difference between all predictions from our hypothesis and
the actual values y - and finally summing
the error up to a total “cost” error.
Minimize the error produced from the cost function by manipulating
the parameters theta.
Gradient Descent
repeat until convergence {
for ( j = 0…n)}
Hypothesis
Cost Function
Gradient Descent
repeat until convergence {
for ( j = 0…n)}
Hypothesis
Cost Function
Techniques in managing input• Mean-Normalization (make sure all your input have similar ranges)
• FFT for audio
• Mean / Average / Range
• Graph your Cost Function over the number of iterations (make sure it is decreasing)
• Separate data sets (cross validation, test set)
• Train on a given set of data, manipulate regularization / extra features..etc and graph your cost function against the cross validation set
• Finally test against unseen data against your test set
• Typically this is 60-30-10, or even 70-20-10, depending on how you wish to split things up.
Normal Equation• Analytically solves the parameters
• Useful when n is relatively small (n of features < 5000 or so)
• Uses the entire matrix of input
• Each sample = vector of features
Supervised Learning - Classification
• Spam/Not Spam
• Benign/Malignant
• Biometric Identification
• Speech Recognition
• Fraudulent Transactions
• Pattern Recognition
• 0 = Negative Class
• 1 = Positive Class
Logistic Regression / Classification
• What we want is a function that will produce a value between 0 and 1 for all weighted input we provide.
• Sigmoid Activation Unit
Logistic Regression / Classification
• What we want is a function that will produce a value between 0 and 1 for all weighted input we provide.
• Sigmoid Activation Unit
Logistic Regression Cost Function
• Hypothesis:
• Cost Function:Linear Regression Cost
Function
Logistic Regression Cost Function
• Hypothesis:
• Cost Function:
Logistic Regression Cost Function Intuition
Logistic Regression Cost Function Intuition
In other words, if we predicted 0 when we should of predicted 1, in this case we are going to return back a very large cost.
Logistic Regression Cost Function Intuition
Logistic Regression Cost Function Intuition
In other words, if we predicted 1 when we should of predicted 0, in this case we are going to return back a very large cost.
Logistic Regression Cost Function
• This is the “simplified” formula —>
Gradient Descent
repeat until convergence {
for ( j = 0…n)}
Hypothesis
Cost Function
Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark.
Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark.
Logistic Regression Decision Boundaries
• The threshold or line at which input data is favoring one class or another. This is usually the same point where we see our sigmoid function cross the 0.5 mark. (Sunny, Rainy, Cloudy..etc)
Multi class classification is dealing with multiple
categories of classification. Typically
done as a one-vs-all classification. Where each class is trained as (1 = positive for a given
class, 0 for everything else).
Find max probability of all classes tested against.
Overfitting and Regularization
Regularization
• Regularized Logistic Regression
• Regularized Gradient Descent
[ ]
Regularization Parameter
Neural Networks
Sophisticated Neural Networks
can do some really amazing things
Multi-layered (deep) neural networks can be built to identify extremely complex things with
potentially millions of features to train on.
Neural Networks can auto-encode (learn from the input itself/self-learn), classify into many categories at once, can be trained to output real-values, they can even be built to retain memory or long-term state (such as in the
case of hidden markov models or finite state automatons)
Types of Neural Networks• Feedforward
• Recurrent
• Echo-State
• Long-Short-Term Memory
• Stochastic
• Bidirectional (propagates in both directions)
Feed Forward Network
Feed Forward Network+ 1 = bias
unit+1
+1
Input Features /
Input Layer
Output
+1 +1 +1
What is the value of ?
Answer: the sigmoid activation of the sum of its weighted inputs
+1 +1 +1
What is the output of ?
Answer: the activation (sigmoid) of it the sum of its weighted inputs
+1 +1 +1
What is the output of ?
Answer: the activation (sigmoid) of it the sum of its weighted inputs
+1 +1 +1
What is the output of ?
Answer: the activation (sigmoid) of it the sum of its weighted inputs
+1 +1 +1
What is the output of ?
+1 +1 +1
What is the output of ?
+1 +1 +1
What is the output of ?
+1 +1 +1
Feed Forward Propagation
Backpropagation
• Gradient computation is done by computing the derivative gradient of our expected output versus our actual output and propagating that error backwards through the network.
• Calculate:
Backpropagation+1 +1 +1 let y = 1 for this
sample
Backpropagation+1 +1 +1
Recurrent Neural Networks
Recurrent Neural Networks• Connections units form a directed cycle
• Allows items to exhibit dynamic temporal behavior
• Useful for maintaining internal memory or state over time
• Ex: unsegmented hand writing recognition
• At any given time step, each non-input unit computes its current activation as a nonlinear function of the weighted sum of the activations of all units from which it receives connections.
• Training is done with back propagation through time
• vanishing gradient (solved with LSTM networks)
Recurrent Neural Networks
Recurrent Neural Networks
http://www.manoonpong.com/AMOSWD08.html
LSTM Recurrent Neural Network• Long Short Term Memory
• Well suited for classifying, predicting and processing time series data with very long range dependencies.
• Achieves best known results in unsegmented handwriting recognition
• Traps error within a memory block (often referred to as an error carousel)
• Amazing applications in rhythm learning, grammar learning, music composition, robot control…etc
Other classification techniques
• SVM (support vector machines)
• constructs a hyperplane in a high/infinite-dimensional space used for training/classification, regression..etc
• by defining a kernel function (or some function that will tell us similarity) svm will allow us to perform simple dot products between high-dimensional features
• high-margin (decision boundary has good separation between training points) which benefits good generalization
• Naive Bayes
Unsupervised Learning• Categorization
• Clustering (density estimation)
• Selecting top clusters (k-means) and updating average centroid, assign data points to a cluster and iterate
• Blind Signal Separation
• Feature Extraction for Dimensionality Reduction
• Hidden Markov Models
• Non-normal & normal distribution analysis (finding the distributions of data)
• Self-Organizing Maps
Autoencoders
Unsupervised Learning from Neural Networks
Autoencoders
Knowing what to do next• Build your algorithm quick and dirty, don’t spend a lot of time on it until you have something to use
• Split up your training, cross validation and test sets (don’t test on your training data!)
• Move on to PCA or unsupervised pre-training for your supervised algorithms to help improve performance after: —>
• Don’t just try and get a lot of data to train on, implement your algorithm quick and dirty, use smaller data sets initially and determine bias/variance
• High variance: get more training data
• High variance: try fewer features
• High bias: add additional features
• High bias: add polynomial features
• High bias: decrease regularization
• High variance: increase regularization
Follow me @ @nyxtom
Thank you!
Questions?
http://ml-class.org/
https://www.coursera.org/course/bluebrain
https://www.coursera.org/course/neuralnets