Deep Learning Student workshop - Delta Course€¦ · 5 Intel student ambassadors - Who are they?...
Transcript of Deep Learning Student workshop - Delta Course€¦ · 5 Intel student ambassadors - Who are they?...
Deep Learning Student workshop
September, 2017
Agenda
⎯ Welcome & Introductions
⎯ Intel® Nervana™ AI Academy for Students
⎯ Intel® & AI
⎯ What is Machine Learning & Data Science
⎯ Deep Learning and Neural Networks
⎯ DL frameworks optimized for IA
3 3
Questions? Ask us!
BEN odom
Developer Evangelist
BOB DUFFY
Student Ambassador Program Manager
Meghana RaoDeveloper evangelist
Niven SinghAI Student Developer Community Manager
4
Announcing: Intel® Nervana™ AI ACADEMY for studentsWith the Intel® Nervana™ AI Academy for Students, our goal is to drive awareness of the innovation around AI at the academic level.
We do this by training students on campus and online, and then showcasing and highlighting their expertise, inspiration and innovation, as part of being an Intel Student Ambassadors.
⎯ Educate students, on campus, in person and begin to build a relationship between students, professors, universities and Intel
⎯ Recruit qualified Student Ambassadors
⎯ Support them with IA access and training
⎯ Coach and help them to deliver innovative ideas, expert content and student training to others students
⎯ Showcase examples of early innovation work by students
5
Intel student ambassadors - Who are they?
They’re just like you!
- Graduate and PhD students who are excited and want to do real work in the field of Deep Learning
- They are subject matter experts, who are going to events like SXSW, SIGSE, PyCon, and on campus to talk about their work
- They are active participants, working on projects, papers, articles – content that has their name on it!
- They are curious and inventive thinkers – trying new things, creating demos and working on REALLY cool stuff to share with the community
6
Intel student ambassadors – What are they doing?Intel Student Ambassadors are working on innovative, real world, applicable research and projects, like:
- Using smart phone cameras to collect and identify data on harmful vs. not mosquitos
- Leveraging neural networks and deep learning to conduct stock price analysis and predictions
- Enabling individuals with speech impediments to use speech-to-text software to recognize and dictate their speech.
- Using ML & AI to solve medical problems, like disease detection and identifying cures for epidemics. http://devmesh.intel.com
Intel & AI
libraries Intel® MKL MKL-DNN Intel® MLSL
toolkits
Frameworks
Intel® DAAL
hardwareMemory/Storage NetworkingCompute
Intel Distribution
Mlib BigDL
Intel® Nervana™ Graph*
Intel® Nervana™ PORTFOLIO
experiences
Intel® Nervana™ DL Software &
Cloud
Computer Vision*Future
Intel® DL Training &
Deployment
Intel® Computer Vision SDK
MovidiusFathom
Intel® GO™ Automotive
SDK
*
9
Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
- Wikipedia
What is data science?
10
Source: https://en.wikipedia.org/wiki/Data_science
The data science process
NOSQL Passion
Math
Statistics
R, Python, Scala
Communication
Visualization
Domain Knowledge
Machine Learning
Story Teller
Hacker MindsetLove the Data
DEEP Learning
Engage with “C” Level
Neural Networks
11
How to become a data scientist?
12
Applying Algorithms to observed data and make predictions based on data.
What is machine learning?
13
Machines Learn in two ways:
Supervised Learning & Unsupervised Learning
14
Supervised Learning
We train the model. We feed the model with correct answers. Model Learns and finally predicts.
We feed the model with “ground truth”.
15
Unsupervised Learning
Data is given to the model. Right answers are not provided to the model. The model makes sense of the data given to it.Can teach you something you were probably not aware of IN THE given dataset.
16
Types of Supervised and Unsupervised learning
Classification
Regression
Clustering
Recommendation
SUPERVISED UNSUPERVISED
17
CLASSIFICATIONPredict a label for an entity with a given set of features.
SPAM
prediction sentiment analysis
REGRESSIONPredict a real numeric value for an entity with a given set of features.
18
Price
Address
Type
Age
Parking
School
Transit
Total sqft
Lot Size
Bathrooms
Bedrooms
Yard
Pool
Fireplace
Property attributes
$Linear regression model
sqft
19
Market Segmentation
Play timein hours
Age
Causal
Gamers
No
Gamers
Serious
Gamers
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
CLUSTERINGGroup entities with similar features
20
RECOMMENDATIONRecommend an item to a user based on past behavior or preferences of similar users.
User Info+Your Past Purchase Data+Purchase of other user+Product Info
Recommendation ML Method
Recommendations
ClassifierMatrix
YMAL
Data
21
Applications of Machine Learning
Fraud Detection
Movie Recommendation
Face Detection
Anomaly Detection
Product Sentiment Analysis
Natural Language Processing
Image Analysis
IoT Analysis
Spam Filtering/Virus Detection
Working with data sets
Machine Learning Vocabulary - How do you read a data set?
Target Predicted category or value of the data (column to predict)
Features properties of the data used for prediction (non-target columns)
Example A single data point within the data (one row)
Label The target value for a single data point
23
An example data set
24
sepal length sepal width petal length petal width species
6.7 3.0 5.2 2.3 virginica
6.4 2.8 5.6 2.1 virginica
4.6 3.4 1.4 0.3 setosa
6.9 3.1 4.9 1.5 versicolor
4.4 2.9 1.4 0.2 setosa
4.8 3.0 1.4 0.1 setosa
5.9 3.0 5.1 1.8 virginica
5.4 3.9 1.3 0.4 setosa
4.9 3.0 1.4 0.2 setosa
5.4 3.4 1.7 0.2 setosa
Target
Example
Features
Label
25
Training data set & Validation & Test dataset
If our Dataset is a 100,000 homes sold in Portland a typical split would be:
Train = 70,000 Homes
Validation = 10,000 Homes
Test = 20,000 Homes
Setting up your environment
27
What is in a Basic Data Science Toolkit
28
Intel® distribution of python* 2017
1. Install Anaconda https://www.continuum.io/downloads#linux
2. Choose Intel Packages: conda config --add channels intel
3. Create the environment: conda create –n intelpython3 intelpython3_full python=3
4. Activate the environment: source activate intelpython3
5. Run the jupyter notebook: jupyter notebook --no-browser (only use no browser if running remotely or using BASH on windows)
6. Access the notebook: http://localhost:8888
29
6 Steps to Jupyter Notebook with Intel Distribution of Python
linear regression
31
Introduction to Linear Regression
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
0.0
1.0
2.0
x108
1.0 2.0
Budget
Bo
x O
ffic
e
x108
32
Introduction to Linear Regression
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
0.0
1.0
2.0
x108
1.0 2.0
Budget
Bo
x O
ffic
e
coefficient
0
box
office
revenue
movie
budgetcoefficient
1x108
33
Introduction to Linear Regression
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
𝛽0= 80 million, 𝛽1= 0.6
34
Predicting from Linear Regression
𝑦𝛽 𝑥 = 𝛽0 + 𝛽1𝑥
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
𝛽0= 80 million, 𝛽1= 0.6
Predict 175 Million Gross for
160 Million Budget
35
Which Model Fits the Best?
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
36
Calculating the Residuals
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
Predicted
value
Observe
d value
𝑦𝛽 𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
37
Calculating the Residuals
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
38
Mean Squared Error
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
1
𝑚
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2
39
Minimum Mean Squared Error
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
min𝛽0,𝛽1
1
𝑚
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2
40
Cost Function
0.0
1.0
2.0
1.0 2.0
Budget
Bo
x O
ffic
e
x108
x108
𝐽 𝛽0, 𝛽1 =1
2𝑚
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2
42
Gradient DescentStart with a cost function J(𝛽):
𝑱 𝜷
𝜷
43
Gradient DescentStart with a cost function J(𝛽):
𝑱 𝜷
𝜷
Then gradually move towards the minimum.
Global Minimum
Now imagine there are two parameters
(𝛽0, 𝛽1)
44
Gradient Descent with Linear Regression
Now imagine there are two parameters (𝛽0, 𝛽1)
This is a more complicated surface on which the minimum must be found
45
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
Now imagine there are two parameters (𝛽0, 𝛽1)
This is a more complicated surface on which the minimum must be found
How can we do this without knowing what 𝐽(𝛽0, 𝛽1) looks like?
46
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
Compute the gradient, 𝛻𝐽(𝛽0, 𝛽1), which points in the direction of the biggest increase!
-𝛻𝐽(𝛽0, 𝛽1)(negative gradient) points to the biggest decrease at that point!
47
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
The gradient is the a vector whose coordinates consist of the partial derivatives of the parameters
48
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
𝛻𝐽 𝛽0, … , 𝛽𝑛 = <𝜕𝐽
𝜕𝛽0, … ,
𝜕𝐽
𝜕𝛽𝑛>
Then use the gradient (𝛻) and the cost function to calculate the next point (𝜔_1) from the current one (𝜔_0):
49
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
𝜔1 = 𝜔0 − 𝛼𝛻1
2
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2 𝜔0
𝜔1
Then use the gradient (𝛻) and the cost function to calculate the next point (𝜔_1) from the current one (𝜔_0):
The learning rate (𝛼) is a tunable parameter that determines step size
50
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
𝜔1 = 𝜔0 − 𝛼𝛻1
2
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2 𝜔0
𝜔1
Each point can be iteratively calculated from the previous one
51
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
𝜔2 = 𝜔1 − 𝛼𝛻1
2
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2 𝜔0
𝜔1
𝜔2
Each point can be iteratively calculated from the previous one
52
Gradient Descent with Linear Regression
𝐽 𝛽0, 𝛽1
𝛽1 𝛽0
𝜔0
𝜔1𝜔2 = 𝜔1 − 𝛼𝛻
1
2
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2
𝜔2
𝜔3 = 𝜔2 − 𝛼𝛻1
2
𝑖=1
𝑚
𝛽0 + 𝛽1𝑥𝑜𝑏𝑠(𝑖)
− 𝑦𝑜𝑏𝑠(𝑖)
2 𝜔3
53
Modelling Best Practice
Use cost function to fit model
Develop multiple models
Compare results and choose best one
k nearest neighbors
55
K Nearest Neighbors Classification
Survived
Did not survive
Number of Malignant Nodes
0
Age
60
40
20
10 20
56
K Nearest Neighbors Classification
Number of Malignant Nodes
0
Age
60
40
20
10 20
Predict
57
K Nearest Neighbors Classification
0
1
Neighbor Count (K = 1):
Number of Malignant Nodes
0
Age
60
40
20
10 20
Predict
58
K Nearest Neighbors Classification
1
1
Neighbor Count (K = 2):
Number of Malignant Nodes
0
Age
60
40
20
10 20
Predict
59
K Nearest Neighbors Classification
Number of Malignant Nodes
2
1
Neighbor Count (K = 3):
0
Age
60
40
20
10 20
Predict
60
K Nearest Neighbors Classification
Number of Malignant Nodes
0
Age
60
40
20
10 20
3
1
Predict
Neighbor Count (K = 4):
Correct value for 'K'
How to measure closeness of neighbors?
61
What is Needed to Select a KNN Model?
Number of Malignant Nodes
0
Age
60
40
20
10 20
62
Value of 'K' Affects Decision Boundary
Number of Malignant Nodes
K=1
0 10 20
60
40
20
0
60
40
20
10 20
Number of Malignant Nodes
K=All
63
Measurement of Distance in KNN
Number of Malignant Nodes
0
Age
60
40
20
10 20
64
Measurement of Distance in KNN
Number of Malignant Nodes
0
Age
60
40
20
10 20
65
Euclidean Distance
Number of Malignant Nodes
0
Age
60
40
20
10 20
66
Euclidean Distance (L2 Distance)
Number of Malignant Nodes
0
Age
60
40
20
10 20
𝑑 = ∆𝑁𝑜𝑑𝑒𝑠2 + ∆𝐴𝑔𝑒2
∆ Age
d
∆ Nodes
67
Manhattan Distance (L1 or City Block Distance)
Number of Malignant Nodes
0
Age
60
40
20
10 20
∆ Age
∆ Nodes 𝑑 = ∆𝑁𝑜𝑑𝑒𝑠 + ∆𝐴𝑔𝑒
68
Scale is Important for Distance Measurement
Number of Surgeries
12345
Age
60
40
20
69
Scale is Important for Distance Measurement
12345
Number of Surgeries
Age
60
40
20
24
22
20
18
70
Scale is Important for Distance Measurement
Number of Surgeries
12345
Age
60
40
20
24
22
20
18
Nearest Neighbors!
"Feature Scaling"
71
Scale is Important for Distance Measurement
1 4 53
Number of Surgeries
0
Age
60
40
20
2
"Feature Scaling"
72
Scale is Important for Distance Measurement
1 4 53
Number of Surgeries
0
Age
60
40
20
2
"Feature Scaling"
73
Scale is Important for Distance Measurement
1 4 53
Number of Surgeries
0
Age
60
40
20
2
Nearest Neighbors!
74
Performance comparison - Linear Regression and KNN
K nearest neighborsLinear regression
Fitting involves minimizing cost function (slow)
Model has few parameters (memory efficient)
Prediction involves calculation (fast)
Fitting involves storing training data (fast)
Model has many parameters (memory intensive)
Prediction involves finding closest neighbors (slow)
75
what is the issue with linear classifiers we have learnt so far?
XORThe counter
example to all models
We need non-linear functions
X1 X2
0 0 0
y
0 1 1
1 0 1
1 1 0
0
X1
X2
0
1
1
Source: https://medium.com/towards-data-science/introducing-deep-learning-and-neural-networks-deep-learning-for-rookies-1-bd68f9cf5883
76
We need layers Usually lots with Non-Linear TransformationsXOR = X1 and not X2 OR Not X1 and X2
1.5 0.5
Input
Input
+1
+1
+1
+1
-2Output
Threshold to 0 or 1
X1 X2
0 0 0
y
0 1 1
1 0 1
1 1 0
77
This is a brewing domain called Deep Learning In the machine learning world, we use neural networks. The idea comes from biology. Each layer learns something.
Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations.
--Wikipedia
Layer 1 Layer 2 Layer N Prediction
78
Each layer learns something
Elephant
Faces
Cars
Elephants
Chairs
FullyConnected
layer
What is deep learning good for?
80
Classification And DETECTION
Detect and label the image
Person
Motorcyclist
Bike
https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf
https://people.eecs.berkeley.edu/~jhoffman/talks/lsda-baylearn2014.pdf
81
Semantic Segmentation
Label every pixel
http://arxiv.org/pdf/1511.04164v3.pdf
82
Natural Language Object Retrieval
The same architecture is used for English and Mandarin Chinese speech recognition
http://svail.github.io/mandarin/
83
Speech Recognition
The basics of building a neural network
Motivation for Neural Nets• Use biology as inspiration for mathematical model
• Get signals from previous neurons
• Generate signals (or not) according to inputs
• Pass signals on to next neurons
• By layering many neurons, can create complex model
bw3
Basic Neuron Visualization
activationfunction
x1
x2
x3
w1
w2
z = x1w1+ x2w2+ x3w3+b
f(z)
1
87
• Sigmoid function
• Smooth transition in output between (0,1)
• Tanh function
• Smooth transition in output between (-1,1)
• ReLU function
• f(x) = max(x,0)
• Step function
• f(x) = (0,1)
Types of activation functions
Why Neural Nets?• Why not just use a single neuron? Why do we need a larger
network?• A single neuron (like logistic regression) only permits a linear
decision boundary.• Most real-world problems are considerably more complicated!
Feedforward Neural Network
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
Weights
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
Input Layer
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
Hidden Layers
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
Output Layer
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
Weights (represented by matrices)
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑊(1) 𝑊(2) 𝑊(3)
Net Input (sum of weighted inputs, before activation function)
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑧(2) 𝑧(3)
𝑧(4)
Activations (output of neurons to next layer)
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑎(1)𝑎(2) 𝑎(3)
𝑎(4)
Matrix representation of computation
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝑧(2) = 𝑥𝑊(1)
𝑎(2) = 𝜎(𝑧 2 )
𝑥 = 𝑥1, 𝑥2, 𝑥3
(𝑥 = 𝑎(1))
𝑧(2)
𝑊(1)
𝑎(2)
𝑊(1) is a
3x4 matrix
𝑧(2) is a
4-vector
For a single data point (instance)
𝑎(2) is a
4-vector
Continuing the Computation
For a single training instance (data point)
Input: vector x (a row vector of length 3)Output: vector 𝑦 (a row vector of length 3)
𝑧(2) = 𝑥𝑊(1) 𝑎(2) = 𝜎(𝑧 2 )
𝑧(3) = 𝑎(2)𝑊(2) 𝑎(3) = 𝜎(𝑧 3 )
𝑧(4) = 𝑎(3)𝑊(3) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 4 )
Multiple data pointsIn practice, we do these computation for many data points at the same time, by “stacking” the rows into a matrix. But the equations look the same!
Input: matrix x (an nx3 matrix) (each row a single instance)Output: vector 𝑦 (an nx3 matrix) (each row a single prediction)
𝑧(2) = 𝑥𝑊(1) 𝑎(2) = 𝜎(𝑧 2 )
𝑧(3) = 𝑎(2)𝑊(2) 𝑎(3) = 𝜎(𝑧 3 )
𝑧(4) = 𝑎(3)𝑊(3) 𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑧 4 )
How to Train a Neural Net?
Input(Feature Vector)
Output(Label)
• Put in Training inputs, get the output• Compare output to correct answers: Look at loss function J• Adjust and repeat!
• Backpropagation tells us how to make a single adjustment using calculus.
Using Gradient Descent
1. Make prediction2. Calculate Loss3. Calculate gradient of the loss function w.r.t. parameters4. Update parameters by taking a step in the opposite direction5. Iterate
𝑦1
Calculate the loss function
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑦2
𝑦3
Evaluate:𝐽 𝑦𝑖 , 𝑦𝑖
Chain Rule
𝜕𝐽
𝜕𝑊(2)= ( 𝑦 − 𝑦) ⋅ 𝑊 3 ⋅ 𝜎′ 𝑧(3) ⋅ 𝑎(2)
𝜕𝐽
𝜕𝑊(1)= 𝑦 − 𝑦 ⋅ 𝑊 3 ⋅ 𝜎′ 𝑧(3) ⋅ 𝑊 2 ⋅ 𝜎′ 𝑧 2 ⋅ 𝑋
𝜕𝐽
𝜕𝑊(3)= ( 𝑦 − 𝑦) ⋅ 𝑎(3)
• Recall that: 𝜎′ 𝑧 = 𝜎(𝑧)(1 − 𝜎(𝑧))• Though they appear complex, above are easy to compute!
𝑦1
Backpropagation
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑦2
𝑦3
𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊𝑘
𝑊(1) 𝑊(2) 𝑊(3)Want:
𝑦1
Backpropagation
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑦2
𝑦3
𝑊(1) 𝑊(2) 𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3
𝑦1
Backpropagation
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑦2
𝑦3
𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3
𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊2
𝑊(1)
𝑦1
Backpropagation
𝑥1
𝑥2
𝑥3𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝜎
𝑦1
𝑦2
𝑦3
𝑦2
𝑦3
𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊3
𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊2
𝜕𝐽 𝑦𝑖 , 𝑦𝑖𝜕𝑊1
108
What we have learnt so far
• Nomenclature required to build a NN
• Input, hidden, output layers
• Weights, activation
• Backpropagation using gradient descent
• Representing it all using matrices
Convolutional neural network
Convolutional Neural Nets
Primary Ideas behind Convolutional Neural Networks:
• Let the Neural Network learn which kernels are most useful• Use same set of kernels across entire image (translation invariance)• Reduces number of parameters and “variance” (from bias-variance point
of view)
Kernels as Feature Detectors
Can think of kernels as a ”local feature detectors”
Vertical Line Detector
-1 1 -1
-1 1 -1
-1 1 -1
Horizontal Line Detector
-1 -1 -1
1 1 1
-1 -1 -1
Corner Detector
-1 -1 -1
-1 1 1
-1 1 1
Without Padding, we lose data at the edges
Padding the input data
Pooling: Max-pool• For each distinct patch, represent it by the maximum
• 2x2 maxpool shown below
115
CNN for Digit recognition
Source: http://cs231n.github.io/
116
Convolutional Neural Networks (CNN) for Image Recognition
LeNet-5
How many total weights in the network?
Conv1: 1*6*5*5 + 6 = 156Conv3: 6*16*5*5 + 16 = 2416FC1: 400*120 + 120 = 48120FC2: 120*84 + 84 = 10164FC3: 84*10 + 10 = 850Total: = 61706
Less than a single FC layer with [1200x1200] weights!Note that Convolutional Layers have relatively few weights.
Differences between CNN and fully connected networks
119
CONVOLUTIONAL NEURAL NETWORK FULLY CONNECTED NEURAL NETWORKS• Each neuron connected to a small set of
nearby neurons in the previous layer• Uses same set of weights for each neuron• Ideal for spatial feature recognition, Ex:
Image recognition• Cheaper on resources due to fewer
connections
• Each neuron is connected to every neuron in the previous layer
• Every connection has a separate weight• Not optimal for detecting features• Computationally intensive – heavy
memory usage
Network architectures
AlexNet - Model Diagram
VGG16 Diagram
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
Layer 1 Layer 2 Layer 3
We can say that the “receptive field” of Layer 2 is 3x3
Each output has been influenced by a 3x3 patch of inputs
VGG
Layer 1 Layer 2 Layer 3
(Input)
What about on Layer 3?
VGG
Layer 1 Layer 2 Layer 3
(Input)
This output on Layer 3 uses a 3x3 patch from Layer 2
How much from Layer 1 does it use?
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
VGG
Layer 1 Layer 2 Layer 3
(Input)
Each square in Layer 3 “sees” a 5x5 grid from Layer 1
VGG
3 × 3 × 𝐶 × 𝐶 = 9𝐶2 7 × 7 × 𝐶 × 𝐶 = 49𝐶2One 3x3 layer One 7x7 layer
3 × (9𝐶2) = 27𝐶2Three 3x3 layers
49𝐶2 27𝐶2 ≈45% reduction!
Two 3x3, stride 1 convolutions in a row one 5x5
Three 3x3 convolutions one 7x7 convolution
Benefit: fewer parameters
Inception V3 schematic
Inception
This whole “block” serves
the function of a previous
convolutional layer.
ResNet
• Add previous layer back in to current layer!• Similar idea to “boosting”
examples
143
Unattended baggage detection using Intel® optimized caffe*
Source: https://software.intel.com/en-us/articles/unattended-baggage-detection-using-deep-neural-networks-in-intel-architecture
why ARE Deep Neural Networks called “Deep”?
Source: https://research.facebook.com/publications/deepface-closing-the-gap-to-human-level-performance-in-face-verification/
144
Example of CNN topologies
11/9/2017 Intel Confidential
GoogLeNet (2014)ConvolutionPoolingSoftmaxOther
Source: Google white paper and Krizhevsky et al.
145
146
Diagnosis of heart disease using CNNs
Source: http://cs231n.stanford.edu/reports2016/331_Report.pdf
Using 30 MRIs during one cardiac cycle from different axis viewsto predict VS and VD
147
Diabetic Retinopathy diagnosis A Kaggle competition solution from deepsense.io
Images from EyePACS
Source: https://deepsense.io/diagnosing-diabetic-retinopathy-with-deep-learning/
Intel® NERVANA™ AI PORTFOLIO
libraries Intel® MKL MKL-DNN Intel® MLSL
toolkits
Frameworks
Intel® DAAL
hardwareMemory/Storage NetworkingCompute
Intel Distribution
Mlib BigDL
Intel® Nervana™ Graph*
Intel® Nervana™ PORTFOLIO
experiences
Intel® Nervana™ DL Software &
Cloud
Computer Vision*Future
Intel® DL Training &
Deployment
Intel® Computer Vision SDK
MovidiusFathom
Intel® GO™ Automotive
SDK
*
Batch Many batch modelsTrain machine learning models across a
diverse set of dense and sparse dataTrain large deep neural networks
Train large models as fast as possible
LAKECREST
Stream EdgeInfer billions of data samples at a time
and feed applications within ~1 dayInfer deep data streams with low latency in order to take action within milliseconds
Power-constrained environments
…
Training
inference
Batch
or other Intel® edge processor
OR OR
OR
Option for higher throughput/watt
*Future*
Required for lower latency
AI silicon positioning
OR
151
Intel® Movidius™ Neural Compute Stick
Get started: https://developer.movidius.com/
• Nervana Cloud Build an AI POC
• neon Train DL models quickly
• Intel Nervana Graph any framework, any hardware
• Intel Nervana HW industry leading AI, coming soon
“deep learning by design”
neon
deep learning
framework
Intel® Nervana™ Full stack platform
Multi-user collaboration
Interactive sessions
Model library
Fast training
Batch training
Experiment tracking
Multi-node distribution
Analytics & visualization
Hyperparameter optimization
Batch inference
Model compression
Inference deployment
Export to edge devices
Data curation/processing
Data partitioning
Data labeling
Accelerate time-to-solution by compressing both compute and labor-intensive steps in the innovation cycle to deliver scalable end-to-end AI solutions
Intel® Nervana™ Deep Learning Software
154
Intel® distribution of python* 2017
DL Framework Optimized for IA:
Tensorflow
Coarse-Grained / multi-node
Domain decomposition
156
Performance Optimization on Modern Platforms
Utilize all the cores
OpenMP, MPI, TBB…
Reduce synchronization events, serial code
Improve load balancing
Vectorize/SIMD
Unit strided access per SIMD lane
High vector efficiency
Data alignment
Efficient memory/cache use
Blocking
Data reuse
Prefetching
Memory allocation
Hierarchical Parallelism
Fine-Grained Parallelism / within node Sub-domain: 1) Multi-level domain decomposition (ex. across layers)
2) Data decomposition (layer parallelism)
Scaling
Improve load balancing
Reduce synchronization events, all-to-all comms
157
Example Challenge 1: Data Layout Has Big Impact on Performance• Data Layouts impacts performance
• Sequential access to avoid gather/scatter• Have iterations in inner most loop to ensure high vector utilization• Maximize data reuse; e.g. weights in a convolution layer
• Converting to/from optimized Layout is some times less expensive than operating on unoptimized Layout
21 18 32 6 3
1 8 0 3 26
40 9 22 76 81
23 44 81 32 11
5 38 10 11 1
8 92 37 29 44
11 9 22 3 26
3 47 29 88 1
15 16 22 46 12
29 9 13 11 1 21 8 18 92 .. 1 11 ..
21 18 … 1 .. 8 92 ..
Better optimized for some operations
vs
158
• End to end optimization can reduce conversions• Staying in optimized layout as long as possible becomes
one of the tuning goals • Minimize the number of back and forth conversions
• Use of graph optimization techniques
Convolution ConvolutionMax PoolNative to MKL layout
MKL layout to Native
Native to MKL layout
MKL layout to Native
Example Challenge 2: Minimize Conversions Overhead
159
Optimizing TensorFlow & Other DL Frameworks for Intel® Architecture • Leverage high performant compute libraries and tools
• e.g. Intel® Math Kernel Library, Intel® Python, Intel® Compiler etc.• Data Format/Shape:
• Right format/shape for max performance: blocking, gather/scatter• Data Layout:
• Minimize cost of data layout conversions • Parallelism:
• Use all cores, eliminate serial sections, load imbalance• Memory allocation
• unique characteristics and ability to reuse buffers• Data layer optimizations:
• parallelization, vectorization, IO• Optimize hyper parameters:
• e.g. batch size for more parallelism• learning rate and optimizer to ensure accuracy/convergence
160
Benchmark MetricBatch
Size
Baseline
Performance
Training
Baseline
Perf
Inference
Optimized
Perf
Training
Optimized
Perf
Inference
Speedup
Training
Speedup
Inference
ConvNet-
Alexnet
Images
/ sec 12833.52 84.2
5241696
15.6x 20.2xConvNet-
GoogleNet
v1
Images
/ sec 12816.87 49.9
112.3439.7
6.7x 8.8x
ConvNet-
VGG
Images
/ sec64 8.2 30.7 47.1 151.1 5.7x 4.9x
• Baseline using TensorFlow 1.0 release with standard compiler knobs
• Optimized performance using TensorFlow with Intel optimizations and built with
• bazel build --config=mkl --copt=”-DEIGEN_USE_VML”
Initial Performance Gains on Modern Xeon (2 Sockets Broadwell - 22 Cores)
161
Benchmark MetricBatch
Size
Baseline
Performance
Training
Baseline
Perf
Inference
Optimized
Perf
Training
Optimized
Perf
Inference
Speedup
Training
Speedup
Inference
ConvNet-
Alexnet
Images
/ sec 12812.21 31.3
549 2698.3 45x 86.2xConvNet-
GoogleNet
v1
Images
/ sec 1285.43 10.9
106 576.6 19.5x 53x
ConvNet-
VGG
Images
/ sec64 1.59 24.6 69.4 251 43.6x 10.2x
• Baseline using TensorFlow 1.0 release with standard compiler knobs
• Optimized performance using TensorFlow with Intel optimizations and built with
• bazel build --config=mkl --copt=”-DEIGEN_USE_VML”
Initial Performance Gains on Modern Xeon Phi (Knights Landing – 68 Cores)
162
• Data format: CPU prefers NCHW data format• Intra_op, inter_op and OMP_NUM_THREADS: set for best core utilization• Batch size: higher batch size provides for better parallelism
• Too high a batch size can increase working set and impact cache/memory perf
Benchmark Data Format Inter_op Intra_op KMP_BLOCKTIME Batch size
ConvNet- AlexnetNet NCHW 1 44 30 2048
ConvNet-Googlenet V1 NCHW 2 44 1 256
ConvNet-VGG NCHW 1 44 1 128
Best Setting for Xeon (Broadwell – 2 Socket – 44 Cores)
BenchmarkData
Format
Inter_
opIntra_op
KMP_BLOCKTI
ME
OMP_NUM_
THREADSBatch size
ConvNet- AlexnetNet NCHW 1 68 30 136 2048
ConvNet-Googlenet V1 NCHW 2 68 1 68 256
ConvNet-VGG NCHW 1 68 1 136 128
Best Setting for Xeon Phi (Knights Landing – 68 Cores)
Additional Performance Gains from Parameters Tuning
Q&A
Social Media & SurveyPrize Winners
Want to learn more?Check out the
Intel® Nervana™ AI Academy for students
software.intel.com/AIStudents
backup
Intel tools and libraries
168
Intel® Distribution for Python*• Ready access to set of tools and techniques for high performance on Intel®
Architecture
• Accelerated Python packages - NumPy, SciPy, pandas, scikit-learn, Jupyter, matplotlib, and mpi4py
• Integrated with Intel® Math Kernel Library (Intel® MKL), Intel® Data Analytics Acceleration Library (Intel® DAAL) and pyDAAL, Intel® MPI Library, and Intel® Threading Building Blocks (Intel® TBB)
• Get out-of-the-box performance that is closer to native code speeds.
• Speed up data analytics with pyDAAL and parallelize Python workloads.
• Manage packages and Jupyter Notebooks easily with conda, Anaconda Cloud, and PIP.
Learn more: https://software.intel.com/en-us/intel-distribution-for-python
169
Intel® Math Kernel Library (MKL)
• Features highly optimized, threaded and vectorized functions to maximize performance on Intel® Architecture and compatible processors
• Linear Algebra, Fast Fourier Transforms (FFT), Neural Network, Vector Math and Statistics functions
• Standard APIs for immediate performance results
• Utilizes de facto standard C and Fortran APIs for compatibility with BLAS, LAPACK and FFTW functions from other math libraries
• Available with both free community-supported and paid support licenses
Learn more: https://software.intel.com/en-us/intel-mkl
170
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)
• A library of DNN performance primitives optimized for Intel architectures
• A set of highly optimized building blocks intended to accelerate compute-intensive parts of deep learning applications, particularly DNN frameworks such as Caffe, Tensorflow, Theano and Torch
• Distributed as source code through GitHub
• Implemented in C++ and provides both C++ and C APIs
• Allows the functionality to be used from a wide range of high-level languages, such as Python or Java
Learn more: https://01.org/mkl-dnn/overview
171
Intel® Data Analytics Acceleration Library (Intel® DAAL)• Features highly tuned functions for deep learning, classical machine learning,
and data analytics performance across spectrum of Intel® architecture devices
• Intel® DAAL addresses all stages of the Big Data Ecosystem
• Includes Python*, C++, and Java* APIs and connectors to popular data sources including Spark* and Hadoop*
• Free and open source community-supported versions are available, as well as paid versions that include premium support.
Learn more: https://software.intel.com/en-us/intel-daal
172
Intel® Machine Learning Scaling Library for Linux* OS
• A library providing an efficient implementation of communication patterns used in deep learning.
• Built on top of MPI, allows for use of other communication libraries
• Optimized to drive scalability of communication patterns
• Works across various interconnects: Intel(R) Omni-Path Architecture, InfiniBand*, and Ethernet
• Common API to support Deep Learning frameworks (Caffe*, Theano*, Torch*, etc.)
Learn more: https://github.com/01org/MLSL
173
BigDL: Distributed Deep Learning Library for Apache Spark*
• Write deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters
• Rich deep learning support - numeric computing (via Tensor) and high level neural networks; load pre-trained Caffe or Torch models into Spark programs using BigDL
• Extremely high performance - uses Intel® MKL and multi-threaded programming in each Spark task
• Efficiently scale-out to “Big Data Scale” using Apache Spark
Learn more: https://github.com/intel-analytics/BigDL
174
Trusted analytics platform• Facilitates data ingestion, preparation, and analysis with parallel processing
and distributed analytics.
• The software leverages Apache Spark*, Intel® Data Analytics Acceleration Library, and Intel® Math Kernel Library for optimized distributed analytics and parallel processing on Intel® processors.
• Accelerates the modeling process with Intel optimized computational machine-learning and deep-learning algorithms, as well as graph operations, scoring engine, and pipelines.
• Integrates with industry-leading software frameworks such as Apache Spark, TensorFlow*, and Superset to expedite application development and enable deep-learning and visualization techniques.
Learn more: https://software.intel.com/en-us/bigdata/tap