1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of 1 Machine Learning: Naïve Bayes, Neural Networks, Clustering Skim 20.5 CMSC 471.
1
Machine Machine Learning:Learning:
Naïve Bayes, Neural Networks, Naïve Bayes, Neural Networks, ClusteringClustering
Skim 20.5Skim 20.5
CMSC 471
2
TheTheNaïve BayesNaïve Bayes
ClassifierClassifier
Some material adapted from slides Some material adapted from slides byby
Tom Mitchell, CMU.Tom Mitchell, CMU.
3
The Naïve Bayes The Naïve Bayes ClassifierClassifier
)(
)|()()|(
j
ijiji XP
YXPYPXYP
Recall Bayes rule:Recall Bayes rule:
Which is short for:Which is short for:
We can re-write this as:We can re-write this as:
)(
)|()()|(
j
ijiji xXP
yYxXPyYPxXyYP
k kkj
ijiji yYPyYxXP
yYxXPyYPxXyYP
)()|(
)|()()|(
4
Deriving Naïve BayesDeriving Naïve Bayes Idea: use the training data to directly Idea: use the training data to directly
estimate:estimate:
Then, we can use these values to estimateThen, we can use these values to estimateusing Bayes rule.using Bayes rule.
Recall that representing the full joint Recall that representing the full joint probabilityprobability
is not practical.is not practical.
)(YP)|( YXP and
)|( newXYP
)|,,,( 21 YXXXP n
5
Deriving Naïve BayesDeriving Naïve Bayes
However, if we make the assumption However, if we make the assumption that the attributes are independent, that the attributes are independent, estimation is easy!estimation is easy!
In other words, we assume all attributes In other words, we assume all attributes are conditionally independent given Y.are conditionally independent given Y. Often this assumption is violated in practice, Often this assumption is violated in practice,
but more on that later…but more on that later…
i
in YXPYXXP )|()|,,( 1
6
Deriving Naïve BayesDeriving Naïve Bayes
Let and label Y be Let and label Y be discrete.discrete.
Then, we can estimate Then, we can estimate and and
directly from the training data by directly from the training data by counting!counting!
nXXX ,,1
)( iYP)|( ii YXP
SkySky TempTemp HumidHumid WindWind WaterWater ForecaForecastst
Play?Play?
sunnysunny warmwarm normalnormal strongstrong warmwarm samesame yesyessunnysunny warmwarm highhigh strongstrong warmwarm samesame yesyesrainyrainy coldcold highhigh strongstrong warmwarm changechange nonosunnysunny warmwarm highhigh strongstrong coolcool changechange yesyes
P(Sky = sunny | Play = yes) = ?
P(Humid = high | Play = yes) = ?
7
The Naïve Bayes The Naïve Bayes ClassifierClassifier
Now we have:Now we have:
which is just a one-level Bayesian which is just a one-level Bayesian NetworkNetwork
To classify a new point XTo classify a new point Xnewnew::
)( iHP
… … Attributes (evidence)
Labels (hypotheses)
1 ni
j
XXX
Y )( jYP)|( ji YXP
k i kik
jiijnj yYXPyYP
yYXPyYPXXyYP
)|()(
)|()(),,|( 1
i
kiky
new yYXPyYPYk
)|()(maxarg
8
The Naïve Bayes The Naïve Bayes AlgorithmAlgorithm
For each value yFor each value ykk
Estimate P(Y = yEstimate P(Y = ykk) from the data.) from the data.
For each value xFor each value xijij of each attribute X of each attribute Xii
Estimate P(XEstimate P(Xii=x=xijij | Y = y | Y = ykk))
Classify a new point via:Classify a new point via:
In practice, the independence In practice, the independence assumption doesn’t often hold true, but assumption doesn’t often hold true, but Naïve Bayes performs very well despite Naïve Bayes performs very well despite it.it.
i
kiky
new yYXPyYPYk
)|()(maxarg
9
Naïve Bayes ApplicationsNaïve Bayes Applications Text classificationText classification
Which e-mails are spam?Which e-mails are spam? Which e-mails are meeting notices?Which e-mails are meeting notices? Which author wrote a document?Which author wrote a document?
Classifying mental statesClassifying mental states
People Words Animal Words
Learning P(BrainActivity | WordCategory)
Pairwise ClassificationAccuracy: 85%
10
Neural Neural NetworksNetworks
Some material adapted from lecture notes by Lise Getoor and Ron Parr
Adapted from slides byTim Finin andMarie desJardins.
11
Neural functionNeural function Brain function (thought) occurs as the Brain function (thought) occurs as the
result of the firing of result of the firing of neuronsneurons Neurons connect to each other through Neurons connect to each other through
synapsessynapses, which propagate , which propagate action action potentialpotential (electrical impulses) by releasing (electrical impulses) by releasing neurotransmittersneurotransmitters
Synapses can be Synapses can be excitatory excitatory (potential-(potential-increasing) or increasing) or inhibitory inhibitory (potential-(potential-decreasing), and have varying decreasing), and have varying activation activation thresholdsthresholds
Learning occurs as a result of the synapses’ Learning occurs as a result of the synapses’ plasticicityplasticicity: They exhibit long-term : They exhibit long-term changes in connection strengthchanges in connection strength
There are about 10There are about 1011 11 neurons and about 10neurons and about 101414 synapses in the human brain(!)synapses in the human brain(!)
12
Biology of a neuronBiology of a neuron
13
Brain structureBrain structure Different areas of the brain have different Different areas of the brain have different
functionsfunctions Some areas seem to have the same function in all Some areas seem to have the same function in all
humans (e.g., Broca’s region for motor speech); the humans (e.g., Broca’s region for motor speech); the overall layout is generally consistentoverall layout is generally consistent
Some areas are more plastic, and vary in their function; Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatlyalso, the lower-level structure and function vary greatly
We don’t know how different functions are We don’t know how different functions are “assigned” or acquired“assigned” or acquired Partly the result of the physical layout / connection to Partly the result of the physical layout / connection to
inputs (sensors) and outputs (effectors)inputs (sensors) and outputs (effectors) Partly the result of experience (learning)Partly the result of experience (learning)
We We reallyreally don’t understand how this neural don’t understand how this neural structure leads to what we perceive as structure leads to what we perceive as “consciousness” or “thought”“consciousness” or “thought”
Artificial neural networks are not nearly as Artificial neural networks are not nearly as complex or intricate as the actual brain structurecomplex or intricate as the actual brain structure
14
Comparison of Comparison of computing powercomputing power
Computers are way faster than neurons…Computers are way faster than neurons… But there are a lot more neurons than we can But there are a lot more neurons than we can
reasonably model in modern digital computers, reasonably model in modern digital computers, and they all fire in paralleland they all fire in parallel
Neural networks are designed to be massively Neural networks are designed to be massively parallelparallel
The brain is effectively a billion times fasterThe brain is effectively a billion times faster
INFORMATION CIRCA INFORMATION CIRCA 19951995
ComputerComputer Human BrainHuman Brain
Computation UnitsComputation Units 1 CPU, 101 CPU, 1055 Gates Gates 10101111 Neurons Neurons
Storage UnitsStorage Units 101044 bits RAM, 10 bits RAM, 101010 bits bits diskdisk
10101111 neurons, 10 neurons, 101414 synapsessynapses
Cycle timeCycle time 1010-8-8 sec sec 1010-3-3 sec sec
BandwidthBandwidth 101044 bits/sec bits/sec 10101414 bits/sec bits/sec
Updates / secUpdates / sec 101055 10101414
15
Neural networksNeural networks
Neural networks are made up of Neural networks are made up of nodesnodes or or unitsunits, , connected by connected by linkslinks
Each link has an associated Each link has an associated weightweight and and activation activation levellevel
Each node has an Each node has an input functioninput function (typically (typically summing over weighted inputs), an summing over weighted inputs), an activation activation functionfunction, and an , and an outputoutput
Output units
Hidden units
Input units
Layered feed-forward network
16
Model of a neuronModel of a neuron
Neuron modeled as a unit i Neuron modeled as a unit i weights on input unit j to i, wweights on input unit j to i, wjiji
net input to unit i is:net input to unit i is:
Activation function g() determines the Activation function g() determines the neuron’s outputneuron’s output g() is typically a sigmoidg() is typically a sigmoid output is either 0 or 1 (no partial activation)output is either 0 or 1 (no partial activation)
j
jiji owin
17
““Executing” neural Executing” neural networksnetworks
Input units are set by some exterior function Input units are set by some exterior function (think of these as (think of these as sensorssensors), which causes their ), which causes their output links to be output links to be activatedactivated at the specified level at the specified level
Working forward through the network, the Working forward through the network, the input input functionfunction of each unit is applied to compute the of each unit is applied to compute the input valueinput value Usually this is just the weighted sum of the activation Usually this is just the weighted sum of the activation
on the links feeding into this nodeon the links feeding into this node The The activation functionactivation function transforms this input transforms this input
function into a final valuefunction into a final value Typically this is a Typically this is a nonlinearnonlinear function, often a function, often a sigmoidsigmoid
function corresponding to the “threshold” of that nodefunction corresponding to the “threshold” of that node
18
Learning rulesLearning rules
Rosenblatt (1959) suggested that if a Rosenblatt (1959) suggested that if a target output value is provided for a target output value is provided for a single neuron with fixed inputs, can single neuron with fixed inputs, can incrementally change weights to learn incrementally change weights to learn to produce these outputs using the to produce these outputs using the perceptron learning ruleperceptron learning rule assumes binary valued input/outputsassumes binary valued input/outputs assumes a single linear threshold unitassumes a single linear threshold unit
19
Perceptron learning rulePerceptron learning rule If the target output for unit i is tIf the target output for unit i is tii
Equivalent to the intuitive rules:Equivalent to the intuitive rules: If output is correct, don’t change the weightsIf output is correct, don’t change the weights If output is low (oIf output is low (oii=0, t=0, tii=1), increment weights =1), increment weights
for all the inputs which are 1for all the inputs which are 1 If output is high (oIf output is high (oii=1, t=1, tii=0), decrement =0), decrement
weights for all inputs which are 1weights for all inputs which are 1 Must also adjust threshold. Or equivalently Must also adjust threshold. Or equivalently
assume there is a weight wassume there is a weight w0i0i for an extra input for an extra input unit that has an output of 1.unit that has an output of 1.
jiijiji ootww )(
20
Perceptron learning Perceptron learning algorithmalgorithm
Repeatedly iterate through examples adjusting Repeatedly iterate through examples adjusting weights according to the perceptron learning weights according to the perceptron learning rule until all outputs are correctrule until all outputs are correct Initialize the weights to all zero (or random)Initialize the weights to all zero (or random) Until outputs for all training examples are Until outputs for all training examples are
correctcorrect for each training example e dofor each training example e do
compute the current output ocompute the current output ojj
compare it to the target tcompare it to the target tjj and update and update weightsweights
Each execution of outer loop is called an Each execution of outer loop is called an epochepoch For multiple category problems, learn a For multiple category problems, learn a
separate perceptron for each category and separate perceptron for each category and assign to the class whose perceptron most assign to the class whose perceptron most exceeds its thresholdexceeds its threshold
21
Representation Representation limitations of a limitations of a
perceptronperceptron Perceptrons can only represent Perceptrons can only represent
linear threshold functions and can linear threshold functions and can therefore only learn functions which therefore only learn functions which linearly separate the data.linearly separate the data. i.e., the positive and negative examples i.e., the positive and negative examples
are separable by a hyperplane in n-are separable by a hyperplane in n-dimensional spacedimensional space
<W,X> - = 0
> 0 on this side
< 0 on this side
22
Perceptron learnabilityPerceptron learnability
Perceptron Convergence TheoremPerceptron Convergence Theorem: : If there is a set of weights that is If there is a set of weights that is consistent with the training data (i.e., consistent with the training data (i.e., the data is linearly separable), the the data is linearly separable), the perceptron learning algorithm will perceptron learning algorithm will converge (Minicksy & Papert, 1969)converge (Minicksy & Papert, 1969)
Unfortunately, many functions (like Unfortunately, many functions (like parity) cannot be represented by LTUparity) cannot be represented by LTU
23
Learning: Learning: BackpropagationBackpropagation
Similar to perceptron learning algorithm, Similar to perceptron learning algorithm, we cycle through our exampleswe cycle through our examples if the output of the network is correct, if the output of the network is correct,
no changes are madeno changes are made if there is an error, the weights are if there is an error, the weights are
adjusted to reduce the erroradjusted to reduce the error The trick is to assess the blame for the The trick is to assess the blame for the
error and divide it among the error and divide it among the contributing weightscontributing weights
24
Output layerOutput layer
As in perceptron learning algorithm, As in perceptron learning algorithm, we want to minimize difference we want to minimize difference between target output and the between target output and the output actually computedoutput actually computed )in(gErraWW iijjiji
activation of hidden unit j (Ti – Oi) derivative
of activationfunction)in(gErr iii
ijjiji aWW
25
Hidden layersHidden layers
Need to define error; we do error Need to define error; we do error backpropagation. backpropagation.
Intuition: Each hidden node j is Intuition: Each hidden node j is “responsible” for some fraction of the error “responsible” for some fraction of the error I I in each of the output nodes to which it in each of the output nodes to which it connects. connects.
I I divided according to the strength of the divided according to the strength of the connection betweenconnection between hidden node and the hidden node and the output node and propagated back to output node and propagated back to provide the provide the j j values for the hidden layer:values for the hidden layer:
ij
jijj W)in(g
jkkjkj IWW update rule:
26
Backprogation algorithmBackprogation algorithm
Compute the Compute the values for the output values for the output units using the observed errorunits using the observed error
Starting with output layer, repeat the Starting with output layer, repeat the following for each layer in the network, following for each layer in the network, until earliest hidden layer is reached:until earliest hidden layer is reached: propagate the propagate the values back to the values back to the
previous layerprevious layer update the weights between the two update the weights between the two
layerslayers
27
Backprop issuesBackprop issues
““Backprop is the cockroach of Backprop is the cockroach of machine learning. It’s ugly, and machine learning. It’s ugly, and annoying, but you just can’t get rid annoying, but you just can’t get rid of it.” of it.” Geoff HintonGeoff Hinton
Problems: Problems: black boxblack box local minimalocal minima
28
UnsuperviseUnsupervised Learning: d Learning: ClusteringClusteringSome material adapted from slides by Andrew Some material adapted from slides by Andrew
Moore, CMU.Moore, CMU.
Visit Visit http://www.autonlab.org/tutorials/http://www.autonlab.org/tutorials/ for forAndrew’s repository of Data Mining tutorials.Andrew’s repository of Data Mining tutorials.
29
Unsupervised LearningUnsupervised Learning Supervised learning used labeled data pairs Supervised learning used labeled data pairs
(x, y) to learn a function f : X→Y.(x, y) to learn a function f : X→Y. But, what if we don’t have labels?But, what if we don’t have labels?
No labels = No labels = unsupervised learningunsupervised learning Only some points are labeled = Only some points are labeled = semi-semi-
supervised learningsupervised learning Labels may be expensive to obtain, so we only get Labels may be expensive to obtain, so we only get
a few.a few.
ClusteringClustering is the unsupervised grouping of is the unsupervised grouping of data points. It can be used for data points. It can be used for knowledge knowledge discoverydiscovery..
30
Clustering DataClustering Data
31
K-Means ClusteringK-Means Clustering
K-Means ( k , data )• Randomly choose k
cluster center locations (centroids).
• Loop until convergence
• Assign each point to the cluster of the closest centroid.
• Reestimate the cluster centroids based on the data assigned to each.
32
K-Means ClusteringK-Means Clustering
K-Means ( k , data )• Randomly choose k
cluster center locations (centroids).
• Loop until convergence
• Assign each point to the cluster of the closest centroid.
• Reestimate the cluster centroids based on the data assigned to each.
33
K-Means ClusteringK-Means Clustering
K-Means ( k , data )• Randomly choose k
cluster center locations (centroids).
• Loop until convergence
• Assign each point to the cluster of the closest centroid.
• Reestimate the cluster centroids based on the data assigned to each.
34
K-Means AnimationK-Means Animation
Example generated by Andrew Moore using Dan Pelleg’s super-duper fast K-means system:
Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning.Proc. Conference onKnowledge Discovery in Databases 1999.
35
Problems with K-MeansProblems with K-Means VeryVery sensitive to the initial points. sensitive to the initial points.
Do many runs of k-Means, each with Do many runs of k-Means, each with different initial centroids.different initial centroids.
Seed the centroids using a better Seed the centroids using a better method than random. (e.g. Farthest-method than random. (e.g. Farthest-first sampling)first sampling)
Must manually choose k.Must manually choose k. Learn the optimal k for the clustering. Learn the optimal k for the clustering.
(Note that this requires a performance (Note that this requires a performance measure.)measure.)
36
Problems with K-MeansProblems with K-Means How do you tell it which clustering you How do you tell it which clustering you
want?want?
Constrained clustering techniquesConstrained clustering techniques
Same-cluster constraint(must-link)
Different-cluster constraint(cannot-link)