Introduction to AI (continued)Introduction to AI (continued) Lecture #5 2/14/2017 Spring 2017 Social...
Transcript of Introduction to AI (continued)Introduction to AI (continued) Lecture #5 2/14/2017 Spring 2017 Social...
Introduction to AI(continued)
Lecture #5
2/14/2017 1Spring 2017 Social Computing Course
What will we cover today?
• Feature Selection• Neural Networks• Unsupervised Learning– K-means
2/14/2017 Spring 2017 Social Computing Course 2
Feature Selection• Task: Given a string, predict whether it is an
email [email protected]
• Feature selection: Given an input x, which properties might be useful for predicting y?
• Needs a feature selector ϕ(x)– contains_@ 1– endswith_.com 1– endswith_.org 0– contains_% 0
2/14/2017 Spring 2017 Social Computing Course 3
ß This is feature vector
ß binary features have a value of 1 if present, 0 if not
Feature weights• contains_@: 3• endswith_.com: 2.4• endswith_.org: 1.3• contains_%: -2• !contains_alpha: -4
• This is weight vector w• Weights typically normalized to [-1, 1]
2/14/2017 Spring 2017 Social Computing Course 4
Linear combination
• For a new example string, sum the weights– If the sum is >= 0, output Yes– else No
§ String 1: “[email protected]”
§ String 2: “0.5%”
2/14/2017 Spring 2017 Social Computing Course 5
contains_@: 3endswith_.com: 2.4endswith_.org: 1.3contains_%: -2!contains_alpha: -4
Let’s add some math!
2/14/2017 Spring 2017 Social Computing Course 6
Feature Selection• Seems arbitrary
– But actually captures our knowledge– What if we do not really know?
• How can we select good features then?• Feature Template
– Rather than a specific feature, decide what kind of feature may be useful• E.g., presence of certain characters in a string
– And then let the system tell us which are good!
2/14/2017 Spring 2017 Social Computing Course 7
contains_@: 3endswith_.com: 2.4endswith_.org: 1.3contains_%: -2!contains_alpha: -4
Feature Templates• Allows us to define a set of related features– contains_@, contains_a, contains_b....
• As a template with a variable:– E.g., contains _ – where _ can be filled by any single character
• Less burden on the designer since we do not need to know which features to choose, only if they are useful cues or not
• But… it may produce lots of “useless” features
2/14/2017 Spring 2017 Social Computing Course 8
Feature Template for “ends with”
2/14/2017 Spring 2017 Social Computing Course 9
Inefficient to represent all zeros!But it is common for feature vectors to be quite sparse
Feature Vector Representations
• As an Array (or vector)[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]– Good for dense features– Efficient in space and speed– E.g. in computer vision, pixel intensity vectors
are typically dense
2/14/2017 Spring 2017 Social Computing Course 10
Feature Vector Representations
• As a Map{“contains_@”: 1, “endswith_m”: 1}– Good for sparse features• Known also as “one hot” if only one feature is set
– Features not in the map are assumed 0– Typical in NLP applications– Sometimes trillions of features– But less efficient than arrays, slower
2/14/2017 Spring 2017 Social Computing Course 11
Dimensionality Reduction –selecting only the features that matter
• Feature Vectors for a data setx1: [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]x2:[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]x3:[0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0]x4:[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]x5:[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]......xn:[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]
2/14/2017 Spring 2017 Social Computing Course 12
A Couple of Dimensionality Reduction Techniques
• Missing Values Ratio– Columns that have too many zeros are unlikely to
carry much useful information, remove columns where # of zeros > threshold
• High Correlation Filter– Columns that have very similar trends tend to
carry similar information. So we calculate the correlation between values of given columns, remove columns where correlation coefficient > threshold
2/14/2017 Spring 2017 Social Computing Course 13
How do we know ϕ is good?
• Score:
• Learning chooses the optimal w
• How does feature extraction affect quality of prediction?– Assume the w is optimally selected by an
oracle
2/14/2017 Spring 2017 Social Computing Course 14
Feature Selection affects what we can learn!
2/14/2017 Spring 2017 Social Computing Course 15
Feature Selection issues• Hypothesis class – a set of possible predictors
• Feature extraction defines a hypothesis class – a subset from all set of all possible predictors
• If feature extraction defines a feature set/hypothesis class which does not contain any good predictors, then no amount of learning will help!!
• We need to select features which are powerful to expresspredictors which are good!
• But it is necessary to keep the hypothesis class small, because the computational cost, and we usually do not get optimal w
2/14/2017 Spring 2017 Social Computing Course 16
Linear Predictors
2/14/2017 Spring 2017 Social Computing Course 17
Neural Networks
2/14/2017 Spring 2017 Social Computing Course 18
A sigmoid function that converts activation input into a signal (neuron like). More recently simple RELU function used (latest bio)
Neural Networks• Think of the intermediate hidden units as learned
features of a linear predictor
• The key idea is feature learning– BEFORE: Manually specify features– AFTER: Automatically learn them from data
• Deep learning: pack sparse large dimensionality representation into a dense small dimensionality space.– Allows generalizations to be learned!
• Objective function: Minimize the loss!
2/14/2017 Spring 2017 Social Computing Course 19
Deep neural networks
2/14/2017 Spring 2017 Social Computing Course 20
Deep learning Neural Networks
2/14/2017 Spring 2017 Social Computing Course 21
Deep Neural Networks
2/14/2017 Spring 2017 Social Computing Course 22
Supervised vs. Unsupervised• Labeled training data is expensive!– For some tasks, it may not even be available
• Unsupervised learning overcomes the need for training data
• Key idea: Data itself contains lots of rich latent structures, unsupervised learning discovers this structure automatically
2/14/2017 Spring 2017 Social Computing Course 23
Types of unsupervised learning
2/14/2017 Spring 2017 Social Computing Course 24
Clustering• Input: training set of input pointsDtrain = {x1, x2,...., xn}
• Output: assignment of each point to a cluster[z1, z2,...., zn] where zi belongs to {1, .. K}
• Want similar points in the same cluster, dissimilar points should be in different clusters
2/14/2017 Spring 2017 Social Computing Course 25
K-means
2/14/2017 Spring 2017 Social Computing Course 26
K-means example
• Input Dtrain = {0, 2, 10, 12}• Output K = 2 clusters with centroids µ1,µ2
2/14/2017 Spring 2017 Social Computing Course 27
K-means
• But we know neither the centroids nor the assignments!
• Break down the hard problem into two easy problems
2/14/2017 Spring 2017 Social Computing Course 28
K-means algorithm (Step 1)
2/14/2017 Spring 2017 Social Computing Course 29
K-means algorithm (Step 2)
2/14/2017 Spring 2017 Social Computing Course 30
K-means algorithm
2/14/2017 Spring 2017 Social Computing Course 31
K-means example
2/14/2017 Spring 2017 Social Computing Course 32
Local minima
• K-means is guaranteed to converge on local minimum but not guaranteed to find global minimum
• Solution: Run multiple times with different random initializations, then choose solution that has lowest loss
2/14/2017 Spring 2017 Social Computing Course 33
What will we cover today?
• Feature Selection• Neural Networks• Unsupervised Learning– K-means
2/14/2017 Spring 2017 Social Computing Course 34
Something to think about
• What might be good features for predicting homepages?
• What might be good features for spam recognition?
2/14/2017 Spring 2017 Social Computing Course 35
Assigned Reading
• No paper assigned today
2/14/2017 Spring 2017 Social Computing Course 36
Assigned Reading
• Antisocial Behavior in Online Discussion Communities– Justin Cheng, Cristian Danescu-Niculescu-
Mizil, Jure Leskovec– Proceedings of ICWSM 2015.
• Responses due on Feb 16th, 2016 11:59 pm
2/14/2017 Spring 2017 Social Computing Course 37