Post on 05-Jan-2016
description
1
Machine Learning
SCE 5820: Machine Learning
Instructor: Jinbo Bi
Computer Science and Engineering Dept.
2
Course Information
Instructor: Dr. Jinbo Bi – Office: ITEB 233– Phone: 860-486-1458
– Email: jinbo@engr.uconn.edu
– Web: http://www.engr.uconn.edu/~jinbo/– Time: Tue / Thur. 2:00pm – 3:15pm – Location: BCH 302– Office hours: Thur. 3:15-4:15pm
HuskyCT– http://learn.uconn.edu– Login with your NetID and password
– Illustration
3
Introduction of the instructor and TA
Ph.D in Mathematics Research interests: machine learning, data mining,
optimization, biomedical informatics, bioinformatics
subtyping GWAS
Color of flowers
Cancer, Psychiatri
c disorde
rs, …
http://labhealthinfo.uconn.edu/EasyBreathing
4
Course Information
Prerequisite: Basics of linear algebra, calculus, optimization and basics of programming
Course textbook (not required):
– Introduction to Data Mining (2005) by Pang-Ning Tan, Michael Steinbach, Vipin Kumar
– Pattern Recognition and Machine Learning (2006) Christopher M. Bishop
– Pattern Classification (2nd edition, 2000) Richard O. Duda, Peter E. Hart and David G. Stork
Additional class notes and copied materials will be given Reading material links will be provided
5
Objectives:
– Introduce students knowledge about the basic concepts of machine learning and the state-of-the-art machine learning algorithms
– Focus on some high-demanding application domains with hands-on experience of applying data mining/ machine learning techniques
Format:
– Lectures, Micro teaching assignment, Quizzes, A term project
Course Information
6
Grading
Micro teaching assignment (1): 20% In-class/In-lab open-book open notes quizzes (4-5):
40% Term Project (1): 30% Participation: 10%
Term Project is one for each term. A term can consist of one or two students. Each student in the team needs to specify his/her roles in the project.
Term projects can be chosen from a list of pre-defined projects
7
Policy
Computers Participation in micro-teaching sessions is very
important, and itself accounts for 50% of the credits for micro-teaching assignment
Quizzes are graded by the instructor Final term projects will be graded by the
instructor If you miss two quizzes, there will be a take-
home quiz to make up the credits (missing one may be ok for your final grade.)
8
Micro-teaching sessions
Students in our class need to form THREE roughly-even study groups
The instructor will help to balance off the study groups
Each study group will be responsible of teaching one specific topic chosen from the following:– Support Vector Machines– Spectral Clustering– Boosting (PAC learning model)
9
Term Project
Each team needs to give two presentations: a progress or preparation presentation (10-15min); a final presentation in the last week (15-20min)
Each team needs to submit a project report– Definition of the problem– Data mining approaches used to solve the
problem– Computational results– Conclusion (success or failure)
10
Machine Learning / Data Mining
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information
– http://www.kdd.org/kdd2013/ ACM SIGKDD conference
The ultimate goal of machine learning is the creation and understanding of machine intelligence
– http://icml.cc/2013/ ICML conference
The main goal of statistical learning theory is to provide a framework for studying the problem of inference, that is of gaining knowledge, making predictions, and decisions from a set of data.
– http://nips.cc/Conferences/2012/ NIPS conference
11
Traditional Topics in Data Mining /AI
Fuzzy set and fuzzy logic– Fuzzy if-then rules
Evolutionary computation– Genetic algorithms– Evolutionary strategies
Artificial neural networks– Back propagation network (supervised
learning)– Self-organization network (unsupervised
learning, will not be covered)
12
Lack theoretical analysis about the behavior of the algorithms
Traditional Techniquesmay be unsuitable due to – Enormity of data– High dimensionality
of data– Heterogeneous,
distributed nature of data
Challenges in traditional techniques
Machine Learning/Pattern
Recognition
Statistics/AI
Soft Computing
13
Recent Topics in Data Mining
Supervised learning such as classification and regression– Support vector machines
– Regularized least squares
– Fisher discriminant analysis (LDA)
– Graphical models (Bayesian nets)
– Boosting algorithms
Draw from Machine Learning domains
14
Recent Topics in Data Mining
Unsupervised learning such as clustering– K-means – Gaussian mixture models– Hierarchical clustering– Graph based clustering (spectral clustering)
Dimension reduction– Feature selection– Compact feature space into low-dimensional
space (principal component analysis)
15
Statistical Behavior
Many perspectives to analyze how the algorithm handles uncertainty
Simple examples:– Consistency analysis– Learning bounds (upper bound on test error of
the constructed model or solution) “Statistical” not “deterministic”
– With probability p, the upper bound holds
P( > p) <= Upper_bound
16
Tasks may be in Data Mining
Prediction tasks (supervised problem)– Use some variables to predict unknown or
future values of other variables.
Description tasks (unsupervised problem)– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
17
Classification: Definition
Given a collection of examples (training set )– Each example contains a set of attributes, one of
the attributes is the class. Find a model for class attribute as a function
of the values of other attributes. Goal: previously unseen examples should be
assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
18
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set
ModelLearn
Classifier
19
Classification: Application 1
High Risky Patient Detection– Goal: Predict if a patient will suffer major complication
after a surgery procedure– Approach:
Use patients vital signs before and after surgical operation.– Heart Rate, Respiratory Rate, etc.
Monitor patients by expert medical professionals to label which patient has complication, which has not.
Learn a model for the class of the after-surgery risk. Use this model to detect potential high-risk patients for a
particular surgical procedure
20
Classification: Application 2
Face recognition
– Goal: Predict the identity of a face image
– Approach: Align all images to derive the features Model the class (identity) based on these features
21
Classification: Application 3
Cancer Detection
– Goal: To predict class (cancer or normal) of a sample (person), based on the microarray gene expression data
– Approach: Use expression levels of all
genes as the features Label each example as cancer
or normal Learn a model for the class of
all samples
22
Classification: Application 4
Alzheimer's Disease Detection
– Goal: To predict class (AD or normal) of a sample (person), based on neuroimaging data such as MRI and PET
– Approach: Extract features from
neuroimages Label each example as AD or
normal Learn a model for the class of
all samples
Reduced gray matter volume (colored areas) detected by MRI voxel-basedmorphometry in AD patients compared to normal healthy controls.
23
Regression
Predict a value of a real-valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.
Extensively studied in statistics, neural network fields. Find a model to predict the dependent variable
as a function of the values of independent variables.
Goal: previously unseen examples should be predicted as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
24
Regression application 1
categoric
al
categoric
al
continuous
Continuous ta
rget
Refund Marital Status
Taxable Income Loss
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ? 10
TestSet
Training Set
ModelLearn
RegressorPast transaction records, label them
Current data, want to use the model to predict
goals: Predict the possible loss from a customer
Tid Refund MaritalStatus
TaxableIncome Loss
1 Yes Single 125K 100
2 No Married 100K 120
3 No Single 70K -200
4 Yes Married 120K -300
5 No Divorced 95K -400
6 No Married 60K -500
7 Yes Divorced 220K -190
8 No Single 85K 300
9 No Married 75K -240
10 No Single 90K 9010
25
Regression applications
Examples:– Predicting sales amounts of new product
based on advertising expenditure.– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.– Time series prediction of stock market indices.
26
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to
one another.– Data points in separate clusters are less
similar to one another. Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures
27
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Intracluster distancesare minimized
Intracluster distancesare minimized
Intercluster distancesare maximized
Intercluster distancesare maximized
28
Clustering: Application 1
High Risky Patient Detection– Goal: Predict if a patient will suffer major complication
after a surgery procedure– Approach:
Use patients vital signs before and after surgical operation.– Heart Rate, Respiratory Rate, etc.
Find patients whose symptoms are dissimilar from most of other patients.
29
Clustering: Application 2
Document Clustering:– Goal: To find groups of documents that are
similar to each other based on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
30
Illustrating Document Clustering
Clustering Points: 3204 Articles of Los Angeles Times. Similarity Measure: How many words are common in
these documents (after some word filtering).
Category TotalArticles
CorrectlyPlaced
Financial 555 364
Foreign 341 260
National 273 36
Metro 943 746
Sports 738 573
Entertainment 354 278
31
Algorithms to solve these problems
32
Classification algorithms
K-Nearest-Neighbor classifiers Naïve Bayes classifier Neural Networks Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Decision Trees Logistic Regression Graphical models
33
Regression methods
Linear Regression Ridge Regression LASSO – Least Absolute Shrinkage and
Selection Operator Neural Networks
34
Clustering algorithms
K-Means Hierarchical clustering Graph-based clustering (Spectral
clustering) Semi-supervised clustering Others
35
Challenges of Data Mining
Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation
36
Basics of probability
An experiment (random variable) is a well-defined process with observable outcomes.
The set or collection of all outcomes of an experiment is called the sample space, S.
An event E is any subset of outcomes from S.
Probability of an event, P(E) is P(E) = number of outcomes in E / number of outcomes in S.
37
Probability Theory
Apples and Oranges
Assume P(Y=r) = 40%, P(Y=b) = 60% (prior)P(X=a|Y=r) = 2/8 = 25%P(X=o|Y=r) = 6/8 = 75%
P(X=a|Y=b) = 3/4 = 75%P(X=o|Y=b) = 1/4 = 25%
X: identity of the fruitY: identity of the box
Marginal P(X=a) = 11/20, P(X=o) = 9/20Posterior P(Y=r|X=o) = 2/3 P(Y=b|X=o) = 1/3
38
Probability Theory
Marginal Probability
Conditional Probability
Joint Probability
39
Probability Theory
Sum Rule
• Product Rule
The marginal prob of X equals the sum of the joint prob of x and y with respect to y
The joint prob of X and Y equals the product of the conditional prob of Y given X and the prob of X
40
Illustration
Y=1
Y=2
p(X)
p(Y)
p(X|Y=1)
p(X,Y)
41
The Rules of Probability
Sum Rule
Product Rule
Bayes’ Rule
posterior likelihood × prior
= p(X|Y)p(Y)
42
Application of Prob Rules
p(X=a) = p(X=a,Y=r) + p(X=a,Y=b)= p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b) P(X=o) = 9/20=0.25*0.4 + 0.75*0.6 = 11/20
p(Y=r|X=o) = p(Y=r,X=o)/p(X=o)= p(X=o|Y=r)p(Y=r)/p(X=o)= 0.75*0.4 / (9/20) = 2/3
Assume P(Y=r) = 40%, P(Y=b) = 60%P(X=a|Y=r) = 2/8 = 25%P(X=o|Y=r) = 6/8 = 75%
P(X=a|Y=b) = 3/4 = 75%P(X=o|Y=b) = 1/4 = 25%
43
Application of Prob Rules
p(X=a) = p(X=a,Y=r) + p(X=a,Y=b)= p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b) P(X=o) = 9/20=0.25*0.4 + 0.75*0.6 = 11/20
p(Y=r|X=o) = p(Y=r,X=o)/p(X=o)= p(X=o|Y=r)p(Y=r)/p(X=o)= 0.75*0.4 / (9/20) = 2/3
Assume P(Y=r) = 40%, P(Y=b) = 60%P(X=a|Y=r) = 2/8 = 25%P(X=o|Y=r) = 6/8 = 75%
P(X=a|Y=b) = 3/4 = 75%P(X=o|Y=b) = 1/4 = 25%
44
Mean and Variance
The mean of a random variable X is the average value X takes.
The variance of X is a measure of how dispersed the values that X takes are.
The standard deviation is simply the square root of the variance.
45
Simple Example
X= {1, 2} with P(X=1) = 0.8 and P(X=2) = 0.2
Mean – 0.8 X 1 + 0.2 X 2 = 1.2
Variance – 0.8 X (1 – 1.2) X (1 – 1.2) + 0.2 X (2 – 1.2)
X (2-1.2)
46
The Gaussian Distribution
47
Gaussian Mean and Variance
48
The Multivariate Gaussian
x
y
49
References
SC_prob_basics1.pdf (necessary) SC_prob_basic2.pdf
Loaded to HuskyCT
50
Basics of Linear Algebra
51
Matrix Multiplication
The product of two matrices
Special case: vector-vector product, matrix-vector product
CA B
52
Matrix Multiplication
53
Rules of Matrix Multiplication
CAB
54
Orthogonal Matrix
. ifonly and if orthormal, are )( of columns The
U
)matrixidentity theis(.ifonlyandif ,orthogonalis1-
IV VnmV
U
IIUUU
Tnm
T
mmTmm
11
1
...
55
Square Matrix – EigenValue, EigenVector
reigenvecto theisx
eigenvalue theis
.ifonlyandif,ofpaireigenanis),(
xAxAx
where
56
Symmetric Matrix – EigenValue EigenVector
ni
xAxxA
ni
xAxxA
i
nTnn
i
nTnn
,,1 ,0
. nonzeroany for ,0 if definite, positive and symmetric is
,,1 ,0
.any for ,0 if definite,-semi positive and symmetric is
.
TAAA if symmetric, is
eigen-decomposition of A
57
Matrix Norms and Trace
columns. lorthonorma has if,
). trace( ) trace(), trace( )trace(
.by size ofmatrix square afor ,)trace(
.:norm-1
.:norm-F
. of alueeigenlargest theofroot square the :norm-2
:normMatrix
2
1
,1
,
2
F
2
QAQA
BAABAAAAA
mmAAA
AA
AA
AAvA
FF
TT
F
m
iii
jiij
jiij
T
Frobenius norm
58
Singular Value Decomposition
. of rseigenvecto theforms:
. of rseigenvecto theforms:
.min and with diagonal is),,(and ,orthogonal are
and,where, :(SVD)ion Decomposit ValueSingular
11
AAVVVAA
AAUUUAA
(m,n)rdiag
VUAVUA
TTTT
TTTT
rr
nnmmnmT
orthogonalorthogonal
diagonal
59
References
SC_linearAlg_basics.pdf (necessary) SVD_basics.pdf
loaded to HuskyCT
60
Summary
This is the end of the FIRST chapter of this course
Next Class
Cluster analysis– General topics– K-means
Slides after this one are backup slides, you can also check them to learn more