Lecture 1 Introduction to Machine Learning & Modern ... · Their solutions do not summarize any...

Lecture 1Introduction to Machine Learning & Modern Applications

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems GroupWilhelm Schickard Institute for Computer ScienceUniversitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 1: Introduction April 17, 2012 1 / 37

Part I

Course Outline


Course Overview

Course Information:

Course Time: Tues, 14:00 c.t. – 16:00 (Except: May 1 & May 29)Course Location: Sand F122Office Hours: By Appointmenthttp://www.ra.cs.uni-tuebingen.de/lehre/ss12/advanced_ml.html

Course Material:

Textbook ( 1/2 of course): John Shawe-Taylor and Nello Cristianini: KernelMethods for Pattern Analysis. Cambridge University Press, 2004 [10].Supplementary Material to be supplied.

Final Exam: July 31


http://www.ra.cs.uni-tuebingen.de/lehre/ss12/advanced_ml.html

Instructors

Dr. Pavel Laskov

Office: Sand A304

pavel DOT laskov AT

uni-tuebingen DOT de

www-rsec.cs.uni-tuebingen.de/

laskov

Dr. Blaine Nelson

Office: Sand A316

blaine DOT nelson AT wsii

DOT uni-tuebingen DOT de

www.ra.cs.uni-tuebingen.de/

mitarb/nelson/


www-rsec.cs.uni-tuebingen.de/

laskov

www.ra.cs.uni-tuebingen.de/

mitarb/nelson/

Ubungen and Grading

In the exercise meetings, some solution techniques will be presented &model solutions will be discussed.

Time: Wed, 14:00 c.t. - 16:00, on selected dates, TBA

Location: Sand F122

Homework: 4-5 graded written assignments

Grades: Homework will comprise 30% of the grade. The remaining 70%will be from the final exam.


Part II

Applications of Machine Learning


Why is Machine Learning Relevant?


Why is Machine Learning Relevant?

Machine

Perce

ption

Computer Vision

Natural L

anguage

Processin

g

Search

Engine

s

Medical Diagnosis

Bioinfo

rmatics

Cheminformatics

Fraud Detection

Stock Market Analysis

Speech/Handwriting Recognition

Object

Recogn

ition

Robot Locomotion

Google Translator

Recommender Systems (Amazon, Facebook, LastFM, LinkedIn)

Paper Assignment System (NIPS)


Why is Machine Learning Relevant?Google Translator

French/EnglishBilingual Text

English Text

Statistical Analysis Statistical Analysis

French (Poor) English English

TranslationModel

LanguageModel

Decoding AlgorithmargmaxP(e) ∗ P(S |e)

la maison la maison blue la fleur

the house the blue house the flower





Statistical translators are composed of 2 elements:1 Translation model: learns correspondences between words2 Language model: learns word order for a proper sentence

Google trains their translation model by learning correspondences foundin bilingual text

Figures used were reproduced from talk What’s New in StatisticalMachine Translation by Knight & Koehn [3]


Why is Machine Learning Relevant?Recommender System

Recommender System: given users past ratings of items (books, movies,etc.) recommend new items for them [8]

Collaborative Filtering (User-based):1 Find users with similar ratings to the current user2 Use the rating of the like-users to make predictions for current user.

Collaborative Filtering (Item-based) [4]:1 Determine item-item relationships from user data2 Use this matrix and user’s preferences to predict new items for the user


Why is Machine Learning Relevant?Paper Assignment System

Paper Assignment System (NIPS): Given a set of reviewers (R) and aset of papers (P) find a matching between themMatching Criteria:1 Each paper p ∈ P must be reviewed by at least 3 reviewers2 Each reviewer r ∈ R should be assigned to papers related to his/her research3 Each reviewer r ∈ R should not be assigned to too many papersApproach of Charlin, Zemel, & Boutilier [1]:1 Construct a language model based on observed words in papers2 Construct suitability score for each reviewer-paper pair using linear regression

based on (1) parameters of language model and (2) reviewer preferences3 Use collaborative filtering to find a reviewer-paper assignment


Topics to be Covered

2 Large-scale/Online Learning

Learning may need to work with vast amounts of dataLearning can be incrementally updated for new data





3 Learning for Structured Data

Most real data is not numeric & converting to a numeric representation maylose important structural elementsWe will discuss methods that allow for structured data







4 Learning in Adversarial Environments

Not all data comes from a static source; in fact, it may change adversariallyLearning methods need to be robust against data



1 Kernel methods

Kernel functions provide data abstractionKernel methods provide common algorithms for any kernel





4 Learning in Adversarial Environments

Not all data comes from a static source; in fact, it may change adversariallyLearning methods need to be robust against data


Part III

Scope of Machine Learning


What is Machine Learning?The Chinese Room Problem (see Chapter 26 of [9])

Suppose you are placed in a room with a book of symbols/instructions

When a symbol comes, the book tells you what symbols to produce

To any outside observer, the room is able to perfectly answer questionsin Chinese, but. . .

Does the room know Chinese?Do you know Chinese?Does the book know Chinese?

The same dilemma occurs when we talk about machinelearning. . .What does it mean for a machine to learn?


What is Machine Learning?Machine Learning & Artificial Intelligence

Machine Learning (ML) & Artificial Intelligence (AI) are closely relatedbut there are several key differences

Artificial Intelligence: the broad study of machines’ ability to solve awide-range of human-like tasks; e.g.,

SearchSolving Constraint Satisfaction ProblemsLogical InferencePlanningComputing probabilities for events

Machine Learning: the branch of AI that studies the ability of machinesto learn (albeit not necessarily like humans)



Goal of AI

Particular General

Representation

F1,2 ∝m1∗m2

r21,2

Particular

Acquisition

(Induction)

Application

(Deduction)

Classic AI addressed deductive reasoning & knowledge representationLearning is concerned with inductive reasoning (generalizing) toconstruct hypotheses



“Classic” AI

Particular General

Representation

F1,2 ∝m1∗m2

r21,2

Particular

Acquisition

(Induction)

Application

(Deduction)




Learning

Particular General

Representation

F1,2 ∝m1∗m2

r21,2

Particular

Acquisition

(Induction)

Application

(Deduction)



What is Machine Learning?Classic Artificial Intelligence: Search, CSPs, & Games

Classic AI includes many interesting problems (see Chapters 1–6 of [9])

Route Planning Constraint Satisfaction Game Search

These AI algorithms solve difficult problems, but do they learn?They use pre-defined knowledge/rules to solve particular instances of eachproblemsTheir solutions do not summarize any inherent aspects of the problems theysolveThese algorithms do not extract information from their input data that canbe applied to solve later problems


What is Machine Learning?Classic Artificial Intelligence: Logical & Probabilistic Inference

Inference algorithms derive consequences from prior knowledge &evidence (see Chapters 7–9 & 13–17 of [9])

Logical Inference Probabilistic Inference

Do inference algorithms learn?

They derive previously unknown knowledge from evidenceTheir rules & structure are given a priori from a knowledge base—all theirderivations follow as consequences


Part IV

Pattern Analysis


Machine Learning as Pattern AnalysisRedundancy

Data Redundancy: indicates that there are (simple) patterns in thedata that allow missing information to be re-constructed/predicted

Compressibility: Redundant data can be compressed (sometimeslosslessly)

Example: Any given natural language text (or photograph), it can besignificantly compressed to a significantly smaller size

Predictability: Redundancy allows predictions to be made with onlypartial information

Example: If we know how long an object has been falling (in a vacuum onEarth), we can predict how far it has fallen: x ∝ −t2


Machine Learning as Pattern AnalysisPatterns

Pattern: any relation present in data; i.e., as a function f : X → Y

Exact Pattern: non-trivial pattern such that f (x) = 0 for all foreseeable xApproximate Pattern: non-trivial pattern such that f (x) ≈ 0 for allforeseeable xStatistical Pattern: non-trivial pattern such that Ex∼P [f (x)] ≈ 0 for somedistribution P on X

The veracity of a pattern is assessed by comparing the pattern’sprediction f (x) to the true value y ; this is accomplished via a lossfunction


What is Machine Learning?Machine Learning as Pattern Analysis

Pattern Analysis: Discovery of underlying relations, regularities orstructures that are inherent to a set of data

Detecting an inherent pattern allows predictions to be made about futuredata from the same sourceExample - Kepler’s Law: From observation, Kepler found that the periodicityof a planet (P) & its distance (D) are related as P2 ≈ D3

Periodicity (P) Distance (D) P2

D3

Mercury 0.24 0.39 0.058 0.059Venus 0.62 0.72 0.38 0.39Earth 1.00 1.00 1.00 1.00Mars 1.88 1.52 3.53 3.51Jupiter 11.86 5.20 140.66 140.61Saturn 29.46 9.58 867.89 879.22Uranus 84.32 ?? 7109.86 ??






D3

Mercury 0.24 0.39 0.058 0.059Venus 0.62 0.72 0.38 0.39Earth 1.00 1.00 1.00 1.00Mars 1.88 1.52 3.53 3.51Jupiter 11.86 5.20 140.66 140.61Saturn 29.46 9.58 867.89 879.22Uranus 84.32 19.22 7109.86






D3

Mercury 0.24 0.39 0.058 0.059Venus 0.62 0.72 0.38 0.39Earth 1.00 1.00 1.00 1.00Mars 1.88 1.52 3.53 3.51Jupiter 11.86 5.20 140.66 140.61Saturn 29.46 9.58 867.89 879.22Uranus 84.32 19.23 7109.86 7111.11

By finding patterns, the system is able to generalize & makepredictions—thus, this is a form of learning


Machine LearningA General Description

Machine Learning Definition (paraphrased from Tom Mitchell [6])

A computer algorithm A is said to learn from data/experience D

with respect to some class of tasks T & performance measure L, ifits performance at tasks in T , as measured by L, improves withexperience D.

The algorithm A is a learning algorithm

The data/experience D will generally be a dataset

The performance function L will generally be a statistical loss function

Throughout this course, we will consider a number of different learningtasks; among them are classification, regression, subspace estimation,outlier detection & clustering.


Part V

Learning Framework & Tasks


Machine LearningA Mathematical Framework & Terminology

Input Space (X ): space used to describe individual data items; e.g.,the D-dimensional Euclidean space, ℜD

Output Space (Y): space of possible predictions

Dataset (D): indexable collection of data; i.e., the data consistent of Nitems from X ; each instance is a data point xi (and output yi)

Hypothesis/Estimator (f ): object or function that represents thelearned entity; f : X → Y

Hypothesis Space (F): set of all learnable hypotheses

Learning Algorithm (A): algorithm that selects hypothesis f ∈ Fbased on data D; i.e., A : XN → F

Loss Function L (·, ·): a non-negative function that measuresdisagreement between its arguments


Machine LearningA Simple Example

Consider the task of finding the center of a distribution on the reals, ℜ,given N numbers drawn from the distribution.

The dataset is D = {xi}Ni=1 where each data point xi ∈ ℜ.

The hypothesis space is the set of all possible centroids; i.e., F = ℜ

The mean is one possible centroid estimating algorithm:

A (D) = 1N

∑Ni=1 xi

The median is a second centroid estimation algorithm:

A (D) = median (x1 . . . xN)


Common Machine Learning TasksTypes of Learning

Supervised Learning: pattern analysis in which training data containspaired examples of inputs xi & their corresponding outputs yi

Examples: Regression, Binary/Multiclass Classification

Semi-Supervised Learning: pattern analysis in which training datacontains both paired examples (xi , yi ) & unpaired examples xj

Examples: Transduction, Ranking

Unsupervised Learning: pattern analysis in which training datacontains only unpaired examples of inputs xi

Examples: Anomaly Detection, Subspace Detection, Clustering


Common Machine Learning TasksRegression

Objective: find relationship between (correlated) input variables x &output variables y

The dataset is D = {(xi , yi )}Ni=1 where each data point is a pair of input

variables xi ∈ ℜD & the corresponding output yiThe hypothesis space is the set of all functions from the input space Xto the output space Y; i.e., F = {f | f : X → Y}

This hypothesis space is generally too large (to be discussed)

A common restriction is to just consider the set of linear mappings fromX to Y as parametrized by w and b as f (x) = w⊤x− b


Common Machine Learning TasksClassification

Objective: find separation between input variables xi based on the classyi of observed instances

The dataset is D = {(xi , yi )}Ni=1 where each data point is a pair of input

variables xi ∈ ℜD & the corresponding output yi ∈ {1, . . . ,K}

The hypothesis space is the set of all functions from the input space Xto {0, . . . ,K}; i.e., F = {f | f : X → {1, . . . ,K} }

Case of binary classification (labels −1 & 1) can be addressed withregression; i.e., as the sign of a real-valued function


Common Machine Learning TasksSubspace Estimation

Objective: find a projection PX onto a subspace which “captures” thedata; i.e., PX (xi ) has a small residual, ‖PX (xi )− xi‖

The dataset is a set of points in X : D = {xi}Ni=1

The hypothesis space is the set of all subspace projections; i.e.,F = {PX | ∀ x ∈ X PX (x) = PX (PX (x)) }

When X is a Euclidean space, the subspace (and its projection) can beparametrized by a set of k ≤ D orthonormal basis vectors


Common Machine Learning TasksClustering

Objective: find underlying clusters (K ) within the dataset; i.e., there isa latent label y that predicts structure

The dataset is a set of points in X : D = {xi}Ni=1

The hypothesis space is the set of all functions from the input space Xto the cluster label; i.e., F = {f | f : X → {1, . . . ,K} }

Number of clusters (K ) often preselectedAssumptions often made about shape of clusters


Part VI

General Challenges for Learning


Challenges for Machine LearningInductive Bias (see also [2, 7])

Inductive learning algorithms require an inductive bias

Without a bias, the number of possible hypotheses is too large anduntenable

For a finite space X , the number of possible binary hypotheses is 2|X |

Suppose you maintain a set of all hypotheses consistent with yourobservations. . .For any new unseen instance, there will always been an equal number ofhypotheses that predict that instance as both positive & negative!

Inductive Bias: “the set of assumptions that the learner uses to predictoutputs given inputs that it has not encountered” [7]

Occam’s Razor: prefer shorter/simpler hypothesesMaximum Margin: prefer hypotheses with the large margin (gap)Minimum Features: only include significant features


Challenges for Machine LearningSpurious Patterns

Underfitting: the inability to find significant patterns whenoverly-restrictive assumptions are made about data or the hypothesisspace is too small.

Overfitting: Finding spurious patterns when too few assumptions aremade about the data or hypothesis space is too large found.

Codes from (left) Bible & (right) War and Peace (see [5])


Challenges for Machine LearningComputational Efficiency

Learning algorithms should be able to (computationally) scale. . .1 To large datasets2 For quick predictions

Training efficiency of learning is generally measured in dataset size, N1 Algorithms are efficient if their computational complexity is polynomial in

N ; i.e., O (Na) for some fixed a ≥ 0.2 Algorithms are considered to be large-scale if their computational

complexity is linear in N ; i.e., O (N)3 In some applications, it may not even be computationally feasible to look at

every data point; require sublinear or logarithmic complexity

Prediction efficiency of learning is generally measured in dataset size, N,and in number of predictions M

Prediction should be sublinear in N


Summary

1 Machine learning (ML) is a relevant, popular topic with applicationsspanning many data-driven tasks (e.g., translation & recommendersystems)

2 ML spans tasks in inductive reasoning; unlike classic AI, ML infersgeneral patterns from specific samples

3 ML algorithms can be viewed as pattern analyzers—the patterns theyfind can be used to make predictions

4 Common tasks in ML include regression, classification, subspacediscovery, & clustering

5 Learning algorithms face challenges including choosing an inductivebias, under- & over-fitting, & computational efficiency

6 Next Lecture: We will discuss a general approach to learning calledkernel methods & show its application to regression


Bibliography I

[1] Laurent Charlin, Richard S. Zemel, and Craig Boutilier. A frameworkfor optimizing paper matching. In Proceedings of the Twenty-SeventhConference on Uncertainty in Artificial Intelligence (UAI), pages86–95, 2011.

[2] Diana F. Gordon and Marie desJardins. Evaluation and selection ofbiases in machine learning. Machine Learning, 20(1-2):5–22, 1995.

[3] Kevin Knight and Philipp Koehn. What’s new in statistical machinetranslation. In HLT-NAACL, 2003.http://people.csail.mit.edu/people/koehn/publications/tutorial2003.pdf.

[4] G Linden, B Smith, and J York. Amazon.com recommendations:Item-to-item collaborative filtering. IEEE Internet Computing,7(1):76–80, 2003.

[5] Brendan Mckay, Dror Bar-Natan, Maya Bar-Hillel, and Gil Kalai.Solving the bible code puzzle. Statistical Science, 14:150–173, 1999.


http://people.csail.mit.edu/people/koehn/publications/tutorial2003.pdf

Bibliography II

[6] Tom Mitchell. Machine Learning. McGraw Hill, 1997.

[7] Tom M. Mitchell. The need for biases in learning generalizations.Technical Report CBM-TR 5-110, Rutgers University, NewBrunswick, NJ, 1980.

[8] Francesco Ricci, Lior Rokach, and Bracha Shapira. Introduction torecommender systems handbook. In Recommender SystemsHandbook, pages 1–35. 2011.

[9] Stuart J. Russell and Peter Norvig. Artificial Intelligence - A ModernApproach. Pearson Education, 3rd edition, 2010.

[10] John Shawe-Taylor and Nello Cristianini. Kernel Methods for PatternAnalysis. Cambridge University Press, 2004.


Lecture 1 Introduction to Machine Learning & Modern ... · Their solutions do not summarize any...

Documents

Transcript of Lecture 1 Introduction to Machine Learning & Modern ... · Their solutions do not summarize any...