Machine Learning for NLP

1. Seminar: Statistical NLP Girona, June 2003 Machine Learning forNatural Language ProcessingLlus Mrquez TALP Research CenterLlenguatges i Sistemes InformticsUniversitat Politcnica de Catalunya

2. Outline

The Classification Problem

Three ML Algorithms

Applications to NLP

3. Outline

Three ML Algorithms

Applications to NLP

There are many general-purpose definitions of Machine Learning (or artificial learning):

Machine Learning ML4NLP

Learners arecomputers : we study learningalgorithms

Resources are scarce: time, memory, data, etc.

It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.

Biological plausibility is welcome but not the main goal

Making a computer automatically acquire some kind of knowledge from a concrete data domain 5. Machine Learning

We will concentrate on:

Supervised inductivelearning forclassification

=discriminative learning

Learning... but what for?

Toperformsomeparticulartask

Toreactto environmental inputs

Conceptlearning from data:

modellingconcepts underlying data

predicting unseen observations

compactingthe knowledgerepresentation

knowledge discoveryfor expert systems

ML4NLP 6. Machine Learning

What to read?

Machine Learning(Mitchell, 1997)

A more precise definition: ML4NLP Obtaining a descriptionof the conceptin some representation language that explains observations and helps predicting new instances of the same distribution 7.

Lexical and structuralambiguityproblems

Word selection (SR, MT)

Part-of-speech tagging

Semantic ambiguity (polysemy)

Prepositional phrase attachment

Reference ambiguity (anaphora)

Empirical NLP90 s : Application of Machine Learning techniques(ML) to NLP problems ML4NLP

What to read?Foundations of Statistical Language Processing (Manning & Schtze, 1999)

Clasification problems 8.

Ambiguityis a crucial problem for natural language understanding/processing.Ambiguity Resolution = Classification

NLP classification problems

He was shot in the hand as he chased the robbers in the back street

( The Wall Street Journal Corpus ) ML4NLP 9.

Morpho-syntactic ambiguity

He wasshotin thehandas hechasedthe robbers in the back street

NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP 10.

Morpho-syntactic ambiguity :Part of Speech Tagging

NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP 11.

Semantic (lexical) ambiguity

He was shot in thehandas he chased the robbers in the back street

body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP 12.

Semantic (lexical) ambiguity :Word Sense Disambiguation

He was shot in thehandas he chased the robbers in the back street

body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP 13.

Structural (syntactic) ambiguity

He was shot in the hand as he chased the robbersin the back street

Structural (syntactic) ambiguity

Hewas shotin the hand as hechased therobbers in the back street

Structural (syntactic) ambiguity: PP-attachment disambiguation

He was shot in the hand as he(chased(the robbers) NP (in the back street) PP )

( The Wall Street Journal Corpus ) ML4NLP 16. Outline

Three ML Algorithms in detail

Applications to NLP

17. Feature Vector Classification

Aninstanceis a vector:x = < x 1 ,,x n > whose components, calledfeatures(or attributes), are discrete or real-valued.

LetXbe the space of all possible instances.

LetY= { y 1 ,, y m } be the set ofcategories(or classes).

The goal is to learn an unknown target function,f : XY

Atraining example is an instancex belonging toX , labelled with the correct value forf( x ) ,i.e., a pair < x ,f( x ) >

LetDbe the set of alltraining examples .

IAperspective Classification 18. Feature Vector Classification

Thegoalis to find a functionhbelonging toHsuch that for all pair< x , f ( x ) > belonging toD ,h (x) =f (x)

Thehypotheses space ,H , is the set of functionsh: XY that the learner can consider as possible definitions

Classification 19. An Example otherwise negative Classification (COLOR= red ) (SHAPE= circle ) positive Rules red blue SHAPE negative positive circle triangle negative COLOR Decision Tree 20. An Example Classification Rules (SIZE= small ) (SHAPE= circle ) positive otherwise negative (SIZE= big ) (COLOR= red ) positive small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg 21. Some important concepts

InductiveBias

Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias(Mooney& Cardie,99)

Language/Searchbias

Classification red blue SHAPE negative positive circle triangle negative COLOR Decision Tree 22.

InductiveBias

Training error andgeneralizationerror

Some important concepts

Generalization ability andoverfitting

BatchLearning vs.on-lineLeaning

Symbolicvs.statisticalLearning

Propositionalvs.first-orderlearning

Classification 23. Propositional vs.Relational LearningClassification color(red)shape(circle) classA

Propositional learning

course(X)person(Y)link_to(Y,X) instructor_of(X,Y) research_project(X)person(Z)link_to(L 1 ,X,Y) link_to(L 2 ,Y,Z) neighbour_word_ people (L 1 ) member_proj(X,Z)

Relational learning = ILP (induction of logic programs)

24. The Classification Setting Class, Point, Example, Data Set, ...

Input Space:X R n

(binary) Output Space:Y= { + 1,-1}

A point, pattern or instance: x X ,x= (x 1 , x 2 , , x n )

Example:( x ,y ) with x X, y Y

Training Set: a set ofmexamples generated i.i.d. according to an unknown distributionP ( x , y ) S = {( x 1 ,y 1 ), , ( x m ,y m )}( X Y ) m

Classification CoLT/SLTperspective 25. The Classification Setting Learning, Error, ...

The hypotheses space,H , is the set of functionsh: X Y that the learner can consider as possible definitions. In SVM are of the form:

The goal is to find a functionhbelonging toHsuch that the expected misclassification error on new examples, also drawn fromP ( x , y ) , is minimal( Risk Minimization , RM)

Classification 26. The Classification Setting Learning, Error, ...

Expected error (risk)

Problem :Pitself is unknown. Known are training examplesaninduction principleis needed

Empirical Risk Minimization(ERM): Find the functionhbelonging toHfor which the training error (empirical risk) is minimal

Classification 27. The Classification Setting Error, Over(under)fitting,...

Low training errorlow true error?

The overfitting dilemma:

Trade-off between training error and complexity

Different learning biases can be used

(Mller et al., 2001) Classification Under fitting Over fitting 28. Outline

Three ML Algorithms

Applications to NLP

29. Outline

Three ML Algorithms

Decision Trees

AdaBoost

Support Vector Machines

Applications to NLP

30. Learning Paradigms

Statisticallearning:

HMM, Bayesian Networks, ME, CRF, etc.

Traditional methods fromArtificial Intelligence( ML, AI )

Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.

Methods fromComputational LearningTheory( CoLT/SLT )

Winnow, AdaBoost, SVMs, etc.

Algorithms 31. Learning Paradigms

Classifiercombination :

Bagging, Boosting, Randomization, ECOC, Stacking, etc.

Semi-supervised learning : learning from labelled and unlabelled examples

Bootstrapping, EM, Transductive learning (SVMs, AdaBoost), Co-Training, etc.

Algorithms 32. Decision Trees

Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.

They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description,Classification , and Generalization.

From a machine-learning perspective: Decision Trees aren -ary branching trees that representclassification rulesfor classifying theobjectsof a certain domain into a set of mutually exclusiveclasses

Algorithms 33. Decision Trees

Acquisition:Top-Down Induction of Decision Trees (TDIDT)

Systems:

CART(Breiman et al. 84) ,

ID3, C4.5, C5.0(Quinlan 86,93,98),

A SSISTANT , A SSISTANT -R(Cestnik et al. 87)(Kononenko et al. 95)

Algorithms 34. An Example Algorithms A1 A2 A3 C1 A5 A2 A2 A5 C3 C2 C1 ... ... ... ... v 1 v 2 v 3 v 5 v 4 v 6 v 7 small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg 35. Learning Decision Trees Algorithms Training TrainingSet TDIDT + DT = Test = DT Example + Class 36. General Induction Algorithm Algorithms function TDIDT(X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : decision-tree; X: set-of-examples; A: set-of-featuresend-var if( stopping_criterion (X))then tree 1:=create_leaf_tree (X) else a max:=feature_selection (X,A); tree 1:=create_tree (X, a max ); for-allvalin values (a max )do X :=select_examples (X,a max ,val); A := A - {a max }; tree 2:=TDIDT (X,A); tree 1:=add_branch (tree 1 ,tree 2 ,val) end-for end-if return (tree 1 ) end-function 37. General Induction Algorithm Algorithms function TDIDT(X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : decision-tree; X: set-of-examples; A: set-of-featuresend-var if( stopping_criterion (X))then tree 1:=create_leaf_tree (X) else a max:=feature_selection (X,A); tree 1:=create_tree (X, a max ); for-allvalin values (a max )do X :=select_examples (X,a max ,val); A := A - {a max }; tree 2:=TDIDT (X,A); tree 1:=add_branch (tree 1 ,tree 2 ,val) end-for end-if return (tree 1 ) end-function 38. Feature Selection Criteria

Functions derived fromInformation Theory :

Information Gain, Gain Ratio(Quinlan 86)

Functions derived from Distance Measures

Gini Diversity Index(Breiman et al. 84)

RLM(Lpez de Mntaras 91)

Statistically-based

Chi-square test(Sestito & Dillon 94)

Symmetrical Tau(Zhou & Dillon 91)

R ELIEF F-IG: variant of R ELIEF F(Kononenko 94)

Algorithms 39. Extensions of DTs

Pruning(pre/post)

Minimize the effect of the greedy approach:lookahead

Non-linealsplits

Combinationof multiple models

Incrementallearning (on-line)

(Murthy 95) Algorithms 40. Decision Trees and NLP

Speech processing(Bahl et al. 89; Bakiri & Dietterich 99)

POS Tagging(Cardie 93, Schmid 94b; Magerman 95; Mrquez & Rodrguez 95,97; Mrquez et al. 00)

Word sense disambiguation(Brown et al. 91; Cardie 93;Mooney 96)

Parsing(Magerman 95,96; Haruno et al. 98,99)

Text categorization(Lewis & Ringuette 94; Weiss et al. 99)

Text summarization(Mani & Bloedorn 98)

Dialogue act tagging(Samuel et al. 98)

Algorithms 41. Decision Trees and NLP

Noun phrase coreference(Aone & Benett 95; Mc Carthy & Lehnert 95)

Discourse analysis in information extraction(Soderland & Lehnert 94)

Cue phrase identification in text and speech(Litman 94; Siegel & McKeown 94)

Verb classification in Machine Translation(Tanaka 96; Siegel 97)

Algorithms 42. Decision Trees: pros&cons

Advantages

Acquires symbolic knowledge in a understandable way

Very well studied ML algorithms and variants

Can be easily translated into rules

Existence of available software: C4.5, C5.0, etc.

Can be easily integrated into an ensemble

Algorithms 43. Decision Trees: pros&cons

Drawbacks

Computationally expensive when scaling to large natural language domains: training examples, features, etc.

Data sparseness and data fragmentation: the problem of thesmall disjuncts=> Probability estimation

DTs is a model with high variance (unstable)

Tendency to overfit training data: pruning is necessary

Requires quite a big effort in tuning the model

Algorithms 44. Boosting algorithms

to combine many simple and moderately accurate hypotheses ( weak classifiers ) into a single and highly accurate classifier

AdaBoost (Freund & Schapire 95)has been theoretically and empirically studied extensively

Many other variants extensions(1997-2003)

http://www.lsi.upc.es/ ~ lluism/seminari/ml&nlp.html

Algorithms 45. AdaBoost: general scheme TRAINING Algorithms TS 2 D 2 TS 1 D 1 WeakLearner h 1 WeakLearner h 2 TS T . . . Probabilitydistributionupdating D T WeakLearner h T . . . Linearcombination F( h 1 ,h 2 ,...,h T ) TEST 2 46. AdaBoost: algorithm Algorithms (Freund & Schapire 97) 47. AdaBoost: example Weak hypotheses= vertical/horizontal hyperplanes Algorithms 48. AdaBoost: round1 Algorithms 49. AdaBoost: round2 Algorithms 50. AdaBoost: round3 Algorithms 51. Combined Hypothesis Algorithms www.research.att.com/ ~ yoav/adaboost 52. AdaBoost and NLP

POS Tagging (Abney et al. 99; Mrquez 99)

Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

PP-attachment Disambiguation (Abney et al. 99)

Parsing (Haruno et al. 99)

Word Sense Disambiguation (Escudero et al. 00, 01)

Shallow parsing (Carreras & Mrquez, 01a; 02)

Email spam filtering (Carreras & Mrquez, 01b)

Term Extraction (Vivaldi, et al. 01)

Algorithms 53. AdaBoost: pros&cons Algorithms

Easy to implement and few parameters to set

Time and space grow linearly with number of examples. Ability to manage very large learning problems

Does not constrain explicitly the complexity of the learner

Naturally combines feature selection with learning

Has been succesfully applied to many practical problems

54. AdaBoost: pros&cons

Seems to be rather robust to overfitting(number of rounds) but sensitive to noise

Performance is very good when there are relatively few relevant terms (features)

Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

Algorithms 55.

Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory.(Cristianini & Shawe-Taylor, 2000)

Algorithms SVM: A General Definition 56. SVM: A General Definition

Support Vector Machines (SVM) are learning systems that use a hypothesis space oflinear functionsin ahigh dimensionalfeature space, trained with a learning algorithm fromoptimisation theorythat implements alearning biasderived from statistical learning theory.(Cristianini & Shawe-Taylor, 2000)

Key Concepts Algorithms 57. Linear Classifiers

Hyperplanes in R N .

Defined by aweight vector( w ) and athreshold( b ).

They induce aclassificationrule:

Algorithms w + + + + + + _ _ _ _ _ _ _ _ _ 58. Optimal Hyperplane:Geometric Intuition Algorithms 59. Optimal Hyperplane:Geometric Intuition MaximalMarginHyperplane Algorithms These are the SupportVectors 60. Linearly separable data QuadraticProgramming Algorithms Seminari SVM s22/05/2001 61. Non-separable case (soft margin) Algorithms Seminari SVM s22/05/2001 62. Non-linear SVMs

Implicit mapping into feature space via kernel functions

Algorithms Non-linear mapping Set of hypotheses Seminari SVM s22/05/2001 Dual formulation Kernel function Evaluation 63. Non-linear SVMs

Kernel functions

Must be efficiently computable

Characterization via Mercers theorem

One of thecuriousfacts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space!(Cristianini & Shawe-Taylor, 2000)

Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

Algorithms Seminari SVM s22/05/2001 64. Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Algorithms Seminari SVM s22/05/2001 65. Toy Examples

All examples have been run with the 2D graphic interface of SVMLIB( Chang and Lin, National University of Taiwan)

LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool.LIBSVMprovides a simple interface where users can easily link it with their own programs

Available from:www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool)

Algorithms 66. Toy Examples(I) Linearly separable data set Linear SVM Maximal margin Hyperplane Algorithms . What happens if we add a blue training example here? 67. Toy Examples(I) (still) Linearly separable data set Linear SVM High value ofCparameter Maximal margin Hyperplane The example iscorrectly classified Algorithms 68. Toy Examples(I) (still) Linearly separable data set Linear SVM Low value ofCparameter Trade-off between: margin and training error The example isnow a bounded SV Algorithms 69. Toy Examples(II) Algorithms 70. Toy Examples(II) Algorithms 71. Toy Examples(II) Algorithms 72. Toy Examples(III) Algorithms 73. SVM: Summary

SVMs introduced in COLT92(Boser, Guyon, & Vapnik, 1992) . Great developement since then

Kernel-induced feature spaces: SVMs work efficiently in veryhigh dimensionalfeature spaces ( + )

Learning bias:maximal marginoptimisation. Reduces the danger of overfitting. Generalization bounds for SVMs ( + )

Compact representation of the induced hypothesis. The solution is sparse in terms of SVs( + )

Algorithms 74. SVM: Summary

Due to Mercers conditions on the kernels the optimi-sation problems are convex. No local minima ( + )

Optimisation theory guides the implementation. Efficient learning ( + )

Mainly for classification but also for regression, density estimation, clustering, etc.

Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. ( + )

Parameter tuning ( ). Implications in convergence times, sparsity of the solution, etc.

Algorithms 75. Outline

Three ML Algorithms

Applications to NLP

76. NLP problems Applications

Warning!We will not focus on final NLP applications, but on intermediate tasks...

We will classify the NLP tasks according to their (structural) complexity

77. NLP problems: structural complexity Applications

Decisionalproblems

Text Categorization, Document filtering, Word Sense Disambiguation, etc.

Sequence taggingand detection of sequential structures

POS tagging, Named Entity extraction, syntactic chunking, etc.

Hierarchicalstructures

Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc.

Morpho-syntactic ambiguity :Part of Speech Tagging

POS tagging

NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) Applications 79. POS tagging Applications preposition-adverb tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 As,as RB IN others others ... ... ^ Probabilistic interpretation: P( RB | word=A/astag(+1)=RBtag(+2)=IN) = 0.987 P( IN | word=A/astag(+1)=RBtag(+2)=IN) = 0.013 ^ 80. POS tagging as _ RBmuch_ RBas_ IN Collocations: as _ RBwell_ RBas_ IN as _ RBsoon_ RBas_ IN Applications preposition-adverb tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 As,as RB IN others others ... ... 81. POS tagging Raw text Morphological analysis Tagged text Classify Update Filter Language Model Disambiguation stop? RTT(Mrquez & Rodrguez 97) yes no Applications A Sequential Model for Multi-class Classification:NLP/POS Tagging(Even-Zohar & Roth, 01) 82. POS tagging STT(Mrquez & Rodrguez 97) Applications Tagged text Raw text Morphological analysis Viterbi algorithm Language Model Disambiguation Lexical probs. + Contextual probs. The Use of Classifiers in sequential inference:Chunking(Punyakanok & Roth, 00) 83. Detection of sequential and hierarchical structures

Named Entity recognition

Clause detection

Applications 84. Summary/conclusions

We have briefly outlined:

The ML setting: supervised learning for classification

Three concrete machine learning algorithms

How to apply them to solve itermediate NLP tasks

Conclusions 85.

Any ML algorithm for NLP should be:

Robust to noise and outliers

Efficient in large feature/example spaces

Adaptive to new/changing domains:portability, tuning, etc.

Able to take advantage of unlabelled examples: semi-supervised learning

Conclusions Summary/conclusions 86. Summary/conclusions

Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research

Conclusions 87. Some current research lines

Appropriate learning paradigm for all kind of NLP problems:TiMBL (DBZ99) ,TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98),CRF(Pereira & Singer02).

Definition of an adequate (and task-specific) feature space:mapping from the input space to a high dimensional feature space, kernels, etc.

Resolution of complex NLP problems:inference with classifiers + constraint satisfaction

Conclusions 88. Bibliografia

You may found additional information at:

http://www.lsi.upc.es/ ~ lluism/

tesi.html

publicacions/pubs.html

cursos/talks.html

cursos/MLandNL.html

cursos/emnlp1.html

This talk at:

http://www.lsi.upc.es/ ~ lluism/udg03.ppt.gz

Conclusions 89. Seminar: Statistical NLP Girona, June 2003 Machine Learning forNatural Language ProcessingLlus Mrquez TALP Research CenterLlenguatges i Sistemes InformticsUniversitat Politcnica de Catalunya

Machine Learning for NLP

Documents

Transcript of Machine Learning for NLP