- 1. Seminar: Statistical NLP Girona, June 2003 Machine Learning
forNatural Language ProcessingLlus Mrquez TALP Research
CenterLlenguatges i Sistemes InformticsUniversitat Politcnica de
Catalunya
2. Outline
- The Classification Problem
3. Outline
- The Classification Problem
4.
- There are many general-purpose definitions of Machine Learning
(or artificial learning):
Machine Learning ML4NLP
- Learners arecomputers : we study learningalgorithms
- Resources are scarce: time, memory, data, etc.
- It has (almost) nothing to do with: Cognitive science,
neuroscience, theory of scientific discovery and research,
etc.
- Biological plausibility is welcome but not the main goal
Making a computer automatically acquire some kind of knowledge
from a concrete data domain 5. Machine Learning
-
- Supervised inductivelearning forclassification
- Learning... but what for?
-
- Toperformsomeparticulartask
-
- Toreactto environmental inputs
-
- Conceptlearning from data:
-
-
- modellingconcepts underlying data
-
-
- predicting unseen observations
-
-
- compactingthe knowledgerepresentation
-
-
- knowledge discoveryfor expert systems
ML4NLP 6. Machine Learning
-
- Machine Learning(Mitchell, 1997)
A more precise definition: ML4NLP Obtaining a descriptionof the
conceptin some representation language that explains observations
and helps predicting new instances of the same distribution 7.
- Lexical and structuralambiguityproblems
-
- Semantic ambiguity (polysemy)
-
- Prepositional phrase attachment
-
- Reference ambiguity (anaphora)
Empirical NLP90 s : Application of Machine Learning
techniques(ML) to NLP problems ML4NLP
- What to read?Foundations of Statistical Language Processing
(Manning & Schtze, 1999)
Clasification problems 8.
- Ambiguityis a crucial problem for natural language
understanding/processing.Ambiguity Resolution = Classification
NLP classification problems
-
- He was shot in the hand as he chased the robbers in the back
street
( The Wall Street Journal Corpus ) ML4NLP 9.
- Morpho-syntactic ambiguity
NLP classification problems
-
- He wasshotin thehandas hechasedthe robbers in the back
street
NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP
10.
- Morpho-syntactic ambiguity :Part of Speech Tagging
NLP classification problems
-
- He wasshotin thehandas hechasedthe robbers in the back
street
NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP
11.
- Semantic (lexical) ambiguity
NLP classification problems
-
- He was shot in thehandas he chased the robbers in the back
street
body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP
12.
- Semantic (lexical) ambiguity :Word Sense Disambiguation
NLP classification problems
-
- He was shot in thehandas he chased the robbers in the back
street
body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP
13.
- Structural (syntactic) ambiguity
NLP classification problems
-
- He was shot in the hand as he chased the robbersin the back
street
( The Wall Street Journal Corpus ) ML4NLP 14.
- Structural (syntactic) ambiguity
NLP classification problems
-
- Hewas shotin the hand as hechased therobbers in the back
street
( The Wall Street Journal Corpus ) ML4NLP 15.
- Structural (syntactic) ambiguity: PP-attachment
disambiguation
NLP classification problems
-
- He was shot in the hand as he(chased(the robbers) NP (in the
back street) PP )
( The Wall Street Journal Corpus ) ML4NLP 16. Outline
- The Classification Problem
- Three ML Algorithms in detail
17. Feature Vector Classification
- Aninstanceis a vector:x = < x 1 ,,x n > whose components,
calledfeatures(or attributes), are discrete or real-valued.
- LetXbe the space of all possible instances.
- LetY= { y 1 ,, y m } be the set ofcategories(or classes).
- The goal is to learn an unknown target function,f : XY
- Atraining example is an instancex belonging toX , labelled with
the correct value forf( x ) ,i.e., a pair < x ,f( x ) >
- LetDbe the set of alltraining examples .
IAperspective Classification 18. Feature Vector
Classification
- Thegoalis to find a functionhbelonging toHsuch that for all
pair< x , f ( x ) > belonging toD ,h (x) =f (x)
- Thehypotheses space ,H , is the set of functionsh: XY that the
learner can consider as possible definitions
Classification 19. An Example otherwise negative Classification
(COLOR= red ) (SHAPE= circle ) positive Rules red blue SHAPE
negative positive circle triangle negative COLOR Decision Tree 20.
An Example Classification Rules (SIZE= small ) (SHAPE= circle )
positive otherwise negative (SIZE= big ) (COLOR= red ) positive
small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue
neg pos neg 21. Some important concepts
- Any means that a classification learning system uses to choose
between to functions that are both consistent with the training
data is called inductive bias(Mooney& Cardie,99)
Classification red blue SHAPE negative positive circle triangle
negative COLOR Decision Tree 22.
- Training error andgeneralizationerror
Some important concepts
- Generalization ability andoverfitting
- BatchLearning vs.on-lineLeaning
- Symbolicvs.statisticalLearning
- Propositionalvs.first-orderlearning
Classification 23. Propositional vs.Relational
LearningClassification color(red)shape(circle) classA
course(X)person(Y)link_to(Y,X) instructor_of(X,Y)
research_project(X)person(Z)link_to(L 1 ,X,Y) link_to(L 2 ,Y,Z)
neighbour_word_ people (L 1 ) member_proj(X,Z)
- Relational learning = ILP (induction of logic programs)
24. The Classification Setting Class, Point, Example, Data Set,
...
- (binary) Output Space:Y= { + 1,-1}
- A point, pattern or instance: x X ,x= (x 1 , x 2 , , x n )
- Example:( x ,y ) with x X, y Y
- Training Set: a set ofmexamples generated i.i.d. according to
an unknown distributionP ( x , y ) S = {( x 1 ,y 1 ), , ( x m ,y m
)}( X Y ) m
Classification CoLT/SLTperspective 25. The Classification
Setting Learning, Error, ...
- The hypotheses space,H , is the set of functionsh: X Y that the
learner can consider as possible definitions. In SVM are of the
form:
- The goal is to find a functionhbelonging toHsuch that the
expected misclassification error on new examples, also drawn fromP
( x , y ) , is minimal( Risk Minimization , RM)
Classification 26. The Classification Setting Learning, Error,
...
- Problem :Pitself is unknown. Known are training
examplesaninduction principleis needed
- Empirical Risk Minimization(ERM): Find the functionhbelonging
toHfor which the training error (empirical risk) is minimal
Classification 27. The Classification Setting Error,
Over(under)fitting,...
- Low training errorlow true error?
- Trade-off between training error and complexity
- Different learning biases can be used
(Mller et al., 2001) Classification Under fitting Over fitting
28. Outline
- The Classification Problem
29. Outline
- The Classification Problem
30. Learning Paradigms
-
- HMM, Bayesian Networks, ME, CRF, etc.
- Traditional methods fromArtificial Intelligence( ML, AI )
-
- Decision trees/lists, exemplar-based learning, rule induction,
neural networks, etc.
- Methods fromComputational LearningTheory( CoLT/SLT )
-
- Winnow, AdaBoost, SVMs, etc.
Algorithms 31. Learning Paradigms
-
- Bagging, Boosting, Randomization, ECOC, Stacking, etc.
- Semi-supervised learning : learning from labelled and
unlabelled examples
-
- Bootstrapping, EM, Transductive learning (SVMs, AdaBoost),
Co-Training, etc.
Algorithms 32. Decision Trees
- Decision trees are a way to represent rules underlying training
data, with hierarchical structures that recursively partition the
data.
- They have been used by many research communities (Pattern
Recognition, Statistics, ML, etc.) for data exploration with some
of the following purposes: Description,Classification , and
Generalization.
- From a machine-learning perspective: Decision Trees aren -ary
branching trees that representclassification rulesfor classifying
theobjectsof a certain domain into a set of mutually
exclusiveclasses
Algorithms 33. Decision Trees
- Acquisition:Top-Down Induction of Decision Trees (TDIDT)
- CART(Breiman et al. 84) ,
- ID3, C4.5, C5.0(Quinlan 86,93,98),
- A SSISTANT , A SSISTANT -R(Cestnik et al. 87)(Kononenko et al.
95)
Algorithms 34. An Example Algorithms A1 A2 A3 C1 A5 A2 A2 A5 C3
C2 C1 ... ... ... ... v 1 v 2 v 3 v 5 v 4 v 6 v 7 small big SHAPE
pos circle red SIZE Decision Tree COLOR triang blue neg pos neg 35.
Learning Decision Trees Algorithms Training TrainingSet TDIDT + DT
= Test = DT Example + Class 36. General Induction Algorithm
Algorithms function TDIDT(X:set-of-examples; A:set-of-features) var
: tree 1 ,tree 2 : decision-tree; X: set-of-examples; A:
set-of-featuresend-var if( stopping_criterion (X))then tree
1:=create_leaf_tree (X) else a max:=feature_selection (X,A); tree
1:=create_tree (X, a max ); for-allvalin values (a max )do X
:=select_examples (X,a max ,val); A := A - {a max }; tree 2:=TDIDT
(X,A); tree 1:=add_branch (tree 1 ,tree 2 ,val) end-for end-if
return (tree 1 ) end-function 37. General Induction Algorithm
Algorithms function TDIDT(X:set-of-examples; A:set-of-features) var
: tree 1 ,tree 2 : decision-tree; X: set-of-examples; A:
set-of-featuresend-var if( stopping_criterion (X))then tree
1:=create_leaf_tree (X) else a max:=feature_selection (X,A); tree
1:=create_tree (X, a max ); for-allvalin values (a max )do X
:=select_examples (X,a max ,val); A := A - {a max }; tree 2:=TDIDT
(X,A); tree 1:=add_branch (tree 1 ,tree 2 ,val) end-for end-if
return (tree 1 ) end-function 38. Feature Selection Criteria
- Functions derived fromInformation Theory :
-
- Information Gain, Gain Ratio(Quinlan 86)
- Functions derived from Distance Measures
-
- Gini Diversity Index(Breiman et al. 84)
-
- Chi-square test(Sestito & Dillon 94)
-
- Symmetrical Tau(Zhou & Dillon 91)
- R ELIEF F-IG: variant of R ELIEF F(Kononenko 94)
Algorithms 39. Extensions of DTs
- Minimize the effect of the greedy approach:lookahead
- Combinationof multiple models
- Incrementallearning (on-line)
(Murthy 95) Algorithms 40. Decision Trees and NLP
- Speech processing(Bahl et al. 89; Bakiri & Dietterich
99)
- POS Tagging(Cardie 93, Schmid 94b; Magerman 95; Mrquez &
Rodrguez 95,97; Mrquez et al. 00)
- Word sense disambiguation(Brown et al. 91; Cardie 93;Mooney
96)
- Parsing(Magerman 95,96; Haruno et al. 98,99)
- Text categorization(Lewis & Ringuette 94; Weiss et al.
99)
- Text summarization(Mani & Bloedorn 98)
- Dialogue act tagging(Samuel et al. 98)
Algorithms 41. Decision Trees and NLP
- Noun phrase coreference(Aone & Benett 95; Mc Carthy &
Lehnert 95)
- Discourse analysis in information extraction(Soderland &
Lehnert 94)
- Cue phrase identification in text and speech(Litman 94; Siegel
& McKeown 94)
- Verb classification in Machine Translation(Tanaka 96; Siegel
97)
Algorithms 42. Decision Trees: pros&cons
-
- Acquires symbolic knowledge in a understandable way
-
- Very well studied ML algorithms and variants
-
- Can be easily translated into rules
-
- Existence of available software: C4.5, C5.0, etc.
-
- Can be easily integrated into an ensemble
Algorithms 43. Decision Trees: pros&cons
-
- Computationally expensive when scaling to large natural
language domains: training examples, features, etc.
-
- Data sparseness and data fragmentation: the problem of thesmall
disjuncts=> Probability estimation
-
- DTs is a model with high variance (unstable)
-
- Tendency to overfit training data: pruning is necessary
-
- Requires quite a big effort in tuning the model
Algorithms 44. Boosting algorithms
- to combine many simple and moderately accurate hypotheses (
weak classifiers ) into a single and highly accurate
classifier
- AdaBoost (Freund & Schapire 95)has been theoretically and
empirically studied extensively
- Many other variants extensions(1997-2003)
- http://www.lsi.upc.es/ ~ lluism/seminari/ml&nlp.html
Algorithms 45. AdaBoost: general scheme TRAINING Algorithms TS 2
D 2 TS 1 D 1 WeakLearner h 1 WeakLearner h 2 TS T . . .
Probabilitydistributionupdating D T WeakLearner h T . . .
Linearcombination F( h 1 ,h 2 ,...,h T ) TEST 2 46. AdaBoost:
algorithm Algorithms (Freund & Schapire 97) 47. AdaBoost:
example Weak hypotheses= vertical/horizontal hyperplanes Algorithms
48. AdaBoost: round1 Algorithms 49. AdaBoost: round2 Algorithms 50.
AdaBoost: round3 Algorithms 51. Combined Hypothesis Algorithms
www.research.att.com/ ~ yoav/adaboost 52. AdaBoost and NLP
- POS Tagging (Abney et al. 99; Mrquez 99)
- Text and Speech Categorization (Schapire & Singer 98;
Schapire et al. 98; Weiss et al. 99)
- PP-attachment Disambiguation (Abney et al. 99)
- Parsing (Haruno et al. 99)
- Word Sense Disambiguation (Escudero et al. 00, 01)
- Shallow parsing (Carreras & Mrquez, 01a; 02)
- Email spam filtering (Carreras & Mrquez, 01b)
- Term Extraction (Vivaldi, et al. 01)
Algorithms 53. AdaBoost: pros&cons Algorithms
- Easy to implement and few parameters to set
- Time and space grow linearly with number of examples. Ability
to manage very large learning problems
- Does not constrain explicitly the complexity of the
learner
- Naturally combines feature selection with learning
- Has been succesfully applied to many practical problems
54. AdaBoost: pros&cons
- Seems to be rather robust to overfitting(number of rounds) but
sensitive to noise
- Performance is very good when there are relatively few relevant
terms (features)
- Can perform poorly when there is insufficient training data
relative to the complexity of the base classifiers, the training
errors of the base classifiers become too large too quickly
Algorithms 55.
- Support Vector Machines (SVM) are learning systems that use a
hypothesis space of linear functions in a high dimensional feature
space, trained with a learning algorithm from optimisation theory
that implements a learning bias derived from statistical learning
theory.(Cristianini & Shawe-Taylor, 2000)
Algorithms SVM: A General Definition 56. SVM: A General
Definition
- Support Vector Machines (SVM) are learning systems that use a
hypothesis space oflinear functionsin ahigh dimensionalfeature
space, trained with a learning algorithm fromoptimisation
theorythat implements alearning biasderived from statistical
learning theory.(Cristianini & Shawe-Taylor, 2000)
Key Concepts Algorithms 57. Linear Classifiers
- Defined by aweight vector( w ) and athreshold( b ).
- They induce aclassificationrule:
Algorithms w + + + + + + _ _ _ _ _ _ _ _ _ 58. Optimal
Hyperplane:Geometric Intuition Algorithms 59. Optimal
Hyperplane:Geometric Intuition MaximalMarginHyperplane Algorithms
These are the SupportVectors 60. Linearly separable data
QuadraticProgramming Algorithms Seminari SVM s22/05/2001 61.
Non-separable case (soft margin) Algorithms Seminari SVM
s22/05/2001 62. Non-linear SVMs
- Implicit mapping into feature space via kernel functions
Algorithms Non-linear mapping Set of hypotheses Seminari SVM
s22/05/2001 Dual formulation Kernel function Evaluation 63.
Non-linear SVMs
-
- Must be efficiently computable
-
- Characterization via Mercers theorem
-
- One of thecuriousfacts about using a kernel is that we do not
need to know the underlying feature map in order to be able to
learn in the feature space!(Cristianini & Shawe-Taylor,
2000)
-
- Examples: polynomials, Gaussian radial basis functions,
two-layer sigmoidal neural networks, etc.
Algorithms Seminari SVM s22/05/2001 64. Non linear SVMs Degree 3
polynomial kernel lin. separable lin. non-separable Algorithms
Seminari SVM s22/05/2001 65. Toy Examples
- All examples have been run with the 2D graphic interface of
SVMLIB( Chang and Lin, National University of Taiwan)
- LIBSVM is an integrated software for support vector
classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR)
and distribution estimation (one-class SVM). It supports
multi-class classification. The basic algorithm is a simplification
of both SMO by Platt and SVMLight by Joachims. It is also a
simplification of the modification 2 of SMO by Keerthy et al. Our
goal is to help users from other fields to easily use SVM as a
tool.LIBSVMprovides a simple interface where users can easily link
it with their own programs
- Available from:www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a
Web integrated demo tool)
Algorithms 66. Toy Examples(I) Linearly separable data set
Linear SVM Maximal margin Hyperplane Algorithms . What happens if
we add a blue training example here? 67. Toy Examples(I) (still)
Linearly separable data set Linear SVM High value ofCparameter
Maximal margin Hyperplane The example iscorrectly classified
Algorithms 68. Toy Examples(I) (still) Linearly separable data set
Linear SVM Low value ofCparameter Trade-off between: margin and
training error The example isnow a bounded SV Algorithms 69. Toy
Examples(II) Algorithms 70. Toy Examples(II) Algorithms 71. Toy
Examples(II) Algorithms 72. Toy Examples(III) Algorithms 73. SVM:
Summary
- SVMs introduced in COLT92(Boser, Guyon, & Vapnik, 1992) .
Great developement since then
- Kernel-induced feature spaces: SVMs work efficiently in
veryhigh dimensionalfeature spaces ( + )
- Learning bias:maximal marginoptimisation. Reduces the danger of
overfitting. Generalization bounds for SVMs ( + )
- Compact representation of the induced hypothesis. The solution
is sparse in terms of SVs( + )
Algorithms 74. SVM: Summary
- Due to Mercers conditions on the kernels the optimi-sation
problems are convex. No local minima ( + )
- Optimisation theory guides the implementation. Efficient
learning ( + )
- Mainly for classification but also for regression, density
estimation, clustering, etc.
- Success in many real-world applications: OCR, vision,
bioinformatics, speech recognition, NLP: TextCat, POS tagging,
chunking, parsing, etc. ( + )
- Parameter tuning ( ). Implications in convergence times,
sparsity of the solution, etc.
Algorithms 75. Outline
- The Classification Problem
76. NLP problems Applications
- Warning!We will not focus on final NLP applications, but on
intermediate tasks...
- We will classify the NLP tasks according to their (structural)
complexity
77. NLP problems: structural complexity Applications
-
- Text Categorization, Document filtering, Word Sense
Disambiguation, etc.
- Sequence taggingand detection of sequential structures
-
- POS tagging, Named Entity extraction, syntactic chunking,
etc.
-
- Clause detection, full parsing, IE of complex concepts,
composite Named Entities, etc.
78.
- Morpho-syntactic ambiguity :Part of Speech Tagging
POS tagging
-
- He wasshotin thehandas hechasedthe robbers in the back
street
NN VB JJ VB NN VB ( The Wall Street Journal Corpus )
Applications 79. POS tagging Applications preposition-adverb tree
root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17
tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 As,as
RB IN others others ... ... ^ Probabilistic interpretation: P( RB |
word=A/astag(+1)=RBtag(+2)=IN) = 0.987 P( IN |
word=A/astag(+1)=RBtag(+2)=IN) = 0.013 ^ 80. POS tagging as _
RBmuch_ RBas_ IN Collocations: as _ RBwell_ RBas_ IN as _ RBsoon_
RBas_ IN Applications preposition-adverb tree root P(IN)=0.81
P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13
P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 As,as RB IN others
others ... ... 81. POS tagging Raw text Morphological analysis
Tagged text Classify Update Filter Language Model Disambiguation
stop? RTT(Mrquez & Rodrguez 97) yes no Applications A
Sequential Model for Multi-class Classification:NLP/POS
Tagging(Even-Zohar & Roth, 01) 82. POS tagging STT(Mrquez &
Rodrguez 97) Applications Tagged text Raw text Morphological
analysis Viterbi algorithm Language Model Disambiguation Lexical
probs. + Contextual probs. The Use of Classifiers in sequential
inference:Chunking(Punyakanok & Roth, 00) 83. Detection of
sequential and hierarchical structures
Applications 84. Summary/conclusions
- We have briefly outlined:
-
- The ML setting: supervised learning for classification
-
- Three concrete machine learning algorithms
-
- How to apply them to solve itermediate NLP tasks
Conclusions 85.
- Any ML algorithm for NLP should be:
-
- Robust to noise and outliers
-
- Efficient in large feature/example spaces
-
- Adaptive to new/changing domains:portability, tuning, etc.
-
- Able to take advantage of unlabelled examples: semi-supervised
learning
Conclusions Summary/conclusions 86. Summary/conclusions
- Statistical and ML-based Natural Language Processing is a very
active and multidisciplinary area of research
Conclusions 87. Some current research lines
- Appropriate learning paradigm for all kind of NLP
problems:TiMBL (DBZ99) ,TBEDL (Brill95), ME (Ratnaparkhi98), SNoW
(Roth98),CRF(Pereira & Singer02).
- Definition of an adequate (and task-specific) feature
space:mapping from the input space to a high dimensional feature
space, kernels, etc.
- Resolution of complex NLP problems:inference with classifiers +
constraint satisfaction
Conclusions 88. Bibliografia
- You may found additional information at:
-
- http://www.lsi.upc.es/ ~ lluism/
- http://www.lsi.upc.es/ ~ lluism/udg03.ppt.gz
Conclusions 89. Seminar: Statistical NLP Girona, June 2003
Machine Learning forNatural Language ProcessingLlus Mrquez TALP
Research CenterLlenguatges i Sistemes InformticsUniversitat
Politcnica de Catalunya