Machine Learning for NLP

89
Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

description

 

Transcript of Machine Learning for NLP

  • 1. Seminar: Statistical NLP Girona, June 2003 Machine Learning forNatural Language ProcessingLlus Mrquez TALP Research CenterLlenguatges i Sistemes InformticsUniversitat Politcnica de Catalunya

2. Outline

  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP

3. Outline

  • Machine Learning for NLP
  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP

4.

  • There are many general-purpose definitions of Machine Learning (or artificial learning):

Machine Learning ML4NLP

  • Learners arecomputers : we study learningalgorithms
  • Resources are scarce: time, memory, data, etc.
  • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.
  • Biological plausibility is welcome but not the main goal

Making a computer automatically acquire some kind of knowledge from a concrete data domain 5. Machine Learning

  • We will concentrate on:
    • Supervised inductivelearning forclassification
    • =discriminative learning
  • Learning... but what for?
    • Toperformsomeparticulartask
    • Toreactto environmental inputs
    • Conceptlearning from data:
      • modellingconcepts underlying data
      • predicting unseen observations
      • compactingthe knowledgerepresentation
      • knowledge discoveryfor expert systems

ML4NLP 6. Machine Learning

  • What to read?
    • Machine Learning(Mitchell, 1997)

A more precise definition: ML4NLP Obtaining a descriptionof the conceptin some representation language that explains observations and helps predicting new instances of the same distribution 7.

  • Lexical and structuralambiguityproblems
    • Word selection (SR, MT)
    • Part-of-speech tagging
    • Semantic ambiguity (polysemy)
    • Prepositional phrase attachment
    • Reference ambiguity (anaphora)
    • etc.

Empirical NLP90 s : Application of Machine Learning techniques(ML) to NLP problems ML4NLP

  • What to read?Foundations of Statistical Language Processing (Manning & Schtze, 1999)

Clasification problems 8.

  • Ambiguityis a crucial problem for natural language understanding/processing.Ambiguity Resolution = Classification

NLP classification problems

    • He was shot in the hand as he chased the robbers in the back street

( The Wall Street Journal Corpus ) ML4NLP 9.

  • Morpho-syntactic ambiguity

NLP classification problems

    • He wasshotin thehandas hechasedthe robbers in the back street

NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP 10.

  • Morpho-syntactic ambiguity :Part of Speech Tagging

NLP classification problems

    • He wasshotin thehandas hechasedthe robbers in the back street

NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) ML4NLP 11.

  • Semantic (lexical) ambiguity

NLP classification problems

    • He was shot in thehandas he chased the robbers in the back street

body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP 12.

  • Semantic (lexical) ambiguity :Word Sense Disambiguation

NLP classification problems

    • He was shot in thehandas he chased the robbers in the back street

body -part clock -part ( The Wall Street Journal Corpus ) ML4NLP 13.

  • Structural (syntactic) ambiguity

NLP classification problems

    • He was shot in the hand as he chased the robbersin the back street

( The Wall Street Journal Corpus ) ML4NLP 14.

  • Structural (syntactic) ambiguity

NLP classification problems

    • Hewas shotin the hand as hechased therobbers in the back street

( The Wall Street Journal Corpus ) ML4NLP 15.

  • Structural (syntactic) ambiguity: PP-attachment disambiguation

NLP classification problems

    • He was shot in the hand as he(chased(the robbers) NP (in the back street) PP )

( The Wall Street Journal Corpus ) ML4NLP 16. Outline

  • The Classification Problem
  • Three ML Algorithms in detail
  • Applications to NLP
  • Machine Learning for NLP

17. Feature Vector Classification

  • Aninstanceis a vector:x = < x 1 ,,x n > whose components, calledfeatures(or attributes), are discrete or real-valued.
  • LetXbe the space of all possible instances.
  • LetY= { y 1 ,, y m } be the set ofcategories(or classes).
  • The goal is to learn an unknown target function,f : XY
  • Atraining example is an instancex belonging toX , labelled with the correct value forf( x ) ,i.e., a pair < x ,f( x ) >
  • LetDbe the set of alltraining examples .

IAperspective Classification 18. Feature Vector Classification

  • Thegoalis to find a functionhbelonging toHsuch that for all pair< x , f ( x ) > belonging toD ,h (x) =f (x)
  • Thehypotheses space ,H , is the set of functionsh: XY that the learner can consider as possible definitions

Classification 19. An Example otherwise negative Classification (COLOR= red ) (SHAPE= circle ) positive Rules red blue SHAPE negative positive circle triangle negative COLOR Decision Tree 20. An Example Classification Rules (SIZE= small ) (SHAPE= circle ) positive otherwise negative (SIZE= big ) (COLOR= red ) positive small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg 21. Some important concepts

  • InductiveBias
  • Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias(Mooney& Cardie,99)
    • Language/Searchbias

Classification red blue SHAPE negative positive circle triangle negative COLOR Decision Tree 22.

  • InductiveBias
  • Training error andgeneralizationerror

Some important concepts

  • Generalization ability andoverfitting
  • BatchLearning vs.on-lineLeaning
  • Symbolicvs.statisticalLearning
  • Propositionalvs.first-orderlearning

Classification 23. Propositional vs.Relational LearningClassification color(red)shape(circle) classA

  • Propositional learning

course(X)person(Y)link_to(Y,X) instructor_of(X,Y) research_project(X)person(Z)link_to(L 1 ,X,Y) link_to(L 2 ,Y,Z) neighbour_word_ people (L 1 ) member_proj(X,Z)

  • Relational learning = ILP (induction of logic programs)

24. The Classification Setting Class, Point, Example, Data Set, ...

  • Input Space:X R n
  • (binary) Output Space:Y= { + 1,-1}
  • A point, pattern or instance: x X ,x= (x 1 , x 2 , , x n )
  • Example:( x ,y ) with x X, y Y
  • Training Set: a set ofmexamples generated i.i.d. according to an unknown distributionP ( x , y ) S = {( x 1 ,y 1 ), , ( x m ,y m )}( X Y ) m

Classification CoLT/SLTperspective 25. The Classification Setting Learning, Error, ...

  • The hypotheses space,H , is the set of functionsh: X Y that the learner can consider as possible definitions. In SVM are of the form:
  • The goal is to find a functionhbelonging toHsuch that the expected misclassification error on new examples, also drawn fromP ( x , y ) , is minimal( Risk Minimization , RM)

Classification 26. The Classification Setting Learning, Error, ...

  • Expected error (risk)
  • Problem :Pitself is unknown. Known are training examplesaninduction principleis needed
  • Empirical Risk Minimization(ERM): Find the functionhbelonging toHfor which the training error (empirical risk) is minimal

Classification 27. The Classification Setting Error, Over(under)fitting,...

  • Low training errorlow true error?
  • The overfitting dilemma:
  • Trade-off between training error and complexity
  • Different learning biases can be used

(Mller et al., 2001) Classification Under fitting Over fitting 28. Outline

  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP
  • Machine Learning for NLP

29. Outline

  • The Classification Problem
  • Three ML Algorithms
    • Decision Trees
    • AdaBoost
    • Support Vector Machines
  • Applications to NLP
  • Machine Learning for NLP

30. Learning Paradigms

  • Statisticallearning:
    • HMM, Bayesian Networks, ME, CRF, etc.
  • Traditional methods fromArtificial Intelligence( ML, AI )
    • Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.
  • Methods fromComputational LearningTheory( CoLT/SLT )
    • Winnow, AdaBoost, SVMs, etc.

Algorithms 31. Learning Paradigms

  • Classifiercombination :
    • Bagging, Boosting, Randomization, ECOC, Stacking, etc.
  • Semi-supervised learning : learning from labelled and unlabelled examples
    • Bootstrapping, EM, Transductive learning (SVMs, AdaBoost), Co-Training, etc.
  • etc.

Algorithms 32. Decision Trees

  • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.
  • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description,Classification , and Generalization.
  • From a machine-learning perspective: Decision Trees aren -ary branching trees that representclassification rulesfor classifying theobjectsof a certain domain into a set of mutually exclusiveclasses

Algorithms 33. Decision Trees

  • Acquisition:Top-Down Induction of Decision Trees (TDIDT)
  • Systems:
  • CART(Breiman et al. 84) ,
  • ID3, C4.5, C5.0(Quinlan 86,93,98),
  • A SSISTANT , A SSISTANT -R(Cestnik et al. 87)(Kononenko et al. 95)
  • etc.

Algorithms 34. An Example Algorithms A1 A2 A3 C1 A5 A2 A2 A5 C3 C2 C1 ... ... ... ... v 1 v 2 v 3 v 5 v 4 v 6 v 7 small big SHAPE pos circle red SIZE Decision Tree COLOR triang blue neg pos neg 35. Learning Decision Trees Algorithms Training TrainingSet TDIDT + DT = Test = DT Example + Class 36. General Induction Algorithm Algorithms function TDIDT(X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : decision-tree; X: set-of-examples; A: set-of-featuresend-var if( stopping_criterion (X))then tree 1:=create_leaf_tree (X) else a max:=feature_selection (X,A); tree 1:=create_tree (X, a max ); for-allvalin values (a max )do X :=select_examples (X,a max ,val); A := A - {a max }; tree 2:=TDIDT (X,A); tree 1:=add_branch (tree 1 ,tree 2 ,val) end-for end-if return (tree 1 ) end-function 37. General Induction Algorithm Algorithms function TDIDT(X:set-of-examples; A:set-of-features) var : tree 1 ,tree 2 : decision-tree; X: set-of-examples; A: set-of-featuresend-var if( stopping_criterion (X))then tree 1:=create_leaf_tree (X) else a max:=feature_selection (X,A); tree 1:=create_tree (X, a max ); for-allvalin values (a max )do X :=select_examples (X,a max ,val); A := A - {a max }; tree 2:=TDIDT (X,A); tree 1:=add_branch (tree 1 ,tree 2 ,val) end-for end-if return (tree 1 ) end-function 38. Feature Selection Criteria

  • Functions derived fromInformation Theory :
    • Information Gain, Gain Ratio(Quinlan 86)
  • Functions derived from Distance Measures
    • Gini Diversity Index(Breiman et al. 84)
    • RLM(Lpez de Mntaras 91)
  • Statistically-based
    • Chi-square test(Sestito & Dillon 94)
    • Symmetrical Tau(Zhou & Dillon 91)
  • R ELIEF F-IG: variant of R ELIEF F(Kononenko 94)

Algorithms 39. Extensions of DTs

  • Pruning(pre/post)
  • Minimize the effect of the greedy approach:lookahead
  • Non-linealsplits
  • Combinationof multiple models
  • Incrementallearning (on-line)
  • etc.

(Murthy 95) Algorithms 40. Decision Trees and NLP

  • Speech processing(Bahl et al. 89; Bakiri & Dietterich 99)
  • POS Tagging(Cardie 93, Schmid 94b; Magerman 95; Mrquez & Rodrguez 95,97; Mrquez et al. 00)
  • Word sense disambiguation(Brown et al. 91; Cardie 93;Mooney 96)
  • Parsing(Magerman 95,96; Haruno et al. 98,99)
  • Text categorization(Lewis & Ringuette 94; Weiss et al. 99)
  • Text summarization(Mani & Bloedorn 98)
  • Dialogue act tagging(Samuel et al. 98)

Algorithms 41. Decision Trees and NLP

  • Noun phrase coreference(Aone & Benett 95; Mc Carthy & Lehnert 95)
  • Discourse analysis in information extraction(Soderland & Lehnert 94)
  • Cue phrase identification in text and speech(Litman 94; Siegel & McKeown 94)
  • Verb classification in Machine Translation(Tanaka 96; Siegel 97)

Algorithms 42. Decision Trees: pros&cons

  • Advantages
    • Acquires symbolic knowledge in a understandable way
    • Very well studied ML algorithms and variants
    • Can be easily translated into rules
    • Existence of available software: C4.5, C5.0, etc.
    • Can be easily integrated into an ensemble

Algorithms 43. Decision Trees: pros&cons

  • Drawbacks
    • Computationally expensive when scaling to large natural language domains: training examples, features, etc.
    • Data sparseness and data fragmentation: the problem of thesmall disjuncts=> Probability estimation
    • DTs is a model with high variance (unstable)
    • Tendency to overfit training data: pruning is necessary
    • Requires quite a big effort in tuning the model

Algorithms 44. Boosting algorithms

  • Idea
  • to combine many simple and moderately accurate hypotheses ( weak classifiers ) into a single and highly accurate classifier
  • AdaBoost (Freund & Schapire 95)has been theoretically and empirically studied extensively
  • Many other variants extensions(1997-2003)
  • http://www.lsi.upc.es/ ~ lluism/seminari/ml&nlp.html

Algorithms 45. AdaBoost: general scheme TRAINING Algorithms TS 2 D 2 TS 1 D 1 WeakLearner h 1 WeakLearner h 2 TS T . . . Probabilitydistributionupdating D T WeakLearner h T . . . Linearcombination F( h 1 ,h 2 ,...,h T ) TEST 2 46. AdaBoost: algorithm Algorithms (Freund & Schapire 97) 47. AdaBoost: example Weak hypotheses= vertical/horizontal hyperplanes Algorithms 48. AdaBoost: round1 Algorithms 49. AdaBoost: round2 Algorithms 50. AdaBoost: round3 Algorithms 51. Combined Hypothesis Algorithms www.research.att.com/ ~ yoav/adaboost 52. AdaBoost and NLP

  • POS Tagging (Abney et al. 99; Mrquez 99)
  • Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)
  • PP-attachment Disambiguation (Abney et al. 99)
  • Parsing (Haruno et al. 99)
  • Word Sense Disambiguation (Escudero et al. 00, 01)
  • Shallow parsing (Carreras & Mrquez, 01a; 02)
  • Email spam filtering (Carreras & Mrquez, 01b)
  • Term Extraction (Vivaldi, et al. 01)

Algorithms 53. AdaBoost: pros&cons Algorithms

  • Easy to implement and few parameters to set
  • Time and space grow linearly with number of examples. Ability to manage very large learning problems
  • Does not constrain explicitly the complexity of the learner
  • Naturally combines feature selection with learning
  • Has been succesfully applied to many practical problems

54. AdaBoost: pros&cons

  • Seems to be rather robust to overfitting(number of rounds) but sensitive to noise
  • Performance is very good when there are relatively few relevant terms (features)
  • Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

Algorithms 55.

  • Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory.(Cristianini & Shawe-Taylor, 2000)

Algorithms SVM: A General Definition 56. SVM: A General Definition

  • Support Vector Machines (SVM) are learning systems that use a hypothesis space oflinear functionsin ahigh dimensionalfeature space, trained with a learning algorithm fromoptimisation theorythat implements alearning biasderived from statistical learning theory.(Cristianini & Shawe-Taylor, 2000)

Key Concepts Algorithms 57. Linear Classifiers

  • Hyperplanes in R N .
  • Defined by aweight vector( w ) and athreshold( b ).
  • They induce aclassificationrule:

Algorithms w + + + + + + _ _ _ _ _ _ _ _ _ 58. Optimal Hyperplane:Geometric Intuition Algorithms 59. Optimal Hyperplane:Geometric Intuition MaximalMarginHyperplane Algorithms These are the SupportVectors 60. Linearly separable data QuadraticProgramming Algorithms Seminari SVM s22/05/2001 61. Non-separable case (soft margin) Algorithms Seminari SVM s22/05/2001 62. Non-linear SVMs

  • Implicit mapping into feature space via kernel functions

Algorithms Non-linear mapping Set of hypotheses Seminari SVM s22/05/2001 Dual formulation Kernel function Evaluation 63. Non-linear SVMs

  • Kernel functions
    • Must be efficiently computable
    • Characterization via Mercers theorem
    • One of thecuriousfacts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space!(Cristianini & Shawe-Taylor, 2000)
    • Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

Algorithms Seminari SVM s22/05/2001 64. Non linear SVMs Degree 3 polynomial kernel lin. separable lin. non-separable Algorithms Seminari SVM s22/05/2001 65. Toy Examples

  • All examples have been run with the 2D graphic interface of SVMLIB( Chang and Lin, National University of Taiwan)
  • LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool.LIBSVMprovides a simple interface where users can easily link it with their own programs
  • Available from:www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool)

Algorithms 66. Toy Examples(I) Linearly separable data set Linear SVM Maximal margin Hyperplane Algorithms . What happens if we add a blue training example here? 67. Toy Examples(I) (still) Linearly separable data set Linear SVM High value ofCparameter Maximal margin Hyperplane The example iscorrectly classified Algorithms 68. Toy Examples(I) (still) Linearly separable data set Linear SVM Low value ofCparameter Trade-off between: margin and training error The example isnow a bounded SV Algorithms 69. Toy Examples(II) Algorithms 70. Toy Examples(II) Algorithms 71. Toy Examples(II) Algorithms 72. Toy Examples(III) Algorithms 73. SVM: Summary

  • SVMs introduced in COLT92(Boser, Guyon, & Vapnik, 1992) . Great developement since then
  • Kernel-induced feature spaces: SVMs work efficiently in veryhigh dimensionalfeature spaces ( + )
  • Learning bias:maximal marginoptimisation. Reduces the danger of overfitting. Generalization bounds for SVMs ( + )
  • Compact representation of the induced hypothesis. The solution is sparse in terms of SVs( + )

Algorithms 74. SVM: Summary

  • Due to Mercers conditions on the kernels the optimi-sation problems are convex. No local minima ( + )
  • Optimisation theory guides the implementation. Efficient learning ( + )
  • Mainly for classification but also for regression, density estimation, clustering, etc.
  • Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. ( + )
  • Parameter tuning ( ). Implications in convergence times, sparsity of the solution, etc.

Algorithms 75. Outline

  • The Classification Problem
  • Three ML Algorithms
  • Applications to NLP
  • Machine Learning for NLP

76. NLP problems Applications

  • Warning!We will not focus on final NLP applications, but on intermediate tasks...
  • We will classify the NLP tasks according to their (structural) complexity

77. NLP problems: structural complexity Applications

  • Decisionalproblems
    • Text Categorization, Document filtering, Word Sense Disambiguation, etc.
  • Sequence taggingand detection of sequential structures
    • POS tagging, Named Entity extraction, syntactic chunking, etc.
  • Hierarchicalstructures
    • Clause detection, full parsing, IE of complex concepts, composite Named Entities, etc.

78.

  • Morpho-syntactic ambiguity :Part of Speech Tagging

POS tagging

    • He wasshotin thehandas hechasedthe robbers in the back street

NN VB JJ VB NN VB ( The Wall Street Journal Corpus ) Applications 79. POS tagging Applications preposition-adverb tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 As,as RB IN others others ... ... ^ Probabilistic interpretation: P( RB | word=A/astag(+1)=RBtag(+2)=IN) = 0.987 P( IN | word=A/astag(+1)=RBtag(+2)=IN) = 0.013 ^ 80. POS tagging as _ RBmuch_ RBas_ IN Collocations: as _ RBwell_ RBas_ IN as _ RBsoon_ RBas_ IN Applications preposition-adverb tree root P(IN)=0.81 P(RB)=0.19 Word Form leaf P(IN)=0.83 P(RB)=0.17 tag(+1) P(IN)=0.13 P(RB)=0.87 tag(+2) P(IN)=0.013 P(RB)=0.987 As,as RB IN others others ... ... 81. POS tagging Raw text Morphological analysis Tagged text Classify Update Filter Language Model Disambiguation stop? RTT(Mrquez & Rodrguez 97) yes no Applications A Sequential Model for Multi-class Classification:NLP/POS Tagging(Even-Zohar & Roth, 01) 82. POS tagging STT(Mrquez & Rodrguez 97) Applications Tagged text Raw text Morphological analysis Viterbi algorithm Language Model Disambiguation Lexical probs. + Contextual probs. The Use of Classifiers in sequential inference:Chunking(Punyakanok & Roth, 00) 83. Detection of sequential and hierarchical structures

  • Named Entity recognition
  • Clause detection

Applications 84. Summary/conclusions

  • We have briefly outlined:
    • The ML setting: supervised learning for classification
    • Three concrete machine learning algorithms
    • How to apply them to solve itermediate NLP tasks

Conclusions 85.

  • Any ML algorithm for NLP should be:
    • Robust to noise and outliers
    • Efficient in large feature/example spaces
    • Adaptive to new/changing domains:portability, tuning, etc.
    • Able to take advantage of unlabelled examples: semi-supervised learning

Conclusions Summary/conclusions 86. Summary/conclusions

  • Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research

Conclusions 87. Some current research lines

  • Appropriate learning paradigm for all kind of NLP problems:TiMBL (DBZ99) ,TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98),CRF(Pereira & Singer02).
  • Definition of an adequate (and task-specific) feature space:mapping from the input space to a high dimensional feature space, kernels, etc.
  • Resolution of complex NLP problems:inference with classifiers + constraint satisfaction
  • etc.

Conclusions 88. Bibliografia

  • You may found additional information at:
    • http://www.lsi.upc.es/ ~ lluism/
    • tesi.html
    • publicacions/pubs.html
    • cursos/talks.html
    • cursos/MLandNL.html
    • cursos/emnlp1.html
  • This talk at:
  • http://www.lsi.upc.es/ ~ lluism/udg03.ppt.gz

Conclusions 89. Seminar: Statistical NLP Girona, June 2003 Machine Learning forNatural Language ProcessingLlus Mrquez TALP Research CenterLlenguatges i Sistemes InformticsUniversitat Politcnica de Catalunya