ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

43
EMNLP’02 11-18/11/2002 ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

description

ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL. EBL. Exemplar-Based Learning. ACL’99 Tutorial on: Symbolic Machine Learning methods for NLP (Mooney & Cardie 99). Preliminaries The k -Nearest Neighbour algorithm Distance Metrics - PowerPoint PPT Presentation

Transcript of ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

Page 1: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• ML: Classical methods from AI

–Decision-Tree induction

–Exemplar-based Learning

–Rule Induction

–TBEDL

• ML: Classical methods from AI

–Decision-Tree induction

–Exemplar-based Learning

–Rule Induction

–TBEDL

Page 2: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Exemplar-Based LearningExemplar-Based Learning

EBLEBL

• Preliminaries

• The k-Nearest Neighbour algorithm

• Distance Metrics

• Feature Relevance and Weighting

• Example Storage

• Preliminaries

• The k-Nearest Neighbour algorithm

• Distance Metrics

• Feature Relevance and Weighting

• Example Storage

ACL’99 Tutorial on: Symbolic Machine Learning methods for NLP

(Mooney & Cardie 99)

ACL’99 Tutorial on: Symbolic Machine Learning methods for NLP

(Mooney & Cardie 99)

Page 3: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

PreliminariesPreliminaries

• Unlike most learning algorithms, exemplar-based, also called (case,instance,memory)-based, approaches do not construct an abstract hypothesis but instead base classification of test instances on similarity to specific training cases (e.g. Aha et al. 1991)

• Training is typically very simple: Just store the training instances

• Generalization is postponed until a new instance must be classified Therefore, exemplar-based methods are sometimes called “lazy” learners (Aha, 1997)

• Unlike most learning algorithms, exemplar-based, also called (case,instance,memory)-based, approaches do not construct an abstract hypothesis but instead base classification of test instances on similarity to specific training cases (e.g. Aha et al. 1991)

• Training is typically very simple: Just store the training instances

• Generalization is postponed until a new instance must be classified Therefore, exemplar-based methods are sometimes called “lazy” learners (Aha, 1997)

EBLEBL

Page 4: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• Given a new instance, its relation to the stored examples is examined in order to assign a target function value for the new instance

• In the worst case, testing requires comparing a test instance to every training instance

• Consequently, exemplar-based methods may require more expensive indexing of cases during training to allow for efficient retrieval: e.g. k-d trees

• Given a new instance, its relation to the stored examples is examined in order to assign a target function value for the new instance

• In the worst case, testing requires comparing a test instance to every training instance

• Consequently, exemplar-based methods may require more expensive indexing of cases during training to allow for efficient retrieval: e.g. k-d trees

EBLEBL

PreliminariesPreliminaries

Page 5: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• Other solutions (simplifying assumptions):

– Store only relevant examples: Instance editing, IB2, IB3, etc. (Aha et al., 1991)

– IGTREEs (Daelemans et al. 1995)

– etc.

• Rather than estimating the target function for the entire instance space, case-based methods estimate the target function locally and differently for each new instance to be classified.

• Other solutions (simplifying assumptions):

– Store only relevant examples: Instance editing, IB2, IB3, etc. (Aha et al., 1991)

– IGTREEs (Daelemans et al. 1995)

– etc.

• Rather than estimating the target function for the entire instance space, case-based methods estimate the target function locally and differently for each new instance to be classified.

EBLEBL

PreliminariesPreliminaries

Page 6: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• Calculate the distance between a test instance and every training instance

• Pick the k closest training examples and assign to the test example the most common category among these “nearest neighbours”

• Calculate the distance between a test instance and every training instance

• Pick the k closest training examples and assign to the test example the most common category among these “nearest neighbours”

k-Nearest Neighboursk-Nearest Neighbours(Cover & Hart, 1967; Duda & Hart, 1973)(Cover & Hart, 1967; Duda & Hart, 1973)

?1-nearest neighbour says +

5-nearest neighbours say -

1-nearest neighbour says +

5-nearest neighbours say -

EBLEBL

• Voting multiple neighbours help increase resistance to noise. For binary classification tasks, odd values of k are normally used to avoid ties.

• Voting multiple neighbours help increase resistance to noise. For binary classification tasks, odd values of k are normally used to avoid ties.

Page 7: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Implicit Classification FunctionImplicit Classification Function

• Although it is not necessary to explicitly calculate it, the learned classification rule is based on the regions of feature space closest to each training example

• For 1-nearest neighbour, the Voronoi diagram gives the complex plyhedra that segment the space into the region of points closest to each training example

• Although it is not necessary to explicitly calculate it, the learned classification rule is based on the regions of feature space closest to each training example

• For 1-nearest neighbour, the Voronoi diagram gives the complex plyhedra that segment the space into the region of points closest to each training example

EBLEBL

Page 8: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Distance MetricsDistance Metrics

• Exemplar-based methods assume a function for calculating the (dis)similarity of two instances

• Simplest approach:

– For continuous feature vectors, just use Euclidean distance

– To compensate for differences in units, scale all continuous values to normalize their values to be between 0 and 1

• Exemplar-based methods assume a function for calculating the (dis)similarity of two instances

• Simplest approach:

– For continuous feature vectors, just use Euclidean distance

– To compensate for differences in units, scale all continuous values to normalize their values to be between 0 and 1

EBLEBL

Page 9: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Distance MetricsDistance Metrics

• Simplest approach:

– For symbolic feature vectors assume Hamming distance (Overlapping measure)

• Simplest approach:

– For symbolic feature vectors assume Hamming distance (Overlapping measure)

EBLEBL

Page 10: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Distance MetricsDistance Metrics• More sophisticated metrics

– Distance metrics can account for relationships that exist among feature values. E.g., when comparing parts-of-speech feature, a proper noun and a common noun are probably more similar than a proper noun and a verb (Cardie & Mooney 99)

– Graded similarity between discrete features: E.g., MVDM Modified Value Difference Metric (Cost & Salzberg, 1993)

– Weighting attributes according to their importance (Cardie, 1993; Daelemans et al., 1996)

– Many empirical comparative studies between distance metrics for exemplar-based classification: (Aha & Goldstone, 1992; Friedman, 1994; Hastie & Tibshirani, 1995; Wettschereck et al., 1997; Blanzieri & Ricci, 1999)

• More sophisticated metrics

– Distance metrics can account for relationships that exist among feature values. E.g., when comparing parts-of-speech feature, a proper noun and a common noun are probably more similar than a proper noun and a verb (Cardie & Mooney 99)

– Graded similarity between discrete features: E.g., MVDM Modified Value Difference Metric (Cost & Salzberg, 1993)

– Weighting attributes according to their importance (Cardie, 1993; Daelemans et al., 1996)

– Many empirical comparative studies between distance metrics for exemplar-based classification: (Aha & Goldstone, 1992; Friedman, 1994; Hastie & Tibshirani, 1995; Wettschereck et al., 1997; Blanzieri & Ricci, 1999)

EBLEBL

Page 11: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Distance Metrics: MVDMDistance Metrics: MVDM

• Modified Value Difference Metric (Cost & Salzberg, 1993)

• Graded similarity between discrete features

• Modified Value Difference Metric (Cost & Salzberg, 1993)

• Graded similarity between discrete features

EBLEBL

• Better results in some NLP applications (e.g. WSD) (Escudero et al., 00b)

• Better results in some NLP applications (e.g. WSD) (Escudero et al., 00b)

Page 12: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Distance MetricsDistance Metrics

• More information on similarity metrics

– Foundations of Statistical NLP (Manning & Schütze, 1999)

– Survey on similarity metrics (HERMES project) (Rodríguez, 2001)

• More information on similarity metrics

– Foundations of Statistical NLP (Manning & Schütze, 1999)

– Survey on similarity metrics (HERMES project) (Rodríguez, 2001)

EBLEBL

Page 13: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

k-NN: Variationsk-NN: VariationsEBLEBL

• Linear search is not a very efficient classification procedure

• The k-d tree data structure can be used to index training examples and find nearest neighbours in logarithmic time on average (Bentley 75; Friedman et al. 77)

• Nodes branch on threshold test on individual features and leaves terminate at nearest neighbours

• There is no loss of information (nor generalization) with respect to the straightforward representation

• Linear search is not a very efficient classification procedure

• The k-d tree data structure can be used to index training examples and find nearest neighbours in logarithmic time on average (Bentley 75; Friedman et al. 77)

• Nodes branch on threshold test on individual features and leaves terminate at nearest neighbours

• There is no loss of information (nor generalization) with respect to the straightforward representation

Page 14: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

k-NN: Variations (2)k-NN: Variations (2)

EBLEBL

• k-NN can be used to approximate the value of a continuous function (regression) by taking the average function value of the k nearest neighbours (Bishop 95; Atkeson et al. 97)

• All training examples can be used to contribute to the classification by giving every example a vote that is weighted by the inverse square of its distance from the test example (Shepard 68). See the implementation details and parameters of TiMBL 4.1 (Daelemans et al. 01)

• k-NN can be used to approximate the value of a continuous function (regression) by taking the average function value of the k nearest neighbours (Bishop 95; Atkeson et al. 97)

• All training examples can be used to contribute to the classification by giving every example a vote that is weighted by the inverse square of its distance from the test example (Shepard 68). See the implementation details and parameters of TiMBL 4.1 (Daelemans et al. 01)

Page 15: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Feature RelevanceFeature RelevanceEBLEBL

• The standard distance metric weight each feature equally, which can cause problems if only a few of the features are relevant to the classification task, since method could be mislead by similarity along many irrelevant dimensions

• The problem of finding the “best” feature set (or a weighting of the features) is a general problem for ML induction algorithms

• Parenthesis: – Wrapper vs. Filter feature selection methods

– Global vs. Local feature selection methods

• The standard distance metric weight each feature equally, which can cause problems if only a few of the features are relevant to the classification task, since method could be mislead by similarity along many irrelevant dimensions

• The problem of finding the “best” feature set (or a weighting of the features) is a general problem for ML induction algorithms

• Parenthesis: – Wrapper vs. Filter feature selection methods

– Global vs. Local feature selection methods

Page 16: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

EBLEBL

• Wrapper methods for feature selection generate a set of candidate features, run the induction algorithm with these features, use the accuracy of the resulting concept description to evaluate the feature set, and iterate (John et al. 94)

– Forward selection / Backward elimination

• Filter methods for feature selection consider attributes independently of the induction algorithm that will use them. They use an inductive bias that can be entirely different from the bias employed in the induction learning algorithm

– Information gain / RELIEFF / Decision Trees / etc.

• Wrapper methods for feature selection generate a set of candidate features, run the induction algorithm with these features, use the accuracy of the resulting concept description to evaluate the feature set, and iterate (John et al. 94)

– Forward selection / Backward elimination

• Filter methods for feature selection consider attributes independently of the induction algorithm that will use them. They use an inductive bias that can be entirely different from the bias employed in the induction learning algorithm

– Information gain / RELIEFF / Decision Trees / etc.

Feature Relevance (2)Feature Relevance (2)

Page 17: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

EBLEBL

• Global feature weighting methods compute a single weight vector for the classification task (e.g. Information Gain in the IGTREE algorithm)

• Local feature weighting methods allow feature weights to vary for each training instance, for each test instance, or both (Wettschereck et al. 97) (e.g. Value-difference metric)

• Global feature weighting methods compute a single weight vector for the classification task (e.g. Information Gain in the IGTREE algorithm)

• Local feature weighting methods allow feature weights to vary for each training instance, for each test instance, or both (Wettschereck et al. 97) (e.g. Value-difference metric)

Feature Relevance (3)Feature Relevance (3)

Page 18: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Linguistic Biases and Feature Weighting

Linguistic Biases and Feature Weighting

EBLEBL

• Assign weights based on linguistic or cognitive preferences (Cardie 96). Use of domain -a priori- knowledge for the Information Extraction task

– recency bias: assign higher weights to features that represent temporally recent information

– focus of attention bias: assign higher weights to features that correspond to words or constituents in focus (e.g. Subject of a sentence)

– restricted memory bias: keep the N features with the highest weights

• Assign weights based on linguistic or cognitive preferences (Cardie 96). Use of domain -a priori- knowledge for the Information Extraction task

– recency bias: assign higher weights to features that represent temporally recent information

– focus of attention bias: assign higher weights to features that correspond to words or constituents in focus (e.g. Subject of a sentence)

– restricted memory bias: keep the N features with the highest weights

Page 19: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Linguistic Biases and Feature Weighting (2)

Linguistic Biases and Feature Weighting (2)

EBLEBL

• Use cross validation to determine which biases apply to the task and which actual weights to assign

• Hybrid filter-wrapper approach

– Wrapper approach for bias selection

– Selected biases direct feature weighting using filter approach

• Use cross validation to determine which biases apply to the task and which actual weights to assign

• Hybrid filter-wrapper approach

– Wrapper approach for bias selection

– Selected biases direct feature weighting using filter approach

Page 20: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example StorageExample StorageEBLEBL

• Some algorithms only store a subset of the most informative training examples in order to focus the system and to make it more efficient: instance editing

• Edit superfluous regular instances, e.g. The “exemplar growing” IB2 algorithm (Aha et al. 91)

initialize set of stored examples to the empty_setfor each example in the training set do

if (the example is correctly classified by picking the nearest neigbour in the current stored examples)

then do not store the exampleelse add it to the set of stored examples

end for

• Some algorithms only store a subset of the most informative training examples in order to focus the system and to make it more efficient: instance editing

• Edit superfluous regular instances, e.g. The “exemplar growing” IB2 algorithm (Aha et al. 91)

initialize set of stored examples to the empty_setfor each example in the training set do

if (the example is correctly classified by picking the nearest neigbour in the current stored examples)

then do not store the exampleelse add it to the set of stored examples

end for

Page 21: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example Storage (2)Example Storage (2)

EBLEBL

• In some experimental tests this approach (IB2) works as well -if not better- than storing all examples

• Edit unproductive exceptions, e.g. the IB3 algorithm (Aha et al. 93): Delete all instances that are bad class predictors for their neighbourhood in the training set

• There is evidence that keeping all training instances is best in exemplar-based approaches to NLP (Daelemans et al. 99)

• In some experimental tests this approach (IB2) works as well -if not better- than storing all examples

• Edit unproductive exceptions, e.g. the IB3 algorithm (Aha et al. 93): Delete all instances that are bad class predictors for their neighbourhood in the training set

• There is evidence that keeping all training instances is best in exemplar-based approaches to NLP (Daelemans et al. 99)

Page 22: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

EBL: SummaryEBL: Summary

• EBL methods base decisions on similarity to specific past instances rather than constructing abstractions

• Exemplar-based methods abandon the goal of maintaining concept “simplicity”

• Consequently, they trade decreased learning time for increased classification time.

• They store all examples and, thus, all exceptions too. This seems to be very important for NLP tasks. “Forgetting Exceptions is Harmful in NL Learning” (Daelemans et al. 99)

• EBL methods base decisions on similarity to specific past instances rather than constructing abstractions

• Exemplar-based methods abandon the goal of maintaining concept “simplicity”

• Consequently, they trade decreased learning time for increased classification time.

• They store all examples and, thus, all exceptions too. This seems to be very important for NLP tasks. “Forgetting Exceptions is Harmful in NL Learning” (Daelemans et al. 99)

EBLEBL

Page 23: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• Important issues are:

– Defining appropriate distance metrics

– Representing appropriately example attributes

– Efficient indexing of training cases

– Handling irrelevant features

• Important issues are:

– Defining appropriate distance metrics

– Representing appropriately example attributes

– Efficient indexing of training cases

– Handling irrelevant features

EBLEBL

EBL: SummaryEBL: Summary

Page 24: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Exemplar-Based Learning and NLP

Exemplar-Based Learning and NLP

• Stress acquisition (Daelemans et al. 94)

• Grapheme-to-phoneme conversion (van den Bosch & Daelemans 93; Daelemans et al. 99)

• Morphology (van den Bosch et al. 96)

• POS Tagging (Daelemans et al. 96; van Halteren et al. 98)

• Domain-specific lexical tagging (Cardie 93)

• Word sense disambiguation (Ng & Lee 96; Mooney 96; Fujii et al. 98; Escudero et al. 00)

• Stress acquisition (Daelemans et al. 94)

• Grapheme-to-phoneme conversion (van den Bosch & Daelemans 93; Daelemans et al. 99)

• Morphology (van den Bosch et al. 96)

• POS Tagging (Daelemans et al. 96; van Halteren et al. 98)

• Domain-specific lexical tagging (Cardie 93)

• Word sense disambiguation (Ng & Lee 96; Mooney 96; Fujii et al. 98; Escudero et al. 00)

EBLEBL

Page 25: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• PP-attachment disambiguation (Zavrel et al. 97; Daelemans et al. 99)

• Partial parsing and NP extraction (Veenstra 98; Argamon et al. 98; Cardie & Pierce 98; Daelemans et al. 99)

• Context-sensitive parsing (Simmons & Yu 92)

• Text Categorization (Riloff & Lehnert 94; Yang & Chute 94; Yang 99)

• etc.

• PP-attachment disambiguation (Zavrel et al. 97; Daelemans et al. 99)

• Partial parsing and NP extraction (Veenstra 98; Argamon et al. 98; Cardie & Pierce 98; Daelemans et al. 99)

• Context-sensitive parsing (Simmons & Yu 92)

• Text Categorization (Riloff & Lehnert 94; Yang & Chute 94; Yang 99)

• etc.

EBLEBLExemplar-Based Learning and

NLPExemplar-Based Learning and

NLP

Page 26: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• TiMBL: Tilburg Memory Based Learner (Daelemans et. al. 1998-2002)

• From ILK (Induction of Linguistic Knowledge) group

• Version 4.1 (new):

http://ilk.kub.nl/software.html

• TiMBL: Tilburg Memory Based Learner (Daelemans et. al. 1998-2002)

• From ILK (Induction of Linguistic Knowledge) group

• Version 4.1 (new):

http://ilk.kub.nl/software.html

EBLEBL

EBL and NLP: SoftwareEBL and NLP: Software

Page 27: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

IGTREEIGTREE(Daelemans et al. 97)(Daelemans et al. 97)

• IGTREE approach: Uses trees for compression and classification in lazy learning algorithms

• IGTREEs are “equivalent” to the Decision Trees that would be inductively acquired with a simple function for attribute selection (pre-calculated)

– Compression of the base of examples = TDIDT

– The attribute ordering is computed only at the root node, using Quinlan’s Gain Ratio and is kept constant during TDIDT

– A final pruning step is performed in order to compress even more the information

• Example1: IGTREEs applied to POS tagging (…) (Daelemans et al. 96)

• IGTREE approach: Uses trees for compression and classification in lazy learning algorithms

• IGTREEs are “equivalent” to the Decision Trees that would be inductively acquired with a simple function for attribute selection (pre-calculated)

– Compression of the base of examples = TDIDT

– The attribute ordering is computed only at the root node, using Quinlan’s Gain Ratio and is kept constant during TDIDT

– A final pruning step is performed in order to compress even more the information

• Example1: IGTREEs applied to POS tagging (…) (Daelemans et al. 96)

EBLEBL

Page 28: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

– Benchmark problems: grapheme-to-phoneme conversion with stress assignment (GS), part-of-speech tagging (POS), prepositional phrase attachment disambiguation (PP), and base noun phrase chunking (NP)

– IB1-IG: Editing superfluous regular instances and unproductive exceptions. Measures: (1) Typicality; (2) Class-prediction strength

– Editing exceptions is harmful for IB1-IG in all benchmark problems

– Benchmark problems: grapheme-to-phoneme conversion with stress assignment (GS), part-of-speech tagging (POS), prepositional phrase attachment disambiguation (PP), and base noun phrase chunking (NP)

– IB1-IG: Editing superfluous regular instances and unproductive exceptions. Measures: (1) Typicality; (2) Class-prediction strength

– Editing exceptions is harmful for IB1-IG in all benchmark problems

EBLEBL

Page 29: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

EBLEBL

Page 30: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

EBLEBL

Page 31: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

– Methods applied: IB1-IG < IGTREE < C5.0

– Forgetting by decision-tree learning (abstracting) can be harmful in language learning

– IB1-IG “outperforms” IGTREE and C5.0 on the benchmark problems

– Methods applied: IB1-IG < IGTREE < C5.0

– Forgetting by decision-tree learning (abstracting) can be harmful in language learning

– IB1-IG “outperforms” IGTREE and C5.0 on the benchmark problems

EBLEBLExample2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Page 32: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

– The decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness)

– Why forgetting exceptions is harmful?

– The decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness)

– Why forgetting exceptions is harmful?

EBLEBLExample2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Example2: Forgetting Exceptions is Harmful in Natural Language Learning (Daelemans et al. 99)

Page 33: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

EBLEBL

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetbody-partclock-partbody-partclock-part

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Semantic ambiguity (lexical-level): WSD• Semantic ambiguity (lexical-level): WSD

ARESO2000

Page 34: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Corpus: – (Ng 96,97) Available through LDC– 192,800 examples (full sentence) of 191 the most

ambiguous English words (121 nouns and 70 verbs)– Avg. Number of senses: 7.2 per noun, 12.6 per verb, 9.2

overall

• Corpus: – (Ng 96,97) Available through LDC– 192,800 examples (full sentence) of 191 the most

ambiguous English words (121 nouns and 70 verbs)– Avg. Number of senses: 7.2 per noun, 12.6 per verb, 9.2

overall

) ,( ), ,( ), ,( , , , , 2111122112 wwwwwwwwww

• Features:

– SetA:

– SetB:

• Features:

– SetA:

– SetB:

nccppp

pppwwwwwwwww

wwwwwwwwwww

, , , , ,

, , , ,) , ,( ), , ,( ), , ,(

), , ,( ), ,( ), ,( ), ,( , ,

1321

123321211112

22321111211

EBLEBL

ARESO2000

Page 35: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

ARESO2000Comparative Study

• Motivation: unconsistent previous results

• Metrics: – Hamming distance (h)

– Modified Value-Difference Metric, MVDM (cs) (Cost & Salzberg 93)

• Variants: – Example weighting (e)

– Attribute weighting (a) using RLM distance for calculating attribute weights (López de Mántaras 91)

Comparative Study

• Motivation: unconsistent previous results

• Metrics: – Hamming distance (h)

– Modified Value-Difference Metric, MVDM (cs) (Cost & Salzberg 93)

• Variants: – Example weighting (e)

– Attribute weighting (a) using RLM distance for calculating attribute weights (López de Mántaras 91)

EBLEBL

Page 36: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Results: SetA on a reduced set of 15 words • Results: SetA on a reduced set of 15 words

MFS NB EBh,1 h,7 h,15,e h,7,a h,7,e,a cs,1 cs,10 cs,10,e

nouns 57.4 71.7 65.8 70.0 71.1 72.1 72.6 70.6 73.6 73.7verbs 46.6 57.6 51.1 56.3 58.1 56.4 58.1 55.9 60.3 60.5all 53.3 66.4 60.2 64.8 66.2 66.1 67.2 65.0 68.6 68.7

time 00:07 00:08 00:11 09:56

EBLEBL

ARESO2000

Page 37: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Naive Bayes• Naive Bayes

EBLEBL

ARESO2000

Page 38: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Results: SetB on a reduced set of 15 words • Results: SetB on a reduced set of 15 words

MFS NB EB (h,15)nouns 57.4 72.2 64.3verbs 46.6 55.2 43.0all 53.3 65.8 56.2

time 16:13 06:04

• There must be an experimental error. Or not?• There must be an experimental error. Or not?

EBLEBL

ARESO2000

Page 39: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Results: SetB on a reduced set of 15 words • Results: SetB on a reduced set of 15 words

MFS NB EB (h,15)nouns 57.4 72.2 64.3verbs 46.6 55.2 43.0all 53.3 65.8 56.2

time 16:13 06:04

• The binary representation of topical attributes is not appropriate for both NB and EB algorithms

• The binary representation of topical attributes is not appropriate for both NB and EB algorithms

EBLEBL

ARESO2000

Page 40: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Proposal for EB:

– Represent all topical binary attributes into a single multivalued attribute, containing the set of all content words that appear in the example (sentence)

– Set the similarity measure between two values of this atribute to the matching coefficient

• P(ositive)EB

• Proposal for EB:

– Represent all topical binary attributes into a single multivalued attribute, containing the set of all content words that appear in the example (sentence)

– Set the similarity measure between two values of this atribute to the matching coefficient

• P(ositive)EB

EBLEBL

ARESO2000

Page 41: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

• Results of Positive versions: SetB on a reduced set of 15 words • Results of Positive versions: SetB on a reduced set of 15 words

MFS NB PNB EB PEBh,15 h,1 h,7 h,7,e h,7,a h,10,e,a cs,1 cs,10 cs,10,e

nouns 57.4 72.2 72.4 64.3 70.6 72.4 73.7 72.5 73.4 73.2 75.4 75.6verbs 46.6 55.2 55.3 43.0 54.7 57.7 59.5 58.9 60.2 58.6 61.9 62.1all 53.3 65.8 66.0 56.2 64.6 66.8 68.4 67.4 68.4 67.7 70.3 70.5

time 16:13 00:12 06:04 00:25 03:55 49:43

EBLEBL

ARESO2000

Page 42: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

Example3: Word Sense Disambiguation

Example3: Word Sense Disambiguation

(Escudero et al. 00)(Escudero et al. 00)

SetA SetBMFS PNB PEBh PEBcs PNB PEBh PEBcs

nouns 56.4 68.7 68.5 70.2 69.2 70.1 -verbs 48.7 64.8 65.3 66.4 63.4 67.0 -all 53.2 67.1 67.2 68.6 66.8 68.8 -

time 00:33 00:47 92:22 01:06 01:46 -

• Results of Positive versions: On the whole corpus (191 words) • Results of Positive versions: On the whole corpus (191 words)

EBLEBL

ARESO2000

Page 43: ML: Classical methods from AI Decision-Tree induction Exemplar-based Learning Rule Induction TBEDL

EMNLP’02 11-18/11/2002 EMNLP’02 11-18/11/2002

• Advantages

– Well known and well studied approach

– Extremely simple learning algorithm

– Robust in the face of redundant features

– It stores all examples and, thus, all exceptions too. This seems to be very important for NLP tasks. “Forgetting Exceptions is Harmful in NL Learning”

(Daelemans et al. 99)

• Drawbacks– Sensitive to irrelevant features

– Computational requirements: storing the base of exemplars, calculation of complex metrics, classification cost (needs efficient indexing and retrieval algorithms), etc.

• Advantages

– Well known and well studied approach

– Extremely simple learning algorithm

– Robust in the face of redundant features

– It stores all examples and, thus, all exceptions too. This seems to be very important for NLP tasks. “Forgetting Exceptions is Harmful in NL Learning”

(Daelemans et al. 99)

• Drawbacks– Sensitive to irrelevant features

– Computational requirements: storing the base of exemplars, calculation of complex metrics, classification cost (needs efficient indexing and retrieval algorithms), etc.

EBL: Some Final CommentsEBL: Some Final Comments

EBLEBL