Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

16
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg

Transcript of Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Page 1: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Dependency Parser for Swedish

Project for EDA171

byJonas Pålsson

Marcus Stamborg

Page 2: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Dependency Grammar Describes relations between words in a sentence A relation is between a head and its dependent(s) All words have a head except the root of a sentence

The big brown beaverbrown

The

beaver

big

Page 3: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Dependency Parsing

Find the links that connects words using a computer. Different algorithms exist. Nivre's parser has reported the best results for swedish.

Page 4: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Nivre's Parser Extension to Shift-Reduce. Adds arcs between input and stack. Produces a dependency graph using the following

actions: Shift - moves the input to the stack. Reduce - pops the stack. Left arc - creates an arc from input to stack. Right arc - creates an arc from stack to input.

Page 5: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

More about actions

Nivre, J. (2004)

Page 6: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Corpus

Talbanken05 – modernized and computerized version of Talbanken76

Modified for use in CoNNL-X Shared Task Training set is about 11500 sentences We used a test set containing about 300 sentences

Example from the corpus:

1 Jag _ PO PO _ 2 SS _ _2 tycker _ VV VV _ 0 ROOT_ _3 det _ PO PO _ 2 OO _ _

Page 7: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

How we did it

Collect data Build model Parse

ARFFBuilder

Trainer

Parser

Train Corpus

Data

Trained Classifier

Test Corpus with relations

Test Corpus

Page 8: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Collect data – Gold Standard Parsing

Build Weka compatible data file (arff). Determining the action sequence from an annotated

corpus is possible using the following rules. (Gold Standard Parsing) If input has stack as head -> Right Arc else if stack has input as head -> Left Arc else if arc exists between input and any word in stack -> Reduce else Shift

Page 9: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Train classifier Weka 3 – Data mining software C4.5 (J48) – Extension to the ID3 algorithm.

Generates decision trees Uses features derived from the current state of the

parser Outputs a trained classifier used by the parser to decide

the next action

Page 10: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Parse using trained classifier Uses the trained classifier to determine the head for

each word in a sentence Uses Nivre's algorithm with action decided by the

classifier Calculates the score as nbrWords assigned correct head

total number of words

Page 11: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Features All features describe the current state of the parser 1st set – Input and stack 2nd set – Input, stack and children. 3rd set – Input, stack and previous input. 4th set – Input, stack, children and previous input. We only used POS in the feature sets Using lexical values actually decreased performance For every set we used constraints to model valid actions

in the current state of the parser

Page 12: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Results

Input 1 2 3 4 5 6Stack 1 0.7161 0.8007 0.7972 0.7967 0.8036 0.8064

2 0.7268 0.8078 0.8055 0.8094 0.8136 0.81293 0.7275 0.8066 0.8076 0.8098 0.8129 0.81314 0.7300 0.8057 0.8076 0.8094 0.8096 0.80915 0.7309 0.8073 0.8071 0.8096 0.8101 0.80976 0.7307 0.8064 0.8071 0.8089 0.8092 0.8094

Scores using features:Stack_n_POS, Input_n_POS, Children

Input 1 2 3 4 5 6Stack 1 0.6936 0.7765 0.7804 0.7801 0.7779 0.7806

2 0.7297 0.7937 0.7970 0.7961 0.7958 0.79463 0.7300 0.7933 0.7963 0.7958 0.7940 0.79444 0.7309 0.7940 0.7967 0.7972 0.7960 0.79535 0.7327 0.7944 0.7974 0.7984 0.7969 0.79606 0.7313 0.7940 0.7972 0.7986 0.7965 0.7960

Scores using features:Stack_n_POS, Input_n_POS

Page 13: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Results cont.

Input 1 2 3 4 5 6Stack 1 0.7242 0.8022 0.8055 0.8052 0.8046 0.8050

2 0.7558 0.8156 0.8168 0.8179 0.8174 0.81823 0.7580 0.8152 0.8186 0.8184 0.8174 0.81844 0.7581 0.8158 0.8177 0.8184 0.8172 0.81755 0.7594 0.8167 0.8182 0.8186 0.8174 0.81776 0.7574 0.8161 0.8181 0.8177 0.8165 0.8172

Scores using features:Stack_n_POS, Input_n_POS, Children, Previous_Input_POS

Input 1 2 3 4 5 6Stack 1 0.7210 0.7999 0.8004 0.8002 0.8062 0.8076

2 0.7279 0.8064 0.8068 0.8108 0.8110 0.81423 0.7283 0.8068 0.8068 0.8101 0.8136 0.81384 0.7307 0.8068 0.8089 0.8106 0.8108 0.81055 0.7316 0.8068 0.8075 0.8103 0.8114 0.81146 0.7344 0.8064 0.8076 0.8101 0.8106 0.8108

Scores using features:Stack_n_POS, Input_n_POS, Previous_Input_POS

Page 14: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Conclusions Lexical values didn’t do much. Score even became

worse. Might be better with different classifying algorithm or different test corpus

Previous input word was a very effective feature, probably the single best addition from only stack and input

Difficult to find optimal feature set

Page 15: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Future improvements

Try other features Siblings Use LEX on specific words More words from original input string

Simulations to find the optimum feature set Use SVM instead of C4.5

Page 16: Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Thank you for listening

More to come in the report