Download - Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Transcript
Page 1: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

FMI

Stefan GerdjikovSofia University

FMI

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

May

Page 2: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

FMI

Stefan GerdjikovSofia University

FMI

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

May

Page 3: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Contents

● Supervised Text Normalisation– CULTURA

– REBELS Translation Model

– Functional Automata

● Unsupervised Text Normalisation– Unsupervised REBELS

– Experimental Results

– Future Improvements

Page 4: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Page 5: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Page 6: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Supervised Text Normalisation

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

Page 7: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Supervised Text Normalisation

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

Page 8: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

Automatic Extraction of Historical Spelling Variations

Page 9: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Training ofThe REBELS Translation Model

● Training pairs from the ground truth:

(shee, she), (maye, may), (she, she),

(tyme, time), (saith, says), (have, have),

(tho:, thomas), ...

Page 10: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Training ofThe REBELS Translation Model

● Deterministic structure of all historical/modern subwords

● Each word has several hierarchical decompositions in the DAWG:

Hierarchical decomposition of each

historical word

Hierarchical decomposition of each

modern word

Page 11: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Training ofThe REBELS Translation Model

● For each training pair (knowth, knows) we find a mapping between the decompositions:

● We collect statistics about

historical subword -> modern subword

● We collect statistics about

historical subword -> modern subword

Page 12: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

REBELS generates normalisation candidates for

unseen historical words

Page 13: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

shee

REBELS

knowth

REBELS

me

REBELS

shee knowth me

Page 14: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

relevance score (he knuth my) =

REBELS TM (he knuth my) * C_tm +

Statistical Language Model (he knuth my)*C_lm

Combination of REBELS with Statistical Bigram Language Model

● Bigram Statistical Model– Smoothing: Absolute Discounting, Backing-off

– Gutengberg English language corpus

Page 15: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Functional Automata

L(C_tm, C_lm) is represented with Functional Automata

Page 16: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Automatic Construction of Functional Automaton For The

Partial Derivative w.r.t. x

L(C_tm, C_lm) is optimised with the Conjugate Gradient method

Page 17: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Supervised Text Normalisation

REBELSTranslationModel

SearchModule Based on Functional Automata

GroundTruth

TrainingModuleBased on Functional Automata

Historical

text Normalised

text

Page 18: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Text Normalisation

REBELSTranslationModel

Unsupervised Generation of Training Pairs(knoweth, knows)

Historical

text Normalised

text

SearchModule Based on Functional Automata

Page 19: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 20: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 21: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 22: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Page 23: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

Page 24: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Normalisation of the 1641 Depositions. Experimental results

Method

Generation of REBELS Training

Pairs

Spelling Probabilities

Language Model Accuracy BLEU

1 ---- ---- ---- 75.59 50.31

2 Unsupervised NO YES 67.84 45.52

3 Unsupervised YES NO 79.18 56.55

4 Unsupervised YES YES 81.79 61.88

5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78

6 Supervised Supervised Trained Supervised Trained 93.96 87.30

Page 25: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Future Improvement

REBELSTranslationModel

Unsupervised Generation of Training Pairs(knoweth, knows)with probabilities

Historical

text Normalised

text

SearchModule Based on Functional Automata

MAPTrainingModule

Page 26: Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

Thank You!

Comments / Questions?

ACKNOWLEDGEMENTS

The reported research work is supported bythe project CULTURA, grant 269973, funded by the FP7Programme andthe project AComIn, grant 316087, funded by the FP7 Programme.