Download - Datech2014 - Session 2 - An approach to Unsupervised Historical Text Normalisation

An approach to unsupervised historical text normalisation

Petar MitankinSofia University

FMI

Stefan GerdjikovSofia University

FMI

Stoyan MihovBulgarian Academy

of SciencesIICT

DATeCH 2014, Maye 19 - 20, Madrid, Spain

May

Contents

● Supervised Text Normalisation– CULTURA

– REBELS Translation Model

– Functional Automata

● Unsupervised Text Normalisation– Unsupervised REBELS

– Experimental Results

– Future Improvements

Co-funded under the 7th Framework Programme of the European Commission

● Maye - 34 occurrences in the 1641 Depositions, 8022 documents, 17th century Early Modern English

● CULTURA: CULTivating Understanding and Research through Adaptivity

● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Supervised Text Normalisation

● Manually created ground truth– 500 documents from the 1641 Depositions

– All words: 205 291

– Normalised words: 51 133

● Statistical Machine Translation from historical language to modern language combines:– Translation model

– Language model

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

Automatic Extraction of Historical Spelling Variations

Training ofThe REBELS Translation Model

● Training pairs from the ground truth:

(shee, she), (maye, may), (she, she),

(tyme, time), (saith, says), (have, have),

(tho:, thomas), ...


● Deterministic structure of all historical/modern subwords

● Each word has several hierarchical decompositions in the DAWG:

Hierarchical decomposition of each

historical word

Hierarchical decomposition of each

modern word


● For each training pair (knowth, knows) we find a mapping between the decompositions:

● We collect statistics about

historical subword -> modern subword

● We collect statistics about

historical subword -> modern subword

REgularities Based Embedding of Language Structures

sheeREBELSTranslationModel

he / -1.89se / -1.69she / -9.75shea / -10.04

REBELS generates normalisation candidates for

unseen historical words

shee

REBELS

knowth

REBELS

me

REBELS

shee knowth me

relevance score (he knuth my) =

REBELS TM (he knuth my) * C_tm +

Statistical Language Model (he knuth my)*C_lm

Combination of REBELS with Statistical Bigram Language Model

● Bigram Statistical Model– Smoothing: Absolute Discounting, Backing-off

– Gutengberg English language corpus

Functional Automata

L(C_tm, C_lm) is represented with Functional Automata

Automatic Construction of Functional Automaton For The

Partial Derivative w.r.t. x

L(C_tm, C_lm) is optimised with the Conjugate Gradient method

Supervised Text Normalisation

REBELSTranslationModel

SearchModule Based on Functional Automata

GroundTruth

TrainingModuleBased on Functional Automata

Historical

text Normalised

text

Unsupervised Text Normalisation


Unsupervised Generation of Training Pairs(knoweth, knows)

Historical

text Normalised

text


Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

Unsupervised Generation of the Training Pairs

● We use similarity search to generate training pairs:– For each historical word H:

● If H is a modern word, then generate (H,H) , else● Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then

● Find each modern word M that is at distance 3 from H and generate (H,M).

● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

Normalisation of the 1641 Depositions. Experimental results

Method

Generation of REBELS Training

Pairs

Spelling Probabilities

Language Model Accuracy BLEU

1 ---- ---- ---- 75.59 50.31

2 Unsupervised NO YES 67.84 45.52

3 Unsupervised YES NO 79.18 56.55

4 Unsupervised YES YES 81.79 61.88

5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78

6 Supervised Supervised Trained Supervised Trained 93.96 87.30

Future Improvement


Unsupervised Generation of Training Pairs(knoweth, knows)with probabilities

Historical

text Normalised

text


MAPTrainingModule

Thank You!

Comments / Questions?

ACKNOWLEDGEMENTS

The reported research work is supported bythe project CULTURA, grant 269973, funded by the FP7Programme andthe project AComIn, grant 316087, funded by the FP7 Programme.