Text Correction using Domain Dependent Bigram Models from Web Crawls

Text Correction using

Domain Dependent Bigram Models from

Web Crawls

Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov

Two recent goals of text correction

Use of

powerful language models

word frequencies, n-gram models, HMMs, probabilistic grammars, etc.

Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,...

Use of

Document centric and

adaptive text correction

prefer words of the text as correction suggestions for unknown tokens.

Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003, ...

Use of

Here: Use of document centric language models (bigrams)

Use of document centric bigram models

Wk-1 Wk+1Wk............. .............Text T:

Idea ill-formed

Wk-1 Wk+1Wk............. .............Text T:

Idea ill-formed

Wk-1 Wk+1Wk............. .............Text T:

correction candidates

Idea ill-formed

Wk-1 Wk+1Wk............. .............Text T:

Prefer those correction candidates V where bigrams Wk-1V and VWk+1"are natural, given the text T".

Idea ill-formed

Wk-1 Wk+1Vi............. .............Text T:

Idea ill-formed

Wk-1 Wk+1Vi............. .............Text T:

Idea ill-formed

Wk-1 Wk+1Vi............. .............Text T:

ProblemHow to measure "naturalness of a bigram, given a text"?

How to derive "natural" bigram models for a text?

• Counting bigram frequencies in text T?

Sparseness of bigrams: low chance to find bigrams repeated in T.

• Using a fixed background corpus (British National Corpus, Brown Corpus)?

Sparseness problem partially solved - but models not document centric!

Our suggestion

Using domain dependent terms from T, crawl a corpus C in the web thatreflects domain and vocabulary of T. Count bigram frequencies in C.

Correction Experiments

Text T

Text T 1. Extract domain specific terms (compounds).

2. Crawl a corpus C that reflects domain and vocabulary of T.

Dictionary D

Dictionary D3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V).

First experiment ("in isolation")

What is the correction accuracy reached when using s(U,V) as the single information for ranking correction suggestions?

First experiment ("in isolation")

What is the correction accuracy reached when using s(U,V) as the single information for ranking correction suggestions?

Second experiment ("in combination")

Which gain is obtained when adding s(U,V) as a new parameter to a sophisticated correction system using other scores as well?

Experiment 1: bigram scores "in isolation"

• Set of ill-formed output tokens of commercial OCR system.• Candidate sets for ill-formed tokens: dictionary entries with edit distance < 3.• Using s(U,V) as the single information for ranking correction suggestions.• Measured the percentage of correctly top-ranked correction suggestions.• Comparing bigram scores from web crawls, from BNC, from Brown Corpus.

Neurol. Fish Mushr. Holoc. Rom Botany

Crawl 64.5% 43.6% 54.8% 59.5% 48.2% 56.5%

BNC 46.8% 34.7% 41.8% 40.9% 37.5% 28.5%

Brown 38.2% 30.5% 36.4% 40.2% 37.0% 25.5%

Texts from 6 domains

Experiment 1: bigram scores "in isolation"

• Set of ill-formed output tokens of commercial OCR system.• Candidate sets for ill-formed tokens: dictionary entries with edit distance < 3.• Using s(U,V) as the single information for ranking correction suggestions.• Measured the percentage of correctly top-ranked correction suggestions.• Comparing bigram scores from web crawls, from BNC, from Brown Corpus.

Neurol. Fish Mushr. Holoc. Rom Botany

Crawl 64.5% 43.6% 54.8% 59.5% 48.2% 56.5%

BNC 46.8% 34.7% 41.8% 40.9% 37.5% 28.5%

Brown 38.2% 30.5% 36.4% 40.2% 37.0% 25.5%

Texts from 6 domains

Resumee: crawled bigram frequencies clearly better than those from static corpora.

Experiment 2: adding bigram scores to fully-fledged correction system

• Baseline: correction with length-sensitive Levenshtein distance and crawled word frequencies as two scores.

• Then adding bigram frequencies as a third score.

• Measuring the correction accuracy (percentage of correct tokens) reached with fully automated correction (optimized parameters).

• Corrected output of commercial OCR 1 and open source OCR 2.

OCR 1Output

Baseline

correction

Adding bigram score

Additional

gainNeurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

OCR 1output

Baseline

correction

Adding bigram score

Additional gain

Neurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

Output highly accurate

OCR 1Output

Baseline

correction

Adding bigram score

Additional gain

Neurology 98.74 99.39 99.44 0.05Fish 99.23 99.47 99.57 0.10Mushroom 99.01 99.50 99.55 0.05Holocaust 98.86 99.03 99.15 0.12Roman Empire 98.73 98.90 99.00 0.10Botany 97.19 97.67 97.89 0.22

Baseline correction adds significant improvement

OCR 1Output

Baseline

correction

Adding bigram score

Additional

Small additional gain by adding bigram score

OCR 2Output

Baseline

correction

Adding bigram score

Additional

OCR 2Output

Baseline

correction

Adding bigram score

Additional

Reduced output accuracy

OCR 2Output

Baseline

correction

Adding bigram score

Additional

Baseline correction adds drastic improvement

OCR 2Output

Baseline

correction

Adding bigram score

Additional

Considerable additional gain by adding bigram score

Additional experiments: comparing language models

Compare word frequencies in input text with1. word frequencies retrieved from "general" standard corpora2. word frequencies retrieved from crawled domain dependent corpora

Result

Experiment

Using the same large word list (dictionary) D,the top-k segments w.r.t. ordering using frequencies of type 2 covers much more tokens of the input text than the top-k segments w.r.t. ordering using frequencies of type 1

Additional experiments: comparing language models

TokensTypes

Crawled frequencies

Standard frequencies

Summing up

• Bigram scores represent a useful additional score for correction systems.

Summing up

• Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora.

Summing up

• Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper).

Summing up

• The additional gain in accuracy reached with bigram scores depends on the baseline.

Summing up

• The additional gain in accuracy reached with bigram scores depends on the baseline.

• Language models obtained from text-centered domain dependent corpora retrieved in the web reflect the language of the input document much more closely than those obtained from general corpora.

Thanks for your attention!

Text Correction using Domain Dependent Bigram Models from Web Crawls

Documents

Transcript of Text Correction using Domain Dependent Bigram Models from Web Crawls

Mistakes-and-Correction'-EdgeJulian_LongmanKeyToLanguageTeachinkes and Correction' EdgeJulian LongmanKeyToLanguageTeaching

CoNEXT'12 ~ Defending Against Large-Scale Crawls in Online ...

NEW JERSEY CIVIL SERVICE COMMISSION - nj.gov County Correction... · Carlos Montoya County Correction Officer Caroline Reynolds County Correction Officer Olushyi Jindal County Correction

Paper-pen peer-correction versus wiki-based peer- correction

BEAM CRAWLS - Elephant Lifting Products...ADJUSTABLE BEAM CRAWLS A BEAM WIDTH RANGE MINIMUM RADIUS OF TRACK CURVE (METER) 45-28065-280 70-300 90-300 S.W.L CAPACITY 1t 2t 3t 5t B …

Defending against large-scale crawls in online social networks

Gravity: Corrections and analysis · Latitude correction 2. Free-air correction 3. Bouguer correction 4. Terrain correction. 2 Applied Geophysics – Corrections and analysis Latitude

The fastest way to scan - ZF- · PDF fileheight correction models (geoid correction ﬁ les) • Correction: ... Software Module Basic Calibration Correction Calibration Correction

Error correction: Part 2 Deuxième partie: la correction d’erreurs

Identifying Urdu Complex Predication via Bigram Extraction

Arcomem training Specifying Crawls Advanced

The thing practically crawls up walls.

Temp Correction Factor Guidelines - GE Water is now … May-11 operating guidelines temperature correction factors Temperature Correction Factor A temperature correction factor (TCF)

Things to do in san antonio my city crawl bar crawls my city crawl _ san antonio bar crawls

A Bigram Supported Generic Knowledge-Assisted Malware …ceur-ws.org/Vol-2009/fmt-proceedings-2017-paper15.pdf · 2017-11-29 · A Bigram Supported Generic Knowledge-Assisted Malware

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION

Error correction: Part 2 Deuxième partie: la correction d’erreurs.

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

E tutorial - Online Correction - Challlan Correction

Correction and Retractions - PNAS · Correction and Retractions CORRECTION BIOCHEMISTRY Correction for “Mod5 protein binds to tRNA gene complexes and affects local transcriptional