Improving orthographic transcriptions using sentence similarities

22
NL Oosthuizen, MJ Puttkammer & M Schlemmer Centre for Text Technology (CTexT®) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK) South Africa E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za Improving orthographic transcriptions using sentence similarities 18 May 2010; AfLaT 2010; Valletta, Malta

description

© NL Oosthuizen, MJ Puttkammer, M Schlemmer

Transcript of Improving orthographic transcriptions using sentence similarities

Page 1: Improving orthographic transcriptions using sentence similarities

NL Oosthuizen, MJ Puttkammer & M SchlemmerCentre for Text Technology (CTexT®)

Research Unit: Languages and Literature in the South African Context

North-West University, Potchefstroom Campus (PUK)

South Africa

E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za

Improving orthographic

transcriptions using sentence

similarities

18 May 2010; AfLaT 2010; Valletta, Malta

Page 2: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Background

Problem Statement

Introduction: Background

18 May 2010; AfLaT 2010; Valletta, Malta

• Lwazi (knowledge) project

– 200 Mother tongue speakers per language

– 30 phrases – 14 open ended and 16 phoneme-

rich sentences

– 350 phoneme-rich sentences from various

corpora each recorded 6-10 times – Totalling

3200 phoneme-rich sentences

– Relatively small ASR corpus meant extremely

accurate transcriptions

Page 3: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Background

Problem Statement

Introduction: Problem Statement

18 May 2010; AfLaT 2010; Valletta, Malta

• Lwazi project – Issues

– 2 year running time

– 4-6 transcribers were employed per language

– Different quality control phases were

unsuccessful

– Another solution was needed to improve

quality

Page 4: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

ConfusablesSplitsInsertionsDeletionsNon-words

Identified Differences: Confusables

18 May 2010; AfLaT 2010; Valletta, Malta

• English examples:– has it been tried on <too> small a scale

– has it been tried on <to> small a scale

• isiXhosa examples:– andingomntu <othanda> kufunda

• (I’m not a person <who loves> to read)

– andingomntu <uthanda> kufunda• (I’m not a person <you love> to read)

• Setswana examples:– bosa bo <jang> ko engelane ka nako e

– bosa bo <yang> ko engelane ka nako e• (How is the weather in England at this time?)

• <yang> in the second example is slang for “how”

Page 5: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

ConfusablesSplitsInsertionsDeletionsNon-words

Identified Differences: Splits

18 May 2010; AfLaT 2010; Valletta, Malta

• English examples:– there’s <nowhere> else for it to go

– there’s <no_where> else for it to go

• isiXhosa examples:– alwela phi na <loo_madabi>

– alwelwa phi na <loomadabi>• (Where is it taking place, these challenges)

• Setswana examples:– le fa e le <gone> re ratanang tota

– le fa e le <go_ne> re ratanang tota• (Even though we have started dating)

Page 6: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

ConfusablesSplitsInsertionsDeletionsNon-words

Identified Differences: Insertions

18 May 2010; AfLaT 2010; Valletta, Malta

• English examples:– so we took our way toward the palace

– so we <we_>took our way toward the palace

• isiXhosa examples:– ibiyini ukuba unga mbambi wakumbona

• (Why didn’t <you catch> him or her when you saw him or her)

– ibiyini <na_>ukuba unga mbambi wakumbona• (Why didn’t < you caught> him or her when you saw him or her)

• Setswana examples:– ba mmatlela mapai ba mo alela a robala

– ba mmatlela mapai <li-> ba mo alela a robala• (They have looked for blankets and made a bed for themselves to sleep)

Page 7: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

ConfusablesSplitsInsertionsDeletionsNon-words

Identified Differences: Deletions

18 May 2010; AfLaT 2010; Valletta, Malta

• English examples:– as <to_>the first the answer is simple

– as the first the answer is simple

• isiXhosa examples:– yagaleleka impi <ke_>xa kuthi qheke ukusa

– yagaleleka impi xa kuthi qheke ukusa• (It started the battle at the beginning of the morning)

• Setswana examples:– ke eng gape se seng<we> se o se lemogang

– ke eng gape se seng se o se lemogang• (What else have you noticed?)

Page 8: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

ConfusablesSplitsInsertionsDeletionsNon-words

Identified Differences: Non-words

18 May 2010; AfLaT 2010; Valletta, Malta

• English examples:– there is no <arbitrator> except a legislature fifteen thousand miles off

– there is no <abritator> except a legislature fifteen thousand miles off

• isiXhosa examples:– yile <venkile> yayikhethwe ngabathembu le

– yile <venkeli> yayikhethwe ngabathembu le • (It is this shop that was selected by the Bathembu)

• <venkeli> in the second example is a spelling mistake.

• Setswana examples:– lefapha la dimenerale le <eneji>

– lefapha la dimenerale le <energy>• (Department of minerals and energy)

• < energy > in the second example is a spelling mistake.

Page 9: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Flowchart

18 May 2010; AfLaT 2010; Valletta, Malta

Map transcription to

original

Compare mapped

sentences

Cleanup

Average of 350 sentences recorded 6-10 times, transcribed by 4-6 people per language

For transcriptions Remove punctuation,noise markers & partialsConvert to LC

For original sentences Remove punctuationConvert to LC

Compute Levenshtein distance and map transcriptions to closest original sentence

slighter fault substance are numerous 90.20%

slighter faults of substances are numerous 97.60%

slighter faults of substance are numerous

String Similarity - Brad Wood“Look ahead” window finds differences

Original sentences &transcriptions

Page 10: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Flowchart

18 May 2010; AfLaT 2010; Valletta, Malta

Replace errors with correct string

MarkupHTML markup to illustrate differences with colours

slighter faults of substance are numerousslighter fault substance are numerous

slighter faults of substance are numerousslighter faults of substances are numerous

I told him to make the charge at oncevs

I told him to make the change at once

Verify the differences in context by listening to recordings

Correct errors

Manual verification

Improved transcriptions

Page 11: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Cleanup

18 May 2010; AfLaT 2010; Valletta, Malta

• Remove possible differences from the

sentences to improve matches:

– Punctuation

• Any commas, full stops, extra spaces ect.

– Noise Markers

• External noises [n] and speaker noises [s]

– Partials

• Any incomplete words (indicted by leading or trailing

hyphen)

Page 12: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Transcription Mapping

18 May 2010; AfLaT 2010; Valletta, Malta

• Levenshtein mapping:

– Link each transcribed sentence (T) to an

original sentence (O) using Levenshtein

distance

– If no difference is found (DIFF (O, T) = 0)

• Do nothing

– If a difference is found (DIFF (O, T) = 1)

• Continue to next step

Page 13: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Transcription Mapping

18 May 2010; AfLaT 2010; Valletta, Malta

• Levenshtein example:

– Original sentence (O):

• slighter faults of substance are numerous

– Transcriptions (T):

• slighter fault substance are numerous

– 90.20%

• slighter faults of substances are numerous

– 97.60%

Page 14: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Sentences and Mark-up

18 May 2010; AfLaT 2010; Valletta, Malta

• String comparison algorithm developed by

Brad Wood (2008):

– Based on finding the Longest Common String

(LCS)

– Windowing compares the strings on character

level over a maximum search distance

– Differences found are annotated with HTML

• Repeat after swapping the string 1 with

string 2

Page 15: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Manual Verification

18 May 2010; AfLaT 2010; Valletta, Malta

• If(DIFF (O, T) = 1)

– The spoken utterance (U) is compared to the

original sentence (O)

– If(DIFF (O, U) = 1 AND DIFF (T, U) = 0) then

• U = T (No change is needed)

– If(DIFF (O, U) = 1 AND DIFF (T, U) = 1) then

• The transcription is incorrect and needs to be

checked manually

Page 16: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Manual Verification

18 May 2010; AfLaT 2010; Valletta, Malta

• Transcribed correctly

– Original sentence (O):

• “I told him to make the charge at once.”

– Spoken utterance (U)

• “I told him to make the change at once.”

– Transcriptions (T):

• “I told him to make the cha<n>ge at once.”

Page 17: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

FlowchartCleanup Transcription Mapping Sentences and Mark-upManual Verification

Methodology: Manual Verification

18 May 2010; AfLaT 2010; Valletta, Malta

• Transcribed incorrectly

– Original sentence (O):

• “a heavy word intervened, between...”

– Spoken utterance (U)

• “a heavy word intervened, between...”

– Transcriptions (T):

• “a heavy wo<o>d intervened, between...”

Page 18: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Results

18 May 2010; AfLaT 2010; Valletta, Malta

Language Differences found Actual errors

Afrikaans 776 152

English 1143 337

isiNdebele 958 291

isiXhosa 1484 1081

isiZulu 1854 1228

Sepedi 1596 736

Sesotho 739 261

Setswana 1479 828

Siswati 1558 351

Tshivenda 814 191

Xitsonga 1586 456

Page 19: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Summary

Future Work

Questions?

Conclusion: Summary

18 May 2010; AfLaT 2010; Valletta, Malta

• We introduced a method for identifying

differences in ASR data

• Overall quality of the transcriptions were

increased

• The Lwazi project had an average

transcription accuracy of 98%.

Page 20: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Summary

Future Work

Questions?

Conclusion: Summary

18 May 2010; AfLaT 2010; Valletta, Malta

• Even with inexperienced transcribers high

accuracy is still possible

• Provide employment opportunities to

people with little linguistic skills but have

basic knowledge of their language

• Empowering people to learn skills that may

be invaluable in future projects

Page 21: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Summary

Future Work

Questions?

Conclusion: Future Work

18 May 2010; AfLaT 2010; Valletta, Malta

• If(DIFF (O, T) = 0 AND DIFF (O, U) = 1)

– This will indicate that DIFF (T, U) = 1

– For the current system DIFF (T, U) = 0 was

considered only, as the specifications required

it

– This will mean that one can check the reader’s

performance

– Future work will include this statement

Page 22: Improving orthographic transcriptions using sentence similarities

Oosthuizen, Puttkammer & Schlemmer

IntroductionIdentified Differences

MethodologyResults

Conclusion

Summary

Future Work

Questions?

Conclusion: Questions?

18 May 2010; AfLaT 2010; Valletta, Malta

Centre for Text Technology (CTexT®)

Research Unit: Languages and Literature in the South African Context

North-West University, Potchefstroom Campus (PUK)

South Africa

E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za

NL Oosthuizen, MJ Puttkammer & M Schlemmer

Improving orthographic

transcriptions using sentence

similarities