Slides

39
Contributions for building a Corpora-Flow system Andr´ e Santos [email protected] Informatics Engineering MSc University of Minho December 2011

description

 

Transcript of Slides

Page 1: Slides

Contributions for building aCorpora-Flow system

Andre [email protected]

Informatics Engineering MScUniversity of Minho

December 2011

Page 2: Slides

Concepts

Aligned parallel corpus: Set of parallel texts inwhich correspondences have been markedbetween blocks (paragraphs, sentences,words, . . . ) from each text.

Corpora-flow: Adaptation of the concept ofworkflow to the several tasks, decisionsand sequences of steps involved in theprocess of building a corpus.

This presentation and the underlying master thesisdescribe the implementation of several tools to beused in typical corpus building activities.

1 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 3: Slides

Concepts

Aligned parallel corpus: Set of parallel texts inwhich correspondences have been markedbetween blocks (paragraphs, sentences,words, . . . ) from each text.

Corpora-flow: Adaptation of the concept ofworkflow to the several tasks, decisionsand sequences of steps involved in theprocess of building a corpus.

This presentation and the underlying master thesisdescribe the implementation of several tools to beused in typical corpus building activities.

1 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 4: Slides

Context

The work developed in the context of this masterthesis was motivated and supported byProject Per-fide, an undergoing project inUniversity of Minho which aims to build largeparallel corpora between Portuguese and other sixlanguages.

2 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 5: Slides

Corpora building challenges

file format and format conversion

finding duplicated files

text encoding format

structural residues

section delimiters

unpaired sections (parallel corpora)

. . .

3 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 6: Slides

Corpora building challenges

Severe problems which often lead to bad results

Many (most?) of them are hard/impossible tosolve completely

Find the problem and report it when it is notsolvable automatically

Provide intelligent ways of describing what wasfound and done

4 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 7: Slides

5 key issues

Book cleaning

Duplicates and candidate pairs detection

Book synchronization

Alignment evaluation

Corpora-flow system

5 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 8: Slides

Book processing problems – Motivation

(...) d <92>’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

M

<96>- 86 <96>-

^L geait la partie anterieure contre l <92>’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

La Jangada, Jules Verne

6 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 9: Slides

Book processing problems – Motivation

(...) d <92>’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

M

<96>- 86 <96>-

^L geait la partie anterieure contre l <92>’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

La Jangada, Jules Verne

<92>’ : right single quot. mark (CP1252)<96>- : en dash (CP1252)

^L : page break (0xC)

prote-(...)geait : transpagination

6 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 10: Slides

Book processing problems – Motivation

(...) d <92>’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

M

<96>- 86 <96>-

^L geait la partie anterieure contre l <92>’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

La Jangada, Jules Verne

(...) d ’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en protegeait _pb1_

la partie anterieure contre l ’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

6 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 11: Slides

Book cleaning

Subdivided in several steps:

7 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 12: Slides

Sections ontology

contains common section typesused to automatically generatethe code to recognize sectiondelimitersallows discussion/cooperationwith people with noprogramming knowledgecode becomes more simple andclean

chap

PT capıtulo,

cap, capitulo

FR chapitre, chap

EN chapter, chap

NT sec

end

PT fim

FR fin

EN the_end

BT _alone

scene

PT cena

FR scene

EN scene

RU главаBT act

8 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 13: Slides

Duplicates and pairs detection

MotivationDuplicates can result in a biased corpusFinding candidate pairs for alignment

Language independent elements (LIEs)

terms which are usually kept untranslatedyear references – “1973”proper names – “Hamlet”

Measuring similarity

similarity(A,B) =|ALIEs ∩ BLIEs ||ALIEs ∪ BLIEs |

Thresholds< 0.2: unrelated

> 0.4: pair

> 0.9: duplicates

9 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 14: Slides

Book synchronization

DefinitionStructural alignment at section level, based onpreviously added section delimiting marks.

MotivationSome aligners cannot handle large documentsSection delimiters can act as anchor pointsUnpaired sections can be discarded

Implementation

match similar section delimiterssynchronization points

10 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 15: Slides

Output

pair of files withsynchronizationmarks

pair of files dividedinto smaller pairsof chunks

text report

synchronizationmatrix

11 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 16: Slides

Output

pair of files withsynchronizationmarks

pair of files dividedinto smaller pairsof chunks

text report

synchronizationmatrix

11 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 17: Slides

Alignment evaluation

Motivationcompare alignments of the same documents(performed by different tools, with different options, . . . )

determine if an alignment was successful

Comparing alignments

parse TMX files and output the total numbercorrespondences of each type0:1/1:0, 1:1, 2:1/1:2 and 2:2

evaluate the other tools developed

compare the performance of the availablealignment tools

12 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 18: Slides

Alignment evaluation

Motivationcompare alignments of the same documents(performed by different tools, with different options, . . . )

determine if an alignment was successful

Comparing alignments

parse TMX files and output the total numbercorrespondences of each type0:1/1:0, 1:1, 2:1/1:2 and 2:2

evaluate the other tools developed

compare the performance of the availablealignment tools

12 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 19: Slides

Alignment evaluation

Determine if an alignment was successful

Summarize a TMX by sampling. Sampling canbe performed based on:

number of samples desiredexplicit sampling pointstranslation units which match a given regularexpression

Output is a (much?) smaller TMX file

13 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 20: Slides

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 21: Slides

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 22: Slides

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 23: Slides

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 24: Slides

Distribution

All the tools implemented as Perl modules:Text::Perfide::BookCleaner

Text::Perfide::BookPairs

Text::Perfide::BookSync

Text::Perfide::TMX::Utils

publicly available on CPAN

including tests and documentation

additional effort required to make codeinstallable and usable by other people

15 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 25: Slides

Corpora-flow

Motivationbuilding a corpus is a complex task

linear pipeline is not powerful enough

Workflowstates

actions

conditions

context

Makefilesfile-oriented

timestamps anddependencies

fail-fast and resumableexecution

parallelization16 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 26: Slides

Corpora-flow

workflow + Makefiles = corpora-flow

DSL (→ Slay::Makefile)workflow: rule*

rule: pre-condition* action post-condition*

action: targets dependencies function

condition: filename function

target: pattern*

dependencies: pattern*

function: Perl code

17 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 27: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 28: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 29: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 30: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 31: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 32: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 33: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 34: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 35: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 36: Slides

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 37: Slides

Future work

Document cleaners

other types of documents (e.g. scientificarticles)

algorithm for finding section delimiters withnotion of hierarchy

create ebooks/bilingual books

Duplicates and pair detection

list of correspondences (e.g. Adson → Адсо,London → Londres)

calculate best threshold values in real time

19 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 38: Slides

Future work

Document synchronization

interactive mode

improvements on synchronization matrix andmetrics

hierarchical sections

other section alignment algorithms

Corpora-flow

finish specification and implementation

implement a corpora-flow for Project Per-fide

20 Andre Santos, [email protected] Contributions for building a Corpora-Flow system

Page 39: Slides

Contributions for building aCorpora-Flow system

Andre [email protected]

Informatics Engineering MScUniversity of Minho

December 2011