Slides

Post on 14-Dec-2014

265 views 2 download

Tags:

description

 

Transcript of Slides

Contributions for building aCorpora-Flow system

Andre Santosandrefs@cpan.org

Informatics Engineering MScUniversity of Minho

December 2011

Concepts

Aligned parallel corpus: Set of parallel texts inwhich correspondences have been markedbetween blocks (paragraphs, sentences,words, . . . ) from each text.

Corpora-flow: Adaptation of the concept ofworkflow to the several tasks, decisionsand sequences of steps involved in theprocess of building a corpus.

This presentation and the underlying master thesisdescribe the implementation of several tools to beused in typical corpus building activities.

1 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Concepts

Aligned parallel corpus: Set of parallel texts inwhich correspondences have been markedbetween blocks (paragraphs, sentences,words, . . . ) from each text.

Corpora-flow: Adaptation of the concept ofworkflow to the several tasks, decisionsand sequences of steps involved in theprocess of building a corpus.

This presentation and the underlying master thesisdescribe the implementation of several tools to beused in typical corpus building activities.

1 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Context

The work developed in the context of this masterthesis was motivated and supported byProject Per-fide, an undergoing project inUniversity of Minho which aims to build largeparallel corpora between Portuguese and other sixlanguages.

2 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Corpora building challenges

file format and format conversion

finding duplicated files

text encoding format

structural residues

section delimiters

unpaired sections (parallel corpora)

. . .

3 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Corpora building challenges

Severe problems which often lead to bad results

Many (most?) of them are hard/impossible tosolve completely

Find the problem and report it when it is notsolvable automatically

Provide intelligent ways of describing what wasfound and done

4 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

5 key issues

Book cleaning

Duplicates and candidate pairs detection

Book synchronization

Alignment evaluation

Corpora-flow system

5 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Book processing problems – Motivation

(...) d <92>’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

M

<96>- 86 <96>-

^L geait la partie anterieure contre l <92>’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

La Jangada, Jules Verne

6 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Book processing problems – Motivation

(...) d <92>’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

M

<96>- 86 <96>-

^L geait la partie anterieure contre l <92>’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

La Jangada, Jules Verne

<92>’ : right single quot. mark (CP1252)<96>- : en dash (CP1252)

^L : page break (0xC)

prote-(...)geait : transpagination

6 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Book processing problems – Motivation

(...) d <92>’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en prote-

M

<96>- 86 <96>-

^L geait la partie anterieure contre l <92>’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

La Jangada, Jules Verne

(...) d ’ entree, donnant acces dans la salle commune.

Une legere veranda, qui en protegeait _pb1_

la partie anterieure contre l ’ action

des rayons solaires, reposait sur de sveltes bambous. (...)

6 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Book cleaning

Subdivided in several steps:

7 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Sections ontology

contains common section typesused to automatically generatethe code to recognize sectiondelimitersallows discussion/cooperationwith people with noprogramming knowledgecode becomes more simple andclean

chap

PT capıtulo,

cap, capitulo

FR chapitre, chap

EN chapter, chap

NT sec

end

PT fim

FR fin

EN the_end

BT _alone

scene

PT cena

FR scene

EN scene

RU главаBT act

8 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Duplicates and pairs detection

MotivationDuplicates can result in a biased corpusFinding candidate pairs for alignment

Language independent elements (LIEs)

terms which are usually kept untranslatedyear references – “1973”proper names – “Hamlet”

Measuring similarity

similarity(A,B) =|ALIEs ∩ BLIEs ||ALIEs ∪ BLIEs |

Thresholds< 0.2: unrelated

> 0.4: pair

> 0.9: duplicates

9 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Book synchronization

DefinitionStructural alignment at section level, based onpreviously added section delimiting marks.

MotivationSome aligners cannot handle large documentsSection delimiters can act as anchor pointsUnpaired sections can be discarded

Implementation

match similar section delimiterssynchronization points

10 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Output

pair of files withsynchronizationmarks

pair of files dividedinto smaller pairsof chunks

text report

synchronizationmatrix

11 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Output

pair of files withsynchronizationmarks

pair of files dividedinto smaller pairsof chunks

text report

synchronizationmatrix

11 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

Motivationcompare alignments of the same documents(performed by different tools, with different options, . . . )

determine if an alignment was successful

Comparing alignments

parse TMX files and output the total numbercorrespondences of each type0:1/1:0, 1:1, 2:1/1:2 and 2:2

evaluate the other tools developed

compare the performance of the availablealignment tools

12 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

Motivationcompare alignments of the same documents(performed by different tools, with different options, . . . )

determine if an alignment was successful

Comparing alignments

parse TMX files and output the total numbercorrespondences of each type0:1/1:0, 1:1, 2:1/1:2 and 2:2

evaluate the other tools developed

compare the performance of the availablealignment tools

12 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

Determine if an alignment was successful

Summarize a TMX by sampling. Sampling canbe performed based on:

number of samples desiredexplicit sampling pointstranslation units which match a given regularexpression

Output is a (much?) smaller TMX file

13 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Alignment evaluation

AdsonDE = АдсоRU

The Name of the Rose, Umberto Eco

14 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Distribution

All the tools implemented as Perl modules:Text::Perfide::BookCleaner

Text::Perfide::BookPairs

Text::Perfide::BookSync

Text::Perfide::TMX::Utils

publicly available on CPAN

including tests and documentation

additional effort required to make codeinstallable and usable by other people

15 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Corpora-flow

Motivationbuilding a corpus is a complex task

linear pipeline is not powerful enough

Workflowstates

actions

conditions

context

Makefilesfile-oriented

timestamps anddependencies

fail-fast and resumableexecution

parallelization16 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Corpora-flow

workflow + Makefiles = corpora-flow

DSL (→ Slay::Makefile)workflow: rule*

rule: pre-condition* action post-condition*

action: targets dependencies function

condition: filename function

target: pattern*

dependencies: pattern*

function: Perl code

17 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Conclusions

Evaluation of the tools has shown that they dohelp to solve problems

Most of the methods devised can be applied inother contextsWorking within a larger project:

provides requirements and resourcesspecific needs and priorities

making code available to other people:requires additional effortgives meaning to the workexternal contributions

Higher level objects help to organize anddiscuss

18 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Future work

Document cleaners

other types of documents (e.g. scientificarticles)

algorithm for finding section delimiters withnotion of hierarchy

create ebooks/bilingual books

Duplicates and pair detection

list of correspondences (e.g. Adson → Адсо,London → Londres)

calculate best threshold values in real time

19 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Future work

Document synchronization

interactive mode

improvements on synchronization matrix andmetrics

hierarchical sections

other section alignment algorithms

Corpora-flow

finish specification and implementation

implement a corpora-flow for Project Per-fide

20 Andre Santos, andrefs@cpan.org Contributions for building a Corpora-Flow system

Contributions for building aCorpora-Flow system

Andre Santosandrefs@cpan.org

Informatics Engineering MScUniversity of Minho

December 2011