Post on 18-Dec-2014
description
Cleaning plain text books withText::Perfide::BookCleaner
Andre Santosandrefs@cpan.org
September 23, 2011
Introduction Per-Fide
1 IntroductionPer-FideText alignmentBooks
2 Text::Perfide::BookCleaner
3 Conclusions, wish list and future work
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
1 IntroductionPer-FideText alignmentBooks
2 Text::Perfide::BookCleaner
3 Conclusions, wish list and future work
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Joint venture between the Computer ScienceDepartment and the School of Humanities ofthe University of Minho
Portuguese in parallel with six languages:Espanol, Russian, Francais, Italiano, Deutsch,English
Build parallel corpora that will establish arelation between Portuguese and the other 6languages
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Joint venture between the Computer ScienceDepartment and the School of Humanities ofthe University of Minho
Portuguese in parallel with six languages:Espanol, Russian, Francais, Italiano, Deutsch,English
Build parallel corpora that will establish arelation between Portuguese and the other 6languages
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Joint venture between the Computer ScienceDepartment and the School of Humanities ofthe University of Minho
Portuguese in parallel with six languages:Espanol, Russian, Francais, Italiano, Deutsch,English
Build parallel corpora that will establish arelation between Portuguese and the other 6languages
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
[Parallel] Corpora
Corpora Collection of natural language texts
Parallel corpora Collection of nat. lang. bitexts
Bitext Pair formed by a text in a givenlanguage and its translation inanother language, frequently aligned.
Alignment Mapping between thesentences/paragraphs/words of onetext and the other.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Original texts in the seven languages and theirtranslations
Two main genres: contemporary fictionand non-fiction
non-fiction: judicial, journalistic, religious,technical, ...
fiction: contemporary novels and shortstories
per-fide.di.uminho.pt
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Original texts in the seven languages and theirtranslations
Two main genres: contemporary fictionand non-fiction
non-fiction: judicial, journalistic, religious,technical, ...
fiction: contemporary novels and shortstories
per-fide.di.uminho.pt
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Original texts in the seven languages and theirtranslations
Two main genres: contemporary fictionand non-fiction
non-fiction: judicial, journalistic, religious,technical, ...
fiction: contemporary novels and shortstories
per-fide.di.uminho.pt
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Original texts in the seven languages and theirtranslations
Two main genres: contemporary fictionand non-fiction
non-fiction: judicial, journalistic, religious,technical, ...
fiction: contemporary novels and shortstories
per-fide.di.uminho.pt
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Per-Fide
Project Per-Fide
Original texts in the seven languages and theirtranslations
Two main genres: contemporary fictionand non-fiction
non-fiction: judicial, journalistic, religious,technical, ...
fiction: contemporary novels and shortstories
per-fide.di.uminho.pt
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment
Manual or automatic
Paragraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:
length based: “when two sentences correspond, the
words in them also correspond”
lexical/dictionary based: relies on lexical
information or dictionaries to perform the
alignment
partial similarity (cognates) based: relies on
occurrences of tokens graphically or
otherwise identical (cognates)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment
Manual or automaticParagraph/sentence/word level
Automatic alignment tools/algorithmsgenerally fall into three categories:
length based: “when two sentences correspond, the
words in them also correspond”
lexical/dictionary based: relies on lexical
information or dictionaries to perform the
alignment
partial similarity (cognates) based: relies on
occurrences of tokens graphically or
otherwise identical (cognates)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment
Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:
length based: “when two sentences correspond, the
words in them also correspond”
lexical/dictionary based: relies on lexical
information or dictionaries to perform the
alignment
partial similarity (cognates) based: relies on
occurrences of tokens graphically or
otherwise identical (cognates)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment
Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:length based: “when two sentences correspond, the
words in them also correspond”
lexical/dictionary based: relies on lexical
information or dictionaries to perform the
alignment
partial similarity (cognates) based: relies on
occurrences of tokens graphically or
otherwise identical (cognates)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment
Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:length based: “when two sentences correspond, the
words in them also correspond”
lexical/dictionary based: relies on lexical
information or dictionaries to perform the
alignment
partial similarity (cognates) based: relies on
occurrences of tokens graphically or
otherwise identical (cognates)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment
Manual or automaticParagraph/sentence/word levelAutomatic alignment tools/algorithmsgenerally fall into three categories:length based: “when two sentences correspond, the
words in them also correspond”
lexical/dictionary based: relies on lexical
information or dictionaries to perform the
alignment
partial similarity (cognates) based: relies on
occurrences of tokens graphically or
otherwise identical (cognates)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Text alignment
Text alignment – Example
Table: Extract of sentence-level alignment performed usingPortuguese and Russian subtitles from the movie Tron.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Books
Books
Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects
Large variety of formats: PDF, MS Word,HTML, ebook formats, ...
If not already in plain text, they need to beconverted before the alignment
This is where all the trouble starts!
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Books
Books
Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects
Large variety of formats: PDF, MS Word,HTML, ebook formats, ...
If not already in plain text, they need to beconverted before the alignment
This is where all the trouble starts!
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Books
Books
Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects
Large variety of formats: PDF, MS Word,HTML, ebook formats, ...
If not already in plain text, they need to beconverted before the alignment
This is where all the trouble starts!
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Books
Books
Obtained directly from publishers or, if inpublic domain, from Project Gutenberg andsimilar projects
Large variety of formats: PDF, MS Word,HTML, ebook formats, ...
If not already in plain text, they need to beconverted before the alignment
This is where all the trouble starts!
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Books
Book alignment problems
pagination – page numbers, headers,footers, . . .
previous text formatting – sub/superscript,bold, italics, . . .
sections
paragraphs
translineations and transpaginations
footnotes
text encoding
. . .
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Introduction Books
Book alignment problems – Example
(. . . )
gaiement. Sur le devant s<92>’ouvrait la porte
d<92>’entree, donnant acces dans la salle commune.
Une legere veranda, qui en prote-
<96>- 86 <96>-
^L geait la partie anterieure contre l<92>’action
des rayons solaires, reposait sur de sveltes bambous.
Le tout etait peint d<92>’une fraıche
(. . . )
La Jangada, Jules Verne
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
1 IntroductionPer-FideText alignmentBooks
2 Text::Perfide::BookCleaner
3 Conclusions, wish list and future work
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
1 IntroductionPer-FideText alignmentBooks
2 Text::Perfide::BookCleaner
3 Conclusions, wish list and future work
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
First approach
RegExp + Find & Replace
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
First approach
RegExp + Find & Replace
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
First approach
Well-intentioned but:
Too naıve
Big mess
A more sofisticated approach was needed!
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Architecture
Build a pipeline; each step handles a specific set ofproblems.
1 pages
2 sections
3 paragraphs
4 footnotes
5 chars
6 . . .
7 commit
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Architecture
Build a pipeline; each step handles a specific set ofproblems.
1 pages
2 sections
3 paragraphs
4 footnotes
5 chars
6 . . .
7 commit
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Architecture
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Architecture
whenever possible, use ontologies and DSLs
they help organizing stuff
they allow to abstract from the code anddiscuss details at a higher level (even withpeople from other areas)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Pages
GoalIdentify and remove from text elements related tobook pagination:
page numbers
headers
footers
page breaks
These elements often lead to a bad performance ofthe aligner.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Pages – Example
est vrai qu’il fallait etre assez chanceux pour
rencontrer le nabab, et assez audacieux pour
s’emparer de sa personne.
Page 3
^L La maison a vapeur Jules Verne
Le faquir, - evidemment le seul entre tous
que ne surexcitat pas l’espoir de gagner la
prime, - filait au milieu des groupes, s’arretant
La Maison a Vapeur, Jules Verne
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Pages – Algorithm
1 identify page breaks (e.g., ^L )2 nearby: candidates to headers and footers3 count the occurrences of each normalized
candidate4 headers and footers are extracted from
candidates which occur more thant a thresholdvalue
5 replace everything with a custom mark6 move all the necessary information to a
standoff file
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Pages – Example
est vrai qu’il fallait etre assez chanceux pour
rencontrer le nabab, et assez audacieux pour
s’emparer de sa personne.
Page 3
^L La maison a vapeur Jules Verne
Le faquir, - evidemment le seul entre tous
que ne surexcitat pas l’espoir de gagner la
prime, - filait au milieu des groupes, s’arretant
La Maison a Vapeur, Jules Verne
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Pages – Example
est vrai qu’il fallait etre assez chanceux pour
rencontrer le nabab, et assez audacieux pour
s’emparer de sa personne. _pb2_
Le faquir, - evidemment le seul entre tous
que ne surexcitat pas l’espoir de gagner la
prime, - filait au milieu des groupes, s’arretant
La Maison a Vapeur, Jules Verne
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections
GoalIdentify and normalize the divisions between theseveral sections of a book (parts, chapters, acts,scenes, epilogue, afterword, ...)
An ontology was created, containing types ofdivisions and subdivisions, in several languages.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections – Ontology
Examplecap
PT capıtulo, cap, capitulo
FR chapitre, chap
EN chapter, chap
NT sec
PT fim
FR fin
EN the_end
BT _alone
This ontology is used to automatically generate aparte of the code.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections – Example
PRIMEIRA PARTE
FANTINE
^L LIVRO PRIMEIRO
UM JUSTO
O abade Myriel
Em 1815, era bispo de Digne, o reverendo Carlos
Francisco Bemvindo Myriel, o qual contava setenta e
Os Miseraveis, Vitor Hugo
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections – Algorithm
1 Search for potential sections divisions:lines with keywords – capıtulo, chapter, Chap.,Appendix, Table des Matieres, . . .pages or lines containing only numbersroman numbering. . .
2 Insert a custom mark immediately before thesection identified
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections – Example
PRIMEIRA PARTE
FANTINE
^L LIVRO PRIMEIRO
UM JUSTO
O abade Myriel
Em 1815, era bispo de Digne, o reverendo Carlos
Francisco Bemvindo Myriel, o qual contava setenta e
Os Miseraveis, Vitor Hugo
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections – Example
_sec+O:PARTE=PRIMEIRA_
FANTINE
_sec+O:LIVRO=PRIMEIRO_
UM JUSTO
O abade Myriel
Em 1815, era bispo de Digne, o reverendo Carlos
Francisco Bemvindo Myriel, o qual contava setenta e
Os Miseraveis, Vitor Hugo
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Sections
Identifying the different parts within a bitext:
allows to subsequently compare the twoversions and remove parts which can only befound in one of them
allows to perform a structural alignment1
1Text::Perfide::BookSyncAndre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Paragraphs
GoalHandles things related with identifying andnormalizing paragraph notation, direct speech, etc.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Paragraphs – Example
L’hotesse prit la defense de son cure:
- D’ailleurs, il en plierait quatre comme vous sur
son genou. Il a, l’annee derniere, aide nos gens a
rentrer la paille; il en portait jusqu’a six bottes
a la fois, tant il est fort!
- Bravo! dit le pharmacien. Envoyez donc vos filles
en confesse a des gaillards d’un temperament pareil!
Moi, si j’etais le gouvernement, je voudrais qu’on
saignat les pretres une fois par mois.
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Paragraphs – Example
L’hotesse prit la defense de son cure:
"D’ailleurs, il en plierait quatre comme vous sur
son genou. Il a, l’annee derniere, aide nos gens a
rentrer la paille; il en portait jusqu’a six bottes
a la fois, tant il est fort! "
"Bravo!" dit le pharmacien. "Envoyez donc vos filles
en confesse a des gaillards d’un temperament pareil!
Moi, si j’etais le gouvernement, je voudrais qu’on
saignat les pretres une fois par mois."
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Paragraphs – Algorithm
paragraph identification is performed bycalculating metrics based on the number ofblank lines and indentationidentification and normalization of directspeech:
punctuation, paragraph, dashtext in quotes
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Footnotes
GoalIdentify and remove footnote callmarks andfootnote expansions
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Footnotes – Example
On fit un inventaire de son argent comptant, et on
le mena dans le chateau que fit construire le roi
Charles V, fils de Jean II, aupres de la rue
Saint-Antoine, a la porte des Tournelles[1].
[1] La Bastille, qui fut prise par le peuple de
Paris, le 14 juillet 1789, puis demolie. B.
^L Quel etait en chemin l’etonnement de l’Ingenu!
je vous le laisse a penser. Il crut d’abord
que c’etait un reve.
Oeuvres de Voltaire, Voltaire
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Footnotes – Algorithm
1 Search for footnote expansions (lines begginingwith <<1>>, [2], ^3, . . . )
2 Replace with custom mark3 Only footnote call marks left4 Search again for the same patterns in the
middle of the text5 Replace with custom mark
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Footnotes – Algorithm
On fit un inventaire de son argent comptant, et on
le mena dans le chateau que fit construire le roi
Charles V, fils de Jean II, aupres de la rue
Saint-Antoine, a la porte des Tournelles[1].
[1] La Bastille, qui fut prise par le peuple de
Paris, le 14 juillet 1789, puis demolie. B.
(fbox^LQuel etait en chemin l’etonnement de l’Ingenu!
je vous le laisse a penser. Il crut d’abord
que c’etait un reve.
Oeuvres de Voltaire, Voltaire
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Footnotes – Algorithm
On fit un inventaire de son argent comptant, et on
le mena dans le chateau que fit construire le roi
Charles V, fils de Jean II, aupres de la rue
Saint-Antoine, a la porte des Tournelles_fnr29_.
_fne8_
^L Quel etait en chemin l’etonnement de l’Ingenu!
je vous le laisse a penser. Il crut d’abord
que c’etait un reve.
Oeuvres de Voltaire, Voltaire
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Words and characters
translineations
text encoding
. . .
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Report
Previous steps produce a report
Summarizes what was found, what wasassumed and what was done
Main goal is to allow to make a diagnostic ofthe program, allowing to manually emend whatis wrong
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Report
livros/_FR_15.pdf.txt:
footers=[’( Page) = 241’]
headers=[
"(La maison \x{e0} vapeur Jules Verne) = 241"]
ctrL=1;
pagnum_ctrL=241;
sectionsO=2;
sectionsN=30;
word_tr=58;
words=118036;
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Text::Perfide::BookCleaner
Commit
Final and irreversible step which removes allthe custom marks added by the previous steps
Outputs a cleaned copy of the document
This is the last stage before the alignment (orany other further processing)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
1 IntroductionPer-FideText alignmentBooks
2 Text::Perfide::BookCleaner
3 Conclusions, wish list and future work
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
1 IntroductionPer-FideText alignmentBooks
2 Text::Perfide::BookCleaner
3 Conclusions, wish list and future work
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Conclusions and wish list
There is no de facto standard format for plaintext books (documents?)
Documents are way heterogeneous(provenience, type and quantity, notationformats, . . . )
Hurrah to regular expressions!
20/80 rule applies
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Conclusions and wish list
Ontologies and DSLs lead to a better structureCommon pattern:
search textcalculate metricsperform action accordingly
Report generated at the end should present asmart summary of what was found and done
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Related ongoing work
Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection
Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:
Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)
Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Related ongoing work
Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection
Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:
Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)
Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Related ongoing work
Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection
Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:
Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)
Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Related ongoing work
Text::Perfide::BookPairs Find repeated books andpairs of books (same book in differentlanguages) within a collection
Text::Perfide::BookSync Uses the sectiondelimitation made by T::P::BC to make astructural alignment:
Text::Perfide::CorporaFlow Uses a DSL to guide thecorpora preparation workflow (to bedone)
Text::Perfide::SciPaperCleaner Cleaner for scientificpapers (to be done)
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Future work
Standoff annotation – no changes in theoriginal file until commit
Export to ebook formats – .fb2, .epub, . . .
. . .
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
CPAN
Is it on CPAN yet?
No, but it will be really, really soon!
Missing
More and better documentation
More and better tests
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
CPAN
Is it on CPAN yet?
No, but it will be really, really soon!
Missing
More and better documentation
More and better tests
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Conclusions, wish list and future work
Questions
o/
Andre Santosandrefs@cpan.org
Andre Santos andrefs@cpan.org Cleaning plain text books with Text::Perfide::BookCleaner
Cleaning plain text books withText::Perfide::BookCleaner
Andre Santosandrefs@cpan.org
September 23, 2011