Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA...

33
Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009

Transcript of Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA...

Page 1: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Making Portable Document Format (PDF) Files Work for You

Tuomas S. Kostiainen

and

Jill R. Sommer

ATA Conference 2009

Page 2: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

PDF File Basics

What is a PDF file and why do we use them?– Stands for "Portable Document Format." PDF is a multi-

platform file format developed by Adobe Systems. A PDF file captures document text, fonts, images, and even formatting of documents from a variety of applications. You can e-mail a PDF document to your friend and it will look the same way on his screen as it looks on yours, even if he has a Mac and you have a PC. Since PDFs contain color-accurate information, they should also print the same way they look on your screen.

(Source: TechTerms.com)

Page 3: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

PDF File Basics

Where do we as translators encounter PDF files?– Translation projects: source text, proofreading, reference– Forms: registration forms, IRS W-9– Creating PDFs: resume, invoices, file sharing,

printing/publishing

Problems associated with PDF files– Rigid (not meant to be editable)– Converting from PDF > DOC etc.

Page 4: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Adobe Acrobat

Versions: Adobe Reader, Adobe Acrobat Standard, Adobe Acrobat Pro, Adobe Acrobat Pro Extended

Product comparison: www.adobe.com/products/acrobat/matrix.html

Compatibility with earlier versions– Update to the current version of 9

Should be part of your toolbox

Page 5: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Other “Comparable” Products

PDF Nitro (Express, Professional, and free version): www.nitropdf.com

Foxit PDF Tools (Reader, Editor, Createor, Phantom etc.): www.foxitsoftware.com/pdf/

Solid PDF Tools: www.soliddocuments.com DocuCom PDF Gold: www.pdfwizard.com Pdf995Suite: www.pdf995.com others

Page 6: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Working with PDF Files

1. Editing and Commenting PDF Files

2. Searching for Text in PDF Files

3. Creating and Filling Electronic Forms

4. Using Electronic Signatures

5. Creating TMs from PDF Files Using LogiTerm AlignFactory

Page 7: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Editing and Commenting PDF Files

Using Commenting Tools in a normal review cycle - don’t use only sticky note comments

– available in Reader only if the PDF file author has enabled the document for commenting using Acrobat Pro (Advanced > Extend Features in Adobe Reader)

Comment & Markup tools (Tools > Comment & Markup / Tools > Customize Toolbars)

– Text Edits, Highlight Text, Callout, Arrow, Rectangle, etc. – Show/Hide Comments– Comments List (View > Navigation Panels > Comments) – Spell checking of notes (Edit > Check Spelling)

Page 8: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Editing and Commenting PDF Files

Touching up text (changing text and text properties)– Tools > Advanced Editing / Advanced Editing toolbar– TouchUp Text, Crop

Typewriter tool– available in Reader only if the PDF author has enabled it

Inserting/extracting/rearranging pages– Document > Insert/Extract/Replace/Delete Pages, or use

the Pages navigation pane on the left E-mail-based review or shared review (on

acrobat.com) for multi-party reviews

Page 9: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Searching for Text in PDF Files

Edit > Find: text within the current document Edit > Search: text in one or more files Indexing (only in Professional): possibility to index

hundreds of files for quick searching– Select Advanced > Document Processing > Full Text Index

with Catalog > New Index– Name the index, select directories to be included, click Build

and specify location for the index file– Use the resulting .pdx file for searching

Creating a Searchable Image– With the image file open in Acrobat, select Document > OCR

Text Recognition > Recognize Text Using OCR

Page 10: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Creating and Filling Electronic Forms

Simple filling with Typewriter tool Using text boxes Converting electronic files to forms using Form

Wizard Blueberry PDF Form Filler; FREE application for

filling in and printing PDF forms (www.bbconsult.co.uk/Resources/PDFFormFiller.aspx)– Note: deselect “Lock All Controls” button

Page 11: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Using Electronic Signatures

Inserting a scanned signature– copy and paste via clipboard (gets inserted as a “stamp”)

Creating and using a digital ID– Creating a digital ID: Advanced > Security Settings >

Digital IDs > Add ID > “A new digital ID I want to create now” > Next > New PKCS#12 digital ID file > Next. Fill the information fields, as needed, click Next. Select location for the ID and define a password. Click Finish to return to the Security Settings dialog box. Click Close.

– Signing a PDF document: Advanced > Sign & Certify > Place Signature. Drag a rectangle where you want to place the signature. Choose a digital ID, type the password, choose appearance and click Sign.

Page 12: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Creating Translation Memories from PDF Files Using LogiTerm AlignFactory

Other tools: – YouAlign by LogiTerm; online tool, FREE (for a

limited time), limited selection of languages (www.youalign.com)

– NoBabel AutoAligner by KCSL; online tool, limited selection of languages (http://nobabel.com/)

LogiTerm AlignFactoryLight (http://www.terminotix.com)– Quick and easy tool to create TMs from PDF files

Page 13: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Additional PDF-related links

www.adobe.com/support/ www.planetpdf.com www.pdfstore.com

Page 14: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Part Two: Creating PDFs and OCR

Page 15: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Reasons translators might need to create a PDF

Résumés Invoices Letters of certification Various protected files

Page 16: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Creating a PDF from Word or Excel

Choose Print – PDF tool (Acrobat Distiller, win2pdf, etc.)

Select the menu button

Page 17: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Optical Character Recognition

Optical character recognition (or OCR) is the translation of handwritten, typewritten or printed text to generate a machine-editable text.

PDFs can be either straight text or a graphic. OCR can handle both.

Page 18: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Optical Character Recognition

OCR tools use pattern recognition, artificial intelligence and computer vision as well as digital character recognition.

The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents. Some tools can now easily recognize Cyrillic and Asian characters as well.

Page 19: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

OCR Tools

ABBYY FineReader (http://www.abbyy.com/) PDF Transformer (by ABBYY) OmniPage

(http://www.nuance.com/imaging/products/omnipage.asp)

Microsoft Office Document Imaging (part of MS Office 2007)

ExperVision (http://www.expervision.com/)

Page 20: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

ABBYY FineReader

ABBYY is the clear favorite among translators (although PDF Transformer is a close second), because it creates fewer text boxes than other OCR programs

The spellcheck feature ensures the document you are working on doesn’t have any spelling errors that would corrupt the TM.

ABBYY supports the most languages (184 at last count).

Page 21: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Abkhaz, Adyghian, Afrikaans, Agul, Albanian, Altai, Armenian (Eastern, Western, Grabar), Avar, Aymara, Azerbaijani (Cyrillic), Azerbaijani (Latin), Bashkir, Basic, Basque, Belarusian, Bemba, Blackfoot, Breton, Bugotu, Bulgarian, Buryat, C/C++, COBOL, Catalan, Cebuano, Chamorro, Chechen, Chinese Simplified, Chinese Traditional, Chukchee, Chuvash, Corsican, Crimean Tatar, Croatian, Crow, Czech, Dakota, Danish, Dargwa, Dungan, Dutch (Netherlands and Belgium), English, Eskimo (Cyrillic), Eskimo (Latin), Esperanto, Estonian, Even, Evenki, Faroese, Fijian, Finnish, Fortran, French,

Page 22: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Frisian, Friulian, Gagauz, Galician, Ganda, German (Luxemburg), German (new and old spelling), Greek, Guarani, Hani, Hausa, Hawaiian, Hebrew, Hungarian, Icelandic, Ido, Indonesian, Ingush, Interlingua, Irish, Italian, JAVA, Japanese, Jingpo, Kabardian, Kalmyk, Karachay-balkar, Karakalpak, Kasub, Kawa, Kazakh, Khakass, Khanty, Kikuyu, Kirghiz, Kongo, Koryak, Kpelle, Kumyk, Kurdish, Lak, Latin, Latvian, Lezgi, Lithuanian, Luba, Macedonian, Malagasy, Malay, Malinke, Maltese, Mansy, Maori, Mari, Maya, Miao, Minangkabau, Mohawk, Moldavian, Mongol, Mordvin, Nahuatl, Nenets, Nivkh, Nogay,

Page 23: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Norwegian (nynorsk and bokmål), Nyanja, Occidental, Ojibway, Ossetian, Papiamento, Pascal, Polish, Portuguese (Portugal and Brazil), Provencal, Quechua, Rhaeto-romanic, Romanian, Romany, Rundi, Russian, Russian (old spelling), Rwanda, Sami (Lappish), Samoan, Scottish Gaelic, Selkup, Serbian (Cyrillic), Serbian (Latin), Shona, Simple chemical formulas, Slovak, Slovenian, Somali, Sorbian, Sotho, Spanish, Sunda, Swahili, Swazi, Swedish, Tabasaran, Tagalog, Tahitian, Tajik, Tatar, Thai, Tok Pisin, Tongan, Tswana, Tun, Turkish, Turkmen, Tuvinian, Udmurt, Uighur (Cyrillic), Uighur (Latin), Ukrainian,

Page 24: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Uzbek (Cyrillic), Uzbek (Latin), Welsh, Wolof, Xhosa, Yakut, Zapotec, Zulu

Page 25: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Spellchecking

ABBYY supports pre- and post-reform German orthography, Old German script, scripting languages, and simple chemical formulas .

ABBYY can also sometimes replicate graphics and logos that you can paste into your file.

The text recognition software includes dictionaries with spell-checking capabilities for 38 languages allowing verification of recognized text directly in the FineReader Editor

Page 26: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Potential Problems

Page setups can vary and create inconsistent margins and page layouts.

OCR tools have problems with handwriting, bullet lists, check boxes, static from "fuzzy" fax transmissions, and tables.

Formatting sometimes needs to be cleaned up (double spaces, text boxes, columns, etc.)

Page 27: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

ABBYY FineReader

Page 28: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Using OCR to create Word files

Open image (PDF, TIF, etc.) Read file using OCR tool Check spelling (allows you to verify words

that the OCR program misread or did not recognize)

Save as Word file Create a clean file and copy and paste the

text into it.

Page 29: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Troubleshooting

Use Edit->Paste Special to copy the text into a fresh Word file and format it by hand.

Delete the illegible pages in Adobe and run the legible pages through the tool.

OCR isn‘t a magic bullet. If the source text is very illegible you may want to just give up and type it in by hand.

Page 30: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

CodeZapper

"CodeZapper"is a set of Word macros designed to “clean up” Word files before being imported into a translation environment program such as Deja Vu DVX or MemoQ.

Word documents are often strewn with junk or “rogue” tags (so-called “smart tags”, languagetags, track changes tags, soft hyphenations, scaling and spacing changes,redundant bookmarks, etc.).

This tagged information shows up in the DVX or MemoQ grid as spurious {1}codes{2}around, or even in the middle of, words, making sentences difficult to read and translate and generally negating many of the productivity benefits of the program.

Page 31: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

CodeZapper

OCR‘d files or files converted from PDF are even worse.

CodeZapper tries to remove as many of these tags as possible while retaining formatting and layout.

It also contains a number of other macros which may be useful before and after importing files into DVX or MemoQ

Page 32: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Final Words

Do not use online OCR tools like the Tesseract OCR Engine from Google if your documents are confidential.

Try several demos to determine which tool best suits your needs.

Shop around for the best price ($399.99 vs. EUR 139/GBP 89 or even EUR 90 ($116)).

Page 33: Making Portable Document Format (PDF) Files Work for You Tuomas S. Kostiainen and Jill R. Sommer ATA Conference 2009.

Links

CodeZapper http://www.transir.cn/redirect.php?tid=992&goto=lastpost

Presentation: http://translationmusings.com/2009/11/05/presentation-from-ata-conference/