Dissecting PDF Documents - Mark S....

23
Dissecting PDF Documents Mark S. Rasmussen iPaper [email protected]

Transcript of Dissecting PDF Documents - Mark S....

Page 1: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Dissecting PDF Documents

Mark S. Rasmussen – [email protected]

Page 2: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

What Is This Session NOT About?

• Creating PDFs

• How to use Acrobat

• Transparency flattening options in InDesign

• So what is it about?– PDF documents

– Tooling

– Extracting data

Page 3: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

The PDF Format

• 1.0 released in 1993

• Open standard as of July 1st 2008

• Reference publicly available– http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

0

500

1000

1500

PDF 1.3 PDF 1.4 PDF 1.5 PDF 1.6 PDF 1.7 OOXML 1.0

Page 4: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

PDF Structure

• Header– %PDF-1.4– %âãÏÓ (optional but common)

• Body– Objects

• Xref table– Index table containing pointers to objects

• Trailer– Pointers to Xref table, key objects– %%EOF

Page 5: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

PDF Objects

• Boolean, Number, String, Name, Array, Dictionary, Stream, Null

• Indirect & direct objects

• Random access

”A PDF file should be thought of as a flattenedrepresentation of a data structure consisting of a collection of objects that can refer to each other in any arbitrary way.”

Page 6: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Reading A PDF – The Ninja Way!

Page 7: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Incremental Changes

• Fast saves, but not for free

• Undo & history

• Save vs Save As

• Single-pass writing

• Linearization

Page 8: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Linearization & Xref Chaining

Page 9: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

PDF Objects: Image

• Stream object with dictionary header

Page 10: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

ABCpdf

• Commercial

• Excellent .NET API

• ObjectSoup is avaluable friend

• Good image rendering

• Useless SWF rendering

• Unstable rendering

• Decent support

• http://www.websupergoo.com/secret.htm

Page 11: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Acrobat

• Commercial (tricky license)

• No COM libraries after 7.x

• Surprisingly stable and fast

• Ugly API

Page 12: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Rendering Using Acrobat

Page 13: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Xpdf

• Open source (GPL)

• Pdffonts, pdfimages,pdfinfo, pdftops, pdftotext

• Basis for many other libraries & tools

• Commercial license & COM library available at www.glyphandcog.com

• http://www.foolabs.com/xpdf/

Page 14: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

PDF Font Management

• Client must have fonts used in PDF document

• However…

– Complete font can be embedded

– Or a subset

– 14 standard fonts (Courier, Helvetica, Times + ITC Zapf & Dingbats)

– Font replacement

Page 15: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Text In PDF

• No concept of text, just characters

• Flow order not guaranteed

• Requires guesstimation to extract text

• Extraction may require embedded fonts

• Lots of tools, some better than others

Page 16: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Text According To ABCpdf1 2

2

3

3

4

4

5

5

6

1

6

Page 17: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Text According To Xpdf

1

2

3

4

1 2

3

4

5

6

5

6

Page 18: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Physical Text According To Xpdf1 2

3

4

5

6

1 23

4

5

Page 19: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

SWFTools

• Open source (GPL)

• PDF2SWF converts PDF files to SWF format

– Based on Xpdf

– Active mailing list

– Author actively working on project

– Use dev snapshots / git repo

– Stable, but some kinks

• http://www.swftools.org

Page 20: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

iTextSharp

• Open source (5.0 – AGPL(!), 4.1 - LGPL)

• Commercial license available

• .NET port of iText

• Very stable

• Excellent for creating &modifying PDFs

• No rendering capabilites

• http://itextsharp.sourceforge.net/

• http://itextpdf.com/

Page 21: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Extracting Bookmarks

Page 22: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Extracting Links

Page 23: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,

Thank you!For attending this session

[email protected]@improvedk

improve.dk

EmailTwitterBlog