Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

1/26

Extracting Searchable Text from Arabic PDFs

Br ian Car r ie r , Ph .D.

Director of Digital Forensics

Basis Technology Corp.


2/26

Motivation

Need to get the text out of files before they can be

indexed and searched.

Arabic PDF files can be challenging.

2

Database


3/26

PDF Basics

Raw f ile contents are organized into objects.

Each obj ect stores a specif ic type of info:

Document (Root ) object

Page obj ects

Font objects

Basic structure of file is viewable text:

3

[]

7 0 obj

endobj

[]


4/26

PDF Text

Text is stored in chunks of one or more characters.

Each chunk is located at a given X,Y coordinate

Chunks can be stored in any order in the file

4


5/26


6/26

PDF Font s and Encodings

PDF fonts typically store only the glyphs that are used.

Text chunk stores an index into a PDF font obj ect .

Font object may map glyph to a Unicode value.

6


7/26

Rendering Dif ference

Displaying a PDF requires the PDF Engine to map fonts

Note that standard encoding values are not required.

7


8/26

Basic Ext ract ion Approach

1. Parse PDF f ile to ident ify page content objects

2. Parse page content stream into text chunks

3. Sort text chunks based on coordinates

4. Process chunks in order:

1. Get index for each character2. Use font informat ion to map index to Unicode (if defined)

3. Add Unicode value to end of st ring

8


9/26

English Ext ract ion Example

9


10/26

Arabic Glyphs

Arabic characters have different shapes depending on

their locat ion in a word.

Each shape is a different glyph in a font.

10


11/26

Arabic Ext ract ion Example

11


12/26

Logical and Presentat ion Orders

Text in computers is typically stored in logical order

First character stored is f irst character read or writ ten

Presentation order is based on screen layout

Orders are same for Left to Right (LTR) Languages:

Opposite for Right to Left (RTL) Languages:

12


13/26

Possible Order Solut ion

PDF stores data in presentation (display) order.

Text editors need the text in logical order though.

Need to convert from presentat ion to logical order.

Obvious solut ion:

After decoding each line, reverse the order of t he Arabic text:

13


14/26

Bi-direct ional Text

How should the following be logically stored?

14


15/26


How should the following be logically stored?

15


16/26


Text can have both RTL and LTR characters and each

should go in the correct direct ion

Unicode Bi-directional Text (BiDi) algorithm defineshow t o order characters in a paragraph based on:

Dominant direct ion of text in paragraph

Direct ion of each character in text Punctuat ion and neighboring characters

Implicit direct ion markers

BiDi lets you convert from logical to presentationorder.

16


17/26

Reverse Bi-direct ional Algori t hm

17

We need Reverse BiDi to convert from presentat ion to

logical order.


18/26

Updated Ext ract ion Approach

1. Parse PDF f ile to ident ify page content objects

2. Parse page content stream into text chunks

3. Sort text chunks based on coordinates

4. Determine dominant text direction

5. Process chunks in order and by line:1. Get index for each character

2. Use font informat ion to map index to Unicode

3. Add Unicode value to end of presentat ion order st ring

4. Apply reverse BiDi algorit hm to presentat ion order st ring

18


19/26

Present at ion Forms / Ligatures

Encodings typically define only the general form of

Arabic characters.

Unicode is an exception.

The OS determines which glyph form to use (init ial,

medial, etc.) based on the context of the character.

PDF stores the specific form of each Arabic character.

Unicode presentation forms should not be used in a

st ring and many tools cannot process them.

Need to normalize text from presentation to general

forms

19


20/26

Arabic Ext ract ion Example 2

20


21/26

Font -specif ic Ligature I mplement at ions

U+FDF2 is the Unicode Arabic ligature for Allah ().

The single ligature represents four characters: Alef, Lam, Lam, Heh .

Some fonts implement the ligature dif ferent ly:

Lam, Lam, Heh

They add a separate Alef before the ligature.

Alef (U+0627) Allah(U+FDF2)

When decomposing using Unicode specs:

Alef Alef Lam Lam Heh

21


22/26

Diacr it ic Placement

Vocalizations and diacritics can be separate glyphs

With Unicode:

Diacri t ics are stored after the base character in logical order

Diacrit ics are placed over the base character when rendered onscreen

With PDF:

Diacri t ics are stored in a separate text chunks

Coordinates cause them to overlap

Diacrit ic chunk can be before or aft er t he chunk it modifies

22


23/26

Diacrit ic I nsert ion

23


24/26

Spacing Est imat ion

Spaces and newlines are not explicitly stored.

Spacing is achieved by direct placement of text.

Ext raction requires guessing where spaces and newlines

should exist.

Is this text chunks X-value furt her away then we expected?

Is this text chunks Y-value furt her away then we expected?

Spacing estimation can be done by keeping track of

average character width thus far.

Newline estimation can be done by keeping track of

character heights.

24


25/26

PDFBox

PDFBox is an open source Apache Incubator project

It worked well for many documents in LTR languages

We enhanced it to:

Correct direct ion of RTL text

Normalize ligatures and presentat ion forms

Merge diacrit ics into text

Bet ter est imate where to add spaces

Fix parsing issues

Deal with corrupt / non-compliant f iles

Can be freely downloaded (in next release):

ht tp:/ / incubator.apache.org/ pdfbox/

25


26/26

Thank You!

Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

Documents

Transcript of Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF