Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

download Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

of 26

Transcript of Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    1/26

    Extracting Searchable Text from Arabic PDFs

    Br ian Car r ie r , Ph .D.

    Director of Digital Forensics

    Basis Technology Corp.

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    2/26

    Motivation

    Need to get the text out of files before they can be

    indexed and searched.

    Arabic PDF files can be challenging.

    2

    Database

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    3/26

    PDF Basics

    Raw f ile contents are organized into objects.

    Each obj ect stores a specif ic type of info:

    Document (Root ) object

    Page obj ects

    Font objects

    Basic structure of file is viewable text:

    3

    []

    7 0 obj

    endobj

    []

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    4/26

    PDF Text

    Text is stored in chunks of one or more characters.

    Each chunk is located at a given X,Y coordinate

    Chunks can be stored in any order in the file

    4

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    5/26

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    6/26

    PDF Font s and Encodings

    PDF fonts typically store only the glyphs that are used.

    Text chunk stores an index into a PDF font obj ect .

    Font object may map glyph to a Unicode value.

    6

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    7/26

    Rendering Dif ference

    Displaying a PDF requires the PDF Engine to map fonts

    Note that standard encoding values are not required.

    7

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    8/26

    Basic Ext ract ion Approach

    1. Parse PDF f ile to ident ify page content objects

    2. Parse page content stream into text chunks

    3. Sort text chunks based on coordinates

    4. Process chunks in order:

    1. Get index for each character2. Use font informat ion to map index to Unicode (if defined)

    3. Add Unicode value to end of st ring

    8

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    9/26

    English Ext ract ion Example

    9

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    10/26

    Arabic Glyphs

    Arabic characters have different shapes depending on

    their locat ion in a word.

    Each shape is a different glyph in a font.

    10

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    11/26

    Arabic Ext ract ion Example

    11

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    12/26

    Logical and Presentat ion Orders

    Text in computers is typically stored in logical order

    First character stored is f irst character read or writ ten

    Presentation order is based on screen layout

    Orders are same for Left to Right (LTR) Languages:

    Opposite for Right to Left (RTL) Languages:

    12

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    13/26

    Possible Order Solut ion

    PDF stores data in presentation (display) order.

    Text editors need the text in logical order though.

    Need to convert from presentat ion to logical order.

    Obvious solut ion:

    After decoding each line, reverse the order of t he Arabic text:

    13

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    14/26

    Bi-direct ional Text

    How should the following be logically stored?

    14

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    15/26

    Bi-direct ional Text

    How should the following be logically stored?

    15

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    16/26

    Bi-direct ional Text

    Text can have both RTL and LTR characters and each

    should go in the correct direct ion

    Unicode Bi-directional Text (BiDi) algorithm defineshow t o order characters in a paragraph based on:

    Dominant direct ion of text in paragraph

    Direct ion of each character in text Punctuat ion and neighboring characters

    Implicit direct ion markers

    BiDi lets you convert from logical to presentationorder.

    16

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    17/26

    Reverse Bi-direct ional Algori t hm

    17

    We need Reverse BiDi to convert from presentat ion to

    logical order.

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    18/26

    Updated Ext ract ion Approach

    1. Parse PDF f ile to ident ify page content objects

    2. Parse page content stream into text chunks

    3. Sort text chunks based on coordinates

    4. Determine dominant text direction

    5. Process chunks in order and by line:1. Get index for each character

    2. Use font informat ion to map index to Unicode

    3. Add Unicode value to end of presentat ion order st ring

    4. Apply reverse BiDi algorit hm to presentat ion order st ring

    18

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    19/26

    Present at ion Forms / Ligatures

    Encodings typically define only the general form of

    Arabic characters.

    Unicode is an exception.

    The OS determines which glyph form to use (init ial,

    medial, etc.) based on the context of the character.

    PDF stores the specific form of each Arabic character.

    Unicode presentation forms should not be used in a

    st ring and many tools cannot process them.

    Need to normalize text from presentation to general

    forms

    19

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    20/26

    Arabic Ext ract ion Example 2

    20

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    21/26

    Font -specif ic Ligature I mplement at ions

    U+FDF2 is the Unicode Arabic ligature for Allah ().

    The single ligature represents four characters: Alef, Lam, Lam, Heh .

    Some fonts implement the ligature dif ferent ly:

    Lam, Lam, Heh

    They add a separate Alef before the ligature.

    Alef (U+0627) Allah(U+FDF2)

    When decomposing using Unicode specs:

    Alef Alef Lam Lam Heh

    21

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    22/26

    Diacr it ic Placement

    Vocalizations and diacritics can be separate glyphs

    With Unicode:

    Diacri t ics are stored after the base character in logical order

    Diacrit ics are placed over the base character when rendered onscreen

    With PDF:

    Diacri t ics are stored in a separate text chunks

    Coordinates cause them to overlap

    Diacrit ic chunk can be before or aft er t he chunk it modifies

    22

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    23/26

    Diacrit ic I nsert ion

    23

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    24/26

    Spacing Est imat ion

    Spaces and newlines are not explicitly stored.

    Spacing is achieved by direct placement of text.

    Ext raction requires guessing where spaces and newlines

    should exist.

    Is this text chunks X-value furt her away then we expected?

    Is this text chunks Y-value furt her away then we expected?

    Spacing estimation can be done by keeping track of

    average character width thus far.

    Newline estimation can be done by keeping track of

    character heights.

    24

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    25/26

    PDFBox

    PDFBox is an open source Apache Incubator project

    It worked well for many documents in LTR languages

    We enhanced it to:

    Correct direct ion of RTL text

    Normalize ligatures and presentat ion forms

    Merge diacrit ics into text

    Bet ter est imate where to add spaces

    Fix parsing issues

    Deal with corrupt / non-compliant f iles

    Can be freely downloaded (in next release):

    ht tp:/ / incubator.apache.org/ pdfbox/

    25

  • 7/31/2019 Http Www.basistech.com Knowledge-center Forensics Extracting-text-from-Arabic-PDF

    26/26

    Thank You!