ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen.

18
ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of ITEC810 Final Report Inferring Document Structure Wieyen Lin/41348133 Supervised by Jette Viethen.

ITEC810 Final Report

Inferring Document Structure

Wieyen Lin/41348133

Supervised by

Jette Viethen

2

Outlines

Part AIntroductionRelated work

Part BMaterialMethodology

Part CImplementationConclusion

3

Part A: Introduction

4

Introduction

5

Introduction (cont’d)

Research ObjectiveAnalyze a document image and detect

its logical structure with annotated labels Project Scope

Focus on: Academic articlesSource Corpus: Association for

Computational Linguistics (ACL) Anthology Corpus

6

Related Work

Physical Layout AnalysisTop-down methodsBottom-up methods

Logical Structure AnalysisSyntactic methodsRule-based methods

7

Part B: Methodology

8

Material:XML Source by Text

An example of Input file of the project

9

Methodology

1a. Grouping texts into lines

XML sourceby text

1b. Aggregating lines into blocks

XML sourceby line

Physical Structure

Phase I: Aggregation of Homogeneous Blocks

10

Methodology (cont’d)

2. Annotating each block with a logical label

Logical Structure

XML sourceby block

1b. Aggregating lines into blocks

Phase II: Detection of Logical Structure

11

Methodology (cont’d)

Check dominant font size

Read-in3 lines at a time

A1A2A3 AAB ABB A1BA2 ABC

A B CA1 B A2A BBAA BCheckspacing

s1=s2

AAA

s1>s2

A1 A2A3

s1>s2

A3A1A2

A, B, C: lines of texts with different dominant font sizesA1, A2: lines of texts with the same dominant font sizes1: spacing between A1 and A2

s2: spacing between A2 and A3

A : belongs to the same block

Algorithm for aggregating blocks In Phase II

12

Part C: Outcomes

13

Current Outcome

Original PDF document

Physical layout outcome in HTML

14

Current Outcome (cont’d)

Logical structure outcome in HTML

15

Implementation:Class Diagram

16

Implementation:User Interfaces

17

Conclusion:Information Evaluation

Error TypeError

FoundAccuracy of

Detection

Incorrect title or missing title 1 97.5% (39/40)

Incorrect Abstract heading or Missing Abstract heading

4 90.0% (36/40)

Incorrect Abstract or Missing Abstract 4 90.0% (36/40)

Incorrect Affiliation(s) or Missing Affiliation(s)

11 72.5% (29/40)

Missing >50% of Page number(s) or Erroneous Page number(s) found

15 62.5% (25/40)

Missing >50% Section heading(s) or Erroneous Section heading(s) found

11 72.5% (29/40)

Summary of detection results out of 40 randomly selected documents

18

Conclusion:Future Work

Improving AlgorithmsAggregation of Homogenous blocksDetection of Abstract Heading,

Section Heading, and Paragraph Removing Noise

Incomplete table contentsIncomplete mathematic formula