ListReader : Wrapper Induction for Lists in OCRed Documents

Post on 22-Feb-2016

36 views 0 download

description

ListReader : Wrapper Induction for Lists in OCRed Documents. Thomas Packer BYU CS DEG 2012.03.17. We Love Data (In Digital Form). Lots of Paper Documents in the World. Lots of Text in Paper Documents. Lots of Lists in Text Lots of Data in Lists. Manual Data Entry. - PowerPoint PPT Presentation

Transcript of ListReader : Wrapper Induction for Lists in OCRed Documents

1

ListReader:Wrapper Induction for Lists

in OCRed Documents

Thomas PackerBYU CS DEG2012.03.17

2

We Love Data (In Digital Form)

3

Lots of Paper Documents in the World

4

Lots of Text in Paper Documents

5

Lots of Lists in TextLots of Data in Lists

6

Manual Data Entry

7

Wrappers: Individualized Extraction Rules

8

Semi-Automatic Wrapper Induction

9

Weakly-supervised

Wrapper Induction

Semi-supervised Wrapper Induction

10

Data: Image

11

Data: OCR

12

Data: Hand-Labeled OCR

13

Induced Regex Wrappers

Single-Specific

Single-General

Multi-General

(i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) …

([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) …

([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …

Character Classes[a-z][A-Z][0-9][ \t\r\n]<each punct.>

15

Preliminary Results(Field Label F-measure, Small Dataset)

Single-Specific Single-General Multi-General0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

31%36%

54%

45% 46%

60%

TrainingTest

16

Conclusions and Future Work

Single-Specific Single-General Multi-General Transduction0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

31%36%

54%

76%

45% 46%

60%

79%

TrainingTest

17

All done. Suggestions?