Download - ListReader : Wrapper Induction for Lists in OCRed Documents

Transcript
Page 1: ListReader : Wrapper Induction for Lists in OCRed Documents

1

ListReader:Wrapper Induction for Lists

in OCRed Documents

Thomas PackerBYU CS DEG2012.03.17

Page 2: ListReader : Wrapper Induction for Lists in OCRed Documents

2

We Love Data (In Digital Form)

Page 3: ListReader : Wrapper Induction for Lists in OCRed Documents

3

Lots of Paper Documents in the World

Page 4: ListReader : Wrapper Induction for Lists in OCRed Documents

4

Lots of Text in Paper Documents

Page 5: ListReader : Wrapper Induction for Lists in OCRed Documents

5

Lots of Lists in TextLots of Data in Lists

Page 6: ListReader : Wrapper Induction for Lists in OCRed Documents

6

Manual Data Entry

Page 7: ListReader : Wrapper Induction for Lists in OCRed Documents

7

Wrappers: Individualized Extraction Rules

Page 8: ListReader : Wrapper Induction for Lists in OCRed Documents

8

Semi-Automatic Wrapper Induction

Page 9: ListReader : Wrapper Induction for Lists in OCRed Documents

9

Weakly-supervised

Wrapper Induction

Semi-supervised Wrapper Induction

Page 10: ListReader : Wrapper Induction for Lists in OCRed Documents

10

Data: Image

Page 11: ListReader : Wrapper Induction for Lists in OCRed Documents

11

Data: OCR

Page 12: ListReader : Wrapper Induction for Lists in OCRed Documents

12

Data: Hand-Labeled OCR

Page 13: ListReader : Wrapper Induction for Lists in OCRed Documents

13

Induced Regex Wrappers

Single-Specific

Single-General

Multi-General

(i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) …

([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) …

([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …

Character Classes[a-z][A-Z][0-9][ \t\r\n]<each punct.>

Page 14: ListReader : Wrapper Induction for Lists in OCRed Documents

15

Preliminary Results(Field Label F-measure, Small Dataset)

Single-Specific Single-General Multi-General0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

31%36%

54%

45% 46%

60%

TrainingTest

Page 15: ListReader : Wrapper Induction for Lists in OCRed Documents

16

Conclusions and Future Work

Single-Specific Single-General Multi-General Transduction0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

31%36%

54%

76%

45% 46%

60%

79%

TrainingTest

Page 16: ListReader : Wrapper Induction for Lists in OCRed Documents

17

All done. Suggestions?