ListReader : Wrapper Induction for Lists in OCRed Documents
description
Transcript of ListReader : Wrapper Induction for Lists in OCRed Documents
1
ListReader:Wrapper Induction for Lists
in OCRed Documents
Thomas PackerBYU CS DEG2012.03.17
2
We Love Data (In Digital Form)
3
Lots of Paper Documents in the World
4
Lots of Text in Paper Documents
5
Lots of Lists in TextLots of Data in Lists
6
Manual Data Entry
7
Wrappers: Individualized Extraction Rules
8
Semi-Automatic Wrapper Induction
9
Weakly-supervised
Wrapper Induction
Semi-supervised Wrapper Induction
10
Data: Image
11
Data: OCR
12
Data: Hand-Labeled OCR
13
Induced Regex Wrappers
Single-Specific
Single-General
Multi-General
(i)(\. )(Lydia)( )(Lewis)(\*, )(b\. ) …
([a-z]{1,5})([ \t\r\n\.]{1,6})([a-zA-Z]{3,9})([ \t\r\n]{1,5}) ([a-zA-Z]{3,9})([ \t\r\n\*,]{1,7})([ \t\r\n\.a-z]{1,7}) …
([iv]{1,3})([ \.]{2,2})([ACHJLacdehilmnrstuvy]{4,6})([ ]{1,1}) ([Leisw]{5,6})([ \*,]{2,3})([ \.abp]{3,5}) …
Character Classes[a-z][A-Z][0-9][ \t\r\n]<each punct.>
15
Preliminary Results(Field Label F-measure, Small Dataset)
Single-Specific Single-General Multi-General0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
31%36%
54%
45% 46%
60%
TrainingTest
16
Conclusions and Future Work
Single-Specific Single-General Multi-General Transduction0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
31%36%
54%
76%
45% 46%
60%
79%
TrainingTest
17
All done. Suggestions?