Thomas L. Packer 12/2012 CS/BYU
-
Upload
quincy-collins -
Category
Documents
-
view
35 -
download
0
description
Transcript of Thomas L. Packer 12/2012 CS/BYU
![Page 1: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/1.jpg)
1
Populating Ontologies by Semi-automatically Inducing Information Extraction Wrappers for Lists in OCRed
DocumentsThomas L. Packer
12/2012 CS/BYU
![Page 2: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/2.jpg)
2
Research Area as a Prezi presentation:
http://prezi.com/vg8dyhx11kq0/dissertation-proposal-list-reading/?kw=view-vg8dyhx11kq0&rc=ref-4464709
![Page 3: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/3.jpg)
3
Family History
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 4: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/4.jpg)
4
• 10M people do it http://www.deseretnews.com/article/700180627/Genealogy-Expanding-the-family-tree.html?pg=all
• $1B/year market http://blog.genlighten.com/2010/03/01/genealogy-a-1b-market-maybe/
• 2nd most popular hobby
Anyone Like Family History?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 5: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/5.jpg)
5
• Search for ancestors in records• Construct family trees from records• Add to them:– Data– Photos– Stories– Temple work
What is Family History Research?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 6: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/6.jpg)
6
• Annotate records– 125,000 volunteers at FamilySearch.org
• Family trees– 26M at Ancestry.com– More at other sites
Willing to Work?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 7: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/7.jpg)
7
• Manual• Automation– OCR keyword search only– Extract rich data querying,
record linkage, question answering
How Do we Do it?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 8: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/8.jpg)
8
• Source records– Hand written– Machine printed
• Lists– Rich and dense– Variable and underutilized
• Documents– Family history books– City directories– Birth, marriage, death records– School yearbooks– Church yearbooks– Newspapers– Local history books– Navy cruise books
What Records?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 9: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/9.jpg)
9
• Contiguous sequence of records
• Records contain fields and delimiters in a regular language
• Fields may be nested lists
What is a List?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 10: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/10.jpg)
10
• Printed receipts– Nutrition app
(Noshly.com)– Marketing + personal
finance app (Itemize.com)
• Document conversion• Citation metrics
Other Applications?
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 11: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/11.jpg)
11
Research Project
Related Work
Project Description
Validation
Conclusion
Child(child1)Child-ChildNumber(child1, “1”)Child-Name(child1, name1)Name-GivenName(name1, “Sarah”)Child-BirthDate(child1, date1)BirthDate-Year(date1, “1797”)
Motivation
![Page 12: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/12.jpg)
12
Wrapper Induction
Project Description
Validation
MotivationConclusio
nRelated
Work
Related Work: Wrapper Induction Lists OCR OCR Error
TolerantOCR & Lists Ont. Lists and
Ont.
Sum 7.7 2.0 0.5 1.8 3.0 2.2
blanco_redundancy_2010 0.5 0.0 0.0 0.0 0.5 0.3dalvi_automatic_2010 0.5 0.0 0.0 0.0 0.0 0.0gupta_answering_2009 1.0 0.0 0.0 0.0 0.5 0.5carlson_bootstrapping_2008 0.0 0.0 0.0 0.0 0.0 0.0heidorn_automatic_2008 0.8 1.0 0.5 0.8 0.5 0.4chang_automatic_2003 0.5 0.0 0.0 0.0 0.0 0.0crescenzi_roadrunner_2001 0.0 0.0 0.0 0.0 0.0 0.0lerman_automatic_2001 0.8 0.0 0.0 0.0 0.0 0.0chidlovskii_wrapper_2000 0.8 0.0 0.0 0.0 0.0 0.0kushmerick_wrapper_2000 0.0 0.0 0.0 0.0 0.0 0.0lerman_learning_2000 0.8 0.0 0.0 0.0 0.0 0.0thomas_t-wrappers_1999 0.0 0.0 0.0 0.0 0.0 0.0adelberg_nodose_1998 1.0 1.0 0.0 1.0 0.5 0.5kushmerick_wrapper_1997 (dis.) 0.5 0.0 0.0 0.0 0.5 0.3kushmeric_wrapper_1997 (paper) 0.5 0.0 0.0 0.0 0.5 0.3
![Page 13: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/13.jpg)
13
Wrapper Induction for Printed Text
• Adelberg 1998:– Grammar induction for any structured text– Not robust to OCR errors– No empirical evaluation
• Heidorn 2008:– Wrapper induction for museum specimen labels– Not typical lists
• Supervised—will not scale well• Entity attribute extraction–limited ontology
populationProject
DescriptionValidatio
nMotivation
Conclusion
Related Work
![Page 14: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/14.jpg)
14
Typical Ontology Population
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 15: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/15.jpg)
15
Expressive Ontology Population
1. Lexical vs. non-lexical2. N-ary relationships3. M degrees of
separation4. Functionality and
optionality5. Generalization-
specialization class hierarchies
1. GivenName(“Joe”) vs. Person(p1)
2. City-Population-Year(“Provo”, “115000”, “2011”)
3. Husband-Wife(p1, p2), Wife-BirthDate(p2, d2), BirthDate-Year(d2, “1876”)
4. Person-Birth() vs. Person-Marriage()
5. Business vs. Person
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 16: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/16.jpg)
16
Why not Apply Web Wrapper Induction to OCR Text?
• Noise tolerance: – Allow character variations increase recall
decrease precision• Populate only the simplest ontologies• Problems with wrapper language:– Left-right context (Kushmeric 2000)– Xpath (Dalvi 2009, etc.)– CRF (Gupta 2009)
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 17: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/17.jpg)
17
Solution: ListReader
• OCR• Wrapper induction– Semi-supervised– Weakly Supervised– Bootstrapping
• Extract information into ontology
Related Work
Validation
MotivationConclusio
nProject
Description
![Page 18: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/18.jpg)
18
Semi-supervised Wrapper Induction
Related Work
Validation
MotivationConclusio
nProject
Description
Child(child1)Child-ChildNumber(child1, “1”)Child-Name(child1, name1)Name-GivenName(name1, “Sarah”)Child-BirthDate(child1, date1)BirthDate-Year(date1, “1797”)
![Page 19: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/19.jpg)
19
Construct Form, Label First Record
Related Work
Validation
MotivationConclusio
nProject
Description
<Child.ChildNumber>1</Child.ChildNumber>. <Child.Name.GivenName>Sarah</Child.Name.GivenName>, b. <Child.BirthDate.Year>1797</Child.BirthDate.Year>.
![Page 20: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/20.jpg)
20
Wrapper Generalization
Related Work
Validation
MotivationConclusio
nProject
Description
Child.BirthDate.Year, .b/h
Child.BirthDate.Year, ..b \n…
… ?? .?? \n
1. Sarah, b. 1797.2. Amy, h. 1799, d. i800.3. John Erastus, b. 1836, d. 1876.
![Page 21: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/21.jpg)
21
1. Sarah, b. 1797.2. Amy, h. 1799, d. i800.3. John Erastus, b. 1836, d. 1876.
Wrapper Generalization
Related Work
Validation
MotivationConclusio
nProject
Description
Child.BirthDate.Year, .b/h
Child.BirthDate.Year, ..b \n…
… ?? .?? \n
Child.BirthDate.Year, .b/h… Child.DeathDate.Year, ..d \n
![Page 22: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/22.jpg)
22
Wrapper Generalization as Beam Search
1. Initialize wrapper from first record2. Apply predefined set of wrapper adjustments3. Score alternate wrappers with:– “Prior” (is like known list structure)– “Likelihood” (how well they match next text)
4. Add best to wrapper set5. Repeat until end of list
Related Work
Validation
MotivationConclusio
nProject
Description
![Page 23: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/23.jpg)
23
Mapping Sequential Labels to Predicates
Related Work
Validation
MotivationConclusio
nProject
Description
Child(child1)Child-ChildNumber(child1, “1”)Child-Name(child1, name1)Name-GivenName(name1, “Sarah”)Child-BirthDate(child1, date1)BirthDate-Year(date1, “1797”)
<Child.ChildNumber>1</Child.ChildNumber>. <Child.Name.GivenName>Sarah</Child.Name.GivenName>, b. <Child.BirthDate.Year>1797</Child.BirthDate.Year>.
Child.ChildNumber . Child.Name.GivenName Child.BirthDate.Year, ..b\n \n
![Page 24: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/24.jpg)
24
Weakly Supervised Wrapper Induction
1. Apply wrappers and ontologies2. Spot list by repeated patterns3. Find best ontology fragments for best-labeled
record4. Generalize wrapper– Both above and below– Active learning without human input
Related Work
Validation
MotivationConclusio
nProject
Description
![Page 25: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/25.jpg)
25
Knowledge from Previously Wrapped Lists
Related Work
Validation
MotivationConclusio
nProject
Description
Child.ChildNumber . Child.Name.G
ivenNameChild.BirthDate.
Year, ;.b\n
Child.DeathDate.Year ;.d m Child.Spouse.Name.
GivenName. . \nChild.Spouse.Name.Surname
![Page 26: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/26.jpg)
26
List Spotting
Related Work
Validation
MotivationConclusio
nProject
Description
1. Sarah, b. 1797.2. Amy, h. 1799, d. i800.3. John Erastus, b. 1836.
Child.ChildNumber . Child.Name.G
ivenName\n
\n
. \n
\n
\n \n
\n
\n
![Page 27: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/27.jpg)
27
Select Ontology Fragments and Label the Starting Record
Related Work
Validation
MotivationConclusio
nProject
Description
Child.ChildNumber .\n
1. Sarah, b. 1797.2. Amy, h. 1799, d. i800.3. John Erastus, b. 1836.
Child.BirthDate.Year.b,
![Page 28: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/28.jpg)
28
Merge Ontology and Wrapper Fragments
Related Work
Validation
MotivationConclusio
nProject
Description
![Page 29: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/29.jpg)
29
Generalize Wrapper,& Learn New Fields without User
Related Work
Validation
MotivationConclusio
nProject
Description
1. Sarah, b. 1797.2. Amy, h. 1799, d. i800.3. John Erastus, b. 1836.
Child.DeathDate.Year.d .
![Page 30: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/30.jpg)
30
Thesis Statement
It is possible to populate an ontology semi-automatically, with better than state-of-the-art accuracy and cost, by inducing information extraction wrappers to extract the stated facts in the lists of an OCRed document, firstly relying only on a single user-provided field label for each field in each list, and secondly relying on less ongoing user involvement by leveraging the wrappers induced and facts extracted previously from other lists.
Related Work
Validation
MotivationConclusio
nProject
Description
![Page 31: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/31.jpg)
31
Four Hypotheses
1. Is a single labeling of each field sufficient? 2. Is fully automatic induction possible?3. Does ListReader perform increasingly better?4. Are induced wrappers better than the best?
Related Work
Project Description
MotivationConclusio
nValidatio
n
![Page 32: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/32.jpg)
32
Hypothesis 1
• Single user labeling of each field per list
• Evaluate detecting new optional fields• Evaluate semi-supervised wrapper induction
Related Work
Project Description
MotivationConclusio
nValidatio
n
![Page 33: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/33.jpg)
33
Hypothesis 2
• No user input required with imperfect recognizers
• Find required level of noisy recognizer P & R
Related Work
Project Description
MotivationConclusio
nValidatio
n
![Page 34: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/34.jpg)
34
Hypothesis 3
• Increasing repository knowledge decreases the cost
• Show repository can produce P- and R-level recognizers
• Evaluate number of user-provided labels over time
Related Work
Project Description
MotivationConclusio
nValidatio
n
![Page 35: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/35.jpg)
35
Hypothesis 4
• ListReader performs better than a representative state-of-the-art information extraction system
• Compare ListReader with the supervised CRF in Mallet
Related Work
Project Description
MotivationConclusio
nValidatio
n
![Page 36: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/36.jpg)
36
Evaluation Metrics
• Precision• Recall• F-measure• Accuracy• Number of user-provided labels
Related Work
Project Description
MotivationConclusio
nValidatio
n
![Page 37: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/37.jpg)
37
Corpus
• Dev. set: ~100 pages
• Blind set: ~400 pages
Related Work
Project Description
MotivationConclusio
nValidatio
n
• Lists in several types of historical docs
![Page 38: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/38.jpg)
38
Research Schedule1. Prepare datasets --------------------------------------------------------------------------------------- Incremental2. Semi-supervision and label mapping ------------------------------------------------------------------ Fall 20123. ICDAR conference paper “Semi-supervised Wrapper Induction for OCRed Lists” ------- Feb. 1 20134. Journal paper “Semi-supervised Wrapper Induction for OCRed Lists” -------------------- Winter 20135. Weak supervision -------------------------------------------------------------------------------------- Winter 20136. Journal paper “Weakly-supervised Wrapper Induction for OCRed Lists” ----------------- Winter 20137. Dissertation -------------------------------------------------------------------------------------------- Summer 20138. Dissertation defense --------------------------------------------------------------------------------------- Fall 2013
• (Journals considered: IJDAR first; JASIST, PAMI, PR, TKDE, DKE second)
Related Work
Project Description
Validation
MotivationConclusio
n
![Page 39: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/39.jpg)
39
Work and Results Thus Far
• Large, diverse corpus of OCRed documents• Semi-supervised regex and HMM induction• Both beat CRF trained on three times the data• Designed label to predicate mapping• Implemented preliminary mapping• 85% accuracy of word-level list spotting
Related Work
Project Description
Validation
MotivationConclusio
n
![Page 40: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/40.jpg)
40
Expected Contributions
• ListReader– Wrapper induction– OCRed lists– Population ontologies– Accuracy and cost
Related Work
Project Description
Validation
MotivationConclusio
n
![Page 41: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/41.jpg)
41
Questions & Answers
![Page 42: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/42.jpg)
42
What Does that Mean?
• Populating Ontologies– A machine-readable and mathematically specified
conceptualization of a collection of facts• Semi-automatically Inducing– Pushing more work to the machine
• Information Extraction Wrappers– Specialized processes exposing data in documents
• Lists in OCRed Documents– Data-rich with variable format and noisy content
Related Work
Project Description
Validation
Conclusion
Motivation
![Page 43: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/43.jpg)
43
Who Cares?
• Populating Ontologies– Versatile, expressive, structured, digital information is
queryable, linkable, editable. • Semi-automatically Inducing– Lowers cost of data
• Information Extraction Wrappers – Accurate by specializing for each document format
• Lists in OCRed Documents– Lots of data useful for family history, marketing,
personal finance, etc. but challenging to extractRelated
WorkProject
DescriptionValidatio
nConclusio
nMotivation
![Page 44: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/44.jpg)
44
Machine Learning
Related Work
Project Description
Validation
MotivationConclusio
nRelated
Work
Information Extraction
Wrappers
Artificial Intelligence
Natural Language Processing
Current Research Problem
Document Image
Analysis Conceptual Modeling
![Page 45: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/45.jpg)
45
Related Work
Project Description
Validation
MotivationConclusio
nRelated
Work
Wrapper Induction
Noisy OCR Text
Lists Ontology Population
![Page 46: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/46.jpg)
46
List Reading
• Specialized for one kind of list:– Printed ToC: Marinai 2010, Dejean 2009, Lin 2006– Printed bibs: Besagni 2004, Besagni 2003, Belaid 2001– HTML lists: Elmeleegy 2009, Gupta 2009, Tao 2009, Embley
2000, Embley 1999• Use specialized hand-crafted knowledge• Rely on clean input text containing useful HTML structure
or tags• NER or flat attribute extraction–limited ontology
population• Omit one or more reading steps
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 47: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/47.jpg)
47
Why not Use Left-Right Context?
• Field boundaries• Field position
and character content
• Record boundaries
Project Description
Validation
MotivationConclusio
nRelated
Work
OCRed List:
![Page 48: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/48.jpg)
48
Why not Use XPaths?
• OCR text has no explicit XML DOM tree structure
• Xpaths require HTML tag to perfectly mark field text
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 49: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/49.jpg)
49
Why not Use (Gupta’s) CRFs?
• HTML lists and records are explicitly marked• Different application: Augment tables using
tuples from any lists on web• At web scale, they can throw away harder-to-
process lists• They rely on more training data than we will• We will compare our approach to CRFs
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 50: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/50.jpg)
50
Page Grammars
• Conway [1993]
• 2-D CFG and chart parser for page layout recognition from document images
• Can assign logical labels to blocks of text
• Manually constructed grammars• Rely on spatial features
Project Description
Validation
MotivationConclusio
nRelated
Work
![Page 51: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/51.jpg)
51
Reading Steps
1. List spotting2. Record segmentation3. Field segmentation4. Field labeling5. Nested list
recognition
Related Work
Validation
MotivationConclusio
nProject
Description
Members of the football team:
Captain: Donald Bakken.................Right Half BackLeRoy "sonny' Johnson.........,........Lcft Half BackOrley "Dude" Bakken......,.......,......Quarter BackRoger Jay Myhrum........................ .Full BackBill "Snoz" Krohg,...........................Center
They had a good year.
![Page 52: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/52.jpg)
52
Special Labels Resolve Ambiguity
Related Work
Validation
MotivationConclusio
nProject
Description
Child(child1)Child-ChildNumber(child1, “1”)Child-Name(child1, name1)Name-GivenName(name1, “Sarah”)Child-BirthDate(child1, date1)BirthDate-Year(date1, “1797”)
<Child.ChildNumber>1</Child.ChildNumber>. <Child.Name.GivenName>Sarah</Child.Name.GivenName>, b. <Child.BirthDate.Year>1797</Child.BirthDate.Year>.
1. Sarah, b. 1797.2. Amy, h. 1799, d. i800.3. John Erastus, b. 1836, d. 1876.
Child.ChildNumber . Child.Name.GivenName Child.BirthDate.Year, ..b\n \n
![Page 53: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/53.jpg)
53
Agenda (90 minutes)
• Research Area + Questions (35 minutes)• Research Problem + Questions (55 minutes)• Committee Deliberation ()
• Please ask questions along the way
![Page 54: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/54.jpg)
54
Research Area35 minutes
![Page 55: Thomas L. Packer 12/2012 CS/BYU](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812fa2550346895d952026/html5/thumbnails/55.jpg)
55
Research Problem
55 minutes
Related Work
Project Description
Validation
MotivationConclusio
n