Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft...
-
Upload
cory-little -
Category
Documents
-
view
218 -
download
0
Transcript of Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft...
![Page 1: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/1.jpg)
Table Extraction usingConditional Random Fields
David Pinto
Andrew McCallum
Xing Wei
Bruce CroftUniversity of Massachusetts Amherst
Building on previous joint work with John Lafferty and Fernando Pereira
![Page 2: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/2.jpg)
Documents convey meaning by…
Apple to Open Its First Retail Storein New York City
MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.
"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."
Stream of Words Words + Formatting & Layout
![Page 3: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/3.jpg)
Apple to Open Its First Retail Storein New York City
MACWORLD EXPO, NEW YORK--July 17, 2002--Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.
"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles."
Different modalities of “Grammar”
Prepositions Formatting & Layout
![Page 4: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/4.jpg)
Most complex use of layout: The Table
![Page 5: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/5.jpg)
Tables have a long history
Old (circa 2700 BC) New (2003)
![Page 6: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/6.jpg)
Simple
Total Returns -- Year Ended December 31, 2001
Money Market Fund + 3.9% All America Fund -17.4% Equity Index Fund -12.2% Mid-Cap Equity Index Fund - 1.1% Bond Fund + 8.7% Short-Term Bond Fund + 7.4% Mid-Term Bond Fund +10.4% Composite Fund -11.0% Aggressive Equity Fund -10.6%
![Page 7: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/7.jpg)
Complex
Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
Source: www.FedStats.gov
![Page 8: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/8.jpg)
Table Information Extraction
Total Returns -- Year Ended December 31, 2001
Money Market Fund + 3.9% All America Fund -17.4% Equity Index Fund -12.2% Mid-Cap Equity Index Fund - 1.1% Bond Fund + 8.7% Short-Term Bond Fund + 7.4% Mid-Term Bond Fund +10.4% Composite Fund -11.0% Aggressive Equity Fund -10.6% FUND NAME TOTAL % RETURN 2001
Money Market +3.9All America -17.4Equity Index -12.2...
Text
Database
TableIE
or
Total returns, year ended December 31, 2001, Aggressive Equity Fund, -10.6%
Short document
![Page 9: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/9.jpg)
Automated Table Extraction needed forQuestion Answering
Question:How much snow fell in the greatest single snow storm?
Answer:4800 mm
e.g. [Pinto, et al 2002]
![Page 10: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/10.jpg)
Automated Table Extraction needed forData Mining
ENRON GLOBAL POWER & PIPELINES L.L.C. CONSOLIDATED BALANCE SHEETS (IN THOUSANDS, EXCEPT SHARE AMOUNTS)
SEPTEMBER 30, DECEMBER 31, 1997 1996 ------------- ------------ (UNAUDITED)ASSETSCurrent Assets Cash and cash equivalents $ 54,262 $ 24,582 Accounts receivable 8,473 6,301 Dividends receivable 7,189 -- Current portion of notes receivable 1,470 1,394 Other current assets 336 404 -------- -------- Total Current Assets 71,730 32,681 -------- --------Investments in to Unconsolidated Subsidiaries 286,340 298,530Notes Receivable 16,059 12,111 -------- -------- Total Assets $374,408 $343,843 ======== ========LIABILITIES AND SHAREHOLDERS' EQUITYCurrent Liabilities Accounts payable $ 13,461 $ 11,277 Accrued taxes 1,910 1,488 Current portion of note payable -- 36,583 -------- -------- Total Current Liabilities 15,371 49,348 -------- --------Deferred Income Taxes 525 4,301Commitments and Contingencies (Note 9)
Much of the data in SEC reports is contained in tables.
Would like to mine these reports forsuspicious behavior, and to better understand what is normal.
![Page 11: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/11.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
![Page 12: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/12.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
![Page 13: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/13.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Title
Super HeaderColumn Header
Data Row
Sub Header
Section Header RowSection Data Row
Separator
![Page 14: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/14.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
![Page 15: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/15.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
![Page 16: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/16.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Header
Data
![Page 17: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/17.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
![Page 18: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/18.jpg)
Sub-problems of Table Information Extraction1. Locate the table
2. Identify the row positions and types
3. Identify the column positions and types
4. Segment the table into individual cells
5. Label the cells as data or label
6. Associate data cells with their appropriate headers
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Treat as sequence labeling problem
Assign each lineone of 12 labels:Non TableTitleSuper HeaderTable HeaderSub HeaderSection HeaderData RowSection Data RowTable FootnoteTable CaptionBlankSeparator
![Page 19: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/19.jpg)
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
![Page 20: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/20.jpg)
Table Row Labelingwith Hidden Markov Models
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
Given a sequence text lines: …and a trained HMM:
Table Title
Table Header
Data Row
Non-Table
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Find the most likely state sequence (Viterbi): ),(maxarg osPs
…then any line said to be generated by the designated “Table Title” state is extracted as part of the title.
![Page 21: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/21.jpg)
What Observational Features?Usually, P(o|s) is a multinomial over a one-dimensional alphabet of atomic features (such as words),
but here we care about words plus many aspects of the layout:
Example features of text lines:
• Is indented• Is indented by more than 4 spaces• Is centered• Contains more than 3 separate
multi-space regions • Has an interior region with more
spaces than the indentation• Whitespace in this line aligns
vertically with whitespace in the previous line
• Contains mostly digits• Contains mostly alphabetics,• Contains all “ASCII-graphics” characters• Contains some “ASCII-graphics”• Contains month names, years, or other
strings associated with headers• Contains more than 4 consecutive
periods
• Next line contains all “ASCII-graphics”• This line contains mostly alphabetics and
contains more than 3 separate multi-space regions.
![Page 22: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/22.jpg)
Problems with Rich Representationand a Generative Model
• These features are not independent:– Overlapping features
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
• HMMs are generative models of the text:
• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
),( osP
![Page 23: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/23.jpg)
Conditional Sequence Models
• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
![Page 24: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/24.jpg)
nn oooossss ,...,,..., 2121
HMM
MEMM
CRF
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
||
1
1
),(
),(
exp1
)|(o
t
kttkk
jttjj
o osg
ssf
ZosP
(A special case of MEMMs and CRFs.)
Conditional Finite State Sequence Models
From HMMs to MEMMs to CRFs [Lafferty, McCallum, Pereira 2001]
[McCallum, Freitag & Pereira, 2000]
||
1
1
,
||
11
),(
),(
exp1
),|()|(
1
o
t
kttkk
jttjj
os
o
tttt
xsg
ssf
Z
ossPosP
tt
![Page 25: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/25.jpg)
Conditional Random Fields (CRFs)
St St+1 St+2
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
St+3 St+4
Markov on s, conditional dependency on o.
||
11 ),,,(exp
1)|(
o
t kttkk
o
tossfZ
osP
Hammersley-Clifford theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|2)—just like HMMs.
Set parameters by maximum likelihood, using optimization method on L.
[Lafferty, McCallum, Pereira 2001]
![Page 26: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/26.jpg)
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency for inference
• Features may be arbitrary functions of any or all observations
• Parameters need not fully specify generation of observations; require less training data
• Easy to incorporate domain knowledge
![Page 27: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/27.jpg)
Experimental Results
• 114 plain ASCII documents obtained from FedStats.gov.
• Train on 52 documents, test on 62.
• Line labels:TitleSuper HeaderTable HeaderSeparatorSub HeaderData RowTable CaptionSection HeaderSection Data RowBlankTable FootnoteNon Table
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market: Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
/2 Includes processing total for dual usage crops
WEIGHTS and MEASURES
The approximate or average weights as given in this table do not necessarily
official standing as a basis for packing or as grounds for settling...
1.2.3.4.5.6.7.8.9.10.11.12.
![Page 28: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/28.jpg)
Features
• 4 spaces in a row• 5+ spaces in a row• 4+ space indentation• 1 space indentation• 2+ regions of 2+ spaces• 3+ regions of 2+ spaces• all spaces
• contains alphabetics• contains digits• contains separator characters
(-+!=:*)• contains 4+ periods• contains common “header”
words, such as months, years, etc.
• percentage of white space• percentage of alphabetics• percentage of digits• percentage of separator chars• percentage of “header” words.
• All these features in time-shifted conjunctions {-1, 0}, {0, 1}, {1,2}.
![Page 29: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/29.jpg)
Table Extraction Experimental Results
Line labels,percent correct
Table segments,F1
95 % 92 %
65 % 64 %
error = 85%
error = 77%
85 % -
HMM
StatelessMaxEnt
CRF w/out conjunctions
CRFcontinuous features
81 % 71 %
93 % 91 %CRFbinary features
![Page 30: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/30.jpg)
Per-label Results
Label Recall Precision
Non Table 98 95
Separator 90 94Title 54 90Super Header 65 91Table Header 46 34Sub Header 92 62Section Header 44 70
Data Row 86 91Section Data Row 55 68
Table Footnote 69 90
![Page 31: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/31.jpg)
Current WorkInfer rows, columns, cell boundaries and types all at once!
... value. The three largest crops in terms of production were head lettuce,
onions, and watermelon, which combined to account for 41 percent of the total
production. Head lettuce, tomatoes, and onions were the most valuable crops,
accounting for 34 percent of the total value when combined.
Principal Vegetables for Fresh Market:Area Planted and Harvested
by Crop, United States, 1997-99 1/
--------------------------------------------------------------------------------
: Area Planted : Area Harvested
Crop :----------------------------------------------------------------
: 1997 : 1998 : 1999 : 1997 : 1998 : 1999
--------------------------------------------------------------------------------
: Acres
:
Artichokes 2/ : 9,300 9,700 9,800 9,300 9,700 9,800
Asparagus 2/ : 79,530 77,730 79,590 74,030 74,430 75,890
Beans, Snap : 90,260 94,700 98,700 82,660 87,800 90,600
Brussels :
Sprouts 2/ : 3,200 3,200 3,200 3,200 3,200 3,200
Cabbage : 77,950 79,680 79,570 75,230 76,280 74,850
/2 Includes processing total for dual usage crops
WEIGHTS and MEASURES
A random variable per-character,connected in a grid of dependencies
Similar to Markov Random Fields as used in computer vision, but conditionally-trained.
Exact inference with loops is intractable.We use recent methods of approximate inference.[Wainwright et al 02, 03]
![Page 32: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/32.jpg)
CRF Related Work• Maximum entropy for language tasks
– Language modeling [Rosenfeld ’94], [Chen & Rosenfeld ’99]– POS tagging, conditioning on previous state [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger, Lafferty ’99]
• Other Conditional Markov Models– Sequence of Winnow classifiers [Roth ‘98]– Gradient descent on state path [LeCun et al ’98]– Maximum Entropy Markov Models [McCallum, Freitag, Pereira 2000],
used by [Klein, Smarr, Ngueng, Manning 2003],…– Maximum Margin sequence models [Taskar et al 2003], [Altun et al 2003],
[Joachims 2003]– Feature induction for CRFs [McCallum 2003]
• Training methods– Limited Memory Quasi-Newton [Malouf 2002], [Sha & Pereira 2002]– Voted Perceptron [Collins 2002]– Adaptive Over-relaxed Bound Optimization [Roweis 2003]
![Page 33: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/33.jpg)
Table Extraction Related Work
• Matthew Hurst [1999, 2000, 2002]– Defined many of the issues of table modeling– Used a naïve-Bayes-like model of table layout
• Ng, Lim, Koo [1999]– Serially find table, segment rows and columns using stateless C4.5
and neural network classifiers.
• Preddy & Croft [1997], Pinto et al [2002]– Heuristically find tables, cells and their associations; use for
question answering.
• “Wrapper Learning” for extraction from consistently formatted Web pages also uses language and formatting– e.g. Stephen Soderland, Nick Kushmeric, Dayne Freitag, William
Cohen, Ion Muslea, …
![Page 34: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/34.jpg)
Summary
• In many documents, meaning is conveyed not only in the stream of words, but in layout.
• Conditional Random Fields combine the benefits of finite-state context, and robustness to non-independent language+layout features.
• Variants of CRFs will bring even finer-grained and more tightly integrated decision-making capabilities.
![Page 35: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/35.jpg)
End of talk
![Page 36: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/36.jpg)
MEMM & CRF Related Work• Maximum entropy for language tasks:
– Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]– Part-of-speech tagging [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger & Lafferty ‘99]– Named entity recognition “MENE” [Borthwick, Grishman,…’98]
• HMMs for similar language tasks– Part of speech tagging [Kupiec ‘92]– Named entity recognition [Bikel et al ‘99]– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]
• Serial Generative/Discriminative Approaches– Speech recognition [Schwartz & Austin ‘93]– Reranking Parses [Collins, ‘00]
• Other conditional Markov models– Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]– Gradient-descent on state path [LeCun et al ‘98]– Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]– Voted Perceptron-trained FSMs [Collins ’02]
![Page 37: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/37.jpg)
Voted Perceptron Sequence Models
before as ),,,(),( where
),(),( :k
),,,(expmaxarg
i instances, trainingallfor
:econvergenc toIterate
0k :zero toparameters Initialize
},{ :data ningGiven trai
1
)()()(
1
k
)(
tossfosC
osCosC
tossfs
so
ttt
kk
iViterbik
iikk
t kttkksViterbi
i
[Collins 2002]
Like CRFs with stochastic gradient ascent and a Viterbi approximation.
Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.
Analogous tothe gradientfor this onetraining instance
![Page 38: Table Extraction using Conditional Random Fields David Pinto Andrew McCallum Xing Wei Bruce Croft University of Massachusetts Amherst Building on previous.](https://reader035.fdocuments.in/reader035/viewer/2022062409/5697bfc21a28abf838ca4a25/html5/thumbnails/38.jpg)
Part-of-speech Tagging
The asbestos fiber , crocidolite, is unusually resilient once
it enters the lungs , with even brief exposures to it causing
symptoms that show up decades later , researchers said .
DT NN NN , NN , VBZ RB JJ IN
PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG
NNS WDT VBP RP NNS JJ , NNS VBD .
45 tags, 1M words training data, Penn Treebank
Error oov error error err oov error err
HMM 5.69% 45.99%
CRF 5.55% 48.05% 4.27% -24% 23.76% -50%
Using spelling features*
* use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
[Pereira 2001 personal comm.]