The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil [email protected]
-
Upload
benedict-cooke -
Category
Documents
-
view
30 -
download
3
description
Transcript of The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil [email protected]
![Page 2: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/2.jpg)
Outline
1. Overview2. Recent Developments
A. Independent Document ModelB. ValidationC. Diversifying – NASA & GPO
collections3. New Issues & Future Directions
A. Post-processingB. Image-Based Classification
![Page 3: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/3.jpg)
1. OverviewInput
Documents
Input Processing &
OCR
Form Processing
Final Metadata
Output
XML model of document
Unresolved Documents
Extracted Metadata
CleanedMetadata
sf298_1 sf298_2 ...
Form Templates
au eagle ...
Nonform Templates
Post Processing
Nonform Processing
Extracted Metadata
Validation
trusted outputs
Untrusted Metadata Outputs
Human Review & Correction
correctedmetadata
![Page 4: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/4.jpg)
Input Processing & OCR
• Select pages of interest
• Apply Off-The-Shelf OCR software
• Convert OCR output to XML model format
![Page 5: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/5.jpg)
Form Processing
• Scan document for form names– Select form template
• Apply form extraction engine to document and template
![Page 6: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/6.jpg)
Sample RDP
![Page 7: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/7.jpg)
Sample RDP (cont.)
![Page 8: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/8.jpg)
Metadata Extracted from Sample RDP (1/3)
<metadata templateName="sf298_2">
<ReportDate>18-09-2003</ReportDate>
<DescriptiveNote>Final Report</DescriptiveNote>
<DescriptiveNote>1 April 1996 - 31 August 2003</DescriptiveNote>
<UnclassifiedTitle>VALIDATION OF IONOSPHERIC MODELS</UnclassifiedTitle>
<ContractNumber>F19628-96-C-0039</ContractNumber> <ContractNumber></ContractNumber>
<ProgramElementNumber>61102F</ProgramElementNumber>
<PersonalAuthor>Patricia H. Doherty Leo F. McNamara
Susan H. Delay Neil J. Grossbard</PersonalAuthor>
<ProjectNumber>1010</ProjectNumber>
<TaskNumber>IM</TaskNumber>
<WorkUnitNumber>AC</WorkUnitNumber>
<CorporateAuthor>Boston College / Institute for Scientific Research 140 Commonwealth Avenue Chestnut Hill, MA 02467-3862</CorporateAuthor>
![Page 9: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/9.jpg)
Metadata Extracted from Sample RDP (2/3)
<ReportNumber></ReportNumber> <MonitorNameAndAddress>Air Force Research Laboratory 29 Randolph Road
Hanscom AFB, MA 01731-3010</MonitorNameAndAddress> <MonitorAcronym>VSBP</MonitorAcronym> <MonitorSeries>AFRL-VS-TR-2003-1610</MonitorSeries> <DistributionStatement>Approved for public release; distribution
unlimited.</DistributionStatement> <Abstract>This document represents the final report for work performed under the Boston College contract F I9628-96C-0039. This contract was entitled Validation of Ionospheric Models. The objective of this contract was to obtain satellite and ground-based ionospheric measurements from a wide range of geographic locations and to utilize the resulting databases to validate the theoretical ionospheric models that are the basis of the Parameterized Real-time Ionospheric Specification Model (PRISM) and the Ionospheric Forecast Model (IFM). Thus our various efforts can be categorized as either observational databases or modeling studies.</Abstract>
![Page 10: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/10.jpg)
Metadata Extracted from Sample RDP (3/3)
<Identifier>Ionosphere, Total Electron Content (TEC), Scintillation, Electron density, Parameterized Real-time Ionospheric Specification Model (PRISM), Ionospheric Forecast Model (IFM), Paramaterized Ionosphere Model (PIM), Global Positioning System
(GPS)</Identifier> <ResponsiblePerson>John Retterer</ResponsiblePerson> <Phone>781-377-3891</Phone> <ReportClassification>U</ReportClassification>
<AbstractClassification>U</AbstractClassification> <AbstractLimitaion>SAR</AbstractLimitaion></metadata>
![Page 11: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/11.jpg)
Non-Form Processing
• Classification – compare document against known document layouts– Select template written for closest matching
layout
• Apply non-form extraction engine to document and template
![Page 12: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/12.jpg)
Non-Form Sample (1/2)
![Page 13: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/13.jpg)
Non-Form Sample (2/2)
![Page 14: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/14.jpg)
Template Used for Sample Document
<structdef pagenumber="1" templateID="au"> <identifier min="1" max="1"> <begin inclusive="current"> <stringmatch case="yes" loc="beginwith">AU/</stringmatch> </begin> <end>onesection</end> </identifier> <CorporateAuthor min="1" max="1"> <begin inclusive="current"> <stringmatch case="no" loc="beginwith">
AIR COMMAND | AIR WAR </stringmatch> </begin> <end inclusive="current"> <stringmatch case="no" loc="beginwith">AIR UNIVERSITY</stringmatch> </end> </CorporateAuthor> <UnclassifiedTitle min="1" max="1"> <begin inclusive="after">CorporateAuthor</begin> <end inclusive="before"> <stringmatch case="no" loc="beginwith">by</stringmatch> </end> </UnclassifiedTitle>…
![Page 15: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/15.jpg)
Metadata Extracted From the Title Page of the Sample Document
<paper templateid="au"> <identifier>AU/ACSC/012/1999-04</identifier> <CorporateAuthor>AIR COMMAND AND STAFF COLLEGE AIR UNIVERSITY</CorporateAuthor> <UnclassifiedTitle>INTEGRATING COMMERCIAL ELECTRONIC EQUIPMENT TO IMPROVE MILITARY CAPABILITIES </UnclassifiedTitle> <PersonalAuthor>Jeffrey A. Bohler LCDR, USN</PersonalAuthor> <advisor>Advisor: CDR Albert L. St.Clair</advisor> <ReportDate>April 1999</ReportDate></paper>
![Page 16: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/16.jpg)
Post-Processing
• Coerce extracted values into standard formats
![Page 17: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/17.jpg)
Validation
• Estimate quality of extracted metadata
• Untrusted outputs referred (to humans) for review and correction
![Page 18: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/18.jpg)
Recent Developments
A. Independent Document Model
B. Validation
C. Diversifying – NASA and GPO Collections
![Page 19: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/19.jpg)
A. Independent Document Model (IDM)
• Platform independent Document Model • Motivation
– Dramatic XML Schema Change between Omnipage 14 and 15
– Tie the template engine to stable specification– Protects from linking directly to specific OCR product– Allows us to include statistics for enhanced feature
usage• Statistics (i.e. avgDocFontSize, avgPageFontSize,
wordCount, avgDocWordCount, etc..)
![Page 20: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/20.jpg)
Documents in IDM
• A document consists of pages• pages are divided into regions• regions may be divided into
– blocks of vertical whitespace– paragraphs– tables– images
• paragraphs are divided into lines• lines are divided into wordsAll of these carry standard attributes for size,
position, font, etc.
![Page 21: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/21.jpg)
Generating IDM
• Use XSLT 2.0 stylesheets to transform– Supporting new OCR schema only requires
generation of new XSLT stylesheet. -- Engine does not change
![Page 22: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/22.jpg)
IDM Usage
OmniPage 14 XML Doc
OmniPage 15 XML Doc
Other OCR Output XML Doc
IDM XML Doc
Form Based Extraction
Non Form ExtractiondocTreeModelOther.xsl
docTreeModelOmni15.xsl
docTreeModelOmni14.xsl
![Page 23: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/23.jpg)
IDM Tool Status
• Converters completed to generate IDM from Omnipage 14 and 15 XML– Omnipage 15 proved to have numerous errors in its
representation of an OCR’d document
– Consequently, not recommended
• Form-based extraction engine revised to work from IDM• Non-form engine still works from our older “CleanXML”
– convertor from IDM to CleanXML completed as stop-gap measure
– direct use of IDM deferred pending review of other engine modifications
![Page 24: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/24.jpg)
B. Validation
• Given a set of extracted metadata– mark each field with a confidence value indicating how
trustworthy the extracted value is– mark the set with a composite confidence score
• Fields and Sets with low confidence scores may be referred for additional processing– automated post-processing– human intervention and correction
![Page 25: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/25.jpg)
Validating Extracted Metadata
• Techniques must be independent of the extraction method
• A validation specification is written for each collection, combining
• Field-specific validation rules– statistical models derived for each field of
• text length• % of words from English dictionary• % of phrases from knowledge base prepared for
that field– pattern matching
![Page 26: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/26.jpg)
Sample Validation Specification
• Combines results from multiple fields<val:validate collection="dtic"
xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary">
<val:average>
<val:field name="UnclassifiedTitle">...</val:field>
<val:field name="PersonalAuthor">...</val:field>
<val:field name="CorporateAuthor">...</val:field>
<val:field name="ReportDate">...</val:field>
</val:average>
</val:validate>
![Page 27: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/27.jpg)
Validation Spec: Field Tests
• Each field is subjected to one or more tests…<val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max>
<val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/>
</val:max> </val:average> </val:field><val:field name="ReportDate"> <val:reportFormat/></val:field>...
![Page 28: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/28.jpg)
Sample Input Metadata Set
<metadata>
<UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle>
<PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor>
<ReportDate>Accepted this 18th day of June 2004 by:</ReportDate>
</metadata>
![Page 29: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/29.jpg)
Sample Validator Output
<metadata confidence="0.522"><UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle>
<PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor>
<ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate>
</metadata>
![Page 30: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/30.jpg)
Classification (a priori)
Classify (select best template)
Final Nonform Output
CleanXML
Extracted Metadata
au eagle ...
Nonform Templates
Unresolved Document
Extract Metadata
selectedtemplate
• Previously, we had attempted various schemes for a priori classification– x-y trees– bin classification
• Still investigating some– image-based recognition
![Page 31: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/31.jpg)
Post-Hoc Classification
• Apply all templates to document– results in multiple candidate sets of metadata
• Score each candidate using the validator– Select the best-scoring set
Extract Metadata
Final Nonform Output
CleanXML
Selected Metadata
au eagle ...
Nonform Templates
Unresolved Document
Select Best Metadata
CandidateMetadata
Sets
Validation Spec.
validation rules
![Page 32: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/32.jpg)
Experimental Results
Manually Assigned
Class
Number of Documents
Validator Preferred Total
Au 86 0 0 0 86
Eagle 0 8 33 4 45
Rand 0 0 8 4 12
Title 0 0 1 23 24
![Page 33: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/33.jpg)
Interpretation of Results
• Validator agreed with human on 125 out of 167 cases
• Of 42 cases where they disagreed– 37 were due to “extra” words in extracted metadata
(e.g., military ranks in author names)• highlights need for post-processing to clean up metadata
– 2 were mistakes by template– 2 were due to garbled characters by OCR– 1 due to a bug in the validator
![Page 34: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/34.jpg)
C. Diversifying – NASA and GPO Collections
Document collections differ in
• whether forms are used and form layout
• document layout
• what metadata fields are present & which ones are collected
![Page 35: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/35.jpg)
Changing Collections
• Porting to a new document collection– identify pages of interest– training classifiers to recognize new document layouts
(?)– templates for forms & document layouts– new validation scripts
• collect statistics for collection model
– new post-processing rules
• No changes required to core engines & other software
![Page 36: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/36.jpg)
NASA Technical Reports
• Different layouts than DTIC– fewer total– tend to be visually more similar– mixture with and without RDPs
![Page 37: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/37.jpg)
NASA Sample Document
![Page 38: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/38.jpg)
Extracted Metadata for NASA Sample
<paper templateid="singleAuthor"> <metadata> <UnclassifiedTitle> A Computationally Efficient Meshless Local Petrov-Galerkin Method for Axisymmetric Problems </UnclassifiedTitle> <PersonalAuthor> I.S. Raju* and T. Chen? </PersonalAuthor> <CorporateAuthor> NASA Langley Research Center Hampton, VA 23681 </CorporateAuthor> <Abstract> The Meshless Local Petrov-Galerkin (MLPG) method is one of the recently developed element-free …
![Page 39: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/39.jpg)
Govt. Printing Office
• Congressional acts & reports• EPA reports Preliminary study with Acts
of Congress and EPA reports
• samples suggest layouts are more diverse than DTIC or NASA– metadata actually present in document varies
widely
![Page 40: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/40.jpg)
GPO Sample – Act of Congress
![Page 41: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/41.jpg)
Metadata Extracted for Act of Congress
<paper> <metadata> <public_law_report_num> 118 STAT. 3984 PUBLIC LAW 108?493?DEC. 23, 2004 </public_law_report_num> <bill_number>[H.R. 5394 ] components.</bill_number> <congress_num>108th Congress</congress_num> <type>An Act</type> <acttype> Dec. 23, 2004 To amend the Internal Revenue Code of 1986 to modify the
taxation of arrow [H.R. 5394 ] components. </acttype> </metadata></paper>
![Page 42: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/42.jpg)
GPO sample report
![Page 43: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/43.jpg)
Metadata Extracted from GPO Sample Report
<paper> <metadata> <title> CHINA?S PROLIFERATION PRACTICES AND ROLE IN THE NORTH KOREA CRISIS </title> <type> HEARING BEFORE THE U.S.-CHINA ECONOMIC AND SECURITY REVIEW COMMISSION </type> <session>ONE HUNDRED NINTH CONGRESS FIRST SESSION</session> <date>MARCH 10, 2005</date> <use> Printed for the use of the U.S.-China Economic and Security Review Commission </use> <online>Available via the World Wide Web: http://www.uscc.gov</online> </metadata></paper>
![Page 44: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/44.jpg)
3. New Issues and Future Directions
A. Post-Processing
B. Image-Based Classification
![Page 45: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/45.jpg)
Post-processing
• WYSIWYG
• WYG != WYW
![Page 46: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/46.jpg)
Post-processing
• WYSIWYG– What You See is What You Get
• WYG != WYW
![Page 47: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/47.jpg)
Post-processing
• WYSIWYG– What You See is What You Get
• WYG != WYW– What You Get is not What You Want
![Page 48: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/48.jpg)
Example – DTIC Date Format
• Document may contain:– March 28, 2007– 3/28/2007– 3/28/07
• DTIC requires:– 28 MAR 2007
![Page 49: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/49.jpg)
Example – Personal Authors
![Page 50: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/50.jpg)
Example – Personal Authors (cont.)
• We extract: <PersonalAuthor>Patricia H. Doherty Leo F. McNamara Susan H.
Delay Neil J. Grossbard</PersonalAuthor>
• DTIC requires:<PersonalAuthor>Patricia H. Doherty ;Leo F. McNamara ;Susan H.
Delay ;Neil J. Grossbard</PersonalAuthor>
• NASA requires<author>Patricia H. Doherty</author><author>Leo F. McNamara</author> <author>Susan H. Delay</author><author>Neil J. Grossbard</author>
![Page 51: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/51.jpg)
Post-Processing Requirements
• Post-processing rules must vary by– metadata field
– collection
![Page 52: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/52.jpg)
Post-Processing Architecture
D TI CPo s tPro ce s s o r
M e ta da ta Po s tPro ce s s o r
re g is t e r (t a g , p ro c e s s o r)
NA S APo s tPro ce s s o r
G POPo s tPro ce s s o r
reg is te re d
Ta g Pro ce s s o rs
D TI CPe rs o n a lA u th o rs
NA S APe rs o n a lA u th o rs
D TI C D a te s
![Page 53: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/53.jpg)
Image-Based Classification
• filter to find likely candidates for validator-based selection of template
• Looking at a variety of techaniques inspired by work in image recognition
![Page 54: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/54.jpg)
Example: Image-Based Classification
• Example: represent a page using various colors to denote images, text, bold text, etc.
• find visually most similar pages in documents of known classes– “vote” based on 5 most similar
documents
![Page 55: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/55.jpg)
Visual Matching Example (1/2)
![Page 56: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/56.jpg)
Visual Matching Example (2/2)
![Page 57: The ODU Metadata Extraction Project March 28, 2007 Dr. Steven J. Zeil zeil@cs.odu](https://reader036.fdocuments.in/reader036/viewer/2022062721/5681387f550346895da02fda/html5/thumbnails/57.jpg)
Conclusions
• Automated metadata extraction can be performed effectively on a wide variety of documents– Coping with heterogeneous collections is a
major challenge
• Much attention must be paid to “support” issues– validation, post-processing, etc.