Document Content Analysisfor Digital Archives
Eric SaundPerceptual Document Analysis Area
Intelligent Systems Laboratory Palo Alto Research Center
Digital Archives
Tasks Operations
-casual browsing-look up information-follow trails-compose narratives-form and organize collections-distribute -assemble timelines
-browse by topic, type, etc.-search for known items-search for items meeting criteria-find duplicate items-find similar items-follow links-establish links-apply logical rules-edit metadata
All enabled by Metadata
Content layer
Metadata layer
Index
Metadata
Two major problems with metadata:
1. Extracting metadata from raw content items.
2. Metadata is always incomplete for some purposes.
Title: Sarix neobDate: 37-23-55Media: niobiumFormat: jnbAuthor: Rsi LiwerText: “aliirn xeca sarlia isyb...”Index ID: 34962s
pointer to item
Metadata as a static record
computeSimilarityTo()containsEntity?()fitsSlotInModel?();extractTextAfterImageCleanup()
Metadata as an interface
functions applied to item content
Automatic Content Analysis
State of the Art
• document image analysis
• photographic image analysis
• video/film analysis
• audio analysis
• web site analysis
text
appearance, layout
whowhatwherewhen
topicsentitites
genrecategoryfunctional roles
genresceneswho, what, ...
genrespeech/musicspeaker IDtransciption
APR 21 2004 17:38 FR ---- 203 749 4519 TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -IPage: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp10 Riverview Drive Danbury, CT 06810 PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: 076613 SERIAL NO.: SM069A 26,351.00 4,;- UNIT NO.: 076614 SERIAL NO.: SM0G9B 26, 351. 00 DOWN PAYMENT 0. 00 BUILDING DELIVERY0. 00 BUILDING DELIVERY 400.00 BLOCK AND LEVEL 0. 00 BLOCK AND LEVEL 2,100.00ANCHOR/TIE DOWN 780 00 DECKING 950. 00 / ELECTRICAL 1, 350. 00 / PLUMBING3, 025. 00 INSTALLATION SITE MANAGEMENT 1,100 00 SKIRTING- VINYL 1,360. 00TOTAL DUE THIS INVOICE 63,767.00
When OCR Works...
Headeralignment
Graphical
logo
Font / Layout /Symbol Patternof Fax ID Line
RedactingmarkingsAddress
block
Repeatedelements
Hand-drawngraphical annotation
Handwritten Textual Annotation
Textual FieldIndicator
Tabular Layout
Graphic separator
ST
Amount Field
How People See a Document
CategoryType
Structural Elementsand Relations
RelationalContext
• Invoice • Construction project
• Supplier relationship
• Inventory & materials management
• Bill
• Itemized purchase listing
• Annotated document
Technology Ecology
Academia Industry• Computer Vision• Document Recognition• Information Retrieval• Machine Learning• Speech Recognition• Natural Language• Artificial Intelligence
• Document Imaging• Transaction Processing• Workflow Systems• Database Vendors• Business Software• Business Process Outsourcing• Advertising/Search
Paying Customer:• government• industry
• businesses• consumers• government
Hobbiests
• museums• schools• local governments• NGOs• individuals• startups• boutique companies• shoestring projects in Academia and Industry
Characteristics:• science-based• toy problems• fragile
• engineering-based• robust• limited capabilities
A Hobby Project
Document Capture Station
+ Collection Comprehension Engine
Wanted:
Collection Comprehension Engine
OCR
308991
DocumentStructure Modeling
Document Collection Linking
Image Processing
Automatic Cataloging
Genre Tagging Clustering
Classification
Visualization GUI
Conclusion
The hobby stage brings together kindred spirits.
Top Related