High Fidelity Programmatic Access to Document...
Transcript of High Fidelity Programmatic Access to Document...
High Fidelity Programmatic
Access to Document ContentStandardization and the industry
Matevž GačnikCTO,
Gama System
Microsoft Regional Director
Microsoft MVP – Solution Architecture
CompanyAbout Us
Consultants in eDMS space
Solution architects
Devs
Software development and consulting company
Advanced document systems, intensive development
15 years
Privately owned
Own products, development investment
Developing on strategic platforms (Microsoft, IBM)
Today
Gama System® Document Line
eDMS, ECM: Gama System® eDocs
eArchive: Gama System® eArchive
Based on Electronic Documents, not scans
Completeness
Innovative(Gama System® eArchive, innovation of 2008 - Slovenia)
Universal usage
Modern technologies, modern demands
Simplicity
Integrability, portal technologies, SOA
Legal validity
Existing user knowledge
Gama System® Document LineCurrent Solutions
Document System: Gama System® eDocs
Archiving System: Gama System® eArchive
Design
Modular platform, open system, integration points
Standardized interfaces
Modern technology
Users work in Office apps
Microsoft Word
Microsoft Excel
Microsoft PowerPoint, …
Support for Microsoft 2007 Office System
Support for OpenXML format
Default persistence format in Office 2007 installations
BasicsDocument Line: Two Products
Slovenian / EU legislature...
1.1.2007 - ZVDAGA
Positions electronic document as an original, if:
It is signed by the author – with a qualified dsig and cert
It is stored and archived by a certified software solution
Needs to maintain validity (dsig could be extended by time stamping or resigning)
Needs to maintain readability (supports format change, reformatting)
Is in a preferred document format
If not, needs a certified, authentic format conversion method –information value needs to be preserved
You can destroy paper documents, no physical archives
Current LegislatureDocument Content
Slovenian / EU legislature...
Specifically - Common Technology Demands (CTD)
CTD defines technology demands on document and archiving systems for electronic documents
Based on MoReq v1.0/2.0, but formalized and regulated
... says:
Standard document formats for long term storage of electronic documents for textual and mixed document content should be stored in the following formats:
ISO Latin-1 – ISO 8859-1, PDF/A – ISO 19005-1
XML – SGML – ISO 8879, ODF – ISO/IEC 26300
OpenXML is (currently) not a preferred format
When CTD was approved, OpenXML was not an ISO standard
Current Legal Issues Document Content
Currently
If you want to archive document for long term storage, it must be in a preferred standard document format
Actual meaning? PDF/A.
Even though SGML is an ISO standard, one would probably fail in court to prove OOXML passes XMLism
Not everything is XML in OOXML (think /word/media/*)
We are pushing for OOXML to be included in the list
Major drawbacks
Office 2007 penetration in enterprise environment is not at a critical point
Office <2007 document format is unparsable, can not be automated seriously
Currently, majority of documents can not be stored long term in native format
Current Legal Issues Impact and Drawbacks
OpenXML is now an open standard
Spec is freely available
XML based, easily processed
Converters available
Able to store arbitrary document content
Low impact for out-of-band editing – only need efficient parsing methods
Format is size efficient
Not based on bitmap representation
Fidelity is high, granularity high, parsable
Major benefit in long term archiving systems
OOXML and ISOTechnical View – Gains
High fidelity formats preserve user content
Market share of editable content is preserved
Fully capable format
Preserves formatting and document layout
Preserves usability of office software
Supports signatures natively
Format and tooling
Users only work with one tool – for storage and editing
No need for additional tools (in terms of long term storage)
OOXML and ISOUser View – Gains
Long term archiving in OpenXML
Digital documents will be stored in OpenXML
Simpler implementation of storage in eDMS and archiving systems
Generating software and e-archive use the same format
No conversions necessary (for legal reasons) – no need for authenticity of conversions
Additional timestamps not needed – no format change
Cost benefits
Users get what they submitted
Archived document is a bitwise clone of the original document
Content reuse
OOXML and ISOLegal and Business View – Gains
World needs fast, standards based, programmatic access to document content
Past implementations are suboptimal
To fill a bookmark programmatically, we need an instance of winword.exe, excel.exe, …
Server side suffers, not reliable
Client suffers, performance wise
Our tests confirm OOXML access – XML parsing is 2000x faster
Practically no memory footprint, working set drops from 30MB+ to 1MB per processing instance
Programmatic AccessBookmark Prefill, Headers, Footers
Programmatic AccessWhat We Do: Office Documents
User opens/Reviews/Approves
a document
Document content is prefilled
Headers are updated
Footers are updated
Bookmarks are updated/filled
Document sections are
locked
Gama System® eDocs
Programmatic AccessHow We Do It: Office Documents
Generator concept
Our products have plugins, called document generators
They generate / update documents & document content
They run on document templates & instance documents
OOXML generator
Handles format specifics without tool instantiation and automation
Reads / writes to:
Document bookmarks
Header, footer content
Inserts arbitrary styled text content
Inserts images (ie. handwritten signatures, logos) into bookmarks
Allows bookmark positioning inside Office files
Programmatic AccessHow We Do It: Word Example Simplified
Obtain needed bookmarks and values / images
Document system knows which values need to be inserted
Gets appropriate values from the system, Unzips .docx
Check if bookmarks exists in OOXML
Parses /word/_rels/document.xml.rels. Gets all document parts (headers, footers, footnotes, endnotes)
Parses /word/document.xml, inserts all found bookmarks
Parses /word/headerN.xml
Parses /word/footerN.xml
Fill:
Replaces XML between<w:bookmarkStart w:id=“#”/><w:bookmarkEnd id=“#”/>
Zips .docx
Workflow StepsEach Step Needs Access to Content
Creation (editor or scan)
Editing Review Approval Archiving
Users expect to only enter information once
Lots of metadata per document
Label, code, classification, authors, signers, reviewers, approvers, custom metadata (external document numbers, costs, …, references), images…
Archiving step needs to be transparent – no format change due to long term storage
90%+ software needs recognition for long term storage
Content prefill is still cumbersome, we need supported tooling
Vendors are reacting, due to past/current stack limitations
OOXML brings promises, but is also complex to parse and support – think readability
Industry should push for high-fidelity formats, not image representations
ConclusionFormats and future
?Matevž GačnikMicrosoft Regional Director
Microsoft MVP – Solutions Architect
CTO
Gama System d.o.o.
http://www.gama-system.si