High Fidelity Programmatic Access to Document...

17
High Fidelity Programmatic Access to Document Content Standardization and the industry Matevž Gačnik CTO, Gama System Microsoft Regional Director Microsoft MVP Solution Architecture

Transcript of High Fidelity Programmatic Access to Document...

Page 1: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

High Fidelity Programmatic

Access to Document ContentStandardization and the industry

Matevž GačnikCTO,

Gama System

Microsoft Regional Director

Microsoft MVP – Solution Architecture

Page 2: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

CompanyAbout Us

Consultants in eDMS space

Solution architects

Devs

Software development and consulting company

Advanced document systems, intensive development

15 years

Privately owned

Own products, development investment

Developing on strategic platforms (Microsoft, IBM)

Page 3: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Today

Gama System® Document Line

eDMS, ECM: Gama System® eDocs

eArchive: Gama System® eArchive

Based on Electronic Documents, not scans

Completeness

Innovative(Gama System® eArchive, innovation of 2008 - Slovenia)

Universal usage

Modern technologies, modern demands

Simplicity

Integrability, portal technologies, SOA

Legal validity

Existing user knowledge

Gama System® Document LineCurrent Solutions

Page 4: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Document System: Gama System® eDocs

Archiving System: Gama System® eArchive

Design

Modular platform, open system, integration points

Standardized interfaces

Modern technology

Users work in Office apps

Microsoft Word

Microsoft Excel

Microsoft PowerPoint, …

Support for Microsoft 2007 Office System

Support for OpenXML format

Default persistence format in Office 2007 installations

BasicsDocument Line: Two Products

Page 5: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Slovenian / EU legislature...

1.1.2007 - ZVDAGA

Positions electronic document as an original, if:

It is signed by the author – with a qualified dsig and cert

It is stored and archived by a certified software solution

Needs to maintain validity (dsig could be extended by time stamping or resigning)

Needs to maintain readability (supports format change, reformatting)

Is in a preferred document format

If not, needs a certified, authentic format conversion method –information value needs to be preserved

You can destroy paper documents, no physical archives

Current LegislatureDocument Content

Page 6: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Slovenian / EU legislature...

Specifically - Common Technology Demands (CTD)

CTD defines technology demands on document and archiving systems for electronic documents

Based on MoReq v1.0/2.0, but formalized and regulated

... says:

Standard document formats for long term storage of electronic documents for textual and mixed document content should be stored in the following formats:

ISO Latin-1 – ISO 8859-1, PDF/A – ISO 19005-1

XML – SGML – ISO 8879, ODF – ISO/IEC 26300

OpenXML is (currently) not a preferred format

When CTD was approved, OpenXML was not an ISO standard

Current Legal Issues Document Content

Page 7: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Currently

If you want to archive document for long term storage, it must be in a preferred standard document format

Actual meaning? PDF/A.

Even though SGML is an ISO standard, one would probably fail in court to prove OOXML passes XMLism

Not everything is XML in OOXML (think /word/media/*)

We are pushing for OOXML to be included in the list

Major drawbacks

Office 2007 penetration in enterprise environment is not at a critical point

Office <2007 document format is unparsable, can not be automated seriously

Currently, majority of documents can not be stored long term in native format

Current Legal Issues Impact and Drawbacks

Page 8: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

OpenXML is now an open standard

Spec is freely available

XML based, easily processed

Converters available

Able to store arbitrary document content

Low impact for out-of-band editing – only need efficient parsing methods

Format is size efficient

Not based on bitmap representation

Fidelity is high, granularity high, parsable

Major benefit in long term archiving systems

OOXML and ISOTechnical View – Gains

Page 9: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

High fidelity formats preserve user content

Market share of editable content is preserved

Fully capable format

Preserves formatting and document layout

Preserves usability of office software

Supports signatures natively

Format and tooling

Users only work with one tool – for storage and editing

No need for additional tools (in terms of long term storage)

OOXML and ISOUser View – Gains

Page 10: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Long term archiving in OpenXML

Digital documents will be stored in OpenXML

Simpler implementation of storage in eDMS and archiving systems

Generating software and e-archive use the same format

No conversions necessary (for legal reasons) – no need for authenticity of conversions

Additional timestamps not needed – no format change

Cost benefits

Users get what they submitted

Archived document is a bitwise clone of the original document

Content reuse

OOXML and ISOLegal and Business View – Gains

Page 11: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

World needs fast, standards based, programmatic access to document content

Past implementations are suboptimal

To fill a bookmark programmatically, we need an instance of winword.exe, excel.exe, …

Server side suffers, not reliable

Client suffers, performance wise

Our tests confirm OOXML access – XML parsing is 2000x faster

Practically no memory footprint, working set drops from 30MB+ to 1MB per processing instance

Programmatic AccessBookmark Prefill, Headers, Footers

Page 12: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Programmatic AccessWhat We Do: Office Documents

User opens/Reviews/Approves

a document

Document content is prefilled

Headers are updated

Footers are updated

Bookmarks are updated/filled

Document sections are

locked

Gama System® eDocs

Page 13: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Programmatic AccessHow We Do It: Office Documents

Generator concept

Our products have plugins, called document generators

They generate / update documents & document content

They run on document templates & instance documents

OOXML generator

Handles format specifics without tool instantiation and automation

Reads / writes to:

Document bookmarks

Header, footer content

Inserts arbitrary styled text content

Inserts images (ie. handwritten signatures, logos) into bookmarks

Allows bookmark positioning inside Office files

Page 14: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Programmatic AccessHow We Do It: Word Example Simplified

Obtain needed bookmarks and values / images

Document system knows which values need to be inserted

Gets appropriate values from the system, Unzips .docx

Check if bookmarks exists in OOXML

Parses /word/_rels/document.xml.rels. Gets all document parts (headers, footers, footnotes, endnotes)

Parses /word/document.xml, inserts all found bookmarks

Parses /word/headerN.xml

Parses /word/footerN.xml

Fill:

Replaces XML between<w:bookmarkStart w:id=“#”/><w:bookmarkEnd id=“#”/>

Zips .docx

Page 15: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

Workflow StepsEach Step Needs Access to Content

Creation (editor or scan)

Editing Review Approval Archiving

Users expect to only enter information once

Lots of metadata per document

Label, code, classification, authors, signers, reviewers, approvers, custom metadata (external document numbers, costs, …, references), images…

Archiving step needs to be transparent – no format change due to long term storage

Page 16: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

90%+ software needs recognition for long term storage

Content prefill is still cumbersome, we need supported tooling

Vendors are reacting, due to past/current stack limitations

OOXML brings promises, but is also complex to parse and support – think readability

Industry should push for high-fidelity formats, not image representations

ConclusionFormats and future

Page 17: High Fidelity Programmatic Access to Document Contentdownload.microsoft.com/download/C/D/1/CD1AA269-0362... · Document System: Gama System® eDocs Archiving System: Gama System®

?Matevž GačnikMicrosoft Regional Director

Microsoft MVP – Solutions Architect

CTO

Gama System d.o.o.

http://www.gama-system.si

[email protected]