sales white paper WHITE...

15
sales white paper WHITE PAPER your PDF to data extractor operating with 100% accuracy V03

Transcript of sales white paper WHITE...

sales white paper

WHITE PAPER your PDF to data extractor operating with 100% accuracy

V03

This page is intentionally left blank.

CONTENTS

Introduction 1

What Is xTractor 2

xTractor Competitive Advantage 4

Compared To OCR Solutions 4

Compared To PDF Specific Solutions 5

xTractor Architecture 5

xTractor Deployment 6

Deployment Options 6

On-Premises Scenario 6

Cloud Scenario 6

Hybrid Cloud Scenario 7

Creating Rule Definition Sets (RDSes) 7

Integration Options 8

Putting It All Together: The Deployment Process 8

Limits 9

Support, Error Handling 10

Frequently Asked Questions (FAQ) 10

General 10

Rule Definition Set (RDS) 10

Deployment 11

Input Format Support 11

Additional Information 11

Disclaimer 11

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

1 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

INTRODUCTION

much more quickly to changing customer and market environments. As new demands challenge the status quo, expectations are greater and the stakes are higher.

New-breed IT solutions have to be able to respond instantly to market needs. They also have to offer definite and fully automated solutions. An approach that merely offers solutions requiring validation of an operator

The time has come for solutions operating with 100% accuracy.

The PDF format dominates the electronic document market. PDF is in all probability the format landing in your mailbox or information system: invoices, purchase

advantages are well-known both to business users and clients and even to the general public. The properties, however, of this format also pose certain limitations.

Data extraction was never part of the original PDF functionality. When created, PDF was designed as a fixed-layout output- only format users can look at or print. Today, 25 years after its origin, automatized solutions producing documents in PDF format are commonplace. With PDF by Adobe being the most common document format, there is understandably a strong market demand for data extraction from PDF documents. OCR/ICR solutions available on the market are not an option as they lack the precision and require manpower for validation of extracted data.

Introducing xTractor, the solution extracting data from PDF documents with 100% accuracy.

Do you want to extract invoice details from PDF? Do you need to match the extracted data to a SAP vendor ID or company code?

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

2 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

Do you need to export the extracted data to UBL, cXML or insert it into your SAP system? Do you find it useful to save the input PDF at the ArchiveLink server, in SharePoint, or generate a HL7 message and send it to a medical system? This, and much more, can be achieved with xTractor, your PDF to data solution.

xTractor has been designed and produced by X-Center Ltd., an international IT company focusing primarily on solutions for a full range of tasks related to circulation of electronic documents. xTractor® is a registered trademark.

This document provides a general understanding of xTractor, presenting insights into xTractor Architecture and xTractor Deployment.

WHAT IS XTRACTOR

xTractor extracts structured data from PDF documents. It is able to master the text layer contained in any PDF document, including purchase & payment orders, medical documentation, delivery slips, insurance forms, and many more document types.

PDF data is extracted using rule definition set (RDS) which is defined on a per-PDF-layout

requested output; it is the principal element shaping both the appearance and content of the final product. The more rules the RDS contains, the more specific the outcome is. As long as the input is textual PDF and the requested output is structured data, xTractor is a solution superior to anything else currently available on the market.

Content-wise, an invoice is a relatively stable document. It always includes an issue date, vendor identification, amount payable, etc. However, each vendor uses a specific page layout and there exists no generally agreed, fixed location on the page where the certain type of information is located.

An example would be the extraction of the issue date from PDF invoices.

Some vendors position the issue date in the top right corner, while others prefer placing it in the center of the page, or perhaps at the bottom.

Each vendor may label the issue date with a different text,

der

The spatial relationship between the label and the value also varies a great deal. The date format depends on the v conventions (mainly county of origin), thus the same date can be written as -12-

3 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

The situation grows even more complicated when it comes to line items. There are invoices containing neat rectangular tables with one

across line items with varying numbers of textual lines, irregular tables, line items split between pages, etc. No wonder data extraction can be such a headache!

With xTractor, exact rules for defining the invoice issue date are specified. Described in words, the rule for solving this task would

left quadrant of the take the

text right of it and treat it as the US/English/

xTractor knows how to obtain the date and it is always extracted as you would expect.

xTractor uses proprietary algorithms based on X-Center original research for line item extraction.

Defining RDS is a one-time investment. RDSes are defined using a GUI desktop application. The process typically takes about 0.5 - 2.5 hours, depending on the complexity of the given PDF. The customer usually provides sample

PDFs, specifies further requirements and the X-Center consultants take care of the rest.

As soon as the RDS has been defined, the job is virtually completed and you begin receiving your data.

Namely you are able to do the following:

Extract structured data from PDF documents and treat them as structured data.

Apply integration options such as exporting to electronic invoice formats (UBL, cXML), exporting the source PDF and extracting data to SAP, other ERP systems or even health information systems, saving document to ArchiveLink storage, converting between various document formats, collecting emails & attachments from multiple mailboxes.

xTractor integration options encompass, however, much more.

Consider this example: migration from legacy hospital information systems to a new system based

on a PDF virtual printer because it is the easiest way to obtain data from the old system.

To migrate old data to the new system, xTractor extracts data such as the patient ID, department ID, document type, etc. from the PDF documents and sends the information into the new system using the standard HL7 protocol.

These, and many other, integration options are enabled by means of additional solutions working in sync with xTractor. You can find out more about their architecture and functions below chapter xTractor Architecture.

4 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

OCR solutions use guesswork on a character recognition level. For instance, number 1 vs. lowercase letter L, zero vs. uppercase O, H vs. K, etc. This is mitigated to

but unfortunately the data we want to extract, such as the invoice number or the shipping reference, is usually not in the dictionary. Since xTractor uses letters

OCR solutions use guesswork on a context level. For instance, somewhere close to (what might be) the word there might be with great probability the invoice number. In some cases, the guess is correct, in others, it may not be. Since xTractor uses precise extraction rules, the context is not merely at.

Line item extraction is a notorious headache with OCR solutions. xTractor uses proprietary

algorithms based on years of experience with OCR malfunctions and limitations. As a result, xTractor can extract line items with unprecedented accuracy, including cases when a line item has an irregular layout, contains a variable number of text lines, spans across pages, etc.

OCR results are accurate in 60-80% of cases. As a result, extraction results have to

be checked by operators, leading to high costs. xTractor can offer 100% extraction accuracy, essentially eliminating the need to validate extracted data using operators.

OCR solutions take a long time to process documents. xTractor is fast: single-

paged documents are typically extracted in less than one second.

XTRACTOR COMPETITIVE ADVANTAGE

COMPARED TO OCR SOLUTIONS

Most companies use OCR solutions (i.e. Kofax Transformation Modules, Abby FineReader, IRIS) to extract digital documents.

It is no exaggeration to state, however, that xTractor has far more advantages than OCR-based solutions.

5 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

COMPARED TO PDF SPECIFIC SOLUTIONS

PDF-specific solutions operate on letters stored in PDF documents. Once again, xTractor surpasses its competitors by far:

xTractor is the only solution with

guaranteed 100% accuracy.

Thanks to the xDPRO platform, xTractor offers a wide range of integration options, including interfacing with SAP and other ERP systems, ArchiveLink archiving solutions, databases, medical systems, etc.

xTractor is at hand. It can operate in a cloud,

on-premises, or in a hybrid cloud mode.

Additional PDF-specific solutions may also be based on extraction rules, much like xTractor. These rules are defined, however, in a programing language, such as the Prolog non-procedural programming language. Such an approach has little to do with user friendliness, being both time demanding and error-prone programming in an exotic language. For comparison, xTractor rules are created in a comfortable xTractor Designer GUI. This provides a far more popular, user friendly and time-effective rule design process. The learning curve is also very steep with xTractor Designer, compared to the programming approach.

XTRACTOR ARCHITECTURE

As mentioned above, xTractor offers a wide range of integration options by means of integration components. xTractor is in fact an umbrella term for the following:

xTractor Designer, a GUI application

for RDS (Rule Definition Set) creation.

xTractor Server components for xDPRO, the server-side part of xTractor running within the xDPRO framework.

xDPRO (X-Center Document Processor), a

modular document processing platform, the cornerstone of many solutions made by X-Center. It is able to process various document types using different types of actions, such as the insertion of a document to SAP, saving document information into a database, conversion between various file formats, etc.

In addition, X-Center projects often use the following in-house components:

EmailFetcher, a utility which collects

emails from multiple mailboxes, extracts & converts attachments and sends them to target systems, including xDPRO.

xMon, a highly modular platform for

monitoring systems and processes. It provides monitoring on all levels, from low-level OS details to SAP workflow statistics. The list of products xMon is able to monitor includes the following: xDPRO, Open Text Archive Server, Open Text Content Server, WMD xFlow for SAP (incl. reporting), WMD xFlow for Windows (incl. reporting), OpenText Document Pipeline, EmailFetcher, xArchive, OpenText Invoice Capture Center (ICC) Server, OCR solutions and many other systems and solutions.

xArchive, a robust document storage

solution based on the ArchiveLink standard. It is a sophisticated tool for, most importantly, reducing database size. It is also very useful in SAP HANA migration, archiving invoices, outgoing documents and reports, etc.

DMS Online, a cloud-based DMS system

(also works on-premises), user-friendly and effective document management of data stored on reliable encrypted servers. The user interface contains easy-to-use customizable dashboards. Its functionality is fully extensible.

6 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

XTRACTOR DEPLOYMENT

DEPLOYMENT OPTIONS

xTractor can be deployed on your systems or offered as a cloud service. There are three possible scenarios for its deployment:

ON-PREMISES SCENARIO

xTractor (powered by xDPRO) is deployed at the customer site. xDPRO manages PDF acquisition, extracts data, performs any project-specific actions and sends the results to any target systems.

Example: a customer places PDF files into the xTractor input folder. xTractor performs the extraction, uses fuzzy matching to recognize SAP vendor ID, archives PDF to SAP ArchiveLink Server and sends XML to SAP using an RFC call.

Advantages: the data remains completely on the customer site; all integration options are available.

CLOUD SCENARIO

xTractor is deployed in a cloud, typically in Microsoft Azure. The customer sends PDF files using web service interface and receives the resulting files (XML, PDF and/ or other). All the integration tasks are to be implemented by the customer.

Example: the customer sends PDF to xTractor cloud service using web services API and receives an electronic invoice in UBL or cXML format.

Advantages: rapid deployment; no HW cost

Disadvantages: the data leaves the customer site (using secure https protocol); integration options are limited.

7 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

HYBRID CLOUD SCENARIO

xDPRO framework is deployed at the customer site. It manages PDF acquisition, sends PDF to the cloud, gets the extracted data back, then sends the results to any of the target systems.

Example: the customer places PDF files into xTractor input folder. xDPRO takes the PDF, sends it to xTractor cloud service. xTractor cloud performs the extraction and returns the result to the on-premises xDPRO. The on-premises xDPRO then uses fuzzy matching to recognize SAP vendor ID, archives PDF to SAP ArchiveLink Server and sends XML to SAP.

Advantages: all the integration options are available, no xTractor maintenance is needed.

CREATING RULE DEFINITION SETS (RDSES)

As discussed previously, document extraction requires RDS (Rule Definition Set) which is defined on a per-document-layout basis.

The procedure of RDS creation is relatively simple. The customer typically provides sample PDFs for each vendor and specifies the type of data to be extracted. X-Center consultants subsequently create the RDS.

RDS creation usually takes about 0.5-2.5 hours per one document layout (for instance per invoice vendor), depending on the complexity of the input PDF files. The free initial deployment package contains the creation of several RDSes .

Once RDS is defined, it can basically be used for data extraction indefinitely, until the format/ layout of the input PDF files changes. While changes in the input document layout are not all that common, it apparently does occur from time to time. In this respect the customer has two options: either allow the X-Center team to

modify the RDS as needed, or assume this role on its own. Each customer is granted access to xTractor Designer to be able to modify the RDS. X-Center consultants are available to provide proper training to authorized personnel whose task is to operate the xTractor Designer.

The training has the following prerequisites on the part of such personnel: a general knowledge of the structure of the document type being processed (for instance invoices), basic technical skills and a knowledge of basic regular expressions. The training session usually takes about 16 hours (two working days) in total. Given the fact that needs usually vary from customer to customer, please contact our X-Center sales support team to confirm this information.

X-Center reserves the right to reuse any RDS created by our consultants for internal purposes, including the provision of RDS to

8 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

other customers providing that RDS does not contain any customer-specific information. The reuse concept makes it possible to reduce the overall RDS creation costs.

INTEGRATION OPTIONS

The server side of xTractor runs within the xDPRO Document Processing framework. Its structure recalls one of the famous construction toys Lego, in which different pieces can be used for different purposes. xTractor Server Components are just like these pieces, although structurally speaking quite large ones.

Existing xDPRO components include XML transformations, file input/output, (S)FTP, emailing, calling SAP RFCs, archiving to ArchiveLink servers (including ones from OpenText, ECM and IBM), executing DB statements, sending and receiving messages in HL7 format (for health-care applications), conversion between file formats and more. If there is no component available for a given purpose, the X-Center development team can prepare it at short notice.

PUTTING IT ALL TOGETHER: THE DEPLOYMENT PROCESS

As a general rule, a few facts have to be known before the deployment:

the document type (i.e. which set

of fields has to be extracted),

the expected document volume,

representative sample documents for each layout (for example, each vendor),

the desired output format (for

example, XML, UBL, cXML), and

the integration options to be used (for example, the sending document and the extracted data into SAP ERP system).

Once this information has been provided to the X-Center team, the implementation process may begin.

The very basic xTractor deployment is simple. Consider the following situation:

Interfacing is limited to Windows shared folders (or cloud web services API),

the document type is invoice (plus debit note, credit note), and

the output format is XML (xTractor-specific), UBL or cXML.

In this case, the entire deployment task is virtually covered by the free initial deployment package, with the exception perhaps of the additional Rule Definition Sets (RDSes) that might have to be created in addition to the free package.

9 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

More demanding deployment, in contrast, can add a number of additional steps. Processing car insurance form agreements might require defining, for example, of a new set of fields to be extracted (field model). It can also involve integration options such as importing documents from mailboxes or (S)FTS folders, archiving to ArchiveLink server, identifying SAP vendor code, sending data to an ERP system, etc. In such cases the deployment turns into a small project. The free initial deployment package still covers some efforts, of course, but it is likely that additional efforts on top of the package might be required.

Each customer is welcome to specify further requirements which can be solved either by our library of existing components or by custom development. All X-Center professionals are available to assist.

LIMITS

X-Center designers use their years of experience in the area of document management, OCR and integration projects to fine-tune xTractor to perfection. The limits of xTractor should also, however, be discussed.

Input PDFs have to be textual, meaning that the PDF contains letters as letters, not as pictures. Identifying a textual PDF is simple: open the PDF document in Adobe Reader, choose a set of letters and try to copy & paste the set. If your attempt is successful, go ahead and zoom in on the letters. It the letters are jagged or blurry, most likely the PDF is not textual. If the letters look clear even at large magnification (and you can copy-paste the content), the PDF is textual.

No scanned PDFs. In some cases, PDFs are made of scanned images (pictures). Technically speaking, X-Center can integrate an OCR solution into the process and xTractor might even be able to extract most of the data, but the reliability of such a solution is doubtful at best. We advise against this approach. xTractor works great with textual PDF documents. For scanned documents, X-Center offers a range of OCR solutions.

No scanned PDFs with an OCR text layer. In some cases, PDFs are composed of scanned images (pictures) AND a text layer produced by a certain OCR software (letters from this layer can be copy-pasted from the PDF). The text layer provided by OCR software is generally of poor quality and the No scanned PDFs statement above applies.

10 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

SUPPORT, ERROR HANDLING

X-Center is committed to providing three levels of customer support:

1. Bug fixing. If the customer reports a product bug, the case is investigated and the defect fixed. Resolution of document errors can be provided by X-Center on time and on a material (T&M) basis.

2. Bug fixing plus Monitoring. The health status

of the solution such as the availability and presence of documents in the error queue is monitored. The customer can view relevant information on a web-based dashboard (xMon solution; deployed in the cloud or on-site). The customer is notified by email whenever the system encounters a warning or error state. Resolution of document error is provided on time and on a material (T&M) basis.

3. Managed Services, optional SLA. X-Center

manages the solution either in the customer environment or in the cloud. This level includes bug fixing, proactive maintenance of the solution, upgrades to the latest version and resolution of documents in the error state.

FREQUENTLY ASKED QUESTIONS (FAQ)

GENERAL

Q: What is xTractor? A: xTractor is a software solution that extracts structured data from PDF documents.

Q: Can xTractor process picture-based data? A: No, it cannot. It is only able to master the text layer type of data. Customers are welcome to contact the X-Center team to discuss the range of OCR solutions provided.

Q: Can xTractor extract all data present on a PDF form? A: Yes, xTractor is 100% accurate provided that the Rule Definition Set has been properly defined.

RULE DEFINITION SET (RDS)

Q: What is RDS? A: RDS stands for Rule Definition Set. It defines where, how and what has to be extracted from a document.

Q: What needs to be specified in order to create RDS? A: Document type (i.e. invoice), data fields to be extracted (i.e. standard invoice set: date of issue, due date, total amount, vendor ID, etc.), plus the set of representative documents (for example, single page, multiple pages, with and without table).

Q: Do I have to provide a sample document? A: It depends. RDS can only be created and tested using sample documents. Customers who are concerned with providing sample documents to X-Center are welcome to have their personnel trained and create RDSes on their own.

Q: What does the free initial deployment package contain? A: xTractor deployment on-site or in the cloud, creation of RDSes for several document types (i.e. several invoice vendors), configuration of XML, UBL or cXML output scenario and several hours of consulting/integration services. Please contact the X-Center sales team to find out more about the content of the most recent free deployment package.

Q: Does every customer have access to RDSes and tools to modify them? A: Yes, if the customer wishes to proceed with in-house modification of RDS, X-Center provides full access to xTractor Designer app. It is strongly recommended, however, that each member of personnel undergo xTractor training.

Q: How long does it take to train an xTractor Designer operator? A: Usually 16 hours (2 working days). For more information, contact the X-Center team.

11 www.x-center.eu EU Office: US Office: www.x-center.org

149/7 Thámova st www.x-tractor.net 8 Faneuil Hall Market Pl 3rd Fl www.x-tractor.net 18600 Prague Czech Republic

[email protected]

+420 776 434 884

Boston, MA 02109 USA

[email protected]

+1-888-228-8611

xTractor® White Paper Your pdf to data extractor operating with 100% accuracy

DEPLOYMENT

Q: On-premises scenario available? A: Yes.

Q: Cloud scenario available? A: Yes.

Q: Hybrid cloud scenario available? A: Yes.

INPUT FORMAT SUPPORT

Q: Which formats does xTractor support? A: PDF files with textual content.

Q: Can xTractor extract pictures from PDFs? A: No, the content has to be textual.

Q: Can xTractor process scanned PDFs? A: No, scans are picture-based documents and as such they are images (including pictures of letters), not letters.

ADDITIONAL INFORMATION

To learn more, contact your X-Center representative or visit us at www.x-center.org or www.x-tractor.net.

DISCLAIMER This presentation outlines our general product direction. This presentation is not subject to your license agreement or any other agreement with X-Center Ltd. X-Center Ltd. has no obligation to pursue any course of business outlined in this presentation or to develop or

release any functionality mentioned.

This presentation and X-Center ssible future developments are subject to change and may be

changed at any time for any reason without notice.

This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or noninfringement. X-Center assumes

no responsibility for errors or omissions in this document, except if such damages were caused by X-Center intentionally or through gross negligence.