dineen2013

37
NLM CONVERSION TO BUILD “ATOMIC” PHYSICS CONTENT IN AN AGILE FASHION JATS-CON, April 2, 2014 OSA The Optical Society & DCL Data Conversion Laboratory, Inc. 1

Transcript of dineen2013

NLM CONVERSION TO BUILD

“ATOMIC” PHYSICS CONTENT IN AN

AGILE FASHION

JATS-CON, April 2, 2014

OSA – The Optical Society &

DCL – Data Conversion Laboratory, Inc.

1

scholarly publisher with 19 current and legacy

journals, 300+ conference proceedings

2

How?

Break 1917-2012 content into “well-polished”

atomic pieces following an industry standard

Develop infrastructure to manage and enrich

content, to build new products and services in an

agile fashion

Budget allocated for five-year strategic plan

OSA Governance: Build more-

flexible products and services!

3

Some evidence of success

With content converted to NLM XML, have developed

Enhanced article: Interactive HTML

Derivative products: ImageBank

Business Intelligence: New insights into author,

topic, funding, and other trends

4

Citation data

5

Equation data

6

7

Legacy content (750,000 journal pages)

We expected this . . .

8

This . . . not so much

JOURNAL AS

COMIC BOOK

SCHOOL YEARBOOK

9

1. Most confusing: Articles skipping

pages, sometimes in two directions

10

2. Most shocking: legacy PDF not matching

Legacy print

Print

Legacy PDF

for same

article

11

3. Most pervasive:

nonscientific

content tacked

onto research

articles

These are not

the authors

12

Project specifications: two extremes

2. Spend up to a

year doing heavy

content analysis

and spec

creation

1. Hand the

project over to

the trusted

vendor and be

done with it

13

Data Conversion Laboratory

• We convert content from any format to any format.

• Expertise with JATS, and most industry standard DTD’s and Schemas

• Established in 1981; a pioneer in the data conversion industry

• Over a billion pages converted

• Expertise in complex conversion projects; STM Publishing, eBooks, Technical documents, Educational Publishing, and Library Digitization.

• Projects range from one book to entire libraries and legacy collections

• Infrastructure for large-scale projects, with automated tracking, quality assurance, and customer reporting for every item

• Industries include Publishing, Technical Societies, Aerospace, Government, Defense, Health Sciences, Libraries & Universities

• Publish DCLNews, a monthly newsletter devoted to XML and Electronic Publishing topics going to 7,000 subscribers

14

Thoughts on Managing a Large Legacy

Conversion Effort

1) Phased Approach

2) Flexibility and Collaboration

3) Keep it Simple

4) Keep Monitoring Quality

15

1) Phased Approach

Why?

• Varied sources (PDF, XML, SGML)

• Content that changed over time

• Very large input corpus going back to 1917

• Allow for the quick, phased release of new OSA products

Strategy for OSA materials

• Focus on one source type at a time but keep the big picture in mind

• Convert newest material first

• Review and decide on conversion nuances as they came up

16

XML

• OSA Proprietary DTD

• NLM v2.3 DTD

PDF

• PDF Normal

• PDF Image

SGML

• Multiple DTDs

Source Material Challenges

17

• Develop an overall specification, with allowance for change as

new scenarios are uncovered

• Software development sprints to incorporate changes

• Close collaboration with OSA to manage new situations

affecting completed work and work in process

2) Build Flexibility and Collaboration into

the Conversion Process

18

Tools Used to Retain Flexibility

• Client-Vendor

collaboration for decision

making

• Hub and Spoke

processing

• Handling of conversion

anomalies

• Quality assurance reviews

• Learning databanks

19

3)There’s a Lot of Detail – Keep It Simple

• Fitting structures into the existing JATS tagging structure

• CALS to HTML table conversion

• MathML line break retention

• Cross-reference ranges

• Rendering limitations

• Unexpected content scenarios

20

Cross-Reference Ranges

• Bibliographic

• Figure

21

Rendering Limitations

• No CSS support for table character alignment

PDF: HTML:

22

• Missing text - Printed page problems

Unexpected Content Scenarios

23

• Jumping pages

Unexpected Content Scenarios (cont.)

24

• Special characters with no corresponding Unicode

Unexpected Content Scenarios (cont.)

25

<body><boxed-text>

<sec><title>Optical Activities in Industry</title><p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments

for this column which should be sent to him at the above address</p><p>

<inline-graphic xlink:href="ao-8-4-792-i001"/></p></sec>

</boxed-text>

____________________________________

• Non-standard Structure

Unexpected Content Scenarios (cont.)

26

Unexpected Content Scenarios (cont.)

• White space filler

27

• Visual review

• OSA Schematron

• Reporting stylesheets

• OCR and hyphenation spellchecker software

• QA software

• Learning databanks

4) Keep Checking Quality

– Don’t Get Too Far Ahead

28

• Correct entities are used

• Math displays correctly

• Table alignment is accurate

• Images correspond to the source

Visual Review

29

• The Schematron includes over 300 checks

Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials

must have @publication-type='other' ($unpublished and @publication-

type != 'communication' and @publication-type != 'other' / warning)

[report]

Warning:ALERT [LJF:JBCO140]: no tables found but title reads

'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap))

/ warning) [report]

ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than

one article-title (count(article-title) &gt; 1) [report]

OSA Schematron

30

• Highlight any discrepancies between the specifications and the

tagging

• Identify suspicious start of a paragraph

• Flag missing external files associated with the XML

• Find missing cross references to specified structures such as Tables

and Figures

DCL QA Software

31

Hyphenation

Spellchecker

32

• Provides easier review of metadata components for a set of articles

Reporting Stylesheets

33

• Modified versions of the fonts designed to help distinguish between

similar looking characters – “O” vs “0”, “Z” vs “2”, “1” vs “l” used

within the proofreading phase

OCR Tools

34

Ongoing updates made based on

feedback and newly determined rules

and structures

• Conversion software

• QA software

• Schematron

• Spellchecker and hyphenation

software

• Editorial guidelines

• Image creation

Learning Databanks

35

Conclusions

OSA has nearly completed a large backfile conversion project in close

coordination with DCL. The project, which is based around NLM markup, has

allowed OSA to enhance its publishing platform, build derivative products, and

significantly improve its ability to gather business intelligence from a deep

journal backfile. We offer the following lessons learned:

• With large content projects, plan ahead but prepare to work in an

agile fashion

• The content owner should stay engaged throughout the project to

align real-time decisions with business aims

• Owner–vendor collaboration—when the right partners are

involved—improves morale, attention to detail, and decision-

making

36

Scott DineenSr. Director Publishing Production & Technol.

The Optical Society

[email protected]

Devorah AshlemSenior Project Manager

Data Conversion Laboratory

[email protected]

37