Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements...

35
SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales 2017 Page 1 of 35 Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS Supply Panel for Digitisation of Heritage Materials Establishment of Category 3 Optical Character Recognition (OCR) and Related Services PROJECT No. SLNSW_2017006 Annexures A - Project Status Report B - Quality Assurance Certificate C - Deliverables Acceptance D - Example of Collection Metadata Spreadsheet to be supplied by the SLNSW E - Example of the Collection Metadata Spreadsheet to be returned by the Respondent F OCR Pilot Test G Inventory Acceptance H Rosetta METS file and SIP structure example

Transcript of Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements...

Page 1: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 1 of 35

Request for Tender (RFT)

PART B:

STATEMENT OF REQUIREMENTS

Supply Panel for Digitisation of Heritage Materials

Establishment of Category 3 – Optical Character Recognition (OCR) and Related Services

PROJECT No. SLNSW_2017006

Annexures

A - Project Status Report B - Quality Assurance Certificate C - Deliverables Acceptance D - Example of Collection Metadata Spreadsheet to be supplied by the SLNSW E - Example of the Collection Metadata Spreadsheet to be returned by the Respondent F – OCR Pilot Test G – Inventory Acceptance H – Rosetta METS file and SIP structure example

Page 2: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35

DOCUMENT CONTROL

Document Version: 1.0

TRIM File No:

Date: 03/11/2017

Status: Final

Confidentiality:

Name Department & Job Title Date Approved Approval Details

Scott Wajon Manager, Digitisation & Imaging

Release

Page 3: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 3 of 35

Table of Contents

1. TECHNICAL REQUIREMENTS ............................................................................................................. 4 1.1. Overview ............................................................................................................................................ 4 1.2. Objectives .......................................................................................................................................... 4 1.3. Source Images of Collection Material ............................................................................................ 4 1.4. Inventory ............................................................................................................................................ 5 1.5. Location and Environment ............................................................................................................... 5 1.6. Data Security and Management ..................................................................................................... 6 1.7. Work Health and Safety ................................................................................................................... 6 1.8. Environmental Management ........................................................................................................... 6 1.9. Hardware and Software ................................................................................................................... 6 1.10. Project Management ........................................................................................................................ 6

2. CARE AND HANDLING OF DIGITAL MATERIALS ...................................................................... 7 2.1. Delivery of the Digital Materials ...................................................................................................... 7 2.2. SOURCE IMAGE .............................................................................................................................. 7 2.3. SOURCE External Metadata (CSV) ............................................................................................... 9 2.4. Inventory Requirements ................................................................................................................... 9 2.5. Storage and Production Facilities ................................................................................................... 9

3. SPECIFICATIONS ...................................................................................................................... 10 3.1. Deliverables ..................................................................................................................................... 10 3.2. Optical Character Recognition (OCR) ......................................................................................... 10 3.3. SCREEN JPEG file specification .................................................................................................. 12 3.4. METS and ALTO files .................................................................................................................... 13 3.5. PDF Specification ........................................................................................................................... 14

4. METADATA ............................................................................................................................................. 15 4.1. External Metadata (CSV) ............................................................................................................... 15 4.2. Embedded Metadata ...................................................................................................................... 17 4.3. SCREEN JPEG metadata ............................................................................................................. 18 4.4. PDF metadata ................................................................................................................................. 19

5. QUALITY ASSURANCE ........................................................................................................................ 20 6. FILE NAMING AND FOLDER STRUCTURES ................................................................................... 21

6.1. File and Folder Naming Schema .................................................................................................. 21 6.2. File Directory Structure – METS, ALTO, SCREEN JPEG & PDF/A Deliverables................. 21 6.3. Tools for Creating and Validating Bags ....................................................................................... 23 6.4. File Naming and Digital ID Schema ............................................................................................. 24

7. SERVICE LEVELS ................................................................................................................................. 25 7.1. Services, responsibility and measure .......................................................................................... 25

8. DELIVERABLES ..................................................................................................................................... 27 8.1. Matrix of deliverables mapped against each service: ............................................................... 27

9. ACCEPTANCE OF DELIVERABLES ................................................................................................... 28 9.1. Quality Assurance Process ........................................................................................................... 28 9.2. Quality Acceptance Level for Deliverables ................................................................................. 29

10. DEFINITIONS AND ACRONYMS......................................................................................................... 30 11. STANDARDS REFERENCES & VERSION REQUIREMENTS ....................................................... 33 12. LIST OF ANNEXURES .......................................................................................................................... 35

Annexure A – Project Status Report for completion by the Respondent ........................................... 35 Annexure B – Quality Assurance Certificate for completion by the Respondent .............................. 35 Annexure C – Deliverables Acceptance Criteria for completion by the SLNSW ............................... 35 Annexure D – Example of supplied Collection Metadata Spreadsheets (CSV’s) for completion by the Respondent ................................................................................................................................................. 35 Annexure E – Example of completed Collection Metadata Spreadsheet (CSV) to be returned by the Respondent ................................................................................................................................................. 35 Annexure F – OCR Pilot Test ................................................................................................................... 35 Annexure G – Goods Shipment Receipt ................................................................................................. 35 Annexure H – Rosetta METS file and SIP structure example ............................................................. 35

Page 4: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 4 of 35

1. TECHNICAL REQUIREMENTS

This document sets out the minimum requirements of the State Library of New South Wales (SLNSW) in relation to Optical Character Recognition (OCR) and Related Services. This document is the key technical document that will govern this tender process and future quotation processes.

The State Library is seeking to identify suitably qualified and experienced companies to provide optical character recognition (OCR) services for digitised from the Library’s collection.

As one part of the State Library's assessment process, respondents will be given a test batch of digitised items (the SOURCE IMAGES). The Respondent will be required to OCR all files in the test batch, deliver specified cropped SCREEN JPEG files, METS / ALTO and PDF/A files, as well as any administrative and technical metadata requested. Annexure F sets out the instructions for “OCR Pilot Test”.

The OCR Services (the “Deliverables”) required by the Libray include:

The creation of cropped and de-skewed SCREEN JPEG's from SOURCE IMAGES that conform to the specifications provided by the SLNSW

The creation of OCR from SCREEN JPEGs

The creation of METS, Rosetta METS and ALTO that conforms to specified schemas and reference the SCREEN JPEGs

Addition of specific metadata to the Rosetta METS, METS and ALTO files

The creation of searchable PDF/A for each book/serial/item that conforms to the standards provided by the SLNSW for specific administrative metadata

Quality Assurance of the Deliverables produced from the services described in Section 5 of this document

File transfer of the Deliverables back to SLNSW

Retention of a copy of the digitised files until the integrity and quality acceptance of the transferred files has been confirmed by SLNSW

Project management for the term of the contract

The Digitised Collections Materials identified in this SOR consists of a selection of Printed Material including

books, pamphlets catalogues, serials, ephemera and newspapers.

The SOURCE IMAGES are JPEG files that are used as the entry point for all cropping and OCR processes.

The majority of source materials will be in English with a small number in other languages. In a small

number of cases there may be special fonts present e.g. Gothic

There digitised collection materials may be in an number of languages. They created from the following

physical collection materials:

1.3.1. Books

Bound printed books ranging in physical sizes from 15 cm to 60 cm (height) and up to 500+ pages.

Examples include:

o Bound books consisting of pages in standard book formats in hardcover and softcover

o Books that (primarily or solely) consist of pages of printed text

o Bound books that contain photographic plates, colour illustrations and line diagrams

o Bound books that contain foldouts. These items may be but not limited to; maps,

illustrations, diagrams, tables, etc.

Page 5: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 5 of 35

1.3.2. Pamphlets and Catalogues

Various physical sizes ranging from 15cm to 30 cm and up to 250+ pages. Examples include:

o Stitched or stapled pamphlets and catalogues consisting of pages in standard booklet

formats in softcover

o Pamphlets and catalogues that (primarily or solely) consist of pages of printed text

o Pamphlets books and catalogues that contain photographic plates, colour illustrations

and line diagrams

o Pamphlets and catalogues that contain foldouts. These items may be but not limited to;

maps, illustrations, diagrams, tables, etc.

1.3.3. Serials

Bound serials/journals/magazines ranging in physical sizes from 15 cm to 60 cm (height) and up to

500+ pages. Examples include:

o Serials consisting of pages in standard magazine or journal formats in hardcover and

softcover

o Serials that (primarily or solely) consist of pages of printed text

o Bound serials that contain photographic plates, colour illustrations and line diagrams

o Bound serials that contain foldouts. These items may be but not limited to; maps,

illustrations, diagrams, tables, etc.

1.3.4. Ephemera

Bound or loose collections of ephemera (e.g. greeting cards, single sheet pamphlets, flyers,

invitations, menus, handouts) ranging in physical sizes from 15 cm to 60 cm (height) and up to

500+ pages. Examples include:

o Ephemera consisting of loose pages or sheets o Ephemera consisting of pages in standard magazine or journal formats in hardcover and

softcover o Ephemera that (primarily or solely) consist of pages of printed text o Bound or loose sheets of ephemera that contain photographic plates, colour illustrations

and line diagrams o Large size loose sheets of ephemera that are folded. These items may be but not

limited to; maps, illustrations, diagrams, tables, etc.

1.3.5. Newspapers

Bound or loose collections of hardcopy newspapers, or reels of microfilmed newspapers.

Examples include:

o Pages may contain black and white and/or colour illustrations/photographs/diagrams o Single issues with loose pages, or stapled o Multiple issues bound into large volumes o Reels of microfilmed newspapers

A detailed inventory will be provided to the Respondent with every shipment of digital materials.

The services are to be delivered offsite at the respondent’s premises. The services are to include the

provision of appropriate equipment, personnel, resources, and appropriate management and quality

systems.

Page 6: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 6 of 35

The SLNSW prefers that the Respondent has current certification for:

ISO 9001 Quality Management Systems and

ISO 27001 Information Security Management.

The Respondent must conduct its own assessments & investigations regarding the work Health and Safety of all persons affected by the services delivered under this project. All equipment used must be in safe and reliable condition and meet all relevant safety requirements, regulations and standards.

The Libraries Work Health and Safety Policy can be downloaded from: www.sl.nsw.gov.au/about/policies/docs/work_health_safety_policy_2012.pdf

The SLNSW is committed to procuring sustainable products, works and services where possible. The Respondent will be required to comply with any directions, policies and guidelines given by the SLNSW. If requested by the SLNSW, the Respondent must submit evidence of its environmental management system.

The Respondent must maintain its hardware and be able to show a record of this. The Respondent may be required to report in writing to the SLNSW any intended hardware and software upgrades prior to its implementation.

The Respondent must provide a single point of contact for managing all aspects during the project. This person must have appropriate authority to make day-to-day project related decisions. The project status reports will be provided by the respondent on a weekly basis at a date to be agreed.

Page 7: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 7 of 35

2. CARE AND HANDLING OF DIGITAL MATERIALS

The digital materials supplied by SLNSW comprise of:

SOURCE IMAGES

Inventories

Metadata (CSV)

SLNSW digital materials will be delivered to the Respondent by SLNSW staff via Microsoft’s OneDrive cloud

service.

The SLNSW will supply JPEG image files for printed material collections that are to undergo OCR. These are referred to as the SOURCE IMAGES.

2.2.1. SOURCE IMAGE File Specification

Technical specifications for SOURCE IMAGE files that will be provided by SLNSW are:

o JPEG files o 24-bit full colour o Colour space Adobe RGB (1998) o 300 ppi or 400 ppi at scale (depending on collection) o Compression: Maximum quality, 100% (Quality 12) o Unsharpened o Colour Correct o Not interpolated or upscaled

Figure 1, Example of SOURCE IMAGE Book Page

Page 8: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 8 of 35

2.2.2. Title Scan

In some sets SOURCE IMAGES, including Books and Pamphlets, a Title Scan will be present.

When the Title Scan is present, which may be in a set of scanned page images, whether a complete

physical volume or a subdivision of a physical volume folder, it will be a the first digital image e.g.

b12345_0001.jpg.

The title scan contains information about the item such as the call number and title. The Title Scan

is required to be OCRed as per 3.2 OCR specifications. These files are not required to be

cropped or de-skewed.

Figure 2 Example of Title Scan

Page 9: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 9 of 35

2.2.3. Foldout Scans

Some collection items have folded inserts such as maps or tables. These items were usually

captured on a different device than the standard pages in the rest of the item. This results in foldout

SOURCE IMAGE file having a significantly different file size and pixel dimensions than the source

images. The Foldout Scans are required to be OCRed as per 3.2 OCR specifications.

Figure 3 Example of a cropped Foldout

The SLNSW will provide a Collection Metadata CSV file containing metadata for each batch of collection items to be processed. The detail of the metadata is outlined in Section 4.1 External Metadata (CSV).

Upon receipt of the digital collection material, the respondent must:

Confirm to the SLNSW Project Manager that the material received matches the inventory provided by the SLNSW.

Sign and return the Inventory Acceptance spreadsheet confirming receipt of SLNSW collection materials.

The digital materials must be stored securely while in the respondent’s possession. Secure storage includes:

Stored securely, where access is limited to respondent staff only.

Storage facilities must have appropriate disaster recovery systems in place.

The respondent must have a disaster management strategy in place.

Page 10: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 10 of 35

3. SPECIFICATIONS

The Respondent will deliver to the Library (“the Deliverables”) the following:

A cropped and deskewed SCREEN JPEG file for each SOURCE IMAGE file supplied.

An ALTO XML file created for each SCREEN JPEG

A METS XML file for each item (i.e. book. journal, pamphlet etc)

A Rosetta METS XML file for each item (i.e. book. journal, pamphlet etc)

A PDF/A for each item/book/serial/pamphlet etc.

The Deliverables will be:

expected on a regular weekly delivery schedule (or as agreed with SLNSW)

accompanied by: o a completed Quality Assurance Certificate o an updated Metadata CSV o a completed & current Project Status Report

The Deliverables must be returned via SLNSW Microsoft’s OneDrive cloud service.

The Respondent is required to perform the following:

Crop and de-skewing SOURCE IMAGE to create SCREEN JPEGS

If required, a WORKING JPEG may be created to optimise images (increased contrast, sharpening etc) to produce to maximise OCR accuracy. SLNSW does not require a copy of the WORKING JPEG to be returned as a deliverable

Create OCR output (i.e. OCR text and defined zones on pages) for every SCREEN JPEG

Ensure OCR quality meets SLNSW requirements and if required, rekey specific content, as detailed in 3.2.2

Ensure page types are appropriately applied within each page and that zones break correctly and are

arranged in the correct reading order

All inverted and rotated text to be identified and marked up Zoned content as specified at section 3.2.1 is to be referenced in the METS within descriptive metadata

and the physical and logical structmaps and if possible with the ALTO as type in ComposedBlock and

Illustration and/ or layout tags (see 3.4.1)

Add text corrections, terms, words into the OCR engine’s dictionary to improve base level OCR quality

Glyphs (non-standard typography) are to be added to the OCR engine’s dictionary, where possible

Inline math and non-keyboard characters to be encoded as UTF-8, where they occur in elements that require rekeying and where possible, added to the OCR engine’s dictionary

Populate mandatory fields in the External Metadata CSV as specified in section 4.1.2

Provide an QA certificate reporting the AccurancyAccuracy Level and sample size for each delivery (see Annexure B)

3.2.1. OCR Zones

The Respondent is required to ensure that the following elements are zoned correctly:

Material Type Element

Books Title, Subtitle (title page) Author (title page) Chapter headings

Ephemera

Chapter headings Mast head (Title,Volume, Date) Article Titles Article Authors

Page 11: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 11 of 35

Newspapers Mast head (Title, Volume, Date) Article Title Article Authors

Pamphlets and Catalogues Title, Subtitle (title page) Author (title page) Chapter headings

Serials Title, Volume, Date Article Titles and Article Authors

All Pages are to have the correct type (e.g. Title, Cover, Table of Contents, Blank, Image)

All Illustrations are to have the correct type (e.g. Illustration, Photo, Cartoon, Map, Graph)

Chapter number and title are zoned as the chapter heading

Articles must contain all the text that forms the article including articles that continue on different page (split article linking).

As specified in version 3.1, ALTO the: o TYPE attribute is used in ComposedBlock and Illustration as appropriate (see 3.4.1) o Layout tags maybe used as appropriate (see 3.4.1)

3.2.2. OCR Accuracy / Error Levels

OCR Accuracy or Error Levels can be assessed in different ways. SLNSW assessment of OCR Accuracy/Error

Levels is per character as follows:

An accuracy rate of 99.9% is 1 incorrect character in 1000 characters

An accuracy rate of 98.0% is 20 incorrect characters in 1000 characters

SLNSW requests the following levels of accuracy. If these levels cannot be achieved, please clearly indicate

the highest achievable level.

Material Type Element Description Accuracy Required

Books Title, Author (on title page) Chapter titles

Rekeying required 99.50%

Books Body text uncorrected 90% or higher

Ephemera

Chapter titles, Mast head (Title,Volume, Date) Article Titles Article Authors

Rekeying required 99.90%

Ephemera Body text uncorrected 90% or higher

Newspapers Mast head (Title, Volume, Date) Article Title Article Authors

Rekeying required 99.90%

Newspapers Body text uncorrected 90% or higher

Pamphlets and Catalogues

Title, Author (on title page) Chapter titles

Rekeying required 99.90%

Pamphlets and Catalogues

Body text uncorrected 90% or higher

Serials Title, Volume, Date Article Titles and Article Authors

Rekeying required 99.90%

Serials Body text uncorrected 90% or higher

Page 12: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 12 of 35

Each text page and foldout will be supplied as a SOURCE IMAGE.

This SOURCE IMAGE file needs to be cropped, deskewed and saved as an individual Cropped JPEG file.

This is referred to as the SCREEN JPEG.

In creating the SCREEN JPEG from the SOURCE IMAGE, The respondent is required adhere to, as close

as possible, the best practice principles of a digital lossless workflow. This is to maintain best image quality

and data integrity. Key to this workflow includes minimising the number of times a file needs to be modified

and re-saved and always trying to use non-destructive image enhancements.

The SCREEN JPEG is to be supplied to the SLNSW as one of the deliverables.

The SCREEN JPEG is to be used in the subsequent OCR processes.

In addition to the SCREEN JPEG the respondent may decide to create a Working JPEG derivative to

enhance OCR accuracy. These modifications may include processing to bitonal, de-speckling and

sharpening. SLNSW doesn’t require these Working JPEG derivatives.

The returned METS and ALTO files must reference the full colour cropped and de-skewed SCREEN

JPEG files. This includes, but not limited to, correct files names, structure mapping, correct zoning

and pixel dimensions.

The Respondent will need to contact the SLNSW for instructions when encountering unusual or non- standard page layouts, tables or foldouts.

Technical specifications for SCREEN JPEG files are:

Cropped and de-skewed (from SOURCE IMAGE precursor)

24-bit full colour, JPEG files

Colour space is to be maintained as Adobe RGB (1998)

Files are to remain at 400 ppi

No additional file compression

No sharpening to be applied to the files unless approval is obtained from the SLNSW

During the image enhancement process the JPEG is to be saved the least number of times as possible to preserve image data and to follow a digital lossless workflow

The JPEG colour and tonality are not to be modified unless approval is obtained from the SLNSW

Embedded metadata must remain in the Jpeg, as specified in Section 4.4 Embedded Metadata

No interpolation or upscaling permitted

3.3.1. De–skewing and Cropping of SCREEN JPEG files

The cropping of the SOURCE IMAGE must create a SCREEN IMAGE with the following attributes: Where required, pages are to be de-skewed prior to cropping so that the text lines are within +/- 10 of horizontal

Hand written annotations and marginalia must remain visible and complete in the cropped image. Marginalia must not be cropped out or obscured. (see figure 7)

All images must be cropped 3-5 mm inside the page edge

The cropping must not impinge upon the text block or illustrations/tables present on the page

Page 13: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 13 of 35

Figure 4, SOURCE IMAGE: Page before De-skewing and Cropping Figure 5, SCREEN IMAGE: Page after De-skewing and Cropping

Figure 6, SOURCE IMAGE : Page before De-skewing and Cropping

Figure 7, SCREEN IMAGE : Page after De-skewing and Cropping with marginalia still visible

The OCR and rekeyed OCR where (applicable) created from SCREEN JPEG Images must be supplied in the ALTO XML and referenced within the descriptive metadata, file group and physical and logical structure maps within the METS XML document.

3.4.1. ALTO

The ALTO files must comply with ALTO 3.1 Schema and validate against the XSD:

Page 14: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 14 of 35

https://www.loc.gov/standards/alto/v3/alto-3-1.xsd

Appropriate use of TYPE attribute for ComposedBlock and Illustration elements.

Where ALTO tagging is used it must follow the use cases published by the ALTO Board: http://altoxml.github.io/documentation/use-cases/tags/ALTO_tags_usecases.html

SLNSW expects that Layout tagging and structured tagging should be possible and can supplement

page and chapter struct maps in the METS.

3.4.2. METS

The METS XML file must contain:

Descriptive metadata as specified in section 4.2.2 as MODS records.

Preservation metadata as PREMIS records.

Technical metadata for images as MIX records.

Technical metadata for text as Textmode records.

A PHYSICAL struct map referencing the SCREEN JPEGs and ALTO files.

A LOGICAL struct map referencing chapters, articles etc

3.4.3. Rosetta METS

The Rosetta METS xml file will be used to create a Rosetta Submission Information Package (SIP)

which is required by SLNSW to facilitate ingestion into our repository Ex Libris Rosetta).

The Rosetta METS file must validate against the XSD:

https://raw.githubusercontent.com/ExLibrisGroup/Rosetta.dps-sdk-projects/master/current/dps-sdk-

deposit/src/xsd/mets_rosetta.xsd

The Rosetta METS xml must contain:

A dmdSec for the IE and each File.

A amdSec for the IE, each of the 5 Reps and each File.

A fileSec with a fileGrp for each of the 5 Reps.

A structMap for each of the 5 Reps containing a physical struct map.

See the Rosetta METS file in Annexure H – Rosetta METS file and SIP structure example for more

detail on what is to be within each of these sections.

The order of files in the physical struct map is important and must match for all Representations

containing multiple files.

Also note that although you will not be dealing with any PRESERVATION_MASTER files you do

have to reference them in the METS file. We will be dropping the files in prior to ingestion into

Rosetta. To be complete the METS must reference these files. The filenames to be referenced will

be provided.

SLNSW requires that PDF/A files are created from each item/book/serial/journal with the following

specifications:

a. PDF files should conform to the PDF/A-3 specification b. PDF files must contain compressed, full page images derived from the SCREEN JPEG

a. Compression level medium-high (level 8) c. The Title scan must not be included in the PDF d. PDF files must contain the corrected OCR text and be fully content text searchable e. PDF files must keep the OCR content text aligned with text as it appears in the SCREEN JPEG

images f. PDF files must contain technical & descriptive metadata as specified in Section.9 PDF Metadata g. Embedded fonts must be ‘open licensed’ so that SLNSW may legally use and redistribute them as

part of the PDF.

Page 15: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 15 of 35

4. METADATA

SLNSW requires that the Respondent supplies a variety of metadata relating to the creation of the OCR products to allow for long term management and sustainability.

The CSV file supplied by SLNSW is to be used by the Respondent to:

● incorporate specific metadata into new fields in the provided CSV for each item digitised as outlined in Section 4.1.2 Metadata required from Respondent and

● embed a small subset of the supplied metadata into the METS, Rosetta METS, ALTO and PDF deliverables.

This section covers the Collection Metadata CSV file and how this data is to be supplemented and used by the Respondent.

A CSV will be supplied for each batch of items and will be named according to the Batch Number and Work Order Number for that Batch (e.g. OCR_D47519_Batch1.csv). Each item will be represented on a single row of the CSV file.

The supplemented CSV should be saved using UTF-8 encoding with the same filename. It should be placed at the 4th Level of the Delivery folder as outlined in 6.2 File Directory Structure.

4.1.1. CSV fields

The fields MATCH POINT, TITLE, AUTHOR, RIGHTS and CONTRIBUTOR (highlighted in green below) are required by SLNSW to be embedded in the deliverables as outlined in 4.2. Embedded Metadata.

The other fields are supplied as reference only and will embedded by SLNSW during the ingestion process. Additional metadata such as LANGUAGE, VOLUME, EDITION and DATE (highlighted in yellow below) are required for PDF deliverables as these will not be re-processed.

Field Label Definition

BATCH Work Order Number supplied by the SLNSW

CATALOGUE

DIGITAL ID Digital ID supplied by the SLNSW

MATCH POINT MMSID from the bibliographic record – unique number form the ALMA record used as a match point

CALL NO Call Number from the ALMA holding record

VOLUME Volume field from the ALMA item record if present

TITLE 245 field from the bibliographic record

AUTHOR 100 field from the bibliographic record if present

EDITION 250 field from the bibliographic record if present

PLACE 260 field subfield a from bibliographic record

PUBLISHER 260 field subfield b from bibliographic record

DATE 260 field subfield c from bibliographic record. May also include copyright information or multiple dates.

MATERIAL TYPE Code from MATTYPE fixed field in ALMA

- - no information

“a” – printed material

LANGUAGE ISO 639 code for the language of the book, e.g.

Page 16: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 16 of 35

Field Label Definition

“eng” – English

“fre” - French

“spa” - Spanish

“dut” - Dutch

“dan” – Danish

“ger” – German

PAGINATION 300 field subfield a – number of pages in the book, as printed in their respective sequences. “Leaves” differ from “pages” in that leaves are printed on one side not both sides.

ILLUSTRATIONS 300 field subfield b – an indication about whether the book is illustrated, has maps, whether the illustrations are in colour, and the presence of other graphical content

DIMENSIONS 300 field subfield c – usually an indication in cm of the height of the book

BARCODE Barcode field from the ALMA item record

RIGHTS Copyright statement supplied by the SLNSW

CONTRIBUTOR Respondent acknowledgment statement

SL NOTES [as required]

CAPTURE DATE [for respondent – see 4.2.1]

CAPTURE DEVICE [for respondent – see 4.2.1]

OCR SOFTWARE [for respondent – see 4.2.1]

OPERATOR ID [for respondent – see 4.2.1]

EQUIPMENT ID [for respondent – see 4.2.1]

EXCEPTION NOTE [for respondent – see 4.2.1]

CONDITION NOTE [for respondent – see 4.2.1]

PAGE COUNT [for respondent – see 4.2.1]

FOLDOUT COUNT [for respondent – see 4.2.1]

TOTAL FILE COUNT [for respondent – see 4.2.1]

CONTRACTOR NOTE [for respondent – see 4.2.1]

4.1.2. CSV Metadata Required from the Respondent

The Respondent is required to incorporate the following metadata into the provided CSV for each item digitised.

Field Label Definition

CAPTURE DATE Date of OCR capture in ISO 8601 format

e.g. 2016-08-17 for August 17, 2016

Mandatory

OCR SOFTWARE

Name of the software used for OCR capture

Mandatory

OPERATOR ID Identifier for the Operator i.e. an operator number or initials.

Mandatory

EQUIPMENT ID Identifier for the device or devices. Not required

EXCEPTION NOTE Exception reports for missing or out of sequence pages.

As required

CONDITION NOTE Damage and condition notes As required

PAGE COUNT Not required

Page 17: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 17 of 35

Field Label Definition

FOLDOUT COUNT Not required

TOTAL FILE COUNT Actual page count based on pages OCRed (excluding

Title Scans, Reference Scans & Foldout Pages)

Mandatory

CONTRACTOR NOTE

General note after OCRing or Not OCRed / Reason e.g. Not OCRed / Source image corrupted

As required

The SLNSW requires that of the follow metadata in the Rosetta METS, METS and ALTO derivatives, along with metadata that is automatically produced during processing e.g. Technical Metadata.

4.2.1. Rosetta METS

Field label DC element Example

MATCH POINT dc:identifier <dc:identifier>991017059979702626</dc:identifier>

TITLE dc:title <dc:title>The service of Man, An essay towards the religion of the future<dc:title>

AUTHOR dc:creator <dc:creator>Morison, J. Cotter (James Cotter)</dc:creator>

CALL NUMBER

dc:source <dc:source>DSM/211/M</dc:source>

MATERIAL TYPE*

dc:type <dc:type>Textual Records</dc:type>

CATALOGUE SYSTEM**

dcterms:isReferencedBy <dcterms:isReferencedBy>Alma</dcterms:isReferencedBy>

*SLNSW requires Textual Records to be used as the material type.

**All item in the OCR pilot test are from the Alma catalogue system.

4.2.2. METS

The METS file should contain the following set of MODS elements:

Field label MODS MODS Example

AUTHOR Creator <mods:name type="personal"> <mods:namePart>Morison, J. Cotter (James Cotter)</mods:namePart> <mods:namePart type="date">1832-1888</mods:namePart> <mods:role> <mods:roleTerm authority="marcrelator" type="text">creator</mods:roleTerm> </mods:role> </mods:name>

CONTRIBUTOR Contributor <mods:name type="corporate"> <mods:namePart>CONTRIBUTOR Respondent name</mods:namePart> <mods:role> <mods:roleTerm type="text" authority="marcrelator">contributor</mods:roleTerm> <mods:roleTerm type="code" authority="marcrelator">ctb</mods:roleTerm> </mods:role> </mods:name>

Page 18: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 18 of 35

Field label MODS MODS Example

MATCH POINT Identifier <mods:identifier type="Match Point">991017059979702626</mods:identifier>

RIGHTS Rights <accessCondition>

TITLE Title <mods:titleInfo> <mods:nonSort>The</mods:nonSort> <mods:title>Service of Man</mods:title> <mods:subTitle>An essay towards the religion of the future</mods:subTitle> </mods:titleInfo>

4.2.3. ALTO

ALTO files created from each image/page in an item will contain the following metadata:

Field label ALTO Term/s Example

‘File name’ sourceImageInformation filename

<sourceImageInformation> <fileName>File name</fileName> <fileIdentifier>MATCH POINT</fileIdentifier> </sourceImageInformation>

MATCH POINT sourceImageInformation fileIdentifier

As above

CONTRIBUTOR processingAgency <processingAgency>CONTRIBUTOR Respondent name</processingAgency>

‘Measurement unit’* MeasurementUnit <MeasurementUnit>mm10</MeasurementUnit>

*SLNSW requires mm10 as the measurement unit.

The following metadata is embedded in the SLNSW supplied SOURCE IMAGE JPEGs and must be retained in the SCREEN JPEGs created by the Respondent:

Title (TITLE)

Rights (RIGHTS)

Identifier (MATCH POINT)

Artist (CONTRIBUTOR)

Page 19: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 19 of 35

Additional metadata fields are required in PDF derivative as these will not be post-processed. This includes embedding any supplied VOLUME and/or EDITION data within the TITLE field.

Field label Usage Example

MATCH POINT dc:identifier <dc:identifier>MATCH POINT</dc:identifier>

TITLE dc:title including any volume or edition data supplied delimited with a . full stop

<dc:title> <rdf:Alt> <rdf:li xml:lang="x-default">TITLE. VOLUME</rdf:li> </rdf:Alt> </dc:title> <dc:title> <rdf:Alt> <rdf:li xml:lang="x-default">TITLE. EDITION</rdf:li> </rdf:Alt> </dc:title>

dc:creator <dc:creator> <rdf:Seq> <rdf:li>CREATOR</rdf:li> </rdf:Seq> </dc:creator>

DATE dc:date <dc:date> <rdf:Seq> <rdf:li>DATE</rdf:li> </rdf:Seq> </dc:date>

RIGHTS dc:rights <dc:rights> <rdf:Alt> <rdf:li xml:lang="x-default">RIGHTS</rdf:li> </rdf:Alt> </dc:rights>

CONTRIBUTOR dc:contributor <dc:contributor> <rdf:Seq> <rdf:li>CONTRIBUTOR Respondent name </rdf:li> </rdf:Seq> </dc:contributor>

LANGUAGE dc:language <dc:language> <rdf:Bag> <rdf:li>LANGUAGE</rdf:li> </rdf:Bag> </dc:language>

Page 20: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 20 of 35

5. QUALITY ASSURANCE

Quality Assurance (QA) encompasses all procedures and techniques to verify the quality, accuracy and consistency of the Deliverables. The SLNSW uses ANSI/ISO/ASQ 2859-4:2002: Sampling procedures for inspection by attributes - Part 4: Procedures for assessment of declared quality levels as a reference.

The Respondent’s Quality Plan must include a detailed overview of all Quality Assurance processes that will be conducted in providing the digitisation services and the measures that will be applied.

The SLNSW will engage in a formal evaluation both to verify that the digital product of the participating Respondents meets the specifications detailed in this document and to evaluate the quality and acceptability of the digital product for future needs.

The SLNSW reserves the right to reject any digital file(s) which fail to meet the specifications and requirements detailed in this document, as determined by both automated and manual validation methods.

The SLNSW reserves the right to require the Respondent to re-make any deliverables which do not meet the specifications and requirements of this document. The SLNSW also reserves the right to refuse payment, up to and including, the whole digital product produced for a shipment of materials.

If the digital deliverables of the Respondent continues to fail to meet these specifications and requirements, the SLNSW reserves the right to require the return of materials for the submission to another Respondent.

Page 21: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 21 of 35

6. FILE NAMING AND FOLDER STRUCTURES

Instructions for specific file naming schema will be provided with documentation that accompanies each batch of Source Images supplied by SLNSW.

SLNSW requires all deliveries to be compliant with the BagIt File Packaging Format V0.97 available at

https://tools.ietf.org/html/draft-kunze-bagit-14.

Each delivery must be a valid bag containing the following:

bagit.txt

data folder containing the payload

payload manifest using the MD5 checksum algorithm (i.e. manifest-md5.txt)

Optional:

Tag manifest

Bag-info.txt

Deliveries must not include Thumbs. dB, .DS_Store, and/or other hidden system files.

NOTE: For this OCR project the 3-letter project code should be OCR

The directory structure is to be organised in the following levels:

Level 1 Level 2 Level 3 Level 4

[Delivery Folder]

| bag-info.txt

| bagit.txt

| manifest-md5.txt

\ [metadata] | Updated CSV

\ [data] \ [Batch Folder] \ [Digital ID]

Level 5 Level 6 Level 7 Level 8

\ [content] | Rosetta METS file

\ [streams] \ [ALTO] | ALTO xml files

\ [PDF] | PDF/A

\ [SCREEN] | SCREEN JPEG files

\ [PRESERVATION_MASTER] | empty

\ [METS] | METS file

Figure 1, table showing the levels of file directory structure

Page 22: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 22 of 35

Folder

Level Item Name Folder or File Name Description

Level 1 [Delivery Folder] OCR_Vend_YYYYMMDD_001

Delivery Folder - A Digital Delivery Folder named: 3 Letter Project Code_4 Letter Respondent Code Reverse Date_Delivery Number

Level 2 | bag-info.txt bag-info.txt BagIt information file

Level 2 | bagit.txt bagit.txt BagIt information file

Level 2 | manifest-md5.txt manifest-md5.txt Manifest checksum of files in the bag

Level 2 \ [metadata] metadata

Folder with the Updated CSV and administration files including QA Certificates, Project Status Reports etc.

Level 2 \ [data] data Folder containing the METS / ALTO xml, SCREEN JPEG, and Admin folder

Level 3 | Updated CSV OCR_DXXXXX_BatchX.csv Updated External Metadata CSV

Level 3 \ [Batch Folder] DXXXXX Batch Number Folder = Work Order Number (e.g. D47549)

Level 4 \ [Digital ID] bXXXXX Digital ID folder for material

Level 5 \ [content] content Structural subfolder

Level 6 | Rosetta METS file bXXXXX_rosetta.xml Rosetta METS xml file

Level 6 \ [streams] streams Structural subfolder

Level 7 \ [ALTO] ALTO Folder that holds the ALTO xml files

Level 7 \ [PDF] PDF Folder that holds the PDF\A file of the item

Level 7 \ [SCREEN] SCREEN Folder that holds the cropped and de-skewed SCREEN JPEG's

Level 7 \ [PRESERVATION_MASTER]

PRESERVATION_MASTER This folder is empty -required for SLNSW use only

Level 7 \ [METS] METS Folder that holds the METS xml file

Level 8 | ALTO xml files bXXXXX_0001.xml - bXXXXX_9999.xml

ALTO xml files e.g. b12345_0004.xml to b12345_9999.xml

Level 8 | PDF/A bXXXXX.pdf Searchable PDF/A e.g. b12345.pdf

Level 8 | SCREEN JPEG files bXXXXX_0001.jpg - bXXXXX_9999.jpg

SCREEN JPEG's e.g. b12345_0004.jpg to b12345_9999.jpg

Level 8 | METS file bXXXXX.xml METS xml file

Figure 2, table describing the levels of file directory structure

Page 23: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 23 of 35

Figure 3, Example of Folder Structures, File Naming & Inclusions

Open source software available for the creation and validation of bags per the BagIt standard include, but is not limited to:

Bagger - https://github.com/LibraryOfCongress/bagger/releases/latest

Exactly - https://www.avpreserve.com/tools/exactly/

Bagit-python - http://libraryofcongress.github.io/bagit-python/

NOTE: Some bagging software produces tag manifest and bag-info.txt files automatically.

If the software you use creates these files you do not have to delete them If the software you use does not create these files, you are not required to create them.

Page 24: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 24 of 35

Each file will have a unique identifier. The naming pattern will be:

Item Description Example

SCREEN JPEG Files

b08851 is the Digital ID, 0001 is a 4-digit sequence counter

b08851_0001.jpg to b00851_9999.jpg

ALTO Files b08851 is the Digital ID, 0001 is a 4-digit sequence counter

b08851_0001.xml to b00851_9999.xml

Rosetta METS File b08851 is the Digital ID b08851_rosetta.xml

METS File b08851 is the Digital ID b08851.xml

PDF/A File b08851 is the Digital ID b08851.pdf

Page 25: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 25 of 35

7. SERVICE LEVELS

Service Respondent’s Responsibility

Principal’s responsibility

Measure Level required

7.1.1 Creation of METS / ALTO, SCREEN JPEG and PDF/A

Meets the requirements set out in this SOR

100%

7.1.2 Supply of specific Administrative metadata

Meets the requirements set out in this SOR

100%

7.1.4 Quality assurance of the Deliverables produced from the services described in sections 7.1.1 and 7.1.2 (above)

Meets the requirements set out in this SOR

100%

7.1.5 File transfer of the digitised files

Meets the requirements set out in this SOR

100%

7.1.6 Retention of the digitised files until the integrity and Quality acceptance of the transferred files has been confirmed by SLNSW

Meets the requirements set out in this SOR

100%

7.1.7 Project Management

Meets the requirements set out in this SOR

100%

7.1.8 Change in Technical Evaluation Schedule

Inform SLNSW’s Project Manager with supporting evidence in writing immediately as per clause 7.6 “Delays” of the Master Digitisation Services Agreement. (Part C)

100%

7.1.9 Response from SLNSW on change to Project Schedule

Upon receiving the written advice about changes in project schedule, SLNSW’s Project Manager will respond back in 3 Business Days with the Principal’s decision on the change in project schedule

100%

7.1.10 Delivery of Digitised files (Deliverables)

All deliveries to the SLNSW of Deliverables must have a supporting QA Report

100%

Page 26: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 26 of 35

Service Respondent’s Responsibility

Principal’s responsibility

Measure Level required

7.1.11 Quality Inspection

The quality of Deliverables will be assessed against requirements set out in the Statement of Requirements. The SLNSW must complete inspections and certify acceptance within 30 days of receiving the Deliverables QA certified from the Respondent

100%

7.1.12 Technical Evaluation

SLNSW must complete inspections and certify acceptance within 30 days of receiving the Deliverables QA certified from the Respondent

100%

Page 27: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 27 of 35

8. DELIVERABLES

Service Deliverable

8.1.3 SCREEN JPEG Files SCREEN JPEG Files must be created as per Section Error! Reference source not found.

Metadata must be embedded as specified in Section 4

SCREEN JPEG File are to be saved as per the file naming convention in Section 6

8.1.1 METS / ALTO The METS / ALTO must be created as per Section 3.4

Metadata must be embedded as specified in Section 4

The METS / ALTO are to be saved as per the file naming convention in Section 6

8.1.2 Searchable PDF/A The Searchable PDF/A must be created as per Section 3.5

Metadata must be embedded as specified in Section 4

The Searchable PDF/A are to be saved as per the file naming convention in Section 6

8.1.4 Addition of specific technical and administrative information in the provided CSV

Completed Itemised Collection Metadata Spreadsheet (CSV)

8.1.5 Quality assurance of digital deliverables

Files digitised as per Quality Acceptance Criteria (Section 9)

8.1.6 File transfer and supply of digital Files transferred between Respondent and SLNSW via Microsoft OneDrive cloud service.

8.1.7 Retention of digital files Maintain Access to Respondent version of digitised files for required period of time

8.1.8 Project Management Compliance with SLNSW NSW supplied templates for status reporting

Page 28: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 28 of 35

9. ACCEPTANCE OF DELIVERABLES

The Respondent will on reasonable notice from the SLNSW allow the SLNSW to inspect materials and services associated with Optical Character Recognition of the digitised collection items as per this SOR.

The Respondent’s services and the Deliverables must be of the professional standard and reflect expertise commensurate with standard commercial or industrial practice for activities of those required under this Agreement.

Acceptance of deliverables will be measured against fulfilment of the requirements in this Statement of Requirements (including Annexures A to I).

The following acceptance process for Deliverables will apply:

1. Delivery of Digitised files (Deliverables) - Respondent to the SLNSW

All project Deliverables must have a supporting QA Certificate completed by the Respondent

2. The SLNSW QA Inspection

The quality of Deliverables will be assessed against requirements set out in the Statement of Requirements

3. The SLNSW’s QA Inspection - Acceptance

The SLNSW must complete inspections and certify acceptance within 30 days of receiving the Deliverables QA certified from the Respondent

4. The SLNSW Quality Rejections

The Respondent must correct any deficiencies identified by the SLNSW within its 30 days’ quality inspection review period at the Respondent’s own expense and resubmit all the deliverable(s) for that entire batch within 10 business days from the date of notification.

5. Review by the SLNSW of rejected Deliverables

The SLNSW must complete inspections and certify acceptance within 30 days of receiving the rejected Deliverables QA certified from the Respondent

6. Resubmissions

Rework should be submitted under the same work order number as the original rejected work with a numerical counter on the work order number to indicate that it is a resubmission. The complete set of images within that work order should be resubmitted not just the rejected files. For example, if there are files rejected under a work order D12345 the resubmitted folders would be named: D12345 (the original submission) D12345_1 (the first resubmission) D12345_2 (the second resubmission)

The SLNSW will acknowledge the acceptance of deliverables by supplying an Acceptance Certificate (Annexure D SLNSW Deliverables Acceptance.docx).

PLEASE NOTE: Acceptance Certificates will be issues at batch level not at book or page level. Resubmissions are expected to replace the rejected item not just the affected page.

Page 29: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 29 of 35

The SLNSW’s quality acceptance criteria are:

Criteria Description Acceptance

Level

Compliance The deliverables follow the SLNSW’s

specifications

100%

Corrected OCR As specified in section 3, this includes

rekeying and machine learning

99.9%

Uncorrected OCR As specified in section 3, this includes

additions to Dictionary to improve machine

learning

98.0%

Completeness METS, ALTO and SCREEN JPEG exists for

each page of the items supplied, as per the

SLNSW’s specification.

Single PDF/A exists for each book/volume or

individual item supplied by SLNSW

100%

Enhancements SCREEN JPEG image quality is to be

maintained and non-degraded from the

supplied SOURCE IMAGE JPEG's and

meeting the SLNSW’s image quality

specification

100%

File format accuracy The file format is valid and has been created

as per the SLNSW’s specification

100%

File name accuracy The file names and directory structure are

accurate and follow SLNSW’s specification

100%

Metadata Accurate metadata has been written to the

files or added to the csv as per the SLNSW’s

specification

100%

Cropping SCREEN JPEG files have been cropped and

de-skewed as per the SLNSW’s Specification

100%

Compliance to Bagit Standard SLNSW requires all deliveries to be compliant

with the BagIt File Packaging Format V0.97

100%

File Delivery and Media -

Operating System & File

System

Delivery media is readable and images can be

ingested by the SLNSW, Windows OS 7.0

compatible, NTFS and free of malware

100%

Reporting – CSV’s Collection Metadata CSV’s completed as per

the SOR

100%

Zoning, Elements and Type As per the SLNSW’s specification, as

specified in section 3.

99.9%

Page 30: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 30 of 35

10. DEFINITIONS AND ACRONYMS

Term Definition

Accepted (work) Packages of work returned to the SLNSW by the successful Respondents that meet all quality acceptance checks conducted by the SLNSW

Acceptance Certificate A document supplied by SLNSW to the Respondent confirming the approval or rejection of Deliverables.

Acknowledgement (of Delivery)

Receipt of delivery prior to SLNSW quality validation.

Acknowledgment doesn't constitute Acceptance of Deliverables. Deliverables may be rejected and subject to warranty correction as described in this statement of requirements.

ALTO TYPE A user defined description of the type of the composed block or illustration.

Archival Master A high resolution reproduction of the Original source material that contains content that the SLNSW intends to maintain for the long term without loss of essential features.

Bag A bag consists of a "payload" and "tags".

BagIt BagIt is a hierarchical file packaging format designed to support disk-based or network-based storage and transfer of arbitrary digital content.

Batch A set of 25 books or volumes under the one Work Order Number e.g. D12345

Collection Item / Material Any item that forms part of the collection held by the SLNSW

Cropping/Cropped The process of reducing the size of images by removing parts that are near the edges

Deliverable(s) Digital files or services that the successful Respondents will provide

Descriptive Metadata Metadata describing and identifying information resources

Deskewing The correcting of distortion caused by image capture from a viewpoint other than on the perpendicular

Digital Collection Item / Material

Any digitised item that forms part of the collection held by the SLNSW

Digital ID A unique letter plus 5-digit sequence that is assigned to each book or volume within a Batch e.g. b09150

Digital Imaging A term used generally to describe the process of creating and manipulating digital images

Digitisation The process of converting analogue materials into digital objects. This involves not only digital imaging but also the production of Metadata.

Dictionary The OCR’s engine word list that can be extended by the user

Foldouts Foldouts can be attached or contained as a loose item within the volume. A foldout page that is imaged on a device different to regular page images will be a Foldout Scan

Image Manipulation The process of editing images e.g. de-skewing, cropping, rotating, de-speckling, etc.

Image Optimisation for OCR

The process of improving the digital image for improved OCR accuracy as well as display e.g. smoothing of characters, sharpening, increasing contrast, removing dirt and noise.

Page 31: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 31 of 35

Term Definition

Intellectual Entity A set of content that is considered a single intellectual unit for purposes of management and description – for example, a book, map, photograph, or database. An intellectual entity may have one or more digital representations.

Item An item could be a book issue of a journal etc. An item is represented as a single row in the External Metadata CSV (4.1) that is supplied with each batch of source images.

Lossless Workflow For the purposes of this SOR a lossless workflow means capturing raw files and converting these to uncompressed 24-bit tiffs in TIFF 6.0 format

Master File A high-resolution file of content that the SLNSW intends to maintain for the long term without loss of essential features. It excludes Title Scans, Resolutions Scans, Foldouts with QP cards and is named with underscore ‘m’. Derivatives of these files will be displayed through the SLNSW’s catalogue.

Match Point Unique bibliographic identifier from the Libraries Australia Database

Metadata Metadata is structured data that instructs end-users (people and computerised programs) about how the data is to be interpreted.

There are three main types of metadata

1: Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords. This is critical for Discovery.

2: Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters. This is critical for Delivery

3: Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it. There are several subsets of administrative data; two that are sometimes listed as separate metadata types are:

Rights management metadata, which deals with intellectual property rights,

Preservation metadata, which contains information needed to archive and preserve a resource.

Optical Character Recognition (OCR)

The conversion of printed material text/illustrations etc into machine-readable text

Original Source Material The hard copy or other text-based material used to create Source images

Page(s) A single page printed in any text-based item

Payload The data encapsulated by the bag.

Payload Manifest A payload manifest is a tag file that lists payload files and checksums for those payload files generated using a bag checksum algorithm.

Printed Materials SLNSW Collection Category consisting of Books, Pamphlets, Catalogues, Serials, Newspapers

Quality Assurance (QA) Quality assurance (QA) encompasses all procedures and techniques to verify the quality, accuracy, and consistency of the deliverables

Raw File The unprocessed file from a digital camera. These files are converted using a raw converter to the TIFF files specified in this SOR

Page 32: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 32 of 35

Term Definition

Reference File A high-resolution file of content that the SLNSW intends to maintain for the long term without loss of essential features. It includes Title Scans, Resolutions Scans and Foldouts with QP cards. These files will not be displayed through the SLNSW’s catalogue.

Resolution Maximum spatial Resolution without recourse to interpolation. Resolution imaged by optical system in a capture device – without subsequent software interpolated pixels

Resolution Scan The second and third scans in a set. They allow the SLNSW to evaluate scanner performance, focus, exposure and colour balance for that scanning session.

Rotating Process of turning (rotating) the image file so that the top of the page appears at the top of the image file

SCREEN JPEG The cropped image file that is used in the OCR process and that the METS / ALTO and PDF/A will reference

Sharpening A software method of exaggerating ‘edges’ in an image to give an enhanced definition

SOURCE IMAGE (or Page Image)

A digital image representing a scanned book/serial/catalogue page that is supplied by SLNSW

SLNSW State Library of New South Wales

SOR Statement of Requirements

Tag The "tags" are metadata files intended to facilitate and document the storage and transfer of the bag.

Technical Metadata Metadata that describes the technical attributes of the digital image

TIFF Tagged Image File Format, a file format for storing images

Title Scan The first digital image in every set of scanned page images. It includes important information derived from the CSV (see 3.8.1)

Respondent The commercial entity responding to the Request for Tender to perform OCR services contracted to digitised the SLNSW’s printed materials collections

Volume A single book/serial/journal/catalogue belonging to the SLNSW printed materials collection

Work Order Number A unique number assigned to a set of items that make up a Batch. Each item within that Batch will have a unique Digital ID

Work Package A package of work that constitutes a batch of volumes to be digitised

Work Report A customised report outlining detail about completed work packages

Work Schedule The schedule in which Work packages are to be sent by the SLNSW and delivered by the successful Respondents.

Working JPEG An optional enhance JPEG file used internally by the respondent to assist with

OCR creation

Page 33: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 33 of 35

11. STANDARDS REFERENCES & VERSION REQUIREMENTS

The SLNSW’s standards are based on internationally applied standards as set out in these documents:

Rosetta METS

Ex Libris Rosetta METS profile

http://www.loc.gov/standards/mets/profiles/00000042.xml Extension schemas

o DC XML Schema http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd Used for encoding descriptive metadata. (MDTYPE="DC")

o Qualified DC XML Schema (dcterms) http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd Used for encoding descriptive metadata. (MDTYPE="DC")

o DNX Used for encoding technical metadata. (MDTYPE="OTHER" OTHERMDTYPE="dnx")

Rosetta AIP data model (contains the DNX Data Dictionary) https://knowledge.exlibrisgroup.com/@api/deki/files/57231/Rosetta_AIP_Data_Model.pdf?revision=

1

o Rosetta XSD

Used for validating METS files against the Rosetta METS Profile

https://github.com/ExLibrisGroup/Rosetta.dps-sdk-projects/blob/master/current/dps-sdk-

deposit/src/xsd/mets_rosetta.xsd

https://raw.githubusercontent.com/ExLibrisGroup/Rosetta.dps-sdk-projects/master/current/dps-sdk-

deposit/src/xsd/mets_rosetta.xsd (direct)

METS

METS (Metadata Encoding and Transmission Standard)

http://www.loc.gov/standards/mets/

METS Schema - Version 1.11

http://www.loc.gov/standards/mets/mets.xsd

External schemas

http://www.loc.gov/standards/mets/mets-extenders.html o MODS - Metadata Object Description Schema

http://www.loc.gov/standards/mods/

o PREMIS Data Dictionary for Preservation Metadata

http://www.loc.gov/standards/premis/

o NISO MIX – NISO Metadata for Images in XML Schema. http://www.loc.gov/standards/mix/

o TextMD - technical metadata for text

https://www.loc.gov/standards/textMD/

ALTO

ALTO (Analysed Layout and Text Object)

Technical Metadata for layout and Text objects

http://www.loc.gov/standards/alto/

ALTO Schema - Version 3.1

https://www.loc.gov/standards/alto/v3/alto-3-1.xsd

Page 34: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 34 of 35

PDF/A-3

PDF, Version 1.7 (ISO 32000-1:2008) ISO 19005-3:2012 Document management -- Electronic document file format for long-term preservation -- Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3).

Adobe RGB (1998) ICC profile

https://www.adobe.com/digitalimag/pdfs/AdobeRGB1998.pdf

XMP

Extensible metadata platform (XMP) specification

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=57421

Adobe XMP Specification

http://www.adobe.com/devnet/xmp.html

Quality Assurance

ANSI/ISO/ASQ 2859-4:2002: Sampling procedures for inspection by attributes - Part 4: Procedures for

assessment of declared quality levels

BagIt

https://tools.ietf.org/html/draft-kunze-bagit-14

Page 35: Request for Tender (RFT) PART B: STATEMENT OF REQUIREMENTS · SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 2 of 35 DOCUMENT

SOR OCR Statement of Requirements v1.0 2017-11-03.docx © State Library of New South Wales – 2017 Page 35 of 35

12. LIST OF ANNEXURES

Annexure A – Project Status Report for completion by the Respondent

The Project Status Report is to be used to track progress against the agreed baseline of the project. It is a requirement this is updated and sent to the SLNSW on a regular basis (agreed by both parties).

Annexure B – Quality Assurance Certificate for completion by the Respondent

The Quality Assurance Certificate is to be completed by the Respondent at the end of digitisation for the Technical Evaluation. It must be included with the Archive Master files when being sent back to the SLNSW.

Annexure C – Deliverables Acceptance Criteria for completion by the SLNSW

The Acceptance Criteria/Quality Assurance Report will be completed by the SLNSW at the completion of their QA activities for the Technical Evaluation.

Annexure D – Example of supplied Collection Metadata Spreadsheets (CSV’s) for completion by the Respondent

Itemised Collection Metadata Spreadsheet – this CSV is used to capture and add various descriptive, technical and administrative metadata for each item and/or files in the collection. The Respondent is required to follow the instructions in the spreadsheet and add required metadata.

Annexure E – Example of completed Collection Metadata Spreadsheet (CSV) to be returned by the Respondent

The same as Annexure D but with the top line of the CSV completed to indicate the type and format of information required from the Respondent.

Annexure F – OCR Pilot Test

This annexure contains details of the pilot test. The document provides instructions for pilot test, source images and their location.

Annexure G – Goods Shipment Receipt

The document that will be used to formally transfer SLNSW material to the respondent for OCR processing and to acknowledge the return of that collection material to the SLNSW.

Annexure H – Rosetta METS file and SIP structure example

Annexure H shows an example of a valid Rosetta METS file with files structured for Submission Information Package (SIP). Other included files are based on an older sample and may not fully comply to current requirements.