oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the...

19
1 HOW TO… Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file that accompanies the theses is used to drive the batch upload. A batch upload job into DigiTool is referred to as an ingest. The eThesis PDFs and the metadata file which describes them are delivered to the SAN share on digitool.library.mcgill.ca known as collect6/etheses. PDF files are named like this: SZPTHUP_<YYYY>_<studentid>_[CERTIFICATE_]<jobid>.pdf e.g. SZPTHUP_2009_260265930_J21080144-j32568656.pdf (eThesis PDF. These will be assigned a Usage Type of VIEW during the ingest.) e.g. SZPTHUP_2009_260265930_CERTIFICATE_J21080144-j32568656.pdf (Certificate files contain personal information required to be retained with the thesis but which must not be displayed to the public. They will be assigned a Usage Type of ARCHIVE.) Metadata files are named like this: SZPTHUP_<date of upload as MMM_DD>_<YYYY>_METADATA_<jobid>.xml e.g. SZPTHUP_JUN_09_2009_METADATA_J21080144-j32568656.xml A record in the metadata file looks like this: <record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:title>Legume production in semi-arid areas</dc:title> <dc:creator>Bourgault, Maryse</dc:creator> <dc:contributor>Donald L Smith (Supervisor)</dc:contributor> <dc:date>2009</dc:date> <dcterms:localdissacceptdate>04/21/2009</dcterms:localdissacceptdate> <dcterms:abstract>Context: Approximately one billion people … </dcterms:abstract> <dcterms:abstract xml:lang="fr">Contexte : Environ un milliard de personnes … </dcterms:abstract> <dc:subject>Agriculture - Plant Physiology</dc:subject> <dcterms:localumicode>0817</dcterms:localumicode> <dc:language>en</dc:language> <dc:type>Electronic Thesis or Dissertation</dc:type> <dc:format>application/pdf</dc:format> <dc:publisher>McGill University</dc:publisher> <dc:rights>© Maryse Bourgault, 2009</dc:rights> <dcterms:localthesisdegreename>Doctor of Philosophy</dcterms:localthesisdegreename> <dcterms:localthesisdegreediscipline>Department of Plant Science</dcterms:localthesisdegreediscipline> <dcterms:localcollectioncode>ETHESIS</dcterms:localcollectioncode> <dcterms:localfilename> http://etheses.library.mcgill.ca/SZPTHUP_2009_110026092_J21080144-j32568656.pdf</dcterms:localfilename> <dcterms:localdisspagecount>239</dcterms:localdisspagecount> </record> The metadata fields have been designed to enable conversion to the various output formats required: NDLTD ETD-MS (required by LAC), ProQuest DISS (which was required for electronic submission to ProQuest – we are no longer doing this), DC (for OAI harvesters), as well as to support upload, organization, and display within DigiTool.

Transcript of oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the...

Page 1: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

1

HOW TO… Load eTheses

Background:

eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file that accompanies the theses is

used to drive the batch upload. A batch upload job into DigiTool is referred to as an ingest.

The eThesis PDFs and the metadata file which describes them are delivered to the SAN share on digitool.library.mcgill.ca known as

collect6/etheses.

PDF files are named like this: SZPTHUP_<YYYY>_<studentid>_[CERTIFICATE_]<jobid>.pdf

e.g. SZPTHUP_2009_260265930_J21080144-j32568656.pdf (eThesis PDF. These will be assigned a Usage Type of VIEW during the ingest.)

e.g. SZPTHUP_2009_260265930_CERTIFICATE_J21080144-j32568656.pdf (Certificate files contain personal information required to be retained with the

thesis but which must not be displayed to the public. They will be assigned a Usage Type of ARCHIVE.)

Metadata files are named like this: SZPTHUP_<date of upload as MMM_DD>_<YYYY>_METADATA_<jobid>.xml

e.g. SZPTHUP_JUN_09_2009_METADATA_J21080144-j32568656.xml

A record in the metadata file looks like this: <record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcterms="http://purl.org/dc/terms/"

xmlns:dc="http://purl.org/dc/elements/1.1/">

<dc:title>Legume production in semi-arid areas</dc:title>

<dc:creator>Bourgault, Maryse</dc:creator>

<dc:contributor>Donald L Smith (Supervisor)</dc:contributor>

<dc:date>2009</dc:date>

<dcterms:localdissacceptdate>04/21/2009</dcterms:localdissacceptdate>

<dcterms:abstract>Context: Approximately one billion people …

</dcterms:abstract>

<dcterms:abstract xml:lang="fr">Contexte : Environ un milliard de personnes … </dcterms:abstract>

<dc:subject>Agriculture - Plant Physiology</dc:subject>

<dcterms:localumicode>0817</dcterms:localumicode>

<dc:language>en</dc:language>

<dc:type>Electronic Thesis or Dissertation</dc:type> <dc:format>application/pdf</dc:format>

<dc:publisher>McGill University</dc:publisher>

<dc:rights>© Maryse Bourgault, 2009</dc:rights>

<dcterms:localthesisdegreename>Doctor of Philosophy</dcterms:localthesisdegreename>

<dcterms:localthesisdegreediscipline>Department of Plant Science</dcterms:localthesisdegreediscipline>

<dcterms:localcollectioncode>ETHESIS</dcterms:localcollectioncode>

<dcterms:localfilename> http://etheses.library.mcgill.ca/SZPTHUP_2009_110026092_J21080144-j32568656.pdf</dcterms:localfilename>

<dcterms:localdisspagecount>239</dcterms:localdisspagecount>

</record>

The metadata fields have been designed to enable conversion to the various output formats required: NDLTD ETD-MS (required by LAC), ProQuest

DISS (which was required for electronic submission to ProQuest – we are no longer doing this), DC (for OAI harvesters), as well as to support

upload, organization, and display within DigiTool.

Page 2: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

2

Ingest: NB. Before Ingesting make sure that files have been converted to utf-8 (see How to Create ISO-8859-1 files to UTF-8), and ensure that necessary basic editing has been done such as leading articles (see the Meditor HOW

TO “How to Apply the Non-Filing-Characters Workaround”), removal of illegal xml characters ('&', '<', and '>') within elements, replacing of Windows "smart quotes" with regular quotes, etc. NB. Until ISR changes the

export, the new dcterms:isPartOf element needs to be added, contain the value "Electronically-submitted theses."

Logon to http://digitool.library.mcgill.ca:1801/webingest/ and choose MCG02 as the admin unit. Click on the New Ingest Activity tab and fill in the

ingest parameters as follows:

Name: Something descriptive ‘ethesis upload’ followed by month and year of upload is good.

Ingest type: Dublin Core XML file and associated file stream(s).

Scheduling: Use default settings to have the job run right away.

Assign to: this should automatically show your login name.

Note. Useful in the case that the person running the ingest is not the person who created it. Usually not used.

Page 3: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

3

Template Task Chain: Remote Stream Linkage

Click the Next > button

You can optionally add additional ingest tasks at the Create a New Task Chain step.

Optional step: If the set of upload files does not contain any supplementary information (i.e. there are no CERTIFICATE files), then Full-text

indexing can be done as part of the ingest job. Click on Full Text Extraction in the All Tasks column and click the right arrow button (center of screen) to

add it to the list of Scheduled Tasks in the right hand panel.

Page 4: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

4

Click the Next > button

Parameters

Digital Entity Template: dc_simple_entity_with_url-etheses-media.xml

Processing Instruction File: dc-with_file_stream_view_archive_media.xml

DC file upload: use the Browse… button to navigate to the location where you’ve stored the converted and edited metadata output file Remote Stream Download Store Link Locally: true

Remote Stream Download File Extension: pdf. If there are MEDIA files, enter the file extension for each type of media file present in the load,

e.g. if there are mov and mp3 files, enter 'pdf,mov,mp3' in the File Extension box.

If you’ve added the Full Text Extraction Task (only done if there are no CERTIFICATE or MEDIA files): Full Text Extraction Encoding: utf8

Full Text Extraction File Extension: pdf

Page 5: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

5

Click the Next > button

Upload Files

File upload is not required because files will be loaded directly using the URLs provided in the metadata file.

Page 6: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

6

Click the Activate button

This activates the ingest. You can follow the progress of the ingest in the Folders tab by checking the number of jobs showing up next to the

Scheduled, Running and Success (or Failed) folder tabs in the left-hand column.

If the upload succeeded, the completed job will show up at the top of the list in the Success folder, as shown above.

Click on the ingest id (e.g. ing1900) in the Id column.

Page 7: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

7

This will display a list of the uploaded items in a separate window.

Check that the icons in the Delivery column are all PDF icons ( ). In some circumstances objects may show up as web icons ( ), which means the

the ingest failed to correctly associate a mime type of PDF with the file object. If this happens, follow instructions for failed jobs, below.

Note that the third last item in the screenshot above has ARCHIVE as a Usage Type. This is how CERTIFICATE files will show up in a successful

upload.

Page 8: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

8

Failed jobs:

If a job shows up in the Failed jobs folder, or if a successful job failed to correctly associate the PDF mime type with the file objects, highlight the

ingest job by clicking on its name in the list of jobs, click on the Full/Split Mode icon ( ) at the top of the list of jobs and then click the Task Log tab in

the Log View: pane which will appear in the lower portion of the screen. This will display the ingest job log which can help track down why an ingest

failed. For example, if there were errors in the metadata load file, they will be listed here.

Click the ingest rollback icon ( ) to rollback the ingest. This removes any objects which might have been created before the job failed. The job will

now show up in the Not Scheduled folder. Click on the edit icon ( ) to verify and adjust all ingest parameters and to re-upload a corrected metadata

file, if required. Click on the Save button at the bottom right of the screen, move to the Upload step and click the Finish button at the bottom right of

the screen. The job will be moved to the Scheduled jobs folder. Click the Activate icon ( ) to rerun the job.

Page 9: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

9

Post ingest steps:

A number of post ingest actions must now be carried out. These consist of:

1. Full-text indexing of VIEW objects in DigiTool.

2. Application of accessrights metadata to ARCHIVE objects in DigiTool.

3. Export of eScholarship to flat file.

4. Digitization of corresponding paper waivers supplied by the Thesis Office. These waivers are a legal requirement, entitling us to distribute

theses electronically. These paper waivers also include paper forms which must be submitted to ProQuest when electronic submission of

theses is carried out.

5. Attaching of digitized waivers to the appropriate thesis in DigiTool. The waivers must be available in DigiTool in order to allow Collection

Services to catalogue the eThesis. Cataloguers consult information in the waivers when cataloguing a thesis.

6. Electronic submission of eTheses to ProQuest.

7. Submission of paper waivers to ProQuest.

8. Export of theses to Aleph

NB. Steps 6 and 7 are no longer required as of April 2013. As of that date we are no longer submitting eTheses to ProQuest.

NB. The ingest processing instruction file loads files associated with a thesis in order as follows: PDF, certificate file (waivers, etc.), multimedia file.

It then assigns respective usage types of VIEW_MAIN, ARCHIVE, and VIEW. This means that for theses for which there is a multimedia file, but no

certificate file, then the multimedia file will be assigned a usage type of ARCHIVE. Therefore these eTheses need to be opened in Meditor and the

usage type for the media file manually changed from ARCHIVE to VIEW.

The first two actions are described below. See other HOW TOs for information on the remaining 6 actions.

Page 10: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

10

Full-text indexing of VIEW objects in DigiTool.

If the load file contained CERTIFICATE files or other files which will receive a usage type of ARCHIVE in addition to eTheses, then the Full Text

Extraction task must be run after the ingest has been completed instead of as part of the ingest process.

To run a Full-Text Extraction job:

Logon to http://digitool.library.mcgill.ca:1801/mng/ and choose MCG02 as the admin unit.

1. Click on the Maintenance tab and choose processing in the Maintenance Job Types dropdown in the Submit a new job screen.

2. Click on the Full Text radio button and click the Next button.

Page 11: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

11

Job Population: Full Text

Click on the Advanced search link at the top right of the search pane.

Choose Ingest Id in the Find: drop down and in the text box enter the id of the ingest which you have just run (e.g. ‘ing1900’).

Click the + Add Condition link to the right of the text box to add a new search condition.

Choose Usage Type in the Find: drop down and View in the dropdown which will appear under the text box.

Click the Search button.

Page 12: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

12

Only files from the ingest job which were assigned a Usage Type of VIEW should be displayed in the search results pane.

Click the Next button.

Page 13: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

13

Job additional details: Full Text

Leave the Job additional details: Full Text settings as is: Parallel Processes: 1 Encoding: UTF-8

Overwrite Existing Index checkbox is left unchecked.

Click the Next button

Page 14: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

14

Confirm job details: Full Text

Make sure that the details provided in the red WARNING! text seem reasonable.

If all seems right, click the Confirm button. Otherwise click the Back button and adjust the search parameters, or click Cancel Job.

Page 15: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

15

You can follow the progress of the job by clicking on the Monitor tab.

Page 16: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

16

You can also monitor the progress of a job or inspect the results by clicking on the Jobs List tab and choosing either Running or Completed in the Job

Status dropdown.

Click on the view icon ( ) in the Action column to watch a job in progress for Running jobs, or to see the job report for Completed jobs.

Page 17: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

17

Application of accessrights metadata to ARCHIVE objects

Accessrights metadata must be applied to ARCHIVE objects. ARCHIVE objects are not made publicly available through the eScholarship website.

However, the file streams (PDFs, JPGs, etc.) of all objects in DigiTool are delivered directly from the Repository. Therefore in order to ensure that

unauthorized persons do not obtain access to the file (this may happen in the unlikely event that the DigiTool PID of an ARCHIVE object becomes

known), an accessrights metadata file must be associated with the ARCHIVE object in DigiTool.

To apply accessrights metadata:

Logon to the DigiTool Management module and follow the steps for submitted a new job, as described on page 10 above, choosing Add Metadata as

the Maintenance Job Type. Click the Next button.

Follow the steps for Job Population, as described on page 11 above, choosing ARCHIVE as the Usage Type. Click the Next button.

Page 18: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

18

Job additional details: Add Metadata

Choose accessrights rights_md in the MD Name & Type: dropdown

Leave the MD File: text box empty

Enter 54759 in the MID: text box

Leave the Shared: dropdown at True

Leave the Filter text boxes empty

Ensure that the Parent Objects Only checkbox is NOT checked.

Click the Next button.

Page 19: oad eTheses Background - McGill Library · Load eTheses Background: eTheses are ftp'd to the Library SAN three times a year: March 1, May 1 and November 1. The metadata output file

19

Confirm job details: Add Metadata

If the information in the job details message seems reasonable, click Confirm. Otherwise click Back to adjust the job parameters, or click Cancel Job.

You can follow the progress of the job and inspect the job report by following the instructions provided on pages 15 and 16 above.

E. Thomson - LTS Digitization - 2009/08/11 (updated 2013/06/12)