CatConf2001

I name thee Bay of Pe(a)rls : some practical virtues of Perl for cataloguers

Jenny Quilliam

Abstract With the increasing numbers of aggregated electronic resources, libraries now tend to ‘collect in batches’. These aggregated collections may not be permanent and are subject to frequent and significant frequent content changes. One survival strategy for Cataloguers is to ‘catalogue in batches’. While some publishers and vendors are now supplying files of MARC records for their aggregated resources, these often need to be adapted by libraries to include local authentication and access restriction information. Perl (Practical Extraction and Reporting Language) – is an easy to learn programming language which was designed to work with chunks of text – extracting, pattern matching / replacing, and reporting. MARC records are just long strings of highly formatted text and scripting with Perl is a practical way to edit fields, to add local information, change subfields, delete unwanted fields etc. – any find-and-replace or insert operation for which the algorithm can be defined. As cataloguers are already familiar with MARC coding and can define the algorithms, learning a bit of Perl means that cataloguers can easily add a few strings of Perls to their repertoire of skills

Introduction In reviewing the literature on current and future roles for cataloguers, two major themes emerge: cataloguers need to be outcomes focussed and that new competencies are required to address the challenges in providing bibliographic access control for remote-access online resources. Electronic resources – primarily fulltext electronic journals and fulltext aggregated databases – have significantly improved libraries’ ability to deliver content to users regardless of time and distance. Integrated access means that the library catalogue must reflect all the resources that can be accessed especially those that are just a few clicks away. Macro cataloguing approaches are needed to deal with the proliferation of electronic resources and the high maintenance load caused by both long-term and temporary associated content volatility of these resources. In the United States, the Federal Library and Information Center Committee’s Personnel Working Group (2001) is developing Knowledge, Skills and Abilities statements for its various professional groups. For Catalogers, it has identified abilities including: • Ability to apply cataloging rules and adapt to changing rules and guidelines • Ability to accept and deal with ambiguity and make justifiable cataloguing decisions in the absence of

clear-cut guidelines • Ability to create effective cataloging records where little or not precedent cataloguing exists Anderson (2000) argues that without decrying the importance of individual title cataloguing, macro-cataloguing approaches to manage large sets of records are essential. Responsibility for managing quality control, editing, loading, maintaining and unloading requires the “Geek Factor”. In a column which outlined skills required for librarians to manage digital collections, Tennant (1999) observed that while digital librarians do not need to be programmers, it is useful to know one’s way around a programming language and while the specific languages will vary a “general purpose language such as Perl can serve as a digital librarian’s Swiss Army knife – something that can perform a variety of tasks quickly and easily”.

What is Perl and why is it useful? Perl is the acronym for Practical Extraction and Report Language. It is a high-level interpreted language optimized for scanning arbitrary text files and extracting, manipulating and reporting information from those text files. Unpacking this statement: • high-level = humans can read it • interpreted = doesn’t need to be compiled and is thus easier to debug and correct • text capabilities = Perl handles text in much the same way as people do Perl is a low cost – free - scripting language with very generous licensing provisions. To write a Perl script all you need is any text editor – e.g. Notepad or Arachnaphilia - as Perl scripts are just plain text files Perl is an outcomes focussed programming language – the ‘P” in Perl means practical and it is designed to get things done. This means that it is complete, easy to use and efficient. Perl uses sophisticated pattern-matching techniques to scan large amounts of data very quickly and it can do tasks that in other programming languages would be more complex, take longer to write, debug and test. There are often many ways to accomplish a task in Perl. Perl is optimized for text processing – and this is precisely what is required in creating editing and otherwise manipulating MARC records. A word of caution - while Perl is more forgiving that many other programming languages, there is a structure and syntax to be observed – in many ways familiar territory to cataloguers who deal with AACR2R and MARC rules, coding and syntax.

Resources for learning Perl There are many how-to books on Perl. If you have no previous programming knowledge, two introductory texts are Paul Hoffman’s Perl 5 for dummies or Schwartz & Christiansen’s Learning Perl. Both are written in a gentle tutorial style, with comprehensive indexes and detailed tables of contents. Another useful resource is the Perl Cookbook, which contains around 1000 how-to-recipes for Perl – giving firstly the quick answer followed by a detailed discussion of the answer to the problem For online resources, an Internet search on the phrase ‘Perl tutorial’ yields pages of results. Two examples of beginner level tutorials are Take 10mins to learn Perl and Nik Silver’s Perl tutorial.

How much Perl is needed to manipulate MARC records? The good news is “not a lot” – there are a number of tools available to deal with the more challenging intricacies of the MARC format – the directory structure and offsets, field and subfield terminators etc. These MARC editing tools (discussed below) allow you to deal with MARC records in a tagged text format rather than the single string. Not only is a tagged text format much easier to read (for humans) but it can be easily updated and manipulated using simple Perl scripts. Certainly to create a useful Perl script need to learn how to open files for reading and writing, something about control structures, conditionals and pattern matching and substitution.

MARC record tools There are a range of MARC editing tools available for use and the Library of Congress maintains a listing of MARC Specialized Tools at: http://lcweb.loc.gov/marc/marctools.html MARCBreaker is a Library of Congress utility for converting MARC records into an ASCII text file format. It has a complimentary utility, MARCMaker, which can then be used to reformat from this file format into MARC records. The current version only runs under DOS and Windows 95/98. There is also a companion

MarcEdit utility to MARCBreaker/MARCMaker developed by Terry Reese (2001). MarcEdit is currently in version 3.0 and has a number of useful editing features including global field addition and deletion. Simon Huggard and David Groenewegen in their paper ‘E-data management: data access and cataloguing strategies to support the Monash University virtual library’ outline the use of MARCBreaker and MARCMaker to edit record sets for various database aggregates. The Virtual University of Virginia (VIVA) has also used MARCMaker together with the MARC.pm module to convert and manipulate MARC records for electronic texts. MARC.pm is Perl 5 module for preprocessing, converting and manipulating MARC records. SourceForge maintains an informative website for MARC.pm that includes documentation with a few examples. It is a comprehensive module that can convert from MARC to ASCII, HTML, and XML and includes a number of ‘methods’ with options to create, delete and update MARC fields. Using MARC.pm requires a reasonable knowledge of Perl and general programming constructs. MARC.pm is used by the JAKE project to create MARC records. Michael Doran, University of Texas at Arlington, uses MARC.pm together with Perl scripts to preprocess MARC records for netLibrary. A description of this project can be found at: http://rocky.uta.edu/doran/preprocess/process.html marc.pl is a Perl module written by Steve Thomas from Adelaide University. It is a utility for extracting record from a file of MARC records, and converting records between standard MARC format and a tagged text representation and vice-versa from tagged text to MARC. One of the best features of this utility is the ability to add tags globally to each record by the use of a globals tagged text file. The marc.pl utility with documentation is available for download at: www.library.adelaide.edu.au/~sthomas/scripts/ It uses command line switches to specify the output format and options to include a global file or skip records. By default, marc.pl creates serial format MARC records, Leader ‘as’ so it is particularly suited to creating records for electronic journals in aggregated databases and publisher collections. The tagged text format required by marc.pl is simple – each field is on a separate line, the tag and indicator information is separated by a space and subfields are terminated with a single dagger delimiter. Records are separated by a blank line. To use marc.pl it is helpful to know what Perl is and this is why I first dived [paddled is probably a more accurate verb] into the world of Perl. Once in though, it is easy to learn enough to write simple Perl scripts.

Scenarios for Perl scripting with MARC records Three scenarios where Perl scripting is used for cataloguing purposes:

• Creating brief MARC records from delimited titles lists • Editing vendor-supplied MARC record files to adapt for local requirements • Deriving MARC records for ejournals based on the print version.

The Final report of the Program for Cooperative Cataloging’s Task Group on Journals in Aggregator Databases (2000) provides a useful checklist of appropriate tags and content when scripting to either create or derive MARC records. It lists proposed data elements for both machine-generated and machine-derived (i.e. from existing print records) aggregator analytics records Depending on whether there is an existing file of MARC records the records creation/manipulation process steps are: 1. Convert from MARC to tagged text using marc.pl or capture vendors delimited titles, ISSN, coverage file 1. Edit tagged text using a locally written Perl script 2. Create a globals tagged text file for fields, including a default holdings tag, to be added to each record 3. Convert from tagged text to MARC using marc.pl 4. Load resulting file of MARC records to the library system

Creating brief MARC records from delimited titles l ists When no MARC record set exists for an aggregated database, Perl scripts are used to parse delimited titles, ISSN, coverage and URL information into MARC tagged text. The resulting tagged text file is then formatted to MARC incorporating a global tagged text file using marc.pl to create as set of records.

In brief, all the Perl script has to do is to open the input file for reading, parse the information into the appropriate fields, format it as tagged text and write the tags to an output file. This approach has been used to create records for several databases including IDEAL, Emerald, Dow Jones Interactive and BlackwellScience. For some publisher databases, fuller records with subject access have been created by adding one or more subject heading terms for each title in the delimited titles file. Appendix 1 shows the simple Perl script written to process Emerald records. Appendix 2 shows an example of the resulting tagged text together with the global file used for Emerald.

Editing Vendor-supplied MARC records Database vendors now make available files of records for their various aggregated databases. EBSCO Publishing had undertaken a pilot project for the PCC Task Group on Aggregator Databases to derive records for aggregated databases and their records are freely available to subscribers. When the University of South Australia subscribed to the Ebsco MegaFile offer in late 1999, the availability of full MARC records was regarded as a definite advantage. However these records required preprocessing to include UniSA-specific information, change the supplied title level URLs to incorporate Digital Island access, and add a second URL for off campus clients. Additional edits include changing GMD from [computer file] to [electronic journal] and altering subject headings form subdivision coding from ‘x’ to ‘v’. Again to enable bulk deletion for maintenance purposes, a tag to create a default holding was required. The Perl scripts for these files do string pattern matching and substitution or [semi-global] find-and-replace operations. In many cases, these changes could be done with a decent text editor with find/replace capabilities and if dealing with the records on a one-off basis this is practical process. However aggregator databases are notoriously volatile – changing content frequently – and hence the record sets need to be deleted and new files downloaded from the vendor site, edited and loaded to the library system. So it’s worth spending a little time to write a custom Perl editing script. Appendix 3 shows a script to edit Ebsco-sourced records. Until mid-2000, Ebsco did not include publisher’s embargo periods in their MARC records but maintained a separate embargoes page – hence further scripting to incorporate this information was needed. Vendor MARC records are also available for the Gale and Proquest databases. A variation of this process is also used to preprocess netLibrary MARC records – adding a default holding, second remote authentication URL, and to edit the GMD.

Deriving MARC records for ejournals from print reco rds The third scenario where Perl scripts are used with MARC records is deriving records for the electronic version from existing records. At UniSA we have reworked existing MARC records for print titles to create ejournal records for APAIS FullText. No records were available as ejournals and as we already had print records for a majority of titles, it was decided to rework these records into ejournal records. Title, ISSN and coverage information was captured from the Informit site and edited into a spreadsheet. During the pre-subscription evaluation process, APAIS FullText titles had been searched to the UniSA catalogue and bibkeys of existing records noted. MARC records for these titles were exported from the catalogue as tagged text. For the titles not held at UniSA, bibliographic records were captured to file from Kinetica and then converted to tagged text. The ISSN and coverage data was also exported in tab-delimited format from the spreadsheet. By matching on ISSN, the fulltext coverage information could be linked to each title and incorporated into the MARC record. The records were edited following the PCC’s (2000) proposed data elements for machine-derived records – deleting unwanted fields, adding and editing fields as needed. A globals file was used to add tag 006 and 007 data, tag 530 additional physical form note, a 590 local access information note, a 773 Host-item entry for the database, a 710 for the vendor Informit and a default local holdings tag. The Perl script to process records is longer than the earlier examples but no more complex – it just does more deleting, updating and reworking. Appendix 4 shows an example of a print record for APAIS Fulltext – the original print form, the edited form, the globals file and the final record as an ejournal.

Conclusion While Perl is currently mostly used to deal with the challenges of providing and maintaining MARC records for electronic resources, scripts are also used to post-process original cataloguing for all formats for batch uploading to Kinetica. The uses of Perl in the cataloguer’s toolkit can be many and varied – it is a not-so-little language that can and does! And it’s fun!

Appendix 1 – Perl script to edit Emerald titles file # !/usr/local/bin/perl # Script to edit Emerald tab-delimited title file i nto tagged text # Entries contain Title, ISSN, Coverage and specifi c URL # Written: Jenny Quilliam Revised: August 2001 # Command line >perl Emerald_RPA.pl [INPUT FILE] [O UTPUT FILE] # ################################################### ############################## $TheFile = shift; $OutFile = shift; open(INFILE, $TheFile) or die "Can't open Input\n"; open(OUTFILE, ">$OutFile") or die "Can't open Outpu t\n"; # control structure to read and process each line f rom the input file while (<INFILE>) { s/"//g ; #deleting any quote marks from the str ing $TheLine = $_ ; chomp($TheLine); #parsing the contents at the tab delimiters to po pulate the variables ($ISSN, $Title, $Coverage, $URL) = split(/\t/, $T heLine); #printing out blank line between records print OUTFILE "\n"; # processing ISSN print OUTFILE "022 |a$ISSN\n" ; # processing Title - fixing filing indicators # checking for leading The in Title if($Title =~ /^The /) {print OUTFILE "245 04|a$Title|h[electronic jou rnal]\n"; } else {print OUTFILE "245 00|a$Title|h[electronic jou rnal]\n";} # processing to generate URL tag with Coverage info print OUTFILE "856 40|zFulltext from: $Coverage." ; print OUTFILE "This electronic journal is part of the Emerald database."; print OUTFILE " Access within University network. |u$URL\n"; # adding generic RPA URL link to all records print OUTFILE "856 41|zAccess outside University network."; print OUTFILE "|uhttp://librpa.levels.unisa.edu.au/rpa/webauth.ex e?rs=emerald\n"; } close(INFILE); close(OUTFILE);

Appendix 2 – Global and example of tagged text for Emerald titles 006 m d 007 cr cn- 008 001123c19uu9999enkuu p 0 a0eng d 040 |aSUSA|beng|cSUSA 260 |a[Bradford, England :|bMCB University Press. ] 530 |aOnline version of the print publication. 590 |aAvailable to University of South Australia staff and students. Access is by direct login from computers within the Universit y network or by authenticated remote access. Articles available for downloading i n PDF and HTML formats. 773 0 |tEmerald 991 |cEJ|nCAE|tNFL ___________________________________________________ ________________________ 001 jaq00-05205 245 00|aAsia Pacific Journal of Marketing & Logist ics|h[electronic journal] 022 |a0945-7517 856 40|zFulltext from: 1998. This electronic journa l is part of the Emerald library database. Access within University network. |uhttp://www.emeraldinsight.com/094-57517.htm 856 41|zAccess outside University network. |uhttp://librpa.levels.unisa.edu.au/rpa/webauth.exe ?rs=emerald

Appendix 3 – Perl script to edit Ebsco sourced records # !/usr/local/bin/perl # # Author: Jenny Quilliam November 2000 # # Program to edit EbscoHost records [as converted t o text using marc.pl] # GMD to be altered to: electronic journal # Form subfield coding to be altered to v # French subject headings to be deleted # Fix URL to incorporate Digital Island access # Command line string, takes 2 arguments: # Command line: mlx> perl EHedit.pl [input filen ame] [output filename] ################################################### ########################## $TheFile = shift; $OutFile = shift; open(INFILE, $TheFile) or die "Can't open input\n"; open(OUTFILE, ">$OutFile") or die "Can't open outpu t\n"; while (<INFILE>) { $TheLine = $_ ; # processing selected lines only # editing the GMD in the 245 field from [computer f ile] to [electronic journal] if($TheLine =~ /^245/) { $TheLine =~ s/computer fi le/electronic journal/g;} # # editing subject headings to fix form subdivision subfield character if($TheLine =~ /^65/) { $TheLine =~ s/xPeriodicals /vPeriodicals/g;} # editing out French subject headings if($TheLine =~ /^650 6/) {next} # editing URL to add .global to string for Digital Island address if($TheLine =~ /^856/) {$TheLine =~ s/search.epnet /search.global.epnet/g ;} print $TheLine; print OUTFILE $TheLine; } close(INFILE); close(OUTFILE);

Appendix 4 – APAIS FullText examples Print record LDR 00824nas 2200104 a 4500 001 dup91000065 008 820514c19739999vrabr p 0 0 0eng d 022 0 $a0310-2939 035 $a(atABN)2551638 035 $u145182 040 $dSIT$dSCAE 043 $au-at--- 082 0 $a639.9$219 245 00 $aHabitat Australia. 259 00 $aLC$bP639.9 H116$cv.2, no.1 (Mar. 1974)- 260 01 $aHawthorn, Vic. :$bAustralian Conservation Foundation,$c1973- 300 $av. :$bill. (some col.), maps ;$c28 cm. 362 0 $aVol. 1, no. 1 (June 1973)- 580 $aAbsorbed Peace magazine Australia. Vol 15 , no. 4 (Aug. 1987) 650 0 $aNatural resources$xResearch$zAustralia. 650 0 $aConservation of natural resources$zAustra lia. 710 20 $aAustralian Conservation Foundation. 780 05 $tPeace magazine Australia$x0817-895X 984 $a2036$cCIT PER 304.2 HAB v.1 (1973)-$cUND PER 304.2 HAB v.1 (1973)-$cMAG PER 333.9506 H116 v.1 (1973)-$cSAL PER 333.705 H11 v.1 (1973)- EndRecord

Edited record LDR 00824nas 2200104 a 4500 001 jaq01-0607 008 820514c19739999vrabr p 0 0 0eng d 022 0 |a0310-2939 082 0 |a639.9|219 245 00|aHabitat Australia|h[electronic journal]. 260 |aHawthorn, Vic. :|bAustralian Conservation F oundation,|c1973- 362 0 |aVol. 1, no. 1 (June 1973)- 580 |aAbsorbed Peace magazine Australia. Vol 15, no. 4 (Aug. 1987) 650 0|aNatural resources|xResearch|zAustralia. 650 0|aConservation of natural resources|zAustrali a. 710 2 |aAustralian Conservation Foundation. 780 05|tPeace magazine Australia|x0817-895X 856 41|zSelected fulltext available: Vol. 24- (Jun e 1996-) .Access via Australian public affairs full text.|uhttp://www.informit.com. au 991 |cEJ|nCAE|tNFL Globals file 006 m d 007 cr anu 040 |aSUSA 530 |aOnline version of the print title. 590 |aAvailable to University of South Australia staff and students. Access is by direct login from computers within the Universit y network or by login and password for remote users. File format and amount o f fulltext content of journals varies. 710 2 |aInformit. 773 0 |tAustralian public affairs full text|dMelbou rne, Vic. : RMIT Publishing, 2000-. 991 |cEJ|nCAE|tNFL

References Anderson, B., 1999, ‘Cataloging issues’ paper presented to Technical Services Librarians: the training we need, the issues we face, PTPL Conference 1999. http://www.lib.virginia.edu/ptpl/anderson.html Christiansen, T. & Torkington, N.,1998, Perl cookbook, O’Reilly, Sebastapol CA. FLICC Personnel Working Group (2001) Sample KSAs for Librarian Positions: Catalogers http://www.loc.gov/flicc/wg/ksa-cat.html Hoffman, P. 1997, Perl 5 for dummies, IDG Books, Foster City CA. Huggard, S. & Groenewegen, D., 2001, ‘E-data management: data access and cataloguing strategies to support the Monash University virtual library’, LASIE, April 2001, p.25-42. Library of Congress’s MARCBreaker and MARCMaker programs available at: http://lcweb.loc.gov/marc/marc/marctools.html Program for Cooperative Cataloging Task Group on Aggregator Databases, 2000, Final report. http://lcweb/loc/gov/catdir/pcc/aggfinal.html Reese, T. MarcEdit 3.0 program available at: http://ucs.orst.edu/~reeset/marcedit/index.html Schwartz, R. & Christiansen, T. 1997, Learning Perl, 2nd ed., O’Reilly, Sebastapol CA. Silver, Nik, Perl tutorial. http://fpg.uwaterloo.ca:80/perl/ Take 10 min to learn Perl http://www.geocities.com/SiliconValley/7331/ten_perl.html Tennant, R. 1999 ‘Skills for the new millenium’, LJ Digital, January 1, 1999. http://www.libraryjournal.com/articles/infotech/digitallibraries/19990101_412.htm Thomas, S. marc.pl utility available at: http://www.library.adelaide.edu.au/~sthomas/scripts/ Using MARC.pm with batches of MARC records : the VIVA experience, 2000. [Online] http://marcpm.sourceforeg.net/examples/viva.html

Author Jenny Quilliam Coordinator (Records) Technical Services University of South Australia Library Email: [email protected]

CatConf2001

Technology

Transcript of CatConf2001