Data Warehousing and Mining Data from Library and University Systems for Assessment of Library...

53
Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations ENUG Conference Cheng Library, William Paterson University, Wayne, New Jersey, Thursday, October 21, 2010 Ray Schwartz, Systems Specialist Librarian Cheng Library, William Paterson University, Wayne, New Jersey, USA schwartzr2 @ wpunj.edu

description

 

Transcript of Data Warehousing and Mining Data from Library and University Systems for Assessment of Library...

Page 1: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

Data Warehousing and Mining Data from Library

and University Systems for Assessment of Library

OperationsENUG Conference

Cheng Library, William Paterson University, Wayne, New Jersey,

Thursday, October 21, 2010

Ray Schwartz, Systems Specialist Librarian

Cheng Library, William Paterson University, Wayne, New Jersey, USAschwartzr2 @ wpunj.edu

Page 2: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

2

Outline• What is Data Mining and Data

Warehousing and Why Do We Do It?• Our Library and University• Patron Statistical Categories• Application Server• Reporting

Page 3: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

3

Collecting Transactional Data

• ILSs collect transactional data for circulation and allocation of collection funds.

• ILL and Document Delivery services supply general transactional data.

• Reports from vendor services– Bibliographic utilities– Subscription agents– Book jobbers

Page 4: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

4

• Most ILSs have search and web server logs

• Most (if not all) Databases have usage reports

• Link Resolver logs• Proxy Server logs• Many other ways of collecting

transactional data.– Gate counts– Reference transaction counts– Reshelving counts

Collecting Transactional Data cont.

Page 5: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

5

What would we like to see?

• Breakdowns by department and majors.

• Combined usage by department/majors of more than one library service.

Page 6: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

6

What is Data Mining and Data Warehousing

• Extracting data from legacy systems and other resources;

• cleaning, scrubbing and preparing data for decision support;

• maintaining data in appropriate data stores; • accessing and analysing data using a variety

of end user tools; • and mining data for significant relationships.

• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.

Page 7: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

7

• The primary purpose of these efforts is to provide easy access to specifically prepared data that can be used with decision support applications such as management reports, queries, decision support systems, executive information systems and data mining.

• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.

Page 8: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

8

Our University

• 9000 undergraduates• 1000 graduates (mostly education

majors)• 400 faculty• 800 adjuncts• 1000 staff

Page 9: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

9

Our Library

• 19 librarians and 26 library staff• 350,000 volumes• 18,000 audiovisual items• 47,000 print and electronic periodicals • 124 general and subject specific

databases• $1,100,000 Non-Salary Allocations

Page 10: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

10

Our Transactions

• 600,000 Database Searches• 413,000 Gate Counts• 40,000 Library Materials Circulation• 34,000 Equipment Circulation• 19,000 Reference Queries• 3,000 Interlibrary Loans • 5,000 Documents Delivered

Page 11: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

11

Our Systems

• Voyager ILS • Clio ILL Software• EZProxy Server• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server

Page 12: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

12

Vendor Services• Serials Solutions

• A to Z list• MARC Record Service• Link Resolver

• OCLC – Bibliographic Utility• Worldcat Collection Analysis

• Coutts (was Blackwell) – Book Jobber

• Ebsco – Subscription Agent• Marcive – Authority Control• Database Vendors

Page 13: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

13

Email Reports from the ILS

Page 14: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

14

Voyager Overdue and Fine Notices - Daily

Page 15: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

15

Quarterly Extract for Serials Solutions AtoZ

Service

Page 16: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

16

Which categories of patrons are

accessing which services?

Page 17: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

17

First Step – Patron Statistical Categories

Page 18: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

18

• Voyager Patron Database allows a maximum of 10 statistical categories per patron record.

• Decide which statistical categories are

needed for each patron group defined.

• Work with your University Information Systems Department to extract the relevant data from the relevant sources.

Page 19: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

19

Groups and Services

• Major• Status

– Undergrad or Grad– Faculty, Adjunct Faculty

or Staff

• Department• College• Degree• No. of Credits• Year of Study• Campus Location

• Circulation– Books– Media– Reserve– By Fund Code– Location

• ILL / Document Delivery• Databases• Library Web Pages

– Subject Area Resource Guides

– Reference Requests• Catalog• Other Vendor Services

– Serials Solutions

Page 20: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

20

History Department - 12 months -Feb. 2008

Library Total = declared undergrad & grad majors, adjuncts & full time faculty borrowers

BORROWER = any member who borrowed materials

MEMBER = declared major or department member

EQUIPMENT CIRCULATION = camcorders, overhead & data projectors, laptops, easels, DVD players, etc.

MEDIA CIRCULATION = audio & video materials, including media reserves

BOOK CIRCULATION = books, book disks, maps, oversize, Curriculum materials, reserve books, NJ History, Leisure Lounge

DEFINITIONS:

10.597.1167% 4,981 7,418 52,756 20,703 8,713 23,370 LIBRARY TOTALS

19.9315.6679% 242 308 4,824 988 443 3,393 HISTORY TOTALS

20.3519.5096% 23 24 468 194 115 159 FULL-TIME FACULTY

9.255.7863% 20 32 185 20 65 100 ADJUNCT FACULTY

39.0836.2993% 13 14 508 76 13 419 GRADUATE STUDENTS

19.6915.3978% 186 238 3,663 698 250 2,715 UNDERGRADUATE STUDENTS

CIRC/ BORROWER

CIRC/ MEMBER

% BORROW

INGBORROWERSMEMBERSTOTAL CIRCEQUIP CIRCMEDIA CIRCBOOK CIRCPATRON STATUS

Page 21: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

21

Communications Majors FY08/09

Statistical Categories // Item Type / Location / Call No Type / Call NoCommunications

Majors Freshman Sophomore Junior SeniorM- DVD / Media Services / Other / DVD 194 17 31 52 94M- VideoCass / Media Services / Other / VC 228 11 40 67 110T- Book / 2nd Floor - Circulating / Library of Congress / B 34 9 8 11 6T- Book / 2nd Floor - Circulating / Library of Congress / BD 3 1 2T- Book / 2nd Floor - Circulating / Library of Congress / BF 30 5 5 12 8... 2nd Floor Circulating 1531 222 310 403 596T- Juvenile / CMC / 125 14 26 20 35T- NJDoc / Askew Documents Room / Other / 1 1New Jersey History 10 0 2 7 1T- ReserveBk / Reserves Desk / 189 13 46 68 62T- SpecColl / Special Collection / Library of Congress / LC 3 3 T- Book-McNaughton / Leisure Lounge / Library of Congress / F 2 1 1T- Book-McNaughton / Leisure Lounge / Library of Congress / HF 1 1 T- Book-McNaughton / Leisure Lounge / Library of Congress / HS 2 2 T- Book-McNaughton / Leisure Lounge / Library of Congress / HV 5 1 2 2T- Book-McNaughton / Leisure Lounge / Library of Congress / ML 1 1 T- Book-McNaughton / Leisure Lounge / Library of Congress / PN 3 3 T- Book-McNaughton / Leisure Lounge / Library of Congress / PS 29 4 10 15T- Book-McNaughton / Leisure Lounge / Library of Congress / RC 2 1 1T- Book-McNaughton / Leisure Lounge / Library of Congress / TL 1 1Leisure Lounge 49 9 1 19 20

Page 22: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

22

Challenges with combining data from various services

• Little to no linkage of data

• Multiple user IDs for authentication

Page 23: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

23

Second Step – Setup an Application Server

Page 24: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

24

What is an Application Server?

• A machine or its software that works in conjunction with a web server to deliver application services such as the dynamic creation of a webpage from content stored in a database. From http://www.webtools.ca.gov/help/Glossary.asp• Web Server Software (Apache or IIS)

• Database Management System – DBMS (MySQL, Oracle, MS SQL Server)

• Scripting Language (Perl, PHP, ColdFusion, ASP)

Page 25: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

25

Why an Application Server?

• Relevant data in logfiles need to be in a database to be analyze.

• Need your own DBMS to create new tables and queries.

Page 26: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

26

• Decide how you will use the Application Server.

• Decide on the best and most plausible configuration.

Page 27: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

27

Authentication of ILL and other forms are routed through the EZProxy server

Page 28: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

28

Daily and Weekly Email Reports from the

Application ServerCirc Fines Audit Daily Report - Daily at 6:05 AM.

Dupe Patron Record Report - Daily at 5:56 AM.

Hobart Media Services Equipment Pickup Summary - Daily at 6:58 AM.

Media Service Scheduling Rooms Report - Daily at 6:02 AM.

Media Services Equipment Pickup Summary - Daily at 7:00 AM.

Received Title Alert - Daily at 6:59 AM.

Reserves Overdues - Daily at 5:59 AM.

Scheduled LIS Tasks - Daily at 6:00 AM.

ILL Borrowing Overdues Report - Weekly at 5:59 AM.

ILL Lending Reports - Weekly at 6:15 AM.

Page 29: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

29

Monthly Email Reports from the Application Server

Circ Fines Audit - Monthly at 6:10 AM. Circulation by Location and Item Type - Monthly at 6:21 AM. Circulation Lost and Paid - Monthly at 6:25 AM. Circulation Online Renewal Count - Monthly at 6:30 AM. Media Circulation - Monthly at 6:35 AM. Reserve Circulation - Monthly at 6:40 AM.

Page 30: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

30

Page 31: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

31

On Demand Reports

Page 32: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

32

Lists of patrons with fines between $10 and $19.99 • Student and Alumni fines list - Sorted by either Name, Amount or Notice

Date.• PALS and Courtesy Patron fines list - Sorted by Name.• All other Patron fines list - Sorted by Name.   

Lists of patrons with fines over $19.99 • Student and Alumni fines list - Sorted by either Name, IID, Amount, Notice

Date or Notes.• PALS and Courtesy Patron fines list - Sorted by Name.• VALE Patron fines list - Sorted by Name.• All other Patron fines list - Sorted by Name.   

Lists of patrons with overdues older than 30 days • Student and Alumni overdues list - Sorted by either Name, IID or Notes.• PALS and Courtesy Patron overdues list - Sorted by Name. • All other Patron overdues list except VALE - Sorted by Name.

Lending Services Reports

Page 33: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

33

Lists of VALE patrons with overdues older than 6 months • VALE patron overdues list - Sorted by Name.

Miscellaneous Reports • Patrons with the word "Collection Agency" or "CA" in their notes.• Patrons with the word "FINE" in one of their notes. • Patrons with the word "SOILS" in their notes. • Patrons with the word "FALL07 SOILS" in their notes. • Patrons with the word "HOLD" in their notes. • Combined list of HOLD, FINE, and CA.

Circulation Reports by Item Type from 2003 to the present• All Staff.• All Colleges • Undergraduates by Major. • Graduates by Major • Patrons that have reached a total fine balance of $10 or more after 31-

Dec-2009 and 30-Nov-2009 

Lending Services Reports, cont.

Page 34: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

34

One of Our Projects• Mining EZProxy logfiles and linking to

patron statistical categories from the Voyager Patron Database

– What majors and departments are accessing which database services?

– What majors and departments are accessing the ILL services?

Page 35: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

35

ILL request form authentications by major

Major90M- History28M- Non-Degree25M- Pub Pol & Intl Affairs20M- Spanish18M- English16M- Undecided14M- Art14M- Education11M- Sociology10M- Biology

9M- Music9M- Special Programs8M- Psychology7M- Biotechnology7M- Political Science6M- Anthropology6M- Music - Jazz Studies4M- Business4M- Communication4M- Nursing

Book CountMajor

62M- Psychology60M- Sociology42M- Applied Clinical Psych35M- Education31M- History30M- Spanish29M- Nursing

1919M- Communication14M- Biotechnology14M- Counseling14M- English12M- Non-Degree10M- Community/Sch Health

7M- Biology7M- Political Science6M- Undecided5M- Comm Media Studies5M- Reading4M- Business

Article Count

M- Communication Disorders

Page 36: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

36

Which Databases are accessed by Majors and

Departments?

Page 37: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

37

By Major and HostMajor Count HostM- Nursing 3377 ebscohost.comM- Non-Degree 3010 ebscohost.comM- Psychology 2303 ebscohost.comM- Counseling 1487 ebscohost.comM- Communication 1359 ebscohost.comM- Education 1267 ebscohost.comM- Business 1246 proquest.umi.comM- Sociology 1152 ebscohost.comM- Business 1145 lexis-nexis.comM- Undecided 1100 ebscohost.comM- Applied Clinical Psych 1075 ebscohost.comM- English 1034 ebscohost.comM- Sociology 916 csa.comM- Business 794 ebscohost.comM- Accounting 738 lexis-nexis.comM- Reading 683 ebscohost.comM- Physical Education 653 ebscohost.comM- Special Programs 600 ebscohost.comM- Non-Degree 463 ereserve.wpunj.edu

Page 38: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

38

By Dept and Host

Department Count HostS- Information Systems 933 webscript.exe?fs.scrS- Psychology Dept. 742 ebscohost.comS- Accounting and Law 559 lexis-nexis.comS- Political Sci Dept. 308 lexis-nexis.comS- Nursing Dept. 204 ebscohost.comS- Market & Mgt. Dept. 175 proquest.umi.comS- Library 167 ebscohost.comS- Sociology Dept. 151 ebscohost.comS- Sociology Dept. 134 csa.comS- History Dept. 121 serials.abc-clio.comS- Exercise & Mov Sci 110 ebscohost.comS- Political Sci Dept. 104 ebscohost.comS- Library 103 ILL_article.cfmS- Library 100 webscript.exe?fs.scrS- History Dept. 94 webscript.exe?fs.scr

Page 39: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

39

By Dept and Service

Department Count ServiceS- Information Systems 933 http://www.wpunj.edu/scripts/webscript.exe?fs.scrS- Accounting and Law 549 http://www.lexis-nexis.com/universeS- Psychology Dept. 364 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psychS- Nursing Dept. 114 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=c8hS- Sociology Dept. 96S- Sociology Dept. 75 http://search.ebscohost.com/login.asp?profile=asp

S- Philosophy Dept. 74S- Library 65 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=aspS- Anthropology Dept. 62 http://www.sciencedirect.com/S- History Dept. 61 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=AHLS- Psychology Dept. 61 http://search.ebscohost.com/login.asp?profile=psyartS- History Dept. 58 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=HAS- Psychology Dept. 54 http://search.ebscohost.com/login.asp?profile=psychS- Psychology Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psyartS- English Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=mzh

http://www.csa.com/htbin/dbrng.cgi?&db=socioabs-set-c&adv=1

http://webspirs4.silverplatter.com:8900/c119646?sp.form.first.p=srchmain.htm&sp.dbid.p=S(PHIL

Page 40: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

40

Admin VLANs Labs VLANs

Vlan ID Vlan Name Vlan ID Vlan Name

2 Servers 3 Lab Servers

4 Admin 9 Imaging

5 Science 160 Lib Labs

6 Test Servers 174 STU VPN

7 NAS 175 Ben Shahn Lab

101 Energy Management 178 Hobart Lab

102 Diebold 179 SCI Lab

104 Xerox 187 CS Lab

150 Media Services 192 Atrium

161 Dorms Offices 209 Labs

162 RBI 212 Resnet Labs

163 Police 214 Raub Labs

164 Maintenance 228 VR Labs

IP Address Location = 149.151.VlanID.*

Page 41: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

41

FY08/09 On Campus Hits to Databases by Class C IP Address

Page 42: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

42

Patron Privacy and Standards

Page 43: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

43

Using Voyager as the model for Patron Privacy

Page 44: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

44

• Active Circ transactions are stored in a table with patron ID and statistical categories.

• Completed Circ transactions are stored in a table without the patron ID, but still with the patron statistical categories.

• The Patron Table contains the total counts of transactions for each patron, but no link to which transactions they are.

Page 45: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

45

• EZProxy transactions would be stored in one table with patron statistical categories, but without the user ID.

• User ID s would be stored in another table with counts for each service divided by academic year.

• Logs are collected monthly and loaded and deleted monthly.

Page 46: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

46

Example of EZProxy log entry

nj.dhcp.embarqhsd.net

-

theuser

1/1/2008 4:25:15 AM

GET

http://ezproxy.wpunj.edu:2048/connect?session=sGHMbeSss121YxZa&url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr

HTTP/1.1

302

537

http://ezproxy.wpunj.edu:2048/login?url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)

• Ip address

• (Not used)

• user id

• date/time

• Method

• page retrieved

• Version

• response code

• no. of bytes

• Referring URL

• User agent

Page 47: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

47

Perl Script for loading ezproxy log into MySQL

use strict;my %month=(Jan=>'01',Feb=>'02',Mar=>'03',Apr=>'04',May=>'05',Jun=>'06',Jul=>'07',Aug=>'08',Sep=>'09',Oct=>'10',Nov=>'11',Dec=>'12');while (<>){ my $pattern = '^(\S*) (\S*) (\S*) (\S*) '. '\[(..)\/(...)\/(....):(..):(..):(..) .....\]'. ' "(\S*) (\S*) (\S*)" '. '(\d*) (-|\d*) "([^"]*)" "([^"]*)"'; if (m/$pattern/){ my ($tgt,$ref,$agt) = (esc($12),esc($16),esc($17)); my $byt = $15 eq '_'?'NULL':$15; print "INSERT INTO ezproxylogs VALUES ('$1','$2','$3',". " TIMESTAMP '$7/$month{$6}/$5 $8:$9:$10','$11','$tgt',". "'$13',$14,$byt,'$ref','$agt');\r."; }else{ print "--Skipped line $.\n"; }}

sub esc{ my ($p) = @_; $p =~ s/'/''/g; return $p;}

Page 48: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

48

Created table to assist the linking

SELECT PATRON_ADDRESS.ADDRESS_TYPE,Left([ADDRESS_LINE1],InStr([ADDRESS_LINE1],"@")-1) AS usr,PATRON_ADDRESS.PATRON_ID, PATRON_ADDRESS.ADDRESS_STATUS,PATRON_ADDRESS.EFFECT_DATE, PATRON_ADDRESS.EXPIRE_DATE,PATRON_ADDRESS.MODIFY_DATE, PATRON_ADDRESS.MODIFY_OPERATOR_ID INTOemailprefixFROM PATRON_ADDRESSWHERE (((PATRON_ADDRESS.ADDRESS_TYPE)="3"));

Page 49: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

49

Reporting and Standards

• Reporting– Emailed periodically - e.g., daily

dossiers, and other event triggered reports.

– On demand, via email, web pages or a printer.

• Standards– Share data for comparative research. – Groups of libraries and consortia

Page 50: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

50

Page 51: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

51

Page 52: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

52

Page 53: Data Warehousing and Mining Data from Library and University Systems for Assessment of Library Operations

53

Questions?

Ray Schwartz, Systems Specialist Librarian

Cheng Library, William Paterson University,

Wayne, New Jersey, USAschwartzr2 @ wpunj.edu