Introduction to Data Ingest using the Harvester 2012 VIVO Implementation Fest.

36
Introduction to Data Ingest using the Harvester 2012 VIVO Implementation Fest

Transcript of Introduction to Data Ingest using the Harvester 2012 VIVO Implementation Fest.

Introduction to Data Ingest using the Harvester

2012 VIVO Implementation Fest

2

Welcome & Who are we?

Vincent Sposato, University of FloridaEnterprise Software EngineeringPrimarily focused on VIVO operations and reproducible harvests

Eliza Chan, Weill Cornell Medical CollegeInformation Technologies and Services (ITS)Primarily focused on VIVO customization (content / branding / ontology) and data ingest

John Fereira, Cornell UniversityMann Library Information Technology Services (ITS)Programmer / Analyst / Technology Strategist

3

Goals of this session

• Provide you with the basics of harvester functionality

• Provide you a brief overview of the standard harvest process

• Answer questions

Harvester Basics

5

What is the harvester?

• A library of ETL tools written in Java for:– Extracting data from external sources, – Transforming it into RDF in the VIVO

schema, – and Loading it into the VIVO application

• A way to build automated and reproducible data ingests to get data into your VIVO application

6

What the harvester is?

• A useful set of tools for data ingest and manipulation for semantic datastores

• An example of ingesting data into VIVO in real life scenarios

• An open source community developed solution

7

What the harvester is not?

• A one button solution to all of your data ingest problems

• Perfect

8

A simple data ingest workflow

Fetch Translate

Score & Match Transfer

9

Fetch• First step in the harvest process• Brings data from the external source to the

harvester• Fetch classes:

– OAIFetch – pull from Open Archives Initiative repositories– PubmedFetch – pull publications from the Pubmed

catalog utilizing a SOAP-style interface– NLMJounalFetch – pull publications from National Library

of Medicine’s catalog utilizing a SOAP-style interface– JDBCFetch – pull information from a JDBC database– D2RMapFetch – pull information from a relational

database directly to an RDF format using the D2RMap Library

10

Translate

• Most important part of entire process, as a mistake here will result in ‘dirty’ data

• Most common translation function is the XSLTranslator – which uses an XML style sheet (XSL)

• Translate classes:– XSLTranslator – use XSL files to translate non-

native data into VIVO RDF/XML– GlozeTranslator – use a Gloze schema to translate

data into Basic RDF/XML– VCardTranslator – intended to translate a Vcard

into VIVO RDF (still in progress)

11

Score

• Scores incoming data to VIVO data to determine potential for matches

• It will score all input data based on a defined algorithm

• Can limit the scored dataset to a given ‘harvested’ namespace

• Multi-tiered scoring can be useful for minimizing the dataset to a smaller set, before adding other factors

12

Score Algorithms• EqualityTest (Most common)

– Tests for exact equality

• NormalizedDoubleMetaphoneDifference– Tests for phonic equality

• NormalizedSoundExDifference– Tests for misspelling distance

• NormalizedDamerauLevenshteinDifference– Tests for misspelling distance acounting for transpositions

• NormalizedTypoDifference– Tests for misspelling distance specific to “qwerty” keyboard

• CaseInsensitiveInitialTest– Tests to see if first letter of each string is the same, case

insensitive

• NameCompare– Test specifically designed for comparing names

13

Match

• Uses cumulative weighted scores from Scoring to determine a ‘match’

• Harvested URI changes to the matching entities VIVO URI

• Can either leave or delete data properties about the matched entity (ie extraneous name information about a known VIVO author)

14

Transfer

• The final step – and the one that actually puts your data into VIVO

• Works directly with Jena models, and speaks RDF only

• Useful for troubleshooting data issues throughout the process

Ingest Walk-Thru

16

Identify the data

• Determine the resource that has the data you are seeking

• Meet with data stewards (owners) to determine how to access it

• Develop a listing of the data fields available

17

Start to build your ingest• Based on your source data select a template from the

examples– Our example we started with example-peoplesoft

• Copy example-harvester-name into a separate directory to begin your work– This allows you to have an unadulterated copy of the original

for reference, and also prevents any accidental breakages when updating the harvester package.

• Perform renaming to match ingest to your actual ingest• Update all files to reflect reality of location and data

sources• Use file record handlers to view the data in a human

readable format until you have a working run of the harvest.

18

Build the Analytics

• Write SPARQL queries that will express the data that you will be working with– This is helpful for determining net changes in

data with regards to the ingest

• Utilize harvester JenaConnect tools to execute queries and output into text format

• Insert items into master shell script to allow analytics to run both before the transfer to VIVO, and then again after transfer to VIVO

• Possibly setup email of these analytics, so that they can be monitored on a daily basis

19

Build your Fetch

• Create model files that point to the correct locations for source data

• Test connections to ensure data is accessible, and comes in format expected

• Identify any issues with connections, and determine speed of transfer for overall timings

20

Build field mappings

• Identify related VIVO properties• Build a mapping between source data and

the resulting VIVO properties / classes

Classes VIVO classtype code

192 core:FacultyMember195 core:NonAcademic197 ufVivo:CourtesyFaculty221 ufVivo:Consultant

Properties VIVO data propertyUF Email core:primaryEmailphone core:phoneNumberfax core:faxNumberFirst name foaf:firstNameLast name foaf:lastNameMiddle name core:middleNameDisplay name rdfs:labelUFID ufVivo:ufidGatorlink ufVivo:gatorlinkWorking title core:preferredTitleDepartment number ufVivo:deptID

21

Build translation file

• Utilize base XSL from the ingest example that you selected

• Field mappings created in previous step will help immensely here

• Determine your entities that will be created from this process– Our example is Person and Department

• Work through each data property and/or class

• Build logic into the XSL file where necessary to accomplish goals

22

Test run through translation

• Test the first 2 steps of your ingest by inserting an exit into your script

• Verify that the source data came over, and that your translated records looked as expected

• Make sure to have test cases for each potential type of data you could see– Avoid the inevitable Hand-to-Forehead

moments

• Wash, Rinse, and Repeat

23

Setup the scoring

• Determine your scoring strategy based on the unique items that you have available to you– Approach 1 (People) – almost all institutions

have some sort of unique identifier that is non-SSN, this is a slam dunk as an EqualityTest

– Approach 2 (Publications) – we utilized a tiered scoring approach to successively shrink down the data set, and also to provide a better match

• Determining weights of each algorithm will be important for cumulative scoring

24

Setup the matching

• The bulk of the work here was done in thinking about scoring, now it is time to implement your threshold for matching

• The matching is done on the individual entities, and matches will be called based upon meeting a threshold

• All data associated with an entity will go over, unless you determine it not needed

25

Test run through match

• Test run of the process through match• Utilize all test cases from your previous tests

to make sure you can account for all variations

• Need to have matching data in your test VIVO to ensure that you see match work

• Use Transfer to output the harvested data model to see that all data is as you expect

• Still need to review previous two steps outputs to ensure nothing has inadvertently changed

26

Setup the Transfer

• Determine whether or not subtractions will be necessary–Will entire data be provided every time?–Will only new data be provided?–Will mixed, new and old, data be provided?

• Make sure that the previous harvest model gets updated to reflect these additions / subtractions

27

Test run through entire process

• This is the full dress rehearsal, and should be done in a test environment (UF calls it Development)

• This is where your analytics really help, as the review of your test cases and what actually made it into VIVO is invaluable

• Check all outputs from all steps of the process to make sure that everything is firing as expected

• Review the data as it appears in the VIVO application, as sometimes the best designed ingest still has unintended view consequences

28

Full production ingest

• This is the moment we all wait for, and is the point where the rest of the world gets to see the fruits of our labor

• Promote the ingest to the production environment, confirm that all settings are changed for this environment

• Kick off the ingest, sit back, and watch the data move

• Pat yourself and your team on the back, as your VIVO is now alive with the sound of data

Harvester Additional Tools

30

Additional Harvester Tools/Utilities

• Qualify– ChangeNamespace

• Creates new nodes for unmatched entities harvested

– Smush• Combines graphs of RDF data when they share certain links• Provides same functionality on command line as the data ingest

menu

– Qualify• Used to clean and verify data before import• Also allow for independent use to manipulate data via Regular

Expressions, string replaces, and property removal

– RenameResource• Takes in an old URI and a new URI, and renames any old URI match

to the new URI• Provides same functionality on command line as the data ingest

menu

31

Additional Harvester Tools/Utilities

• JenaConnect– Used by harvester to connect to Jena models– Also allows for SPARQL queries to be run from

command line• Harvester-jenaconnect –j pathToVIVOModel –q

“Query Text”

• XMLGrep– Allows for moving files that match an Xpath

expression– Useful for removing XML files from a set of data,

for separate processing or a different workflow

32

Additional Harvester Tools/Utilities

• JenaConnect– Used by harvester to connect to Jena models– Also allows for SPARQL queries to be run from

command line• Harvester-jenaconnect –j pathToVIVOModel –q

“Query Text”

• XMLGrep– Allows for moving files that match an Xpath

expression– Useful for removing XML files from a set of data,

for separate processing or a different workflow

Harvester Troubleshooting Basics

34

Troubleshooting

• Memory Issues– Each harvester function has a

corresponding setup script that gets called from the harvester/bin directory

– These are set for minimum memory usage, but for large datasets they need to be adjusted• Currently UF allocates minimum 4GB and

maximum 6GB for the Score, Match, and Diff functions

35

Troubleshooting• Unexpected Data Issues

– When you are receiving unexpected results from the harvester dump the steps to file for review• Harvester-transfer –i model-config-file –d

path_and_name_of_file_to_dump_to• Invaluable for reviewing each step of the process, and the outputs

that are being generated

– When things are not scoring or matching correctly, check to make sure that you have your comparisons setup correctly• Make sure that you are using the correct predicates and their

associated predicate namespace– <Param

name="inputJena-predicates">label=http://www.w3.org/2000/01/rdf-schema#label</Param>

• Make sure that your harvested namespace is correct based upon your translation of the source data

Follow-Up Discussion

Q&A