Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software...

Enhancing the Performance and Extensibility of the XC

MetadataServicesToolkitBen Anderson, Software Engineer, XCO

Download this presentation:www.extensiblecatalog.org/learnmore

Timeline

Jennifer Bowenpresented at

code4lib2/10

I began at XCO3/10

work beganon 0.34/10

0.3 released1/11

0.2 released

1.0 released

MARCXML(6M records) DC-TERMS

(13k records)

XC Software ComponentsUser Interface for searching and browsing

Library Website (on Drupal)Library Website (on Drupal)

Integrated Library SystemIntegrated Library System RepositoryRepository

XC Drupal Toolkit

Tools for automated processing of large batches of metadata XC Metadata

Services ToolkitXC Metadata

Services Toolkit

Tools for connectivity between XC and an ILS

XC NCIP Toolkit

XC OAI ToolkitXC OAI Toolkit irplus

Learn More about XC atwww.extensiblecatalog.org

One Example of Process Flow

MARC BIBrecord from externalrepository

NormalizedMARC BIBrecord from normalization service

FRBRized recordsfrom transformationservice

expression

manifestation

Logical Process

OAI-PMH Harvest

MARCNormalization

Service

MARC-XCTransformation

Service

Pseudo OAI-PMH Harvests

OAI-PMH Harvestable

provider caches

repo repo

Add an External Repository

Schedule a Harvest

Configure Processing Rules

Browse Records

Goals for 0.3

• Each service should process one million records per hour on an “average library server”– 1.5 GHz SPARC V9 – 8G RAM (3G for the JVM)– 10k RPM hard drive

• Services should have little to no degradation as the size of a repository grows– University of Rochester has 6M records

• Implementing a service should be easy– it should require no knowledge of MST internals– it should not be up to the service implementer to figure

out how to build and package their service

Determine Throughput of 0.2

• Using the MARC Normalization service as our metric, the first million records processed at average at a speed of:– 29 ms/record = 120k/hr (goal is 3.6 ms/rec = 1M/hr)

• Before the service processed 2 million records, the process crawled to a halt (goal was little to no degradation of at least 6 million records).

Determine Bottlenecks with TimingLogger

This codeproduces this output

Bottleneck Breakdown

• 29 ms per record– 2.5 ms to create DOM– 5 ms for actual service processing (the innards of

the MARC-Normalization service)– 21 ms for querying solr and inserting

• This is the average - both querying and inserting are done in batch.

• I had a hard time separating the two

0.2 Design

•All data needed for the UI•except for searching and browsing records

•All data needed for configuring harvests, services, processing rules, etc

•Text indexes necessary for searching and browsing records•All record/repository data

0.3 Design Change to use MySQL

•All data needed for the UI•except for searching and browsing records

•All data needed for configuring harvests, services, processing rules, etc•All record/repository data

•Doesn’t store any data•Use only for indexing records to support searching in the UI

0.3 Design – Keep the table sizes small

One index for allrepositories

Each external repository cache and each service gets its own set of database tables

externalprovider

reponormalization

repotransformation

one or moreper record

zero or moreper record

one per record

0.3 Design - Yes, a boring ERD

record_updates

record_id

update_date

records_xml

record_id

record_sets

records_xml

record_id

record_predecessors

record_id

pred_record_id

Did that improve things?

• 11 ms per record (previously 29)– 2.5 ms to create DOM– 5 ms for actual service processing (the innards of the

MARC-Normalization service)– 3.5 ms (previously 21) for querying MySQL and

inserting into MySQL• again, both querying and inserting are done in batch• The query time is almost nill - it’s the inserting that takes

time.• It’s faster, but still nearly 3x slower than our goal• The performance showed little to no degradation

Get rid of XPath

XPath isn’t a bad technology, but when you’re optimizing for performance, it can be beneficial to find other ways to accomplish the same task. So, I changed this code…

to this code…

• 7 ms per record (previously 11)– 2.5 ms to create DOM– 1.0 ms (previously 5) for actual service processing

(the innards of the MARC-Normalization service)– 3.5 ms for MySQL inserts

• It’s faster, but still nearly 2x slower than our goal

Delayed Indexing in MySQL

• MySQL modifies table indexes with each insert.

• It is faster to the drop indexes, insert lots of rows into the tables, and then add the indexes back.– This is the way mysqldump works– This means you can’t read the data while doing an

insert. No big deal – we’ll just do it during large loads.

• 6 ms per record (previously 11)– 2.5 ms to create DOM– 1.0 ms for actual service processing (the innards of

the MARC-Normalization service)– 2.2 ms (previously 3.5) for MySQL inserts

• It’s faster, but still nearly 2x slower than our goal

Batch Prepared Statements

Java/JDBC provides an extremely highly performant method for sending large chunks of data to the db at once using batch prepared statements.

There’s no way to speed this part up… or so I thought…

LOAD DATA INFILE

When discussing db optimizations with XC’s Drupal Toolkit developer, Peter Kiraly, he said PHP didn’t have the same ability. Instead he’d have to write out a csv file and load that in. I figured I might as well try it.

• 4 ms per record (previously 6)– 2.5 ms to create DOM– 1.0 ms for actual service processing (the innards of

the MARC-Normalization service)– 0.6 ms (previously 2.2) for MySQL inserts

• Pretty close, but still not there

Sometimes it’s the little things

DomFactoryBuilderDOAServiceFactoryFactoryImplI knew enough not to create the DocumentBuilderFactory each time, but didn’t realize creating the DocumentBuilder each time would have that much of an effect.

Code was

Code is now

• 3 ms per record (previously 4)– 0.9 ms (previously 2.5) to create DOM– 1.0 ms for actual service processing (the innards of

the MARC-Normalization service)– 0.6 ms for MySQL inserts

• WE DID IT! We have exceeded our goal!

0.2 Service Development

Internals of the MST were exposed to the service developer and the developer was expected to re-implement much of this internal code.

code.google.com/p/xcmetadataservicestoolkit/

0.3.x Service Development

• Install Java, Ant, MySQL

$ wget 'http://xcmetadataservicestoolkit.googlecode.com/files/example-0.3.0-dev-env.zip’

$ unzip example-0.3.0-dev-env.zip$ cd example$ ant retrieve$ ant -Dtest=ProcessFiles test$ ls -ladh ./build/test/actual_output_records/1/*$ ant zip

Input Files for Testing

$ ls -1 ./test/input_records/1/* | xargs cat<records xmlns="http://www.openarchives.org/OAI/2.0/"> <record> <header> <identifier>oai:mst.rochester.edu:bib:1</identifier> </header> <metadata> <foo xmlns="foo:bar">pb&j</foo> </metadata> </record> <record> <header> <identifier>oai:mst.rochester.edu:bib:1</identifier> </header> <metadata> <foo xmlns="foo:bar">pb&j 2</foo> </metadata> </record></records>...

Output Files from Testing

$ ls -1 ./build/test/actual_output_records/1/* | xargs cat<records xmlns="http://www.openarchives.org/OAI/2.0/"><record> <header status="replaced"> <identifier>oai:mst.rochester.edu:example/1</identifier> <datestamp /> <predecessors> <predecessor>oai:mst.rochester.edu:bib:1</predecessor> </predecessors> </header> <metadata> <foo xmlns="foo:bar"> pb&j <bar>you've been foobarred!</bar> </foo> </metadata></record>

Implementing in Code

More tidbits for interested implementers

• The MST now is configured via spring– each service is given it’s own application context

as well as it’s own classloader• This means it can use all the objects and services from

the MST while not worrying about name collisions (naming or dependencies) w/ other services

• Each service is given it’s own db schema (again, so you don’t have to worry about name collisions). The db schema is prefixed w/ “xc_”

Other Services

• MARC-XC-Transformation Just as fast as the marcnormalization service

• DC-XC-Transformation Initially contributed by Kyushu University (in Japan) – now one of our core services.

Photo Credits

• All photos taken from flickr.com– “Brick Wall” by somenametoforget– “Snail” by DRB62– “Paris Train” by Pictr 30D– “Spaghetti with tomato sauce” by HatM– “Hawk in Flight” by Nick Chill– “Tortoise” by GraphicReality

Final Numbers

0.2• 125k records / hr

29 ms / record

• fell down before 2M records processed

• not easily extensible40

0.3• 1.2M records / hr

3.0 ms / record

• processed 16M records with no degradation

• easily extensible

1.5 GHz CPU

Download XC software at

eXtensibleCatalog.orgcontact me at

banderson@library.rochester.edu

Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software...

Documents

Transcript of Enhancing the Performance and Extensibility of the XC MetadataServicesToolkit Ben Anderson, Software...

Serverless Extensibility

ATTIX 30 / BATT / PC / XC ATTIX 40 / PC / XC ATTIX 50 / PC / XC · PDF fileATTIX 30 / BATT / PC / XC ATTIX 40 / PC / XC ATTIX 50 / PC / XC ... Materials picked up can present a hazard

Managed Extensibility Framework

Life Cycle Extensibility - VMware · Life Cycle Extensibility Life Cycle Extensibility provides information about customizing IaaS workflows by using vRealize Orchestrator as well

XCO MU Standings

WCF Extensibility

In-App Extensibility

Extensibility in UE4

Designing for Extensibility - Göteborgs universitet...Designing for extensibility: An action research study of maximising extensibility by means of design principles. Niklas Johansson

Q-XCO Connector Datasheet - HUBER+SUHNER

Sales Cloud Extensibility

UCI Ranking XCO 2016

EXtensible Catalog Software Portfolio Ben Anderson, Software Engineer, XCO.

Extensibility PG

XC Di Rasai Campionato Regionale XCO Rasai OdA...Rg Pett.le Nominativo RegioneCat.ID Team Team Giri Tempi Scarto Junior Maschile 1 44 VANTAGGIATO RAMON 03 JU03P2162 A.S.D. TEAM VELOCIRAPTORS

GocDB Extensibility Mechanism

Spinning, XCO and Just Bounce at Special Sports

Tuscany 2.x Extensibility and SPIs Raymond Feng. Tuscany Extensibility Cx2x/Tuscany+2.x+Extensibility+and+SPIs#extensions.

Extensibility in application

xUnit.net 2: Design and Extensibility · (Resharper, Code Rush, NCrunch) v2 Extensibility Half a decade of developer requests. Extensibility: Assert •Lots of options •Use our