REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

20
REPLIX www.mpi.nl/replix Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA

Transcript of REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Page 1: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

REPLIXwww.mpi.nl/replix

Willem.Elbers(@mpi.nl)

Max Planck Institute for Psycholinguistics, TLA

Page 2: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Agenda Goals Motivation Infrastructure Language Archive specific Results Discussion

REPLIX – Repository / Workspace Workshop – September 2010

Page 3: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

REPLIX Goals Data replication / synchronization between

repositories at a logical level. What is this logical level?

More than just moving files. What about access rights? What about structure defined on top of the data? What about persistent identifiers (PIDs)? What about things we didn’t think about?

Workflow based and easy to configure and adapt for different scenarios. Workflow is a chain of small tasks (a.k.a. blocks). Easy to develop and integrate new blocks.

REPLIX – Repository / Workspace Workshop – September 2010

Page 4: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

REPLIX Goals Independent of repository implementation.

Repositories use different software solutions and we do not expect them to change.

How do we synchronize between different software solutions? Inter-connection layer.

Originating repository is and should remain owner of the data. Repository should never depend on REPLIX for anything

else than the synchronization. REPLIX has access to the file system but does not

control it. Repository controls its data with policies.

REPLIX – Repository / Workspace Workshop – September 2010

Page 5: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

REPLIX Goals

RepositorySoftware /

ToolsRepositoryFile System

REPLIX

RepositorySoftware /

Tools

RepositoryFile System

REPLIX

inter-connect

inter-connect

inter-connect

inter-connect

REPLIX – Repository / Workspace Workshop – September 2010

Page 6: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Motivation – REPLIX TLA Open up the MPI LAT backup sites (B1,B2) as

read-only archives.

Improve the replication process in general. Speed. Validation. Parts of the archive. Update PID information.

Keep in mind to try to generalize to provide an out-of-the box solution for other repositories.

B2

LAT

B1

Garching

REPLIX – Repository / Workspace Workshop – September 2010

Nijmegen

Gottingen

Page 7: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Motivation - Software Implementation of the REPLIX communication

system. iRODS looks like a promising candidate (federated

zones). Implementation of the interface to the repository file

system. iRODS looks like promising candidate.

Implementation of the inter-connection layer. REPLIX side.

iRODS looks like a promising candidate by using a custom module.

Repository side Will require custom programming.

REPLIX – Repository / Workspace Workshop – September 2010

Page 8: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Motivation Perform iRODS performance tests.

See if iRODS lives up to our expectations. How does iRODS compare to the current rsync process?

Develop a concrete test-case, based on iRODS, to test our ideas. Main archive located at the MPI in Nijmegen, the

Netherlands. Backup archive located at the RZG in Garching, Germany. Approximately 25-30 TB.

How much do we have to change in the existing software?

REPLIX – Repository / Workspace Workshop – September 2010

Page 9: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Infrastructure iRODS zones provide archive to archive connection.

Single archive data exists inside iRODS zone. Use federated zones ensuring each archive remains autonomous.

Loose connection to the file system. iRODS mounted collections. iRODS regular collections are too strict since iRODS controls all

files and their metadata.

XML-RPC interface to existing software (inter-connection layer). Develop an iRODS micro-service to facilitate XML-RPC

communication. Use some reserved disk space for caching purposes. How to handle different method signatures?

REPLIX – Repository / Workspace Workshop – September 2010

Page 10: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

REPLIX

Infrastructure

iRODSicommands

rule-basevirtual file system

scripts

micro services

mounted collection(s)

msiXmlRPC

replix scipts

replix rule-base

jargon

core + rule engine

WM

Existin

g so

ftware

sta

ck

Repository (local) file system

REPLIX – Repository / Workspace Workshop – September 2010

Page 11: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Infrastructure Two ways of interacting with the REPLIX

system: Use the workflow manager (WM). Invoke the workflow rules directly through the

icommands.

Workflow manager is preferred. Exposed through a REST-service interface.

REPLIX – Repository / Workspace Workshop – September 2010

Page 12: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Language Archive Specific How does this fit into the existing LAT

infrastructure?

REPLIX – Repository / Workspace Workshop – September 2010

LAT 1

AMS

IMDI Browse

r

LAMUS

PIDpid

ams

corpus-structur

ecrawl

er

SOURCE

LAT 2

AMS

IMDI Browse

r

LAMUS

ams

corpus-structur

ecrawl

er

DESTINATION

REPLIX

Page 13: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Language Archive Specific LAT synchronization workflow:

(1) File synchronization. iRODS sync.

(2) Start crawler (index all files). msiExecCmd.

(3) Permission synchronization. msiXmlRpc.

(4) Update PID information. msiXmlRpc.

Each step implemented as an iRODS action.

REPLIX – Repository / Workspace Workshop – September 2010

Page 14: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Language Archive Specific (1) Synchronize based on nodes in the archive tree.

If the node is the root node, synchronize all files. If the node is not the root node, create a list of files that

need to be synchronized and synchronize them. File list export functionality needs to be available. Do not touch file content.

(2) Start the crawler. Use the iRODS “msiExecCmd” micro-service to start the

crawler at the destination archive through a script. The time this could take might be a problem. PIDs should remain untouched and can be used as a

reference to the parent archive. Do not touch file content.

REPLIX – Repository / Workspace Workshop – September 2010

Page 15: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Language Archive Specific (3) Replicate Archive permissions.

AMS is in charge of the permissions in the archive. (node id, user id, permission) triples.

Create an export, based on the selected node, at the source archive.

Transfer the export to the destination archive. Import the data into AMS at the destination archive.

Export based on PIDs. Constant between source and destination archive. Translate between PID and node id, since AMS

internally uses node id’s. How to synchronize users?

Discard triples for non-existing users.

REPLIX – Repository / Workspace Workshop – September 2010

Page 16: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Language Archive Specific (4) Update PID information.

After replicating a file from the source archive to another archive, the files PID record has to be updated.

Create an export at the destination archive. (pid, url) pairs.

Transfer to the parent archive. Import into PID system.

How to administrate these changes to the PID record? New domains are always allowed to be added. Only allowed to update ‘own’ url assume domain is

constant.

REPLIX – Repository / Workspace Workshop – September 2010

Page 17: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Results Performance test executed.

Transfer files from one zone (MPI) to another federated zone (RZG).

Gigabit connection.

Two sets of tests: Increasing amount of small files (100KB). Decreasing amount of increasing files (1MB 1GB).

outp

ut_5

0x10

0KB.c

sv

outp

ut_1

00x1

00KB.cs

v

outp

ut_2

50x1

00KB.cs

v

outp

ut_5

00x1

00KB.cs

v

outp

ut_1

000x

100K

B.csv

outp

ut_5

000x

100K

B.csv

0

0.2

0.4

0.6

Transfer Speed (small files)

AVG(iput)AVG(iget)

Tra

nsfe

r (M

B/s

)

outp

ut_5

00x1

MB.cs

v

outp

ut_1

00x5

MB.cs

v

outp

ut_5

0x10

MB.cs

v

outp

ut_2

0x25

MB.cs

v

outp

ut_2

0x50

MB.cs

v

outp

ut_1

0x10

0MB.c

sv

outp

ut_1

0x25

0MB.c

sv

outp

ut_5

x500

MB.cs

v

outp

ut_5

x100

0MB.c

sv0

20

40

60

80

Transfer Speed (large files)

AVG(iput)AVG(iget)

Tra

nsfe

r (M

B/s

)

REPLIX – Repository / Workspace Workshop – September 2010

Page 18: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Results (local) Pilot to test initial workflow.

Transfer files. Trigger crawler.

Invoke script at destination. Export permissions.

Invoke xmlRPC at source to create export. Transfer export file. Invoke xmlRPC at destination to import.

Initial results look promising.

REPLIX – Repository / Workspace Workshop – September 2010

Page 19: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Results What to do:

Implement local pilot project in Nijmegen-Garching environment.

Support sub-tree synchronization. Support updating of handle records.

The interconnection layer requires changes in existing software. The repository is required to provide the

interconnection functionality for the used synchronization workflow actions.

REPLIX – Repository / Workspace Workshop – September 2010

Page 20: REPLIX  Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.

Questions / Discussion

Any questions?

REPLIX – Repository / Workspace Workshop – September 2010