REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.
-
Upload
kerry-carroll -
Category
Documents
-
view
212 -
download
0
Transcript of REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.
![Page 1: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/1.jpg)
REPLIXwww.mpi.nl/replix
Willem.Elbers(@mpi.nl)
Max Planck Institute for Psycholinguistics, TLA
![Page 2: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/2.jpg)
Agenda Goals Motivation Infrastructure Language Archive specific Results Discussion
REPLIX – Repository / Workspace Workshop – September 2010
![Page 3: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/3.jpg)
REPLIX Goals Data replication / synchronization between
repositories at a logical level. What is this logical level?
More than just moving files. What about access rights? What about structure defined on top of the data? What about persistent identifiers (PIDs)? What about things we didn’t think about?
Workflow based and easy to configure and adapt for different scenarios. Workflow is a chain of small tasks (a.k.a. blocks). Easy to develop and integrate new blocks.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 4: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/4.jpg)
REPLIX Goals Independent of repository implementation.
Repositories use different software solutions and we do not expect them to change.
How do we synchronize between different software solutions? Inter-connection layer.
Originating repository is and should remain owner of the data. Repository should never depend on REPLIX for anything
else than the synchronization. REPLIX has access to the file system but does not
control it. Repository controls its data with policies.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 5: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/5.jpg)
REPLIX Goals
RepositorySoftware /
ToolsRepositoryFile System
REPLIX
RepositorySoftware /
Tools
RepositoryFile System
REPLIX
inter-connect
inter-connect
inter-connect
inter-connect
REPLIX – Repository / Workspace Workshop – September 2010
![Page 6: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/6.jpg)
Motivation – REPLIX TLA Open up the MPI LAT backup sites (B1,B2) as
read-only archives.
Improve the replication process in general. Speed. Validation. Parts of the archive. Update PID information.
Keep in mind to try to generalize to provide an out-of-the box solution for other repositories.
B2
LAT
B1
Garching
REPLIX – Repository / Workspace Workshop – September 2010
Nijmegen
Gottingen
![Page 7: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/7.jpg)
Motivation - Software Implementation of the REPLIX communication
system. iRODS looks like a promising candidate (federated
zones). Implementation of the interface to the repository file
system. iRODS looks like promising candidate.
Implementation of the inter-connection layer. REPLIX side.
iRODS looks like a promising candidate by using a custom module.
Repository side Will require custom programming.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 8: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/8.jpg)
Motivation Perform iRODS performance tests.
See if iRODS lives up to our expectations. How does iRODS compare to the current rsync process?
Develop a concrete test-case, based on iRODS, to test our ideas. Main archive located at the MPI in Nijmegen, the
Netherlands. Backup archive located at the RZG in Garching, Germany. Approximately 25-30 TB.
How much do we have to change in the existing software?
REPLIX – Repository / Workspace Workshop – September 2010
![Page 9: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/9.jpg)
Infrastructure iRODS zones provide archive to archive connection.
Single archive data exists inside iRODS zone. Use federated zones ensuring each archive remains autonomous.
Loose connection to the file system. iRODS mounted collections. iRODS regular collections are too strict since iRODS controls all
files and their metadata.
XML-RPC interface to existing software (inter-connection layer). Develop an iRODS micro-service to facilitate XML-RPC
communication. Use some reserved disk space for caching purposes. How to handle different method signatures?
REPLIX – Repository / Workspace Workshop – September 2010
![Page 10: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/10.jpg)
REPLIX
Infrastructure
iRODSicommands
rule-basevirtual file system
scripts
micro services
mounted collection(s)
msiXmlRPC
replix scipts
replix rule-base
jargon
core + rule engine
WM
Existin
g so
ftware
sta
ck
Repository (local) file system
REPLIX – Repository / Workspace Workshop – September 2010
![Page 11: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/11.jpg)
Infrastructure Two ways of interacting with the REPLIX
system: Use the workflow manager (WM). Invoke the workflow rules directly through the
icommands.
Workflow manager is preferred. Exposed through a REST-service interface.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 12: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/12.jpg)
Language Archive Specific How does this fit into the existing LAT
infrastructure?
REPLIX – Repository / Workspace Workshop – September 2010
LAT 1
AMS
IMDI Browse
r
LAMUS
PIDpid
ams
corpus-structur
ecrawl
er
SOURCE
LAT 2
AMS
IMDI Browse
r
LAMUS
ams
corpus-structur
ecrawl
er
DESTINATION
REPLIX
![Page 13: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/13.jpg)
Language Archive Specific LAT synchronization workflow:
(1) File synchronization. iRODS sync.
(2) Start crawler (index all files). msiExecCmd.
(3) Permission synchronization. msiXmlRpc.
(4) Update PID information. msiXmlRpc.
Each step implemented as an iRODS action.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 14: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/14.jpg)
Language Archive Specific (1) Synchronize based on nodes in the archive tree.
If the node is the root node, synchronize all files. If the node is not the root node, create a list of files that
need to be synchronized and synchronize them. File list export functionality needs to be available. Do not touch file content.
(2) Start the crawler. Use the iRODS “msiExecCmd” micro-service to start the
crawler at the destination archive through a script. The time this could take might be a problem. PIDs should remain untouched and can be used as a
reference to the parent archive. Do not touch file content.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 15: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/15.jpg)
Language Archive Specific (3) Replicate Archive permissions.
AMS is in charge of the permissions in the archive. (node id, user id, permission) triples.
Create an export, based on the selected node, at the source archive.
Transfer the export to the destination archive. Import the data into AMS at the destination archive.
Export based on PIDs. Constant between source and destination archive. Translate between PID and node id, since AMS
internally uses node id’s. How to synchronize users?
Discard triples for non-existing users.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 16: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/16.jpg)
Language Archive Specific (4) Update PID information.
After replicating a file from the source archive to another archive, the files PID record has to be updated.
Create an export at the destination archive. (pid, url) pairs.
Transfer to the parent archive. Import into PID system.
How to administrate these changes to the PID record? New domains are always allowed to be added. Only allowed to update ‘own’ url assume domain is
constant.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 17: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/17.jpg)
Results Performance test executed.
Transfer files from one zone (MPI) to another federated zone (RZG).
Gigabit connection.
Two sets of tests: Increasing amount of small files (100KB). Decreasing amount of increasing files (1MB 1GB).
outp
ut_5
0x10
0KB.c
sv
outp
ut_1
00x1
00KB.cs
v
outp
ut_2
50x1
00KB.cs
v
outp
ut_5
00x1
00KB.cs
v
outp
ut_1
000x
100K
B.csv
outp
ut_5
000x
100K
B.csv
0
0.2
0.4
0.6
Transfer Speed (small files)
AVG(iput)AVG(iget)
Tra
nsfe
r (M
B/s
)
outp
ut_5
00x1
MB.cs
v
outp
ut_1
00x5
MB.cs
v
outp
ut_5
0x10
MB.cs
v
outp
ut_2
0x25
MB.cs
v
outp
ut_2
0x50
MB.cs
v
outp
ut_1
0x10
0MB.c
sv
outp
ut_1
0x25
0MB.c
sv
outp
ut_5
x500
MB.cs
v
outp
ut_5
x100
0MB.c
sv0
20
40
60
80
Transfer Speed (large files)
AVG(iput)AVG(iget)
Tra
nsfe
r (M
B/s
)
REPLIX – Repository / Workspace Workshop – September 2010
![Page 18: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/18.jpg)
Results (local) Pilot to test initial workflow.
Transfer files. Trigger crawler.
Invoke script at destination. Export permissions.
Invoke xmlRPC at source to create export. Transfer export file. Invoke xmlRPC at destination to import.
Initial results look promising.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 19: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/19.jpg)
Results What to do:
Implement local pilot project in Nijmegen-Garching environment.
Support sub-tree synchronization. Support updating of handle records.
The interconnection layer requires changes in existing software. The repository is required to provide the
interconnection functionality for the used synchronization workflow actions.
REPLIX – Repository / Workspace Workshop – September 2010
![Page 20: REPLIX Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA.](https://reader036.fdocuments.in/reader036/viewer/2022072014/56649e7b5503460f94b7c85c/html5/thumbnails/20.jpg)
Questions / Discussion
Any questions?
REPLIX – Repository / Workspace Workshop – September 2010