DQ2 discussion on future features

13
1 Distributed Data Management Miguel Branco DQ2 discussion on future features BNL workshop October 4, 2007

description

DQ2 discussion on future features. BNL workshop October 4, 2007. DQ2 0.4.x. Continue to optimize DB schema to cope with higher load channel allocation to follow ‘Dataset Subscription policy’ Hiro/Patrick also asking for local configurable ordered list of preferred sources within cloud - PowerPoint PPT Presentation

Transcript of DQ2 discussion on future features

Page 1: DQ2 discussion on future features

1

Distributed Data Management

Miguel Branco

DQ2 discussionon future features

BNL workshopOctober 4, 2007

Page 2: DQ2 discussion on future features

2

DQ2 0.4.x• Continue to optimize DB schema to cope with higher load

– channel allocation to follow ‘Dataset Subscription policy’• Hiro/Patrick also asking for local configurable ordered list of preferred

sources within cloud– implications on channel allocation

• How much to ‘prefer’ a T1 before going to a T2 for a replica? Right now, shortest queue wins…

– distinguishing files unlikely to have replicas in the future (bad subscriptions)

• particularly in the local monitoring

– removing ‘holes’ in system (growing backlogs)

• Reduce load (better GSI session reuse)• Goal O(100K) file transfers/day/site

– or SRM/storage limitations– Need better understanding outside DQ2

Page 3: DQ2 discussion on future features

3

Local monitoring of site services

Page 4: DQ2 discussion on future features

4

Staging…

• Did not recognize this was a problem for OSG• .. It is very hard to do with remote storages

without SRM– FTS 2 + SRMv2 move on the right direction but

not there yet

• Could do a local mechanism for T1->T2 transfers in the same cloud– provided site services for T2 run “close” to the T1

storage

• … but not for cross T1 transfers

Page 5: DQ2 discussion on future features

5

Hierarchiescurrent thoughts, for discussion

• Hierarchical datasets would be a special kind of dataset.• These would have only 2 states: open AND frozen• These would not have versions• The constituents of a hierarchical dataset could only be closed

dataset versions or frozen datasets• Not sure if the following commands should be provided

explicitly:– list files in hierarchical dataset directly?

• or only list datasets in hierarchical dataset and forcing user to loop over results?

– subscribe open hierarchical dataset?• or only allow listing datasets in open hierarchical dataset and forcing

user to manually subscribe sub-units• point is: having to loop over OPEN hierarchies (likely manageable)

– locations of hierarchical dataset?• or only allow listing locations of the individual datasets in the

hierarchical dataset?

Page 6: DQ2 discussion on future features

6

Merging

• Not much to do from DQ2 side here but provide an attribute for each dataset– “merged” Y/N (or protocol: zip, tar?)

• DQ2 does 3rd party transfers only– does not actually ‘see’ the data

Page 7: DQ2 discussion on future features

7

Checksums

• Not much from DQ2 here but enforcing checksums in the central catalogues and its protocol– ‘md5:’ for MD5

• adler32 is frequently discussed as a better checksum candidate– but not relevant to DQ2, rather to the sites

and production people

Page 8: DQ2 discussion on future features

8

Subscription lifetime• Increasingly important…

– Would clean up what no one is cleaning up now… (some sites with O(100K) files in impossible situations)

• Discussion from yesterday:– allow only waitForSources to be set by users with

production role ?• avoid creating looping subscriptions in the system

• Forbid subscriptions for datasets with more than X files, if not production user requesting?

• Forbid more than Y subscriptions per sure, if not production user?

• Ignore subscription - regardless of its state - after more than 3 months?– Subscription is marked as broken

Page 9: DQ2 discussion on future features

9

Central catalogues• [ as mentioned yesterday ]• Main changes are:

– for Scalability only…– dropping VUIDs (becomes DUID+Version number)– DUID becomes timestamp-oriented UUID so that

backend is partitioned in time• and highly optimized UUID storage on ORACLE

– meaning shorter index

• ORACLE partitioning, redirect service…

– .. but fully backward compatible with 0.3 clients• Many queries become much faster

– list files in dataset is query by DUID as opposed to query by N number of VUIDs

– ORACLE IOTs guarantees listing files from a dataset [version] reads close to sequential blocks on disk

Page 10: DQ2 discussion on future features

10

Location catalogue• [ as mentioned yesterday ]• Location catalogue will be populated asynchronously

with:– information on missing files– (re)marking complete/incomplete locations for existing

datasets - consistency– Missing files are extra information made available on ‘best-

effort’ to the users• derived from request by Ganga

• This is populated by the ‘tracker’ service– Which was being reworked for the site services– The tracker service is a ‘stronger’ Fetcher (as existing on the

site services), used to find content on site VS content missing on site - one of the site services performance bottleneck

Page 11: DQ2 discussion on future features

11

Dashboard

• Relatively big update coming soon– distinguish errors source/destination– display messages on the dashboard for all

sites– alarms supported– more overview of site services state from a

central place• e.g. states of files (based also on new site

services monitoring)

Page 12: DQ2 discussion on future features

12

ToA

• More and more info there…• Blacklist/whitelist• Preferred site connections• This is a cache file, same style as ToA

– but independent file from ToA cache since it is more dynamic

• ToA renewal much stronger– I’d claim it is the most reliable info system

so far on the Grid :-)

Page 13: DQ2 discussion on future features

13

Communication…• … still not working:

– e.g. did not recognize staging as a problem– e.g. 0.3.2 apparently not deployed on OSG T2s

• quite bad as 0.3.1 had a simple bug where agents could simply die whenever a glitch happened in the central catalogue connection

– glitches “common” with the central catalogue request rate, but harmless and ok to retry

• … what to do here?• Jabber chatroom :-)

[email protected]– ask me - [email protected] or [email protected] -

to be authorized