Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

12
Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008

Transcript of Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Page 1: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Clusters, SubClusters and Queues

A Spotters GuideChris Brew

HepSysMan 06/11/2008

Page 2: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 2

Current Default Setup

• YAIM Sets up by Default– One Cluster (Batch System)– One SubCluster (Set of WNs)– Multiple CEs (queues) pointing to the

subcluster

• Falls down with– Non Identical Worker Nodes– Multiple CENodes attached to the same batch

system

Page 3: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 3

CE Node

The way it’s supposed to be

Type 1 WNType 1 WNType 1 WNType 1 WN

SubClusType 2 WNType 2 WNType 2 WNType 2 WN SubClus

Type 3 WNType 3 WNType 3 WNType 3 WN

SubClus

Cluster

Q1

Q2

Q3

Q4

Q5

Tags

Page 4: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 4

The way it usually is

Type 1 WNType 1 WNType 1 WNType 1 WN

Type 21 WNType 21 WNType 21 WNType 21 WN

Type 31 WNType 31 WNType 31 WNType 31 WN

CE Node

SubClus

ClusterQ1

Q2

Q3

Q4

Q5

Tags

Page 5: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 5

CE Node

How bad it can be

Type 1 WNType 1 WNType 1 WNType 1 WN

Type 21 WNType 21 WNType 21 WNType 21 WN

Type 31 WNType 31 WNType 31 WNType 31 WN

CE Node

SubClus

Cluster Q1

Q2

Q3Q4

SubClus

Cluster Q1

Q2

Q3Q4

Tags

Tags

Page 6: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 6

Problem on Non Identical Worker

Nodes• Default setup assumes that all worker nodes

are identical– Obviously no the case at most sites– Subcluster has to publish the lowest spec WN

• Leads to:– Small memory jobs wasting large memory nodes– Inability to publish existence of large memory nodes– Differing CPU specs lead to inaccurate timing and

accounting (CPU scaling helps here)

Page 7: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 7

Problem of multiple CENodes

• Sites want to add multiple CENodes for Scaling and Redundancy– Should just add CEs (queue endpoints)– Currently duplicates Clusters and SubClusters

• Causes problems in CPU counting (gStat, GridMap, Accounting Reports, etc.)

• Various hacks to try to help with this

Page 8: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 8

Current Hacks

• Can already set up multiple Clusters, SubClusters to advertise different memory queues– See publishing for RAL-LCG2 and UKI-

SOUTHGRID-RALPP– Involves hand crafted ldif files to set up

(Sub)Clusters and map queues to them– Cannot let YAIM near them

Page 9: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 9

Traylen Proposal

• Move (Sub)Cluster publishing from CENode to new node type– Probably share node with site-bdii

• CENode gip will associate queues to SubClusters

• Software Tags currently associated with CENode not (Sub)Cluster, they’ll be fixed and published through the new node type.

Page 10: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 10

How it may be

Type 1 WNType 1 WNType 1 WNType 1 WN

Type 2 WNType 2 WNType 2 WNType 2 WN

Type 3 WNType 3 WNType 3 WNType 3 WN

Glite-Cluster Node

SubClus

Cluster

CE Node

SubClus

Q1Q2Q3

Q4 Q5

Tags Tags

SubClus

Tags

CE Node

Q1Q2Q3

Q4 Q5

CE Node

Q1Q2Q3

Q4 Q5

Page 11: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 11

Our Experience

• We’ve put in hand crafted ldif files to define 500MB, 1000MB and 2000MB SubClusters

• grid[500|1000|2000] queues pointing at them on both CENodes

• Technically it works – jobs with higher memory requirements only match the high memory queues

• In practice it makes no difference – almost no jobs include memory requirements

Page 12: Clusters, SubClusters and Queues A Spotters Guide Chris Brew HepSysMan 06/11/2008.

Slide 12

Conclusion

• You’re probably not doing it right at the moment– But the fix is probably worse

• You can add hacks to provide more info to the batch system– But it probably won’t make any difference

• Things are likely to change (for the better) in the near future– Wait until then