CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul...
Transcript of CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul...
![Page 1: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/1.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
CERN Batch System, Monitoring andAccounting
HEPiX Fall 2012
Jerome BellemanCERN – IT-PES
October 2012
![Page 2: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/2.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
2 – CERN BatchSystem, Monitoring
and Accounting
Context
Growing community
Busier batch system
Agile Infrastructure project
![Page 3: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/3.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
3 – CERN BatchSystem, Monitoring
and Accounting
Outline
1 Batch System Challenges
2 Batch Monitoring Tools
3 Batch Accounting Overhaul
![Page 4: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/4.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
4 – Batch SystemChallenges
Section 1
Batch System Challenges
![Page 5: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/5.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
5 – Batch SystemChallenges
CERN Batch Setup
Platform LSF 7.0.6
All resources to one cluster
Different shares for different customers: public, grid andseveral for CERN experiments
LSF Master NodeNFS Server
LSF Master Failover
WNWN WN WN WN WN WN WN WN WN
Local Jobs Grid Jobs
![Page 6: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/6.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
6 – Batch SystemChallenges
A Large Batch System
> 4 000 physical nodes
> 60 000 cores, some SMT-enabled (25% overcommit)
> 55 000 job slots, > 400 000 jobs/day:
![Page 7: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/7.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
7 – Batch SystemChallenges
Future of the Batch Service
Agile Infrastructure Project:
Virtualise resources in CC: batch nodes to be fat VMs
Uniform IaaS layer
Configuration management with Puppet
![Page 8: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/8.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
8 – Batch SystemChallenges
Today’s Operational Issues
High submission and query load → Slow response
Ensuring fairshare scheduling
Complex LSF setup
Poor dynamism requiring daily reconfiguration
Scalability
![Page 9: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/9.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
9 – Batch SystemChallenges
Possible Alternatives to LSF
Goal for 5 years:
4 000→ 12 000 physical nodes
60 000→ 300 000 cores
Support frequent structural changes
Possible alternatives (unordered):
LSF 8
Condor
Grid Engine
Torque
SLURM ←−
![Page 10: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/10.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
10 – Batch SystemChallenges
Evaluating SLURM
From the SLURM Web site:
Free
65 000 physical nodes
120 000 jobs/hour
Active community
Extensible via plug-ins
Test bed:
Implement and test hierarchical fairshare model
Controllably submit queries and jobs
Reproducible load
Scale number of hosts, jobs, slots and queries
![Page 11: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/11.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
11 – BatchMonitoring Tools
Section 2
Batch Monitoring Tools
![Page 12: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/12.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
12 – BatchMonitoring Tools
Technology Overview
Oracle, Python, Matplotlib & Django → Stats
Cassandra → Fairshare monitoring
OpenTSDB → Live monitoring
Splunk → Historical usage
![Page 13: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/13.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
13 – BatchMonitoring Tools
Live Monitoring with OpenTSDB
![Page 14: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/14.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
14 – BatchMonitoring Tools
Historical Usage with Splunk
![Page 15: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/15.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
15 – BatchAccounting Overhaul
Section 3
Batch Accounting Overhaul
![Page 16: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/16.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
16 – BatchAccounting Overhaul
New Batch Accounting: Goals
Make portable to other schedulers
Publish local job information
Publish correct normalisation factor per job
Use the new APEL software
Remove complexity, improve consistency
![Page 17: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/17.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
17 – BatchAccounting Overhaul
Old vs. New Batch Accounting
CEsBLAH
File
LRMSAcct.File
Acct.
ReportsAcct.Page
APELAcct.Portal
Daily
FilterLocalAPEL
SSMMessaging
![Page 18: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/18.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
17 – BatchAccounting Overhaul
Old vs. New Batch Accounting
CEsBLAH
File
LRMSAcct.File
Acct.
ReportsAcct.Page
APELAcct.Portal
Daily
FilterLocalAPEL
SSMMessaging
![Page 19: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/19.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
17 – BatchAccounting Overhaul
Old vs. New Batch Accounting
CEsBLAH
File
LRMSAcct.File
Acct.
ReportsAcct.Page
APELAcct.Portal
Real-T
ime
FilterLocalAPEL
SSMMessaging
![Page 20: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/20.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
18 – CERN BatchSystem, Monitoring
and Accounting
Conclusion
We need to scale
We’re moving to new infrastructure tools
CERN batch service being prepared for future challenges
![Page 21: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information](https://reader030.fdocuments.in/reader030/viewer/2022040806/5e468131a56d5912b03e7a42/html5/thumbnails/21.jpg)
PES
CERN IT DepartmentCH-1211 Geneve 23
Switzerlandwww.cern.ch/it
CERNITDepartment
19 – CERN BatchSystem, Monitoring
and Accounting
Thanks!
Questions?