Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO...
-
Upload
tiffany-patrick -
Category
Documents
-
view
212 -
download
0
Transcript of Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO...
Installing, running, and Installing, running, and maintaining large Linux maintaining large Linux
Clusters at CERNClusters at CERN
Thorsten KleinwortCERN-IT/FIOCHEP 200324.03.2003
3/24/2003 Thorsten Kleinwort CERN-IT 2
OverviewOverview
• The Linux Clusters at the CERN CC• Recent achievements to improve manageability
• Installation• Configuration• Monitoring• Collaboration with EDG (WP4)
• Maintenance of the clusters• The batch system LSF• Steps towards LHC Computing• References
3/24/2003 Thorsten Kleinwort CERN-IT 3
IntroductionIntroduction
• The computing facilities in the CERN Computer Center:• Decommissioned non Linux platforms, apart
from some Suns• Merged private clusters into two big, shared
clusters:• LXPLUS for interactive use (~80 nodes)• LXBATCH as batch farm (~700 nodes)
• All commodity hardware (towers, Dual CPU), but divers (CPU speed, disk sizes and #, memory,…)
• The current OS is RedHat Linux, we are in the transition from 6.1 to 7.3, around 70% is done
3/24/2003 Thorsten Kleinwort CERN-IT 4
The CERN Computer The CERN Computer Center:Center:
3/24/2003 Thorsten Kleinwort CERN-IT 5
Recent achievements…Recent achievements…
• Moving from RedHat 6.1 to 7.3• Revised and rewrote existing
installation and maintenance tools, because the requirements have changed:• Focusing on Linux • Using well established tools/protocols/languages
(RPM, HTTP, XML,…)• Standard adherence (LSB, init scripts,…)
• Separated installation & configuration:• Identified all parts of installation• Identified all sources of configuration information
3/24/2003 Thorsten Kleinwort CERN-IT 6
InstallationInstallation
• The system is installed with kickstart• The installation is completely automatic
• Software installation: RPM• RPM is the tool of choice:
• Allows easy install/update/uninstall• Version control
• Additional software in RPMs, ours as well as software from others (e.g. CASTOR, EDG, LCG…)
• (Post-) Installation split up in components:• One RPM per component for installation• Configuration is done per component as well
3/24/2003 Thorsten Kleinwort CERN-IT 7
ConfigurationConfiguration
• Configuration of the system:• We enhanced SUE with a configuration interface
• Identified all sources of configuration information:• First step: Make this information available
through one interface (CCConfig)• Next step: Work on the unification and merging
of the different data sources behind it (ongoing)
3/24/2003 Thorsten Kleinwort CERN-IT 8
Configuration IIConfiguration II
• Using the EDG WP4 configuration tool:• Pan & CDB (Configuration Data Base) for
describing hosts:• Pan is a very flexible language for describing
host configuration information:• Expressed in templates (ASCII)• Allows includes (inheritance)
• Pan is compiled into XML, inside CDB• XML is downloaded and the information
provided by CCConfig, which is the high level API
3/24/2003 Thorsten Kleinwort CERN-IT 9
MonitoringMonitoring
• Adoption of EDG WP4 monitoring:• Has replaced old self made & grown alarm
scripts
• Still relying on old Alarm system (SURE):• Will be replaced, either by the WP4 tool or by a
commercial tool (PVSS)
• The monitoring information is stored in database:• With a user-API for queries• Eliminate the need for client access
3/24/2003 Thorsten Kleinwort CERN-IT 10
MaintenanceMaintenance
• Machines must be ‘updatable’:• Updating a machine must lead to the same result
as a new install
• Rpmupdate:• Based on RPMT, a transactional RPM which
allows updates, installs, and uninstalls at the same time
• Will be superseded by the EDG WP4 tool: SPMA
• Notification mechanism:• No automatic/periodic upgrade• Change mechanism triggered to run on the nodes
3/24/2003 Thorsten Kleinwort CERN-IT 11
The batch system (LSF)The batch system (LSF)
• Current version: LSF 4.2• LSF 5.1 is evaluated at the moment
• No multi-clusters any more• We introduced Fairshare, for a better utilization of the
unused capacities:• Experiments have guaranteed shares of the batch capacity• If unused, they can be used by others• No more available, but unusable resources
• We oversubscribe our hosts (3 jobs per dual CPU)
• Close collaboration with the provider, Platform, they benefit from our big farm, we benefit from their help and will to implement our requirements
3/24/2003 Thorsten Kleinwort CERN-IT 12
Other improvementsOther improvements
• Secure installations:• Each node has its own GPG key pair to exchange
secure information:• eg. for SSH keys, (encrypted) root password
• Intervention rundown:• Allow a scheduled reboot on batch nodes, when
they have finished batch jobs, e.g. for new kernel or other software installs
• Server Cluster:• Serves the RPMs, the configuration information,
etc.• Several machines, selected by ‘dynamic DNS
aliases’
3/24/2003 Thorsten Kleinwort CERN-IT 13
Going to Grid ComputingGoing to Grid Computing
• Merging EDG/VDT middleware into a large scale production farm• Enlarging our batch capacity by 400 nodes in
April
• Early contribution to LCG 1 by this summer• LXBATCH fully integrated by Q4/2003
• Close collaboration with EDG (WP4) and LCG will continue
3/24/2003 Thorsten Kleinwort CERN-IT 14
ConclusionsConclusions
• Redone the Linux installation for RH 7.3
• Clearer concepts, new tools• Streamlined it with EDG WP4 tools• Continuous collaboration with EDG
WP4 and LCG• Facing and implementing the needs
for the Grid Computing
3/24/2003 Thorsten Kleinwort CERN-IT 15
ReferencesReferences
• CERN-IT/FIO: http://it-div-fio.web.cern.ch/it-div-fio/
• EDG: http://eu-datagrid.web.cern.ch/eu-datagrid/WP4: http://hep-proj-grid-fabric.web.cern.ch/hep-proj-grid-fabric/
• LCG: http://lcg.web.cern.ch/LCG/• SUE: http://proj-sue.web.cern.ch/proj-sue/• LCFG: http://www.lcfg.org/• LSF (Platform Computing):
http://www.platform.com/• PVSS: