Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA...
-
date post
18-Dec-2015 -
Category
Documents
-
view
223 -
download
6
Transcript of Rewriting The Rules For Enterprise IT Enterprise Grid Orchestrator Christof Westhues SE Manager EMEA...
Rewriting The Rules For Enterprise ITEnterprise Grid Orchestrator
Christof Westhues SE Manager EMEA
Platform Computing 2007/03/01
National Grid Meeting Ankara
© Platform Computing Inc. 20032
Platform Enterprise Grid Orchestratorboosting EU-Grid Technology exploitation
Agenda
Increasing the industrial impact of EU Grid Technologies Programme
About Platform Computing
Understanding Industry requirements
Unified Grid resource layer
Integrate your Grid solution with Platform EGO
Platform Collaborations – EGEE, DEISA etc.
Conclusion - Open for new ideas
© Platform Computing Inc. 20033
Platform Enterprise Grid Orchestratorboosting EU-Grid Technology exploitation
Increasing the industrial impact of EU Grid Technologies Programme with Platform Enterprise Grid Orchestrator
The EU Grid Technologies Programme targets the logical next step: 'From Vision to Impacts in Industry and Society'
How to make this real?
Platform Computing holds probably the largest commercially productive install base of Grid infrastructure in industry worldwide.
Now introducing the Enterprise Grid Orchestrator (EGO), the first large scale rolled out Grid-SOI (Service Oriented Infrastructure) for technical as well as business computing.
Platform Computing EGO invites all Grid technology solutions to integrate with its unified Grid resource layer.
Platform Computing
© Platform Computing Inc. 20035
Platform Computing
The leading systems infrastructure software company accelerating applications and delivering I.T. agility to High Performance Data Centers
14 years of grid computing experience
Global network of offices, resellers & partners
7 x 24 world-wide support and consulting
Gartner Group 2006 “Cool Vendor” award in I.T. Operations Management
© Platform Computing Inc. 20036
Over 2,000 leading Global Customers
© Platform Computing Inc. 20037
Our Customers: from all verticals
ElectronicsFinancial
Services
Industrial
Manufacturing
Life
Sciences
Government
& Research
• AMD• ARM• ATI• Broadcom• Cadence• HP• IBM• Motorola• NVIDIA• Qualcomm• Samsung• ST Micro• Synopsys• Texas Instr.• Toshiba
• Fidelity Investments
• HSBC
• JP Morgan Chase
• Mass Mutual
• Royal Bank of Canada
• Sal Oppenheim
• Société Générale
• Lehman Brothers
• BMW
• Boeing
• Bombardier
• Airbus
• Daimler Chrysler
• GE
• GM
• Lockheed Martin
• Pratt & Whitney
• Toyota
• Volkswagen
• AstraZeneca
• Bristol Myers- Squibb
• Celera
• Dupont
• GSK
• Johnson &Johnson
• Merck
• Novartis
• Pfizer
• Wellcome Trust Sanger Institute
• Wyeth
• ASCI
• CERN
• DoD, US
• DoE, US
• ENEA
• Fleet Numeric
• MaxPlanck
• SSC, China
• TACC
• Univ Tokyo
Other
Business
• Bell Canada
• Cablevision
• Ebay
• Starwood Hotels
• Telecom Italia
• Telefonica
• Sprint
• GE
• IRI
• Cadbury Schweppes
Understanding Industry requirements
© Platform Computing Inc. 20039
Understanding Industry requirements
Grid value: shared resources & shared usage.
Unify many different users AND multiple different workload typesAvoid building “Grid-Silos”: don’t become part of the
problem
Primary target is “agility” – speed & ease of changeDriven by business process & business change needsAs consequence of handling all workload in the Grid,
orchestration, scaling, acceleration, results in agility
Lets have a look at the users
Industry – generically: professional users aiming to create results (€,$,₤) using the tool “Grid” –
Call them customers (change of perspective)
© Platform Computing Inc. 200310
Understanding Industry requirements
Quality requirements
Reliability (self-healing, recovery from incidents, policy driven proactive problem containment, no job loss during operation or in error condition, while reconfig or failover.
Performance (n*10millions jobs per day throughput with 90% job-slot utilization based on 15min job-runtime, max 5min for failover)
Scalability (n*1000’s users, hosts, n*millions jobs in one logical cluster at any time, n*10millions jobs per day throughput, n*1000’s way-parallel jobs)
© Platform Computing Inc. 200311
LSF Roadmap
LSF product roadmap is based on the feedback and interviews with 75+ customers including:
Agilent, Airbus, AMD, ARM, Apple, ATI, BASF, BMS, Boehringer Ingelheim, Boeing, Broadcom, Caltech, CEA, Cineca, Cinesite, Conexant, Daimler Chrysler, DEISA, Devon Energy, Disney, DoD (ARL, ASC, ERDC, MHPCC, NAVO), DoE (LANL, LLNL, Sandia), Dreamworks, Emulex, Engineous, Ferrari, Fleet Numerics, Ford, Freescale, GE, GM, Halliburton/Landmark, Harvard, Hilti, HP, IDT, Intel, J&J, LandRoverJaguar, Lockheed Martin, LSILogic, Magma, Merck, Motorola, MSC, MTU, NCAR, NCSA, Nissan, NOAA, Novartis, NovoNordisk, NVidia, Philips, Pratt & Whitney, Pfizer, PSA, Qlogic, Qualcomm, RBC, Renesas, Samsung, Sandisk, Seagate, Shell, Skyworks, Synopsys, TACC, TenorNetworks, TI, Toshiba and Volvo
© Platform Computing Inc. 200312
Understanding Industry requirements
Quality requirements
Why scaling counts: Performance and Scalability translates into Reliability
Reliability can be measured as “MTBF” – Mean Transactions (=Jobs) Between FailurePlatform technology meets this requirement – Technology-
Leader
Support 24/7 around the globe
Non-Technical Quality requirements
Focus on Grid technology – commitment -
Reliable partner: experienced, stable, profitable.
Unified Grid resource layer
© Platform Computing Inc. 200314
Network Bandwidth
Servers Licenses Data Storage
Heterogeneous Enterprise Resource
Enterprise Grid Problem: workload characteristics
ApplicationApplication Application Application Application
HPC & Enterprise Applications
Unpredictable infinite demand
Result: under-provisioning or over-provisioning
Finite compute resources
© Platform Computing Inc. 200315
IT Architectures Are Still Statically Coupled and Silo’d
ApplicationApplication Application Application Application
Core Applications in the Data Center
Unpredictable, Infinite Demand
With Multiple Engineering groups collaborating on multiple designs, core and business applications can consume vast amounts of computing resources
Finite Computing Resources
Applications are “siloed”, often procured out of different budgets at different times for different purposes
© Platform Computing Inc. 200316
The Need is for Variable Resources to Meet Variable Business Demand The Need is for Variable
Data Center Business “Pain Points”
Underutilized Resources
Diffculty meeting SLAs
Costly I.T. Environment
Complex
Unpredictable
Some server silos have insufficient capacity while there is an excess capacity in others
It is difficult to meet application SLAs because resources may not be available when required
With application silos underutilized, excess capacity, cooling, space and power are requiredCoordination of resources is complex, time-
consuming and error-proneHardware failures, outages or insufficient capacity
makes the environment unpredictable
Results of statically Coupled and Silo’d Infrastructure
© Platform Computing Inc. 200317
Model architecture
ApplicationApplication Application Application Application
Core Applications in the Data Center
Unpredictable Infinite Demand
Computing Resources are Finite
Create a Shared Pool of Computer Systems
Decouple Resources from Applications
© Platform Computing Inc. 200318
System Resource
Orchestration
ResourcesPlug-ins
InfrastructurePlug-ins
Platform EGO Standard Services
Application Workload
Management
Open & Decoupled Architecture Platform Enterprise Grid Orchestrator
Platform LSF HPC
API/CLI
Platform VOVMO & ASE
API/CLI
3rd Party Middleware Integration
API/CLI
Applications
LS MDA EDA CAE FSI VM’s J2EE DB’s ERP CRM BI
H/W
Solaris
H/W
Aix
H/W
Windows
H/W
Linux
H/W
ServersGrid Devices H/W
Desktops
AllocateManage Execute
Platform EGO Kernel
Fail-over
Platform LSF
Platform Symphony
API/CLI API/CLI
Portal Service
Logging Service
Deployment Service
Event Service
Service Director
Data Cache
SNMP
Security
Platform EGO SDK/API
Storage
License
e.g. Infiniband
SOI
SOA
© Platform Computing Inc. 200319
Example: Dynamic Resource Allocations – Live SOI
Platform EGO Foundation
Host Group: Linux 2.4
Platform EGO responds to requests from consumers and allocates supply according to policy – Service Oriented Infrastructure
Resource allocation: min, max, conditions, resource req.
Dynamic response: Resource re-allocation based on policies (=> SLA’s) – “lend&borrow”
Dynamic response: acquisition of additional resources
Host Group: Linux 2.4
Host Group: Windows NT
3rd party Middleware integration
Integrate your Grid solution with Platform EGO
© Platform Computing Inc. 200321
Integrate your Grid solution with Platform EGO
Meet industrial quality requirements AND deploy innovative technologies and methods
Specific and targeted solutions as well as general purpose workload adapters can join one unified resource Grid
Reliability (self-healing, recovery from incidents, policy driven proactive problem containment)
Dynamic Resource Allocation – peak power on demand
Scalability & Performance
© Platform Computing Inc. 200322
Integrate your Grid solution with Platform EGO
Platform EGO offers by open API/SKD policy based access to all resources in the Grid.
Access the same resource Grid from & for all workload types or Grid solutions No Grid silos!
Access to resources on EGO includes dynamic allocations within SLA guarantees.“Breathing” resource allocations: SLA: minimum, maximum
– lend&borrow
This may well replace traditional static Advanced Reservations that were building up “virtual silos” – a virtual grid-based flavor of silo’ed infrastructure Grid technology was supposed to make redundant. No Grid silos – not even virtual!
Platform Collaborations
© Platform Computing Inc. 200324
Platform Engagements and Collaborations
Currently, Platform Computing is engaged at:
QOSCOS
DEISA
EGEE
…
Platform Collaborations - QOSCOS
© Platform Computing Inc. 200326
What is QosCosGrid?
IST Proposal
Specific Targeted Research Project (STREP)
IST Call 5
FP6-2005-IST-5
Quasi-Opportunistic Supercomputing for Complex Systems in Grid Environments
(QosCosGrid)
Part. # Participant organisation name Short name
1* University of Ulster, United Kingdom UU2 The University of Queensland, Australia UQ3 Israel Institute of Technology, Israel TECH4 Cranfield University, United Kingdom CU5 Universitat Pompeu Fabra, Spain UPF6 Eötvös Loránd University, Hungary ELU7 National Inst. for Research in Computer Science and Control, France INRIA8 Poznan Supercomputing and Networking Centre, Poland PSNC9 University of Amsterdam, Netherlands UA10 Platform Computing PCC
© Platform Computing Inc. 200327
What is QosCosGrid?
Target & Definition, from the proposal paper:
…. “Whereas supercomputing resources are more or less dependable, the grid approach is characterized by an opportunistic sharing of resources as they become available. This distributed quasi-opportunistic supercomputing, while not offering the quality of service of a supercomputer, will be to some degree better than the pure opportunistic grid approach. Furthermore it will enable users to develop applications with supercomputing requirements without the need to deploy supercomputers themselves. …
QosCosGrid is, therefore, an effort to use the best from two worlds: the opportunistic approach of the grid technology to sharing and using resources whenever they become available, and the reliant or dependable approach of the supercomputing. By developing an infrastructure for quasi-opportunistic supercomputing, QosCosGrid aims at providing a reliable, effortless and cost-effective access to the enormous computational and storage resources required across a wide range of CS research areas and application domains and industrial sectors.”
Prof. Dr. Dubitzki, University of Ulster
© Platform Computing Inc. 200328
What is QosCosGrid?
The Proposal to the EU-Commission (click
here )
Why Platform Computing?Researchers from initiating University of Ulster remembered Platform Computing from D-Grid (German e-science initiative) working groups and asked for Platform participation
EU-Commission funding rule: for each research project there must be a commercial partner
Platform is invited to enter the academic IT research scene in Europe and by this increase success in a currently under developed market
Platform was offered a package of 45 person-months with a total of +400000 Euro funding
© Platform Computing Inc. 200329
QosCosGrid Project Plan
QosCosGrid UU
UQ
TE
CH
CU
UP
F
EL
U
INR
IA
PS
NC
UA
PC
C
Workpackags and Tasks 1 2 3 4 5 6 7 8 9 10 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
WP1 Grid Services for CS Simulations PSNC 159 1 45 11 3 19 6 22 29 14 9
T1.1State-of-the-art/gap analysis of grid technologies for CS modelling UQ 28 0 15 1 0 4 0 2 3 2 1
T1.2Selection and adaptation of grid monitoring and meta-scheduling services to specific requirements of CS modelling PSNC 32 0 8 4 0 4 0 0 9 6 1
T1.3Design and implementation of fault tolerance protocols for point to point and grid middleware aware CS communication routines
INRIA 22 0 0 1 0 0 0 20 0 0 1
T1.4Design and development of CS oriented interfaces to grid services UPF 21 1 8 2 0 4 2 0 3 0 1
T1.5Adaptation and integration of advanced features provided by local scheduling and storage systems according to specific CS requirements
PCC 16 0 0 2 0 4 0 0 6 0 4
T1.6Integration of storage management and data transfer systems with CS oriented grid services and interfaces PSNC 20 0 0 1 2 2 2 0 6 6 1
T1.7 Remote steering of grid middleware aware CS simulations UQ 20 0 14 0 1 1 2 0 2 0 0
WP2 Grid Services for QO Supercomputing TECH 125 1 10 70 1 5 10 0 10 0 18
T2.1 State-of-the-art/gap analysis of QO supercomputing TECH 19 0 5 7 0 0 3 0 3 0 1
T2.2 Multi Resource QoS-aware Provisioning TECH 26 0 0 21 0 0 0 0 0 0 5
T2.3 QoS aware resource orchestration TECH 31 0 0 21 0 2 3 0 3 0 2
T2.4 QoS resource orchestration for a multi-applications Grid environmentTECH 32 1 0 21 0 1 3 0 3 0 3
T2.5 Accounting and billing services PCC 17 0 5 0 1 2 1 0 1 0 7
WP3 CS Simulations on the Grid (use case scenarios) UPF 166 22 23 6 27 17 27 9 4 29 2
T3.1 Use cases requirements analysis and specification UA 29 3 5 1 3 4 5 1 3 3 1
T3.2 Living simulations (protein folding, astrohpysics apps) UA 46 6 5 0 2 4 5 1 0 23 0
T3.3 Evolutionary computation ELU 36 10 5 0 6 4 10 1 0 0 0
T3.4 Co-evolutionary agents models (supply chain) CU 28 0 5 0 12 4 6 1 0 0 0
T3.5 Integration with WP1 and WP2 and demonstrations UU 27 3 3 5 4 1 1 5 1 3 1
WP4 Concertation CU 12 1 0 1 3 1 1 2 1 1 1
T4.1 Exploitation of synergies / technical exploitation PCC 2 0 0 0 0 0 0 0 0 1 1
T4.2 Joint fora for exchange and dissemination INRIA 3 0 0 0 1 0 0 1 1 0 0
T4.3 Co-ordination of standardisation efforts TECH 2 0 0 1 1 0 0 0 0 0 0
T4.4 Repository of reference implementations and grid middleware CU 1 0 0 0 1 0 0 0 0 0 0
T4.5 Collaboration on research inventors and roadmaps UU 2 1 0 0 0 1 0 0 0 0 0
T4.6 Indicators and impact assessment ELU 1 0 0 0 0 0 1 0 0 0 0
T4.7 Training activities UA 1 0 0 0 0 0 0 1 0 0 0
WP5 Exploitation and Dissemination PCC 34 5 1 1 2 1 1 6 1 1 15
T5.1 Promotional and dissemination activities UU 16 5 0 0 2 1 0 5 0 0 3
T5.2 Targeted user groups & exploitation PCC 18 0 1 1 0 0 1 1 1 1 12
WP6 Project Management UU 31 20 0 6 0 2 0 3 0 0 0
T6.1 Overall project coordination UU 21 20 0 0 0 0 0 1 0 0 0
T6.2 Technical coordination TECH 8 0 0 6 0 0 0 2 0 0 0
T6.3 Quality management UPF 2 0 0 0 0 2 0 0 0 0 0
Grand total 527 50 79 95 36 45 45 42 45 45 45Planned in A4 1 9% 15% ### 7% 9% 9% 8% 9% 9% 9%
Year 3
Gantt Chart -- Project Years and Quarters
PMsPartner Number Year 1 Year 2
Platform (PCC)
marked
30 months runtime
© Platform Computing Inc. 200330
QosCosGrid Technology Stack & LSF
QosCosGrid Technology Stack: QosCosGrid research and development efforts will be based on the existing grid technology (such as GT4[[i]], Glite[[ii]] and LSF[[iii]] from PCC), and will focus on three additional layers, as depicted in Figure below.
To achieve that, one of the first activities in the project will be the roll-out of a world-spanning Platform LSF-MultiCluster grid – from Ireland across Europe, Israel and Australia.
G rid Fabric
In terfaces, Serv ices, Too ls
A p p lications
M id d lew are
Dem
o 1
Dem
o 3
Dem
o N
Dem
o 2
[[i]] GT4: www.globus.org/toolkit / [[ii]] Glite: glite.web.cern.ch/Glite / [[iii]] LSF: www.platform.com/Products
Platform Collaborations - DEISA
© Platform Computing Inc. 200332
Heterogeneous job submissions and Co-Allocation capability
OpenPBS / PBSPro IBM LoadlevelerPLATFORM LSF
Develop and extend heterogeneous job submission
capability (UNIVERSUS)
NEC NQS (optional)
Virtualized Infrastructure
© Platform Computing Inc. 200333
Heterogeneous job submissions and Co-Allocation capability
OpenPBS / PBSPro IBM LoadlevelerPLATFORM LSF
Develop and extend heterogeneous job submission
capability (UNIVERSUS)
NEC NQS (optional)
Virtualized Infrastructure
Co-Allocation: Heterogeneous Multi-Site resource allocation
Example:
Give me 200 CPU on Site1 and 300 CPU on Site2 at the same time
200 CPU300 CPU
Platform Collaborations - EGEE
© Platform Computing Inc. 200335
Platform Computing - EGEE-Business-Associate
The collaboration “Plan” Step 1
Immediate improvements for the EGEE users and resource providers
Technology boost
SLA Scheduling
Parallel job control and accounting
Resource aware scheduling – double compute efficiency
What‘s next? Step 2
Mid term target: production Grid unifying all resources AND all users
Enable & integrate with new user groups and their resources
All kind of applications: commercial code; complex systems
Long term target: SOA/SOI for Service Oriented Science
„IT-Agility“ for scientific computing
Introduce novelties faster
respond to changing requests in time
© Platform Computing Inc. 200336
EGEE & Platform: the “Plan”
The collaboration “Plan” Step 1: 4 Actions
1st Action: improve LSFgLite integrationPlatform LSF is one of the supported batch systems of gLite.
Actually, about 45% of all CPUs in EGEE are on LSF
May include version maintenance as well as performance improvements
Will include improved documentation and communication
Leeds to better understanding the capabilities of LSF in order to build complex algorithms that may benefit from information passing to use all the features of LSF
© Platform Computing Inc. 200337
EGEE & Platform: the “Plan”
The collaboration “Plan” Step 1: 4 Actions
2nd Action: SLA Schedulingexploit LSF and gLite features to enhance user and resource
provider capabilitiesSLA scheduling helps both: for the
User it provides guaranteed result delivery – in time or in troughput
Resource provider, it translates to „least impact scheduling“, that is: serving the SLA user while there is still room left to host other requests. I other words: handling different Service Levels, working with different customers, at the same time
Expected results:Resource providers will offer more resources to EGEE users under
well defined SLAsUser perceives predictable result delivery, predictable behaviour of
the Grid
© Platform Computing Inc. 200338
EGEE & Platform: the “Plan”
The collaboration “Plan” Step 1: 4 Actions
3rd Action: Parallel application supportgLite today supports sequential and provides basic support for
parallel jobs based on mpich
Exploit LSF-HPC featuresLSF-HPC allows control of MPI parallel jobs down to task levelProvides signalling layer for management or workflow control
signalsDelivers accounting that include all children of a parallel applicationMultiple MPI type in one cluster support
Is parallel application support in EGEE easy? No.LSF-HPC might be the best choice to start with.We may identify topics worth a research project / support actionE.g.: parallel application checkpoint / restart
© Platform Computing Inc. 200339
EGEE & Platform: the “Plan”
The collaboration “Plan” Step 1: 4 Actions
4th Action: Resource aware scheduling – double compute efficiency
Exploit LSF featuresLSF supports a generic resource concept, thus data is resource,
tooAll resources can be used for scheduling decisionsScheduling paradigm “job-follows-data” results in up to 50% gain in
compute power
Is Resource aware scheduling in EGEE easy? No.EGEE supports co-location of data and computation based on
sites, but not for computation scheduling within a siteMajor topics in operations model Medium topics for the compute resources, re-think, re-build, re-
budgetMaybe switch to Mid-term horizon …
© Platform Computing Inc. 200340
SLA Scheduling for EGEE
LSF service-level agreement (SLA) scheduling: Is a goal-oriented "just-in-time" scheduling policy that enables the user
to focus on the "what and when" of a project instead of "how“ the resources need to be allocated to satisfy various workload
Defines an agreement between LSF administrators and users
Helps configure workload so that jobs complete on time
Reduces the risk of missed deadlines
Three different types of service-level goals are Deadline
Velocity
Throughput
or a combination of the service-level goals
© Platform Computing Inc. 200341
SLA Scheduling for EGEE: SLA “Deadline”
now
100%= 8 Job-slots
time
Cluster filled to 100%Classical opportunistic scheduling
timenow
SLA 1 consumes 50% of cluster
SLA 1
“deadline”
Free resources for dialog users, real-time requests, online sessions, other workload
100%
“deadline”
I need to work now!
Early enough for m
e
© Platform Computing Inc. 200342
SLA Scheduling for EGEE: SLA “Throughput”
SLA 2 consumes 25% of cluster
now
SLA 2 “troughput”
Free resources for dialog users, real-time requests, online sessions, other workload, other SLAs, …
100%
4 R
esul
ts/h
r
4 R
esul
ts/h
r
4 R
esul
ts/h
r
4 R
esul
ts/h
r
4 R
esul
ts/h
r
I am a scientist, I need just as many results as I can process per time interval.
time
more EGEE users !
© Platform Computing Inc. 200343
EGEE High Performance Parallel Computing
Distributed computation“Imperfectly parallel” – the real world
inter-task-runtime-communication
often implemented using MPI – Message Passing Interface
MPI - Many Possible Implementations
Different communication patterns:
“Neighbour” tasks (defined by problem decomposition topology)
“All to all”, “some to many” (=N-to-M)
Central instance to tasks (commercial code, …)
© Platform Computing Inc. 200344
LSF-HPC – LSF for High Performance Computing
LSF-HPCLSF plus additional functionality
Topology aware scheduling
large SMPs
large Clusters
Task granular control for parallel computation
Generic and vendor specific MPI integrations
Signal forwarding to all tasks
Resource usage accounting for all tasks
Limit enforcement: time, mem, threads, ….
Scalability: +8000 in LSF6.2 / +16000 in LSF7.0
© Platform Computing Inc. 200345
Platform LSF/HPC – Generic integration
Without the generic PJL framework, the PJL starts tasks directly on each host, and manages the job.
Even if the MPI job was submitted through LSF, LSF never receives information about the individual tasks. LSF is not able to track job resource usage or provide job control.
If you simply replace PAM with a parallel job launcher that is not integrated with LSF, LSF loses control of the process and is not able to monitor job resource usage or provide job control. LSF never receives information about the individual tasks.
TaskTask
PJL1st executionhost
TaskTask
2nd executionhost
ArchitectureRunning a parallel job using a non-integrated PJL
© Platform Computing Inc. 200346
Platform LSF/HPC – Generic integration
PAM is the resource manager for the job.
The key step in the integration is to place TS in the job startup hierarchy, just before the task starts.
TS must be the parent process of each task in order to collect the task process ID (PID) and pass it to PAM.
mbatchd mbschdJob submission
LSF Master host
Task
TS TS
Task
RES
PAM
PJL wrapper
PJL
mpirun.lsfsbatchd
1st executionhost
RES
Task
TS TS
Task
2nd executionhost
Architecture: Using the generic PJL framework
© Platform Computing Inc. 200347
LSF-HPC – LSF for High Performance Computing
Advantage for EGEE, users and resource providersFreedom to integrate and use
All MPI types
All compute architectures
May implement optional automated MPI selection, dependent on actual availability – best possible choice
Full application control, ready to implement optional parallel
Preemption - important to guarantee service levels
Suspend/resume
Checkpoint/migrate/restart
© Platform Computing Inc. 200348
Resource aware scheduling for EGEE
The collaboration “Plan” Step 1: 4 Actions
4th Action: Resource aware scheduling – double compute efficiency
Exploit LSF featuresLSF supports a generic resource concept, thus data is resource,
tooAll resources can be used for scheduling decisionsScheduling paradigm “job-follows-data” results in up to 50% gain in
compute power
Is Resource aware scheduling in EGEE easy? No.EGEE supports co-location of data and computation based on
sites, but not for computation scheduling within a siteMajor topics in operations model Medium topics for the compute resources, re-think, re-build, re-
budgetMaybe switch to Mid-term horizon …
© Platform Computing Inc. 200349
EGEE: data handling in the resource center
EGEEusers
Compute nodes
Data nodes
Storage
Controller
Drive
Drive
Drive
Drive
Drive
Drive
Drive
Drive
LA
N
Robot
EGEE jobs
EGEE example operations model Job arrives and is started on compute node
Requested data is ordered from storage robot
Tape mounted and content “data set” provided to compute node via NFS
allocating 2 nodes for 1 job
© Platform Computing Inc. 200350
Resource aware scheduling – up to double compute efficiency
Resource aware scheduling 1Job arrives and is queued, resource requirement e.g. “data=#4711” 2Requested data-set “#4711” is ordered from storage robot by LSF 3Tape mounted and LSF resource “data” is updated4 to “data=#4711” As soon as resource requirements are satisfied, job is 5dispatched to the
right host, holding the right data locally
EGEEusers
R
R
R
R
R
R
R
R
LSF Cluster
Compute & Data nodes
Storage
Controller
Drive
Drive
Drive
Drive
Drive
Drive
Drive
Drive
Robot
EGEE jobs
Q
Q
Q
Resource: dataValue: “identifier”
LSF-mbschd
1
2
3
45
Conclusion
© Platform Computing Inc. 200352
Conclusion: Increasing the industrial impact of Grid
Increasing the industrial impact of EU Grid Technologies Programme with Platform Enterprise Grid Orchestrator
Platform Computing invites all Grid technology solutions to integrate with its unified Grid resource layer, the Enterprise Grid Orchstrator – EGO -
Platform Computing is open to partner with academia, research and industry to push forward adoption and “impact” of Grid technology.
Contact: Christof Westhues, SE Manager EMEA Platform Computing GmbH, [email protected]
Proline Bilişim A.Ş.Tel : +90 212 236 8070
Fax :+90 212 236 7740
Thank you