InstantGrid: A Framework for On- Demand Grid Point Construction
R.S.C. Ho, K.K. Yin, D.C.M. Lee, D.H.F. Hung, C.L. Wang,and F.C.M. LauDept. of Computer Science, The University of Hong Kong
Grid point construction a difficult task
Different grid users/applications demand different execution environments (EE’s)
Managing - and switching between - different EE’s incur much system administration overheads
E.g. Computing grid (MPICH-G2, etc.) vs. service grid (GT3); different OS distributions/versions, libraries, etc.
Our solution – InstantGrid A framework for efficient construction of grid point
• Convenient system administration for multiple EE’s
• Instant EE construction in remote nodes
• Complete transparency to user applications
• Supports in-memory execution – protects HD’s data from malicious access
The InstantGrid Framework
All EE’s are installed, configured, and managed in central InstantGrid servers
Cluster/grid nodes obtain customized EE’s through network (i.e., the “dissemination” process)
Framework consists of the following key elements: Application-centric software grouping Proactive software configuration Discriminative file sharing mechanisms Options for file storage in compute nodes An EE dissemination service
Single Linux Image Management (SLIM):The infrastructure for EE dissemination
SLIM is able to deliver customized EE’s for:
• HPC cluster/grid systems
• Linux desktops
• Diskless Linux nodes
Application-centric Software Grouping
(a) A service-oriented grid point
(b) A frontend node for HPC job submission
(c) A typical cluster node which processes jobs dispatched from the frontend node
(b)+(c): A single EE group indicating the software requirement of a cluster-based grid point, which includes a gatekeeper and a number of compute nodes
Software are grouped together to match the specific requirements of applications
An EE is a collection of software components, which include an OS, system libraries, grid/cluster middleware, applications, and the user data
Customized EE “images” for different users/applications
Facilitates software management and dissemination
Sample EE’s:
Proactive Software Configuration
Discrimitive File Sharing Mechanism
Full replication is impractical due to large size of typical EE’s
Updating files through NFS is slow
InstantGrid adopts a hybrid approach: Replicate (frequently- updated files) + NFS (other files)
Traditionally, software are installed/configured incrementally
InstantGrid advocates “configuration before dissemination”
Try to configure all software in the central server if possible
The EE’s disseminated are (almost) ready-to-run
Option for File Storage in Compute Nodes “Full-copy to RAM” – files stored entirely in physical memory
“Full-copy to HD” – files stored in hard disk
“Copy-if-needed” – files stored in HD; only new files are copied
EE Dissemination Service Service is offered through a DHCP server, a TFTP server
and an NFS server
When a client machine boots up, it obtains its IP address and the kernel from the DCHP and TFTP servers respectively
Constructs the pre-defined EE by replicating writable files to local storage and mounting the read-only directories through the NFS
Example – Constructing a service-oriented grid point
/ usr/ local/gt3.2
OS image
SLIM server
client clientclient
DHCP
client clientclient
SLIM server
1TFTP
2
SLIM server
client clientclient
3
4
certificateSLIM server 1CA server
client
client
client
42
3 . . .
1. Software installation at SLIM server
2. Client boots and obtains kernel
3. OS image/App disseminated 4. Process to generate certificates
Performance evaluation
Future Work
To devise standard protocols for communicating EE specifications between the InstantGrid servers and compute nodes
To optimize InstantGrid’s performance in WAN
A 256-node cluster-based grid point can be constructed from scratch in three (copy-if-needed) to five (full-copy to hard disk) minutes
Standalone grid points take longer time to construct. The bottleneck mainly lies on the process to generate host certificates
Conducted in HKU CS’s Gideon Cluster (Pentium 4 x 300; fast ethernet; each node has 512MB ram, 40GB IDE hard disk)
Two tests: (a) a cluster-based grid point, and (b) standalone grid points
Top Related