Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003...

28
Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD

Transcript of Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003...

Page 1: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Progress on Integration, Vote on APIsSC2003, and SW release

Progress on Integration, Vote on APIsSC2003, and SW release

Al GeistSeptember 11-12, 2003

Rockville, MD

Page 2: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCSDSCIBM

SNLLANLAmesNCSA

CrayIntelUnlimited Scale

Participating OrganizationsParticipating Organizations

External reviewers want to see more vendors involved.Could be important point in our long-term plans

Have begun working with Don Mason and John Lawson to set up a presentation to a vendor forum.

Will need your participation when logistics are known

No Progress since last meeting

Page 3: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

IBMCrayIntelUnlimited Scale

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSCSDSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

Impact

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

• Reduced facility mgmt costs.• More effective use of machines

by scientific applications.

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystemsTo learn more visit

Page 4: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfaces

Working Components and Interfaces (bold)

authentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Scalable Systems Software SuiteScalable Systems Software Suite

Validation & Testing

HardwareInfrastructure

Manager

Page 5: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Scalable Systems Software CenterJune 5-6Chicago ILL

Review of Last MeetingReview of Last Meeting

Details inMain project notebook

Page 6: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Highlights from June. mtgHighlights from June. mtg

Matt Sottile – Using SSS to create bstat_sss. It is a prototype distribution so some of these issues are expected. Major gripes had to write code for Socket code and XML parsing and creationThese should be APIs. XML parsing – the schema and associated parser are intimately related

Craig Steffan – Warehouse Monitoring Software InfrastructureDescribes the old way cluster monitor worked and scalability issues with it. Presents new design

Thomas Naughton – SSS deployment using OSCARSeems to be consensus of group to do this for SC2003

Slides can be found in Main Notebook

Page 7: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Highlights from June. Mtg (cont.)Highlights from June. Mtg (cont.)

Narayan Desai – All Service directory,BC, and PM APIs changed to restriction syntax – draft spec given.

Scott Jackson – SSSRMAP v2 proposal Have taken an object oriented approach to jobs and attributesDiscuss of the differences between RM Schema and BC SchemaPart of the difference is the incorporation of securityAnother part is functional vs object oriented

Good discussion of the strengths and weaknesses of both.

Page 8: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Consensus and Voting:Consensus and Voting:

Communication Infrastructure Spec DraftWe should be able to hardwire components together.Existence of static file to define where things are – may just have service directory. Unix Domain socket protocol for SMP serversVote – accept the spec pending amendment to allow hardwired componentsYes 15, No, 0 abstaning 0

Agreement for having common error objects with 3 digit codes and messages. Message is human readable string. Two special ones 000 success 999 unknownStraw vote: 15 no 1 Abs 0

Add “supported scheme version” to Service directory Vote: 15 no 0 Abs 0

Discussion of outer (envelope, signature, body) framing and put in SSSlib (SSSlib guys said it would be done no vote taken)

Page 9: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Scalable Systems Software Center

June-September

Progress Since Last MeetingProgress Since Last Meeting

Page 10: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Five Project Notebooks- little activity this QtrFive Project Notebooks- little activity this Qtr

A main notebook for general information

And individual notebooks for each working group

• Over 281 total pages – 11 added since last meeting

• BC and PM groups need to get info into their notebooks

• Add Telecom meeting notes even if short

Get to all notebooks through main web site www.scidac.org/ScalableSystems

Click on side bar or at “project notebooks” at bottom of page

Page 11: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Bi-Weekly Working Group TelecomsStarting to pick up as SC2003 approaches Bi-Weekly Working Group TelecomsStarting to pick up as SC2003 approaches

Resource management, scheduling, and accounting

Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg”

Validation and Testing (hasn’t met since last year)

Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157

Proccess management, system monitoring, and checkpointing

Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910

Node build, configuration, and information service

Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)

Page 12: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Scalable Systems Software Center

September 11-12, 2003

This MeetingThis Meeting

Page 13: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Major Topics this MeetingMajor Topics this Meeting

Five year project – Fred says that the five year projects will go all five years, but they need to be finished at that point. He asks “What is our exit strategy?”

Open Source License – Fred asks that we come up with one general text that all organizations can agree on and then he will bless it.

Software Release – deadline for a suite release is SC2003

Formal API presentations and voting - it is that time in the project when we should be settling on some APIs. Use less time for progress reports

SC2003 prep - booth space, demos, posters

Page 14: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Agenda – September 11Agenda – September 11

8:30 Al Geist – Project Status. 9:15 Rusty Lusk – Use of Scalable Systems Suite on Chiba Working Group Reports

Progress report on what their group has done API Proposals for adoption by the groupProgress on SC2003 software release date

9:30 Scott Jackson – Resource Management10:30 Break11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own – hotel restaurant) 1:00 Paul Hargrove – Process Management 2:00 Narayan Desai – Node Build, Configure 3.00 Break 3:30 Thomas Naughton—Discussion of SSS OSCAR

software suite release, XML syntax 4:30 Discussion of SC2003 demos, booths, posters 5:30 Adjourn Working groups may wish to prepare material for voting Friday

Page 15: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Agenda – September 12Agenda – September 12

8:30 Discussion, proposals, votes

Eric – tweaks for peta-scale systemsScott – error codes extensibilityNarayan – communication infrastructure 2nd voteRusty – mystery topic

Plans for SC2003 demos and talks

10:30 Break11:00 Al Geist – Summary Review plans for SC2003 next meeting date: January 15-16, 2004

location: ANL

12:00 meeting ends

Page 16: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Rusty Lusk – SSS on chiba project (summer project)Experiment to see if sss architecture can replace chiba city SWNeeded better software on the cluster, test sss (chkpt), scalable testbedNeeded external testingNeeded more experience with published XML APIUse ANL SSS components, and stubs for scheduler, QM w/PBS compatabilityUse restriction syntax for everythingAfter 2-week shakedown Remy agreed to go forward and use SSS onlyBeen running user job mix for about 3 weeks without disastersShook out XML ambiguities, fixed bugs, fixed scalability problemsPlans short term incorporate chckpt, LAM support, monitoring warehousePlans long term incorporate components from RM group, use chiba

Question: What is Jazz using? Qbank, MPI-GM, Veridian PBS, etc.Question: What is the bug fix load? Exponential decrease with low load.

Page 17: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Scott Jackson – RM progressSSSRMAP v2.0 is in all components except Silver meta-schedulerTested securityRunning suite on ORNL’s XTORC some issues with ssslib, and PMCreated Node object v1.0Proposed set of response/status codes (more tomorrow)Suite for SC2003 include openPBS_sss, Maui, Qbank, sss_xml_svrRunning on Linux, HP-UX, AIX 5.1, IRIX 6.5, (to come Tru64, Solaris)Uses SSSRMAP v1.0Webpage for RMwg recreated w/ documentation, tarballs, rpms, bug trackScheduler progress – support for error codesQM progress – named “Bamboo” implements SSSRMAP v2 incl. SecurityAccounting and Allocation – QBank portability testingGold – implements SSSRMAPv2.0Reimplemented in PERL to overcome latency issues in java startupCreated a suite of full-featured Perl command line clients.Installed Gold on PNNL 11.8TF Linux cluster to compare to QbankSlow progress on open sourcing. Asks a/b public domain SW. group says no

Page 18: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Scott Jackson – RM progress continuedFuture work: release alpha of Bamboo, Silver, GoldSupport multi-source resource management, multi-step job supportInterface to system monitorI/O staging (need API from PM)Package code for distributionOpen source Gold (BSD)SSL on web gui.

Issues for group discussion• Resonse Codes• SC03• Problem Response System• Need process exit codes from PM• Cluster Monitor• Open SourceDiscuss OpenPBS_sss is it a real SSS component, can it drop in?http://sss.scl.ameslab.gov/downloads.shtml

Page 19: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Will McClendon – Validation and TestingNeil reports that when the SNL Institutional Cluster is upSSS will be able to use Cplant for scalability testing.API test supports multiprotocolStatus daemon-configurable monitoring infrastructure for clustersDistributed Runtime System TestingProgress: week at ANL (July)Major rework on framwork for APItest – individual tests are atomicFramework handles checking tests, dependencies, and aggregate resultsExtensibility – new types of tests are easy to createDependency system• define relationships as DAG encoded in XML, (shows many examples)• edges are boolean dependenciesSupported Tests• sssTest – use ssslib to communicate with ssslib components• shellTest – execute a command• httpTest – app testing web interfaces• tcpipTest – raw socket via tcp

Page 20: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Will McClendon – Validation and TestingHow is this different form “DART” or other testing harnesses?They not doing DAG dependencesThey don’t have regular expression matching

The SW is released inside ssslib (already available)Updates are placed directly into the CVS by Will.

Issue Tracking (same topic as Scott brought up)Is anyone using Bugzilla on the SSS website?

New hire: Ron Oldfield (new PhD)

Page 21: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Paul Hargrove – PM progressCheckpoint manager – docs nearly done. Issue of open files 30% doneNeed to chase kernel versions (need to be a part of OSCAR)Hope to test on unknowing NERSC users on PDSF systemExpect to deploy on ChibaStill need to define XML interfaces to checkpointOutside interest from Altair (PBS Pro), LANL (SLURM), Quadrics (RMS)Will be able to have “something” in the SC2003 suite (toy)Process Manager – improved scalability, mods to support SSS-PMSupport for multi-step jobs- uses MPISHNow the production PM on ChibaMonitoring – Data Warehouse written and testedXML parsing 80% done, response not done, Service directory registration not done yet.FutureIntegrate/deploy on ChibaRelease it (OSCAR based release at SC2003)Demo it at SC2003

Page 22: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Narayan Desi – BC progressCommunication –scale tested and in productionAdded schema version, added component tierEvent manager – data persistence and event statisticsIntegration with APITest – service directory tests written, event mgr nextSSSlib – core rewrite to improve code reuse, smaller code baseNode State Manager – improved diagnosticsBuild system – new config mgmt system, working on OSCAR implementationCluster HW Infrastructure – identified need for generic topology supportRestriction Syntax – command formatProvides SQL-like functionalityNow it is Disjunctive Normal Form Data ownership is explicit – which component owns Basic command syntax (describe and shows examples from report)Future – improved integration to RM, sdmin tools leveraging R syntaxMore APITest tests. Dummy components – such as “file stager”Long syntax discussion

Page 23: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notesMeeting notes

Thomas Naughton – SSS deployment using OSCARA release of OSCAR that contains all SSS softwareRoll SSS components into OSCAR packages – RPM formatCreate repository for OSCAR package uploads• Source forge sss-oscar.sf.net for our team use• Accounts & CVS permissionsEstablish “supported” Linux distribution • RedHat 7.3? Or 9.0? Discussion and group decides 9.0• Myrinet?Put an OSCAR RH9.0 version on ftp site for team to grab.

OSCAR Homepage http://oscar.sf.net

Proposed timeline for SC2003 SW releaseOct 06: SSS pkgs OSCAR-ized & in CVSOct 24: CVS freeze – begin beta testsNov 17: SC2003

Page 24: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notes Day 2Meeting notes Day 2

Eric Debenedictus– Issues for peta-scale systemsRedstorm and Bluelight mesh rather than switchMeans that topology is important considerationSpeed of light is about 5% of communication timeBut this is growing at 40% a year so that in 2008 light will be 20%Discussion that machine size in 2008 may be physically smallerEither SW has to have hooks for manual placementOr automatically optimized placementFor SSS to consider:XML attribute to specify topology and I/O resourcesXML attribute to specify data arrangement on diskOS functionality hints to help auto placement

Ron Oldfield – distributed file copy and permutationTo what extent does SSS want to involved in post 100T range? yesIs it appropriate for SNL to consider work in this area as part ofcontribution to SSS? No. Fred asked us not to do I/O – he funds I/O

through other projects.

Page 25: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notes Day 2Meeting notes Day 2

Scott Jackson – Error reporting and codesDivide up code space in consistent way.Code 0xx Success1xx Warning2xx Wire protocol3xx Message XML format4xx Security5xx Event Management6xx Reserved7xx Server application8xx Client application9xx Misc Failure999 Unknown FailureRusty mentions MPI error classes and error codeAl suggests these general error classes –

success, warning, temp failure, partial failure, failurePeople need to come up with counter proposal if they care

Page 26: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notes Day 2Meeting notes Day 2

Narayan – Communication second voteWire protocols – need to add security envelope protocolAdded service location. Bootstraped using /etc/sss/

Vote to Accept as spec for • Wire Protocol definition to get new ones accepted• Service Directory interface• Event Manager interface

Second vote: 16 yes 2 abs 0 no

Rusty – Plan for voting on specific component interfaces• Service directory (today)• Event manager (today)• Node state manager (1st vote next time)• Build system (discuss next time)• Process manager (1st vote next time)

Page 27: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Meeting notes Day 2Meeting notes Day 2

Al – SC2003

Rusty – fancy dancing meatball in wxpython!Mike – try run SSS suite on 1400 node cluster (not at SC, before SC)Capture a trace log to play.Thomas – SSS-OSCAR working! Implies that the whole suite works togetherWill – fancy graphic demonstration of APITest Brett – demonstrate swapping components in SSS architecture (show accounting?)Paul – chkpoint interacting with PM on chiba

Where? All Across the show floor

SciDAC booth – Talks by geist, rusty, craig

OSCAR BOF on Tuesday 5:00-6:00 will mention SSS-OSCAR

Page 28: Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfaces

Working Components and Interfaces (bold)

authentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Scalable Systems Software SuiteScalable Systems Software Suite

Validation & Testing

HardwareInfrastructure

Manager

Interfaces needing work (red)