SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

24
SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL

Transcript of SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Page 1: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

SC04 Release, API Discussions,SDK, and FastOS

SC04 Release, API Discussions,SDK, and FastOS

Al GeistAugust 26-27, 2004

Chicago, ILL

Page 2: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

IBMCrayIntelSGI

Scalable Systems SoftwareScalable Systems Software

Participating Organizations

ORNLANLLBNLPNNL

NCSAPSCSDSC

SNLLANLAmes

• Collectively (with industry) define standard interfaces between systems components for interoperability

• Create scalable, standardized management tools for efficiently running our large computing centers

Problem

Goals

• Computer centers use incompatible, ad hoc set of systems tools

• Present tools are not designed to scale to multi-Teraflop systems

ResourceManagement

Accounting& user mgmt

SystemBuild &Configure

Job management

SystemMonitoring

www.scidac.org/ScalableSystems

To learn more visit

Page 3: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Coordinator: Al Geist

Participating Organizations

ORNLANLLBNLPNNL

PSCIBMSGI

SNLLANLAmesNCSA

CrayIntel

Participating OrganizationsParticipating Organizations

How do we position ourselves with respect to the - National Leadership-class facility?

NLCF is a partnership between ORNL (Cray), ANL (BG), PNNL (cluster)

- NERSC and NSF centers

Page 4: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Major Topics this MeetingMajor Topics this Meeting

Fred’s Feedback on the Project – he asks that we discuss several things at this meeting.

SC04 Suite Release – code freeze Sept 3. Where do we stand? What do we show/demo at SC? (make demo list)

SDK for SSS Components - developed since last meeting. What it is, how to use it.

FastOS presentations - SNL, ORNL, ANL, and LANL winners. What are they proposing? How can SSS help?

API Discussions - SSSRMAP and Restriction Syntax the saga continues.

Page 5: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Fred’s RequestsFred’s Requests

“The ISICs will not be continued beyond 5 years. Time to think about how you are going to wrap up the SSS effort.”

• Fred would like a list of our accomplishments

consistent with the $10M he has spent on the project over 5 years.

• List our priorities for what is still to be done (and what is likely to not be done) by the end.

• Get our software out on large clusters (LLNL, NCSA, NERSC, PNNL, where the ones he mentioned)

• Is our software robust enough to use at NLCF? NERSC? (in part or whole)

• Define our relationship and to the new FastOS effort.

Page 6: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Grid Interfaces

Accounting

Event Manager

ServiceDirectory

MetaScheduler

MetaMonitor

MetaManager

SchedulerNode StateManager

AllocationManagement

Process Manager

UsageReports

Meta Services

System &Job Monitor

Job QueueManager

NodeConfiguration

& BuildManager

Standard XML

interfacesauthentication communication

Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite

Checkpoint /Restart

Validation & Testing

HardwareInfrastructure

Manager

Packaging&

Install

Scalable Systems Software SuiteScalable Systems Software SuiteAny Updates to this diagram?Any Updates to this diagram?

Page 7: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Scalable Systems Software CenterMay 6-7Argonne

Review of Last MeetingReview of Last Meeting

Details inMain project notebook

Page 8: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Highlights from May mtgHighlights from May mtgCraig – General thoughts on official V1.0 Released at SC04 This will be the first time many people will see the software.Our orthogonal directions in syntax is disturbingIf we don’t make a decision soon it hurts project progress towards V1.0Brett, who works with both, favors the SSSRMAPHe likes the more descriptive nature of it and OO nature.Paul says the one is better but two is not too bad. Scott doesn’t think we can reconcile Paul asks for straw vote for a preference. For SSRMAP – 7 votes representing 5 institutions For Restriction Syntax - 3 votes all ANL Abstain – 3 votesCraig says he will do whatever it takes to make either work. he is going to make ssslib SSSRMAP work Neil says “users” are guiding factor and RMAP better there Paul says understandability and acceptability is key and RMAP is better Both say that RS is more compact and elegant.

Page 9: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Highlights from May Meeting (cont)Highlights from May Meeting (cont)

Narayan- asks does it just need documentation and tutorialsPaul says no. There is closer match for SOAP et al. the OO was not a factor in his choice, but it is more popular today.Neil says potential users won’t have a Narayan to figure this out.Components are both client and server so developer has to know syntax.Rusty – if there was something else added to RS that made it easier to use or understand. He is not sure it is a good idea.Will – documentation is better in RMAP and he has looked at RMAP more Would all this stuff be more abstracted? User does as little as they can read manual only after they get stuck. Doesn’t care as long we pick ONE! Need to have a same look and feel across the project.Rick – I don’t care which. I don’t like XML. What about the SD and EM that are already accepted. Al – says that he feels that RMAP would be more acceptable to vendors and this would be a critical to long term success of the project.

Page 10: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Scalable Systems Software Center

August 26-27, 2004

This MeetingThis Meeting

Page 11: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Agenda – August 26Agenda – August 26

8:30 Al Geist – Project Status and Fred’s comments. 9:30 Scott Jackson – Resource Management10:30 Paul Hargrove – Process Management 11:30 Narayan Desai – Node Build, Configure 12:30 Lunch (on own – cafeteria) 1:30 Rusty Lusk -- Comparing Restriction Syntax and SSSRMAP 3:00 Break 3:30 Beckman 4:00 Brightwell 4:30 Scott 5:00 Al – Discussion on SC04, What demos will we have? 5:30 Adjourn

FastOS presentations and discussion of SSS interaction

Page 12: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Agenda – August 27Agenda – August 27

8:30 Discussion, proposals, votes

Rusty – new SDK for SSS components Will McClendon – Validation and Testing

Thomas Naughton – SSS-OSCAR and v1.0 release10:30 Break11:00 Al Geist – Response to Fred

v1.0 SSS-OSCAR releaseformal collaboration with FastOS projectsSSS on NLCF machines rather than production clustersrun scale tests on big clusters (short windows)steps towards long-term supportpriority to vote on written component interfaces

next meeting date: Hacking mtg Oct. 6-8 ORNLnext regular mtg: Jan 25-26 2005 (check w/Fred)location: DC for Fred to attend

12:00 meeting ends

Page 13: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notesMeeting notesAl Geist – presents Fred’s requests and goals for this meetingScott – RM working group statusUpdated and implemented SSSRMAP v3 specSecond alpha release including Maui, Bamboo, Warehouse, GoldAdded interactive FAQOMATICCompleted merger of Maui 3.2 and Maui SSS – uses SSS interfaces (commercial versions of these will also as a matter of course)QM – interactive job support finished and tested. Packaging updated to separate out components required on the execution nodes.Accounting and allocation- complete rewrite in PERL. Significantly improved accounting design and account reportCompleted allocation, reservation, quotation and charge rates GOLD GUIMetascheduler (grid scheduler-Silver) migrated interface to use SSSFuture work – Beta release of all components including SilverFT supporting 25% cluster lossContinued OS support for Linux, AIX, Tru-64. Future OS-X, Unicos, HPUXWho using these alpha? Few just looking at it Production deployment of GOLD on 11.8TF PNNL cluster (November) Also think - MauiHPC center, ANL, and DOD centers likely

Page 14: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)

SC04 demo Discussion – RM GUI (commercial) by David Jackson

Paul – Process Management statusCheckpoint status – full save and restore of: registers, memory, signals, PID,Files (open but unmodified, open and appended, pipes between processes), and communication (via LAM/MPI over TCP) Handles in flight data (drains), linear scaling and migration. In future OpenMPI? Paul will checkDiscussion of handling filesWill always be a Linux-only solution (across all of them) Presently x86 only – Alpha and PPC as possible future work-not high priorityFuture work – more on files (mutable files, directories), process groupsCheckpoint Manager works with Bamboo and MPDPMProcess Manger – continued daily use on ChibaNew option to signal entire unix process groupMisc hardening of MPD system. Prompted by Intel use (cook & associates)Future- Intel donated a IA64 test cluster could be used to test SSSWarehouse- major bug fix, works with RM componentsSsslib version with RMAP delayed due to harddrive crash.

Page 15: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)

Narayan – Build configure statusInfrastructure improvements – ssslib wire protocol user support additions SSS SDK developmentComponents Improvements – efficiency of service directory and event mgr Node state manager – simplified implementation by using other SSS components particularly from PM

More discussion of SC04 demos – GUIs and handling failure.

Rusty – Syntax DiscussionWe agreed on XML as basis of communication mechanism. Many benefits.Allowed multiple wire protocol and service directory to keep trackWe have created a couple XML styles. Rusty thinks having two or more is fine. He is not suggesting we have only one, although others in group haveSteps: match a set of objects, apply function with args to set of objects, and construct return message.

Page 16: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)Rusty – Syntax Discussion cont.RS syntax is <command> predicate </command>Command is the function to apply (args go here)Predicate is a field value match to select set of objectsReturn message includes info on all fields in predicateGoes through a few examples in RS and explains them

In SSSRMAP <request action=value><where …></where></request>Go through same examples in RMAPMatching is in the “where” clauseArgs are in the “Option” objectReturn message indicated by “Get” objectLooking at BothCompleteness - Probably equal – both lack general negationValidation – RS is somewhat better hereExtensibility – SSSRMAP is somewhat betterReadability – SSSRMAP is somewhat better hereConciseness – RS is betterAtomicity- equal

Page 17: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)Rusty – Syntax Discussion cont.Critique of bothRSPuts too much in attributesOverloads use for (see slides flesh out here)

Page 18: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)Rusty – Syntax Discussion cont.The Less Restrictive Syntax-Keep high-level spec of commands like RS for validation-Move attributes in RS to subobjects as in SSSRMAP-Explicitly specify fields to return

Show and discuss examples in new syntax style

<function> <List of objects to match> <matching criteria> </List></function>

Still has the same implicit AND and OR that was in RS.

Argonne is starting to transition to the Less Restriction Syntax.

Page 19: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)Pete – FastOS at ANL and U Oregon (Budget starts Oct. at 2/3 requestFuture systems (smart memory, message processor, stream processor) Functional decomposition and Hierarchical organizationExample BG/L uses 4 OS SuSE 8, SuSE 9, embedded Linux, microkernel Get a BG/L in DecemberFor Petascale how many OS will be requiredWhat are their performance characteristics and requirementsCan they be dynamicWhat is the cost of each component. What if a part is left outAre collective – coupled OSes neededCan we build experimental framework for FTFour focus areas- Flexible OS suites, Scalable system calls, FT, performance toolsInteract with SSSDynamic node builds and kernel loadsTao will be added to kernel and middleware could compliment SSSFaulty Towers provide info to SSS layers via component interfaceOS is Linux 2.6 kernel Embedded Linux

Page 20: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)Ron – SNL, UNM, CalTech also got 2/3 budgetNeed OS functionality out on network interface – distributing bits of OS OS bypass, offload, splinteringLight Weight Kernel influences – hard to make changesProgramming Models – problems with mixing PIM, MPI, OpenMPUsage Models – Apps number and time change over timeExternal services – parallel file system, chkpt, dynamic librariesWhat Build a collection of micro services. Small components with well defined interfaces Combine services specifically specifically for an app and systemTools for combining Micro servicesBuilding custom OS on the flySee http://coset.irisa.fr/

Page 21: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –(cont)Meeting notes –(cont)Stephen – ORNL, UNM, NCSU, OSU, LousianaTech got 1/2 budgetRAS for scientific and engineering apps

Paul – getting K42 to work on clusters

Scott – PNNL going to do work in SGI and single system image

Page 22: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –Day 2Meeting notes –Day 2Rusty – SDK for SSS componentsLots of components in the future, some we have never imaginedCrucial to make component development easySsslib and event manager, service directory for an foundationAlso need to encapsulate functionality of an abstract component Have been trying this with Python ClassesUseful for BG/L and FastOS experimentsLow Levels of SDK multiple wire protocols, EM, SD, and communication for any languageUpper Level of SDK Server and Event receiver classes provide all the services that are independent of component – registers, logging, errors, XML validating, …Shows the “stack”Goes through echo example – makes SSS coding pretty easy!Goes through job submitter example (several slides)Conclusion makes writing SSS components easy currently for Python components but other languages possibleScott says this should be easy to implement RMAP syntax in this SDK

Page 23: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –Day 2Meeting notes –Day 2Will – Validation and Testing statusMainly working on APItest current release v0.2.0Available for download on SNL ftp site (see slides)Easier to define a new test type (already does shell, script, and SSS)There is some caution with SUIDPackages required Python 2.3, ElementTree (www.effbot.com), Twisted, and ssslibSSS- Service Directory test – need to extend to all SSS componentsDiscussion about details how to use for SSS testsWhat about a user manual? Future workFuture work develop more tests for SSS components test developer GUI additional native tests types – http, TCP, XMLRPC user guide ability to SU jobs to different usersDiscussion

Page 24: SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.

Meeting notes –Day 2Meeting notes –Day 2Thomas – SSS-OSCARCurrent status v0.2a8 prerelease for v1.0 at SC04Two more items are in CVS (at this meeting) need testingStarting work on v0.3 w/ new GOLD pkg OSCAR support for BCWG schemaFuture work Integrate Gold integrate APItest in OSCAR – authors create their own test cases Improve documentation for v1.0 Start weekly builds for testingRelease scheduleNov 8 SC04 release v1.0Oct 4 code freezeSept weekly builds – available first day of week by noon for developer to test their componentTest resources – ORNL “Test1” cluster