HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture...

20
HPC Systems Engineering in the Interaction Room Matthias Book with Morris Riedel, Jülich Supercomputing Centre / UoI and Helmut Neukirchen, University of Iceland

Transcript of HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture...

Page 1: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

HPC Systems Engineeringin the Interaction Room

Matthias Book

with Morris Riedel, Jülich Supercomputing Centre / UoIand Helmut Neukirchen, University of Iceland

Page 2: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

General Software Engineering Challenges

Need to ensure we are building the right software, and we are building it right

But many sources of miscommunication between domain & technology experts: Different vocabulary / areas of competence

Struggling to convey / understand requirements precisely

Struggling to realize what is non-obvious / implicit knowledge / unknown

Struggling to realize what bears particular value / effort / risk

Struggling to convey / understand what is fixed, what is flexible, what is variable

Up-front specifications often do not solve these problems, but just mask them Same struggles put in writing Issues surface later, when they are more expensive to fix

Agile approaches encourage (and actually depend on) more interaction but provide little guidance for communicating about what is really crucial in a project

Matthias Book: HPC Systems Engineering in the Interaction Room 2

Page 3: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

The Nature of Software Development

“Because software is embodied knowledge,

and that knowledge is initially dispersed, tacit, latent, and incomplete,

software development is a social learning process.”

Howard Baetjer, Jr.: Software as Capital. IEEE Computer Society Press, 1998

Matthias Book: HPC Systems Engineering in the Interaction Room 3

Page 4: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

The Interaction Room

Successful projects require personal, focused discussion of critical project aspects

thorough understanding of application domain,and how it is modeled in software

early recognition of value and effort drivers

early elimination of risks and uncertainties

The Interaction Room is a dedicated room for the project team

where domain and technical stakeholders feel at home

with large whiteboards on the walls

but without a classic conference table

to visualize and discuss key project aspects informally

instead of going over tedious documents

Matthias Book: HPC Systems Engineering in the Interaction Room 4

Process Canvas

Interaction Canvas

Inte

gra

tion C

anva

s

Ob

ject C

anva

s

Example: IR for information system development

Page 5: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Example: Annotated Object Canvasfor an Information System

Matthias Book: HPC Systems Engineering in the Interaction Room 5

Page 6: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

A Pragmatic Approach to Conceptualizing Software

Informal, high-level sketches of software models sacrifice formality, consistency, completeness (no strict UML necessary)

in favor of focus, pragmatism, interdisciplinary understanding, value-orientation

Not a replacement for formal software specifications! May well be necessary for certain aspects in later stages,

and can then be delegated to expert groups

Informal sketches serve as catalysts for the identification, understanding and discussion of the most critical project aspects Interdisciplinary communication about domain and technology

High-level orientation about project goals, dependencies, conflicts, trade-offs

Early identification of value and complexity drivers, risks, uncertainties

Matthias Book: HPC Systems Engineering in the Interaction Room 6

Page 7: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

High Performance Computing / Scientific Computing

Simulation Science

Simulation of natural processes to learn about known behavior of a complex system, e.g. Weather forecast

Human brain

Glacial processes

Volcanic processes

Crowd behavior

etc.

(Focus of following IR ideas)

Data Science

Identification of patterns / correlations to learn about unknown aspects of a complex system, e.g. Recognizing customer preferences

Recognizing medieval manuscript scribes

etc.

HPC: Break simulation/data science problems down into small chunks for parallel processing on very large number of cores

Matthias Book: HPC Systems Engineering in the Interaction Room 7

Page 8: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Crucial Interdisciplinary Communication Pointsin HPC Simulation Science Projects

Domain experts need to help systems engineers understand: What research question are we trying to answer? What context, what boundary conditions?

What parameters and variables are there? How are they evolving over time? Initial values?

How do the variables affect each other? Is interaction long- or short-range?

What are particularly interesting segments of the simulation space? Are these variable?

etc.

Systems engineers need to validate technical decisions with domain experts: Cluster architecture: Memory-intensive or compute-intensive? Many-core, multi-core, GPUs?

Domain decomposition: How to map the problem most efficiently to the cluster?

Communication patterns: Choice of communication type? Ghosts and halos?

Memory model: Distributed (MPI), shared (OpenMP) or hybrid?

Data structures: What can be transient / must be persistent? Checkpointing? Parallel I/O?

etc.

Matthias Book: HPC Systems Engineering in the Interaction Room 8

Page 9: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Typical Pitfalls in HPC Simulation Science Projects

Choosing appropriate solvers vs. reinventing the wheel

Inefficient domain decomposition; load imbalance

Dealing with differences between & unique strengths of individual architectures

Dealing with different schedulers and their job scripts

Debugging costs high amount of (possibly expensive) computing time

Approximation of real world, insufficient validation data

Integrating different physical models/processes with each other (multi-physics)

Constant change of hardware, software, modus operandiConstant need for porting; always an early adopter; changing code ownership

Many of these revealed only in late (i.e. expensive to fix) project stages

Matthias Book: HPC Systems Engineering in the Interaction Room 9

Page 10: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Software Process for HPC Simulation Science Projects

1. Understand the problem domain

2. Perform appropriate domain decomposition andchoose appropriate communicators, helpful libraries, data structures etc.

3. Implement correct code framework for communication between processes;integrate correct problem-domain code into communication code

4. Test and validate simulation model

5. Optimize accuracy, tune performance

Matthias Book: HPC Systems Engineering in the Interaction Room 10

Page 11: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Conceptual Levels in HPC Simulation Science Projects

Problem level: Statement of research question / project goal and scope Goal, context, scope: Research question, boundary conditions, assumptions, abstractions

Quality requirements: Accuracy, generalizability, performance

Scientific level: Description of the pertinent aspects of the domain Static aspects: Coordinates, variables, sources of influence, points of interest, physical laws

Dynamic aspects: Forces, interactions, events, timing, discontinuities

Distribution level: Breakdown of the scientific model into parallelizable units Static aspects: Domain decomposition, data structure, initial conditions

Dynamic aspects: Communication patterns, stencils, halos, ghosts, adaptive mesh refinements, iterative numerical methods

Technical level: Implementation of distribution model on particular architecture Static aspects: Cluster architecture, (parallel) file system, memory model, interconnect

Dynamic aspects: Communication protocols, I/O operations, available libraries, solvers

Matthias Book: HPC Systems Engineering in the Interaction Room 11

Page 12: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Interaction Room Canvases for Simulation Science

Problem canvas: Goal and scope of research question about the domain

Real-world canvas: Description of the pertinent aspects of the domain

Decomposition canvas: Breakdown of scientific model into parallelizable units

Architecture canvas: Implementation of simulation on suitable HPC technology

Matthias Book: HPC Systems Engineering in the Interaction Room 12

Page 13: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Problem CanvasGoal and scope of research question about the domain

Domain experts collect on note cards: Research question

Boundary conditions

Assumptions

Abstractions

Quality requirements

Example: Heat dissipation problem Question: What will the temperature in the middle of a room be like after running an air

conditioner on one side and a heater on the other for several hours?

Boundary conditions: Room size, starting temperature, A/C and heater setting

Abstractions: Consider heat transfer by air flow / convection only, not by radiation

Assumptions: No moving objects in the room, no windows

Quality requirements: Temperature must be determined with double precision

Matthias Book: HPC Systems Engineering in the Interaction Room 13

Page 14: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Real-World CanvasDescription of the pertinent aspects of the domain

Domain experts sketch static properties of the simulation space Spatial setup

Locations and properties of simulation elements

Domain experts sketch dynamic properties of simulation process Forces

Events

Points of interest (actors, sensors)

Changes over time

Example: Heat dissipation problem Room geometry, placement of heater, A/C, monitor

Working of convection forces

Working of air flows, times of A/C operation

Appropriate formulae, numerical methods

Matthias Book: HPC Systems Engineering in the Interaction Room 14

Page 15: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Decomposition CanvasBreakdown of scientific model into parallelizable units

Technical experts sketch digital model reflecting real-world model, focusing on: Static aspects: Domain decomposition, data structure, abstraction level, initial conditions

Dynamic aspects: Communication patterns, stencils, halos/ghosts, adaptive mesh refinements, iterative numerical methods

Example: Heat dissipation problem Initial room temperature: 20°C, A/C set to 10°C

Resource requirements: e.g. regular grid,number of cores, memory usage

Adaptive mesh refinement

Cartesian communicator

Halo creation & communication strategy

Iterative method of known physical formula(heat transfer)

Matthias Book: HPC Systems Engineering in the Interaction Room 15

Page 16: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Architecture CanvasImplementation of simulation on suitable HPC technology

Technical experts sketch mapping of digital model to actual cluster infrastructure Static aspects: Cluster architecture, parallel file system, memory model, available modules,

tool support

Dynamic aspects: Communication protocols, I/O operations, solvers, scheduling, checkpointing, output format

Example: Heat dissipation problem 16 cores, I/O and compute nodes

MPI or hybrid, Intel compiler

Jacobi solver

10.000 iterations (based on previous experience)require ~3 h wall time

Checkpoints every 2.000 iterations, parallel I/O

Output format: List of temperature measurements

Matthias Book: HPC Systems Engineering in the Interaction Room 16

Page 17: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Interaction Room Annotations for Simulation Science

Highlight model elements that merit particular consideration: Value annotations

Scientific value

Risk annotations

Complexity

Innovation

Uncertainty

Effort annotations

Quality requirements

Boundary conditions

Interfaces

Shift attention from what is visible in models to what is implied, what is assumed, what is unknown

i.e. those aspects that often make or break a project

Matthias Book: HPC Systems Engineering in the Interaction Room 17

Page 18: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Interaction Room Workflow

1. Define project scopeon problem canvas

2. For real-world, decomposition andarchitecture canvas, iteratively:1. Sketch static & dynamic canvas concepts

2. Place and discuss canvas annotations

3. Place Uncertainty annotations

4. Refine prior canvases with new insights

3. Identify need for formal specifications

4. Make project plan e.g. agile backlog or classic work packages

Matthias Book: HPC Systems Engineering in the Interaction Room 18

Page 19: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

Outlook on Further Work

More precise definition of canvases and notations

Choice of appropriate annotations – identification of particular challenges

Refinement of process

Validation in actual projects

Matthias Book: HPC Systems Engineering in the Interaction Room 19

Page 20: HPC Systems Engineering in the Interaction Room Systems Engineering in the IR.pdf · Architecture canvas: Implementation of simulation on suitable HPC technology ... e.g. agile backlog

20

Takk fyrir!

[email protected]