The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico...

23
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati

Transcript of The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico...

Page 1: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

The High Performance Simulation Project

Status and short term plans

17th April 2013

Federico Carminati

Page 2: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Where are we now?

Present status Several investigations of possible alternatives for “extremely

parallel – no lock” transport Not much code written, several blackboards full Some investigation on a simplified but fully vectorized model to

prove vectorization gain New design in preparation

Page 3: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Major points under discussion

How to minimise locks and maximise local handling of particles

How to handle hit and digit structures How to preserve the history of the particles

This point seems more difficult at the moment and it requires more design

What is the possible speedup obtained by micro-parallelisation

What are the bottlenecks and opportunities with parallel I/O

Page 4: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Dispatcher thread

Thread local

4

Current design

Logical Volume

Logical Volume basket

p array

p* array

Transport

Output particle

store

p* array

p

p* p* p* p* p*

Page 5: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Features

Pros Good parallel performance but… Easy recording of particle history Limited data movement

Cons Possible limited scalability with large number of cores Non locality of particle in memory Difficult to introduce hits and digits maintaining locality

5

Page 6: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

6

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

Page 7: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

7

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

Transport thread

Digitizer thread

Ev build thread

Events

Page 8: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

8

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

Continuouslyrotated

Flushedat the end of

event

Page 9: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Features

Pros Excellent potential locality Easy to introduce hits and digits

Cons One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and

introduce particle pruning

9

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 10: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow I The transport thread takes particles from the input

buffer and transports them till they stop, interact or exit from the volume At this point they are inserted in the output particle buffer for

further processing If the LV is a sensitive detector, hits are generated and stored

per LV basket A LV basked history record is kept (we have no idea how for

the moment, we need more blackboard work!) Input and output particle buffers are fixed size

structures, which can however evolve (be optimised) during simulation

10

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 11: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

11

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

✔ full!

✗ empty!

Page 12: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow II When an input particle buffer is exhausted

It is marked as such by the transport thread No lock if its kept assigned to the LV basket, but possible

memory waste Can be passed to a queue of “used baskets”, but this implies

a lock In case of a flag, the dispatcher thread has to scan all LV->all

basked->all output buffers to know which ones used, but this can be optmized

Used buffers are scanned by the dispatcher thread that updates a global track counter per event -1 for each stopped “dead” particle

And then they are declared “empty” to be reused The transport thread picks up another “ready” basket

12

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 13: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow III When an output particle buffer is full, it is marked

as such Again queue insertion or just a flag In case of a flag, the dispatcher thread has to scan all LV->all

basked->all output buffers to know which ones are full, but this can be optmized

The transport thread picks another empty output buffer The dispatcher thread copies particles from the full

output particle buffer to LV-specific input particle buffers Increasing the global particle event counter

When an input particle buffer is full, the dispatcher declares it “ready to be transported”

13

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 14: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

14

Design under study

Input particle list

Output particle list

p array

p array

List of logical Volumes

List of baskets for lvLogical Volume lv

Empty buffer list Full buffer list

Ready buffer list

Page 15: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow IV Note an important point

The LV basket structure has input and output particle buffers and hits and history buffers

Input and output particle buffers are Multi-event Volatile, they get emptied and filled during transport of a

single event Hits and history buffers are

Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads

successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data

structure till the event is flushed

15

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 16: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow V When an event is finished, the digitizer thread

kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure

When this is over, the event is built into the event structure (to be designed!) by the event builder thread

After that, the history for this event is assembled by the same thread

Then the event is output

16

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 17: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Questions? How many dispatcher, digitizer and event-builder

threads? Difficult to say, we need some more quantitative design work Measurements with G4 simulations could help

Transport thread numbers will have to adapt to the size of simulation and of the detector In ATLAS for instance 50% of the time is spent in 0.75% of

the volumes Threads could be distributed proportionally to the time spent

in the different LVs

17

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Page 18: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Simple observation: HEP transport is mostly local !

• Locality not exploited by the classical transportation approach

• Existing code very inefficient (0.6-0.8 IPC)

• Cache misses due to fragmented code

50 per cent of the time spent in 50/7100 volumes

Page 19: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Questions?

What about memory? Fortunately we do not have “that many” LVs

19

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Detector Physical volumes Logical volumes

ALICE 4,354,735 4,764

ATLAS 29,046,966 7,143

CMS 1,166,318 1,537

LHCb 18,491,756 709

Page 20: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Grand strategy

20

Simulation job

Create vectors

Basic algorithms

Use vectors

We are concentrating here

But we should look also here

Page 21: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

21

Short term tasks

Continue the design work – essential before any more substantial implementation This is the most important task at the moment We have to evaluate the potential bottlenecks before starting the

implementation Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines

Both on x86 CPUs and GPUs Demonstrate speedup of some chosen physics methods

Particularly in the EM domain

Page 22: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

22

Possible timeline

Summer 2013 Implement a prototype according to the present design Get essential numbers from G4 (to be defined!)

Total particle in a shower, profile of development of a shower in terms of multiplicity, locality of transport ecc ecc.

Vectorize, GPU-ize, Phi-ize at least three geometry classes (simple, intermediate, hard)

Vectorize, GPU-ize, Phi-ize at least a couple of EM simplified methods (from G4?)

Fall 2013 Interface the methods above to the prototype to realise a first

protype of vectorized transport

Page 23: The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Thank you!