The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico...

Post on 18-Jan-2016

215 views 0 download

Tags:

Transcript of The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico...

The High Performance Simulation Project

Status and short term plans

17th April 2013

Federico Carminati

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Where are we now?

Present status Several investigations of possible alternatives for “extremely

parallel – no lock” transport Not much code written, several blackboards full Some investigation on a simplified but fully vectorized model to

prove vectorization gain New design in preparation

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Major points under discussion

How to minimise locks and maximise local handling of particles

How to handle hit and digit structures How to preserve the history of the particles

This point seems more difficult at the moment and it requires more design

What is the possible speedup obtained by micro-parallelisation

What are the bottlenecks and opportunities with parallel I/O

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Dispatcher thread

Thread local

4

Current design

Logical Volume

Logical Volume basket

p array

p* array

Transport

Output particle

store

p* array

p

p* p* p* p* p*

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Features

Pros Good parallel performance but… Easy recording of particle history Limited data movement

Cons Possible limited scalability with large number of cores Non locality of particle in memory Difficult to introduce hits and digits maintaining locality

5

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

6

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

7

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

Transport thread

Digitizer thread

Ev build thread

Events

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

8

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

Continuouslyrotated

Flushedat the end of

event

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Features

Pros Excellent potential locality Easy to introduce hits and digits

Cons One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and

introduce particle pruning

9

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow I The transport thread takes particles from the input

buffer and transports them till they stop, interact or exit from the volume At this point they are inserted in the output particle buffer for

further processing If the LV is a sensitive detector, hits are generated and stored

per LV basket A LV basked history record is kept (we have no idea how for

the moment, we need more blackboard work!) Input and output particle buffers are fixed size

structures, which can however evolve (be optimised) during simulation

10

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

11

Design under study

Input particle list

Output particle list

p array

Hits

p array

History

List of logical Volumes

List of baskets for lv

Active event list

Sensitive volumes

Digits for lv and event ev

Logical Volume lv

List of active events for lv

Event ev

✔ full!

✗ empty!

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow II When an input particle buffer is exhausted

It is marked as such by the transport thread No lock if its kept assigned to the LV basket, but possible

memory waste Can be passed to a queue of “used baskets”, but this implies

a lock In case of a flag, the dispatcher thread has to scan all LV->all

basked->all output buffers to know which ones used, but this can be optmized

Used buffers are scanned by the dispatcher thread that updates a global track counter per event -1 for each stopped “dead” particle

And then they are declared “empty” to be reused The transport thread picks up another “ready” basket

12

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow III When an output particle buffer is full, it is marked

as such Again queue insertion or just a flag In case of a flag, the dispatcher thread has to scan all LV->all

basked->all output buffers to know which ones are full, but this can be optmized

The transport thread picks another empty output buffer The dispatcher thread copies particles from the full

output particle buffer to LV-specific input particle buffers Increasing the global particle event counter

When an input particle buffer is full, the dispatcher declares it “ready to be transported”

13

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

14

Design under study

Input particle list

Output particle list

p array

p array

List of logical Volumes

List of baskets for lvLogical Volume lv

Empty buffer list Full buffer list

Ready buffer list

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow IV Note an important point

The LV basket structure has input and output particle buffers and hits and history buffers

Input and output particle buffers are Multi-event Volatile, they get emptied and filled during transport of a

single event Hits and history buffers are

Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads

successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data

structure till the event is flushed

15

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Processing flow V When an event is finished, the digitizer thread

kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure

When this is over, the event is built into the event structure (to be designed!) by the event builder thread

After that, the history for this event is assembled by the same thread

Then the event is output

16

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Questions? How many dispatcher, digitizer and event-builder

threads? Difficult to say, we need some more quantitative design work Measurements with G4 simulations could help

Transport thread numbers will have to adapt to the size of simulation and of the detector In ATLAS for instance 50% of the time is spent in 0.75% of

the volumes Threads could be distributed proportionally to the time spent

in the different LVs

17

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Simple observation: HEP transport is mostly local !

• Locality not exploited by the classical transportation approach

• Existing code very inefficient (0.6-0.8 IPC)

• Cache misses due to fragmented code

50 per cent of the time spent in 50/7100 volumes

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Questions?

What about memory? Fortunately we do not have “that many” LVs

19

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Detector Physical volumes Logical volumes

ALICE 4,354,735 4,764

ATLAS 29,046,966 7,143

CMS 1,166,318 1,537

LHCb 18,491,756 709

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Grand strategy

20

Simulation job

Create vectors

Basic algorithms

Use vectors

We are concentrating here

But we should look also here

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

21

Short term tasks

Continue the design work – essential before any more substantial implementation This is the most important task at the moment We have to evaluate the potential bottlenecks before starting the

implementation Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines

Both on x86 CPUs and GPUs Demonstrate speedup of some chosen physics methods

Particularly in the EM domain

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

22

Possible timeline

Summer 2013 Implement a prototype according to the present design Get essential numbers from G4 (to be defined!)

Total particle in a shower, profile of development of a shower in terms of multiplicity, locality of transport ecc ecc.

Vectorize, GPU-ize, Phi-ize at least three geometry classes (simple, intermediate, hard)

Vectorize, GPU-ize, Phi-ize at least a couple of EM simplified methods (from G4?)

Fall 2013 Interface the methods above to the prototype to realise a first

protype of vectorized transport

SFT S o F T w a r e   D e v e l o p m e n t   f o r   E x p e r i m e n t s

Thank you!