“ Status of GPU trigger ”
description
Transcript of “ Status of GPU trigger ”
![Page 1: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/1.jpg)
“Status of GPU trigger”
Gianluca Lamanna (INFN)On behalf of GAP collaboration
TDAQ, Liverpool, 28.8.2013
![Page 2: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/2.jpg)
Two problems to use GPU in the trigger
Computing power: Is the GPU fast enough to take trigger decision at tens of MHz events rate?
Elena -> Use with the RICH Jacopo -> Use with the Straw
Latency: Is the GPU latency per event small enough to cope with the tiny latency of a low level trigger system in HEP? Is the latency stable enough for usage in synchronous trigger systems? (Felice, Roberto, me + Alessandro, Piero, Andrea, etc.)
2
![Page 3: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/3.jpg)
GPU processing
3
NIC GPU
chipset
CPU RAM
PCI express
VRAM• Example: packet with
1404 B (20 events in NA62 RICH application)
• T=0
0 us
![Page 4: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/4.jpg)
GPU processing
4
NIC GPU
chipset
CPU RAM
PCI express
VRAM
0 10 us
![Page 5: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/5.jpg)
GPU processing
5
NIC GPU
chipset
CPU RAM
PCI express
VRAM
0 10 99 us
![Page 6: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/6.jpg)
GPU processing
6
NIC GPU
chipset
CPU RAM
PCI express
VRAM
0 10 99
104
us
![Page 7: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/7.jpg)
GPU processing
7
NIC GPU
chipset
CPU RAM
PCI express
VRAM
0 10 99
104
134 us
![Page 8: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/8.jpg)
GPU processing
8
NIC GPU
chipset
CPU RAM
PCI express
VRAM
0 10 99
104
134
139
us
![Page 9: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/9.jpg)
GPU processing
9
NIC GPU
chipset
CPU RAM
PCI express
VRAM
0 10 99
104
134
139
us
The latency due to the data transfer from data source to the system is more important than the latency due to the computing on the GPUIt scales almost linearly (apart from the overheads) with the data size while the latency due to the computing can be hidden exploiting the huge resources.Communication latency fluctuations quite big (~50%).
![Page 10: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/10.jpg)
Two approaches: PF_RING driver
Fast packet capturing from standard NIC (PF_RING from ntop) The data are written directly on the user space memory. Skip redundant copy in the kernel memory space.Both for 1 Gb/s and 10 Gb/sLatency fluctuations could be reduced using RTOS ( Mauro P.).Host/kernel parallelized with three concurrent streaming.
10
NIC GPU
chipset
CPU RAM
PCI express
VRAM
![Page 11: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/11.jpg)
Tests
Events simulated in TEL62Grouped in MTPStart signal rises with the first event in the MTPFirst stop: arriving of the packetBuffering in the PC RAM (buffer depth can be changed (GMTP))Second stop: after execution on GPU (single ring reconstruction kernel)The precision of the method as been evaluated as better than 1us
11
GPUTEL62
NIC PC1 Gb/s
Scope
lpt
Dual processor PC:XEON E5-2620 2GhzI350T2 Gigabit card32 GBGPU K20c (2496 cores) PCIe v2 x16
Stop 1
Stop 2
Start
![Page 12: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/12.jpg)
Data transfer time
Using PF_RING the latency (and the fluctuations) due to packet handling in the NIC are highly reduced. 12
~80B/event
![Page 13: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/13.jpg)
GPU computing time
It’s better to accumulate a big number of events (GMTP) in order to exploit the computing cores available in the GPU.
13
![Page 14: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/14.jpg)
Total latency
Total latency: Start on first GMTP event
14
256 GMTP
Total latency: Start on last GMTP event
256 GMTP
NA62 latency
Latency timeout not implemented yet.
![Page 15: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/15.jpg)
Two approaches: NANET
NANET based on the Apenet+ card (collaboration with the Apenet group of Rome INFN).Additional UDP protocol offloadFirst not-NVIDIA device having P2P connection with a GPU.
Joint development with NVIDIA.
Preliminary version implemented on Altera DEV4 dev board with PCIX8 Gen2 (Gen3 under study).Modular structure: the link can be replaced (1 Gb/s, 10 Gb/s, SLINK, GBT,…)
15
NANET
![Page 16: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/16.jpg)
NANET data trasfer performances
Test with system loopback (data produced on the same PC and sent through standard NIC to NANET).The 50us plateau in the latency is mainly due to the NIC used for data transmission.Full bandwidth in 1 Gb/s.10Gb/s version will be ready soon.
16
![Page 17: “ Status of GPU trigger ”](https://reader034.fdocuments.in/reader034/viewer/2022042522/568154c1550346895dc2c93b/html5/thumbnails/17.jpg)
Best performance with apenet link
Faster than Infiniband for GPU-GPU transmission.Latency at level of 8us with Apenet Link.Link implemented on high-speed daughter card: in principle several types of links can be implemented (GBT, PCI express (Gianmaria), Infiniband,…)
17