Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng,...
-
date post
15-Jan-2016 -
Category
Documents
-
view
219 -
download
0
Transcript of Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng,...
![Page 1: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/1.jpg)
1
Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting
Author:Yuhao Zhu , Yangdong Deng , Yubei Chen
Publisher:DAC'11, June 5-10, 2011, San Diego, California, USA
Presenter:Ye-Zhi Chen
Date:2011/10/5
![Page 2: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/2.jpg)
2
Introduction
Current GPU architectures are still under two serious limitations for routing
processing :
1. GPU computing requires the packets to be copied from CPU’s main
memory to GPU’s global memory
2. the batch based GPU processing could not guarantee processing QoS for
an individual packet (throughput vs delay)
A novel solution : Hermes, an integrated CPU/GPU, shared memory
microarchitecture that is enhanced with an adaptive warp issuing
mechanism for IP packet processing
![Page 3: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/3.jpg)
3
Hermes
Hermes : a heterogeneous microarchitecture with CPU and GPU integrated
on a single chip memory .The copy overhead can thus be removed by
sharing a common memory system between CPU and GPU
![Page 4: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/4.jpg)
4
Hermes
The warp issuing mechanism problem
Warp – the basic unit of job scheduling on GPUs
The warp issuing mechanism of Hermes is responsible for assigning
parallel tasks onto shader cores for further intra-core scheduling.
Best effort strategy – the number of warps that can be issued in one round
is only constrained by the number of available warps as well as hardware
resources such as per core register and shared memory size.
adaptive warp issuing mechanism – adapts to the arrival pattern of
network packets and maintains a good balance between overall throughput
and worst-case per-packet delay.
![Page 5: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/5.jpg)
5
Hermes The packets are received by network interface cards (NICs) and then
copied to the shared memory via DMA transfers. The CPU is thus able to
keep track of the number of arrived packets. Accordingly, the CPU is
responsible for notifying the GPU to fetch packets for processing
When CPU decides it is appropriate to report the availability of packets, it
creates a new FIFO entry with its value as the number of packets ready for
further processing, assuming the task FIFO is not full
the GPU is constantly monitoring the FIFO and making decisions on
fetching a proper number of packets.
The minimum granularity , i.e., number of packets, of one round of
fetching by GPU should be at least equal to the number of threads in one
warp
![Page 6: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/6.jpg)
HermesHow frequently the CPU should update the task FIFO?
transferring a packet from NIC to the shared memory involves a book-
keeping overhead
too frequently updating of the task FIFO also complicates GPU fetching
due to the restriction of finest fetching granularity
too large an interval between two consecutive updates increases average
packet delay
![Page 7: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/7.jpg)
HermesSolution
set the minimum data transfer granularity to be the size of a warp, i.e., 32
packets (from NIC to shared memory)
if there are not enough packets arriving in a given interval, these packets
should still be fetched and processed by GPU, interval is chosen to be 2
warp arriving times
A round-robin issuing strategy is employed to evenly distribute the
workload among each shader core
![Page 8: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/8.jpg)
HermesGuaranteed In-Order Warp Commit
Thread warps running on one shader core may finish in an arbitrary order,
not to mention warps running on different shader cores
TCP - its header includes extra areas to enable retransmission and
reassembly
UDP – it does require in-order processing
![Page 9: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/9.jpg)
HermesDelay Commit Queue
The key idea is to allow out-of-order warps execution but enforce in-order
commitment.
DCQ holds the IDs of those warps that have finished but not committed yet
Every time a warp is about to be issued onto one shader core, and the DCQ
is not full, a new entry is allocated.
Only could a warp be committed when all warps arrived earlier have been
finished
![Page 10: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/10.jpg)
Hermes
![Page 11: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/11.jpg)
HermesAPI modification
Memory copy API is not necessary
Add two memory management API : RMalloc and Rfree
Abandon thread blocks , directly organized threads in warps
packetIdx
![Page 12: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/12.jpg)
Experiment
Configuration : DPI : bloom filter
String rule set :Snort
Packet classification : linear search
CheckIPHeader, DecTTL, and Fragmentation: RouterBench
![Page 13: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/13.jpg)
Experiment
Three QoS metrics:
Throughput: the total number of bits that can be processed and transferred
during a given time period
Delay : 1.queuing delay : The waiting time which the packet arriving rate
exceeds the processing throughput of the system, the succeeding packets
have to wait before shader cores are available.
2.service delay : the time for a packet to receive complete processing by
a shader core
Delay variance : The delay disparity of different packets measured by
interquartile range
![Page 14: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/14.jpg)
Experiment
![Page 15: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/15.jpg)
Experiment
![Page 16: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/16.jpg)
Experiment
![Page 17: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/17.jpg)
Experiment
![Page 18: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/18.jpg)
Experiment
![Page 19: Hermes: An Integrated CPU/GPU Microarchitecture for IPRouting Author: Yuhao Zhu, Yangdong Deng, Yubei Chen Publisher: DAC'11, June 5-10, 2011, San Diego,](https://reader035.fdocuments.in/reader035/viewer/2022062322/56649d2e5503460f94a05aeb/html5/thumbnails/19.jpg)
Experiment