Thesis Defense
Transcript of Thesis Defense
![Page 1: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/1.jpg)
Multi-Threaded End-to-End Applications on Network
Processors
Michael Watts
January 26th, 2006
![Page 2: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/2.jpg)
The Stage
• Demand to move applications from end nodes to network edge
• Increased processing power at edge makes this possible
End Node
End Node
Edge
Edge
The Internet
![Page 3: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/3.jpg)
Example
• All communication between corporate office secured at Internet edge
The Internet
CorporateOffice West
CorporateOffice East
The Internet
• End nodes responsible for establishing secure communication
![Page 4: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/4.jpg)
Applications at Network Edge
• Provide service to end nodes– Security– Quality of Service– Intrusion detection– Load balancing
• Kernels carry out single task– Such as MD5, URL-based switching, and AES
• End-to-end applications combine multiple kernels
![Page 5: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/5.jpg)
Intelligent Devices
• High level applications at network edge– Demand processing power– Demand flexibility of general-purpose processors
• Application-Specific Integrated Circuit (ASIC)– Speed without flexibility– Customized for particular use
• Network Processing Unit (NPU)– Programmable flexibility– Performance through parallelization
![Page 6: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/6.jpg)
Benchmarks
• Increasing complexity of next-generation applications– More demand on NPUs– Benchmark applications used to test
performance of NPUs
• Current network benchmarks– Single-threaded kernels– Insufficient for NPU multi-processor
architecture
![Page 7: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/7.jpg)
Contributions
• Multi-threaded end-to-end application benchmark suite
• Generic NPU simulator
• Analysis shows kernel performance inaccurate indicator of end-to-end application performance
![Page 8: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/8.jpg)
Overview
1. Network Processors and Simulators
2. The NPU Simulator
3. Benchmark Applications
4. Tests and Results
5. Conclusion
6. Future Work
![Page 9: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/9.jpg)
Network Processors
• NPU– Programmable packet processing device– Over 30 self-identified NPUs
• NPU Architecture– Dedicated co-processors– High-speed network interfaces– Multiple processing units
• Pipelined• Symmetric
![Page 10: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/10.jpg)
Pipelined vs. Symmetric
• Pipelined
• Symmetric
Packet
Processing Units
Packet
Packet
Packet
Processing Units
![Page 11: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/11.jpg)
Intel IXP1200
• Symmetric architecture• Processors (266MHz, 32-bit RISC)
– 1 x StrongARM controller• L1 and L2 cache
– 6 x microengines (ME)• 4 hardware supported threads each• No cache, lots of registers
• Shared Memory– 8 MBytes SRAM– 256 MBytes SDRAM– StrongARM and MEs share memory bus– No built-in memory management
![Page 12: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/12.jpg)
Intel IXP1200 Architecture
![Page 13: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/13.jpg)
NPU Simulators
• Purpose– Execute programs on foreign platform– Provide performance statistics
• SimpleScalar– Cycle-accurate hardware simulation– Architecture similar to MIPS– Modified GNU GCC generates binaries
![Page 14: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/14.jpg)
PacketBench
• Developed at University of Massachusetts
• Uses SimpleScalar
• Provides API for basic NPU functions
• NPU platform independence
• Drawback: no support for multiprocessor architectures
![Page 15: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/15.jpg)
Benchmarks
• Applications designed to assess performance characteristics of a single platform or differences between platforms– Synthetic
• Mimic a particular type of workload
– Application• Real-world applications
• Our focus: application benchmarks for the domain of NPUs
![Page 16: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/16.jpg)
Benchmark Suites
• MiBench– Target: embedded microprocessors– Including Rijndael encryption (AES)
• NetBench– Target: NPUs– Including Message-Digest 5 (MD5) and URL-
based switching
• Source available in C• Limitation: single-threaded
![Page 17: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/17.jpg)
The Simulator
• Modified existing multiprocessor simulator• Built on SimpleScalar• Modeled after Intel IXP1200
– Modeled processing units, memory, and cache structure
– Processors share memory bus– SRAM reserved for instruction stacks
Parameter StrongARM Microengines
Scheduling Out-of-order In-order
Width 1 (single-issue) 1 (single-issue)
L1 I Cache Size 16 KByte SRAM (0 penalty)
L1 D Cache Size 8 KByte 1 KByte (replace registers)
![Page 18: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/18.jpg)
Methods of Use
• Simulator compiles on Linux using GCC• Takes SimpleScalar binary as input
sim3ixp1200 [-h] [sim-args] program [program-args]
• Threads argument controlls number of microengine threads (0-24)
• 6 microengines allotted threads using round-robin
![Page 19: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/19.jpg)
Application Development
• Developed in C• Compiled using GCC 2.7.2.3 cross-compiler
– Linux/x86 SimpleScalar
• No POSIX thread support, same binary executed by each thread
• No memory management• Multi-threading
– getcpu()– barrier()– ncpus
![Page 20: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/20.jpg)
Example Code// common initialization…
barrier();
int thread_id = getcpu();
if (thread_id == 0) { // StrongARM}else if (thread_id == 1) { // 1st microengine thread}else { // 2 – ncpu microengine threads}
![Page 21: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/21.jpg)
Benchmark Applications
• Modified 3 kernels from MiBench and NetBench– Message-Digest 5 (MD5)– URL-based switching (URL)– Advanced Encryption Standard (AES)
[Rijndael]
• Modified memory allocations• Modified source of incoming packets• Parallelized
![Page 22: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/22.jpg)
MD5
• Creates a 128-bit signature of input
• Used extensively in public-key cryptography and verification of data integrity
• Packet processing offloaded to microengine (ME) threads
• Packets processed in parallel
![Page 23: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/23.jpg)
MD5 Algorithm
• Every packet processed on separate ME thread
• StrongARM monitors for idle threads and assigns work
Microengines
Inco
min
g P
acke
ts
![Page 24: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/24.jpg)
MD5 Parallelization
StrongARM Microengines
![Page 25: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/25.jpg)
URL
• Directs packets based on payload content
• Useful for load-balancing, fault detection and recovery
• Layer 7 switch, content-switch, web-switch
• Uses pattern matching algorithm
![Page 26: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/26.jpg)
URL Algorithm
• Work for each packet split among ME threads
• StrongARM iterates over search tree, assigning work to idle ME threads
• ME threads report when match found
Microengines
Inco
min
gP
acke
ts
StrongARM
![Page 27: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/27.jpg)
URL Parallelization
StrongARM
Microengines
![Page 28: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/28.jpg)
AES
• Block cipher encryption algorithm
• Made US government standard in 2001
• 256 bit key
• Same parallelization technique as MD5
• Key loaded into each ME’s stack during initialization
• Packet encryption performed in parallel
![Page 29: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/29.jpg)
Performance Tests
• Purpose– Evaluate multi-threading kernels and end-to-
end applications
• Tests– Isolation– Shared– Static– Dynamic
![Page 30: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/30.jpg)
Isolation Tests
• Establish baseline
• Explore effects of multi-threading kernels
• Each kernel run in isolation
• Number of ME threads varied from 1 to 24
• Speedup graphed against serial version
![Page 31: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/31.jpg)
MD5 Isolation Results
• 0: serial on StrongARM• 1-24: parallel on MEs• Decreased speedup on 1 ME • Significant speedup overall• Note decreasing slope at 7, 13, and 19 threads
![Page 32: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/32.jpg)
URL Isolation Results
• When 1 thread finds a match, must wait for other threads to finish– Polling version required polling of global flag– Performed slightly worse (1.64 compared to 1.75)– Matching pattern found in 40% of packets
• When too many threads working at once, shared resource bottlenecks affect speedup
![Page 33: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/33.jpg)
AES Isolation Results
• Performs poorly on MEs• Packets processed in 16 byte chunks• State maintained in accumulator for packet lifetime• Static lookup table of 8 Kbytes• L1 data cache 8 Kbytes for StrongARM – 1 Kbytes for
MEs• Consumes more cycles on ME by factor of 8.4
![Page 34: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/34.jpg)
Shared Tests
• Reveal sensitivity of each kernel to concurrent execution of other kernels
• StrongARM serves as controller
• Baseline of 1 MD5, 4 URL, and 1 AES thread
• Separate packet streams for each kernel
• Number of threads increased for kernel under test
![Page 35: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/35.jpg)
Shared Results
• MD5: not substantially affected• URL: maximum of 1.17 (compared to 1.75)• AES: order of magnitude higher
– Baseline uses ME, not StrongARM
![Page 36: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/36.jpg)
Static Tests
• Characteristics of end-to-end application
• Location of bottlenecks
• Kernels work together to process single packet stream
• Find optimal thread configuration
![Page 37: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/37.jpg)
End-to-End Application
• Distribution of sensitive information from trusted network over Internet to different hosts1. Calculate MD5 signature
2. Determine destination host using URL
3. Encrypt packet using AES
4. Send packet and signature to host
![Page 38: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/38.jpg)
Static Results
• Baseline of 1 MD5, 4 URL, and 1 AES thread• Additional thread tried on each kernel• Best configuration used as starting point for next• Final result 1 MD5, 11 URL, and 12 AES threads
![Page 39: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/39.jpg)
Static Results (cont.)
• Although MD5 best speedup in Isolation, unable to improve speedup in Static– Amdahl’s Law: 1 / ((1 – P) + (P / S))
• More threads initially allocated to URL– URL bottleneck until 10 threads
![Page 40: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/40.jpg)
Dynamic Tests
• MEs not dedicated to single kernel, instead assigned work by StrongARM based on demand
• StrongARM responsible for allocating threads and maintaining wait-queues
• Realistic configuration
• Increased development complexity
![Page 41: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/41.jpg)
Dynamic Algorithm
URL
AESPac
ket
queu
es
StrongARM
Microengines
• MD5 URL AES
• StrongARM monitors MEs
• Assigns work to idle threads
• First from queues, then from incoming packet stream
• AES queue
• URL queue
• Network
• URL queue fills as MD5 outperforms URL
• Additional threads created for URL
• AES threads created each time URL finishes
MD5
URL
URL
AES
![Page 42: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/42.jpg)
Dynamic Results
• Baseline same as Static• Substantial speedup over Static
![Page 43: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/43.jpg)
Dynamic Results (cont.)
• 25% as many cycles as Static• Some ME threads in Static waste idle cycles• Less affected by URL bottleneck• Able to adjust to varying packet sizes
![Page 44: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/44.jpg)
Analysis
• Isolation– Established baseline
• Shared– Explored concurrent kernels
• Static– End-to-end application characteristics– Thread allocation optimization
• Dynamic– Contrast on-demand to static thread allocation
![Page 45: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/45.jpg)
Conclusion
• NPU multi-processor simulator
• Multi-threaded end-to-end benchmark applications
• Analysis of benchmarks on NPU simulator– Kernel performance is not indicative of end-to-
end application performance– MD5 scaled well in Isolation and Shared, little
effect in end-to-end applications
![Page 46: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/46.jpg)
Future Work
• NPU simulator– Already used in two other M.S. thesis projects– Larger cycle count capability– Updated to model current NPU generation
• End-to-end applications– Simulated on next-generation simulator– Further investigation into bottlenecks
![Page 47: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/47.jpg)
Future Work (cont.)
• Benchmark suite– Include additional kernels– Model more real-world end-to-end
applications
![Page 48: Thesis Defense](https://reader033.fdocuments.in/reader033/viewer/2022061116/5466d578b4af9fbb068b4eee/html5/thumbnails/48.jpg)
Thank You, Questions