Post on 19-Dec-2015
Uncorq:
Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors
http://iacoma.cs.uiuc.edu
Karin Strauss AMD Advanced Architecture and Technology Lab
Xiaowei Shen IBM Research
Josep Torrellas University of Illinois at Urbana-Champaign
Karin Strauss - “Uncorq” 2QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Motivation
• CMPs are ubiquitous
• Shared memory + caches = cache coherence
• directory-based: indirection, storage
• shared bus-based: electrical, layout issues
• Traditional cache coherence solutions
Karin Strauss - “Uncorq” 3QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Embedded-ring cache coherence [ISCA 2006]
• logical ring is embedded in network
• control messages use ring
Simple and inexpensive to implement
• Novel snoopy cache coherence for mid-sized machines
• data messages use any path
Snoop requests can have long latencies
Karin Strauss - “Uncorq” 4QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Contributions
• Propose invariant for transaction serialization
• Propose performance enhancements
• reduces cache-to-cache transfer latency
• Uncorq: unconstrained snoop request delivery
• Simple hardware data prefetching technique
• reduces memory-to-cache transfer latency
Karin Strauss - “Uncorq” 5QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
logical ring
Embedded-ring terminology
• snoop request• snoop response• snoop request + response
• data
• Types of messages:
+requestresponse
snoop op. outcome
positive snoop op. outcome
+ positive response
data
A B
control messages
• Snoopy, invalidate protocol
responserequest
• Single supplier protocol
request
Ordering invariant
Karin Strauss - “Uncorq” 7QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Transaction serialization
inv
inv
ack
ack
read
data S
MSI
S A BS I S
timeinv
I
old value new value
Karin Strauss - “Uncorq” 8QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Serialization enforcement with embedded-ring
• Logical unidirectional ring provides partial ordering
• Distributed algorithm establishes global order for same-address transactions
• one is declared the “winner” (first to reach supplier)• others have to retry
• On simultaneous transactions to same address:
A
requestrequest requestrequestresponseresponse responseresponse
Karin Strauss - “Uncorq” 9QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
How to serialize transactions
A
B
S
A’s request and response
B’s request and response
No clear “first” transaction
B’s request reaches S first
A receives B’s positive response before its own
A retries: B A
Ring guarantees responsesare forwarded in the order Sperformed snoop operations
+
+
Karin Strauss - “Uncorq” 10QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Enforcing transaction serialization
Ordering Invariant: the order in which responses travel the ring after leaving the supplier must be the same as the order in which the supplier processed their corresponding requests.
• Node whose request arrives at supplier node first is the “winner”
• What we need to enforce transaction serialization:
loser node sees other node’s positive response before its own
+
S
+request
response
Uncorq:Unconstrained snoop
request delivery
Karin Strauss - “Uncorq” 12QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Uncorq idea
Baselinerequest
response
Idea: requests do not have to follow the ring(but responses do)
Uncorq
Karin Strauss - “Uncorq” 13QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Benefit of Uncorq
Baseline
Reduced cache-to-cache transfer latency
Uncorq
savings
request snoop data
request reachessupplier node
time
Karin Strauss - “Uncorq” 14QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Implications of Uncorq
• Uncorq no longer restricts order of requests
• Nodes may receive and process requests in any order
• Responses may also get reordered
Problem: distributed algorithm relies on the fact that response order reflects order of requests at supplier
Karin Strauss - “Uncorq” 15QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Example: incorrect transaction ordering
A
B
S
S
++ +
S
Ordering invariant
A node cannot forward any other response ifit has an outstanding positive snoop outcome
request
response
Karin Strauss - “Uncorq” 16QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
How Uncorq stalls responses
+ +
addr Crequestsresponses
+
• Local transaction table (per-node structure)
A B …C
• records messages that node is currently processing
+request
response
Karin Strauss - “Uncorq” 17QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Optimization: prefetching from memory
• Predict when no node will supply data
• Access memory in parallel with ring snoop
R
memory(1)
(2)
R
memory
(1)
(1)
• Goal: reduce latency of memory-to-cache transfers
unoptimized optimized
Evaluation
Karin Strauss - “Uncorq” 19QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
• 64 nodes in a single CMP
Experimental setup
• SESC simulator (sesc.sourceforge.net)
• SPLASH-2, SPECjbb and SPECweb workloads
• Interconnection network: 2D torus with embedded-ring
Karin Strauss - “Uncorq” 20QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Cache-to-cache transfer latency
substantial reduction in latencyUncorq
0
2
4
6
8
10
0 100 200 300 400 500 600
distribution (%)
0
20
40
60
80
100
cumulative
distribution (%)
cache-to-cache transfer latency
Baseline
0
2
4
6
8
10
0 100 200 300 400 500 600
distribution (%)
0
20
40
60
80
100
cumulative
distribution (%)
cache-to-cache transfer latency
Karin Strauss - “Uncorq” 21QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Execution Time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SPLASH-2 SPECjbb SPECweb
normalized
execution time
BaselineUncorqUncorq+Pref
• Uncorq + Pref performs the best (reduction: 13-26%)
• Uncorq significantly reduces execution time (reduction: 5-23%)
Karin Strauss - “Uncorq” 22QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Also in the paper
• Serialization mechanism for case with no supplier
• System and node forward progress
• Fences and memory consistency issues
• Characterization of prefetching mechanism
• Comparison against ccHyperTransport
Karin Strauss - “Uncorq” 23QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Conclusion
• Propose invariant for transaction serialization
• Propose performance enhancements• Uncorq: unconstrained snoop request delivery
• Reduce execution time by 13-26%
• Simple hardware data prefetching technique
Uncorq:
Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors
http://iacoma.cs.uiuc.edu
Karin Strauss AMD Advanced Architecture and Technology Lab
Xiaowei Shen IBM Research
Josep Torrellas University of Illinois at Urbana-Champaign