LEADERSHIP SKILLS, EMOTIONAL INTELLIGENCE AND SPIRITUAL

12
Improving System Throughput and Fairness Simultaneously in Shared Memory CMP Systems via Dynamic Bank Partitioning Mingli Xie, Dong Tong, Kan Huang, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing, China {xiemingli, tongdong, huangkan, chengxu}@mprc.pku.edu.cn Abstract Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism. In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads’ requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads’ memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning. Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance. Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness. 1. Introduction Applications running concurrently contend with each other for the shared memory in CMP systems. Hence, thread can be slowed down compared to when it runs alone and entirely owns the memory system. If the memory contention is not properly managed, it can degrade both individual thread and overall system perfor- mance, simultaneously causing system unfairness [1-8]. In addition, memory streams of different threads are inter- leaved and interfere with each other at DRAM memory; the inter-thread interference destroys original spatial locality and bank level parallelism of individual threads, thus severely degrading system performance [4, 7, 14, 15, 29]. As the number of cores on a chip continues to grow, the contention for the limited memory resources and the interference become more serious [4, 31]. The effectiveness of a memory system is commonly evaluated by two objectives: system throughput and fairness [1-9, 11, 16]. On the one hand, the overall system throughput should remain high. On the other hand, no single thread should be disproportionately slowed down to ensure fairness. Previously proposed out-of-order memory access scheduling policies [1-6, 9, 23] reorder memory accesses to improve system throughput and/or fairness. Thread Clustering Memory scheduling (TCM) [2] is one of the best scheduling among these policies. Memory access scheduling policies can potentially recover a portion of original spatial locality. However, the primary design consideration of these policies is not reclaiming the lost locality. Furthermore, the effective- ness in recovering locality of memory scheduling is restricted due to the limited scheduling buffer size and the arrival interval (often large) of memory requests from a single thread. In a word, memory access scheduling can somewhat decrease inter-thread interference, but it does not address the issue at its root. As shown in Figure 1, there is large disparity between TCM (threads share DRAM memory) and the optimal system performance (thread runs alone). Therefore, reducing inter-thread interference can further improve system performance. Memory partitioning divides memory resource among threads to reduce interference. MCP [8] maps memory intensive and non-intensive threads onto different channel(s) to enable faster progress of non-intensive threads. MCP allocates memory resource unfairly and physically exacerbates intensive threads’ contention for memory, thus ultimately resulting in the increased slowdown of these threads and high system unfairness (system fairness is approximate to FR-FCFS [23], as shown in Figure 1). Bank partitioning [7, 15, 16, 29] partitions the internal memory banks among cores to isolate memory access streams of different threads and eliminates interference. It solves the interference at its root and improves DRAM system performance. However, bank partitioning allocates an equal number of banks to each core without considering the varying demands for bank level parallelism of individual thread. In some cases, BP restricts the number of banks available to a thread and reduces bank level parallelism, thereby significantly degrading system performance. To compensate for the reduced bank level parallelism of equal bank partitioning, we propose Dynamic Bank Partitioning (DBP) which dynamically partitions memory 978-1-4799-3097-5/14/$31.00 ©2014 IEEE

Transcript of LEADERSHIP SKILLS, EMOTIONAL INTELLIGENCE AND SPIRITUAL

untitledImproving System Throughput and Fairness Simultaneously in Shared Memory CMP Systems via Dynamic Bank Partitioning
Mingli Xie, Dong Tong, Kan Huang, Xu Cheng Microprocessor Research and Development Center, Peking University, Beijing, China
{xiemingli, tongdong, huangkan, chengxu}@mprc.pku.edu.cn
Abstract Applications running concurrently in CMP systems
interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism.
In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads’ requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads’ memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning.
Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance.
Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness.
1. Introduction Applications running concurrently contend with each other for the shared memory in CMP systems. Hence, thread can be slowed down compared to when it runs alone and entirely owns the memory system. If the memory contention is not properly managed, it can degrade both individual thread and overall system perfor- mance, simultaneously causing system unfairness [1-8]. In addition, memory streams of different threads are inter-
leaved and interfere with each other at DRAM memory; the inter-thread interference destroys original spatial locality and bank level parallelism of individual threads, thus severely degrading system performance [4, 7, 14, 15, 29]. As the number of cores on a chip continues to grow, the contention for the limited memory resources and the interference become more serious [4, 31].
The effectiveness of a memory system is commonly evaluated by two objectives: system throughput and fairness [1-9, 11, 16]. On the one hand, the overall system throughput should remain high. On the other hand, no single thread should be disproportionately slowed down to ensure fairness. Previously proposed out-of-order memory access scheduling policies [1-6, 9, 23] reorder memory accesses to improve system throughput and/or fairness. Thread Clustering Memory scheduling (TCM) [2] is one of the best scheduling among these policies. Memory access scheduling policies can potentially recover a portion of original spatial locality. However, the primary design consideration of these policies is not reclaiming the lost locality. Furthermore, the effective- ness in recovering locality of memory scheduling is restricted due to the limited scheduling buffer size and the arrival interval (often large) of memory requests from a single thread. In a word, memory access scheduling can somewhat decrease inter-thread interference, but it does not address the issue at its root. As shown in Figure 1, there is large disparity between TCM (threads share DRAM memory) and the optimal system performance (thread runs alone). Therefore, reducing inter-thread interference can further improve system performance.
Memory partitioning divides memory resource among threads to reduce interference. MCP [8] maps memory intensive and non-intensive threads onto different channel(s) to enable faster progress of non-intensive threads. MCP allocates memory resource unfairly and physically exacerbates intensive threads’ contention for memory, thus ultimately resulting in the increased slowdown of these threads and high system unfairness (system fairness is approximate to FR-FCFS [23], as shown in Figure 1). Bank partitioning [7, 15, 16, 29] partitions the internal memory banks among cores to isolate memory access streams of different threads and eliminates interference. It solves the interference at its root and improves DRAM system performance. However, bank partitioning allocates an equal number of banks to each core without considering the varying demands for bank level parallelism of individual thread. In some cases, BP restricts the number of banks available to a thread and reduces bank level parallelism, thereby significantly degrading system performance.
To compensate for the reduced bank level parallelism of equal bank partitioning, we propose Dynamic Bank Partitioning (DBP) which dynamically partitions memory
978-1-4799-3097-5/14/$31.00 ©2014 IEEE978-1-4799-3097-5/14/$31.00 ©2014 IEEE
0
1
2
3
4
5
6
M ax
im um
S lo
w do
w n
System Throughput
Figure 1. System Throughput and Fairness of state-of-the-art approaches
banks to accommodate threads’ disparate requirements for bank amounts. The key idea is to profile threads’ memory characteristics at run-time and estimate their needs for BLP, then use the estimation to direct our bank partitioning. The main goals of DBP are to 1) isolate memory intensive threads with high row-buffer locality from other memory intensive threads to reserve spatial locality, 2) provide as many as (8 or 16) banks to memory intensive thread to improve bank level parallelism.
Although bank partitioning solves the interference issue, the improvement of system throughput and fairness is limited. Previous work [1, 4, 8, 20] shows that prioritizing memory non-intensive threads enables these threads to quickly resume computation, hence ultimately improving system throughput. Bank partitioning alone cannot ensure this priority, thus the system throughput improvement is constrained. In addition, bank partition- ing does not take into account system fairness, and its ability to improve fairness is also restricted.
We find that memory partitioning is orthogonal to memory scheduling. As memory partitioning focuses on preserving row-buffer locality, out-of-order memory scheduling focuses on improving system performance by considering threads’ memory access behavior and system fairness. Memory scheduling policies can benefit from improved spatial locality, while memory partitioning can benefit from better scheduling. Therefore, we propose a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Clustering Memory scheduling [2]. Together, the two methods can illuminate each other. TCM divides threads into two separate clusters and employs different scheduling policy for each cluster. TCM also introduce an insertion shuffling algorithm to reduce interference. While DBP resolves the interference issue, there is no need for insertion shuffle. We shrink the complicated insertion algorithm to random shuffle. Experimental results show that this powerful combination is able to enhance both overall system throughput and fairness significantly.
Contributions. In this paper, we make the following contributions:
—We introduce a dynamic bank partitioning to compensate for the reduced bank level parallelism of equal bank partitioning. DBP dynamically partitions memory banks according to threads’ needs for bank amounts. DBP not only solves the inter-thread interferen- ce, but also improves bank level parallelism of individual thread. DBP effectively improves system throughput, while maintaining fairness.
—We also present a comprehensive approach which combines Dynamic Bank Partitioning and Thread Cluster Memory scheduling, we call it DBP-TCM. TCM focuses
on improving memory scheduling by considering threads’ memory access behavior and system fairness. DBP is orthogonal to TCM, as DBP focuses on preserving row-buffer locality and improving bank level parallelism. When DBP is applied on top of TCM, both methods are benefited.
—We simplify the shuffling algorithm of TCM by shrinking the complicated insertion shuffle to random shuffle, and simplify the implementation.
—We compare our proposals with several related approaches. DBP improves system throughput by 9.4% and improves system fairness by 18.1% over FR-FCFS. Compared with BP, DBP improves system throughput by 4.3% and improves system fairness by 16.0%. DBP-TCM provides 15.6% better system throughput (and 42.7% better system fairness) than FR-FCFS. We also compare our proposal with TCM. Experimental results show that DBP-TCM outperforms TCM in terms of both system throughput and fairness, DBP-TCM improves system throughput by 6.2% and fairness by 16.7% over TCM. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness.
The rest of this paper is organized as follows: Section 2 provides background information and motivation. Section 3 introduces our proposals. Section 4 describes impleme- ntation details. Section 5 shows the evaluation methodo- logy and Section 6 discusses the results. Section 7 pre- sents related work and Section 8 concludes.
2. Background and Motivation 2.1 DRAM Memory System Organization We present a brief introduction about DRAM memory system; more details can be found in [12]. In modern DRAM memory systems, a memory system consists of one or more channels. Each channel has independent address, data, and command buses. Memory accesses to different channels can proceed in parallel. A channel typically consists of one or several ranks (multiple ranks share the address, data, and commands buses of a channel). A rank is a set DRAM devices that operate in lockstep in response to a given command. Each rank consists of multiple banks (e.g. DDR3 8 banks/rank). A bank is organized as a two-dimensional structure, consisting of multiple rows and columns. A location in the DRAM is thus accessed using a DRAM address consisting of channel, rank, bank, row, and column fields. 2.2 Memory Access Behavior We define an application’s memory access behavior using three components: memory intensity [1, 2, 8, 29], row- buffer locality [2, 8, 29], and bank level parallelism [4, 7] as identified by previous work.
Memory Intensity. Memory intensity is the frequency that an application generates memory requests or misses in the last-level cache. In this paper, we measure it in the terms of memory accesses per interval (MAPI) instead of LLC misses per thousand instructions (MPKI), since the per-thread instruction counts are not typically available in memory controller. We also evaluate our proposal using MPKI, we found that it provides similar results.
Row Buffer Locality. In modern DRAM memory systems, each bank has a row buffer that provides temporary data storage of a DRAM row. If a subsequent memory access goes to the same row currently in the row buffer, only column access is necessary; this is called a row-buffer hit. Otherwise, there is a row-buffer conflict, memory controller has to first precharge the opened row, activate another row, and then perform column access.
Figure 2. Generalized Thread Clustering Memory scheduling and Memory Channel Partitioning
Row-buffer hit can be serviced much more quickly than conflict. The Row-Buffer Locality (RBL) is measured by the average Row-Buffer Hit rate (RBH).
Bank Level Parallelism. Bank is a set of independent memory arrays inside a DRAM device. Modern DRAM devices contain multiple banks (DDR3: 8 banks), so that multiple, independent accesses to different DRAM arrays can proceed in parallel subject to the availability of the shared address, command, and data busses [12]. As a result, the long memory access latency can be overlapped and DRAM throughput can be improved, thus leading to high system performance. The notion of servicing multiple memory accesses in parallel in different banks is called Bank Level Parallelism (BLP).
In modern out-of-order processors, a memory access that reaches the head of ROB (reorder buffer) will cause structural stall. Due to the high latency of off-chip DRAM, most memory accesses reach the head of ROB, thus negatively impacting system performance [30]. Moreover, memory accesses which miss in the open row causes additional precharge and activate, further exacerb- ating this problem. If threads with low row-buffer locality generate multiple memory accesses to different banks, they can proceed in parallel in memory system [12]. Exploiting BLP hides memory access latency and dramatically improves system performance. 2.3 Memory Scheduling Memory access scheduling policies [1-6, 23] reorders memory requests to improve system throughput and fairness. In this section, we introduce Thread Clustering Memory scheduling [2] in detail. As shown in Figure 2(a), TCM periodically divides applications into latency- sensitive and bandwidth-sensitive clusters based on their memory intensity at fixed-length time intervals. The least memory-intensive threads are placed in latency-sensitive cluster, while other threads are in bandwidth-sensitive cluster. TCM use a parameter called ClusterThresh to specify the amount of memory bandwidth consumed by latency-sensitive cluster, thus limiting the number of threads in latency-sensitive cluster.
TCM strictly prioritizes memory requests of threads in latency-sensitive cluster over bandwidth-sensitive cluster. Because servicing memory requests from such “light” threads allows them to quickly resume computation, and hence improving system throughput [1, 4, 20]. To further improve system throughput and to minimize unfairness, TCM employs different memory scheduling policies in
each cluster. For latency-intensive cluster, thread with the least memory intensity receives the highest priority. This ensures threads spend most of their time at the processor, thus allowing them to make fast progress and make large contribution to overall system throughput. Threads in bandwidth-sensitive cluster share the remaining memory bandwidth, no single thread should be disproportionately slowed down to ensure system fairness. To minimize unfairness, TCM periodically shuffles (ShuffleInterval) priority ordering among threads in bandwidth-sensitive cluster.
TCM also proposes a notion of niceness, which reflects threads’ propensity to cause interference and their susceptibility to interference. Threads with high row-buffer locality are less nice than others, while threads with high bank-level parallelism are nicer. Based on niceness, TCM introduces an insertion shuffling algorithm to minimize inter-thread interference by enforcing nicer threads are prioritized more often than others. TCM employs dynamic shuffling, which switches back and forth between random shuffle and insertion shuffle according to the composition of workload. If threads have similar memory behavior, TCM disables insertion shuffle and employs random shuffle to prevent unfair treatment.
Out-of-order memory scheduling improves system performance by considering threads’ memory access behavior and system fairness. However, it cannot solve the interference issue effectively due to several reasons. First, the primary design consideration of scheduling is not recovering the lost spatial locality. Second, the effectiveness in reclaiming locality is constrained, as the scheduling buffer size is limited and the arrival interval of memory requests from a single thread is often large. As the number of core increases, the recovering ability will become less effective since buffer size does not scale well with the number of cores per chip, and hence negatively affecting the scheduling capability. 2.4 Memory Partitioning Memory partitioning divides memory resource among cores to reduce inter-thread interference. Memory Channel Partitioning [8] maps the data of threads with different memory characteristics onto separate memory channels. MCP also integrates memory partitioning with a simple memory scheduling to further improve system throughput. MCP can effectively reduce the interference of threads with different memory access behavior and im-
Number of banks × Number of ranks Figure 3. Influence of bank amounts on single
thread’s performance
Figure 4. Trade-offs between bank level parallelism and row-buffer locality
prove system throughput. However, the improvement of system throughput comes at the cost of system fairness. As shown in Figure 2(b), MCP allocate memory channels unfairly and inherently exacerbates intensive threads’ contention for high load memory resource, thus ultimate- ly resulting in the increased memory slowdown of these threads and high system unfairness.
Bank partitioning [7, 15, 16, 29] divides the internal memory banks among cores to isolate memory access streams from different threads, hence eliminating the inter-thread interference. Bank partitioning is implement- ed by extending the operating system physical frame allocation algorithm to enforce that physical frames mapped to the same DRAM bank is exclusively allocated to a single thread. The detailed implementation of bank partitioning can be found in [7, 6, 29]. Bank partitioning divides memory internal banks among cores by allocating different page colors to different cores, thus avoiding inter-thread interference. When a core runs out of physic- al frames allocated to it, operating system can assign different colored frames at the cost of reducing isolation.
However, the benefits of bank partitioning are not ful- ly exploited, since previous BP partitions equal number of banks to each core without considering the disparate needs for bank level parallelism of individual application. In some cases, BP restricts the number of banks available to an application and reduces bank level parallelism, which significantly degrades system performance. 2.5 Motivation In this section, we motivate our partitioning by showing how bank level parallelism impacts threads’ performance.
The Impact of Bank Level Parallelism. We study how the amount of bank influences single thread’s performance by performing experiments on gem5 (the detailed experimental parameters are in Section 5). For each application, we fix the available banks from 2 to 64 to see the performance change. We get similar results as in [7, 16]. Figure 3 shows the correlation between bank amounts and threads’ performance, which depicts threads’ disparate needs for bank level parallelism, all the results are normalized to 64 banks.
Basically, the results can be classified into three groups. First, the applications with low memory intensity (e.g. gromacs, namd, povray, etc.) show almost no perform- ance changes with the bank amount varying. Because these applications seldom generate memory requests, memory latency accounts for a small fraction of the total
program execution time. Furthermore, the requests arrival interval is long and there is hardly bank level parallelism to be exploited.
Second, memory intensive applications with low row-buffer locality (e.g. mcf, bwaves, zeusmp, etc.) are sensitive to the number of banks in memory system; the performance first rises, and then remains stable as the number of banks increases. As described in Section 2.2, the exploitation of bank level parallelism hides memory access latency and improves performance. BLP improves system performance undoubtedly, whereas its benefits are limited. Since a single thread is unable to generate enough concurrent memory accesses [7, 16, 30] due to the combination of many factors such as limited number of MSHRs, high cache hit rate and memory dependency. Therefore, after some point, the available bank amount exceeds threads’ needs and performance remains stable.
Third, the performance of memory intensive applica- tions with high row-buffer locality (e.g. libq, milc, leslie3d, etc.) also improves owing to the benefits brought by BLP, and then stays stable. While the performance improvement is slower than those applications’ in the second group. Benchmark lbm is an outlier, since it shows a higher row-buffer hit rate as the number of bank increases. These applications that benefit from the increased bank amount are BLP-sensitive.
In summary, the performance of memory intensive applications are sensitive to bank amounts and most applications give peak performance at 8 or 16 banks, while non-intensive applications’ are not sensitive to bank amounts. Due to the 2 cycle rank-to-rank switching penalty, after this point, applications show no further performance improvement or even perform worse.
The Trade-offs between Bank Level Parallelism and Row-Buffer Locality. We also study the trade-offs between bank level parallelism and row-buffer locality of memory intensive applications with low spatial locality. In our experiments, we run two threads concurrently. The example in Figure 4 illustrates how increasing bank level parallelism can improve system throughput. The column BP means two threads have their own memory banks, and shared means two threads share all banks evenly.
In the first case, 2 instances of mcf running together, we use bank partitioning [7, 15, 16, 29] divides memory banks among threads and each thread can only access its own four banks; the row-buffer locality is reserved. However, the original spatial locality of these applicat- ions is low; recovering row-buffer locality brings limited
0.6
0.7
0.8
0.9
1.0
1.1
2x1 4x1 8x1 8x2 8x4 8x6 8x8
hmmer gromacs namd povray dealII bwaves cactusADM zeusmp mcf lbm leslie3d milc libquatum
N or
m al
iz ed
1
1.2
1.4
1.6
1.8
2
8 16 32 64 8 16 32 64 8 16 32 64
W ei
gh te
d S
pe ed
Figure 5. Overview of Our Proposal
Rule 1. Application Classifying at the end of each interval for each application do
if MAPIi < MAPIt Application i is classified into non-intensive group (NI)
else if RBHi < RBHt
Application i is into low RBL group (L-RBL) else
Application i is into high RBL group H-RBL) benefits. The second situation is that the two threads share the 8 memory banks evenly to increase bank level parallelism. Although evenly share memory banks reduces row-buffer locality, the increased bank level parallelism provides more system performance benefits than the elimination of interference. When the number of banks rises to 16 or 32, increasing bank level parallelism also outperforms reducing interference. As the amount of banks rises to 64, increasing bank level parallelism by evenly sharing all banks negatively impacts DRAM performance, since most applications give peak perfor- mance at 8 or 16 banks. The 2 instances of bwaves running together show similar results. When two different threads (mcf and bwaves) running together, evenly share their memory banks also improves system throughput.
3 Our Proposals 3.1 Overview of Our Proposal To accommodate the disparate memory needs for bank level parallelism of concurrently running applications, we propose Dynamic Bank Partitioning. There are two key principles of DBP. First, to save memory banks for BLP- sensitive threads, DBP does not allocate dedicated memory banks for threads in non-intensive group. Instead, they can access all memory banks (as shown in Figure 5). Because these threads only generate a small amount of memory requests, and their propensities to cause interfer- ence are low. Second, for memory intensive threads with high row-buffer locality, DBP isolates their memory requests from other intensive threads to reserve spatial locality. While for memory intensive threads with low row-buffer locality, DBP balances spatial locality and bank level parallelism. We will discuss DBP in detail in the next section. To further improve system performance, we propose a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Clustering Memory scheduling.
DBP resolves the interference issue while TCM improve system throughput and fairness by considering threads’ memory access behavior. These two methods are orthog- onal in the sense, when applied together both concepts are benefited. 3.2 Dynamic Bank Partitioning Based on the observations in Section 2.5, we propose a dynamic bank partitioning which takes into account both inter-thread interference and applications’ disparate needs for bank level parallelism. DBP consists of three steps: 1) profiling applications’ memory access behavior at run time, 2) classifying applications into groups, 3) making bank partitioning decisions. The policy proceeds periodic- ally at fixed-length time intervals. During each interval applications’ memory access behavior is profiled. We category applications into groups based on the profiled characteristics, and then decide the bank partitioning policy at the end of each interval. The decided bank partitioning is applied in the subsequent interval.
Profiling Applications’ Characteristics. As described in Section 2.2, memory intensity and row-buffer locality are the two main factors that determine the benefits of bank level parallelism. Memory intensity is measured in the unit of memory accesses per interval (MAPI); spatial locality is measured by row-buffer hit rate (RBH). Therefore, each thread’s MAPI and RBH are collected during each interval.
Grouping Applications. The second step is to classify applications into groups. We first categorize applications into memory intensive (MI) and non-intensive (NI) groups according to their MAPI. We further classify memory-intensive applications into low row-buffer locality (L-RBL) and high row-buffer locality (H-RBL) groups based on their RBH. We use two threshold para- meters MAPIt and RBHt to classify applications. DBP categorizes applications using the rules shown in Rule 1.

=⋅ <⋅⋅
corebankrank
corebankrankMPU
At the end of each interval, our policy dynamically selects bank partitioning based on the proportion of each group among all applications. Algorithm 1 shows the pseudo code of our dynamic bank partitioning.
In the first case, all applications are memory non- intensive, we assign an equal number of colors to each
Algorithm 1. Dynamic Bank Partitioning Definition: MI: the proportion of memory intensive applications among
all applications NI: non-intensive group H-RBL: high row-buffer locality group L-RBL: low row-buffer locality group MPU: the minimum partitioning unit
Dynamic Bank Partitioning: (at the end of each interval) if MI = 0% Allocate an equal number of colors to each core else if 0% < MI < 100%
Application in NI group can access all colors Allocate MPU colors to each core in H-RBL group if MPU < 16
Each two core shares MPU 2 colors in L-RBL group else
Allocate MPU colors to each core in L-RBL group else if MI = 100%
Allocate MPU colors to each core in H-RBL group if MPU < 16
Each two core shares MPU 2 colors in L-RBL group else
Allocate MPU colors to each core in L-RBL group core to isolate memory access streams from different threads, since increasing bank amounts cannot improve system throughput for these applications (Figure 3).
The second case is the proportion of memory intensive application is bigger than 0%, but less than 100%. To save memory banks for BLP-sensitive applications, we do not allocate dedicate colors for applications in NI group. Instead, these applications can access all memory banks, since they only generate a small amount of memory accesses, and their propensities to cause interference are low. We isolate H-RBL threads from other memory intensive threads by allocating each core MPU colors to reserve row-buffer locality. While for threads in L-RBL group, there are two circumstances. If MPU is less than 16, each two cores share MPU2 colors to improve bank level parallelism, since improving bank level parallelism brings more benefits than eliminating interference (as shown in Figure 4).1 Otherwise, we allocate MPU colors to each core in L-RBL group, since most threads give peak performance at 8 or 16 banks (Figure 5).
The last case is that all applications are memory intensive. For applications in H-RBL group, we allocate MPU colors to each core to isolate memory accesses from other cores, thus reducing inter-thread interference. If MPU is bigger than 16, we allocate MPU colors to each core in L-RBL group. Otherwise, each two cores evenly share MPU2 colors to increase bank level parallelism. 3.3 Integrated Dynamic Bank Partitioning and
Memory Scheduling Bank partitioning aims to solve inter-thread memory interference purely using OS-based page coloring. In contrast, various existing approaches try to improve system performance entirely in hardware using sophisti- cated memory scheduling policies (e.g. [1-6]) by considering threads’ memory access behavior and system fairness. The question is whether either alone (bank partitioning alone or memory scheduling alone) can provide the best system performance. According to our observations below, the answer is negative. 1 If the number of threads in L-RBL group is odd, we allocate MPU colors to the last core in L-RBL group.
TCM reorders memory requests to improve system performance and can potentially recover some of the lost spatial locality. However, the primary design considerat- ion is not reclaiming the lost locality. Furthermore, the effectiveness in recovering locality is constrained due to the limited scheduling buffer size and the arrival interval (often large) of memory requests from a single thread. As the number of core increases, the recovering will become less effective since buffer size does not scale well with the number of cores per chip, thus negatively affecting the scheduling capability.
Bank partitioning effectively reduces inter-thread interference using OS-based page coloring. However, the improvement of system throughput and fairness is limited. Previous work [1, 4, 8, 20] shows that prioritizing memory non-intensive threads enables these threads to quickly continue with their computation, thereby significantly improving system throughput. While bank partitioning alone cannot ensure this priority since it does not change the memory scheduling policy. Therefore, the improvement of system throughput is restricted. In addition, bank partitioning does not take into account system fairness, and hence its ability to improve fairness is constrained.
TCM aims to improve system performance by considering threads’ memory access behavior and system fairness, while dynamic bank partitioning focuses on preserving row-buffer locality and improving the reduced bank level parallelism. These two kinds of method are complementary to each other. TCM can benefit from improved spatial locality, and bank partitioning can benefit from better scheduling. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling to further improve system throughput and fairness.
To save memory banks, DBP does not separate non-intensive threads’ memory requests from memory intensive threads, which exacerbates the interference experienced by non-intensive threads. TCM strictly prioritizes latency-sensitive cluster which consists of non-intensive threads, thus nullifying the interference caused by dynamic bank partitioning.
The insertion shuffling algorithm of TCM aims to reduce inter-thread interference. However, we find that it destroys the row-buffer locality of less nice threads. In addition, prioritizing nicer threads increases memory slowdown of less nice threads, thus leading to high maximum slowdown. Our comprehensive approach combines TCM with DBP, dynamic bank partitioning solves the inter-thread interference. Therefore, instead of insertion shuffle we use random shuffle. Random shuffle not only saves implementation complexity of monitoring BLP and calculating niceness values of each thread, but also improves system performance (as shown in Section 6.3).
The prioritization rules of DBP-TCM are: 1) threads of latency-sensitive cluster are prioritized over bandwidth- sensitive cluster, 2) within latency-sensitive threads, the less intensive threads are prioritized over others, 3) within bandwidth-sensitive threads, prioritization is determined by the shuffling algorithm, 4) when two memory requests have the same priority, row-buffer hit requests are favored. 5) or else, older requests are prioritized.
4 Implementation Hardware Support. Our approach requires hardware support to 1) profile threads’ memory access behavior at run-time, and 2) schedule memory requests as described.
Table 1. Hardware overhead Function Size (bits) MAPI-counter Memory accesses per interval Ncore × log 2MAPImax Shadow row-buffer index Row address of previous memory access Ncore ×Nch × Nrank × Nbank × log 2Nrow Shadow row-buffer hits Number of row hits Ncore × log 2MAPImax
Table 2. Simulated system parameters Parameter Value Processor 8 cores, 4 GHz, out-of-order L1 caches (per core) 32KB Inst/32KB Data, 4-way, 64B line, LRU L2 cache (shared) 8MB, 16-way, 64B line Memory Controller Open-page policy; 64 entries read queue, 64 entries write queue
DRAM Memory 2 channel, 2-ranks/channel, 8-banks/rank. Timing: DDR3-1333 (8-8-8) All parameters from the Micron datasheet [10]
Table 3. SPEC CPU2006 benchmark memory characteristics No. Benchmark MPKI RBH No. Benchmark MPKI RBH 1 429.mcf 35.8 21.1% 12 482.sphinx3 3.84 82.6% 2 470.lbm 34.8 96.2% 13 473.astar 3.27 79.3% 3 462.libquantum 28.9 99.2% 14 401.bzip2 1.45 71.3% 4 436.cactusADM 25.8 25.4% 15 456.hmmer 0.28 43.2% 5 459.GemsFDTD 20.3 32.7% 16 435.gromacs 0.27 88.3% 6 450.soplex 20.1 91.5% 17 458.sjeng 0.23 25.5% 7 410.bwaves 18.4 18.5% 18 445.gobmk 0.21 77.0% 8 433.milc 16.84 84.2% 19 447.dealII 0.19 82.8% 9 434.zeusmp 14.6 23.7% 20 481.wrf 0.10 85.7% 10 471.omnetpp 9.84 46.4% 21 444.namd 0.04 91.3% 11 437.leslie3d 7.45 81.5% 22 453.povray 0.03 85.6%
The major hardware storage cost incurred to profile threads’ memory access behavior is shown in Table 1.
We need a counter to record the MAPI for each application. To compute RBH, we use a per-bank shadow row-buffer index to record the row address of previous memory request, and record the number of shadow row-buffer hits for each thread as described in previous work [2, 5, 8, 29]. RBH is simply computed as the number of shadow row-buffer hits divided by MAPI. These counters are readable by software. To implement TCM, additional logic required to calculate priority and cluster threads, as was done in [2].
Software Support. Dynamic Bank Partitioning requires system software support to 1) read the counters provided by hardware at memory controller, 2) categorize applications into groups according to their memory access behavior as described in rule 1, 3) dynamically allocate page colors to cores according to algorithm 1, at the end of each interval.
We use OS-based page coloring to map memory requests of different threads to their allocated page colors. Once each thread is allocated a bunch of colors, DBP ensures this allocating. When a page fault occurs, the page fault handler attempts to allocate a free physical page in the assigned colors. If a thread runs out of physical frames in colors allocated to it, the OS can assign frames of different color at the cost of reduced isolation. The decision of whose color the frames are to spill can be further explored, but this is beyond the scope of this paper (for the experimental workloads, the memory capacity of 0.5 ~ 1GB is enough [15, 18]). Experimental results on real machine show that the overhead of page coloring is negligible [15].
Page Migration. There are cases where the accessed page is present in colors other than the mapped colors, which we observe to be very rare in our workloads (less
than 10%) for two reasons. First, we allocate a thread as many as possible page colors as former intervals to reduce these circumstances when deciding bank partition- ing policy. Take an 8 core system with 32 memory banks for example, each core can have 4 banks in an equal bank partitioning. In our dynamic bank partitioning, we compel to allocate these 4 banks to the corresponding core in priority. Second, applications’ memory access behavior is relatively constant within an interval. Dynamically migrating these pages to the preferred colors could be beneficial. However, page migration incurs extra penalty (TLB and cache block invalidation overheads as discussed in [8, 27]). Hence, page migration does not likely gain much benefit; we don’t implement it in our work. However, migration can be incorporated into our work if needed.
5 Evaluation Methodology 5.1 Simulation Setup We evaluate our proposal using gem5 [14] as the base architectural simulator, and integrated DRAMSim2 [13] to simulate the details of DRAM memory system. The memory subsystem is modeled using DDR3 timing parameters [10]. Gem5 models a virtual-to-physical address translator. When a page fault occurs, the translator allocates a physical page to the virtual page. We modified the translator to support the dynamic bank partitioning. We simulate out-of-order processor running Alpha binaries. Table 2 summarizes the major processor and DRAM parameters of our simulation. We set the length of profiling interval to 100K memory cycles, MAPIt to 200, and RBHi to 0.5. We set ClusterThresh to 1/6, ShuffleInterval to 800, and ShuffleAlgoThresh to 0.1 for TCM as presented in [2].
We simulate the following 7 configurations to evaluate
our proposal. 1) FR-FCFS — First Ready, First Come First Serve
scheduling. 2) BP — Bank Partitioning with first ready, first come
first serve scheduling. 3) DBP — Dynamic Bank Partitioning with first ready,
first come first serve scheduling. 4) TCM — Thread Cluster Memory scheduling. 5) MCP — Memory Channel Partitioning. 6) BP-TCM — Bank Partitioning employed with Thread
Cluster Memory scheduling. 7) DBP-TCM — integrating Dynamic Bank Partitioning


max
5.3 Workload Construction We use workloads constructed from the SPEC CPU2006 benchmarks [28] for evaluation. Table 3 shows memory characteristics of benchmarks. We cross compiled each benchmark using gcc4.3.2 with –O3 optimizations. In our simulation, each processor core is single-threaded and runs one program. We warm up system for 100 million instructions before taking measurement; followed by a final 200 million instructions used in our evaluation. If a benchmark finishes its instructions before others, its statistics are collected but it continues to execute so that it insistently exerts pressure on memory system.
Memory intensity and row-buffer locality are the two key memory characteristics of an application. Memory intensity is represented by the last-level cache miss per kilo instruction (MPKI); row-buffer locality is measured by row-buffer hit rate (RBH). We classify the benchmarks into two groups: memory-intensive (MPKI greater than 1) and non-intensive (MPKI less than 1); we further categorize each group into two sub-groups: high row-buffer locality (RBH greater than 0.5) and low row-buffer locality (RBH less than 0.5). We evaluate our proposals on a wide variety of multi-programmed workloads. We vary the fraction of memory intensive benchmarks in our workloads from 100%, 75%, 50%, 25%, 0% and constructed 8 workloads in each category.
6 Results We first evaluate the impact of our proposal on system performance. We measure system throughput using weighted speedup; the balance between system throughput and fairness is measured by harmonic speedup. We separately simulate the 7 configurations as described in Section 5.1. Figure 6 shows the average system throughput and harmonic speedup over all workloads, all the results are normalized to the baseline FR-FCFS. The upper right part of the graph corresponds to better system
Figure 6. System performance over all workloads
throughput (higher weighted speedup) and a better balance between system throughput and fairness (higher harm-onic speedup). DBP improves system throughput by 9.4% and harmonic speedup by 21.1% over the baseline FR-FCFS. Compared to BP, DBP improves system throughput by 4.2% and harmonic speedup by 15.1%. DBP-TCM provides 6.2% better system throughput (20% better harmonic speedup) than DBP, and 6.4% better system throughput (9% better harmonic speedup) than TCM. When compared with MCP, DBP-TCM improves system throughput by 5.3% and harmonic speedup by 35.3%.
We make two major conclusions based on the above results. First, dynamic bank partitioning is beneficial for system performance, as DBP takes into account applications’ varying needs for bank level parallelism. Second, integrating dynamic bank partitioning and thread cluster memory scheduling provides better system performance than employing either alone. 6.1 System Throughput We then evaluate the impact of our mechanism on system throughput in details. Figure 7 provides insight into where the performance benefits of DBP and DBP-TCM are coming from by breaking down performance based on memory intensity of workloads. All the results are normalized to the baseline FR-FCFS. Compared to FR-FCFS, dynamic bank partitioning improves system throughput by 9.4% on average. As expected, dynamic bank partitioning outperforms previous equal bank partitioning by increasing the bank level parallelism of BLP-sensitive benchmarks (4.3% better system throughput than BP over all workloads). When bank partitioning employed with thread cluster memory scheduling (BP-TCM), the improvement over FR-FCFS is 11.3%, while integrating DBP and TCM (DBP-TCM) improves system throughput by 15.6%. Compared to MCP, DBP-TCM provides 5.3% better system throughput.
The workloads with 100% memory intensity bench- mark benefit most from dynamic bank partitioning due to two reasons. On the one hand, these benchmarks recover the most spatial locality lost due to interference and the increased row-buffer hit rate results in system throughput improvement. On the other hand, DBP increases the bank level parallelism of memory intensive benchmarks with low row-buffer locality, thereby improving system throughput.
0.95
1
1.05
1.1
1.15
1.2
N or
m al
iz ed
W ei
Figure 7. System throughput
Figure 8. System fairness
For high memory intensive workloads (75% and 100%), reducing inter-thread interference via bank partitioning is more effective than memory scheduling: both BP and DBP outperform TCM. Because contention for DRAM memory is severe in such workloads. Bank partitioning divides memory banks among cores and effectively reduces inter-thread interference, thus improving system throughput. In contrast, TCM tries to address contention potentially recover a fraction of row-buffer locality. However, the effectiveness in recovering locality of memory scheduling is restricted due to the limited scheduling buffer size.
Low memory intensive workloads (0%, 25%, and 50%) benefit more from TCM. TCM strictly prioritizes memory requests of latency-sensitive cluster over bandwidth- sensitive cluster, thus improving system throughput. Besides, TCM enforces a strict priority for latency- sensitive cluster; thread with the least memory intensity receives the highest priority. This strict priority allows “light” threads to quickly resume computation, thereby making great contribution to the overall system throughput. On the opposite, the workloads with 0% memory intensive benchmark have almost no benefits from BP and DBP, since all the benchmarks are memory non-intensive. In such workloads, applications seldom generate memory requests and the contention for DRAM memory is slight.
Based on the above analysis, we conclude that either memory scheduling alone or bank partitioning alone cannot provide the best system performance. A
combination of memory scheduling and bank partitioning is a more effective solution than pure scheduling or pure partitioning. 6.2 System Fairness System throughput only shows half of the story. Figure 8 provides the inside into the impact on system fairness of our proposals for six workload categories with different memory intensity. System fairness is measured by maximum slowdown. Compared to FR-FCFS, bank partitioning improves system fairness by 2.1% on average; the improvement of dynamic bank partitioning is 18.1%. The benefit of thread cluster memory scheduling is 31.7%. Although MCP effectively improves system throughput, the improvement of system fairness is slight (5.7% over FR-FCFS, this is consistent with the result in [8]). When bank partitioning employed with thread cluster memory scheduling (BP-TCM), the fairness improvement is 26.3%. While integrating DBP and TCM (DBP-TCM) improves system fairness by 42.7%.
For workloads with 0% memory intensive benchmark, all schemes have slight benefits for system fairness. As all benchmarks are non-intensive, the memory latency is relatively low.
Workloads with 25% memory intensive benchmark suffer from equal bank partitioning. BP reduces bank level parallelism of BLP-sensitive applications and increases their memory access latency, thereby increasing memory slowdown of BLP-sensitive applications. While DBP addresses this problem by increasing bank level par-
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
N or
m al
iz ed
W ei
gh te
d Sp
ee du
FRFCFS
BP
DBP
TCM
MCP
BP-TCM
DBP-TCM
0
1
2
3
4
5
6
7
M ax
m iu
m S
lo w
do w
FRFCFS
BP
DBP
TCM
MCP
BP-TCM
DBP-TCM
interval
Figure 10. Sensitivity to MPAIt
Figure 11. Sensitivity to RBHt
Table 4. Impact of different shuffling algorithm Compared to Insertion Shuffle
Shuffling Algorithm TCM Round-robin Random
WS 0.28% 0.74% 1.05% MS 4.3% 8.5% 13.8%
allelism of BLP-sensitive applications, and hence improv- ing system fairness.
DBP and TCM together optimize memory contention, spatial locality and bank level parallelism; the two methods applied together provides a fairer and more robust solution. 6.3 Impact of Shuffling Algorithm TCM improves system fairness by shuffling priority among applications in bandwidth-sensitive cluster. TCM also proposes the insertion shuffling algorithm to reduce inter-thread interference. However, it destroys the row-buffer locality of less nice threads (threads with high row-buffer locality are less nice and deprioritized). In addition, prioritizing nicer threads increases memory slowdown of less nice threads. Therefore, the improve- ment of system fairness is restricted.
We study the impact of four shuffling algorithms (insertion, TCM, round-robin, and random) when DBP-TCM employs different shuffle. Table 4 shows the system throughput (Weight Speedup, WS) and fairness (Maximum Slowdown, MS) results, all the results are normalized to insertion shuffle. Of the four shuffles, insertion shuffle shows the worst system performance. Prioritizing nicer threads increases slowdown of less nice threads, thus decreasing system throughput and leading to high maximum slowdown. TCM shuffle performs slightly better than insertion shuffle, since it dynamically switches between insertion and random shuffle to accommodate the disparate memory characteristics of diversity workloads. Round-robin and random shuffle provides more benefits, as DBP solves the inter-thread interference. Therefore, there is no need for insertion shuffle. Random shuffle provides the best results of the four shuffling algorithm.
We conclude that when TCM is employed along with DBP, the insertion shuffling algorithm is no longer needed since DBP solves the interference issue. 6.4 Sensitivity Study We first experiment with different profile interval length to study its performance impact on DBP and DBP-TCM
(Figure 9); all the results are normalized to the baseline FR-FCFS. A short profile interval length causes unstable MAPI and RBH, and hence potentially inaccurate estimation of applications’ memory characteristics. Conversely, a long profile interval length cannot catch the memory access behavior change of an application. A profile interval length of 100K memory cycles balances the downsides of short and long intervals, and provides the best results.
We also vary the value of MAPIt to study the sensitivity of DBP and DBP-TCM (Figure 10). As the value of MAPIt increases, more applications get into non-intensive group. The system throughput first increases, and then falls. To save memory banks for BLP-sensitive applications, DBP does not allocate dedicated memory banks to non-intensive applications, thus increasing bank level parallelism of BLP-sensitive applications and improving system throughput. However, increasing MAPIt aggravates inter-thread interference. After some point, the drawbacks overwhelm the benefits, thereby resulting in lower system throughput. The value of 200 is the best choice. We also evaluate our proposal using MPKI instead of MAPI; we found that MPKI provides similar results as MAPI.
Figure 11 shows the trade-offs between row-buffer locality and bank level parallelism by adjusting the row-buffer hit rate threshold RBHt. As the scaling factor RBHt increases, more memory intensive applications get into low RBL group. In some cases, DBP enforces two cores in low RBL group share their colors to increase bank level parallelism, thereby improving system throughput. However, this also increases row-buffer conflicts, and hence negatively impacting DRAM memory performance. An RBHt value of 0.5 achieves a good balance between row-buffer locality and bank level parallelism, and gives the best performance.
We vary the ClusterThresh from 1/8 to 1/5 to study the performance of DBP-TCM, the results are shown in Table 5, we compare our proposal with Thread Cluster Memory scheduling. DBP-TCM has a wide range of balanced points which provide both high system throughput and fairness. By tuning the clustering threshold, system throughput and fairness can be gently traded off for one another. We also experiment with different shuffle interval lengths (Table 5). Compared to TCM, system performance remains high and stable over a wide range of these values, and the best performance observed at the value of 800.
Table 5 also provides the performance of DBP-TCM as
1.02
1.06
1.1
1.14
1.18
or m
al iz
ed W
ei gh
te d
Sp ee
du p
Table 5. DBP-TCM’s sensitivity to parameters Compared to TCM
ClusterThresh ShuffleInterval Number of Cores 1/8 1/7 1/6 1/5 600 700 800 900 4 8 16
WS 4.8% 5.5% 6.2% 6.6% 5.7% 6.0% 6.2% 5.9% 3.9% 6.2% 6.5% MS 19.3% 17.9% 16.7% 14.5% 17.8% 16.3% 16.7% 17.2% 6.9% 16.7% 18.0%
the number of core varies (memory and L2 cache capacity scale with cores). The results show similar trends as the number of core increase. Therefore, our approach scales well with the number of cores.
7 Related Work 7.1 Memory Scheduling A number of studies have focused on memory scheduling that aim to enhance system performance and/or fairness, or maximum DRAM throughput.
Much prior work has focused on mitigating inter-thread interference at the DRAM memory system to improve overall system throughput and/or fairness of share memory CMP systems. Most of previous approaches address this problem using application-aware memory scheduling algorithm [1-6]. Fair queuing memory scheduling [5, 6] adapts the variants fair queuing algorithm of computer network to memory controller. FQM achieves some notion of fairness at the cost of system throughput. STFM [3] makes priority decisions to equalize the DRAM-related stall time each thread experienced to achieve system fairness. PAR-BS [4] employs a parallelism-aware batch scheduling policy to achieve a balance of system throughput and fairness. ATLAS [1] strictly prioritizes threads which have attained the least service from DRAM memory to maximize system throughput, but at a large expense of system fairness. MISE [19] proposes a memory- interference induced slowdown estimation model that estimates slowdowns caused by interference to enforce system fairness. SMS [9] presents a staged memory scheduler which achieves high system throughput and fairness in integrated CPU-GPU systems. TCM [2] divides threads into two separate clusters and employs different memory scheduling policy for each cluster. TCM can achieve both high system throughput and fairness in CMP systems [2].
These memory access scheduling policies focus on improving system performance by considering threads’ memory access behavior and system fairness. They can potentially recover a fraction of original spatial locality. However, the recovering ability is constrained as the scheduling buffer size is limited and the arrival interval of memory requests from a single thread is often large. Moreover, the primary design consideration of these policies is not reclaiming row-buffer locality. Memory access scheduling cannot solve the interference problem effectively. Therefore, there is potential space to further improve system performance by solving inter-thread interference.
Application-unaware memory scheduling policies [21-26] aims to maximize DRAM throughput, including the commonly employed FR-FCFS [23] which prioritizes row-buffer hit memory requests over others. These scheduling policies do not distinguish between different threads, and also do not take into account inter-thread interference. Therefore, they lead to low system throughput and prone to starvation when multiple threads compete the shared DRAM memory in general purpose multi-core systems, as shown in previous work [1-9].
7.2 Memory Partitioning Muralidhara et al. [8] propose application-aware memory channel partitioning (MCP). MCP maps the data of applications that are likely to severely interfere with each other to different memory channels. The key principle is to partition threads with different memory characteristics (memory intensity and row-buffer locality) onto separate channels. MCP can effectively reduce the interference of threads with different memory access behavior and improve system throughput.
However, we find that MCP has several drawbacks. First, MCP cannot eliminate the inter-thread interference, since threads with the same memory characteristics still share channel(s). Second, MCP nullifies fine-grained DRAM channel interleaving, which limits peak memory bandwidth of individual threads and reduces channel level parallelism, thus degrading system performance. Third, MCP is only suited for multi-channel system, while mobile system usually has only one channel. Most importantly, MCP places high memory intensive threads onto the same channel(s) to enable faster progress of low memory intensive threads. This leads to unfair memory resource allocation and physically exacerbates intensive threads’ contention for memory, thus ultimately resulting in the increased slowdown of these threads and high system unfairness.
Previous bank partitioning [7, 15, 16, 29] proposes to partition memory banks among cores to isolate memory access of different applications, thereby eliminating inter- thread interference and improving system performance. Liu et al. [16] implement and evaluate bank partitioning in reality. However, all of the previous bank partitioning are unaware of applications’ varying demands for bank level parallelism. In some cases, bank partitioning restricts the number of banks available to an application and reduces bank level parallelism, hence significantly degrading system performance.
Jeong et al. [7] also combine bank partitioning with sub-rank [17] to increase the number of banks available to each thread, thus compensating for the reduced bank level parallelism caused by bank partitioning. However, sub-rank increases data transferring time and reduces peak memory bandwidth. Therefore, the performance improvement is slight and sometimes it even hurts system performance. In addition, we have to modify conventional DRAM DIMM to support sub-rank.
8 Conclusion Bank partitioning divides memory banks among cores and eliminates inter-thread interference. Previous bank partitioning allocates an equal number of memory banks to each core without considering the disparate needs for bank level parallelism of individual thread. In some cases, it restricts the number of banks available to an application and reduces bank level parallelism, thereby significantly degrading system performance. To compensate for the reduced bank level parallelism, we present Dynamic Bank Partitioning which takes into account threads’
demands for bank level parallelism. DBP improves system throughput, while maintaining fairness.
Memory access scheduling focuses on improving syst- em performance by considering threads’ memory access behavior and system fairness. While the improvement of system performance is restricted due to its ability to reclaim the original spatial locality. Bank partitioning focuses on preserving row-buffer locality and solves the inter-thread interference issue. These two kinds of method are complementary and orthogonal to each other. Memory scheduling can benefit from improved spatial locality, and bank partitioning can benefit from better scheduling. Therefore, we introduce a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling. These two methods applied together illuminate each other. We show that this harmonious combination, unlike DBP or TCM alone, is able to simultaneously improve overall system throughput and fairness significantly.
Experimental results using eight-core multi program- ming workloads show that the proposed DBP improves system performance by 9.4% and improves system fairness by 18.1% over FR-FCFS on average. The improvement of our comprehensive approach DBP-TCM is 15.6% and 42.7%. Compared with TCM, which is one of the best memory scheduling policies, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness. We conclude that our approaches are effective in enhancing both system throughput and fairness.
Acknowledgments We gratefully thank the anonymous reviewers for their valuable and helpful comments and suggestions. This research was partially supported by the Major National Science and Technology Special Projects of China under Grant No. 2009ZX01029-001-002.
References [1] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter. ATLAS:
A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. HPCA, 2010.
[2] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. MICRO, 2010.
[3] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. MICRO, 2007.
[4] O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. ISCA, 2008.
[5] K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair Queuing Memory Systems. MICRO, 2006.
[6] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective Management of DRAM Bandwidth in Multicore Processors. PACT, 2007.
[7] M. Jeong, D. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez. Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems. HPCA, 2012.
[8] S. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning. MICRO, 2011.
[9] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh,
and O. Mutlu. Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems. ISCA, 2012.
[10] Micron Technology, Inc. Micron DDR3 SDRAM Part MT41J256M8, 2006.
[11] H. Park, S. Baek, J. Choi, D. Lee, and S. Noh. Regularities Considered Harmful: Forcing Randomness to Memory Accesses to Reduce Row Buffer Conflicts for Multi-Core, Multi-Bank Systems. ASPLOS, 2013.
[12] B. Jacob, S. W. Ng, and D. T. Wang. Memory Systems: Cache, DRAM, Disk. Elsevier, 2008.
[13] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE CAL, vol. 10, pages 16-19, 2011.
[14] N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. Hill, and D. Wood. The Gem5 Simulator. SIGARCH CAN, vol. 39, pages 1-7, 2011.
[15] W. Mi, X. Feng, J. Xue, and Y. Jia. Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors. NPC, 2010.
[16] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems. PACT, 2012.
[17] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu. Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency. MICRO, 2008.
[18] J. L. Henning. SPEC CPU2006 Memory Footprint. ACM SIGARCH Computer Architecture News, 2007.
[19] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. HPCA, 2013.
[20] H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Memory Access Scheduling Schemes for Systems with Multi-core Processors. ICPP, 2008.
[21] I. Hur and C. Lin. Adaptive History-based Memory Schedulers. MICRO, 2004.
[22] S. A. McKee, W. A. Wulf, J. Aylor, R. Klenke, M. Salinas, S. Hong, and D. Weikle. Dynamic Access Ordering for Streamed Computations. IEEE TC, vol. 49, pages 1255-1271, Nov. 2000.
[23] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens. Memory Access Scheduling. ISCA, 2000.
[24] J. Shao and B. T. Davis. A Burst Scheduling Access Reordering Mechanism. HPCA, 2007.
[25] C. Natarajan, B. Christenson, and F. Briggs. A Study of Performance Impact of Memory Controller Features in Multi-processor Server Environment. WMPI, 2004.
[26] W. K. Zuravle and T. Robinson. Controller for a Synchronous DRAM that Maximizes Throughput by Allowing Memory Requests and Commands to be Issued Out of Order. U.S. Patent Number 5,630,096, May 1997.
[27] M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the Problems and Opportunities Posed by Multiple on-chip Memory Controllers. PACT, 2010.
[28] SPEC CPU2006. http://www.spec.org/spec2006. [29] M. Xie, D. Tong, Y. Feng, K. Huang, X. Cheng. Page
Policy Control with Memory Partitioning for DRAM Performance and Power Efficiency. ISLPED, 2013.
[30] T. Karkhanis and J. E. Smith. A Day in the Life of a Data Cache Miss. WMPI, 2002.
[31] H. Cheng, C. Lin, J. Li, and C. Yang. Memory Latency Reduction via Thread Throttling. MICRO, 2010.
<< /ASCII85EncodePages false /AllowTransparency false /AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left /CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1) /CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile (sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Error /CompatibilityLevel 1.7 /CompressObjects /Off /CompressPages true /ConvertImagesToIndexed true /PassThroughJPEGImages true /CreateJobTicket false /DefaultRenderingIntent /Default /DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy /LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true /EmbedOpenType false /ParseICCProfilesInComments true /EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false /EndPage -1 /ImageMemory 1048576 /LockDistillerParams true /MaxSubsetPct 100 /Optimize true /OPM 0 /ParseDSCComments false /ParseDSCCommentsForDocInfo false /PreserveCopyPage true /PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness true /PreserveHalftoneInfo true /PreserveOPIComments false /PreserveOverprintSettings true /StartPage 1 /SubsetFonts true /TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue false /ColorSettingsFile () /AlwaysEmbed [ true /AbadiMT-CondensedLight /ACaslon-Italic /ACaslon-Regular /ACaslon-Semibold /ACaslon-SemiboldItalic /AdobeArabic-Bold /AdobeArabic-BoldItalic /AdobeArabic-Italic /AdobeArabic-Regular /AdobeHebrew-Bold /AdobeHebrew-BoldItalic /AdobeHebrew-Italic /AdobeHebrew-Regular /AdobeHeitiStd-Regular /AdobeMingStd-Light /AdobeMyungjoStd-Medium /AdobePiStd /AdobeSansMM /AdobeSerifMM /AdobeSongStd-Light /AdobeThai-Bold /AdobeThai-BoldItalic /AdobeThai-Italic /AdobeThai-Regular /AGaramond-Bold /AGaramond-BoldItalic /AGaramond-Italic /AGaramond-Regular /AGaramond-Semibold /AGaramond-SemiboldItalic /AgencyFB-Bold /AgencyFB-Reg /AGOldFace-Outline /AharoniBold /Algerian /Americana /Americana-ExtraBold /AndaleMono /AndaleMonoIPA /AngsanaNew /AngsanaNew-Bold /AngsanaNew-BoldItalic /AngsanaNew-Italic /AngsanaUPC /AngsanaUPC-Bold /AngsanaUPC-BoldItalic /AngsanaUPC-Italic /Anna /ArialAlternative /ArialAlternativeSymbol /Arial-Black /Arial-BlackItalic /Arial-BoldItalicMT /Arial-BoldMT /Arial-ItalicMT /ArialMT /ArialMT-Black /ArialNarrow /ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic /ArialRoundedMTBold /ArialUnicodeMS /ArrusBT-Bold /ArrusBT-BoldItalic /ArrusBT-Italic /ArrusBT-Roman /AvantGarde-Book /AvantGarde-BookOblique /AvantGarde-Demi /AvantGarde-DemiOblique /AvantGardeITCbyBT-Book /AvantGardeITCbyBT-BookOblique /BakerSignet /BankGothicBT-Medium /Barmeno-Bold /Barmeno-ExtraBold /Barmeno-Medium /Barmeno-Regular /Baskerville /BaskervilleBE-Italic /BaskervilleBE-Medium /BaskervilleBE-MediumItalic /BaskervilleBE-Regular /Baskerville-Bold /Baskerville-BoldItalic /Baskerville-Italic /BaskOldFace /Batang /BatangChe /Bauhaus93 /Bellevue /BellGothicStd-Black /BellGothicStd-Bold /BellGothicStd-Light /BellMT /BellMTBold /BellMTItalic /BerlingAntiqua-Bold /BerlingAntiqua-BoldItalic /BerlingAntiqua-Italic /BerlingAntiqua-Roman /BerlinSansFB-Bold /BerlinSansFBDemi-Bold /BerlinSansFB-Reg /BernardMT-Condensed /BernhardModernBT-Bold /BernhardModernBT-BoldItalic /BernhardModernBT-Italic /BernhardModernBT-Roman /BiffoMT /BinnerD /BinnerGothic /BlackadderITC-Regular /Blackoak /blex /blsy /Bodoni /Bodoni-Bold /Bodoni-BoldItalic /Bodoni-Italic /BodoniMT /BodoniMTBlack /BodoniMTBlack-Italic /BodoniMT-Bold /BodoniMT-BoldItalic /BodoniMTCondensed /BodoniMTCondensed-Bold /BodoniMTCondensed-BoldItalic /BodoniMTCondensed-Italic /BodoniMT-Italic /BodoniMTPosterCompressed /Bodoni-Poster /Bodoni-PosterCompressed /BookAntiqua /BookAntiqua-Bold /BookAntiqua-BoldItalic /BookAntiqua-Italic /Bookman-Demi /Bookman-DemiItalic /Bookman-Light /Bookman-LightItalic /BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic /BookmanOldStyle-Italic /BookshelfSymbolOne-Regular /BookshelfSymbolSeven /BookshelfSymbolThree-Regular /BookshelfSymbolTwo-Regular /Botanical /Boton-Italic /Boton-Medium /Boton-MediumItalic /Boton-Regular /Boulevard /BradleyHandITC /Braggadocio /BritannicBold /Broadway /BrowalliaNew /BrowalliaNew-Bold /BrowalliaNew-BoldItalic /BrowalliaNew-Italic /BrowalliaUPC /BrowalliaUPC-Bold /BrowalliaUPC-BoldItalic /BrowalliaUPC-Italic /BrushScript /BrushScriptMT /CaflischScript-Bold /CaflischScript-Regular /Calibri /Calibri-Bold /Calibri-BoldItalic /Calibri-Italic /CalifornianFB-Bold /CalifornianFB-Italic /CalifornianFB-Reg /CalisMTBol /CalistoMT /CalistoMT-BoldItalic /CalistoMT-Italic /Cambria /Cambria-Bold /Cambria-BoldItalic /Cambria-Italic /CambriaMath /Candara /Candara-Bold /Candara-BoldItalic /Candara-Italic /Carta /CaslonOpenfaceBT-Regular /Castellar /CastellarMT /Centaur /Centaur-Italic /Century /CenturyGothic /CenturyGothic-Bold /CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchL-Bold /CenturySchL-BoldItal /CenturySchL-Ital /CenturySchL-Roma /CenturySchoolbook /CenturySchoolbook-Bold /CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic /CGTimes-Bold /CGTimes-BoldItalic /CGTimes-Italic /CGTimes-Regular /CharterBT-Bold /CharterBT-BoldItalic /CharterBT-Italic /CharterBT-Roman /CheltenhamITCbyBT-Bold /CheltenhamITCbyBT-BoldItalic /CheltenhamITCbyBT-Book /CheltenhamITCbyBT-BookItalic /Chiller-Regular /Cmb10 /CMB10 /Cmbsy10 /CMBSY10 /CMBSY5 /CMBSY6 /CMBSY7 /CMBSY8 /CMBSY9 /Cmbx10 /CMBX10 /Cmbx12 /CMBX12 /Cmbx5 /CMBX5 /Cmbx6 /CMBX6 /Cmbx7 /CMBX7 /Cmbx8 /CMBX8 /Cmbx9 /CMBX9 /Cmbxsl10 /CMBXSL10 /Cmbxti10 /CMBXTI10 /Cmcsc10 /CMCSC10 /Cmcsc8 /CMCSC8 /Cmcsc9 /CMCSC9 /Cmdunh10 /CMDUNH10 /Cmex10 /CMEX10 /CMEX7 /CMEX8 /CMEX9 /Cmff10 /CMFF10 /Cmfi10 /CMFI10 /Cmfib8 /CMFIB8 /Cminch /CMINCH /Cmitt10 /CMITT10 /Cmmi10 /CMMI10 /Cmmi12 /CMMI12 /Cmmi5 /CMMI5 /Cmmi6 /CMMI6 /Cmmi7 /CMMI7 /Cmmi8 /CMMI8 /Cmmi9 /CMMI9 /Cmmib10 /CMMIB10 /CMMIB5 /CMMIB6 /CMMIB7 /CMMIB8 /CMMIB9 /Cmr10 /CMR10 /Cmr12 /CMR12 /Cmr17 /CMR17 /Cmr5 /CMR5 /Cmr6 /CMR6 /Cmr7 /CMR7 /Cmr8 /CMR8 /Cmr9 /CMR9 /Cmsl10 /CMSL10 /Cmsl12 /CMSL12 /Cmsl8 /CMSL8 /Cmsl9 /CMSL9 /Cmsltt10 /CMSLTT10 /Cmss10 /CMSS10 /Cmss12 /CMSS12 /Cmss17 /CMSS17 /Cmss8 /CMSS8 /Cmss9 /CMSS9 /Cmssbx10 /CMSSBX10 /Cmssdc10 /CMSSDC10 /Cmssi10 /CMSSI10 /Cmssi12 /CMSSI12 /Cmssi17 /CMSSI17 /Cmssi8 /CMSSI8 /Cmssi9 /CMSSI9 /Cmssq8 /CMSSQ8 /Cmssqi8 /CMSSQI8 /Cmsy10 /CMSY10 /Cmsy5 /CMSY5 /Cmsy6 /CMSY6 /Cmsy7 /CMSY7 /Cmsy8 /CMSY8 /Cmsy9 /CMSY9 /Cmtcsc10 /CMTCSC10 /Cmtex10 /CMTEX10 /Cmtex8 /CMTEX8 /Cmtex9 /CMTEX9 /Cmti10 /CMTI10 /Cmti12 /CMTI12 /Cmti7 /CMTI7 /Cmti8 /CMTI8 /Cmti9 /CMTI9 /Cmtt10 /CMTT10 /Cmtt12 /CMTT12 /Cmtt8 /CMTT8 /Cmtt9 /CMTT9 /Cmu10 /CMU10 /Cmvtt10 /CMVTT10 /ColonnaMT /Colossalis-Bold /ComicSansMS /ComicSansMS-Bold /Consolas /Consolas-Bold /Consolas-BoldItalic /Consolas-Italic /Constantia /Constantia-Bold /Constantia-BoldItalic /Constantia-Italic /CooperBlack /CopperplateGothic-Bold /CopperplateGothic-Light /Copperplate-ThirtyThreeBC /Corbel /Corbel-Bold /Corbel-BoldItalic /Corbel-Italic /CordiaNew /CordiaNew-Bold /CordiaNew-BoldItalic /CordiaNew-Italic /CordiaUPC /CordiaUPC-Bold /CordiaUPC-BoldItalic /CordiaUPC-Italic /Courier /Courier-Bold /Courier-BoldOblique /CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT /CourierNewPS-ItalicMT /CourierNewPSMT /Courier-Oblique /CourierStd /CourierStd-Bold /CourierStd-BoldOblique /CourierStd-Oblique /CourierX-Bold /CourierX-BoldOblique /CourierX-Oblique /CourierX-Regular /CreepyRegular /CurlzMT /David-Bold /David-Reg /DavidTransparent /Dcb10 /Dcbx10 /Dcbxsl10 /Dcbxti10 /Dccsc10 /Dcitt10 /Dcr10 /Desdemona /DilleniaUPC /DilleniaUPCBold /DilleniaUPCBoldItalic /DilleniaUPCItalic /Dingbats /DomCasual /Dotum /DotumChe /EdwardianScriptITC /Elephant-Italic /Elephant-Regular /EngraversGothicBT-Regular /EngraversMT /EraserDust /ErasITC-Bold /ErasITC-Demi /ErasITC-Light /ErasITC-Medium /ErieBlackPSMT /ErieLightPSMT /EriePSMT /EstrangeloEdessa /Euclid /Euclid-Bold /Euclid-BoldItalic /EuclidExtra /EuclidExtra-Bold /EuclidFraktur /EuclidFraktur-Bold /Euclid-Italic /EuclidMathOne /EuclidMathOne-Bold /EuclidMathTwo /EuclidMathTwo-Bold /EuclidSymbol /EuclidSymbol-Bold /EuclidSymbol-BoldItalic /EuclidSymbol-Italic /EucrosiaUPC /EucrosiaUPCBold /EucrosiaUPCBoldItalic /EucrosiaUPCItalic /EUEX10 /EUEX7 /EUEX8 /EUEX9 /EUFB10 /EUFB5 /EUFB7 /EUFM10 /EUFM5 /EUFM7 /EURB10 /EURB5 /EURB7 /EURM10 /EURM5 /EURM7 /EuroMono-Bold /EuroMono-BoldItalic /EuroMono-Italic /EuroMono-Regular /EuroSans-Bold /EuroSans-BoldItalic /EuroSans-Italic /EuroSans-Regular /EuroSerif-Bold /EuroSerif-BoldItalic /EuroSerif-Italic /EuroSerif-Regular /EuroSig /EUSB10 /EUSB5 /EUSB7 /EUSM10 /EUSM5 /EUSM7 /FelixTitlingMT /Fences /FencesPlain /FigaroMT /FixedMiriamTransparent /FootlightMTLight /Formata-Italic /Formata-Medium /Formata-MediumItalic /Formata-Regular /ForteMT /FranklinGothic-Book /FranklinGothic-BookItalic /FranklinGothic-Demi /FranklinGothic-DemiCond /FranklinGothic-DemiItalic /FranklinGothic-Heavy /FranklinGothic-HeavyItalic /FranklinGothicITCbyBT-Book /FranklinGothicITCbyBT-BookItal /FranklinGothicITCbyBT-Demi /FranklinGothicITCbyBT-DemiItal /FranklinGothic-Medium /FranklinGothic-MediumCond /FranklinGothic-MediumItalic /FrankRuehl /FreesiaUPC /FreesiaUPCBold /FreesiaUPCBoldItalic /FreesiaUPCItalic /FreestyleScript-Regular /FrenchScriptMT /Frutiger-Black /Frutiger-BlackCn /Frutiger-BlackItalic /Frutiger-Bold /Frutiger-BoldCn /Frutiger-BoldItalic /Frutiger-Cn /Frutiger-ExtraBlackCn /Frutiger-Italic /Frutiger-Light /Frutiger-LightCn /Frutiger-LightItalic /Frutiger-Roman /Frutiger-UltraBlack /Futura-Bold /Futura-BoldOblique /Futura-Book /Futura-BookOblique /FuturaBT-Bold /FuturaBT-BoldItalic /FuturaBT-Book /FuturaBT-BookItalic /FuturaBT-Medium /FuturaBT-MediumItalic /Futura-Light /Futura-LightOblique /GalliardITCbyBT-Bold /GalliardITCbyBT-BoldItalic /GalliardITCbyBT-Italic /GalliardITCbyBT-Roman /Garamond /Garamond-Bold /Garamond-BoldCondensed /Garamond-BoldCondensedItalic /Garamond-BoldItalic /Garamond-BookCondensed /Garamond-BookCondensedItalic /Garamond-Italic /Garamond-LightCondensed /Garamond-LightCondensedItalic /Gautami /GeometricSlab703BT-Light /GeometricSlab703BT-LightItalic /Georgia /Georgia-Bold /Georgia-BoldItalic /Georgia-Italic /GeorgiaRef /Giddyup /Giddyup-Thangs /Gigi-Regular /GillSans /GillSans-Bold /GillSans-BoldItalic /GillSans-Condensed /GillSans-CondensedBold /GillSans-Italic /GillSans-Light /GillSans-LightItalic /GillSansMT /GillSansMT-Bold /GillSansMT-BoldItalic /GillSansMT-Condensed /GillSansMT-ExtraCondensedBold /GillSansMT-Italic /GillSans-UltraBold /GillSans-UltraBoldCondensed /GloucesterMT-ExtraCondensed /Gothic-Thirteen /GoudyOldStyleBT-Bold /GoudyOldStyleBT-BoldItalic /GoudyOldStyleBT-Italic /GoudyOldStyleBT-Roman /GoudyOldStyleT-Bold /GoudyOldStyleT-Italic /GoudyOldStyleT-Regular /GoudyStout /GoudyTextMT-LombardicCapitals /GSIDefaultSymbols /Gulim /GulimChe /Gungsuh /GungsuhChe /Haettenschweiler /HarlowSolid /Harrington /Helvetica /Helvetica-Black /Helvetica-BlackOblique /Helvetica-Bold /Helvetica-BoldOblique /Helvetica-Condensed /Helvetica-Condensed-Black /Helvetica-Condensed-BlackObl /Helvetica-Condensed-Bold /Helvetica-Condensed-BoldObl /Helvetica-Condensed-Light /Helvetica-Condensed-LightObl /Helvetica-Condensed-Oblique /Helvetica-Fraction /Helvetica-Narrow /Helvetica-Narrow-Bold /Helvetica-Narrow-BoldOblique /Helvetica-Narrow-Oblique /Helvetica-Oblique /HighTowerText-Italic /HighTowerText-Reg /Humanist521BT-BoldCondensed /Humanist521BT-Light /Humanist521BT-LightItalic /Humanist521BT-RomanCondensed /Imago-ExtraBold /Impact /ImprintMT-Shadow /InformalRoman-Regular /IrisUPC /IrisUPCBold /IrisUPCBoldItalic /IrisUPCItalic /Ironwood /ItcEras-Medium /ItcKabel-Bold /ItcKabel-Book /ItcKabel-Demi /ItcKabel-Medium /ItcKabel-Ultra /JasmineUPC /JasmineUPC-Bold /JasmineUPC-BoldItalic /JasmineUPC-Italic /JoannaMT /JoannaMT-Italic /Jokerman-Regular /JuiceITC-Regular /Kartika /Kaufmann /KaufmannBT-Bold /KaufmannBT-Regular /KidTYPEPaint /KinoMT /KodchiangUPC /KodchiangUPC-Bold /KodchiangUPC-BoldItalic /KodchiangUPC-Italic /KorinnaITCbyBT-Regular /KozGoProVI-Medium /KozMinProVI-Regular /KristenITC-Regular /KunstlerScript /Latha /LatinWide /LetterGothic /LetterGothic-Bold /LetterGothic-BoldOblique /LetterGothic-BoldSlanted /LetterGothicMT /LetterGothicMT-Bold /LetterGothicMT-BoldOblique /LetterGothicMT-Oblique /LetterGothic-Slanted /LetterGothicStd /LetterGothicStd-Bold /LetterGothicStd-BoldSlanted /LetterGothicStd-Slanted /LevenimMT /LevenimMTBold /LilyUPC /LilyUPCBold /LilyUPCBoldItalic /LilyUPCItalic /Lithos-Black /Lithos-Regular /LotusWPBox-Roman /LotusWPIcon-Roman /LotusWPIntA-Roman /LotusWPIntB-Roman /LotusWPType-Roman /LucidaBright /LucidaBright-Demi /LucidaBright-DemiItalic /LucidaBright-Italic /LucidaCalligraphy-Italic /LucidaConsole /LucidaFax /LucidaFax-Demi /LucidaFax-DemiItalic /LucidaFax-Italic /LucidaHandwriting-Italic /LucidaSans /LucidaSans-Demi /LucidaSans-DemiItalic /LucidaSans-Italic /LucidaSans-Typewriter /LucidaSans-TypewriterBold /LucidaSans-TypewriterBoldOblique /LucidaSans-TypewriterOblique /LucidaSansUnicode /Lydian /Magneto-Bold /MaiandraGD-Regular /Mangal-Regular /Map-Symbols /MathA /MathB /MathC /Mathematica1 /Mathematica1-Bold /Mathematica1Mono /Mathematica1Mono-Bold /Mathematica2 /Mathematica2-Bold /Mathematica2Mono /Mathematica2Mono-Bold /Mathematica3 /Mathematica3-Bold /Mathematica3Mono /Mathematica3Mono-Bold /Mathematica4 /Mathematica4-Bold /Mathematica4Mono /Mathematica4Mono-Bold /Mathematica5 /Mathematica5-Bold /Mathematica5Mono /Mathematica5Mono-Bold /Mathematica6 /Mathematica6Bold /Mathematica6Mono /Mathematica6MonoBold /Mathematica7 /Mathematica7Bold /Mathematica7Mono /Mathematica7MonoBold /MatisseITC-Regular /MaturaMTScriptCapitals /Mesquite /Mezz-Black /Mezz-Regular /MICR /MicrosoftSansSerif /MingLiU /Minion-BoldCondensed /Minion-BoldCondensedItalic /Minion-Condensed /Minion-CondensedItalic /Minion-Ornaments /MinionPro-Bold /MinionPro-BoldIt /MinionPro-It /MinionPro-Regular /MinionPro-Semibold /MinionPro-SemiboldIt /Miriam /MiriamFixed /MiriamTransparent /Mistral /Modern-Regular /MonotypeCorsiva /MonotypeSorts /MSAM10 /MSAM5 /MSAM6 /MSAM7 /MSAM8 /MSAM9 /MSBM10 /MSBM5 /MSBM6 /MSBM7 /MSBM8 /MSBM9 /MS-Gothic /MSHei /MSLineDrawPSMT /MS-Mincho /MSOutlook /MS-PGothic /MS-PMincho /MSReference1 /MSReference2 /MSReferenceSansSerif /MSReferenceSansSerif-Bold /MSReferenceSansSerif-BoldItalic /MSReferenceSansSerif-Italic /MSReferenceSerif /MSReferenceSerif-Bold /MSReferenceSerif-BoldItalic /MSReferenceSerif-Italic /MSReferenceSpecialty /MSSong /MS-UIGothic /MT-Extra /MT-Symbol /MT-Symbol-Italic /MVBoli /Myriad-Bold /Myriad-BoldItalic /Myriad-Italic /MyriadPro-Black /MyriadPro-BlackIt /MyriadPro-Bold /MyriadPro-BoldIt /MyriadPro-It /MyriadPro-Light /MyriadPro-LightIt /MyriadPro-Regular /MyriadPro-Semibold /MyriadPro-SemiboldIt /Myriad-Roman /Narkisim /NewCenturySchlbk-Bold /NewCenturySchlbk-BoldItalic /NewCenturySchlbk-Italic /NewCenturySchlbk-Roman /NewMilleniumSchlbk-BoldItalicSH /NewsGothic /NewsGothic-Bold /NewsGothicBT-Bold /NewsGothicBT-BoldItalic /NewsGothicBT-Italic /NewsGothicBT-Roman /NewsGothic-Condensed /NewsGothic-Italic /NewsGothicMT /NewsGothicMT-Bold /NewsGothicMT-Italic /NiagaraEngraved-Reg /NiagaraSolid-Reg /NimbusMonL-Bold /NimbusMonL-BoldObli /NimbusMonL-Regu /NimbusMonL-ReguObli /NimbusRomDGR-Bold /NimbusRomDGR-BoldItal /NimbusRomDGR-Regu /NimbusRomDGR-ReguItal /NimbusRomNo9L-Medi /NimbusRomNo9L-MediItal /NimbusRomNo9L-Regu /NimbusRomNo9L-ReguItal /NimbusSanL-Bold /NimbusSanL-BoldCond /NimbusSanL-BoldCondItal /NimbusSanL-BoldItal /NimbusSanL-Regu /NimbusSanL-ReguCond /NimbusSanL-ReguCondItal /NimbusSanL-ReguItal /Nimrod /Nimrod-Bold /Nimrod-BoldItalic /Nimrod-Italic /NSimSun /Nueva-BoldExtended /Nueva-BoldExtendedItalic /Nueva-Italic /Nueva-Roman /NuptialScript /OCRA /OCRA-Alternate /OCRAExtended /OCRB /OCRB-Alternate /OfficinaSans-Bold /OfficinaSans-BoldItalic /OfficinaSans-Book /OfficinaSans-BookItalic /OfficinaSerif-Bold /OfficinaSerif-BoldItalic /OfficinaSerif-Book /OfficinaSerif-BookItalic /OldEnglishTextMT /Onyx /OnyxBT-Regular /OzHandicraftBT-Roman /PalaceScriptMT /Palatino-Bold /Palatino-BoldItalic /Palatino-Italic /PalatinoLinotype-Bold /PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic /PalatinoLinotype-Roman /Palatino-Roman /PapyrusPlain /Papyrus-Regular /Parchment-Regular /Parisian /ParkAvenue /Penumbra-SemiboldFlare /Penumbra-SemiboldSans /Penumbra-SemiboldSerif /PepitaMT /Perpetua /Perpetua-Bold /Perpetua-BoldItalic /Perpetua-Italic /PerpetuaTitlingMT-Bold /PerpetuaTitlingMT-Light /PhotinaCasualBlack /Playbill /PMingLiU /Poetica-SuppOrnaments /PoorRichard-Regular /PopplLaudatio-Italic /PopplLaudatio-Medium /PopplLaudatio-MediumItalic /PopplLaudatio-Regular /PrestigeElite /Pristina-Regular /PTBarnumBT-Regular /Raavi /RageItalic /Ravie /RefSpecialty /Ribbon131BT-Bold /Rockwell /Rockwell-Bold /Rockwell-BoldItalic /Rockwell-Condensed /Rockwell-CondensedBold /Rockwell-ExtraBold /Rockwell-Italic /Rockwell-Light /Rockwell-LightItalic /Rod /RodTransparent /RunicMT-Condensed /Sanvito-Light /Sanvito-Roman /ScriptC /ScriptMTBold /SegoeUI /SegoeUI-Bold /SegoeUI-BoldItalic /SegoeUI-Italic /Serpentine-BoldOblique /ShelleyVolanteBT-Regular /ShowcardGothic-Reg /Shruti /SimHei /SimSun /SimSun-PUA /SnapITC-Regular /StandardSymL /Stencil /StoneSans /StoneSans-Bold /StoneSans-BoldItalic /StoneSans-Italic /StoneSans-Semibold /StoneSans-SemiboldItalic /Stop /Swiss721BT-BlackExtended /Sylfaen /Symbol /SymbolMT /Tahoma /Tahoma-Bold /Tci1 /Tci1Bold /Tci1BoldItalic /Tci1Italic /Tci2 /Tci2Bold /Tci2BoldItalic /Tci2Italic /Tci3 /Tci3Bold /Tci3BoldItalic /Tci3Italic /Tci4 /Tci4Bold /Tci4BoldItalic /Tci4Italic /TechnicalItalic /TechnicalPlain /Tekton /Tekton-Bold /TektonMM /Tempo-HeavyCondensed /Tempo-HeavyCondensedItalic /TempusSansITC /Times-Bold /Times-BoldItalic /Times-BoldItalicOsF /Times-BoldSC /Times-ExtraBold /Times-Italic /Times-ItalicOsF /TimesNewRomanMT-ExtraBold /TimesNewRomanPS-BoldItalicMT /TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT /TimesNewRomanPSMT /Times-Roman /Times-RomanSC /Trajan-Bold /Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold /TrebuchetMS-Italic /Tunga-Regular /TwCenMT-Bold /TwCenMT-BoldItalic /TwCenMT-Condensed /TwCenMT-CondensedBold /TwCenMT-CondensedExtraBold /TwCenMT-CondensedMedium /TwCenMT-Italic /TwCenMT-Regular /Univers-Bold /Univers-BoldItalic /UniversCondensed-Bold /UniversCondensed-BoldItalic /UniversCondensed-Medium /UniversCondensed-MediumItalic /Univers-Medium /Univers-MediumItalic /URWBookmanL-DemiBold /URWBookmanL-DemiBoldItal /URWBookmanL-Ligh /URWBookmanL-LighItal /URWChanceryL-MediItal /URWGothicL-Book /URWGothicL-BookObli /URWGothicL-Demi /URWGothicL-DemiObli /URWPalladioL-Bold /URWPalladioL-BoldItal /URWPalladioL-Ital /URWPalladioL-Roma /USPSBarCode /VAGRounded-Black /VAGRounded-Bold /VAGRounded-Light /VAGRounded-Thin /Verdana /Verdana-Bold /Verdana-BoldItalic /Verdana-Italic /VerdanaRef /VinerHandITC /Viva-BoldExtraExtended /Vivaldii /Viva-LightCondensed /Viva-Regular /VladimirScript /Vrinda /Webdings /Westminster /Willow /Wingdings2 /Wingdings3 /Wingdings-Regular /WNCYB10 /WNCYI10 /WNCYR10 /WNCYSC10 /WNCYSS10 /WoodtypeOrnaments-One /WoodtypeOrnaments-Two /WP-ArabicScriptSihafa /WP-ArabicSihafa /WP-BoxDrawing /WP-CyrillicA /WP-CyrillicB /WP-GreekCentury /WP-GreekCourier /WP-GreekHelve /WP-HebrewDavid /WP-IconicSymbolsA /WP-IconicSymbolsB /WP-Japanese /WP-MathA /WP-MathB /WP-MathExtendedA /WP-MathExtendedB /WP-MultinationalAHelve /WP-MultinationalARoman /WP-MultinationalBCourier /WP-MultinationalBHelve /WP-MultinationalBRoman /WP-MultinationalCourier /WP-Phonetic /WPTypographicSymbols /XYATIP10 /XYBSQL10 /XYBTIP10 /XYCIRC10 /XYCMAT10 /XYCMBT10 /XYDASH10 /XYEUAT10 /XYEUBT10 /ZapfChancery-MediumItalic /ZapfDingbats /ZapfHumanist601BT-Bold /ZapfHumanist601BT-BoldItalic /ZapfHumanist601BT-Demi /ZapfHumanist601BT-DemiItalic /ZapfHumanist601BT-Italic /ZapfHumanist601BT-Roman /ZWAdobeF ] /NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages true /ColorImageMinResolution 200 /ColorImageMinResolutionPolicy /OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic