Transcript of LEADERSHIP SKILLS, EMOTIONAL INTELLIGENCE AND SPIRITUAL
untitledImproving System Throughput and Fairness Simultaneously in
Shared Memory CMP Systems via Dynamic Bank Partitioning
Mingli Xie, Dong Tong, Kan Huang, Xu Cheng Microprocessor Research
and Development Center, Peking University, Beijing, China
{xiemingli, tongdong, huangkan, chengxu}@mprc.pku.edu.cn
Abstract Applications running concurrently in CMP systems
interfere with each other at DRAM memory, leading to poor system
performance and fairness. Memory access scheduling reorders memory
requests to improve system throughput and fairness. However, it
cannot resolve the interference issue effectively. To reduce
interference, memory partitioning divides memory resource among
threads. Memory channel partitioning maps the data of threads that
are likely to severely interfere with each other to different
channels. However, it allocates memory resource unfairly and
physically exacerbates memory contention of intensive threads, thus
ultimately resulting in the increased slowdown of these threads and
high system unfairness. Bank partitioning divides memory banks
among cores and eliminates interference. However, previous equal
bank partitioning restricts the number of banks available to
individual thread and reduces bank level parallelism.
In this paper, we first propose a Dynamic Bank Partitioning (DBP),
which partitions memory banks according to threads’ requirements
for bank amounts. DBP compensates for the reduced bank level
parallelism caused by equal bank partitioning. The key principle is
to profile threads’ memory characteristics at run-time and estimate
their demands for bank amount, then use the estimation to direct
our bank partitioning.
Second, we observe that bank partitioning and memory scheduling are
orthogonal in the sense; both methods can be illuminated when they
are applied together. Therefore, we present a comprehensive
approach which integrates Dynamic Bank Partitioning and Thread
Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory
scheduling) to further improve system performance.
Experimental results show that the proposed DBP improves system
performance by 4.3% and improves system fairness by 16% over equal
bank partitioning. Compared to TCM, DBP-TCM improves system
throughput by 6.2% and fairness by 16.7%. When compared with MCP,
DBP-TCM provides 5.3% better system throughput and 37% better
system fairness. We conclude that our methods are effective in
improving both system throughput and fairness.
1. Introduction Applications running concurrently contend with each
other for the shared memory in CMP systems. Hence, thread can be
slowed down compared to when it runs alone and entirely owns the
memory system. If the memory contention is not properly managed, it
can degrade both individual thread and overall system perfor-
mance, simultaneously causing system unfairness [1-8]. In addition,
memory streams of different threads are inter-
leaved and interfere with each other at DRAM memory; the
inter-thread interference destroys original spatial locality and
bank level parallelism of individual threads, thus severely
degrading system performance [4, 7, 14, 15, 29]. As the number of
cores on a chip continues to grow, the contention for the limited
memory resources and the interference become more serious [4,
31].
The effectiveness of a memory system is commonly evaluated by two
objectives: system throughput and fairness [1-9, 11, 16]. On the
one hand, the overall system throughput should remain high. On the
other hand, no single thread should be disproportionately slowed
down to ensure fairness. Previously proposed out-of-order memory
access scheduling policies [1-6, 9, 23] reorder memory accesses to
improve system throughput and/or fairness. Thread Clustering Memory
scheduling (TCM) [2] is one of the best scheduling among these
policies. Memory access scheduling policies can potentially recover
a portion of original spatial locality. However, the primary design
consideration of these policies is not reclaiming the lost
locality. Furthermore, the effective- ness in recovering locality
of memory scheduling is restricted due to the limited scheduling
buffer size and the arrival interval (often large) of memory
requests from a single thread. In a word, memory access scheduling
can somewhat decrease inter-thread interference, but it does not
address the issue at its root. As shown in Figure 1, there is large
disparity between TCM (threads share DRAM memory) and the optimal
system performance (thread runs alone). Therefore, reducing
inter-thread interference can further improve system
performance.
Memory partitioning divides memory resource among threads to reduce
interference. MCP [8] maps memory intensive and non-intensive
threads onto different channel(s) to enable faster progress of
non-intensive threads. MCP allocates memory resource unfairly and
physically exacerbates intensive threads’ contention for memory,
thus ultimately resulting in the increased slowdown of these
threads and high system unfairness (system fairness is approximate
to FR-FCFS [23], as shown in Figure 1). Bank partitioning [7, 15,
16, 29] partitions the internal memory banks among cores to isolate
memory access streams of different threads and eliminates
interference. It solves the interference at its root and improves
DRAM system performance. However, bank partitioning allocates an
equal number of banks to each core without considering the varying
demands for bank level parallelism of individual thread. In some
cases, BP restricts the number of banks available to a thread and
reduces bank level parallelism, thereby significantly degrading
system performance.
To compensate for the reduced bank level parallelism of equal bank
partitioning, we propose Dynamic Bank Partitioning (DBP) which
dynamically partitions memory
978-1-4799-3097-5/14/$31.00 ©2014 IEEE978-1-4799-3097-5/14/$31.00
©2014 IEEE
0
1
2
3
4
5
6
M ax
im um
S lo
w do
w n
System Throughput
Figure 1. System Throughput and Fairness of state-of-the-art
approaches
banks to accommodate threads’ disparate requirements for bank
amounts. The key idea is to profile threads’ memory characteristics
at run-time and estimate their needs for BLP, then use the
estimation to direct our bank partitioning. The main goals of DBP
are to 1) isolate memory intensive threads with high row-buffer
locality from other memory intensive threads to reserve spatial
locality, 2) provide as many as (8 or 16) banks to memory intensive
thread to improve bank level parallelism.
Although bank partitioning solves the interference issue, the
improvement of system throughput and fairness is limited. Previous
work [1, 4, 8, 20] shows that prioritizing memory non-intensive
threads enables these threads to quickly resume computation, hence
ultimately improving system throughput. Bank partitioning alone
cannot ensure this priority, thus the system throughput improvement
is constrained. In addition, bank partition- ing does not take into
account system fairness, and its ability to improve fairness is
also restricted.
We find that memory partitioning is orthogonal to memory
scheduling. As memory partitioning focuses on preserving row-buffer
locality, out-of-order memory scheduling focuses on improving
system performance by considering threads’ memory access behavior
and system fairness. Memory scheduling policies can benefit from
improved spatial locality, while memory partitioning can benefit
from better scheduling. Therefore, we propose a comprehensive
approach which integrates Dynamic Bank Partitioning and Thread
Clustering Memory scheduling [2]. Together, the two methods can
illuminate each other. TCM divides threads into two separate
clusters and employs different scheduling policy for each cluster.
TCM also introduce an insertion shuffling algorithm to reduce
interference. While DBP resolves the interference issue, there is
no need for insertion shuffle. We shrink the complicated insertion
algorithm to random shuffle. Experimental results show that this
powerful combination is able to enhance both overall system
throughput and fairness significantly.
Contributions. In this paper, we make the following
contributions:
—We introduce a dynamic bank partitioning to compensate for the
reduced bank level parallelism of equal bank partitioning. DBP
dynamically partitions memory banks according to threads’ needs for
bank amounts. DBP not only solves the inter-thread interferen- ce,
but also improves bank level parallelism of individual thread. DBP
effectively improves system throughput, while maintaining
fairness.
—We also present a comprehensive approach which combines Dynamic
Bank Partitioning and Thread Cluster Memory scheduling, we call it
DBP-TCM. TCM focuses
on improving memory scheduling by considering threads’ memory
access behavior and system fairness. DBP is orthogonal to TCM, as
DBP focuses on preserving row-buffer locality and improving bank
level parallelism. When DBP is applied on top of TCM, both methods
are benefited.
—We simplify the shuffling algorithm of TCM by shrinking the
complicated insertion shuffle to random shuffle, and simplify the
implementation.
—We compare our proposals with several related approaches. DBP
improves system throughput by 9.4% and improves system fairness by
18.1% over FR-FCFS. Compared with BP, DBP improves system
throughput by 4.3% and improves system fairness by 16.0%. DBP-TCM
provides 15.6% better system throughput (and 42.7% better system
fairness) than FR-FCFS. We also compare our proposal with TCM.
Experimental results show that DBP-TCM outperforms TCM in terms of
both system throughput and fairness, DBP-TCM improves system
throughput by 6.2% and fairness by 16.7% over TCM. When compared
with MCP, DBP-TCM provides 5.3% better system throughput and 37%
better system fairness.
The rest of this paper is organized as follows: Section 2 provides
background information and motivation. Section 3 introduces our
proposals. Section 4 describes impleme- ntation details. Section 5
shows the evaluation methodo- logy and Section 6 discusses the
results. Section 7 pre- sents related work and Section 8
concludes.
2. Background and Motivation 2.1 DRAM Memory System Organization We
present a brief introduction about DRAM memory system; more details
can be found in [12]. In modern DRAM memory systems, a memory
system consists of one or more channels. Each channel has
independent address, data, and command buses. Memory accesses to
different channels can proceed in parallel. A channel typically
consists of one or several ranks (multiple ranks share the address,
data, and commands buses of a channel). A rank is a set DRAM
devices that operate in lockstep in response to a given command.
Each rank consists of multiple banks (e.g. DDR3 8 banks/rank). A
bank is organized as a two-dimensional structure, consisting of
multiple rows and columns. A location in the DRAM is thus accessed
using a DRAM address consisting of channel, rank, bank, row, and
column fields. 2.2 Memory Access Behavior We define an
application’s memory access behavior using three components: memory
intensity [1, 2, 8, 29], row- buffer locality [2, 8, 29], and bank
level parallelism [4, 7] as identified by previous work.
Memory Intensity. Memory intensity is the frequency that an
application generates memory requests or misses in the last-level
cache. In this paper, we measure it in the terms of memory accesses
per interval (MAPI) instead of LLC misses per thousand instructions
(MPKI), since the per-thread instruction counts are not typically
available in memory controller. We also evaluate our proposal using
MPKI, we found that it provides similar results.
Row Buffer Locality. In modern DRAM memory systems, each bank has a
row buffer that provides temporary data storage of a DRAM row. If a
subsequent memory access goes to the same row currently in the row
buffer, only column access is necessary; this is called a
row-buffer hit. Otherwise, there is a row-buffer conflict, memory
controller has to first precharge the opened row, activate another
row, and then perform column access.
Figure 2. Generalized Thread Clustering Memory scheduling and
Memory Channel Partitioning
Row-buffer hit can be serviced much more quickly than conflict. The
Row-Buffer Locality (RBL) is measured by the average Row-Buffer Hit
rate (RBH).
Bank Level Parallelism. Bank is a set of independent memory arrays
inside a DRAM device. Modern DRAM devices contain multiple banks
(DDR3: 8 banks), so that multiple, independent accesses to
different DRAM arrays can proceed in parallel subject to the
availability of the shared address, command, and data busses [12].
As a result, the long memory access latency can be overlapped and
DRAM throughput can be improved, thus leading to high system
performance. The notion of servicing multiple memory accesses in
parallel in different banks is called Bank Level Parallelism
(BLP).
In modern out-of-order processors, a memory access that reaches the
head of ROB (reorder buffer) will cause structural stall. Due to
the high latency of off-chip DRAM, most memory accesses reach the
head of ROB, thus negatively impacting system performance [30].
Moreover, memory accesses which miss in the open row causes
additional precharge and activate, further exacerb- ating this
problem. If threads with low row-buffer locality generate multiple
memory accesses to different banks, they can proceed in parallel in
memory system [12]. Exploiting BLP hides memory access latency and
dramatically improves system performance. 2.3 Memory Scheduling
Memory access scheduling policies [1-6, 23] reorders memory
requests to improve system throughput and fairness. In this
section, we introduce Thread Clustering Memory scheduling [2] in
detail. As shown in Figure 2(a), TCM periodically divides
applications into latency- sensitive and bandwidth-sensitive
clusters based on their memory intensity at fixed-length time
intervals. The least memory-intensive threads are placed in
latency-sensitive cluster, while other threads are in
bandwidth-sensitive cluster. TCM use a parameter called
ClusterThresh to specify the amount of memory bandwidth consumed by
latency-sensitive cluster, thus limiting the number of threads in
latency-sensitive cluster.
TCM strictly prioritizes memory requests of threads in
latency-sensitive cluster over bandwidth-sensitive cluster. Because
servicing memory requests from such “light” threads allows them to
quickly resume computation, and hence improving system throughput
[1, 4, 20]. To further improve system throughput and to minimize
unfairness, TCM employs different memory scheduling policies
in
each cluster. For latency-intensive cluster, thread with the least
memory intensity receives the highest priority. This ensures
threads spend most of their time at the processor, thus allowing
them to make fast progress and make large contribution to overall
system throughput. Threads in bandwidth-sensitive cluster share the
remaining memory bandwidth, no single thread should be
disproportionately slowed down to ensure system fairness. To
minimize unfairness, TCM periodically shuffles (ShuffleInterval)
priority ordering among threads in bandwidth-sensitive
cluster.
TCM also proposes a notion of niceness, which reflects threads’
propensity to cause interference and their susceptibility to
interference. Threads with high row-buffer locality are less nice
than others, while threads with high bank-level parallelism are
nicer. Based on niceness, TCM introduces an insertion shuffling
algorithm to minimize inter-thread interference by enforcing nicer
threads are prioritized more often than others. TCM employs dynamic
shuffling, which switches back and forth between random shuffle and
insertion shuffle according to the composition of workload. If
threads have similar memory behavior, TCM disables insertion
shuffle and employs random shuffle to prevent unfair
treatment.
Out-of-order memory scheduling improves system performance by
considering threads’ memory access behavior and system fairness.
However, it cannot solve the interference issue effectively due to
several reasons. First, the primary design consideration of
scheduling is not recovering the lost spatial locality. Second, the
effectiveness in reclaiming locality is constrained, as the
scheduling buffer size is limited and the arrival interval of
memory requests from a single thread is often large. As the number
of core increases, the recovering ability will become less
effective since buffer size does not scale well with the number of
cores per chip, and hence negatively affecting the scheduling
capability. 2.4 Memory Partitioning Memory partitioning divides
memory resource among cores to reduce inter-thread interference.
Memory Channel Partitioning [8] maps the data of threads with
different memory characteristics onto separate memory channels. MCP
also integrates memory partitioning with a simple memory scheduling
to further improve system throughput. MCP can effectively reduce
the interference of threads with different memory access behavior
and im-
Number of banks × Number of ranks Figure 3. Influence of bank
amounts on single
thread’s performance
Figure 4. Trade-offs between bank level parallelism and row-buffer
locality
prove system throughput. However, the improvement of system
throughput comes at the cost of system fairness. As shown in Figure
2(b), MCP allocate memory channels unfairly and inherently
exacerbates intensive threads’ contention for high load memory
resource, thus ultimate- ly resulting in the increased memory
slowdown of these threads and high system unfairness.
Bank partitioning [7, 15, 16, 29] divides the internal memory banks
among cores to isolate memory access streams from different
threads, hence eliminating the inter-thread interference. Bank
partitioning is implement- ed by extending the operating system
physical frame allocation algorithm to enforce that physical frames
mapped to the same DRAM bank is exclusively allocated to a single
thread. The detailed implementation of bank partitioning can be
found in [7, 6, 29]. Bank partitioning divides memory internal
banks among cores by allocating different page colors to different
cores, thus avoiding inter-thread interference. When a core runs
out of physic- al frames allocated to it, operating system can
assign different colored frames at the cost of reducing
isolation.
However, the benefits of bank partitioning are not ful- ly
exploited, since previous BP partitions equal number of banks to
each core without considering the disparate needs for bank level
parallelism of individual application. In some cases, BP restricts
the number of banks available to an application and reduces bank
level parallelism, which significantly degrades system performance.
2.5 Motivation In this section, we motivate our partitioning by
showing how bank level parallelism impacts threads’
performance.
The Impact of Bank Level Parallelism. We study how the amount of
bank influences single thread’s performance by performing
experiments on gem5 (the detailed experimental parameters are in
Section 5). For each application, we fix the available banks from 2
to 64 to see the performance change. We get similar results as in
[7, 16]. Figure 3 shows the correlation between bank amounts and
threads’ performance, which depicts threads’ disparate needs for
bank level parallelism, all the results are normalized to 64
banks.
Basically, the results can be classified into three groups. First,
the applications with low memory intensity (e.g. gromacs, namd,
povray, etc.) show almost no perform- ance changes with the bank
amount varying. Because these applications seldom generate memory
requests, memory latency accounts for a small fraction of the
total
program execution time. Furthermore, the requests arrival interval
is long and there is hardly bank level parallelism to be
exploited.
Second, memory intensive applications with low row-buffer locality
(e.g. mcf, bwaves, zeusmp, etc.) are sensitive to the number of
banks in memory system; the performance first rises, and then
remains stable as the number of banks increases. As described in
Section 2.2, the exploitation of bank level parallelism hides
memory access latency and improves performance. BLP improves system
performance undoubtedly, whereas its benefits are limited. Since a
single thread is unable to generate enough concurrent memory
accesses [7, 16, 30] due to the combination of many factors such as
limited number of MSHRs, high cache hit rate and memory dependency.
Therefore, after some point, the available bank amount exceeds
threads’ needs and performance remains stable.
Third, the performance of memory intensive applica- tions with high
row-buffer locality (e.g. libq, milc, leslie3d, etc.) also improves
owing to the benefits brought by BLP, and then stays stable. While
the performance improvement is slower than those applications’ in
the second group. Benchmark lbm is an outlier, since it shows a
higher row-buffer hit rate as the number of bank increases. These
applications that benefit from the increased bank amount are
BLP-sensitive.
In summary, the performance of memory intensive applications are
sensitive to bank amounts and most applications give peak
performance at 8 or 16 banks, while non-intensive applications’ are
not sensitive to bank amounts. Due to the 2 cycle rank-to-rank
switching penalty, after this point, applications show no further
performance improvement or even perform worse.
The Trade-offs between Bank Level Parallelism and Row-Buffer
Locality. We also study the trade-offs between bank level
parallelism and row-buffer locality of memory intensive
applications with low spatial locality. In our experiments, we run
two threads concurrently. The example in Figure 4 illustrates how
increasing bank level parallelism can improve system throughput.
The column BP means two threads have their own memory banks, and
shared means two threads share all banks evenly.
In the first case, 2 instances of mcf running together, we use bank
partitioning [7, 15, 16, 29] divides memory banks among threads and
each thread can only access its own four banks; the row-buffer
locality is reserved. However, the original spatial locality of
these applicat- ions is low; recovering row-buffer locality brings
limited
0.6
0.7
0.8
0.9
1.0
1.1
2x1 4x1 8x1 8x2 8x4 8x6 8x8
hmmer gromacs namd povray dealII bwaves cactusADM zeusmp mcf lbm
leslie3d milc libquatum
N or
m al
iz ed
1
1.2
1.4
1.6
1.8
2
8 16 32 64 8 16 32 64 8 16 32 64
W ei
gh te
d S
pe ed
Figure 5. Overview of Our Proposal
Rule 1. Application Classifying at the end of each interval for
each application do
if MAPIi < MAPIt Application i is classified into non-intensive
group (NI)
else if RBHi < RBHt
Application i is into low RBL group (L-RBL) else
Application i is into high RBL group H-RBL) benefits. The second
situation is that the two threads share the 8 memory banks evenly
to increase bank level parallelism. Although evenly share memory
banks reduces row-buffer locality, the increased bank level
parallelism provides more system performance benefits than the
elimination of interference. When the number of banks rises to 16
or 32, increasing bank level parallelism also outperforms reducing
interference. As the amount of banks rises to 64, increasing bank
level parallelism by evenly sharing all banks negatively impacts
DRAM performance, since most applications give peak perfor- mance
at 8 or 16 banks. The 2 instances of bwaves running together show
similar results. When two different threads (mcf and bwaves)
running together, evenly share their memory banks also improves
system throughput.
3 Our Proposals 3.1 Overview of Our Proposal To accommodate the
disparate memory needs for bank level parallelism of concurrently
running applications, we propose Dynamic Bank Partitioning. There
are two key principles of DBP. First, to save memory banks for BLP-
sensitive threads, DBP does not allocate dedicated memory banks for
threads in non-intensive group. Instead, they can access all memory
banks (as shown in Figure 5). Because these threads only generate a
small amount of memory requests, and their propensities to cause
interfer- ence are low. Second, for memory intensive threads with
high row-buffer locality, DBP isolates their memory requests from
other intensive threads to reserve spatial locality. While for
memory intensive threads with low row-buffer locality, DBP balances
spatial locality and bank level parallelism. We will discuss DBP in
detail in the next section. To further improve system performance,
we propose a comprehensive approach which integrates Dynamic Bank
Partitioning and Thread Clustering Memory scheduling.
DBP resolves the interference issue while TCM improve system
throughput and fairness by considering threads’ memory access
behavior. These two methods are orthog- onal in the sense, when
applied together both concepts are benefited. 3.2 Dynamic Bank
Partitioning Based on the observations in Section 2.5, we propose a
dynamic bank partitioning which takes into account both
inter-thread interference and applications’ disparate needs for
bank level parallelism. DBP consists of three steps: 1) profiling
applications’ memory access behavior at run time, 2) classifying
applications into groups, 3) making bank partitioning decisions.
The policy proceeds periodic- ally at fixed-length time intervals.
During each interval applications’ memory access behavior is
profiled. We category applications into groups based on the
profiled characteristics, and then decide the bank partitioning
policy at the end of each interval. The decided bank partitioning
is applied in the subsequent interval.
Profiling Applications’ Characteristics. As described in Section
2.2, memory intensity and row-buffer locality are the two main
factors that determine the benefits of bank level parallelism.
Memory intensity is measured in the unit of memory accesses per
interval (MAPI); spatial locality is measured by row-buffer hit
rate (RBH). Therefore, each thread’s MAPI and RBH are collected
during each interval.
Grouping Applications. The second step is to classify applications
into groups. We first categorize applications into memory intensive
(MI) and non-intensive (NI) groups according to their MAPI. We
further classify memory-intensive applications into low row-buffer
locality (L-RBL) and high row-buffer locality (H-RBL) groups based
on their RBH. We use two threshold para- meters MAPIt and RBHt to
classify applications. DBP categorizes applications using the rules
shown in Rule 1.
=⋅ <⋅⋅
corebankrank
corebankrankMPU
At the end of each interval, our policy dynamically selects bank
partitioning based on the proportion of each group among all
applications. Algorithm 1 shows the pseudo code of our dynamic bank
partitioning.
In the first case, all applications are memory non- intensive, we
assign an equal number of colors to each
Algorithm 1. Dynamic Bank Partitioning Definition: MI: the
proportion of memory intensive applications among
all applications NI: non-intensive group H-RBL: high row-buffer
locality group L-RBL: low row-buffer locality group MPU: the
minimum partitioning unit
Dynamic Bank Partitioning: (at the end of each interval) if MI = 0%
Allocate an equal number of colors to each core else if 0% < MI
< 100%
Application in NI group can access all colors Allocate MPU colors
to each core in H-RBL group if MPU < 16
Each two core shares MPU 2 colors in L-RBL group else
Allocate MPU colors to each core in L-RBL group else if MI =
100%
Allocate MPU colors to each core in H-RBL group if MPU <
16
Each two core shares MPU 2 colors in L-RBL group else
Allocate MPU colors to each core in L-RBL group core to isolate
memory access streams from different threads, since increasing bank
amounts cannot improve system throughput for these applications
(Figure 3).
The second case is the proportion of memory intensive application
is bigger than 0%, but less than 100%. To save memory banks for
BLP-sensitive applications, we do not allocate dedicate colors for
applications in NI group. Instead, these applications can access
all memory banks, since they only generate a small amount of memory
accesses, and their propensities to cause interference are low. We
isolate H-RBL threads from other memory intensive threads by
allocating each core MPU colors to reserve row-buffer locality.
While for threads in L-RBL group, there are two circumstances. If
MPU is less than 16, each two cores share MPU2 colors to improve
bank level parallelism, since improving bank level parallelism
brings more benefits than eliminating interference (as shown in
Figure 4).1 Otherwise, we allocate MPU colors to each core in L-RBL
group, since most threads give peak performance at 8 or 16 banks
(Figure 5).
The last case is that all applications are memory intensive. For
applications in H-RBL group, we allocate MPU colors to each core to
isolate memory accesses from other cores, thus reducing
inter-thread interference. If MPU is bigger than 16, we allocate
MPU colors to each core in L-RBL group. Otherwise, each two cores
evenly share MPU2 colors to increase bank level parallelism. 3.3
Integrated Dynamic Bank Partitioning and
Memory Scheduling Bank partitioning aims to solve inter-thread
memory interference purely using OS-based page coloring. In
contrast, various existing approaches try to improve system
performance entirely in hardware using sophisti- cated memory
scheduling policies (e.g. [1-6]) by considering threads’ memory
access behavior and system fairness. The question is whether either
alone (bank partitioning alone or memory scheduling alone) can
provide the best system performance. According to our observations
below, the answer is negative. 1 If the number of threads in L-RBL
group is odd, we allocate MPU colors to the last core in L-RBL
group.
TCM reorders memory requests to improve system performance and can
potentially recover some of the lost spatial locality. However, the
primary design considerat- ion is not reclaiming the lost locality.
Furthermore, the effectiveness in recovering locality is
constrained due to the limited scheduling buffer size and the
arrival interval (often large) of memory requests from a single
thread. As the number of core increases, the recovering will become
less effective since buffer size does not scale well with the
number of cores per chip, thus negatively affecting the scheduling
capability.
Bank partitioning effectively reduces inter-thread interference
using OS-based page coloring. However, the improvement of system
throughput and fairness is limited. Previous work [1, 4, 8, 20]
shows that prioritizing memory non-intensive threads enables these
threads to quickly continue with their computation, thereby
significantly improving system throughput. While bank partitioning
alone cannot ensure this priority since it does not change the
memory scheduling policy. Therefore, the improvement of system
throughput is restricted. In addition, bank partitioning does not
take into account system fairness, and hence its ability to improve
fairness is constrained.
TCM aims to improve system performance by considering threads’
memory access behavior and system fairness, while dynamic bank
partitioning focuses on preserving row-buffer locality and
improving the reduced bank level parallelism. These two kinds of
method are complementary to each other. TCM can benefit from
improved spatial locality, and bank partitioning can benefit from
better scheduling. Therefore, we present a comprehensive approach
which integrates Dynamic Bank Partitioning and Thread Cluster
Memory scheduling to further improve system throughput and
fairness.
To save memory banks, DBP does not separate non-intensive threads’
memory requests from memory intensive threads, which exacerbates
the interference experienced by non-intensive threads. TCM strictly
prioritizes latency-sensitive cluster which consists of
non-intensive threads, thus nullifying the interference caused by
dynamic bank partitioning.
The insertion shuffling algorithm of TCM aims to reduce
inter-thread interference. However, we find that it destroys the
row-buffer locality of less nice threads. In addition, prioritizing
nicer threads increases memory slowdown of less nice threads, thus
leading to high maximum slowdown. Our comprehensive approach
combines TCM with DBP, dynamic bank partitioning solves the
inter-thread interference. Therefore, instead of insertion shuffle
we use random shuffle. Random shuffle not only saves implementation
complexity of monitoring BLP and calculating niceness values of
each thread, but also improves system performance (as shown in
Section 6.3).
The prioritization rules of DBP-TCM are: 1) threads of
latency-sensitive cluster are prioritized over bandwidth- sensitive
cluster, 2) within latency-sensitive threads, the less intensive
threads are prioritized over others, 3) within bandwidth-sensitive
threads, prioritization is determined by the shuffling algorithm,
4) when two memory requests have the same priority, row-buffer hit
requests are favored. 5) or else, older requests are
prioritized.
4 Implementation Hardware Support. Our approach requires hardware
support to 1) profile threads’ memory access behavior at run-time,
and 2) schedule memory requests as described.
Table 1. Hardware overhead Function Size (bits) MAPI-counter Memory
accesses per interval Ncore × log 2MAPImax Shadow row-buffer index
Row address of previous memory access Ncore ×Nch × Nrank × Nbank ×
log 2Nrow Shadow row-buffer hits Number of row hits Ncore × log
2MAPImax
Table 2. Simulated system parameters Parameter Value Processor 8
cores, 4 GHz, out-of-order L1 caches (per core) 32KB Inst/32KB
Data, 4-way, 64B line, LRU L2 cache (shared) 8MB, 16-way, 64B line
Memory Controller Open-page policy; 64 entries read queue, 64
entries write queue
DRAM Memory 2 channel, 2-ranks/channel, 8-banks/rank. Timing:
DDR3-1333 (8-8-8) All parameters from the Micron datasheet
[10]
Table 3. SPEC CPU2006 benchmark memory characteristics No.
Benchmark MPKI RBH No. Benchmark MPKI RBH 1 429.mcf 35.8 21.1% 12
482.sphinx3 3.84 82.6% 2 470.lbm 34.8 96.2% 13 473.astar 3.27 79.3%
3 462.libquantum 28.9 99.2% 14 401.bzip2 1.45 71.3% 4 436.cactusADM
25.8 25.4% 15 456.hmmer 0.28 43.2% 5 459.GemsFDTD 20.3 32.7% 16
435.gromacs 0.27 88.3% 6 450.soplex 20.1 91.5% 17 458.sjeng 0.23
25.5% 7 410.bwaves 18.4 18.5% 18 445.gobmk 0.21 77.0% 8 433.milc
16.84 84.2% 19 447.dealII 0.19 82.8% 9 434.zeusmp 14.6 23.7% 20
481.wrf 0.10 85.7% 10 471.omnetpp 9.84 46.4% 21 444.namd 0.04 91.3%
11 437.leslie3d 7.45 81.5% 22 453.povray 0.03 85.6%
The major hardware storage cost incurred to profile threads’ memory
access behavior is shown in Table 1.
We need a counter to record the MAPI for each application. To
compute RBH, we use a per-bank shadow row-buffer index to record
the row address of previous memory request, and record the number
of shadow row-buffer hits for each thread as described in previous
work [2, 5, 8, 29]. RBH is simply computed as the number of shadow
row-buffer hits divided by MAPI. These counters are readable by
software. To implement TCM, additional logic required to calculate
priority and cluster threads, as was done in [2].
Software Support. Dynamic Bank Partitioning requires system
software support to 1) read the counters provided by hardware at
memory controller, 2) categorize applications into groups according
to their memory access behavior as described in rule 1, 3)
dynamically allocate page colors to cores according to algorithm 1,
at the end of each interval.
We use OS-based page coloring to map memory requests of different
threads to their allocated page colors. Once each thread is
allocated a bunch of colors, DBP ensures this allocating. When a
page fault occurs, the page fault handler attempts to allocate a
free physical page in the assigned colors. If a thread runs out of
physical frames in colors allocated to it, the OS can assign frames
of different color at the cost of reduced isolation. The decision
of whose color the frames are to spill can be further explored, but
this is beyond the scope of this paper (for the experimental
workloads, the memory capacity of 0.5 ~ 1GB is enough [15, 18]).
Experimental results on real machine show that the overhead of page
coloring is negligible [15].
Page Migration. There are cases where the accessed page is present
in colors other than the mapped colors, which we observe to be very
rare in our workloads (less
than 10%) for two reasons. First, we allocate a thread as many as
possible page colors as former intervals to reduce these
circumstances when deciding bank partition- ing policy. Take an 8
core system with 32 memory banks for example, each core can have 4
banks in an equal bank partitioning. In our dynamic bank
partitioning, we compel to allocate these 4 banks to the
corresponding core in priority. Second, applications’ memory access
behavior is relatively constant within an interval. Dynamically
migrating these pages to the preferred colors could be beneficial.
However, page migration incurs extra penalty (TLB and cache block
invalidation overheads as discussed in [8, 27]). Hence, page
migration does not likely gain much benefit; we don’t implement it
in our work. However, migration can be incorporated into our work
if needed.
5 Evaluation Methodology 5.1 Simulation Setup We evaluate our
proposal using gem5 [14] as the base architectural simulator, and
integrated DRAMSim2 [13] to simulate the details of DRAM memory
system. The memory subsystem is modeled using DDR3 timing
parameters [10]. Gem5 models a virtual-to-physical address
translator. When a page fault occurs, the translator allocates a
physical page to the virtual page. We modified the translator to
support the dynamic bank partitioning. We simulate out-of-order
processor running Alpha binaries. Table 2 summarizes the major
processor and DRAM parameters of our simulation. We set the length
of profiling interval to 100K memory cycles, MAPIt to 200, and RBHi
to 0.5. We set ClusterThresh to 1/6, ShuffleInterval to 800, and
ShuffleAlgoThresh to 0.1 for TCM as presented in [2].
We simulate the following 7 configurations to evaluate
our proposal. 1) FR-FCFS — First Ready, First Come First
Serve
scheduling. 2) BP — Bank Partitioning with first ready, first
come
first serve scheduling. 3) DBP — Dynamic Bank Partitioning with
first ready,
first come first serve scheduling. 4) TCM — Thread Cluster Memory
scheduling. 5) MCP — Memory Channel Partitioning. 6) BP-TCM — Bank
Partitioning employed with Thread
Cluster Memory scheduling. 7) DBP-TCM — integrating Dynamic Bank
Partitioning
∑
∑
max
5.3 Workload Construction We use workloads constructed from the
SPEC CPU2006 benchmarks [28] for evaluation. Table 3 shows memory
characteristics of benchmarks. We cross compiled each benchmark
using gcc4.3.2 with –O3 optimizations. In our simulation, each
processor core is single-threaded and runs one program. We warm up
system for 100 million instructions before taking measurement;
followed by a final 200 million instructions used in our
evaluation. If a benchmark finishes its instructions before others,
its statistics are collected but it continues to execute so that it
insistently exerts pressure on memory system.
Memory intensity and row-buffer locality are the two key memory
characteristics of an application. Memory intensity is represented
by the last-level cache miss per kilo instruction (MPKI);
row-buffer locality is measured by row-buffer hit rate (RBH). We
classify the benchmarks into two groups: memory-intensive (MPKI
greater than 1) and non-intensive (MPKI less than 1); we further
categorize each group into two sub-groups: high row-buffer locality
(RBH greater than 0.5) and low row-buffer locality (RBH less than
0.5). We evaluate our proposals on a wide variety of
multi-programmed workloads. We vary the fraction of memory
intensive benchmarks in our workloads from 100%, 75%, 50%, 25%, 0%
and constructed 8 workloads in each category.
6 Results We first evaluate the impact of our proposal on system
performance. We measure system throughput using weighted speedup;
the balance between system throughput and fairness is measured by
harmonic speedup. We separately simulate the 7 configurations as
described in Section 5.1. Figure 6 shows the average system
throughput and harmonic speedup over all workloads, all the results
are normalized to the baseline FR-FCFS. The upper right part of the
graph corresponds to better system
Figure 6. System performance over all workloads
throughput (higher weighted speedup) and a better balance between
system throughput and fairness (higher harm-onic speedup). DBP
improves system throughput by 9.4% and harmonic speedup by 21.1%
over the baseline FR-FCFS. Compared to BP, DBP improves system
throughput by 4.2% and harmonic speedup by 15.1%. DBP-TCM provides
6.2% better system throughput (20% better harmonic speedup) than
DBP, and 6.4% better system throughput (9% better harmonic speedup)
than TCM. When compared with MCP, DBP-TCM improves system
throughput by 5.3% and harmonic speedup by 35.3%.
We make two major conclusions based on the above results. First,
dynamic bank partitioning is beneficial for system performance, as
DBP takes into account applications’ varying needs for bank level
parallelism. Second, integrating dynamic bank partitioning and
thread cluster memory scheduling provides better system performance
than employing either alone. 6.1 System Throughput We then evaluate
the impact of our mechanism on system throughput in details. Figure
7 provides insight into where the performance benefits of DBP and
DBP-TCM are coming from by breaking down performance based on
memory intensity of workloads. All the results are normalized to
the baseline FR-FCFS. Compared to FR-FCFS, dynamic bank
partitioning improves system throughput by 9.4% on average. As
expected, dynamic bank partitioning outperforms previous equal bank
partitioning by increasing the bank level parallelism of
BLP-sensitive benchmarks (4.3% better system throughput than BP
over all workloads). When bank partitioning employed with thread
cluster memory scheduling (BP-TCM), the improvement over FR-FCFS is
11.3%, while integrating DBP and TCM (DBP-TCM) improves system
throughput by 15.6%. Compared to MCP, DBP-TCM provides 5.3% better
system throughput.
The workloads with 100% memory intensity bench- mark benefit most
from dynamic bank partitioning due to two reasons. On the one hand,
these benchmarks recover the most spatial locality lost due to
interference and the increased row-buffer hit rate results in
system throughput improvement. On the other hand, DBP increases the
bank level parallelism of memory intensive benchmarks with low
row-buffer locality, thereby improving system throughput.
0.95
1
1.05
1.1
1.15
1.2
N or
m al
iz ed
W ei
Figure 7. System throughput
Figure 8. System fairness
For high memory intensive workloads (75% and 100%), reducing
inter-thread interference via bank partitioning is more effective
than memory scheduling: both BP and DBP outperform TCM. Because
contention for DRAM memory is severe in such workloads. Bank
partitioning divides memory banks among cores and effectively
reduces inter-thread interference, thus improving system
throughput. In contrast, TCM tries to address contention
potentially recover a fraction of row-buffer locality. However, the
effectiveness in recovering locality of memory scheduling is
restricted due to the limited scheduling buffer size.
Low memory intensive workloads (0%, 25%, and 50%) benefit more from
TCM. TCM strictly prioritizes memory requests of latency-sensitive
cluster over bandwidth- sensitive cluster, thus improving system
throughput. Besides, TCM enforces a strict priority for latency-
sensitive cluster; thread with the least memory intensity receives
the highest priority. This strict priority allows “light” threads
to quickly resume computation, thereby making great contribution to
the overall system throughput. On the opposite, the workloads with
0% memory intensive benchmark have almost no benefits from BP and
DBP, since all the benchmarks are memory non-intensive. In such
workloads, applications seldom generate memory requests and the
contention for DRAM memory is slight.
Based on the above analysis, we conclude that either memory
scheduling alone or bank partitioning alone cannot provide the best
system performance. A
combination of memory scheduling and bank partitioning is a more
effective solution than pure scheduling or pure partitioning. 6.2
System Fairness System throughput only shows half of the story.
Figure 8 provides the inside into the impact on system fairness of
our proposals for six workload categories with different memory
intensity. System fairness is measured by maximum slowdown.
Compared to FR-FCFS, bank partitioning improves system fairness by
2.1% on average; the improvement of dynamic bank partitioning is
18.1%. The benefit of thread cluster memory scheduling is 31.7%.
Although MCP effectively improves system throughput, the
improvement of system fairness is slight (5.7% over FR-FCFS, this
is consistent with the result in [8]). When bank partitioning
employed with thread cluster memory scheduling (BP-TCM), the
fairness improvement is 26.3%. While integrating DBP and TCM
(DBP-TCM) improves system fairness by 42.7%.
For workloads with 0% memory intensive benchmark, all schemes have
slight benefits for system fairness. As all benchmarks are
non-intensive, the memory latency is relatively low.
Workloads with 25% memory intensive benchmark suffer from equal
bank partitioning. BP reduces bank level parallelism of
BLP-sensitive applications and increases their memory access
latency, thereby increasing memory slowdown of BLP-sensitive
applications. While DBP addresses this problem by increasing bank
level par-
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
N or
m al
iz ed
W ei
gh te
d Sp
ee du
FRFCFS
BP
DBP
TCM
MCP
BP-TCM
DBP-TCM
0
1
2
3
4
5
6
7
M ax
m iu
m S
lo w
do w
FRFCFS
BP
DBP
TCM
MCP
BP-TCM
DBP-TCM
interval
Figure 10. Sensitivity to MPAIt
Figure 11. Sensitivity to RBHt
Table 4. Impact of different shuffling algorithm Compared to
Insertion Shuffle
Shuffling Algorithm TCM Round-robin Random
WS 0.28% 0.74% 1.05% MS 4.3% 8.5% 13.8%
allelism of BLP-sensitive applications, and hence improv- ing
system fairness.
DBP and TCM together optimize memory contention, spatial locality
and bank level parallelism; the two methods applied together
provides a fairer and more robust solution. 6.3 Impact of Shuffling
Algorithm TCM improves system fairness by shuffling priority among
applications in bandwidth-sensitive cluster. TCM also proposes the
insertion shuffling algorithm to reduce inter-thread interference.
However, it destroys the row-buffer locality of less nice threads
(threads with high row-buffer locality are less nice and
deprioritized). In addition, prioritizing nicer threads increases
memory slowdown of less nice threads. Therefore, the improve- ment
of system fairness is restricted.
We study the impact of four shuffling algorithms (insertion, TCM,
round-robin, and random) when DBP-TCM employs different shuffle.
Table 4 shows the system throughput (Weight Speedup, WS) and
fairness (Maximum Slowdown, MS) results, all the results are
normalized to insertion shuffle. Of the four shuffles, insertion
shuffle shows the worst system performance. Prioritizing nicer
threads increases slowdown of less nice threads, thus decreasing
system throughput and leading to high maximum slowdown. TCM shuffle
performs slightly better than insertion shuffle, since it
dynamically switches between insertion and random shuffle to
accommodate the disparate memory characteristics of diversity
workloads. Round-robin and random shuffle provides more benefits,
as DBP solves the inter-thread interference. Therefore, there is no
need for insertion shuffle. Random shuffle provides the best
results of the four shuffling algorithm.
We conclude that when TCM is employed along with DBP, the insertion
shuffling algorithm is no longer needed since DBP solves the
interference issue. 6.4 Sensitivity Study We first experiment with
different profile interval length to study its performance impact
on DBP and DBP-TCM
(Figure 9); all the results are normalized to the baseline FR-FCFS.
A short profile interval length causes unstable MAPI and RBH, and
hence potentially inaccurate estimation of applications’ memory
characteristics. Conversely, a long profile interval length cannot
catch the memory access behavior change of an application. A
profile interval length of 100K memory cycles balances the
downsides of short and long intervals, and provides the best
results.
We also vary the value of MAPIt to study the sensitivity of DBP and
DBP-TCM (Figure 10). As the value of MAPIt increases, more
applications get into non-intensive group. The system throughput
first increases, and then falls. To save memory banks for
BLP-sensitive applications, DBP does not allocate dedicated memory
banks to non-intensive applications, thus increasing bank level
parallelism of BLP-sensitive applications and improving system
throughput. However, increasing MAPIt aggravates inter-thread
interference. After some point, the drawbacks overwhelm the
benefits, thereby resulting in lower system throughput. The value
of 200 is the best choice. We also evaluate our proposal using MPKI
instead of MAPI; we found that MPKI provides similar results as
MAPI.
Figure 11 shows the trade-offs between row-buffer locality and bank
level parallelism by adjusting the row-buffer hit rate threshold
RBHt. As the scaling factor RBHt increases, more memory intensive
applications get into low RBL group. In some cases, DBP enforces
two cores in low RBL group share their colors to increase bank
level parallelism, thereby improving system throughput. However,
this also increases row-buffer conflicts, and hence negatively
impacting DRAM memory performance. An RBHt value of 0.5 achieves a
good balance between row-buffer locality and bank level
parallelism, and gives the best performance.
We vary the ClusterThresh from 1/8 to 1/5 to study the performance
of DBP-TCM, the results are shown in Table 5, we compare our
proposal with Thread Cluster Memory scheduling. DBP-TCM has a wide
range of balanced points which provide both high system throughput
and fairness. By tuning the clustering threshold, system throughput
and fairness can be gently traded off for one another. We also
experiment with different shuffle interval lengths (Table 5).
Compared to TCM, system performance remains high and stable over a
wide range of these values, and the best performance observed at
the value of 800.
Table 5 also provides the performance of DBP-TCM as
1.02
1.06
1.1
1.14
1.18
or m
al iz
ed W
ei gh
te d
Sp ee
du p
Table 5. DBP-TCM’s sensitivity to parameters Compared to TCM
ClusterThresh ShuffleInterval Number of Cores 1/8 1/7 1/6 1/5 600
700 800 900 4 8 16
WS 4.8% 5.5% 6.2% 6.6% 5.7% 6.0% 6.2% 5.9% 3.9% 6.2% 6.5% MS 19.3%
17.9% 16.7% 14.5% 17.8% 16.3% 16.7% 17.2% 6.9% 16.7% 18.0%
the number of core varies (memory and L2 cache capacity scale with
cores). The results show similar trends as the number of core
increase. Therefore, our approach scales well with the number of
cores.
7 Related Work 7.1 Memory Scheduling A number of studies have
focused on memory scheduling that aim to enhance system performance
and/or fairness, or maximum DRAM throughput.
Much prior work has focused on mitigating inter-thread interference
at the DRAM memory system to improve overall system throughput
and/or fairness of share memory CMP systems. Most of previous
approaches address this problem using application-aware memory
scheduling algorithm [1-6]. Fair queuing memory scheduling [5, 6]
adapts the variants fair queuing algorithm of computer network to
memory controller. FQM achieves some notion of fairness at the cost
of system throughput. STFM [3] makes priority decisions to equalize
the DRAM-related stall time each thread experienced to achieve
system fairness. PAR-BS [4] employs a parallelism-aware batch
scheduling policy to achieve a balance of system throughput and
fairness. ATLAS [1] strictly prioritizes threads which have
attained the least service from DRAM memory to maximize system
throughput, but at a large expense of system fairness. MISE [19]
proposes a memory- interference induced slowdown estimation model
that estimates slowdowns caused by interference to enforce system
fairness. SMS [9] presents a staged memory scheduler which achieves
high system throughput and fairness in integrated CPU-GPU systems.
TCM [2] divides threads into two separate clusters and employs
different memory scheduling policy for each cluster. TCM can
achieve both high system throughput and fairness in CMP systems
[2].
These memory access scheduling policies focus on improving system
performance by considering threads’ memory access behavior and
system fairness. They can potentially recover a fraction of
original spatial locality. However, the recovering ability is
constrained as the scheduling buffer size is limited and the
arrival interval of memory requests from a single thread is often
large. Moreover, the primary design consideration of these policies
is not reclaiming row-buffer locality. Memory access scheduling
cannot solve the interference problem effectively. Therefore, there
is potential space to further improve system performance by solving
inter-thread interference.
Application-unaware memory scheduling policies [21-26] aims to
maximize DRAM throughput, including the commonly employed FR-FCFS
[23] which prioritizes row-buffer hit memory requests over others.
These scheduling policies do not distinguish between different
threads, and also do not take into account inter-thread
interference. Therefore, they lead to low system throughput and
prone to starvation when multiple threads compete the shared DRAM
memory in general purpose multi-core systems, as shown in previous
work [1-9].
7.2 Memory Partitioning Muralidhara et al. [8] propose
application-aware memory channel partitioning (MCP). MCP maps the
data of applications that are likely to severely interfere with
each other to different memory channels. The key principle is to
partition threads with different memory characteristics (memory
intensity and row-buffer locality) onto separate channels. MCP can
effectively reduce the interference of threads with different
memory access behavior and improve system throughput.
However, we find that MCP has several drawbacks. First, MCP cannot
eliminate the inter-thread interference, since threads with the
same memory characteristics still share channel(s). Second, MCP
nullifies fine-grained DRAM channel interleaving, which limits peak
memory bandwidth of individual threads and reduces channel level
parallelism, thus degrading system performance. Third, MCP is only
suited for multi-channel system, while mobile system usually has
only one channel. Most importantly, MCP places high memory
intensive threads onto the same channel(s) to enable faster
progress of low memory intensive threads. This leads to unfair
memory resource allocation and physically exacerbates intensive
threads’ contention for memory, thus ultimately resulting in the
increased slowdown of these threads and high system
unfairness.
Previous bank partitioning [7, 15, 16, 29] proposes to partition
memory banks among cores to isolate memory access of different
applications, thereby eliminating inter- thread interference and
improving system performance. Liu et al. [16] implement and
evaluate bank partitioning in reality. However, all of the previous
bank partitioning are unaware of applications’ varying demands for
bank level parallelism. In some cases, bank partitioning restricts
the number of banks available to an application and reduces bank
level parallelism, hence significantly degrading system
performance.
Jeong et al. [7] also combine bank partitioning with sub-rank [17]
to increase the number of banks available to each thread, thus
compensating for the reduced bank level parallelism caused by bank
partitioning. However, sub-rank increases data transferring time
and reduces peak memory bandwidth. Therefore, the performance
improvement is slight and sometimes it even hurts system
performance. In addition, we have to modify conventional DRAM DIMM
to support sub-rank.
8 Conclusion Bank partitioning divides memory banks among cores and
eliminates inter-thread interference. Previous bank partitioning
allocates an equal number of memory banks to each core without
considering the disparate needs for bank level parallelism of
individual thread. In some cases, it restricts the number of banks
available to an application and reduces bank level parallelism,
thereby significantly degrading system performance. To compensate
for the reduced bank level parallelism, we present Dynamic Bank
Partitioning which takes into account threads’
demands for bank level parallelism. DBP improves system throughput,
while maintaining fairness.
Memory access scheduling focuses on improving syst- em performance
by considering threads’ memory access behavior and system fairness.
While the improvement of system performance is restricted due to
its ability to reclaim the original spatial locality. Bank
partitioning focuses on preserving row-buffer locality and solves
the inter-thread interference issue. These two kinds of method are
complementary and orthogonal to each other. Memory scheduling can
benefit from improved spatial locality, and bank partitioning can
benefit from better scheduling. Therefore, we introduce a
comprehensive approach which integrates Dynamic Bank Partitioning
and Thread Cluster Memory scheduling. These two methods applied
together illuminate each other. We show that this harmonious
combination, unlike DBP or TCM alone, is able to simultaneously
improve overall system throughput and fairness significantly.
Experimental results using eight-core multi program- ming workloads
show that the proposed DBP improves system performance by 9.4% and
improves system fairness by 18.1% over FR-FCFS on average. The
improvement of our comprehensive approach DBP-TCM is 15.6% and
42.7%. Compared with TCM, which is one of the best memory
scheduling policies, DBP-TCM improves system throughput by 6.2% and
fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3%
better system throughput and 37% better system fairness. We
conclude that our approaches are effective in enhancing both system
throughput and fairness.
Acknowledgments We gratefully thank the anonymous reviewers for
their valuable and helpful comments and suggestions. This research
was partially supported by the Major National Science and
Technology Special Projects of China under Grant No.
2009ZX01029-001-002.
References [1] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter.
ATLAS:
A Scalable and High-Performance Scheduling Algorithm for Multiple
Memory Controllers. HPCA, 2010.
[2] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter. Thread
Cluster Memory Scheduling: Exploiting Differences in Memory Access
Behavior. MICRO, 2010.
[3] O. Mutlu and T. Moscibroda. Stall-Time Fair Memory Access
Scheduling for Chip Multiprocessors. MICRO, 2007.
[4] O. Mutlu and T. Moscibroda. Parallelism-Aware Batch Scheduling:
Enhancing both Performance and Fairness of Shared DRAM Systems.
ISCA, 2008.
[5] K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair Queuing
Memory Systems. MICRO, 2006.
[6] N. Rafique, W.-T. Lim, and M. Thottethodi. Effective Management
of DRAM Bandwidth in Multicore Processors. PACT, 2007.
[7] M. Jeong, D. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez.
Balancing DRAM Locality and Parallelism in Shared Memory CMP
Systems. HPCA, 2012.
[8] S. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T.
Moscibroda. Reducing Memory Interference in Multicore Systems via
Application-Aware Memory Channel Partitioning. MICRO, 2011.
[9] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh,
and O. Mutlu. Staged Memory Scheduling: Achieving High Performance
and Scalability in Heterogeneous Systems. ISCA, 2012.
[10] Micron Technology, Inc. Micron DDR3 SDRAM Part MT41J256M8,
2006.
[11] H. Park, S. Baek, J. Choi, D. Lee, and S. Noh. Regularities
Considered Harmful: Forcing Randomness to Memory Accesses to Reduce
Row Buffer Conflicts for Multi-Core, Multi-Bank Systems. ASPLOS,
2013.
[12] B. Jacob, S. W. Ng, and D. T. Wang. Memory Systems: Cache,
DRAM, Disk. Elsevier, 2008.
[13] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle
Accurate Memory System Simulator. IEEE CAL, vol. 10, pages 16-19,
2011.
[14] N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A.
Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K.
Sewell, M. Shoaib, N. Vaish, M. Hill, and D. Wood. The Gem5
Simulator. SIGARCH CAN, vol. 39, pages 1-7, 2011.
[15] W. Mi, X. Feng, J. Xue, and Y. Jia. Software-Hardware
Cooperative DRAM Bank Partitioning for Chip Multiprocessors. NPC,
2010.
[16] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu. A
Software Memory Partition Approach for Eliminating Bank-level
Interference in Multicore Systems. PACT, 2012.
[17] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu.
Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power
Efficiency. MICRO, 2008.
[18] J. L. Henning. SPEC CPU2006 Memory Footprint. ACM SIGARCH
Computer Architecture News, 2007.
[19] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu.
MISE: Providing Performance Predictability and Improving Fairness
in Shared Main Memory Systems. HPCA, 2013.
[20] H. Zheng, J. Lin, Z. Zhang, and Z. Zhu. Memory Access
Scheduling Schemes for Systems with Multi-core Processors. ICPP,
2008.
[21] I. Hur and C. Lin. Adaptive History-based Memory Schedulers.
MICRO, 2004.
[22] S. A. McKee, W. A. Wulf, J. Aylor, R. Klenke, M. Salinas, S.
Hong, and D. Weikle. Dynamic Access Ordering for Streamed
Computations. IEEE TC, vol. 49, pages 1255-1271, Nov. 2000.
[23] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D.
Owens. Memory Access Scheduling. ISCA, 2000.
[24] J. Shao and B. T. Davis. A Burst Scheduling Access Reordering
Mechanism. HPCA, 2007.
[25] C. Natarajan, B. Christenson, and F. Briggs. A Study of
Performance Impact of Memory Controller Features in Multi-processor
Server Environment. WMPI, 2004.
[26] W. K. Zuravle and T. Robinson. Controller for a Synchronous
DRAM that Maximizes Throughput by Allowing Memory Requests and
Commands to be Issued Out of Order. U.S. Patent Number 5,630,096,
May 1997.
[27] M. Awasthi, D. Nellans, K. Sudan, R. Balasubramonian, and A.
Davis. Handling the Problems and Opportunities Posed by Multiple
on-chip Memory Controllers. PACT, 2010.
[28] SPEC CPU2006. http://www.spec.org/spec2006. [29] M. Xie, D.
Tong, Y. Feng, K. Huang, X. Cheng. Page
Policy Control with Memory Partitioning for DRAM Performance and
Power Efficiency. ISLPED, 2013.
[30] T. Karkhanis and J. E. Smith. A Day in the Life of a Data
Cache Miss. WMPI, 2002.
[31] H. Cheng, C. Lin, J. Li, and C. Yang. Memory Latency Reduction
via Thread Throttling. MICRO, 2010.
<< /ASCII85EncodePages false /AllowTransparency false
/AutoPositionEPSFiles true /AutoRotatePages /None /Binding /Left
/CalGrayProfile (Gray Gamma 2.2) /CalRGBProfile (sRGB IEC61966-2.1)
/CalCMYKProfile (U.S. Web Coated \050SWOP\051 v2) /sRGBProfile
(sRGB IEC61966-2.1) /CannotEmbedFontPolicy /Error
/CompatibilityLevel 1.7 /CompressObjects /Off /CompressPages true
/ConvertImagesToIndexed true /PassThroughJPEGImages true
/CreateJobTicket false /DefaultRenderingIntent /Default
/DetectBlends true /DetectCurves 0.0000 /ColorConversionStrategy
/LeaveColorUnchanged /DoThumbnails false /EmbedAllFonts true
/EmbedOpenType false /ParseICCProfilesInComments true
/EmbedJobOptions true /DSCReportingLevel 0 /EmitDSCWarnings false
/EndPage -1 /ImageMemory 1048576 /LockDistillerParams true
/MaxSubsetPct 100 /Optimize true /OPM 0 /ParseDSCComments false
/ParseDSCCommentsForDocInfo false /PreserveCopyPage true
/PreserveDICMYKValues true /PreserveEPSInfo false /PreserveFlatness
true /PreserveHalftoneInfo true /PreserveOPIComments false
/PreserveOverprintSettings true /StartPage 1 /SubsetFonts true
/TransferFunctionInfo /Remove /UCRandBGInfo /Preserve /UsePrologue
false /ColorSettingsFile () /AlwaysEmbed [ true
/AbadiMT-CondensedLight /ACaslon-Italic /ACaslon-Regular
/ACaslon-Semibold /ACaslon-SemiboldItalic /AdobeArabic-Bold
/AdobeArabic-BoldItalic /AdobeArabic-Italic /AdobeArabic-Regular
/AdobeHebrew-Bold /AdobeHebrew-BoldItalic /AdobeHebrew-Italic
/AdobeHebrew-Regular /AdobeHeitiStd-Regular /AdobeMingStd-Light
/AdobeMyungjoStd-Medium /AdobePiStd /AdobeSansMM /AdobeSerifMM
/AdobeSongStd-Light /AdobeThai-Bold /AdobeThai-BoldItalic
/AdobeThai-Italic /AdobeThai-Regular /AGaramond-Bold
/AGaramond-BoldItalic /AGaramond-Italic /AGaramond-Regular
/AGaramond-Semibold /AGaramond-SemiboldItalic /AgencyFB-Bold
/AgencyFB-Reg /AGOldFace-Outline /AharoniBold /Algerian /Americana
/Americana-ExtraBold /AndaleMono /AndaleMonoIPA /AngsanaNew
/AngsanaNew-Bold /AngsanaNew-BoldItalic /AngsanaNew-Italic
/AngsanaUPC /AngsanaUPC-Bold /AngsanaUPC-BoldItalic
/AngsanaUPC-Italic /Anna /ArialAlternative /ArialAlternativeSymbol
/Arial-Black /Arial-BlackItalic /Arial-BoldItalicMT /Arial-BoldMT
/Arial-ItalicMT /ArialMT /ArialMT-Black /ArialNarrow
/ArialNarrow-Bold /ArialNarrow-BoldItalic /ArialNarrow-Italic
/ArialRoundedMTBold /ArialUnicodeMS /ArrusBT-Bold
/ArrusBT-BoldItalic /ArrusBT-Italic /ArrusBT-Roman /AvantGarde-Book
/AvantGarde-BookOblique /AvantGarde-Demi /AvantGarde-DemiOblique
/AvantGardeITCbyBT-Book /AvantGardeITCbyBT-BookOblique /BakerSignet
/BankGothicBT-Medium /Barmeno-Bold /Barmeno-ExtraBold
/Barmeno-Medium /Barmeno-Regular /Baskerville /BaskervilleBE-Italic
/BaskervilleBE-Medium /BaskervilleBE-MediumItalic
/BaskervilleBE-Regular /Baskerville-Bold /Baskerville-BoldItalic
/Baskerville-Italic /BaskOldFace /Batang /BatangChe /Bauhaus93
/Bellevue /BellGothicStd-Black /BellGothicStd-Bold
/BellGothicStd-Light /BellMT /BellMTBold /BellMTItalic
/BerlingAntiqua-Bold /BerlingAntiqua-BoldItalic
/BerlingAntiqua-Italic /BerlingAntiqua-Roman /BerlinSansFB-Bold
/BerlinSansFBDemi-Bold /BerlinSansFB-Reg /BernardMT-Condensed
/BernhardModernBT-Bold /BernhardModernBT-BoldItalic
/BernhardModernBT-Italic /BernhardModernBT-Roman /BiffoMT /BinnerD
/BinnerGothic /BlackadderITC-Regular /Blackoak /blex /blsy /Bodoni
/Bodoni-Bold /Bodoni-BoldItalic /Bodoni-Italic /BodoniMT
/BodoniMTBlack /BodoniMTBlack-Italic /BodoniMT-Bold
/BodoniMT-BoldItalic /BodoniMTCondensed /BodoniMTCondensed-Bold
/BodoniMTCondensed-BoldItalic /BodoniMTCondensed-Italic
/BodoniMT-Italic /BodoniMTPosterCompressed /Bodoni-Poster
/Bodoni-PosterCompressed /BookAntiqua /BookAntiqua-Bold
/BookAntiqua-BoldItalic /BookAntiqua-Italic /Bookman-Demi
/Bookman-DemiItalic /Bookman-Light /Bookman-LightItalic
/BookmanOldStyle /BookmanOldStyle-Bold /BookmanOldStyle-BoldItalic
/BookmanOldStyle-Italic /BookshelfSymbolOne-Regular
/BookshelfSymbolSeven /BookshelfSymbolThree-Regular
/BookshelfSymbolTwo-Regular /Botanical /Boton-Italic /Boton-Medium
/Boton-MediumItalic /Boton-Regular /Boulevard /BradleyHandITC
/Braggadocio /BritannicBold /Broadway /BrowalliaNew
/BrowalliaNew-Bold /BrowalliaNew-BoldItalic /BrowalliaNew-Italic
/BrowalliaUPC /BrowalliaUPC-Bold /BrowalliaUPC-BoldItalic
/BrowalliaUPC-Italic /BrushScript /BrushScriptMT
/CaflischScript-Bold /CaflischScript-Regular /Calibri /Calibri-Bold
/Calibri-BoldItalic /Calibri-Italic /CalifornianFB-Bold
/CalifornianFB-Italic /CalifornianFB-Reg /CalisMTBol /CalistoMT
/CalistoMT-BoldItalic /CalistoMT-Italic /Cambria /Cambria-Bold
/Cambria-BoldItalic /Cambria-Italic /CambriaMath /Candara
/Candara-Bold /Candara-BoldItalic /Candara-Italic /Carta
/CaslonOpenfaceBT-Regular /Castellar /CastellarMT /Centaur
/Centaur-Italic /Century /CenturyGothic /CenturyGothic-Bold
/CenturyGothic-BoldItalic /CenturyGothic-Italic /CenturySchL-Bold
/CenturySchL-BoldItal /CenturySchL-Ital /CenturySchL-Roma
/CenturySchoolbook /CenturySchoolbook-Bold
/CenturySchoolbook-BoldItalic /CenturySchoolbook-Italic
/CGTimes-Bold /CGTimes-BoldItalic /CGTimes-Italic /CGTimes-Regular
/CharterBT-Bold /CharterBT-BoldItalic /CharterBT-Italic
/CharterBT-Roman /CheltenhamITCbyBT-Bold
/CheltenhamITCbyBT-BoldItalic /CheltenhamITCbyBT-Book
/CheltenhamITCbyBT-BookItalic /Chiller-Regular /Cmb10 /CMB10
/Cmbsy10 /CMBSY10 /CMBSY5 /CMBSY6 /CMBSY7 /CMBSY8 /CMBSY9 /Cmbx10
/CMBX10 /Cmbx12 /CMBX12 /Cmbx5 /CMBX5 /Cmbx6 /CMBX6 /Cmbx7 /CMBX7
/Cmbx8 /CMBX8 /Cmbx9 /CMBX9 /Cmbxsl10 /CMBXSL10 /Cmbxti10 /CMBXTI10
/Cmcsc10 /CMCSC10 /Cmcsc8 /CMCSC8 /Cmcsc9 /CMCSC9 /Cmdunh10
/CMDUNH10 /Cmex10 /CMEX10 /CMEX7 /CMEX8 /CMEX9 /Cmff10 /CMFF10
/Cmfi10 /CMFI10 /Cmfib8 /CMFIB8 /Cminch /CMINCH /Cmitt10 /CMITT10
/Cmmi10 /CMMI10 /Cmmi12 /CMMI12 /Cmmi5 /CMMI5 /Cmmi6 /CMMI6 /Cmmi7
/CMMI7 /Cmmi8 /CMMI8 /Cmmi9 /CMMI9 /Cmmib10 /CMMIB10 /CMMIB5
/CMMIB6 /CMMIB7 /CMMIB8 /CMMIB9 /Cmr10 /CMR10 /Cmr12 /CMR12 /Cmr17
/CMR17 /Cmr5 /CMR5 /Cmr6 /CMR6 /Cmr7 /CMR7 /Cmr8 /CMR8 /Cmr9 /CMR9
/Cmsl10 /CMSL10 /Cmsl12 /CMSL12 /Cmsl8 /CMSL8 /Cmsl9 /CMSL9
/Cmsltt10 /CMSLTT10 /Cmss10 /CMSS10 /Cmss12 /CMSS12 /Cmss17 /CMSS17
/Cmss8 /CMSS8 /Cmss9 /CMSS9 /Cmssbx10 /CMSSBX10 /Cmssdc10 /CMSSDC10
/Cmssi10 /CMSSI10 /Cmssi12 /CMSSI12 /Cmssi17 /CMSSI17 /Cmssi8
/CMSSI8 /Cmssi9 /CMSSI9 /Cmssq8 /CMSSQ8 /Cmssqi8 /CMSSQI8 /Cmsy10
/CMSY10 /Cmsy5 /CMSY5 /Cmsy6 /CMSY6 /Cmsy7 /CMSY7 /Cmsy8 /CMSY8
/Cmsy9 /CMSY9 /Cmtcsc10 /CMTCSC10 /Cmtex10 /CMTEX10 /Cmtex8 /CMTEX8
/Cmtex9 /CMTEX9 /Cmti10 /CMTI10 /Cmti12 /CMTI12 /Cmti7 /CMTI7
/Cmti8 /CMTI8 /Cmti9 /CMTI9 /Cmtt10 /CMTT10 /Cmtt12 /CMTT12 /Cmtt8
/CMTT8 /Cmtt9 /CMTT9 /Cmu10 /CMU10 /Cmvtt10 /CMVTT10 /ColonnaMT
/Colossalis-Bold /ComicSansMS /ComicSansMS-Bold /Consolas
/Consolas-Bold /Consolas-BoldItalic /Consolas-Italic /Constantia
/Constantia-Bold /Constantia-BoldItalic /Constantia-Italic
/CooperBlack /CopperplateGothic-Bold /CopperplateGothic-Light
/Copperplate-ThirtyThreeBC /Corbel /Corbel-Bold /Corbel-BoldItalic
/Corbel-Italic /CordiaNew /CordiaNew-Bold /CordiaNew-BoldItalic
/CordiaNew-Italic /CordiaUPC /CordiaUPC-Bold /CordiaUPC-BoldItalic
/CordiaUPC-Italic /Courier /Courier-Bold /Courier-BoldOblique
/CourierNewPS-BoldItalicMT /CourierNewPS-BoldMT
/CourierNewPS-ItalicMT /CourierNewPSMT /Courier-Oblique /CourierStd
/CourierStd-Bold /CourierStd-BoldOblique /CourierStd-Oblique
/CourierX-Bold /CourierX-BoldOblique /CourierX-Oblique
/CourierX-Regular /CreepyRegular /CurlzMT /David-Bold /David-Reg
/DavidTransparent /Dcb10 /Dcbx10 /Dcbxsl10 /Dcbxti10 /Dccsc10
/Dcitt10 /Dcr10 /Desdemona /DilleniaUPC /DilleniaUPCBold
/DilleniaUPCBoldItalic /DilleniaUPCItalic /Dingbats /DomCasual
/Dotum /DotumChe /EdwardianScriptITC /Elephant-Italic
/Elephant-Regular /EngraversGothicBT-Regular /EngraversMT
/EraserDust /ErasITC-Bold /ErasITC-Demi /ErasITC-Light
/ErasITC-Medium /ErieBlackPSMT /ErieLightPSMT /EriePSMT
/EstrangeloEdessa /Euclid /Euclid-Bold /Euclid-BoldItalic
/EuclidExtra /EuclidExtra-Bold /EuclidFraktur /EuclidFraktur-Bold
/Euclid-Italic /EuclidMathOne /EuclidMathOne-Bold /EuclidMathTwo
/EuclidMathTwo-Bold /EuclidSymbol /EuclidSymbol-Bold
/EuclidSymbol-BoldItalic /EuclidSymbol-Italic /EucrosiaUPC
/EucrosiaUPCBold /EucrosiaUPCBoldItalic /EucrosiaUPCItalic /EUEX10
/EUEX7 /EUEX8 /EUEX9 /EUFB10 /EUFB5 /EUFB7 /EUFM10 /EUFM5 /EUFM7
/EURB10 /EURB5 /EURB7 /EURM10 /EURM5 /EURM7 /EuroMono-Bold
/EuroMono-BoldItalic /EuroMono-Italic /EuroMono-Regular
/EuroSans-Bold /EuroSans-BoldItalic /EuroSans-Italic
/EuroSans-Regular /EuroSerif-Bold /EuroSerif-BoldItalic
/EuroSerif-Italic /EuroSerif-Regular /EuroSig /EUSB10 /EUSB5 /EUSB7
/EUSM10 /EUSM5 /EUSM7 /FelixTitlingMT /Fences /FencesPlain
/FigaroMT /FixedMiriamTransparent /FootlightMTLight /Formata-Italic
/Formata-Medium /Formata-MediumItalic /Formata-Regular /ForteMT
/FranklinGothic-Book /FranklinGothic-BookItalic
/FranklinGothic-Demi /FranklinGothic-DemiCond
/FranklinGothic-DemiItalic /FranklinGothic-Heavy
/FranklinGothic-HeavyItalic /FranklinGothicITCbyBT-Book
/FranklinGothicITCbyBT-BookItal /FranklinGothicITCbyBT-Demi
/FranklinGothicITCbyBT-DemiItal /FranklinGothic-Medium
/FranklinGothic-MediumCond /FranklinGothic-MediumItalic /FrankRuehl
/FreesiaUPC /FreesiaUPCBold /FreesiaUPCBoldItalic /FreesiaUPCItalic
/FreestyleScript-Regular /FrenchScriptMT /Frutiger-Black
/Frutiger-BlackCn /Frutiger-BlackItalic /Frutiger-Bold
/Frutiger-BoldCn /Frutiger-BoldItalic /Frutiger-Cn
/Frutiger-ExtraBlackCn /Frutiger-Italic /Frutiger-Light
/Frutiger-LightCn /Frutiger-LightItalic /Frutiger-Roman
/Frutiger-UltraBlack /Futura-Bold /Futura-BoldOblique /Futura-Book
/Futura-BookOblique /FuturaBT-Bold /FuturaBT-BoldItalic
/FuturaBT-Book /FuturaBT-BookItalic /FuturaBT-Medium
/FuturaBT-MediumItalic /Futura-Light /Futura-LightOblique
/GalliardITCbyBT-Bold /GalliardITCbyBT-BoldItalic
/GalliardITCbyBT-Italic /GalliardITCbyBT-Roman /Garamond
/Garamond-Bold /Garamond-BoldCondensed
/Garamond-BoldCondensedItalic /Garamond-BoldItalic
/Garamond-BookCondensed /Garamond-BookCondensedItalic
/Garamond-Italic /Garamond-LightCondensed
/Garamond-LightCondensedItalic /Gautami /GeometricSlab703BT-Light
/GeometricSlab703BT-LightItalic /Georgia /Georgia-Bold
/Georgia-BoldItalic /Georgia-Italic /GeorgiaRef /Giddyup
/Giddyup-Thangs /Gigi-Regular /GillSans /GillSans-Bold
/GillSans-BoldItalic /GillSans-Condensed /GillSans-CondensedBold
/GillSans-Italic /GillSans-Light /GillSans-LightItalic /GillSansMT
/GillSansMT-Bold /GillSansMT-BoldItalic /GillSansMT-Condensed
/GillSansMT-ExtraCondensedBold /GillSansMT-Italic
/GillSans-UltraBold /GillSans-UltraBoldCondensed
/GloucesterMT-ExtraCondensed /Gothic-Thirteen /GoudyOldStyleBT-Bold
/GoudyOldStyleBT-BoldItalic /GoudyOldStyleBT-Italic
/GoudyOldStyleBT-Roman /GoudyOldStyleT-Bold /GoudyOldStyleT-Italic
/GoudyOldStyleT-Regular /GoudyStout /GoudyTextMT-LombardicCapitals
/GSIDefaultSymbols /Gulim /GulimChe /Gungsuh /GungsuhChe
/Haettenschweiler /HarlowSolid /Harrington /Helvetica
/Helvetica-Black /Helvetica-BlackOblique /Helvetica-Bold
/Helvetica-BoldOblique /Helvetica-Condensed
/Helvetica-Condensed-Black /Helvetica-Condensed-BlackObl
/Helvetica-Condensed-Bold /Helvetica-Condensed-BoldObl
/Helvetica-Condensed-Light /Helvetica-Condensed-LightObl
/Helvetica-Condensed-Oblique /Helvetica-Fraction /Helvetica-Narrow
/Helvetica-Narrow-Bold /Helvetica-Narrow-BoldOblique
/Helvetica-Narrow-Oblique /Helvetica-Oblique /HighTowerText-Italic
/HighTowerText-Reg /Humanist521BT-BoldCondensed
/Humanist521BT-Light /Humanist521BT-LightItalic
/Humanist521BT-RomanCondensed /Imago-ExtraBold /Impact
/ImprintMT-Shadow /InformalRoman-Regular /IrisUPC /IrisUPCBold
/IrisUPCBoldItalic /IrisUPCItalic /Ironwood /ItcEras-Medium
/ItcKabel-Bold /ItcKabel-Book /ItcKabel-Demi /ItcKabel-Medium
/ItcKabel-Ultra /JasmineUPC /JasmineUPC-Bold /JasmineUPC-BoldItalic
/JasmineUPC-Italic /JoannaMT /JoannaMT-Italic /Jokerman-Regular
/JuiceITC-Regular /Kartika /Kaufmann /KaufmannBT-Bold
/KaufmannBT-Regular /KidTYPEPaint /KinoMT /KodchiangUPC
/KodchiangUPC-Bold /KodchiangUPC-BoldItalic /KodchiangUPC-Italic
/KorinnaITCbyBT-Regular /KozGoProVI-Medium /KozMinProVI-Regular
/KristenITC-Regular /KunstlerScript /Latha /LatinWide /LetterGothic
/LetterGothic-Bold /LetterGothic-BoldOblique
/LetterGothic-BoldSlanted /LetterGothicMT /LetterGothicMT-Bold
/LetterGothicMT-BoldOblique /LetterGothicMT-Oblique
/LetterGothic-Slanted /LetterGothicStd /LetterGothicStd-Bold
/LetterGothicStd-BoldSlanted /LetterGothicStd-Slanted /LevenimMT
/LevenimMTBold /LilyUPC /LilyUPCBold /LilyUPCBoldItalic
/LilyUPCItalic /Lithos-Black /Lithos-Regular /LotusWPBox-Roman
/LotusWPIcon-Roman /LotusWPIntA-Roman /LotusWPIntB-Roman
/LotusWPType-Roman /LucidaBright /LucidaBright-Demi
/LucidaBright-DemiItalic /LucidaBright-Italic
/LucidaCalligraphy-Italic /LucidaConsole /LucidaFax /LucidaFax-Demi
/LucidaFax-DemiItalic /LucidaFax-Italic /LucidaHandwriting-Italic
/LucidaSans /LucidaSans-Demi /LucidaSans-DemiItalic
/LucidaSans-Italic /LucidaSans-Typewriter
/LucidaSans-TypewriterBold /LucidaSans-TypewriterBoldOblique
/LucidaSans-TypewriterOblique /LucidaSansUnicode /Lydian
/Magneto-Bold /MaiandraGD-Regular /Mangal-Regular /Map-Symbols
/MathA /MathB /MathC /Mathematica1 /Mathematica1-Bold
/Mathematica1Mono /Mathematica1Mono-Bold /Mathematica2
/Mathematica2-Bold /Mathematica2Mono /Mathematica2Mono-Bold
/Mathematica3 /Mathematica3-Bold /Mathematica3Mono
/Mathematica3Mono-Bold /Mathematica4 /Mathematica4-Bold
/Mathematica4Mono /Mathematica4Mono-Bold /Mathematica5
/Mathematica5-Bold /Mathematica5Mono /Mathematica5Mono-Bold
/Mathematica6 /Mathematica6Bold /Mathematica6Mono
/Mathematica6MonoBold /Mathematica7 /Mathematica7Bold
/Mathematica7Mono /Mathematica7MonoBold /MatisseITC-Regular
/MaturaMTScriptCapitals /Mesquite /Mezz-Black /Mezz-Regular /MICR
/MicrosoftSansSerif /MingLiU /Minion-BoldCondensed
/Minion-BoldCondensedItalic /Minion-Condensed
/Minion-CondensedItalic /Minion-Ornaments /MinionPro-Bold
/MinionPro-BoldIt /MinionPro-It /MinionPro-Regular
/MinionPro-Semibold /MinionPro-SemiboldIt /Miriam /MiriamFixed
/MiriamTransparent /Mistral /Modern-Regular /MonotypeCorsiva
/MonotypeSorts /MSAM10 /MSAM5 /MSAM6 /MSAM7 /MSAM8 /MSAM9 /MSBM10
/MSBM5 /MSBM6 /MSBM7 /MSBM8 /MSBM9 /MS-Gothic /MSHei
/MSLineDrawPSMT /MS-Mincho /MSOutlook /MS-PGothic /MS-PMincho
/MSReference1 /MSReference2 /MSReferenceSansSerif
/MSReferenceSansSerif-Bold /MSReferenceSansSerif-BoldItalic
/MSReferenceSansSerif-Italic /MSReferenceSerif
/MSReferenceSerif-Bold /MSReferenceSerif-BoldItalic
/MSReferenceSerif-Italic /MSReferenceSpecialty /MSSong /MS-UIGothic
/MT-Extra /MT-Symbol /MT-Symbol-Italic /MVBoli /Myriad-Bold
/Myriad-BoldItalic /Myriad-Italic /MyriadPro-Black
/MyriadPro-BlackIt /MyriadPro-Bold /MyriadPro-BoldIt /MyriadPro-It
/MyriadPro-Light /MyriadPro-LightIt /MyriadPro-Regular
/MyriadPro-Semibold /MyriadPro-SemiboldIt /Myriad-Roman /Narkisim
/NewCenturySchlbk-Bold /NewCenturySchlbk-BoldItalic
/NewCenturySchlbk-Italic /NewCenturySchlbk-Roman
/NewMilleniumSchlbk-BoldItalicSH /NewsGothic /NewsGothic-Bold
/NewsGothicBT-Bold /NewsGothicBT-BoldItalic /NewsGothicBT-Italic
/NewsGothicBT-Roman /NewsGothic-Condensed /NewsGothic-Italic
/NewsGothicMT /NewsGothicMT-Bold /NewsGothicMT-Italic
/NiagaraEngraved-Reg /NiagaraSolid-Reg /NimbusMonL-Bold
/NimbusMonL-BoldObli /NimbusMonL-Regu /NimbusMonL-ReguObli
/NimbusRomDGR-Bold /NimbusRomDGR-BoldItal /NimbusRomDGR-Regu
/NimbusRomDGR-ReguItal /NimbusRomNo9L-Medi /NimbusRomNo9L-MediItal
/NimbusRomNo9L-Regu /NimbusRomNo9L-ReguItal /NimbusSanL-Bold
/NimbusSanL-BoldCond /NimbusSanL-BoldCondItal /NimbusSanL-BoldItal
/NimbusSanL-Regu /NimbusSanL-ReguCond /NimbusSanL-ReguCondItal
/NimbusSanL-ReguItal /Nimrod /Nimrod-Bold /Nimrod-BoldItalic
/Nimrod-Italic /NSimSun /Nueva-BoldExtended
/Nueva-BoldExtendedItalic /Nueva-Italic /Nueva-Roman /NuptialScript
/OCRA /OCRA-Alternate /OCRAExtended /OCRB /OCRB-Alternate
/OfficinaSans-Bold /OfficinaSans-BoldItalic /OfficinaSans-Book
/OfficinaSans-BookItalic /OfficinaSerif-Bold
/OfficinaSerif-BoldItalic /OfficinaSerif-Book
/OfficinaSerif-BookItalic /OldEnglishTextMT /Onyx /OnyxBT-Regular
/OzHandicraftBT-Roman /PalaceScriptMT /Palatino-Bold
/Palatino-BoldItalic /Palatino-Italic /PalatinoLinotype-Bold
/PalatinoLinotype-BoldItalic /PalatinoLinotype-Italic
/PalatinoLinotype-Roman /Palatino-Roman /PapyrusPlain
/Papyrus-Regular /Parchment-Regular /Parisian /ParkAvenue
/Penumbra-SemiboldFlare /Penumbra-SemiboldSans
/Penumbra-SemiboldSerif /PepitaMT /Perpetua /Perpetua-Bold
/Perpetua-BoldItalic /Perpetua-Italic /PerpetuaTitlingMT-Bold
/PerpetuaTitlingMT-Light /PhotinaCasualBlack /Playbill /PMingLiU
/Poetica-SuppOrnaments /PoorRichard-Regular /PopplLaudatio-Italic
/PopplLaudatio-Medium /PopplLaudatio-MediumItalic
/PopplLaudatio-Regular /PrestigeElite /Pristina-Regular
/PTBarnumBT-Regular /Raavi /RageItalic /Ravie /RefSpecialty
/Ribbon131BT-Bold /Rockwell /Rockwell-Bold /Rockwell-BoldItalic
/Rockwell-Condensed /Rockwell-CondensedBold /Rockwell-ExtraBold
/Rockwell-Italic /Rockwell-Light /Rockwell-LightItalic /Rod
/RodTransparent /RunicMT-Condensed /Sanvito-Light /Sanvito-Roman
/ScriptC /ScriptMTBold /SegoeUI /SegoeUI-Bold /SegoeUI-BoldItalic
/SegoeUI-Italic /Serpentine-BoldOblique /ShelleyVolanteBT-Regular
/ShowcardGothic-Reg /Shruti /SimHei /SimSun /SimSun-PUA
/SnapITC-Regular /StandardSymL /Stencil /StoneSans /StoneSans-Bold
/StoneSans-BoldItalic /StoneSans-Italic /StoneSans-Semibold
/StoneSans-SemiboldItalic /Stop /Swiss721BT-BlackExtended /Sylfaen
/Symbol /SymbolMT /Tahoma /Tahoma-Bold /Tci1 /Tci1Bold
/Tci1BoldItalic /Tci1Italic /Tci2 /Tci2Bold /Tci2BoldItalic
/Tci2Italic /Tci3 /Tci3Bold /Tci3BoldItalic /Tci3Italic /Tci4
/Tci4Bold /Tci4BoldItalic /Tci4Italic /TechnicalItalic
/TechnicalPlain /Tekton /Tekton-Bold /TektonMM
/Tempo-HeavyCondensed /Tempo-HeavyCondensedItalic /TempusSansITC
/Times-Bold /Times-BoldItalic /Times-BoldItalicOsF /Times-BoldSC
/Times-ExtraBold /Times-Italic /Times-ItalicOsF
/TimesNewRomanMT-ExtraBold /TimesNewRomanPS-BoldItalicMT
/TimesNewRomanPS-BoldMT /TimesNewRomanPS-ItalicMT
/TimesNewRomanPSMT /Times-Roman /Times-RomanSC /Trajan-Bold
/Trebuchet-BoldItalic /TrebuchetMS /TrebuchetMS-Bold
/TrebuchetMS-Italic /Tunga-Regular /TwCenMT-Bold
/TwCenMT-BoldItalic /TwCenMT-Condensed /TwCenMT-CondensedBold
/TwCenMT-CondensedExtraBold /TwCenMT-CondensedMedium
/TwCenMT-Italic /TwCenMT-Regular /Univers-Bold /Univers-BoldItalic
/UniversCondensed-Bold /UniversCondensed-BoldItalic
/UniversCondensed-Medium /UniversCondensed-MediumItalic
/Univers-Medium /Univers-MediumItalic /URWBookmanL-DemiBold
/URWBookmanL-DemiBoldItal /URWBookmanL-Ligh /URWBookmanL-LighItal
/URWChanceryL-MediItal /URWGothicL-Book /URWGothicL-BookObli
/URWGothicL-Demi /URWGothicL-DemiObli /URWPalladioL-Bold
/URWPalladioL-BoldItal /URWPalladioL-Ital /URWPalladioL-Roma
/USPSBarCode /VAGRounded-Black /VAGRounded-Bold /VAGRounded-Light
/VAGRounded-Thin /Verdana /Verdana-Bold /Verdana-BoldItalic
/Verdana-Italic /VerdanaRef /VinerHandITC /Viva-BoldExtraExtended
/Vivaldii /Viva-LightCondensed /Viva-Regular /VladimirScript
/Vrinda /Webdings /Westminster /Willow /Wingdings2 /Wingdings3
/Wingdings-Regular /WNCYB10 /WNCYI10 /WNCYR10 /WNCYSC10 /WNCYSS10
/WoodtypeOrnaments-One /WoodtypeOrnaments-Two
/WP-ArabicScriptSihafa /WP-ArabicSihafa /WP-BoxDrawing
/WP-CyrillicA /WP-CyrillicB /WP-GreekCentury /WP-GreekCourier
/WP-GreekHelve /WP-HebrewDavid /WP-IconicSymbolsA
/WP-IconicSymbolsB /WP-Japanese /WP-MathA /WP-MathB
/WP-MathExtendedA /WP-MathExtendedB /WP-MultinationalAHelve
/WP-MultinationalARoman /WP-MultinationalBCourier
/WP-MultinationalBHelve /WP-MultinationalBRoman
/WP-MultinationalCourier /WP-Phonetic /WPTypographicSymbols
/XYATIP10 /XYBSQL10 /XYBTIP10 /XYCIRC10 /XYCMAT10 /XYCMBT10
/XYDASH10 /XYEUAT10 /XYEUBT10 /ZapfChancery-MediumItalic
/ZapfDingbats /ZapfHumanist601BT-Bold /ZapfHumanist601BT-BoldItalic
/ZapfHumanist601BT-Demi /ZapfHumanist601BT-DemiItalic
/ZapfHumanist601BT-Italic /ZapfHumanist601BT-Roman /ZWAdobeF ]
/NeverEmbed [ true ] /AntiAliasColorImages false /CropColorImages
true /ColorImageMinResolution 200 /ColorImageMinResolutionPolicy
/OK /DownsampleColorImages true /ColorImageDownsampleType /Bicubic