Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University...

Memory OptimizationsResearch at UNT

Krishna Kavi

ProfessorDirector of NSF Industry/University Cooperative Center

for Net-Centric Software and Systems (Net-Centric IUCRC)

Computer Science and EngineeringThe University of North Texas

Denton, Texas 76203, USA

[email protected]://csrl.unt.edu/~kavi

mailto:[email protected]

Memory Optimizations at UNT 2

Motivation

Memory subsystem plays a key role in achieving performance on multi-core processors

Memory subsystem contributes to significant portions of energy consumed

Pin limitations limit bandwidth to off-chip memories

Shared caches may have non-uniform access behaviors

Shared caches may encounter inter-core conflicts and coherency misses

Different data types exhibit different locality and reuse behaviors

Different applications need different memory optimizations


Our Research Focus

Cache Memory optimizations

software and hardware solutions

primarily at L-1

some ideas at L-2

Memory Management

Intelligent allocation and user defined layouts

Hardware supported allocation and garbage collection


Non-Uniformity of Cache Accesses

Non-Uniform access to cache setsSome sets are accessed 100,000 time more often than other setsCause more misses while some sets are not used

Non-Uniform Cache Accesses For Parser


Non-Uniformity of Cache Accesses

But, not all applications exhibit “bad” access behavior

Non-Uniform Cache Accesses for Selected BenchmarksNeed different solutions for different applications


Improving Uniformity of Cache Accesses

Possible solutions

• Using Fully associative caches with perfect replacement policies

• Selecting optimal addressing schemes

• Dynamically re-mapping addresses to new cache lines

• Partitioning caches into smaller portions

• Each partition used by a different data object

• Using Multiple address decoders

• Static or dynamic data mapping and relocation


Associative Caches Improve Uniformity

Direct Mapped Cache 16-Way Associative Cache


Data Memory Characteristics

• Different Object Types exhibit different access behaviors- Arrays exhibit spatial localities- Linked lists and pointer data types are difficult to pre-fetch- Static and scalars may exhibit temporal localities

• Custom memory allocators and custom run-time support can be used to improve locality of dynamically allocated objects- Pool Allocators (U of Illinois)- Regular Expressions to improve on Pool Allocators (Korea)- Profiling and reallocating objects (UNT)- Hardware support for intelligent memory management (UNT and Iowa State)


ABC’s of Cache Memories

Multiple levels of memory – memory hierarchy

CPU and Registers

L1- InstrCache

L1- DataCache

L2 Cache(combinedData and Instr)

DRAM(Main memory)

DISK



Consider a direct mapped Cache

An address can only be in a fixed cache line as specified by the 6-bit line number of the address



Consider a 2-way set associative cache

An address is located in a fixed set of the cache.But the address can occupy either of the 2 lines of a set.

We extend this idea to 4-way, 8-way,.. fully associative caches



Consider a fully associative cache

An address is located in any line

Or, there is only one set in the cache.

Very expensive since we need to compare the address tag with each line tag.

Also need a good replacement strategy.

Can lead to more uniform of access to cache lines

Tag Byte offset


Programmable Associativity

Can we provide higher associativity only when we need it?Consider a simple idea

Heavily accessed cache lines will be provided with alternate locationsas indicated by “partner index”



Pier’s adaptive cache uses two tablesSet-reference History Table (SHT) – tracks heavily used cache lines Out-of-position directory (OUT) – tracks alternate locations

[Pier 98] J. Peir, Y. Lee, and W. Hsu, “Capturing Dynamic Memory Reference Behavior with Adaptive Cache Topology.” In Proc. of the 8th Int. Conf. on Architectural Support for Programming Language and Operating Systems, 1998, pp. 240–250

[Zhang 06] C. Zhang. Balanced cache: Reducing conflict misses of direct-mapped caches. ISCA, pages 155–166, June 2006

Zhang’s programmable associativity (B-Cache)Cache index is divided in to Programmable and Non-programmable indexesThe NPI facilitates for varying associativities



adpcmbasicmath

bitcountcrc

dijkstrafft

patriciaqsort

rijndaelsha

susanAverage

0

20

40

60

80

100

120

Adaptive_Cache B_Cache Column_associative

Mibench Benchamarks

% R

ed

uc

tio

n i

n M

iss

-Ra

te



adpcmbasicmath

bitcountcrc

dijkstrafft

patriciaqsort

rijndaelsha

susanAverage-5

0

5

10

15

20

25

30

35

40

45

Adaptive_Cache B_Cache Column_associative

Mibench Benchmarks

% R

eductio

n in

AM

AT


Multiple Decoders

Tag Set Index Byte offsetTag

Tag Set Index Byte offset

TagSet Index Byte offsetSet Index

Tag Data

Different decoders may use different associativities


Multiple Decoders

But how to select index bits?


Index Selection Techniques

Different approaches have been studiedGivargis quality bitsX-Or some tag bits with index bitsAdd a multiple of tag to index Use prime modulo

[Givargis 03] T. Givargis, “Improved Indexing for Cache Miss Reduction in Embedded Systems,” In Proc. of Design Automation Conference, 2003.

[Kharbutli 04] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee, “Using PrimeNumbers for Cache Indexing to Eliminate Conflict Misses,” Proc.Int’l Symp. High Performance Computer Architecture, 2004


Index Selection Techniques


Multiple Decoders

Odd multiplier method

Different multipliers foreach thread

b itc o un t_ a dp c m

b zi p2 _ lib q ua n tum

ff t_ su s an

g ro ma cs _ nam

d

mi lc _n am

d

q so r t_ ba s ic mat h

q so r t_ pa tr ic ia

ff t_ ba s ic ma th _p a tri ci a_ s us a n

s us a n_ b itc o un t_ a dp c m_ pa tr ic ia

Av er ag e

0

20

40

60

80

Multi-Threaded Benchmarks

% R

ed

uct

ion

in M

iss-

Ra

te


Multiple Decoders

bit cou nt_a dpc m

fft_ sus an

qs ort_ bas icmath

qs ort_ fft

qs ort_ patr icia

lib qua ntum_m

ilc

milc_n am

d

gr oma cs_ nam

d

bz ip2_ libq uan tum

fft_ bas icmath _pat rici a_su san

su san _bit coun t_a dpc m_p atri cia

Av era ge

0

10

20

30

40

50

60

70

Multi-threaded Applications

% Im

prov

emen

t in

AM

AT

Here we split cache into segments, one per thread

But, we used Adaptive cachetechniques to “donate”underutilized sets to other threads


Other Cache Memory Research at UNT

Use of a single data cache can lead to unnecessary cache misses

Arrays exhibit higher spatial localities while scalar may exhibit higher temporal localities

May benefit from different cache organizations (associativity, block size)

If using separate instruction and data caches, why not different data caches -- either statically or dynamically partitioned

And if separate array and scalar caches are included, how to further improve their performance

Optimize the sizes of array and scalar caches for each application


Reconfigurable Caches

CPU

ArrayCache

ScalarCache

MAIN

MEMORY

SecondaryCache

Cache


Percentage reduction of power, area and cycles for data cache

0

10

20

30

40

50

60

70

80

90

power

area

time

bc qs dj bf sh ri ss ad cr ff avg

percentage

Conventional cache configuration: 8k, Direct mapped data cache, 32k 4-way Unified level 2 cache Scalar cache configuration: Size variable, Direct mapped with 2 lined Victim cache Array cache configuration: Size variable, Direct mapped


Summarizing

For instruction cache85% (average 62%) reduction in cache size72% (average 37%) reduction in cache access time75% (average 47%) reduction in energy consumption

For data cache78% (average 49%) reduction in cache size36% (average 21%) reduction in cache access time 67% (average 52%) reduction in energy consumption

when compared with an 8KB L-1 instruction cache and an 8KB L-1 unified data cache with a 32KB level-2 cache


Generalization

Why not extend Array/Scalar split caches to more than 2 partitions?

Each partition customized to a specific object type

Partitioning can be achieved using multiple decoders with a single cache resource (virtual partitioning)

Reconfigurable partitions is possible with programmable decodersEach decoder accesses a portion of the cache

either physically restrict to a segment of cacheor virtually limit the number of lines accessed by a

decoder

Scratchpad Memories can be viewed as cache partitionsDedicate a segment of cache for scratchpad


Scratch Pad Memories

They are viewed as compiler controlled memoriesas fast as L-1 caches, but not managed as caches

Compiler decides which data will reside in scratch pad memory

A new paper from Maryland proposes a way of compiling programs for unknown sized Scratch pad memories

Only Stack data (static and global variables) are placed in SPMCompiler views Stack as two stacks

Potential SPM data stackDRAM data stack


Current and Future Research

Extensive study of using Multiple Decoders

Separate decoders for different data structurespartitioning of L-1 caches

Separate decoders for different threads and coresat L-2 or Last Level Cachesminimize conflictsminimize coherency related missesminimize loss due to non-uniform memory access delays

Investigate additional indexing or programmable associativity ideasCooperative L-2 caches using adaptive caches


Program Analysis Tool

We need tools to profile and analyze • Data layout at various levels of memory hierarchy• Data access patterns

• Existing tools (Valgrind, Pin) do not provide fine grained information• We want to relate each memory access back to a source level constructs

Source variable name, function/thread that caused the access


Gleipnir

Our tool is built on top of Valgrind

Can be used with any architecture that is supported by Valgrind

x-86, PPC, MIPS

and ARM


Gleipnir


Gleipnir

How can we use Gleipnir.Explore different data layoutsand their impact on cache accesses


Gleipnir

Standard layout


Gleipnir

Tiled matrices


Gleipnir

Matrices A and C combined


Further Research

• Restructuring memory allocation – currently in progress- Analyze cache set conflicts and relate them to data objects- Modify data placement of these objects- Reorder variables, include dummy variables, …

• Restructure Code to improve data access patters (SLO tool)- Loop Fusion – combine loops that use the same data- Loop tiling – split loops into smaller loops to limit data accessed- Similar techniques to assure “common” data resides in L-2 (shared

caches)- Similar techniques such that data is transferred to GPUs infrequently

Loop Tiling Idea

Too much data accessed in the loop


Code Refactoring

double sum(…) {…for(int i=0; i<len; i++)

result += X[i];…

}

all cache misses occur here.


Code Refactoring

Loop Fusion Idea

double inproduct(…) { …

for(int i=0; i<len; i++)result += X[i]*Y[i];

…}

double sum(…) {…for(int i=0; i<len; i++)

result += X[i];…

}

previous use occur here.

all cache misses occur here.


SLO Tool

double inproduct(…) { …

for(int i=0; i<len; i++)result += X[i]*Y[i];

…}double sum(…) {

…for(int i=0; i<len; i++)

result += X[i];…

}


Extensions Planned

Key Factors Influencing Code and Data Refactoring

Reuse Distance – reducing distance improves data utilization

Can be used with CPU-GPU configurations

Fuse loops so that all computations using the “same” data are grouped

Conflict sets and conflict distances

The set of variables that fall to the same cache line (or group of lines)

Conflict between pairs of conflicting variables

Increase conflict distance


Further Research

We are currently investigating several of these ideas

Using architectural simulators like SimICS

explore multiple decoders with multiple threads, cores or for different data types

Further extend Gleipnir

and explore using Gleipnir with compilers

and Gleipnir with other tools like SLO,

evaluate the effectiveness of custom allocators

Some hardware implementations of memory management using FPGAs

And we welcome collaborations


The End

Questions?

More information and papers at http://csrl.cse.unt.edu/~kavi


Custom Memory Allocators

Consider a typical pointer chasing programs

node {int key;… data; /* complex data partnode *next; }

We will explore two possibilitiespool allocationsplit structures



• Pool Allocator (Illinois)

Data type B

Data type A Data type AData type A Data type A

Data type B Data type BData type B

Data type A

Data type AData type A

Data type A

Data type B

Data type B

Heap

Heap



Further OptimizationConsider a typical pointer chasing programs

node {int key;… data; /* complex data partnode *next; }

The data part is accessed only if key matches

while (..) {if (b->key == k) return h->data;h= h->next;}

Consider a different definition of the data

node { int key; node *next; data_node * data+ptr; }

Key; *next;

*datat_ptr

Key; *next;

*datat_ptr

Key; *next;

*datat_ptr

Data_node Data_node Data_node



Profiling (UNT) Using data profiling, “flatten” dynamic data into consecutive blocksMake linked lists look like arrays!


Cache Based Side-Channel Attacks

Encryption algorithms use keys (or blocks of the key) as index into tables containing constants used in the algorithm

Using which table entries caused cache missescan find the address of the table entry

and then find the value of the key that was used

Z. Wang and R. Lee. “New cache designs for thwarting software cache based side channel attacks”, ISCA 2007, pp 494-505

Two solutions: 1. Lock cache lines (cannot be displaced) when using encryption

2. Use a random replacement policy in selecting which line of a set is replaced


Offloading Memory Management Functions

1. Dynamic memory management is the management of main memory

for use by programs during runtime

2. Dynamic memory management account for significant amount of

execution time –42& for 197.parser (from SPEC 2000 benchmarks)

3. If CPU is performing memory management, CPU cache will perform

poorly due to switching between user functions and memory

management functions

4. If we have a separate hardware and separate cache for memory

management, CPU cache performance can be improved dramatically


Offloading Memory Management Functions

BIUCPU

DataCache

1

2

3

De-All Completion

Allocation Ready

System

Bus

InstructionCache

Interface

Mem

ory

Pro

cess

or

MPInst. Cache

MPData Cache

Seco

nd L

evel

Cac

he


Improved Performance

• Object Oriented and Linked Data Structured Applications Exhibit Poor LocalityCache pollution caused by Memory Management functions

• Memory management functions do not use user data cachesOn average, about 40% of cache misses eliminated

• Memory manager does not need large data caches


Improved Execution Performance

Nameof

Benchmark

% of cycles

spent on malloc

Numbers of instructions in conventional Architecture

Numbers of instruction in

Separated Hardware

Implementation

% Performance increase due to

Separate Hardware

Implementation

% Performance increase due to fastest separate

Hardware Implementation

255.vortex 0.59 13020462240 12983022203 2.81 2.90

164.gzip 0.04 4,540,660 4,539,765 0.031 0.0346197.parser 17.37 2070861403 1616890742 3.19 18.8

espresso

Cfrac 31.17 599365 364679 19.03 39.99bisort 2.08 620560644 607122284 10.03 12.76


Other Uses of Hardware Memory Manager

Dynamic relocation of objects to improve localities

Hardware Manager can track object usage and relocate them

without CPU’s knowledge

New and innovative Allocation/Garbage collection methods

Estranged Buddy Allocator

Contaminated Garbage Collector

Predictive allocation to achieve “one-cycle” allocation

Allocator bookkeeping data kept separate from objects

Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University...

Documents

Transcript of Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University...