MAI: Memory Affinity Interface

26
HAL Id: inria-00344189 https://hal.inria.fr/inria-00344189v3 Submitted on 29 Sep 2009 (v3), last revised 14 Jun 2010 (v6) HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. MAi: Memory Affinity Interface Christiane Pousa Ribeiro, Jean-François Méhaut To cite this version: Christiane Pousa Ribeiro, Jean-François Méhaut. MAi: Memory Affinity Interface. [Technical Report] RT-0359, 2008. <inria-00344189v3>

Transcript of MAI: Memory Affinity Interface

HAL Id: inria-00344189https://hal.inria.fr/inria-00344189v3

Submitted on 29 Sep 2009 (v3), last revised 14 Jun 2010 (v6)

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

MAi: Memory Affinity InterfaceChristiane Pousa Ribeiro, Jean-François Méhaut

To cite this version:Christiane Pousa Ribeiro, Jean-François Méhaut. MAi: Memory Affinity Interface. [Technical Report]RT-0359, 2008. <inria-00344189v3>

appor t de recherche

ISS

N02

49-6

399

ISR

NIN

RIA

/RR

--03

59--

FR

+E

NG

Thème NUM

INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE

MAI: Memory Affinity Interface

Christiane Pousa Ribeiro — Jean-François Méhaut

N° 0359

December 2008

Centre de recherche INRIA Grenoble – Rhône-Alpes655, avenue de l’Europe, 38334 Montbonnot Saint Ismier

Téléphone : +33 4 76 61 52 00 — Télécopie +33 4 76 61 52 52

MAI: Memory A�nity Interfa e

Christiane Pousa Ribeiro, Jean-François Méhaut

Thème NUM � Systèmes numériques

Équipe-Projet MESCAL

Rapport de re her he n° 0359 � De ember 2008 � 22 pages

Abstra t: In this do ument, we des ribe an interfa e alled MAI. This inter-

fa e allows developers to manage memory a�nity in NUMA ar hite tures. The

a�nity unit in MAI is an array of the parallel appli ation. A set of memory

poli ies implemented in MAI an be applied to these arrays in a simple way.

High-level fun tions implemented in MAI minimize developers work when man-

aging memory a�nity in NUMA ma hines. MAI's performan e was evaluated

on two di�erent NUMA ma hines using three di�erent parallel appli ations.

Results obtained with MAI present important gains when ompared with the

standard memory a�nity solutions.

Key-words: multi- ore ar hite ture, memory a�nity, stati data, perfor-

man e evaluation, programming

MAI: Interfa e pour la A�nité de Mémoire

Résumé : Ce do ument dé rit MAI, une interfa e pour ontr�ler l'a�nité de

mémoire sur des ar hite ture NUMA. L'unité d'a�nité utilisé par MAI est un

tableau de l'appli ation. MAI dispose d'un ensemble de politiques de mémoire

qui peuvent être appliquées pour le tableau du une appli ation d'une façon

simple. Fon tions de haut-niveau mis en ÷uvre en minimiser les travaux des

programmeurs. Les performan e de MAI sont évaluée sur deux di�érents NUMA

ma hines utilisant trois appli ations parallèles. Les résultats obtenus ave ette

interfa e présente d'importants gains par rapport aux solutions standard.

Mots- lés : ar hite tures NUMA, a�nité mémoire, données statiques, étude

de performan es

MAI 3

1 Introdu tion

Parallel appli ations are generally developed using message passing or shared

memory programming models. Both models have their advantages and disad-

vantages. The message passing model s ales better, however, is more omplex

to use. In the other hand, shared memory programming model is less s alable

but easier to be used by the developer. The latter programming model an be

used in parallel appli ations development with libraries, languages, interfa es

or ompilers dire tives. Some examples are Pthreads [1℄, OpenMP [2℄, TBB [3℄

et .

Shared memory programming model is the ideal model for traditional UMA

(Uniform Memory A ess) ar hite tures, like symmetri multipro essors. In

these ar hite tures the omputer has a single memory, whi h is shared by all

pro essors. This single memory often be omes a bottlene k when many pro es-

sors a esses it at the same time. The problem gets worse as the number of

pro essors in rease, where the single memory ontroller does not s ale.

The limitation of lassi al symmetri multipro essors (SMP) parallel sys-

tems an be over ome with NUMA (Non Uniform Memory A ess) ar hite -

tures. These ar hite tures have as main hara teristi s, multiple memory levels

that are physi ally distributed but, seen by the developer as a single memory

[4℄. NUMA ar hite tures ombine the e� ien y and s alability of MPP (Mas-

sively Parallel Pro essing) with the programming fa ility of SMP ma hines [5℄.

However, due to the fa t that the memory is distributed between the ma hine

nodes, the time spent to a ess it is onditioned by the distan e between the

pro essor and data. The memory a ess by a given pro essor ould be lo al

(data is lose) or remote (it has to use the inter onne tion network to a ess

the data) [5, 6℄ .

As these ma hines are being used as servers for appli ations that demand a

low response time it is important to assure memory a�nity on them. Memory

a�nity is the guarantee that pro essing units will always have their data lose

to them. So, to rea h the maximum performan e on NUMA ar hite tures it

is ne essary to minimise the number of remote a ess during the appli ation

exe ution. Whi h is, pro essors and data need to be s heduled so as the distan e

between them are the smallest possible [5, 6, 7, 8℄.

To redu e this distan e and onsequently in rease the appli ation perfor-

man e in these ar hite tures, many di�erent solutions were proposed. These

solutions are based in algorithms, me hanisms and tools that do memory pages

allo ation, migration and repli ation to guarantee memory a�nity in NUMA

systems for an entire pro ess [5, 9, 6, 7, 10, 11, 12, 4, 13, 14, 15, 4, 16℄. How-

ever, none of these solutions provides variable/obje t as memory a�nity ontrol

unit. In these solutions, memory a�nity is assured applying a memory poli y

for an entire pro ess. Thus, developers an not express the di�erent data a ess

patterns of their appli ation. This an result in a not ideal performan e, be ause

memory a�nity are related with the program most important variables/obje ts

(importan e in means of memory onsuption and a ess).

Thus, to allow di�erent memory poli ies for ea h most important vari-

able/obje t of an appli ation and onsequently a better ontrol of memory a�n-

ity we developed MAI (Memory A�nity Interfa e). In this repport we introdu e

MAI, an interfa e developed in C that de�nes some high level fun tions to deal

with memory a�nity in NUMA ar hite tures. The main advantages of MAI

RR n° 0359

4 Ribeiro & Méhaut

are: allow di�erent memory poli y for ea h most important arrays of the appli-

ation, provide several standard and new memory poli ies and is a simple way

to ontrol memory a�nity in parallel appli ations. In order to evaluate MAI

performan e we made some experiments with some memory-bound appli ations

(MG from ben hmark NAS [17℄, Ondes 3D [18℄ and ICTM [19℄) in two di�er-

ent NUMA ma hines. We then ompared the results obtained with MAI with

Linux standard poli y for NUMA ma hines [11℄ and the solution most used by

developers of parallel appli ation for NUMA ma hines [17℄. This last solution

is based in parallel initialization of data, threads and pro esses tou h data to

pla e it lose to them (in this work we will all this solution Parallel-Init).

The repport is stru tured as follows. In se tion 2 we des ribe the previous

solutions in memory a�nity management for NUMA ar hite tures. Se tion 3

depi ts our interfa e and its main fun tionalities. In se tion 4 we present and

dis uss the performan e evaluation of MAI. In the last se tion we present our

on lusions and future works.

2 Memory A�nity Management Solutions

A lot of work has been donne in memory a�nity management for NUMA ar hi-

te tures. Most of them are proposals of new memory poli ies that have some

intelligent me hanism to pla e/migrate memory pages. Other types of works

are proposal of new dire tives to OpenMP and some support integrated in the

operating system. In this se tion we present these three groups of related works.

2.1 Memory A�nity Poli ies

In the work [9℄, authors present a new memory poli y alled on-next-tou h. This

poli y allows data migration in the se ond time of a thread/pro ess tou h it.

Thus, threads an have their data in the same node, allowing more lo al a ess.

The performan e evaluation of this poli y was donne using a real appli ation

that has as main hara teristi irregular data a ess patterns. The gain obtained

with this solution is about 69% with 22 threads.

In [20℄, the authors present two new memory poli ies alled skew-mapping

and prime-mapping. In the �rst one, allo ation of memory pages is performed

skipping one node per round. As example, suppose that we have to allo ate

16 memory pages in four nodes. The �rst four pages will be pla ed on nodes

0,1,2,3, the next four in nodes 1,2,3,0 and so on. The prime-mapping poli y

works with virtual nodes to allo ate data. Thus, after the allo ation on the

virtual nodes there is a re-allo ation of the pages in the real nodes. As s ienti�

appli ations always work in power of 2, for data distribution, these two poli ies

allows better data distribution. The gains with this solutions are about 35%

with some ben hmarks.

The work [16℄ present two algorithms to do page migration and assure mem-

ory a�nity in NUMA ma hines. These algorithms use information get from

kernel s heduler to do migrations. The performan e evaluation show gains of

264% onsidering existed solution as omparison.

INRIA

MAI 5

2.2 Memory A�nity with OpenMP Dire tives

In [21℄, authors present a strategy to assure memory a�nity using OpenMP in

NUMA ma hines. The idea is to use information about s hedule of threads and

data and made some relations between them. The work do not present formal

OpenMP extensions but shows some suggestions of how this an be donne and

what has to be in luded. All performan e evaluation was donne using tightly-

oupled NUMA ma hines. The results show that their proposal an s ale well

in the used ma hines.

In the work [11℄, authors present new OpenMP dire tives to allow memory

allo ation in OpenMP. The new dire tives allows developers to express how data

have to be allo ated in the NUMA ma hine. All memory allo ation dire tives

are for arrays and Fortran programming language. The authors present the

ideas for the dire tives and how they an be implemented in a real ompiler.

2.3 Memory A�nity with Operating System Support

NUMA support is now present in several operating systems, su h as Linux and

Solaris. This support an be found in the user level (with administration tools

or shell ommands) and in the kernel level (with system all and NUMA APIs)

[4℄.

The user level support allows the programmer to spe ify a poli y for memory

pla ement and threads s heduling for an appli ation. The advantage of using

this support is that the programmer does not need to modify the appli ation

ode. However, the hosen poli y will be applied in the entire appli ation and

we an not hange the poli y during the exe ution.

The API NUMA is an interfa e that de�nes a set of system alls to apply

memory poli ies and pro esses/threads s heduling. In this solution, the pro-

grammer must hange the appli ation ode to apply the poli ies. The main

advantage of this solution is that we an have a better ontrol of memory allo-

ation.

The use of system alls or user level tools to manage memory a�nity in

NUMA ma hines an generate some gains. However, developers must know

the appli ation and ar hite ture hara teristi s. To use system alls, developers

need to hange the appli ation sour e ode and make the memory management

by themselves. This is a omplex work and depending on the appli ation we

an not a hieve expressive gains. User level solutions require from developers

several shell ommands and do not allow �ne ontrol over memory a�nity.

2.4 Con lusion on Related Works

None of the solutions presented in this se tion allow developers manage memory

a�nity in NUMA ma hines using the most important arrays of the appli ation.

Additionally, omplex ode or modi� ation are ne essary to do memory a�nity

management.

3 MAi

In this se tion we des ribe MAI and its features. We �rst des ribe the interfa e

proposal and its main ontributions. Then, we explain its high level fun tions

RR n° 0359

6 Ribeiro & Méhaut

and memory poli ies. Finally, we des ribe its implementation details and usage.

MAI an be download from [22℄.

3.1 The Interfa e

MAI is a library developed in C that de�nes some high level fun tions to deal

with memory a�nity in appli ations exe uted on NUMA ar hite tures. This li-

brary allows developers to manage memory a�nity onsidering the most impor-

tant arrays of their appli ations (importan e in terms of memory onsumption

and a ess). This hara teristi makes memory management easier for develop-

ers be ause they do not need to are about pointers and pages addresses like

in the system all APIs for NUMA (libnuma in Linux for example [4℄). Fur-

thermore, with MAI it is possible to have a �ne ontrol over memory a�nity;

memory poli ies an be hanged through appli ation ode (di�erent poli ies for

di�erent phases). This feature is not allowed in user level tools like numa tl in

Linux.

MAI main hara teristi s are: (i) simpli ity of use (less omplex than other

solutions: NUMA API, numa tl et ) and �ne ontrol of memory (several array

based memory poli ies), (ii) portability (hardware abstra tion) and (iii) perfor-

man e (better performan e than other standard solutions).

(a) (b)

Figure 1: Cy li memory poli ies: (a) y li and (b) y li _blo k.

The library has seven memory poli ies: y li , y li _blo k, prime_mapp,

skew_mapp, bind_all, bind_blo k and random. In y li poli ies the memory

pages that ontain the array are pla ed in physi al memory making a round with

the memory blo ks. The main di�eren e between y li and y li _blo k(blo k

an be a set of rows or olumns of the array) is the amount of memory pages

used to do the y li pro ess (Figure 1). In prime_mapp memory pages are

allo ated in virtual memory nodes. The number of virtual memory nodes are

hosen by the developer and it has to be a prime number (Figure 2 (b) with

prime number equal to 5). This poli y tries to spread memory pages through

the NUMA ma hine. In skew_mapp the memory pages are pla ed in NUMA

ma hine in a round-robin way (Figure 2 (a)). However, in every round a shift

is done with the memory node number. Both,prime_mapp and skew_mapp

INRIA

MAI 7

an be used in blo ks, like y li _blo k, where a blo k an be a set of rows or

olumns. In bind_all and bind_blo k (blo k an be a set of rows or olumns

of the array) poli ies the memory pages that ontain the array data are pla ed

in memory blo ks spe i�ed by the appli ation developer. The main di�eren e

between bind_all and bind_blo k is that in the latter, pages are pla ed in the

memory blo ks that have the threads/pro ess that will make use of them(Figure

3). The last memory poli y is random. In this poli y, memory pages that ontain

the array are pla ed in physi al memory in a random uniform distribution. The

main obje tive of this poli y is to minimize the memory ontention in the NUMA

ma hine.

(a) (b)

Figure 2: Cy li memory poli ies: (a) skew_mapp and (b) prime_mapp.

These memory poli ies an be ombine during the appli ation steps to im-

plement a new memory poli y. Arrays generally are a essed in di�erent ways

during the appli ation. Thus, it is important give to the developer the oppor-

tunity to develop/ reate a memory poli y for his appli ation. The majority of

the new proposed memory a�nity algorithms do not allows memory poli ies

hanges during the appli ations exe ution. Also, in MAI it is also possible to

do memory migration to optimize any in orre t memory pla ement.

3.2 High-Level Fun tions

To develop appli ations using MAI the programmer just have to use some high-

level fun tions. The fun tions are divided into �ve groups: system, allo ation,

memory, threads and statisti .

System Fun tions: the system fun tions are divided in fun tions to on�g-

ure the interfa e and fun tions to see some information about the platforms.

The on�guration fun tions must be alled in the begin (mai_init()) and end

(mai_�nal()) of the appli ation ode. The fun tion init() is responsible for set-

ting the nodes and threads Ids that will be used for memory/thread poli ies

and for the interfa e start. This fun tion an re eive as parameter a �le name.

The �le is an environment on�guration �le that gives information about the

memory blo ks and pro essors/ ores of the underlying ar hite ture. If no �le

RR n° 0359

8 Ribeiro & Méhaut

(a) (b)

Figure 3: Bind memory poli ies: (a) bind_all and (b) bind_blo k.

name is give the interfa e hose whi h memory blo ks and pro essors/ ores to

use. The �nal() fun tion is used to �nish the interfa e and free any data that

was allo ated by the developer. The other system fun tions are used to extra t

information from the target ar hite ture and appli ation.

void mai_init ( har f i l ename [ ℄ ) ;

void mai_final ( ) ;

i n t mai_get_num_nodes ( ) ;

i n t mai_get_num_threads ( ) ;

i n t mai_get_num_ pu ( ) ;

unsigned long * mai_get_nodes_id ( ) ;

unsigned long * mai_get_ pus_id ( ) ;

void mai_show_nodes ( ) ;

void mai_show_ pus ( ) ;

Allo ation Fun tions: these fun tions allows the developer allo ate arrays of

C primitive types (CHAR, INT, DOUBLE and FLOAT) and not primitive types,

like stru tures. The fun tions need the number of items and the C type. Their

return is a void pointer to the allo ated data. The memory is always allo ated

sequentially, like a linear (1D) array. This strategy allow better performan e

and also make easer to de�ne a memory poli y for an array. Example presented

in Figure 4.

void * mai_allo _1D ( i n t nx , s i ze_t size_item , i n t type ) ;

void * mai_allo _2D ( i n t nx , i n t ny , s i ze_t size_item , i n t type ) ;

void * mai_allo _3D ( i n t nx , i n t ny , i n t nz , s i ze_t size_item , i n t type ) ;

void * mai_allo _4D ( i n t nx , i n t ny , i n t nz , i n t nk , s i ze_t size_item , i n t type ) ;

void mai_free_array ( void *p ) ;

Memory Poli ies Fun tions: these fun tions allows the developer to sele t a

memory poli y for an appli ation and to migrate data between the nodes. There

are seven di�erents poli ies: y li , y li _blo k, prime_mapp, skew_mapp,

INRIA

MAI 9

Figure 4: Allo fun tions example

bind_all, bind_blo k and random. It is also possible to migrate rows and

olumns of an array between the nodes of the NUMA system using the fun tions

migrate_*. Example presented in Figure 5.

void mai_ y l i ( void *p ) ;

void mai_skew_mapp( void *p ) ;

void mai_prime_mapp( void *p , i n t prime ) ;

void mai_ y li _rows ( void *p , i n t nr ) ;

void mai_skew_mapp_rows( void *p , i n t nr ) ;

void mai_prime_mapp_rows( void *p , i n t prime , i n t nr ) ;

void mai_ y li _ olumns ( void *p , i n t n ) ;

void mai_skew_mapp_ olumns ( void *p , i n t n ) ;

void mai_prime_mapp_ olumns ( void *p , i n t prime , i n t n ) ;

/*Bind memory po l i y − f un t i on prototypes */

void mai_bind_all ( void *p ) ;

void mai_bind_rows ( void *p ) ;

void mai_bind_ olumns ( void *p ) ;

/*Random memory po l i y − f un t i on prototypes */

void mai_random( void *p ) ;

void mai_random_rows ( void *p , i n t nr ) ;

void mai_random_ olumns ( void *p , i n t n ) ;

RR n° 0359

10 Ribeiro & Méhaut

/*Migrat ion − f un t i on prototypes */

void mai_migrate_rows ( void *p , unsigned long node , i n t nr , i n t s t a r t ) ;

void mai_migrate_ olumns ( void *p , unsigned long node , i n t n , i n t s t a r t ) ;

void mai_migrate_s atter ( void *p , unsigned long nodes ) ;

void mai_migrate_gather ( void *p , unsigned long node ) ;

Figure 5: Memory poli ies fun tions example

Thread Poli ies Fun tions: these fun tions allows the developer to bind or

migrate a thread to a pu/ ore in the NUMA system. The fun tion bind_threads

use the threads identi�ers and the pu/ ore mask (giving in the on�guration

�le) to bind threads.

void bind_threads ( ) ;

void bind_thread_posix ( i n t id ) ;

void bind_thread_omp( i n t id ) ;

void mai\_migrate_thread ( pid_t id , unsigned long pu ) ;

Statisti s Fun tions: these fun tions allows the developer to get some in-

formation about the appli ation exe ution in the NUMA system. There are

INRIA

MAI 11

statisti s of memory and threads pla ement/migration and overhead from mi-

grations.

/*Page in fo rmat ion − f un t i on prototype */

void mai_print_pagenodes ( unsigned long *pageaddrs , i n t s i z e ) ;

i n t mai_number_page_migration ( unsigned long *pageaddrs , i n t s i z e ) ;

double mai_get_time_pmigration ( ) ;

/*Thread in fo rmat ion − f un t i on prototype */

void mai_print_thread pus ( ) ;

i n t mai_number_thread_migration ( unsigned i n t * threads , i n t s i z e ) ;

double mai_get_time_tmigration ( ) ;

3.3 Implementation Details

MAI interfa e is implemented in C for Linux based NUMA systems. Developers

who wants to use MAI have to program their appli ations using shared memory

programming model. The allowed shared memory libraries are: Posix threads or

OpenMP for C/C++ languages. To use this interfa e, libnumamust be installed

in the NUMA system. Figure 6 present the ar hite ture of the interfa e.

Figure 6: MAI Ar hite ture

To guarantee memory a�nity, all memory management is donne using MAI

stru tures. These stru tures are loaded in the initialization of the interfa e

(with the init() fun tion) and are used in the rest of appli ation exe ution. All

allo ations and memory poli ies algorithms onsider the NUMA ar hite ture

before do any allo ation and pla ement of memory pages in the memory blo ks.

A memory poli y set for a variable/obje t of the appli ation an be hanged

during the appli ation steps. To do this developers just have to all the memory

poli y fun tion for the variable/obje t and MAI will hange it.

Threads s heduling in MAI is donne using one thread per pro essor/ ore

strategy. This minimizes memory ontention problem that exists in some NUMA

ar hite tures. If the number of threads are bigger then the number of pro es-

sors/ ores then, more threads per pro essor/ ore will be used.

RR n° 0359

12 Ribeiro & Méhaut

4 Performan e Evaluation

In this se tion we present the performan e evaluation of MAI. We �rst des ribe

the two NUMA platforms used in our experiments. Then, we des ribe the

appli ations and their main hara teristi s. Finally, we present the results and

their analysis.

4.1 NUMA Platforms

In our experiments we used two di�erent NUMA ar hite tures. The �rst NUMA

ar hite ture is a sixteen Itanium 2 with 1.6 GHz of frequen y and 9 MB of L3

a he memory for ea h pro essor. The ma hine is organized in four nodes of

four pro essors and has in total 64 GB of main memory. This memory is divided

in four blo ks in ea h node (16 GB of lo al memory). The nodes are onne ted

using a FAME S alability Swit h (FSS), whi h is a ba kplane developed by

Bull [23℄. This onne tion gives di�erent memory laten ies for remote a ess by

nodes (NUMA fa tor from 2 to 2.5). A s hemati �gure of this ma hine is given

in Figure 7. We will use the name Itanium 2 for this ar hite ture.

The operating system that has been used in this ma hine was the Linux ver-

sion 2.6.18-B64k.1.21 and the distribution is Red Hat with support for NUMA

ar hite ture (system alls and user API numa tl). The ompiler that has been

used for the OpenMP ode ompilation was the ICC (Intel C Compiler).

Figure 7: NUMA platforms.

The se ond NUMA ar hite ture used is an eight dual ore AMD Opteron

with 2.2 GHz of frequen y and 2 MB of a he memory for ea h pro essor. The

ma hine is organized in eight nodes and has in total 32 GB of main memory.

This memory is divided in eight nodes (4 GB of lo al memory) and the system

page size is 4 KB. Ea h node has three onne tions whi h are used to link

with other nodes or with input/output ontrollers (node 0 and node 1). These

onne tions give di�erent memory laten ies for remote a ess by nodes (NUMA

fa tor from 1.2 to 1.5). A s hemati �gure of this ma hine is given in Figure 7.

We will use the name Opteron for this ar hite ture.

The operating system that has been used in this ma hine is Linux version

2.6.23-1-amd64 and the distribution is Debian with support for NUMA ar hi-

INRIA

MAI 13

te ture (system alls and user API numa tl). The ompiler that has been used

for the OpenMP ode ompilation was the GNU Compiler Colle tion (GCC).

4.2 Appli ations

In our experiments we used three di�erent appli ations: one from ben hmark

NAS (Conjugate Gradiente [17℄), Ondes 3D (seismology appli ation [24℄) and

ICTM (Interval Categorization Tessellation model [25℄). All the appli ations

have as main hara teristi s high memory onsumption, number of data a ess

and regular and/or irregular data pattern a ess. In this work, we onsider as

regular pattern a ess, when threads of an appli ation a ess always the same

data set. On the ontrary, on the irregular data pattern a ess, threads of

an appli ation do not always a ess the same data set. Threads or pro ess of

an appli ation do not a ess the same data set during the initializations and

omputational steps of the appli ation. All appli ations were developed in C

using OpenMP to ode parallelization.

One of the three appli ations is a kernel from NAS ben hmark [17℄ alled

MG. This kernel were sele ted be ause they are representative in terms of om-

putations and present memory important hara teristi s. MG is a kernel that

uses a V y le MultiGrid method to al ulate the solution of the 3D s alar Pois-

son equation. The main hara teristi of this kernel is that it tests both short

and long distan e data movement. Figure 8 present a s hema of the appli ation.

Figure 8: Multi Grid

Ondes 3D is a seismology appli ation simulates the propagation of a seis-

mi in a 3 dimension region [18℄. This appli ation has three main steps: data

allo ation, data initialization and propagation al ulus (operations). In the al-

lo ation step all arrays are allo ated. The initialization is important be ause

the data are tou hed and physi ally allo ated in the memory blo ks. The last

step is omposed by two big al ulus loop. In these loops all arrays are a essed

for write an read but, with the same a ess patterns of the initialization step.

Figure 9 present a s hema of the appli ation.

RR n° 0359

14 Ribeiro & Méhaut

Figure 9: Ondes 3D Appli ation

ICTM is a multi-layered and multi-dimensional tessellation model for the

ategorization of geographi regions onsidering several di�erent hara teris-

ti s (relief, vegetation, limate, land use, et .) [19℄. This appli ation has two

main steps: data allo ation-initialization and lassi� ation omputation. In the

allo ation-initialization step all arrays are allo ated and initialized. The se ond

step is omposed by four big al ulus loops. In these loops all arrays are a essed

for write an read but, with the regular and irregular a ess patterns.Figure 10

present a s hema of the appli ation.

Figure 10: ICTM Appli ation

INRIA

MAI 15

4.3 Results

In this se tion we present the results obtained with ea h appli ation in the two

NUMA ma hines with MAI and the two other solutions (Linux native solution,

�rst-tou h and Parallel-Init, �rst-tou h used in a parallel data initialization).

The experiments were donne using 2,4,8 and 16 threads and a problem size of

2562 elements fo MG, 2.0 Gbytes for ICTM and 2.4 Gbystes for Ondes 3D. As

performan e metri s we sele ted speedup (Sequential exe ution time/parallel

exe ution time) and e� ien y (speedup/number of threads).

In �gure 11(a) we present the speedups and e� ien ies for the MG appli a-

tion when exe uted in Opteron ma hine. As we an observe for this appli ation

and onsidering an Opteron NUMA ma hine the better memory poli y is y li .

This NUMA ma hine has as important hara teristi a not so good network

bandwidth. The optimization of bandwitdh in this ma hine is more important

than the optimization of laten y. As the NUMA fa tor in this ma hine is small

the laten y in�uen e in this type of appli ation is not high (irregular pattern

a ess). Thus, distribute the memory pages in a y li way avoid ontention and

onsequently the in�uen e of network bandwidth in the appli ation exe ution.

The others memory poli ies had worst results be ause they try to optimize la-

ten y and as the problem in this ma hine is ontention in network these poli ies

do not present good results.

In �gure 11(b) we present the speedups and e� ien ies for the MG appli a-

tion for Itanium ma hine. The memory poli y bind_blo k is the best one, as this

ma hine has a high NUMA fa tor try to allo ate data lose to threads that use it

is better. The y li memory poli y had worst results be ause it try to optimize

bandwidth and as the ma hine has a high NUMA fa tor this poli y (optimize

bandwidth)do not present good results. The other two poli ies, First-Tou h and

Parallel-Init also do not have good gains. First-tou h poli y entralizes data on

a node so, threads made a lot of remote and on urrent a ess on the same

memory blo k. On the other hand, Parallel-Init try to distribute data between

the ma hine nodes but, as the a ess patterns are irregular this strategy do not

give good speedups and e� ien ies.

In �gures 12(a) and 12(b) we present the results for ICTM appli ation on-

sidering the Opteron and the Itanium ma hines. For this appli ation, as for MG,

the better memory poli ies are y li for Opteron and bind_blo k for Itanium.

Opteron has a network bandwidth problem so, it is better to spread the

data among NUMA nodes. By using this strategy, we redu e the number of

simultaneous a esses on the same memory blo k. As a onsequen e of that,

we an have a better performan e. The bind_all and bind_blo k poli ies have

presented worse speed-ups for Opteron ma hine. These poli ies allows a lot

of remote a ess by threads and also do not onsider network bandwidth in

memory allo ation.

We an see that the results for Itanium are quite di�erent when ompared

to those obtained over Opteron (Figure 12(b)). By allo ating data loser to

the threads whi h omputes them, we de rease the impa t of the high NUMA

fa tor. As a result, bind_blo k was the best poli y for this ma hine.

In �gures 13(a) and 13(b) we present the results for Ondes 3D appli ation

onsidering the Opteron and the Itanium ma hines. As we an observe, the

results obtained with MAI poli ies, First-Tou h and Parallel-Init solutions were

similar. The main reason for this is the regularity of data a ess patterns of this

RR n° 0359

16 Ribeiro & Méhaut

appli ation. Threads always a ess the same set of memory pages. However,

there is a little performan e gain in the use of bind_blo k poli y. This poli y

pla e threads and data in the same node, so it avoids remote a ess. In the other

solutions threads an migrate or data are pla ed in a remote node allowing more

remote a ess.

INRIA

MAI 17

0

0.2

0.4

0.6

0.8

1

1.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Effi

cien

cy

Number of Threads

0

2

4

6

8

10

12

14

16

Spe

edup

MG - Opteron

CyclicBind-BlockParallel-InitFirst-Touch

Ideal

(a) Opteron Speedup-E� ien y

0

0.2

0.4

0.6

0.8

1

1.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Effi

cien

cy

Number of Threads

0

2

4

6

8

10

12

14

16

Spe

edup

MG - Itanium

CyclicBind-BlockParallel-InitFirst-Touch

Ideal

(b) Itanium Speedup-E� ien y

Figure 11: MG Appli ation

RR n° 0359

18 Ribeiro & Méhaut

0

0.2

0.4

0.6

0.8

1

1.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Effi

cien

cy

Number of Threads

0

2

4

6

8

10

12

14

16

Spe

edup

ICTM - Opteron

CyclicBind-BlockFirst-Touch

Ideal

(a) Opteron Speedup-E� ien y

0

0.2

0.4

0.6

0.8

1

1.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Effi

cien

cy

Number of Threads

0

2

4

6

8

10

12

14

16

Spe

edup

ICTM - Itanium

CyclicBind-BlockFirst-Touch

Ideal

(b) Itanium Speedup-E� ien y

Figure 12: ICTM Appli ation

INRIA

MAI 19

0

0.2

0.4

0.6

0.8

1

1.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Effi

cien

cy

Number of Threads

0

2

4

6

8

10

12

14

16

Spe

edup

OD - Opteron

CyclicBind-BlockParallel-InitFirst-Touch

Ideal

(a) Opteron Speedup-E� ien y

0

0.2

0.4

0.6

0.8

1

1.2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Effi

cien

cy

Number of Threads

0

2

4

6

8

10

12

14

16

Spe

edup

OD - Itanium

CyclicBind-BlockParallel-InitFirst-Touch

Ideal

(b) Itanium Speedup-E� ien y

Figure 13: Ondes 3D Appli ation

RR n° 0359

20 Ribeiro & Méhaut

5 Con lusion

In this report we presented an interfa e to manage memory a�nity in NUMA

ma hines alled MAI. This interfa e has some memory poli ies that an be

applied to array of an appli ation to assure memory a�nity. MAI is also an

easy way of manage memory in NUMA ma hines, high level fun tions allow

developers to hange their appli ation with no big work.

The experiments made with MAI show that the interfa e has better perfor-

man e than some proposed solutions. We evaluate MAI using three di�erent

appli ations and two NUMA ma hines. The results show gains of 2%-60% in

relation of Linux memory poli y and optimized appli ation ode version.

The work show that memory has to be manage in NUMA ma hines and

that this management must be easy for the developer. Thus, an interfa e that

helps developers to do this is essen ial. As future works we highlight: MAI

version that allows developers to implement new memory poli ies as plug-ins,

new performan e evaluation of MAI with others appli ations and integrate MAI

in GOMP, the GCC implementation of OpenMP.

Referen es

[1℄ D. R. Butenhof, �Programming with posix threads,� 1997.

[2℄ OpenMP, �The openmp spe i� ation for parallel program-

ming - http://www.openmp.org,� 2008. [Online℄. Available:

http://www.openmp.org

[3℄ �Threading building blo ks-http://www.threadingbuildingblo ks.org/,�

2008. [Online℄. Available: http://www.threadingbuildingblo ks.org/

[4℄ A. Kleen, �A NUMA API for Linux,� Te h.

Rep. Novell-4621437, April 2005. [Online℄. Available:

http://whitepapers.zdnet. o.uk/0,1000000651,260150330p,00.htm

[5℄ T. Mu, J. Tao, M. S hulz, and S. A. M Kee, �Intera tive Lo ality Opti-

mization on NUMA Ar hite tures,� in SoftVis '03: Pro eedings of the 2003

ACM Symposium on Software Visualization. New York, NY, USA: ACM,

2003, pp. 133��.

[6℄ J. Marathe and F. Mueller, �Hardware Pro�le-Guided Automati Page

Pla ement for NUMA Systems,� in PPoPP '06: Pro eedings of the

eleventh ACM SIGPLAN symposium on Prin iples and pra ti e of parallel

programming. New York, NY, USA: ACM, 2006, pp. 90�99. [Online℄.

Available: http://portal.a m.org/ itation. fm?id=1122987

[7℄ F. Bellosa and M. Ste kermeier, �The Performan e Impli ations of Lo ality

Information Usage in Shared-Memory Multipro essors,� J. Parallel Distrib.

Comput., vol. 37, no. 1, pp. 113�121, August 1996. [Online℄. Available:

http://portal.a m.org/ itation. fm?id=241170.241180

[8℄ A. Carissimi, F. Dupros, J.-F. Mehaut, and R. V. Polan zyk, �Aspe tos

de Programação Paralela em arquiteturas NUMA,� in VIII Workshop em

Sistemas Computa ionais de Alto Desempenho, 2007.

INRIA

MAI 21

[9℄ H. Löf and S. Holmgren, �A�nity-on-next-tou h: In reasing the

Performan e of an Industrial PDE Solver on a -NUMA System,� in

ICS '05: Pro eedings of the 19th Annual International Conferen e on

Super omputing. New York, NY, USA: ACM, 2005, pp. 387�392. [Online℄.

Available: http://portal.a m.org/ itation. fm?id=1088149.1088201

[10℄ A. Joseph, J. Pete, and R. Alistair, �Exploring Thread and Memory Pla e-

ment on NUMA Ar hite tures: Solaris and Linux, UltraSPARC/FirePlane

and Opteron/HyperTransport,� 2006, pp. 338�352. [Online℄. Available:

http://dx.doi.org/10.1007/11945918_35

[11℄ J. Bir sak, P. Craig, R. Crowell, Z. Cvetanovi , J. Harris, C. A. Nelson,

and C. D. O�ner, �Extending OpenMP for NUMA Ma hines,� in SC '00:

Pro eedings of the 2000 ACM/IEEE Conferen e on Super omputing, Dallas,

Texas, USA, 2000.

[12℄ L. T. S hermerhorn, �Automati Page Migration for Linux,� in Pro eedings

of the Linux Symposium, Linux, Ed., Sydney, Australia, 2007.

[13℄ B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, �Operating system

support for improving data lo ality on CC-NUMA ompute servers,�

in ASPLOS-VII: Pro eedings of the 7th International Conferen e on

Ar hite tural Support for Programming Languages and Operating Systems.

New York, NY, USA: ACM, 1996, pp. 279�289. [Online℄. Available:

http://portal.a m.org/ itation. fm?id=237090.237205

[14℄ R. Vaswani and J. Zahorjan, �The Impli ations of Ca he A�nity on Pro-

essor S heduling for Multiprogrammed, Shared Memory Multipro essors,�

SIGOPS Oper. Syst. Rev., vol. 25, no. 5, pp. 26�40, 1991.

[15℄ R. Chandra, S. Devine, B. Verghese, A. Gupta, and M. Rosenblum,

�S heduling and Page Migration for Multipro essor Compute Servers,�

SIGPLAN Not., vol. 29, no. 11, pp. 12�24, 1994.

[16℄ D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Poly hronopoulos,

J. Labarta, and E. Ayguadé, �User-Level Dynami Page Migration for Mul-

tiprogrammed Shared-Memory Multipro essors,� in ICPP '00: Pro eedings

of the 2000 International Conferen e on Parallel Pro essing, 2000, pp. 95�

104.

[17℄ J. Y. Haoqiang Jin, Mi hael Frumkin, �The OpenMP Im-

plementation of NAS Parallel Ben hmarks and Its Perfor-

man e,� NAS System Division - NASA Ames Resear h

Center, Te h. Rep. 99-011/1999, 1999. [Online℄. Available:

https://www.nas.nasa.gov/Resear h/Reports/Te hreports/1999/PDF/nas-99-011.pdf

[18℄ F. Dupros, H. Ao hi, A. Du ellier, D. Komatits h, and J. Roman, �Ex-

ploiting Intensive Multithreading for the E� ient Simulation of 3D Seis-

mi Wave Propagation,� in CSE '08: Pro eedings of the 11th International

Conferen e on Computational S ien e and Engineerin, Sao Paulo, Brazil,

2008, pp. 253�260.

RR n° 0359

22 Ribeiro & Méhaut

[19℄ R. K. S. Silva, C. A. F. D. Rose, M. S. de Aguiar, G. P. Dimuro, and

A. C. R. Costa, �Hp -i tm: a parallel model for geographi ategoriza-

tion,� in JVA '06: Pro eedings of the IEEE John Vin ent Atanaso� 2006

International Symposium on Modern Computing. Washington, DC, USA:

IEEE Computer So iety, 2006, pp. 143�148.

[20℄ R. Iyer, H. Wang, and L. Bhuyan, �Design and Analysis of Stati Mem-

ory Management Poli ies for CC-NUMA Multipro essors,� College Station,

TX, USA, Te h. Rep., 1998.

[21℄ D. S. Nikolopoulos, E. Artiaga, E. Ayguadé, and J. Labarta, �Exploiting

Memory A�nity in OpenMP Through S hedule Reuse,� SIGARCH Com-

puter Ar hite ture News, vol. 29, no. 5, pp. 49�55, 2001.

[22℄ C. P. Ribeiro and J.-F. Méhaut, �Memory A�nity Interfa e,�

INRIA, Te h. Rep. RT-0359, De ember 2008. [Online℄. Available:

http://hal.inria.fr/inria-00344189/en/

[23℄ BULL, �Bull - ar hite t of an open world,� 2008. [Online℄. Available:

http://www.bull. om/

[24℄ F. Dupros, C. Pousa, A. Carissimi, and J.-F. Méhaut, �Parallel Simulations

of Seismi Wave Propagation on NUMA Ar hite tures,� in ParCo'09: In-

ternational Conferen e on Parallel Computing (to appear), Lyon, Fran e,

2009.

[25℄ M. Castro, L. G. Fernandes, C. Pousa, J.-F. Méhaut, and M. S. de Aguiar,

�NUMA-ICTM: A Parallel Version of ICTM Exploiting Memory Pla e-

ment Strategies for NUMA Ma hines,� in PDSEC '09: Pro eedings of the

23rd IEEE International Parallel and Distributed Pro essing Symposium -

IPDPS. Rome, Italy: IEEE Computer So iety, 2009.

INRIA

Centre de recherche INRIA Grenoble – Rhône-Alpes655, avenue de l’Europe - 38334 Montbonnot Saint-Ismier (France)

Centre de recherche INRIA Bordeaux – Sud Ouest : Domaine Universitaire - 351, cours de la Libération - 33405 Talence CedexCentre de recherche INRIA Lille – Nord Europe : Parc Scientifique de la Haute Borne - 40, avenue Halley - 59650 Villeneuve d’Ascq

Centre de recherche INRIA Nancy – Grand Est : LORIA, Technopôle de Nancy-Brabois - Campus scientifique615, rue du Jardin Botanique - BP 101 - 54602 Villers-lès-Nancy Cedex

Centre de recherche INRIA Paris – Rocquencourt : Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay CedexCentre de recherche INRIA Rennes – Bretagne Atlantique : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex

Centre de recherche INRIA Saclay – Île-de-France : Parc Orsay Université - ZAC des Vignes : 4, rue Jacques Monod - 91893 Orsay CedexCentre de recherche INRIA Sophia Antipolis – Méditerranée :2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex

ÉditeurINRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France)

http://www.inria.fr

ISSN 0249-6399