QQ: Nanoscale Timing and Profiling

44
QQ: Nanoscale QQ: Nanoscale Timing and Timing and Profiling Profiling James Frye James Frye *, James G. King *, James G. King *, Christine *, Christine J. Wilson * J. Wilson * , , Frederick C. Harris, Jr. Frederick C. Harris, Jr. * * Department of Computer Science and Department of Computer Science and Engineering Engineering *Brain Computation Lab *Brain Computation Lab Biomedical Engineering Biomedical Engineering University of Nevada University of Nevada Reno, NV 89557 Reno, NV 89557

description

QQ: Nanoscale Timing and Profiling. James Frye † *, James G. King † *, Christine J. Wilson * ◊ , Frederick C. Harris, Jr. † * † Department of Computer Science and Engineering *Brain Computation Lab ◊ Biomedical Engineering University of Nevada Reno, NV 89557. What is QQ. - PowerPoint PPT Presentation

Transcript of QQ: Nanoscale Timing and Profiling

Page 1: QQ: Nanoscale Timing and Profiling

QQ: Nanoscale QQ: Nanoscale Timing and Timing and

ProfilingProfilingJames Frye James Frye † † *, James G. King *, James G. King † † *, Christine J. *, Christine J.

Wilson * Wilson * ◊◊, , Frederick C. Harris, Jr. Frederick C. Harris, Jr. †† * *

††Department of Computer Science and Department of Computer Science and EngineeringEngineering

*Brain Computation Lab*Brain Computation Lab◊◊Biomedical EngineeringBiomedical Engineering

University of NevadaUniversity of Nevada

Reno, NV 89557Reno, NV 89557

Page 2: QQ: Nanoscale Timing and Profiling

What is QQWhat is QQ

QQ is a simple and efficient tool for QQ is a simple and efficient tool for measuring timing and memory usemeasuring timing and memory use

Developed for the examination of a Developed for the examination of a massively parallel program (NCS)massively parallel program (NCS)

Easily extensible to inspect other Easily extensible to inspect other programsprograms

Page 3: QQ: Nanoscale Timing and Profiling

The Place: The Human The Place: The Human BrainBrain

Goal:Goal: create the first create the first large-scalelarge-scale, ,

synaptically realisticsynaptically realistic cortical cortical computational model. computational model.

Purpose:Purpose: Simulation Experiments Simulation Experiments

Drug TrialsDrug Trials Alzheimer’s ResearchAlzheimer’s Research RoboticsRobotics

Page 4: QQ: Nanoscale Timing and Profiling

The Science:The Science:

NeuronsNeurons ExcitatoryExcitatory Interneurons Interneurons

(inhibitory)(inhibitory)

ColumnsColumns High connectivity High connectivity

within columns.within columns. Less connectivity Less connectivity

across columnsacross columns

Page 5: QQ: Nanoscale Timing and Profiling

The Science (cont):The Science (cont):

ChannelsChannels Potassium FamilyPotassium Family

M, A, AHP ChannelsM, A, AHP Channels Suppressing behavior on Suppressing behavior on

parent cellparent cell

SynapsesSynapses Analog converter of Analog converter of

binary spike event.binary spike event. Contextual filters.Contextual filters.

Page 6: QQ: Nanoscale Timing and Profiling

NeuronsNeurons

The Science (cont):The Science (cont):

Page 7: QQ: Nanoscale Timing and Profiling

NCS BiologyNCS Biology

The membrane voltage determines The membrane voltage determines the cell’s firing ratethe cell’s firing rate

Once threshold voltage is reached Once threshold voltage is reached the cell sends an action potential to the cell sends an action potential to it’s connected synapsesit’s connected synapses

0

mV

Time (mS)

-45

30

Action Potential

Page 8: QQ: Nanoscale Timing and Profiling

2-Cell Model2-Cell Model

Pre-Synaptic

Cell

Post-Synaptic

Cell

0.2 mV

100 200 300 400 5000

Time (ms)

Page 9: QQ: Nanoscale Timing and Profiling

No ChannelsNo ChannelsSustained firing at maximum rate during a continuous stimulus

Page 10: QQ: Nanoscale Timing and Profiling

KKaa Channel ChannelSlows the initial response during a sustained stimulus

Page 11: QQ: Nanoscale Timing and Profiling

KKmm Channel ChannelPrevents continuous bursting during a continuous stimulus

Page 12: QQ: Nanoscale Timing and Profiling

KKahpahp Channel ChannelDampens the effect while still allowing for some action potentials during a sustained stimulus

Page 13: QQ: Nanoscale Timing and Profiling

QQ DevelopmentQQ Development QQ was developed to optimize a parallel QQ was developed to optimize a parallel

program used to simulate cortical program used to simulate cortical neurons – NeoCortical Simulator (NCS)neurons – NeoCortical Simulator (NCS)

Our goal for the summer of 2002 was to Our goal for the summer of 2002 was to simulate 10simulate 1066 neurons with 10 neurons with 1099 synapses synapses within a realistic run timewithin a realistic run time

Before optimization, NCS would run Before optimization, NCS would run about 1.5 million synapses at a rate of 1 about 1.5 million synapses at a rate of 1 day per simulated second of synaptic day per simulated second of synaptic activityactivity

Clearly optimization of NCS was neededClearly optimization of NCS was needed

Page 14: QQ: Nanoscale Timing and Profiling

QQ DesignQQ Design

QQ is designed so that all of its routines QQ is designed so that all of its routines can be selectively compiled into a programcan be selectively compiled into a program

In the QQ.h header file, each routine is In the QQ.h header file, each routine is defined with a preprocessor directive, so defined with a preprocessor directive, so that if profiling is not enabled, it reduces that if profiling is not enabled, it reduces to an empty statement.to an empty statement.

#ifdef QQ_ENABLE#ifdef QQ_ENABLE

void QQInit (int);void QQInit (int);

#else#else

#define QQInit (dummy)#define QQInit (dummy)

#endif#endif

Page 15: QQ: Nanoscale Timing and Profiling

QQ DesignQQ Design

Memory profiling routines also use the Memory profiling routines also use the C preprocessor to intercept library callsC preprocessor to intercept library calls

#ifdef QQ_ENABLE#ifdef QQ_ENABLE

#define malloc(arg) MemMalloc (MEM_KEY, #define malloc(arg) MemMalloc (MEM_KEY, arg)arg)

#endif#endif

The MemMalloc function records The MemMalloc function records allocation information, calls the malloc allocation information, calls the malloc function to do the actual allocation, and function to do the actual allocation, and returns the result to the callerreturns the result to the caller

Page 16: QQ: Nanoscale Timing and Profiling

QQ TimingQQ Timing

Extremely accurate measurement of Extremely accurate measurement of execution speed. execution speed.

In theory fine-grained resolution to a In theory fine-grained resolution to a single clock cycle. single clock cycle. Using the IA32 instruction RTDSC Using the IA32 instruction RTDSC

In practice, measurements are In practice, measurements are accurate to tens of cyclesaccurate to tens of cycles Because of instruction reordering and Because of instruction reordering and

multiple pipelines in the CPUmultiple pipelines in the CPU

Page 17: QQ: Nanoscale Timing and Profiling

Timing MeasurementsTiming Measurements Measuring the impact of a line change Measuring the impact of a line change

in the calculation for the Km channelin the calculation for the Km channelFrom:From:

I = unitaryG * strength * pow (m, mPower) * (ReversePot – I = unitaryG * strength * pow (m, mPower) * (ReversePot – CmpV);CmpV);

To:To:I = unitaryG * strength * (ReversePot – CmpV);I = unitaryG * strength * (ReversePot – CmpV);

Km-type channel, mPower is always 1, Km-type channel, mPower is always 1, so we were able to change the equation so we were able to change the equation to streamline the executionto streamline the execution

Wrapping the line in calls to QQ, we Wrapping the line in calls to QQ, we measure the effect of this single changemeasure the effect of this single change

QQStateOn (QQ_Km);QQStateOn (QQ_Km);I = unitaryG * strength * (ReversePot – CmpV);I = unitaryG * strength * (ReversePot – CmpV);QQStateOff (QQ_Km);QQStateOff (QQ_Km);

Page 18: QQ: Nanoscale Timing and Profiling

Timing MeasurementsTiming Measurements

Note that both code versions give similar Note that both code versions give similar cycle counts on different processors, cycle counts on different processors, though more consistent and somewhat though more consistent and somewhat fewer on P4 than P3.fewer on P4 than P3.

Times for similar counts are proportional Times for similar counts are proportional to processor speed, as expected.to processor speed, as expected.

Function call pays a heavy penalty for first Function call pays a heavy penalty for first call. It's only called by Km channel code call. It's only called by Km channel code in this code, so time represents first load in this code, so time represents first load of the code into cacheof the code into cache

Page 19: QQ: Nanoscale Timing and Profiling

Timing MeasurementsTiming Measurements

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10

Old Clockcycles

New Clockcycles

PIII – 800 MHz

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10

Old Clockcycles

New Clockcycles

Page 20: QQ: Nanoscale Timing and Profiling

Timing MeasurementsTiming Measurements

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10

Old uSec

New uSec

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10

Old clockcycles

New Clockcycles

P4 – 2200MHz

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10

Old clockcycles

New Clockcycles

Page 21: QQ: Nanoscale Timing and Profiling

Expanding Timing Expanding Timing InformationInformation

QQ allows the user to record an additional QQ allows the user to record an additional item of information with the normal timing.item of information with the normal timing. QQCount records an integer with the keyQQCount records an integer with the key

QQCount( eventKey, integer_of_interest );QQCount( eventKey, integer_of_interest ); QQValue records a double precision floating QQValue records a double precision floating

point value with the keypoint value with the key QQValue( eventKey, double_of_interest );QQValue( eventKey, double_of_interest );

QQState records a state of QQState records a state of ONON or or OFFOFF with the with the keykey

QQStateOn( eventKey ); QQStateOff( eventKey );QQStateOn( eventKey ); QQStateOff( eventKey );

These will be described during discussion These will be described during discussion of the output formatof the output format

Page 22: QQ: Nanoscale Timing and Profiling

QQ MemoryQQ Memory

Records memory allocation Records memory allocation dedicated to the code-block, rather dedicated to the code-block, rather than the total allocation due to code than the total allocation due to code and library calls, to single-byte and library calls, to single-byte accuracyaccuracy

Page 23: QQ: Nanoscale Timing and Profiling

QQ Memory ExampleQQ Memory Example

NCS implementation of ion channelsNCS implementation of ion channels Suppose we want to know the total Suppose we want to know the total

memory used by all channels. Each memory used by all channels. Each channel function would require channel function would require channel key:channel key:

#define MEM_KEY KEY_CHANNEL#define MEM_KEY KEY_CHANNEL

Then at any point in the program Then at any point in the program execution, just call the MemPrint execution, just call the MemPrint function to display memory usefunction to display memory use

Page 24: QQ: Nanoscale Timing and Profiling

Memory Usage OutputMemory Usage OutputMemory Allocation: Total Allocated = 988 KBytesMemory Allocation: Total Allocated = 988 KBytes Object Number Number Object Alloc Total MaxObject Number Number Object Alloc Total MaxItemItem SizeSize CreatedCreated DeletedDeleted KBKB KBKB KbKb KBKBBrainBrain 120120 11 00 11 00 11 11CellManagerCellManager 4444 11 00 11 1 1 11 11Cell Cell 16 16 100100 0 0 2 2 0 0 22 22ChannelChannel 252252 300300 00 7474 00 7474 7474Compartment Compartment 324324 100100 00 3232 22 33 33 3333MessageMgrMessageMgr 1616 11 00 11 205205 205205 205205MessageBusMessageBus 0 0 00 0 0 00 1 1 1 1 11ReportReport 8080 11 00 11 1 1 11 11StimulusStimulus 252252 11 00 11 11 11 11SynapseSynapse 4444 1000010000 00 430430 118118 547547 547547

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------11 22 33 44 55 66 7 7 88

KeyKey 1 - Internal name given to recording category1 - Internal name given to recording category 2 - The size of the object being allocated - it's valid only if all2 - The size of the object being allocated - it's valid only if all allocations are the same size, as with "new Object".allocations are the same size, as with "new Object". 3 - Number of allocation calls made: new, malloc, calloc, etc.3 - Number of allocation calls made: new, malloc, calloc, etc. 4 - Number of free or delete calls made4 - Number of free or delete calls made 5 - KBytes allocated via object creation (new)5 - KBytes allocated via object creation (new) 6 - KBytes allocated via *alloc calls6 - KBytes allocated via *alloc calls 7 - Total memory currently allocated7 - Total memory currently allocated 8 - Max memory ever allocated = high-water mark.8 - Max memory ever allocated = high-water mark.

Page 25: QQ: Nanoscale Timing and Profiling

QQ ApplicationsQQ Applications

Brain Communication Server (BCS)Brain Communication Server (BCS) NCSNCS

Page 26: QQ: Nanoscale Timing and Profiling

Further experimentation with the simulator required another application be developed to coordinate communication between NCS and numerous potential clients:

• virtual creatures• physical robots• visualization tools

BCS

Brain Communication Brain Communication ServerServer

NCS

Page 27: QQ: Nanoscale Timing and Profiling

Optimizing BCSOptimizing BCS

Different applications make non-sequential requests. No single function was called in a loop iterating several times, so time needed to be measured over the course of execution. Then perform an analysis of QQ’s final output.

Page 28: QQ: Nanoscale Timing and Profiling

Parsing QQ’s outputParsing QQ’s output

QQ uses a straight forward layout QQ uses a straight forward layout for the final output filefor the final output file

The data can be easily extracted and The data can be easily extracted and displayed in a text report as shown displayed in a text report as shown on the previous slide or sent to a on the previous slide or sent to a graphical displaygraphical display

The following slides describe the The following slides describe the output format and how to manage output format and how to manage the informationthe information

Page 29: QQ: Nanoscale Timing and Profiling

QQ file formatQQ file format

HeaderNumber of Keys (int), Key Name string length (int)

Key TableFor each Key – Key ID (int), Key type (int), Key name (char *)

Node InformationNumber of nodes (int)Node Table

For each Node – Byte offset to data (size_t), Number of entries (int), Starting Base Time (unsigned long long), Mhz (double)

DataFor each Node, For each entry – item (QQItem)

Page 30: QQ: Nanoscale Timing and Profiling

QQ Format – Data Close QQ Format – Data Close UpUp

Node 0 Byte offset

Node 1 Byte offset

Node 2 Byte offset

Previous Sections

DataNode 0 – For each entry Key (int), [Optional Info], Event Time (unsigned long long)

Node 1 – For each entry Key (int), [Optional Info], Event Time (unsigned long long)

Node 2 – For each entry …

Where Optional Info is the size of a double, but contains a State (int), a Count (int), or a Value (double)

Page 31: QQ: Nanoscale Timing and Profiling

Gathering the ResultsGathering the Results After reading a node’s data section, entries After reading a node’s data section, entries

with the same key can be gathered.with the same key can be gathered. Using the key table, the user knows what Using the key table, the user knows what

is contained in the second block of a is contained in the second block of a timing entrytiming entry

Example: Key 2 has type “State” The second block contains integer 1 for “on” or

integer 0 for “off” By subtracting the event times, the length of

time spent in the “on” state is determined

2 1 109342759

2 0 109342768

Page 32: QQ: Nanoscale Timing and Profiling

Another exampleAnother example

Example: Key 4 has type “Value” The second block contains a double

precision value passed in during execution

The value can be saved and displayed with timing information, or sent to a separate graph

Timing is obtained the same as before, by subtracting the event times

4 -65.3477 109342735

4 -58.2367 109342819

Page 33: QQ: Nanoscale Timing and Profiling

NCS Performance NCS Performance MeasurementMeasurement

QQ was able to hone in on specific QQ was able to hone in on specific blocks of code and allow blocks of code and allow measurement at a resolution measurement at a resolution necessary to allow for easy necessary to allow for easy interpretationinterpretation

Page 34: QQ: Nanoscale Timing and Profiling

Optimization TargetsOptimization Targets

QQ analysis quickly identified two QQ analysis quickly identified two major targets within the codemajor targets within the code

Synapses Synapses Message PassingMessage Passing

Page 35: QQ: Nanoscale Timing and Profiling

SynapsesSynapses

Synapses were by far the most Synapses were by far the most common element of any NCS model common element of any NCS model with the most memory usagewith the most memory usage Active only when an action potential Active only when an action potential

was processed through the synapsewas processed through the synapse Pass information between the nodes via Pass information between the nodes via

message passingmessage passing

Page 36: QQ: Nanoscale Timing and Profiling

Message Parsing Message Parsing OverheadOverhead

Using QQ, we were able to identify Using QQ, we were able to identify areas for improvement within NCS 3areas for improvement within NCS 3

Many unneeded fields requiring Many unneeded fields requiring better encoding of their destinationbetter encoding of their destination

Fixed number of messages pre-Fixed number of messages pre-allocated, far more than needed by allocated, far more than needed by the programthe program Implemented a shared pool, buffers allocated as Implemented a shared pool, buffers allocated as

neededneeded Messages sent individually, processed Messages sent individually, processed

multiple times multiple times Implemented a packet scheme: process packet Implemented a packet scheme: process packet

once for send, once for receiveonce for send, once for receive Process messages only when usedProcess messages only when used

Page 37: QQ: Nanoscale Timing and Profiling

Optimization ResultsOptimization Results

Page 38: QQ: Nanoscale Timing and Profiling

Execution Time Execution Time Measurements Measurements

after Optimizationafter Optimization

Page 39: QQ: Nanoscale Timing and Profiling

ConclusionsConclusions

QQ allows profiling of nanoscale QQ allows profiling of nanoscale timing of code segments and timing of code segments and memory usage analysismemory usage analysis

Fine grained measurements of Fine grained measurements of specific eventsspecific events

Ability to measure memory at an Ability to measure memory at an object or event level with a small object or event level with a small memory and performance footprintmemory and performance footprint

Simple and effective toolSimple and effective tool

Page 40: QQ: Nanoscale Timing and Profiling

Future WorkFuture Work

New Opteron clusterNew Opteron cluster BlueGene migrationBlueGene migration

NCS is currently being installed at our NCS is currently being installed at our sister lab The Brain Mind Institute at sister lab The Brain Mind Institute at EPFL in Switzerland on their new EPFL in Switzerland on their new machinemachine

Robotic integrationRobotic integration

Page 41: QQ: Nanoscale Timing and Profiling

AcknowledgementsAcknowledgements

Office of Naval ResearchOffice of Naval Research 6 years of funding for people (3 year 6 years of funding for people (3 year

renewable)renewable) 4 DURIP grants for hardware4 DURIP grants for hardware

Page 42: QQ: Nanoscale Timing and Profiling

QQ: Nanoscale QQ: Nanoscale Timing and Timing and

ProfilingProfilingJames Frye James Frye † † *, James G. King *, James G. King † † *, Christine J. *, Christine J.

Wilson * Wilson * ◊◊, , Frederick C. Harris, Jr. Frederick C. Harris, Jr. †† * *

††Department of Computer Science and Department of Computer Science and EngineeringEngineering

*Brain Computation Lab*Brain Computation Lab◊◊Biomedical EngineeringBiomedical Engineering

University of NevadaUniversity of Nevada

Reno, NV 89557Reno, NV 89557

Page 43: QQ: Nanoscale Timing and Profiling
Page 44: QQ: Nanoscale Timing and Profiling

QQ APIQQ API