QQ: Nanoscale Timing and Profiling
description
Transcript of QQ: Nanoscale Timing and Profiling
QQ: Nanoscale QQ: Nanoscale Timing and Timing and
ProfilingProfilingJames Frye James Frye † † *, James G. King *, James G. King † † *, Christine J. *, Christine J.
Wilson * Wilson * ◊◊, , Frederick C. Harris, Jr. Frederick C. Harris, Jr. †† * *
††Department of Computer Science and Department of Computer Science and EngineeringEngineering
*Brain Computation Lab*Brain Computation Lab◊◊Biomedical EngineeringBiomedical Engineering
University of NevadaUniversity of Nevada
Reno, NV 89557Reno, NV 89557
What is QQWhat is QQ
QQ is a simple and efficient tool for QQ is a simple and efficient tool for measuring timing and memory usemeasuring timing and memory use
Developed for the examination of a Developed for the examination of a massively parallel program (NCS)massively parallel program (NCS)
Easily extensible to inspect other Easily extensible to inspect other programsprograms
The Place: The Human The Place: The Human BrainBrain
Goal:Goal: create the first create the first large-scalelarge-scale, ,
synaptically realisticsynaptically realistic cortical cortical computational model. computational model.
Purpose:Purpose: Simulation Experiments Simulation Experiments
Drug TrialsDrug Trials Alzheimer’s ResearchAlzheimer’s Research RoboticsRobotics
The Science:The Science:
NeuronsNeurons ExcitatoryExcitatory Interneurons Interneurons
(inhibitory)(inhibitory)
ColumnsColumns High connectivity High connectivity
within columns.within columns. Less connectivity Less connectivity
across columnsacross columns
The Science (cont):The Science (cont):
ChannelsChannels Potassium FamilyPotassium Family
M, A, AHP ChannelsM, A, AHP Channels Suppressing behavior on Suppressing behavior on
parent cellparent cell
SynapsesSynapses Analog converter of Analog converter of
binary spike event.binary spike event. Contextual filters.Contextual filters.
NeuronsNeurons
The Science (cont):The Science (cont):
NCS BiologyNCS Biology
The membrane voltage determines The membrane voltage determines the cell’s firing ratethe cell’s firing rate
Once threshold voltage is reached Once threshold voltage is reached the cell sends an action potential to the cell sends an action potential to it’s connected synapsesit’s connected synapses
0
mV
Time (mS)
-45
30
Action Potential
2-Cell Model2-Cell Model
Pre-Synaptic
Cell
Post-Synaptic
Cell
0.2 mV
100 200 300 400 5000
Time (ms)
No ChannelsNo ChannelsSustained firing at maximum rate during a continuous stimulus
KKaa Channel ChannelSlows the initial response during a sustained stimulus
KKmm Channel ChannelPrevents continuous bursting during a continuous stimulus
KKahpahp Channel ChannelDampens the effect while still allowing for some action potentials during a sustained stimulus
QQ DevelopmentQQ Development QQ was developed to optimize a parallel QQ was developed to optimize a parallel
program used to simulate cortical program used to simulate cortical neurons – NeoCortical Simulator (NCS)neurons – NeoCortical Simulator (NCS)
Our goal for the summer of 2002 was to Our goal for the summer of 2002 was to simulate 10simulate 1066 neurons with 10 neurons with 1099 synapses synapses within a realistic run timewithin a realistic run time
Before optimization, NCS would run Before optimization, NCS would run about 1.5 million synapses at a rate of 1 about 1.5 million synapses at a rate of 1 day per simulated second of synaptic day per simulated second of synaptic activityactivity
Clearly optimization of NCS was neededClearly optimization of NCS was needed
QQ DesignQQ Design
QQ is designed so that all of its routines QQ is designed so that all of its routines can be selectively compiled into a programcan be selectively compiled into a program
In the QQ.h header file, each routine is In the QQ.h header file, each routine is defined with a preprocessor directive, so defined with a preprocessor directive, so that if profiling is not enabled, it reduces that if profiling is not enabled, it reduces to an empty statement.to an empty statement.
#ifdef QQ_ENABLE#ifdef QQ_ENABLE
void QQInit (int);void QQInit (int);
#else#else
#define QQInit (dummy)#define QQInit (dummy)
#endif#endif
QQ DesignQQ Design
Memory profiling routines also use the Memory profiling routines also use the C preprocessor to intercept library callsC preprocessor to intercept library calls
#ifdef QQ_ENABLE#ifdef QQ_ENABLE
#define malloc(arg) MemMalloc (MEM_KEY, #define malloc(arg) MemMalloc (MEM_KEY, arg)arg)
#endif#endif
The MemMalloc function records The MemMalloc function records allocation information, calls the malloc allocation information, calls the malloc function to do the actual allocation, and function to do the actual allocation, and returns the result to the callerreturns the result to the caller
QQ TimingQQ Timing
Extremely accurate measurement of Extremely accurate measurement of execution speed. execution speed.
In theory fine-grained resolution to a In theory fine-grained resolution to a single clock cycle. single clock cycle. Using the IA32 instruction RTDSC Using the IA32 instruction RTDSC
In practice, measurements are In practice, measurements are accurate to tens of cyclesaccurate to tens of cycles Because of instruction reordering and Because of instruction reordering and
multiple pipelines in the CPUmultiple pipelines in the CPU
Timing MeasurementsTiming Measurements Measuring the impact of a line change Measuring the impact of a line change
in the calculation for the Km channelin the calculation for the Km channelFrom:From:
I = unitaryG * strength * pow (m, mPower) * (ReversePot – I = unitaryG * strength * pow (m, mPower) * (ReversePot – CmpV);CmpV);
To:To:I = unitaryG * strength * (ReversePot – CmpV);I = unitaryG * strength * (ReversePot – CmpV);
Km-type channel, mPower is always 1, Km-type channel, mPower is always 1, so we were able to change the equation so we were able to change the equation to streamline the executionto streamline the execution
Wrapping the line in calls to QQ, we Wrapping the line in calls to QQ, we measure the effect of this single changemeasure the effect of this single change
QQStateOn (QQ_Km);QQStateOn (QQ_Km);I = unitaryG * strength * (ReversePot – CmpV);I = unitaryG * strength * (ReversePot – CmpV);QQStateOff (QQ_Km);QQStateOff (QQ_Km);
Timing MeasurementsTiming Measurements
Note that both code versions give similar Note that both code versions give similar cycle counts on different processors, cycle counts on different processors, though more consistent and somewhat though more consistent and somewhat fewer on P4 than P3.fewer on P4 than P3.
Times for similar counts are proportional Times for similar counts are proportional to processor speed, as expected.to processor speed, as expected.
Function call pays a heavy penalty for first Function call pays a heavy penalty for first call. It's only called by Km channel code call. It's only called by Km channel code in this code, so time represents first load in this code, so time represents first load of the code into cacheof the code into cache
Timing MeasurementsTiming Measurements
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9 10
Old uSec
New uSec
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7 8 9 10
Old uSec
New uSec
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10
Old Clockcycles
New Clockcycles
PIII – 800 MHz
0
200
400
600
800
1000
1200
1400
1 2 3 4 5 6 7 8 9 10
Old Clockcycles
New Clockcycles
Timing MeasurementsTiming Measurements
0
5
10
15
20
25
30
35
40
1 2 3 4 5 6 7 8 9 10
Old uSec
New uSec
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9 10
Old uSec
New uSec
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10
Old clockcycles
New Clockcycles
P4 – 2200MHz
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10
Old clockcycles
New Clockcycles
Expanding Timing Expanding Timing InformationInformation
QQ allows the user to record an additional QQ allows the user to record an additional item of information with the normal timing.item of information with the normal timing. QQCount records an integer with the keyQQCount records an integer with the key
QQCount( eventKey, integer_of_interest );QQCount( eventKey, integer_of_interest ); QQValue records a double precision floating QQValue records a double precision floating
point value with the keypoint value with the key QQValue( eventKey, double_of_interest );QQValue( eventKey, double_of_interest );
QQState records a state of QQState records a state of ONON or or OFFOFF with the with the keykey
QQStateOn( eventKey ); QQStateOff( eventKey );QQStateOn( eventKey ); QQStateOff( eventKey );
These will be described during discussion These will be described during discussion of the output formatof the output format
QQ MemoryQQ Memory
Records memory allocation Records memory allocation dedicated to the code-block, rather dedicated to the code-block, rather than the total allocation due to code than the total allocation due to code and library calls, to single-byte and library calls, to single-byte accuracyaccuracy
QQ Memory ExampleQQ Memory Example
NCS implementation of ion channelsNCS implementation of ion channels Suppose we want to know the total Suppose we want to know the total
memory used by all channels. Each memory used by all channels. Each channel function would require channel function would require channel key:channel key:
#define MEM_KEY KEY_CHANNEL#define MEM_KEY KEY_CHANNEL
Then at any point in the program Then at any point in the program execution, just call the MemPrint execution, just call the MemPrint function to display memory usefunction to display memory use
Memory Usage OutputMemory Usage OutputMemory Allocation: Total Allocated = 988 KBytesMemory Allocation: Total Allocated = 988 KBytes Object Number Number Object Alloc Total MaxObject Number Number Object Alloc Total MaxItemItem SizeSize CreatedCreated DeletedDeleted KBKB KBKB KbKb KBKBBrainBrain 120120 11 00 11 00 11 11CellManagerCellManager 4444 11 00 11 1 1 11 11Cell Cell 16 16 100100 0 0 2 2 0 0 22 22ChannelChannel 252252 300300 00 7474 00 7474 7474Compartment Compartment 324324 100100 00 3232 22 33 33 3333MessageMgrMessageMgr 1616 11 00 11 205205 205205 205205MessageBusMessageBus 0 0 00 0 0 00 1 1 1 1 11ReportReport 8080 11 00 11 1 1 11 11StimulusStimulus 252252 11 00 11 11 11 11SynapseSynapse 4444 1000010000 00 430430 118118 547547 547547
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------11 22 33 44 55 66 7 7 88
KeyKey 1 - Internal name given to recording category1 - Internal name given to recording category 2 - The size of the object being allocated - it's valid only if all2 - The size of the object being allocated - it's valid only if all allocations are the same size, as with "new Object".allocations are the same size, as with "new Object". 3 - Number of allocation calls made: new, malloc, calloc, etc.3 - Number of allocation calls made: new, malloc, calloc, etc. 4 - Number of free or delete calls made4 - Number of free or delete calls made 5 - KBytes allocated via object creation (new)5 - KBytes allocated via object creation (new) 6 - KBytes allocated via *alloc calls6 - KBytes allocated via *alloc calls 7 - Total memory currently allocated7 - Total memory currently allocated 8 - Max memory ever allocated = high-water mark.8 - Max memory ever allocated = high-water mark.
QQ ApplicationsQQ Applications
Brain Communication Server (BCS)Brain Communication Server (BCS) NCSNCS
Further experimentation with the simulator required another application be developed to coordinate communication between NCS and numerous potential clients:
• virtual creatures• physical robots• visualization tools
BCS
Brain Communication Brain Communication ServerServer
NCS
Optimizing BCSOptimizing BCS
Different applications make non-sequential requests. No single function was called in a loop iterating several times, so time needed to be measured over the course of execution. Then perform an analysis of QQ’s final output.
Parsing QQ’s outputParsing QQ’s output
QQ uses a straight forward layout QQ uses a straight forward layout for the final output filefor the final output file
The data can be easily extracted and The data can be easily extracted and displayed in a text report as shown displayed in a text report as shown on the previous slide or sent to a on the previous slide or sent to a graphical displaygraphical display
The following slides describe the The following slides describe the output format and how to manage output format and how to manage the informationthe information
QQ file formatQQ file format
HeaderNumber of Keys (int), Key Name string length (int)
Key TableFor each Key – Key ID (int), Key type (int), Key name (char *)
Node InformationNumber of nodes (int)Node Table
For each Node – Byte offset to data (size_t), Number of entries (int), Starting Base Time (unsigned long long), Mhz (double)
DataFor each Node, For each entry – item (QQItem)
QQ Format – Data Close QQ Format – Data Close UpUp
Node 0 Byte offset
Node 1 Byte offset
Node 2 Byte offset
Previous Sections
DataNode 0 – For each entry Key (int), [Optional Info], Event Time (unsigned long long)
Node 1 – For each entry Key (int), [Optional Info], Event Time (unsigned long long)
Node 2 – For each entry …
Where Optional Info is the size of a double, but contains a State (int), a Count (int), or a Value (double)
Gathering the ResultsGathering the Results After reading a node’s data section, entries After reading a node’s data section, entries
with the same key can be gathered.with the same key can be gathered. Using the key table, the user knows what Using the key table, the user knows what
is contained in the second block of a is contained in the second block of a timing entrytiming entry
Example: Key 2 has type “State” The second block contains integer 1 for “on” or
integer 0 for “off” By subtracting the event times, the length of
time spent in the “on” state is determined
2 1 109342759
2 0 109342768
Another exampleAnother example
Example: Key 4 has type “Value” The second block contains a double
precision value passed in during execution
The value can be saved and displayed with timing information, or sent to a separate graph
Timing is obtained the same as before, by subtracting the event times
4 -65.3477 109342735
4 -58.2367 109342819
NCS Performance NCS Performance MeasurementMeasurement
QQ was able to hone in on specific QQ was able to hone in on specific blocks of code and allow blocks of code and allow measurement at a resolution measurement at a resolution necessary to allow for easy necessary to allow for easy interpretationinterpretation
Optimization TargetsOptimization Targets
QQ analysis quickly identified two QQ analysis quickly identified two major targets within the codemajor targets within the code
Synapses Synapses Message PassingMessage Passing
SynapsesSynapses
Synapses were by far the most Synapses were by far the most common element of any NCS model common element of any NCS model with the most memory usagewith the most memory usage Active only when an action potential Active only when an action potential
was processed through the synapsewas processed through the synapse Pass information between the nodes via Pass information between the nodes via
message passingmessage passing
Message Parsing Message Parsing OverheadOverhead
Using QQ, we were able to identify Using QQ, we were able to identify areas for improvement within NCS 3areas for improvement within NCS 3
Many unneeded fields requiring Many unneeded fields requiring better encoding of their destinationbetter encoding of their destination
Fixed number of messages pre-Fixed number of messages pre-allocated, far more than needed by allocated, far more than needed by the programthe program Implemented a shared pool, buffers allocated as Implemented a shared pool, buffers allocated as
neededneeded Messages sent individually, processed Messages sent individually, processed
multiple times multiple times Implemented a packet scheme: process packet Implemented a packet scheme: process packet
once for send, once for receiveonce for send, once for receive Process messages only when usedProcess messages only when used
Optimization ResultsOptimization Results
Execution Time Execution Time Measurements Measurements
after Optimizationafter Optimization
ConclusionsConclusions
QQ allows profiling of nanoscale QQ allows profiling of nanoscale timing of code segments and timing of code segments and memory usage analysismemory usage analysis
Fine grained measurements of Fine grained measurements of specific eventsspecific events
Ability to measure memory at an Ability to measure memory at an object or event level with a small object or event level with a small memory and performance footprintmemory and performance footprint
Simple and effective toolSimple and effective tool
Future WorkFuture Work
New Opteron clusterNew Opteron cluster BlueGene migrationBlueGene migration
NCS is currently being installed at our NCS is currently being installed at our sister lab The Brain Mind Institute at sister lab The Brain Mind Institute at EPFL in Switzerland on their new EPFL in Switzerland on their new machinemachine
Robotic integrationRobotic integration
AcknowledgementsAcknowledgements
Office of Naval ResearchOffice of Naval Research 6 years of funding for people (3 year 6 years of funding for people (3 year
renewable)renewable) 4 DURIP grants for hardware4 DURIP grants for hardware
QQ: Nanoscale QQ: Nanoscale Timing and Timing and
ProfilingProfilingJames Frye James Frye † † *, James G. King *, James G. King † † *, Christine J. *, Christine J.
Wilson * Wilson * ◊◊, , Frederick C. Harris, Jr. Frederick C. Harris, Jr. †† * *
††Department of Computer Science and Department of Computer Science and EngineeringEngineering
*Brain Computation Lab*Brain Computation Lab◊◊Biomedical EngineeringBiomedical Engineering
University of NevadaUniversity of Nevada
Reno, NV 89557Reno, NV 89557
QQ APIQQ API