Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003
description
Transcript of Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003
Overview of Overview of An Efficient Implementation An Efficient Implementation Scheme of Concurrent Object-Oriented Scheme of Concurrent Object-Oriented
Languages on Stock MulticomputersLanguages on Stock Multicomputers
Tony Chen, Sunjeev Sikand, and John KerwinTony Chen, Sunjeev Sikand, and John Kerwin
CSE 291 - Programming Sensor NetworksCSE 291 - Programming Sensor Networks
May 23, 2003May 23, 2003
Paper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori YonezawaPaper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori Yonezawa
22
BackgroundBackground
Most of the work done on high performance, Most of the work done on high performance, concurrent object-oriented programming concurrent object-oriented programming languages (OOPLs) has focused on languages (OOPLs) has focused on combinations of elaborate hardware and highly-combinations of elaborate hardware and highly-tuned, specially tailored software.tuned, specially tailored software.
These software architectures (the compiler and These software architectures (the compiler and the runtime system) exploit special features the runtime system) exploit special features provided by the hardware in order to achieve:provided by the hardware in order to achieve: Efficient intra-node multithreadingEfficient intra-node multithreading Efficient message passing between objectsEfficient message passing between objects
33
Special Hardware FeaturesSpecial Hardware Features
The hardware manages the thread The hardware manages the thread scheduling queue, and automatically scheduling queue, and automatically dispatches the next runnable thread upon dispatches the next runnable thread upon termination of the current thread.termination of the current thread.Processors and the network are tightly Processors and the network are tightly connected.connected. Processors can send a packet to the network Processors can send a packet to the network
within a few machine cycles.within a few machine cycles. Dispatching a task upon packet arrival takes Dispatching a task upon packet arrival takes
only a few cycles.only a few cycles.
44
Objective of this PaperObjective of this Paper
Demonstrate software techniques that can Demonstrate software techniques that can be used to achieve comparable intra-node be used to achieve comparable intra-node multithreading, and inter-node message multithreading, and inter-node message passing performance on conventional passing performance on conventional multicomputers, without special hardware multicomputers, without special hardware scheduling and message passing facilities.scheduling and message passing facilities.
55
System Used to Demonstrate System Used to Demonstrate these Techniquesthese Techniques
The authors developed a runtime The authors developed a runtime environment for a concurrent object- environment for a concurrent object- oriented programming language called oriented programming language called ABCL/onAP1000.ABCL/onAP1000.
Used Fujitsu Laboratory’s experimental Used Fujitsu Laboratory’s experimental multicomputer called AP1000.multicomputer called AP1000. 512 SPARC chips running at 25 MHz512 SPARC chips running at 25 MHz Interconnected with a 25 MB/s torus networkInterconnected with a 25 MB/s torus network
66
Computation/Programming Computation/Programming ModelModel
Computation is carried out by message Computation is carried out by message transmissions among transmissions among concurrent objects.concurrent objects. Units of concurrency that become active when they Units of concurrency that become active when they
accept messages.accept messages.
Multiple message transmissions may take place Multiple message transmissions may take place in parallel, so objects may become active in parallel, so objects may become active simultaneously.simultaneously.
When an object receives a message, the When an object receives a message, the message is placed in its message queue, so that message is placed in its message queue, so that messages can be invoked one at a time.messages can be invoked one at a time.
77
Computation/Programming Computation/Programming Model (cont.)Model (cont.)
Messages can contain Messages can contain mail addressesmail addresses of of concurrent objects in addition to basic values concurrent objects in addition to basic values such as numbers and booleans.such as numbers and booleans.
Each object has its own autonomous single Each object has its own autonomous single thread of control, and its own encapsulated state thread of control, and its own encapsulated state variables.variables.
Objects can be in Objects can be in dormantdormant mode if they have no mode if they have no messages to process, messages to process, activeactive mode if they are mode if they are executing a method, or executing a method, or waitingwaiting mode if they are mode if they are waiting to receive a certain set of messages.waiting to receive a certain set of messages.
88
Possible Actions Within a Possible Actions Within a MethodMethod
Message SendsMessage Sends to other concurrent objects to other concurrent objects Past typePast type – sender does not wait for a reply – sender does not wait for a reply
messagemessage Now typeNow type – sender waits for a reply message – sender waits for a reply message
Reply messages are sent through a third object Reply messages are sent through a third object called a called a reply destination objectreply destination object, which resumes the , which resumes the original sender upon the reception of the reply original sender upon the reception of the reply message.message.
CreationCreation of concurrent objects of concurrent objects
99
Possible Actions Within a Possible Actions Within a Method (cont.)Method (cont.)
Referencing and UpdatingReferencing and Updating the contents of the contents of state variablesstate variables
Waiting Waiting for a specified set of messagesfor a specified set of messages
Standard OperationsStandard Operations (like arithmetic (like arithmetic operations) on values stored in state operations) on values stored in state variablesvariables
1010
Scheduling ProcessScheduling Process
Scheduling for sequential OOPLs simply involves Scheduling for sequential OOPLs simply involves a method lookup and a stack-based function call.a method lookup and a stack-based function call.
For concurrent OOPLs, scheduling of methods is For concurrent OOPLs, scheduling of methods is not necessarily LIFO-based, since methods may not necessarily LIFO-based, since methods may be blocked to wait for messages, and resumed be blocked to wait for messages, and resumed upon the arrival of a message.upon the arrival of a message. Therefore, a naïve implementation must allocate Therefore, a naïve implementation must allocate
invocation frames from the heap instead of the stack, invocation frames from the heap instead of the stack, and use a scheduling queue to keep track of pending and use a scheduling queue to keep track of pending methods. methods.
1111
Scheduling Process (cont.)Scheduling Process (cont.)
In addition, since it may not be possible for In addition, since it may not be possible for a receiver object to immediately process a receiver object to immediately process incoming messages, each object must incoming messages, each object must have its own message queue to buffer have its own message queue to buffer incoming messages.incoming messages.
This can lead to substantial overhead for This can lead to substantial overhead for frame allocation/deallocation, and queue frame allocation/deallocation, and queue manipulation, for both the scheduling and manipulation, for both the scheduling and message queues.message queues.
1212
Example of a Naïve Example of a Naïve Scheduling MechanismScheduling Mechanism
A naïve implementation of message A naïve implementation of message reception / method invocation for an object reception / method invocation for an object would require:would require:
1.1. Allocation of an invocation frame to hold local Allocation of an invocation frame to hold local variables and message arguments of the variables and message arguments of the method.method.
2.2. Buffering a message into the frame.Buffering a message into the frame.3.3. Enqueueing the frame into the object message Enqueueing the frame into the object message
queue.queue.4.4. Enqueueing the object into the scheduling queue Enqueueing the object into the scheduling queue
(if it is not already there).(if it is not already there).
1313
Key Observation for Intra-Key Observation for Intra-node Scheduling Strategynode Scheduling StrategyIn many cases, this full scheduling In many cases, this full scheduling mechanism is not necessary, and we can mechanism is not necessary, and we can use more efficient stack-based scheduling.use more efficient stack-based scheduling.If an object is If an object is dormant,dormant, meaning it has no meaning it has no messages to be processed, its method can messages to be processed, its method can be invoked immediately upon message be invoked immediately upon message reception, without message buffering or reception, without message buffering or schedule queue manipulation.schedule queue manipulation.If it is If it is activeactive, then the message is buffered, , then the message is buffered, and the method is invoked later via the and the method is invoked later via the scheduling queue.scheduling queue.
1414
Example of ABCL/onAP1000 Example of ABCL/onAP1000 Intra-node Scheduling StrategyIntra-node Scheduling Strategy
1515
Scheduling Strategy Scheduling Strategy ImplementationImplementation
We need a mechanism to implement this We need a mechanism to implement this strategy efficiently.strategy efficiently.
We cannot perform a runtime check on We cannot perform a runtime check on every intra-node message send to every intra-node message send to determine whether or not the receiver is determine whether or not the receiver is dormant.dormant.
When a running object becomes blocked When a running object becomes blocked on the stack, we must be able to resume on the stack, we must be able to resume other objects. other objects.
1616
Components of an ObjectComponents of an Object
1717
Virtual Function TablesVirtual Function Tables
A Virtual Function Table Pointer (VFTP) A Virtual Function Table Pointer (VFTP) points to a Virtual Function Table, which points to a Virtual Function Table, which contains the address of each compiled contains the address of each compiled function (method) of the class.function (method) of the class.
1818
Key Idea in Object Key Idea in Object RepresentationRepresentation
Each class has Each class has multiplemultiple virtual function tables, virtual function tables, each of which roughly corresponds to a mode each of which roughly corresponds to a mode (dormant, active, and waiting) of an object.(dormant, active, and waiting) of an object.When an object is in dormant mode, its Virtual When an object is in dormant mode, its Virtual Function Table Pointer (VFTP) points to the table Function Table Pointer (VFTP) points to the table that contains the method bodies.that contains the method bodies.When an object is active, the VFTP points to a When an object is active, the VFTP points to a virtual function table that holds tiny virtual function table that holds tiny queueing queueing procedures, procedures, which simply allocate a frame, store which simply allocate a frame, store the message into the frame, and enqueue it on the message into the frame, and enqueue it on the object’s message queue. the object’s message queue.
1919
Benefits of Multiple Virtual Benefits of Multiple Virtual Function TablesFunction Tables
With multiple virtual function tables, a With multiple virtual function tables, a sender object does not have to do a sender object does not have to do a runtime check of whether or not the runtime check of whether or not the receiver object is dormant.receiver object is dormant.
Instead this check is built into the virtual Instead this check is built into the virtual function table look-up, which is already a function table look-up, which is already a necessary cost in object-oriented necessary cost in object-oriented programming languages.programming languages.
2020
Benefits of Multiple Virtual Benefits of Multiple Virtual Function Tables (cont.)Function Tables (cont.)
Can be used to implement Can be used to implement selective selective message receptionmessage reception where acceptable where acceptable messages trigger functions that restore the messages trigger functions that restore the context of the object, and unacceptable context of the object, and unacceptable messages trigger queueing procedures.messages trigger queueing procedures.
Can also be used to initialize an object’s Can also be used to initialize an object’s state variables, by creating a table that state variables, by creating a table that points to initialization functions that initialize points to initialization functions that initialize variables before calling a method body.variables before calling a method body.
2121
Combining the Stack with the Combining the Stack with the Scheduling QueueScheduling Queue
When a method is invoked on a dormant object, When a method is invoked on a dormant object, an activation frame is allocated on the an activation frame is allocated on the stackstack, , thereby achieving fast frame thereby achieving fast frame allocation/deallocation.allocation/deallocation.If this invocation blocks in the middle of a thread, If this invocation blocks in the middle of a thread, it allocates another frame on the heap, and saves it allocates another frame on the heap, and saves its context to this frame, which will survive until its context to this frame, which will survive until termination of the method.termination of the method.The scheduling queue is used to schedule The scheduling queue is used to schedule preempted objects that saved their context into a preempted objects that saved their context into a heap-allocated frame, or to invoke messages that heap-allocated frame, or to invoke messages that were buffered in a message queue. were buffered in a message queue.
2222
Example of Stack UnwindingExample of Stack Unwinding
2323
Inter-node Software Inter-node Software ArchitectureArchitecture
Important for message passing between objects Important for message passing between objects on different nodes, and object creation on a on different nodes, and object creation on a remote node.remote node.
Assumes the hardware (or message passing Assumes the hardware (or message passing libraries) provides an interface to send and libraries) provides an interface to send and receive messages asynchronously.receive messages asynchronously.
Uses an Active Message-like mechanism, where Uses an Active Message-like mechanism, where each message attaches its own self-dispatching each message attaches its own self-dispatching message handler, which is invoked immediately message handler, which is invoked immediately after the delivery of the message.after the delivery of the message.
2424
Customized Message Customized Message HandlersHandlers
Providing a customized message handler for Providing a customized message handler for each kind of remote message allows the system each kind of remote message allows the system to achieve low overhead remote task to achieve low overhead remote task dispatching.dispatching.Message handlers are classified into the Message handlers are classified into the following categories:following categories:
1.1. Normal message transmission between objectsNormal message transmission between objects2.2. Request for remote object creationRequest for remote object creation3.3. Reply to remote memory allocation requestReply to remote memory allocation request4.4. Other services such as load balancing, garbage Other services such as load balancing, garbage
collection, etc.collection, etc.
2525
Remote Object CreationRemote Object Creation
A mail address of an object is represented A mail address of an object is represented as <as <processor number, pointerprocessor number, pointer>.>.
This provides maximum performance for This provides maximum performance for local object access, and avoids the local object access, and avoids the overhead of export table management.overhead of export table management.
Object creation on a remote node Object creation on a remote node requires a memory allocation on the requires a memory allocation on the remote node to generate a remote mail remote node to generate a remote mail address. address.
2626
Remote Object Creation Remote Object Creation (cont.)(cont.)
Since the latency of remote communication is Since the latency of remote communication is unpredictable, and the cost of context switching unpredictable, and the cost of context switching is high, it is unacceptable to wait for the remote is high, it is unacceptable to wait for the remote node to allocate memory and return a pointer.node to allocate memory and return a pointer.Therefore the system uses a Therefore the system uses a prefetch scheme,prefetch scheme, where each node manages predelivered stocks where each node manages predelivered stocks of addresses of memory chunks on remote of addresses of memory chunks on remote nodes, and these addresses are used for nodes, and these addresses are used for remote object allocation.remote object allocation.A node only has to wait for a remote address to A node only has to wait for a remote address to be allocated if its local stock is empty.be allocated if its local stock is empty.
2727
Typical Remote Object Typical Remote Object Creation SequenceCreation Sequence
The requester node obtains a unique mail address The requester node obtains a unique mail address locally from the stock.locally from the stock.It sends a creation request message to the node It sends a creation request message to the node specified by the mail address.specified by the mail address.The target node performs class-specific initialization The target node performs class-specific initialization (such as initialization of the virtual function table) of (such as initialization of the virtual function table) of the created object upon receipt of the creation the created object upon receipt of the creation message.message.The target node allocates a replacement chunk of The target node allocates a replacement chunk of memory, and returns its address to the requester memory, and returns its address to the requester node.node.The requester replenishes its stock upon receipt of The requester replenishes its stock upon receipt of the replacement address.the replacement address.
2828
Costs of Basic OperationsCosts of Basic Operations
Time (Time (µs)µs)
Intra-node Message (to Dormant)Intra-node Message (to Dormant)
Intra-node Message (to Active)Intra-node Message (to Active)
Intra-node CreationIntra-node Creation
Latency of Inter-node MessageLatency of Inter-node Message
2.32.3
9.69.6
2.12.1
8.98.9
2929
Breakdown of Intra-node Breakdown of Intra-node Message to Dormant ObjectMessage to Dormant Object
InstructionsInstructions
Check LocalityCheck Locality
Lookup and CallLookup and Call
Switch VFTP to Active ModeSwitch VFTP to Active Mode
Execution of Method BodyExecution of Method Body
Check Message QueueCheck Message Queue
Switch VFTP to Dormant ModeSwitch VFTP to Dormant Mode
Polling of Remote Message ArrivalPolling of Remote Message Arrival
Adjusting Stack Pointer and ReturnAdjusting Stack Pointer and Return
33
55
33
--
33
33
55
33
TotalTotal 2525
3030
Comparison of Send/Reply Comparison of Send/Reply LatencyLatency
Instruction Instruction CountsCounts
Real Time (Real Time (µs)µs) CyclesCycles Clock Rate Clock Rate (MHz)(MHz)
ABCL/onAP1000ABCL/onAP1000
ABCL/onEM4ABCL/onEM4
CST (on J-Machine)CST (on J-Machine)
160160
100100
110110
17.817.8
99
44
450450
110110
220220
2525
12.512.5
5050
Send and reply latency for the ABCL/onAP1000 Send and reply latency for the ABCL/onAP1000 conventional multicomputer is only about 4 times that conventional multicomputer is only about 4 times that of the ABCL/onEM4 fine-grain machine, and 2 times of the ABCL/onEM4 fine-grain machine, and 2 times that of the CST fine-grain machine.that of the CST fine-grain machine.
3131
Benchmark StatisticsBenchmark Statistics
To evaluate these techniques on real To evaluate these techniques on real applications, the authors measured the applications, the authors measured the performance of the N-queen exhaustive performance of the N-queen exhaustive search algorithm for N = 8 and N = 13.search algorithm for N = 8 and N = 13.
They compared these results to the results They compared these results to the results of running the same programs on a single of running the same programs on a single CPU SPARC station 1+, which uses the CPU SPARC station 1+, which uses the same CPU that is used in the AP1000.same CPU that is used in the AP1000.
3232
The Scale of the N-queen The Scale of the N-queen ProgramProgram
N = 8N = 8 N = 13N = 13
Number of SolutionsNumber of Solutions
Number of Objects CreatedNumber of Objects Created
Number of MessagesNumber of Messages
Total Memory Used (KB)Total Memory Used (KB)
Elapsed Time on SS1+Elapsed Time on SS1+
9292
2,0562,056
4,1044,104
130130
84 ms84 ms
73,71273,712
4,636,2104,636,210
9,349,7659,349,765
549,463549,463
461,955 ms461,955 ms
3333
Speedup of the N-queen Speedup of the N-queen ProgramProgram
3434
The Effect of Stack-based The Effect of Stack-based SchedulingScheduling
To demonstrate the effect of stack-based To demonstrate the effect of stack-based scheduling, they compared the performance of scheduling, they compared the performance of the N-queen program using stack-based the N-queen program using stack-based scheduling, to its performance using a naïve scheduling, to its performance using a naïve scheduling mechanism that always buffers a scheduling mechanism that always buffers a message in the message queue of the receiver message in the message queue of the receiver object, and schedules the object through the object, and schedules the object through the scheduling queue.scheduling queue.In these programs, approximately 75% of local In these programs, approximately 75% of local messages are sent to dormant objects.messages are sent to dormant objects.In general, they observed a speedup of In general, they observed a speedup of approximately 30%.approximately 30%.
3535
The Effect of Stack-based The Effect of Stack-based Scheduling (cont.)Scheduling (cont.)
3636
ConclusionsConclusions
The authors proposed a software architecture for The authors proposed a software architecture for concurrent OOPLs on conventional concurrent OOPLs on conventional multicomputers that can compete with multicomputers that can compete with implementations on special-purpose, fine-grain implementations on special-purpose, fine-grain architectures.architectures.Their stack-based intra-node scheduling Their stack-based intra-node scheduling mechanism significantly reduces the average mechanism significantly reduces the average cost of intra-node method invocation.cost of intra-node method invocation.Their Active Message-like messages, and Their Active Message-like messages, and address prefetch scheme minimize the cost of address prefetch scheme minimize the cost of inter-node message passing, and remote object inter-node message passing, and remote object creation.creation.
3737
DiscussionDiscussion
The eternal question: How does this apply The eternal question: How does this apply to sensor networks?to sensor networks?
Low instruction count for intra-node Low instruction count for intra-node schedulingscheduling
Power efficient remote object creation cuts Power efficient remote object creation cuts down on communicationdown on communication
3838
FlawsFlaws
Security problems related to active Security problems related to active messages. User can run any code they messages. User can run any code they desire.desire.
Scalability for prefetching objects, if Scalability for prefetching objects, if thousands of nodes results in lots of thousands of nodes results in lots of communication between nodes and communication between nodes and memory becomes a scarce commodity.memory becomes a scarce commodity.