Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003

Overview of Overview of An Efficient Implementation An Efficient Implementation Scheme of Concurrent Object-Oriented Scheme of Concurrent Object-Oriented

Languages on Stock MulticomputersLanguages on Stock Multicomputers

Tony Chen, Sunjeev Sikand, and John KerwinTony Chen, Sunjeev Sikand, and John Kerwin

CSE 291 - Programming Sensor NetworksCSE 291 - Programming Sensor Networks

May 23, 2003May 23, 2003

Paper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori YonezawaPaper by: Kenjiro Taura, Satoshi Matsuoka, and Akinori Yonezawa

22

BackgroundBackground

Most of the work done on high performance, Most of the work done on high performance, concurrent object-oriented programming concurrent object-oriented programming languages (OOPLs) has focused on languages (OOPLs) has focused on combinations of elaborate hardware and highly-combinations of elaborate hardware and highly-tuned, specially tailored software.tuned, specially tailored software.

These software architectures (the compiler and These software architectures (the compiler and the runtime system) exploit special features the runtime system) exploit special features provided by the hardware in order to achieve:provided by the hardware in order to achieve: Efficient intra-node multithreadingEfficient intra-node multithreading Efficient message passing between objectsEfficient message passing between objects

33

Special Hardware FeaturesSpecial Hardware Features

The hardware manages the thread The hardware manages the thread scheduling queue, and automatically scheduling queue, and automatically dispatches the next runnable thread upon dispatches the next runnable thread upon termination of the current thread.termination of the current thread.Processors and the network are tightly Processors and the network are tightly connected.connected. Processors can send a packet to the network Processors can send a packet to the network

within a few machine cycles.within a few machine cycles. Dispatching a task upon packet arrival takes Dispatching a task upon packet arrival takes

only a few cycles.only a few cycles.

44

Objective of this PaperObjective of this Paper

Demonstrate software techniques that can Demonstrate software techniques that can be used to achieve comparable intra-node be used to achieve comparable intra-node multithreading, and inter-node message multithreading, and inter-node message passing performance on conventional passing performance on conventional multicomputers, without special hardware multicomputers, without special hardware scheduling and message passing facilities.scheduling and message passing facilities.

55

System Used to Demonstrate System Used to Demonstrate these Techniquesthese Techniques

The authors developed a runtime The authors developed a runtime environment for a concurrent object- environment for a concurrent object- oriented programming language called oriented programming language called ABCL/onAP1000.ABCL/onAP1000.

Used Fujitsu Laboratory’s experimental Used Fujitsu Laboratory’s experimental multicomputer called AP1000.multicomputer called AP1000. 512 SPARC chips running at 25 MHz512 SPARC chips running at 25 MHz Interconnected with a 25 MB/s torus networkInterconnected with a 25 MB/s torus network

66

Computation/Programming Computation/Programming ModelModel

Computation is carried out by message Computation is carried out by message transmissions among transmissions among concurrent objects.concurrent objects. Units of concurrency that become active when they Units of concurrency that become active when they

accept messages.accept messages.

Multiple message transmissions may take place Multiple message transmissions may take place in parallel, so objects may become active in parallel, so objects may become active simultaneously.simultaneously.

When an object receives a message, the When an object receives a message, the message is placed in its message queue, so that message is placed in its message queue, so that messages can be invoked one at a time.messages can be invoked one at a time.

77

Computation/Programming Computation/Programming Model (cont.)Model (cont.)

Messages can contain Messages can contain mail addressesmail addresses of of concurrent objects in addition to basic values concurrent objects in addition to basic values such as numbers and booleans.such as numbers and booleans.

Each object has its own autonomous single Each object has its own autonomous single thread of control, and its own encapsulated state thread of control, and its own encapsulated state variables.variables.

Objects can be in Objects can be in dormantdormant mode if they have no mode if they have no messages to process, messages to process, activeactive mode if they are mode if they are executing a method, or executing a method, or waitingwaiting mode if they are mode if they are waiting to receive a certain set of messages.waiting to receive a certain set of messages.

88

Possible Actions Within a Possible Actions Within a MethodMethod

Message SendsMessage Sends to other concurrent objects to other concurrent objects Past typePast type – sender does not wait for a reply – sender does not wait for a reply

messagemessage Now typeNow type – sender waits for a reply message – sender waits for a reply message

Reply messages are sent through a third object Reply messages are sent through a third object called a called a reply destination objectreply destination object, which resumes the , which resumes the original sender upon the reception of the reply original sender upon the reception of the reply message.message.

CreationCreation of concurrent objects of concurrent objects

99

Possible Actions Within a Possible Actions Within a Method (cont.)Method (cont.)

Referencing and UpdatingReferencing and Updating the contents of the contents of state variablesstate variables

Waiting Waiting for a specified set of messagesfor a specified set of messages

Standard OperationsStandard Operations (like arithmetic (like arithmetic operations) on values stored in state operations) on values stored in state variablesvariables

1010

Scheduling ProcessScheduling Process

Scheduling for sequential OOPLs simply involves Scheduling for sequential OOPLs simply involves a method lookup and a stack-based function call.a method lookup and a stack-based function call.

For concurrent OOPLs, scheduling of methods is For concurrent OOPLs, scheduling of methods is not necessarily LIFO-based, since methods may not necessarily LIFO-based, since methods may be blocked to wait for messages, and resumed be blocked to wait for messages, and resumed upon the arrival of a message.upon the arrival of a message. Therefore, a naïve implementation must allocate Therefore, a naïve implementation must allocate

invocation frames from the heap instead of the stack, invocation frames from the heap instead of the stack, and use a scheduling queue to keep track of pending and use a scheduling queue to keep track of pending methods. methods.

1111

Scheduling Process (cont.)Scheduling Process (cont.)

In addition, since it may not be possible for In addition, since it may not be possible for a receiver object to immediately process a receiver object to immediately process incoming messages, each object must incoming messages, each object must have its own message queue to buffer have its own message queue to buffer incoming messages.incoming messages.

This can lead to substantial overhead for This can lead to substantial overhead for frame allocation/deallocation, and queue frame allocation/deallocation, and queue manipulation, for both the scheduling and manipulation, for both the scheduling and message queues.message queues.

1212

Example of a Naïve Example of a Naïve Scheduling MechanismScheduling Mechanism

A naïve implementation of message A naïve implementation of message reception / method invocation for an object reception / method invocation for an object would require:would require:

1.1. Allocation of an invocation frame to hold local Allocation of an invocation frame to hold local variables and message arguments of the variables and message arguments of the method.method.

2.2. Buffering a message into the frame.Buffering a message into the frame.3.3. Enqueueing the frame into the object message Enqueueing the frame into the object message

queue.queue.4.4. Enqueueing the object into the scheduling queue Enqueueing the object into the scheduling queue

(if it is not already there).(if it is not already there).

1313

Key Observation for Intra-Key Observation for Intra-node Scheduling Strategynode Scheduling StrategyIn many cases, this full scheduling In many cases, this full scheduling mechanism is not necessary, and we can mechanism is not necessary, and we can use more efficient stack-based scheduling.use more efficient stack-based scheduling.If an object is If an object is dormant,dormant, meaning it has no meaning it has no messages to be processed, its method can messages to be processed, its method can be invoked immediately upon message be invoked immediately upon message reception, without message buffering or reception, without message buffering or schedule queue manipulation.schedule queue manipulation.If it is If it is activeactive, then the message is buffered, , then the message is buffered, and the method is invoked later via the and the method is invoked later via the scheduling queue.scheduling queue.

1414

Example of ABCL/onAP1000 Example of ABCL/onAP1000 Intra-node Scheduling StrategyIntra-node Scheduling Strategy

1515

Scheduling Strategy Scheduling Strategy ImplementationImplementation

We need a mechanism to implement this We need a mechanism to implement this strategy efficiently.strategy efficiently.

We cannot perform a runtime check on We cannot perform a runtime check on every intra-node message send to every intra-node message send to determine whether or not the receiver is determine whether or not the receiver is dormant.dormant.

When a running object becomes blocked When a running object becomes blocked on the stack, we must be able to resume on the stack, we must be able to resume other objects. other objects.

1616

Components of an ObjectComponents of an Object

1717

Virtual Function TablesVirtual Function Tables

A Virtual Function Table Pointer (VFTP) A Virtual Function Table Pointer (VFTP) points to a Virtual Function Table, which points to a Virtual Function Table, which contains the address of each compiled contains the address of each compiled function (method) of the class.function (method) of the class.

1818

Key Idea in Object Key Idea in Object RepresentationRepresentation

Each class has Each class has multiplemultiple virtual function tables, virtual function tables, each of which roughly corresponds to a mode each of which roughly corresponds to a mode (dormant, active, and waiting) of an object.(dormant, active, and waiting) of an object.When an object is in dormant mode, its Virtual When an object is in dormant mode, its Virtual Function Table Pointer (VFTP) points to the table Function Table Pointer (VFTP) points to the table that contains the method bodies.that contains the method bodies.When an object is active, the VFTP points to a When an object is active, the VFTP points to a virtual function table that holds tiny virtual function table that holds tiny queueing queueing procedures, procedures, which simply allocate a frame, store which simply allocate a frame, store the message into the frame, and enqueue it on the message into the frame, and enqueue it on the object’s message queue. the object’s message queue.

1919

Benefits of Multiple Virtual Benefits of Multiple Virtual Function TablesFunction Tables

With multiple virtual function tables, a With multiple virtual function tables, a sender object does not have to do a sender object does not have to do a runtime check of whether or not the runtime check of whether or not the receiver object is dormant.receiver object is dormant.

Instead this check is built into the virtual Instead this check is built into the virtual function table look-up, which is already a function table look-up, which is already a necessary cost in object-oriented necessary cost in object-oriented programming languages.programming languages.

2020

Benefits of Multiple Virtual Benefits of Multiple Virtual Function Tables (cont.)Function Tables (cont.)

Can be used to implement Can be used to implement selective selective message receptionmessage reception where acceptable where acceptable messages trigger functions that restore the messages trigger functions that restore the context of the object, and unacceptable context of the object, and unacceptable messages trigger queueing procedures.messages trigger queueing procedures.

Can also be used to initialize an object’s Can also be used to initialize an object’s state variables, by creating a table that state variables, by creating a table that points to initialization functions that initialize points to initialization functions that initialize variables before calling a method body.variables before calling a method body.

2121

Combining the Stack with the Combining the Stack with the Scheduling QueueScheduling Queue

When a method is invoked on a dormant object, When a method is invoked on a dormant object, an activation frame is allocated on the an activation frame is allocated on the stackstack, , thereby achieving fast frame thereby achieving fast frame allocation/deallocation.allocation/deallocation.If this invocation blocks in the middle of a thread, If this invocation blocks in the middle of a thread, it allocates another frame on the heap, and saves it allocates another frame on the heap, and saves its context to this frame, which will survive until its context to this frame, which will survive until termination of the method.termination of the method.The scheduling queue is used to schedule The scheduling queue is used to schedule preempted objects that saved their context into a preempted objects that saved their context into a heap-allocated frame, or to invoke messages that heap-allocated frame, or to invoke messages that were buffered in a message queue. were buffered in a message queue.

2222

Example of Stack UnwindingExample of Stack Unwinding

2323

Inter-node Software Inter-node Software ArchitectureArchitecture

Important for message passing between objects Important for message passing between objects on different nodes, and object creation on a on different nodes, and object creation on a remote node.remote node.

Assumes the hardware (or message passing Assumes the hardware (or message passing libraries) provides an interface to send and libraries) provides an interface to send and receive messages asynchronously.receive messages asynchronously.

Uses an Active Message-like mechanism, where Uses an Active Message-like mechanism, where each message attaches its own self-dispatching each message attaches its own self-dispatching message handler, which is invoked immediately message handler, which is invoked immediately after the delivery of the message.after the delivery of the message.

2424

Customized Message Customized Message HandlersHandlers

Providing a customized message handler for Providing a customized message handler for each kind of remote message allows the system each kind of remote message allows the system to achieve low overhead remote task to achieve low overhead remote task dispatching.dispatching.Message handlers are classified into the Message handlers are classified into the following categories:following categories:

1.1. Normal message transmission between objectsNormal message transmission between objects2.2. Request for remote object creationRequest for remote object creation3.3. Reply to remote memory allocation requestReply to remote memory allocation request4.4. Other services such as load balancing, garbage Other services such as load balancing, garbage

collection, etc.collection, etc.

2525

Remote Object CreationRemote Object Creation

A mail address of an object is represented A mail address of an object is represented as <as <processor number, pointerprocessor number, pointer>.>.

This provides maximum performance for This provides maximum performance for local object access, and avoids the local object access, and avoids the overhead of export table management.overhead of export table management.

Object creation on a remote node Object creation on a remote node requires a memory allocation on the requires a memory allocation on the remote node to generate a remote mail remote node to generate a remote mail address. address.

2626

Remote Object Creation Remote Object Creation (cont.)(cont.)

Since the latency of remote communication is Since the latency of remote communication is unpredictable, and the cost of context switching unpredictable, and the cost of context switching is high, it is unacceptable to wait for the remote is high, it is unacceptable to wait for the remote node to allocate memory and return a pointer.node to allocate memory and return a pointer.Therefore the system uses a Therefore the system uses a prefetch scheme,prefetch scheme, where each node manages predelivered stocks where each node manages predelivered stocks of addresses of memory chunks on remote of addresses of memory chunks on remote nodes, and these addresses are used for nodes, and these addresses are used for remote object allocation.remote object allocation.A node only has to wait for a remote address to A node only has to wait for a remote address to be allocated if its local stock is empty.be allocated if its local stock is empty.

2727

Typical Remote Object Typical Remote Object Creation SequenceCreation Sequence

The requester node obtains a unique mail address The requester node obtains a unique mail address locally from the stock.locally from the stock.It sends a creation request message to the node It sends a creation request message to the node specified by the mail address.specified by the mail address.The target node performs class-specific initialization The target node performs class-specific initialization (such as initialization of the virtual function table) of (such as initialization of the virtual function table) of the created object upon receipt of the creation the created object upon receipt of the creation message.message.The target node allocates a replacement chunk of The target node allocates a replacement chunk of memory, and returns its address to the requester memory, and returns its address to the requester node.node.The requester replenishes its stock upon receipt of The requester replenishes its stock upon receipt of the replacement address.the replacement address.

2828

Costs of Basic OperationsCosts of Basic Operations

Time (Time (µs)µs)

Intra-node Message (to Dormant)Intra-node Message (to Dormant)

Intra-node Message (to Active)Intra-node Message (to Active)

Intra-node CreationIntra-node Creation

Latency of Inter-node MessageLatency of Inter-node Message

2.32.3

9.69.6

2.12.1

8.98.9

2929

Breakdown of Intra-node Breakdown of Intra-node Message to Dormant ObjectMessage to Dormant Object

InstructionsInstructions

Check LocalityCheck Locality

Lookup and CallLookup and Call

Switch VFTP to Active ModeSwitch VFTP to Active Mode

Execution of Method BodyExecution of Method Body

Check Message QueueCheck Message Queue

Switch VFTP to Dormant ModeSwitch VFTP to Dormant Mode

Polling of Remote Message ArrivalPolling of Remote Message Arrival

Adjusting Stack Pointer and ReturnAdjusting Stack Pointer and Return

33

55

33

--

33

33

55

33

TotalTotal 2525

3030

Comparison of Send/Reply Comparison of Send/Reply LatencyLatency

Instruction Instruction CountsCounts

Real Time (Real Time (µs)µs) CyclesCycles Clock Rate Clock Rate (MHz)(MHz)

ABCL/onAP1000ABCL/onAP1000

ABCL/onEM4ABCL/onEM4

CST (on J-Machine)CST (on J-Machine)

160160

100100

110110

17.817.8

99

44

450450

110110

220220

2525

12.512.5

5050

Send and reply latency for the ABCL/onAP1000 Send and reply latency for the ABCL/onAP1000 conventional multicomputer is only about 4 times that conventional multicomputer is only about 4 times that of the ABCL/onEM4 fine-grain machine, and 2 times of the ABCL/onEM4 fine-grain machine, and 2 times that of the CST fine-grain machine.that of the CST fine-grain machine.

3131

Benchmark StatisticsBenchmark Statistics

To evaluate these techniques on real To evaluate these techniques on real applications, the authors measured the applications, the authors measured the performance of the N-queen exhaustive performance of the N-queen exhaustive search algorithm for N = 8 and N = 13.search algorithm for N = 8 and N = 13.

They compared these results to the results They compared these results to the results of running the same programs on a single of running the same programs on a single CPU SPARC station 1+, which uses the CPU SPARC station 1+, which uses the same CPU that is used in the AP1000.same CPU that is used in the AP1000.

3232

The Scale of the N-queen The Scale of the N-queen ProgramProgram

N = 8N = 8 N = 13N = 13

Number of SolutionsNumber of Solutions

Number of Objects CreatedNumber of Objects Created

Number of MessagesNumber of Messages

Total Memory Used (KB)Total Memory Used (KB)

Elapsed Time on SS1+Elapsed Time on SS1+

9292

2,0562,056

4,1044,104

130130

84 ms84 ms

73,71273,712

4,636,2104,636,210

9,349,7659,349,765

549,463549,463

461,955 ms461,955 ms

3333

Speedup of the N-queen Speedup of the N-queen ProgramProgram

3434

The Effect of Stack-based The Effect of Stack-based SchedulingScheduling

To demonstrate the effect of stack-based To demonstrate the effect of stack-based scheduling, they compared the performance of scheduling, they compared the performance of the N-queen program using stack-based the N-queen program using stack-based scheduling, to its performance using a naïve scheduling, to its performance using a naïve scheduling mechanism that always buffers a scheduling mechanism that always buffers a message in the message queue of the receiver message in the message queue of the receiver object, and schedules the object through the object, and schedules the object through the scheduling queue.scheduling queue.In these programs, approximately 75% of local In these programs, approximately 75% of local messages are sent to dormant objects.messages are sent to dormant objects.In general, they observed a speedup of In general, they observed a speedup of approximately 30%.approximately 30%.

3535

The Effect of Stack-based The Effect of Stack-based Scheduling (cont.)Scheduling (cont.)

3636

ConclusionsConclusions

The authors proposed a software architecture for The authors proposed a software architecture for concurrent OOPLs on conventional concurrent OOPLs on conventional multicomputers that can compete with multicomputers that can compete with implementations on special-purpose, fine-grain implementations on special-purpose, fine-grain architectures.architectures.Their stack-based intra-node scheduling Their stack-based intra-node scheduling mechanism significantly reduces the average mechanism significantly reduces the average cost of intra-node method invocation.cost of intra-node method invocation.Their Active Message-like messages, and Their Active Message-like messages, and address prefetch scheme minimize the cost of address prefetch scheme minimize the cost of inter-node message passing, and remote object inter-node message passing, and remote object creation.creation.

3737

DiscussionDiscussion

The eternal question: How does this apply The eternal question: How does this apply to sensor networks?to sensor networks?

Low instruction count for intra-node Low instruction count for intra-node schedulingscheduling

Power efficient remote object creation cuts Power efficient remote object creation cuts down on communicationdown on communication

3838

FlawsFlaws

Security problems related to active Security problems related to active messages. User can run any code they messages. User can run any code they desire.desire.

Scalability for prefetching objects, if Scalability for prefetching objects, if thousands of nodes results in lots of thousands of nodes results in lots of communication between nodes and communication between nodes and memory becomes a scarce commodity.memory becomes a scarce commodity.

Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003

Documents

Transcript of Tony Chen, Sunjeev Sikand, and John Kerwin CSE 291 - Programming Sensor Networks May 23, 2003