Getting More From Your Multicore With Polycephaly - Arden Thomas

62
Polycephaly and Leveraging Multi-Core How can we get the most out of our modern CPU’s? Arden Thomas Cincom Smalltalk Product Manager

description

This presentation discusses some basic strategies for how Smalltalk can leverage multi-core computers, and the results of using a new simple framework that is now included with Cincom’s VisualWorks and ObjectStudio.

Transcript of Getting More From Your Multicore With Polycephaly - Arden Thomas

Page 1: Getting More From Your Multicore With Polycephaly - Arden Thomas

Polycephaly and Leveraging Multi-Core

How can we get the most out

of our modern CPU’s?

Arden Thomas

Cincom Smalltalk

Product Manager

Page 2: Getting More From Your Multicore With Polycephaly - Arden Thomas

Multi-Core Computers

• Are becoming ubiquitous….

• Quad cores from Intel & AMD are becoming

commonplace

• 8,16,64 cores around the corner?

Page 3: Getting More From Your Multicore With Polycephaly - Arden Thomas

Cincom Smalltalk™ Roadmap Item

• “Research ways to leverage Multi-Core computing”

• This item was a magnet – lots of interest

Page 4: Getting More From Your Multicore With Polycephaly - Arden Thomas

What is the Attraction?

• Making the most of what you have:

Page 5: Getting More From Your Multicore With Polycephaly - Arden Thomas

Rear Admiral Grace Murray-Hopper

• Distinguished computer scientist

• Was there when they took a moth

out of the relays of an early

computer (“getting the bugs out”)

• Had a great ability to convey ideas

in easy to grasp perspectives

Page 6: Getting More From Your Multicore With Polycephaly - Arden Thomas

Grace Murray-Hopper

“When the farmer, using a horse, could not

pull out a stump, he didn’t go back to the

barn for a bigger horse,

he went back for

another horse.”

Page 7: Getting More From Your Multicore With Polycephaly - Arden Thomas

Using Multi-Core Computers

• “Team” of horses

• Type of concurrent programming

• On the same machine / OS

Generally faster

More options (i.e. shared memory, etc)

Page 8: Getting More From Your Multicore With Polycephaly - Arden Thomas

…A ‘Small’ Matter of Programming….

• Most Concurrency is NOT EASY

• Concurrency problems and solutions have been studied

for decades

Page 9: Getting More From Your Multicore With Polycephaly - Arden Thomas

…A ‘Small’ Matter of Programming….

• Both AMD & Intel have donated money and personnel to

Universities doing concurrency research

Specifically with the intent of increasing market demand for their

products

Page 10: Getting More From Your Multicore With Polycephaly - Arden Thomas

“As the Power of using

concurrency

increases linearly,

the complexity

increases exponentially”

…A ‘Small’ Matter of Programming….

Page 11: Getting More From Your Multicore With Polycephaly - Arden Thomas

Approaches to Using Multi-Cores

Core 1 Core 2 Core 3 Core 4

Multiple Applications

Page 12: Getting More From Your Multicore With Polycephaly - Arden Thomas

Approaches to Using Multi-Cores

• Multiple Applications

• Advantages

Simple

Easy

Effective

Minimal contention

• Disadvantages

Multiple independent applications needed

Not necessarily using multi-cores for one problem

Page 13: Getting More From Your Multicore With Polycephaly - Arden Thomas

Multiple

Process

Threads in

a Single

Application

Approaches to Using Multi-Cores

Thread 1

Thread 2

Thread 3

Core 1 Core 2 Core 3 Core 4

Page 14: Getting More From Your Multicore With Polycephaly - Arden Thomas

• Multiple Process threads in a single application

• Advantages

Threads have access to the same object space

Can be effective concurrency

• Disadvantages

Object contention and overhead

Usually significant added complexity

Threads (if native) can take down the whole application in

severe error situations

Approaches to Using Multi-Cores

Page 15: Getting More From Your Multicore With Polycephaly - Arden Thomas

Multiple “green” (non native) threads available in

VisualWorks/ObjectStudio

• Modeling producer/consumer problems

• Not inherently concurrent

• They can still be very effective for concurrent work

coordination (better even?)

• Fast, lightweight, efficient

• Effective example later

Approaches to Using Multi-Cores

Page 16: Getting More From Your Multicore With Polycephaly - Arden Thomas

Some Smalltalk Objects for Concurrency

• Process

• Semaphore

• Promise (a kind of „Future‟)

• SharedQueue

Page 17: Getting More From Your Multicore With Polycephaly - Arden Thomas

Multiple

Process

Threads

in a CST

VM

Execution thread

Garbage

collection thread JIT thread

Approaches to Using Multi-Cores

Core 1 Core 2 Core 3 Core 4

Page 18: Getting More From Your Multicore With Polycephaly - Arden Thomas

• Multiple Process threads in a CST VM Note that the idea is just an example for discussion, product management brainstorming

• Advantages

Use of multi-cores even with single threaded applications

• Disadvantages

Time & resources to develop VM

Feasibility and stability questions

Approaches to Using Multi-Cores

Page 19: Getting More From Your Multicore With Polycephaly - Arden Thomas

Approaches to Using Multi-Cores

Coordinated

Multiple

Applications

Core 1 Core 2 Core 3 Core 4

Page 20: Getting More From Your Multicore With Polycephaly - Arden Thomas

Approaches to Using Multi-Cores • Coordinated Multiple Applications

• Advantages

Smaller changes in code needed

Fairly Easy & Effective

Could be Scaled to number of Cores

Fewer contention issues

Doable without VM changes

• Disadvantages

Need fairly independent pieces

Good for a subset of problems only

Page 21: Getting More From Your Multicore With Polycephaly - Arden Thomas

Approaches to Using Multi-Cores

Shared Object

Memory &

Coordinated Multiple

Applications

Core 1 Core 2 Core 3 Core 4

Page 22: Getting More From Your Multicore With Polycephaly - Arden Thomas

Approaches to Using Multi-Cores

• Shared Object Memory with Coordinated Multiple

Applications

• Advantages

Can solve a broader set of problems

Can bring multiple cores to bear on the same object set

• Disadvantages

Garbage collection complications

Need to coordinate or manage sharing (traditional concurrency

problems)

Page 23: Getting More From Your Multicore With Polycephaly - Arden Thomas

Product Management Requirements • Research into ways to utilize multi-core computers

We can add multiple facilities & abilities over time

• Something "Smalltalk simple“

A simple mechanism that is simple, effective, and high level

Not Smalltalk implementation of generally hard things

• Avoids most of the traditional problems and difficulties if possible

Minimize contention, locking, ability to deadlock

• Is flexible enough to be the basis for more sophisticated solutions

as situations warrant

Page 24: Getting More From Your Multicore With Polycephaly - Arden Thomas

Polycephaly

• Engineering experiment

• Assists in starting and managing headless images

Handing them work

Transporting the results

Page 25: Getting More From Your Multicore With Polycephaly - Arden Thomas

Polycephaly – What it Isn’t

We have not magically invented some

panacea to the difficult issue of

concurrency

Page 26: Getting More From Your Multicore With Polycephaly - Arden Thomas

Polycephaly – What it Is

• A simple framework that allows a subset of concurrency

problems to be solved

• For this class of problems, it works rather nicely

Page 27: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiments

• Take some code that performs a task

• See how overall time to perform the task could be

improved, using concurrency

• Use the Polycephaly framework

• Goal:

See if substantial improvements, via concurrency could be made

with a small, minor and simple amount of effort.

Page 28: Getting More From Your Multicore With Polycephaly - Arden Thomas

28

Experiment A – number crunching example numberQueue

| numberQueue |

numberQueue := SharedQueue new: 50.

50001 to: 50050 do: [:i | numberQueue nextPut: i ].

^numberQueue

generateResult: n

^n factorial

processLocal: numberQueue

"self processLocal: self numberQueue"

| results |

results := OrderedCollection new.

[numberQueue isEmpty] whileFalse:[ results add: (self generateResult: numberQueue next) ].

^results

self processRemote: self numberQueue

Page 29: Getting More From Your Multicore With Polycephaly - Arden Thomas

29

Experiment A – number crunching example

processRemote: numberQueue

| done results vms |

done := Semaphore new.

results := SharedQueue new.

vms := Polycephaly.VirtualMachines new: 4

vms machines do: [:machine |

[ | number |

[number := numberQueue nextAvailable.

number isNil] whileFalse: [results nextPut: (machine do: [:n | PolyExample generateResult: n ] with:

number)].

machine release. "Shut down the vm"

done signal] fork].

vms machines size timesRepeat: [done wait]. "Wait until each machine has reported in as completed"

^results

Page 30: Getting More From Your Multicore With Polycephaly - Arden Thomas

30

Experiment A – number crunching results

self processLocal: self numberQueueSmall

self processRemote: self numberQueueSmall

Example Small Local 410

Example Small Remote 792

- overhead of Poly

self processLocal: self numberQueue

self processRemote: self numberQueue

Example Local 308002

Example Remote 72051

- Less than ¼ the time using Polycephaly!

Page 31: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment B- Load Stock Daily data

• Load daily data for all stocks in S&P 100

Open, high , low, close, volume

All days since 1990 to present

• Source

Yahoo historical data

Index spx100 do:[:stk | stk loadDaily] .

Page 32: Getting More From Your Multicore With Polycephaly - Arden Thomas

32

Experiment B- Load Stock Daily data

• Load local: 84434 ms

Page 33: Getting More From Your Multicore With Polycephaly - Arden Thomas

33

Experiment B- Load Stock Daily data

• Load local: 84434 ms

• Load remote, 3 vms: 21613 ms

• Load remote, 5 hot vms: 18291 ms

• Load remote, 10 hot vms: 18404 ms

Page 34: Getting More From Your Multicore With Polycephaly - Arden Thomas

34

Experiment C - Load Security information

• Loading market security information

Stocks

Mutual Funds

ETFs

• HTTP

Files from Nasdaq For NYSE, AMEX, Nasdaq

Mutual Funds (around 24,000!)

ETFS (584)

Page 35: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment C - Load Security information

• Original sequential load code:

nyse := Security loadNYSE.

amex := Security loadAMEX.

nasd := Security loadNasdaq.

funds := MutualFund load.

etfs := ETF load.

Page 36: Getting More From Your Multicore With Polycephaly - Arden Thomas

Time Baseline - Code Run Sequentially

Page 37: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment I

• Make the five loads work concurrently using Polycephaly

Page 38: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment I • Concurrent load code:

nyseLoad := self promise: „Security loadNYSE'.

amexLoad := self promise: „Security loadAMEX'.

nasdLoad := self promise: „Security loadNasdaq'.

fundsLoad := self promise: „MutualFund load'.

etfsLoad := self promise: „ETF load'.

• nyse := nyseLoad value.

amex := amexLoad value.

nasd := nasdLoad value.

funds := fundsLoad value.

etfs := etfsLoad value.

Page 39: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment I

• promise: aString

| machine |

machine := VirtualMachine new.

^[ | val |

val := machine doit: aString.

machine release.

val] promise

Page 40: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment I

Page 41: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment I

Page 42: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment I - Bottleneck

Page 43: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment II – Addressing the Bottleneck

• loadFunds took the majority of the time

• Loads information for over 24,000 Mutual Funds

• Loads via http

• Loads all Funds starting with „A‟, then „B‟ ….

• So try making each letter an independent task ….

Page 44: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment II – Addressing the Bottleneck

• Create n “drones” (start n images)

• Each drones grabs a letter and processes it

Then returns the results to the requesting image

• When done it asks for another letter

• For n from 3 to 20

Page 45: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment II – Addressing the Bottleneck

• Time with: 3 drones: 30244

Time with: 4 drones: 22373

Time with: 5 drones: 20674

Time with: 6 drones: 18271

Time with: 7 drones: 18570

Time with: 8 drones: 17619

Time with: 9 drones: 17802

Time with: 10 drones: 17691

Time with: 11 drones: 19558

• Time with: 12 drones: 18905

Time with: 13 drones: 17658

Time with: 14 drones: 19696

Time with: 15 drones: 21028

Time with: 16 drones: 19704

Time with: 17 drones: 21899

Time with: 18 drones: 19400

Time with: 19 drones: 19884

Time with: 20 drones: 20698

Page 46: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment II – Addressing the Bottleneck

• Points to note about the solution

Times include

• Start-up of drone VM‟s

• Object transport time

Page 47: Getting More From Your Multicore With Polycephaly - Arden Thomas

Experiment III – Addressing the Bottleneck, Revisited

• Lets take another look at the problem

• 26 http calls

Call

Wait

Process

Repeat

Page 48: Getting More From Your Multicore With Polycephaly - Arden Thomas

Bottleneck Revisited– the Orange Issue

0 1 2 3 4

Call

Wait

Process

This Wait is the time waster

Page 49: Getting More From Your Multicore With Polycephaly - Arden Thomas

Bottleneck Revisited- Sequence

Page 50: Getting More From Your Multicore With Polycephaly - Arden Thomas

Bottleneck Revisited – Overlap Solution

Work Done overlapping the wait

Page 51: Getting More From Your Multicore With Polycephaly - Arden Thomas

Code for the Curious ….

1 to: n do: [:i |

[ | char |

[char := queue nextAvailable. char isNil]

whileFalse: [results nextPut: (Alpha.MutualFund loadForChar: char )].

done signal] fork].

n timesRepeat: [done wait].

Page 52: Getting More From Your Multicore With Polycephaly - Arden Thomas

Bottleneck Redux – Overlap Solution

• Results

Used 13 green threads

Reduced the load time to 15 seconds!

All in one image

What about 26 threads?

Page 53: Getting More From Your Multicore With Polycephaly - Arden Thomas

Putting it All Together Coordinated

Multiple

Applications

loadNYSE

loadAMEX loadNasdaq loadFunds

loadETFs

Core 1 Core 2 Core 3 Core 4

Page 54: Getting More From Your Multicore With Polycephaly - Arden Thomas

Putting it All Together

Page 55: Getting More From Your Multicore With Polycephaly - Arden Thomas

Putting it All Together

• Final overall load time using 5 drones, along with

threaded http calls for loadFund, saw overall load time

improve from 114 seconds to 35 Seconds

The 35 seconds includes all image startup and transport time

Page 56: Getting More From Your Multicore With Polycephaly - Arden Thomas

56

Putting it All Together processAToZ

"self processAToZ"

"This is a generalized example of how (one way) to use Polycephaly for using multiple cores to solve a problem.

This is an example of breaking our universe into pieces to process by letters

- for example it could process all customers with last names beginning with 'A', all with 'B', etc, separately."

| letterQueue done results vms |

letterQueue := self aToZ. "A collection (SharedQueue) of letters A .. Z "

done := Semaphore new. "This is to help make sure we wait until all processing is complete before proceeding"

results := SharedQueue new. "Use SharedQueues for safe concurrent access"

vms := Polycephaly.VirtualMachines new: 3. "Create this many headless processing images - benchmark different sizes to optimize”

vms machines do: [:machine |

[ | letter | [

letter := letterQueue nextAvailable.

letter isNil] whileFalse: [results nextPut: (machine do: [:ch | PolyExampleAtoZ processLetter: ch ] with:letter)].

machine release. "Shut down the vm"

done signal] fork]. "Signal that the machine has completed its work"

vms machines size timesRepeat: [done wait]. "Wait until each machine has reported in as completed"

^results

Page 57: Getting More From Your Multicore With Polycephaly - Arden Thomas

Some Observations • Measure the elapsed time of work - it helps to know where the

problems are in order to address them

• Green threads are an effective way to coordinate concurrency,

whether managing vm‟s or overlapping http calls. They are

lightweight, and you can create hundreds in a fraction of a

second. I ended up combining both (green threads and a vm) for a

superior solution.

• Strive for simplicity. Remember if it gets out of hand, concurrency

can get ugly and difficult very quickly. Follow the KISS principle

(Keep it Simple, Smalltalk ;-) ).

Page 58: Getting More From Your Multicore With Polycephaly - Arden Thomas

Some Observations

• The launch of drone images was faster than I expected

• Drone images are instances of the main image

• As mentioned in the commentary, my measurements in

experiment C included startup and shutdown of

images, a (overly?) conservative approach that may be

atypical to how many use this technology. In

experiment B, I tried hot images too, which saved a

little time.

Page 59: Getting More From Your Multicore With Polycephaly - Arden Thomas

• Polycephaly has some nice features:

If I had a dozen drone images running and lost my references

and the ability to close them, closing the main image shuts

everything down. Things like that save a lot of time, avoid

aggravation, and make it fun.

If I started a local process, which in turn started a drone, then

later I terminated that process, the drone would be terminated.

Remote errors are passed back to the main image

Some Observations

Page 60: Getting More From Your Multicore With Polycephaly - Arden Thomas

60

• Ability to specify vm

• Integration into base.im

• Pluggable marshalling

What is next for Polycephaly?

Page 61: Getting More From Your Multicore With Polycephaly - Arden Thomas

Conclusions

• I believe my requirements of providing a means for

concurrency with high gain and low pain were met

(80/20)

• I like the simplicity and robustness of Polycephaly

• I welcome you to experiment with it and benefit from it!

• Feedback and suggestions always welcome!!!

[email protected]

Page 62: Getting More From Your Multicore With Polycephaly - Arden Thomas

Cincom Smalltalk Contacts

Arden Thomas - Product Manager

- [email protected]

Suzanne Fortman – Smalltalk Business Director - [email protected]

Jeremy Jordan – Marketing Manager - [email protected]

http://www.cincomsmalltalk.com 2011 Cincom Systems, Inc. All Rights Reserved

Cincom, the Quadrant Logo, VisualWorks, ObjectStudio and WebVelocity are trademarks or registered trademarks of Cincom Systems, Inc.

All other trademarks belong to their respective companies.