Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting...

69
A Technical Anatomy of How OpenMPI Applications Can Inherit Fault Tolerance Using SPM.Python Minesh B. Amin mamin @ mbasciences.com http://www.mbasciences.com PyHPC Workshop Supercomputing Conference 2011 Seattle, Washington Nov 18, 2011

Transcript of Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting...

Page 1: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

A Technical Anatomy of How OpenMPI ApplicationsCan Inherit Fault Tolerance Using SPM.Python

Minesh B. Aminmamin @ mbasciences.com

http://www.mbasciences.com

PyHPC Workshop

Supercomputing Conference 2011

Seattle, Washington

Nov 18, 2011

© 2011 MBA Sciences, Inc. All rights reserved.

Page 2: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ... lacks support for:

• fault tolerance• timeout• detection of deadlocks

Page 3: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ...

Page 4: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ... that lacks support for:

Page 5: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ... that lacks support for:

• fault tolerance

• timeout• detection of deadlocks

⇒ Prototyping is (deeply)∞

frustrating

Page 6: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ... that lacks support for:

• fault tolerance• timeout

• detection of deadlocks

⇒ Prototyping is (deeply)∞

frustrating

Page 7: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ... that lacks support for:

• fault tolerance• timeout• detection of deadlocks

⇒ Prototyping is (deeply)∞

frustrating

Page 8: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Prologue

GNU/Linux [] mpirun ... ./hello_world -prefix "api"

Typical OpenMPI application ... that lacks support for:

• fault tolerance• timeout• detection of deadlocks

⇒ Prototyping is (deeply)∞

frustrating

Page 9: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement

Prototyping should be frictionless

Must use original OpenMPI application• original source code• original binary

Original OpenMPI application must inherit support for:• fault tolerance• timeout• detecting deadlocks

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

Page 10: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement

Prototyping should be frictionless

Must use original OpenMPI application• original source code• original binary

Original OpenMPI application must inherit support for:• fault tolerance• timeout• detecting deadlocks

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

Page 11: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement

Prototyping should be frictionless

Must use original OpenMPI application• original source code• original binary

Original OpenMPI application must inherit support for:• fault tolerance• timeout• detecting deadlocks

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

Page 12: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement

Prototyping should be frictionless

Must use original OpenMPI application• original source code• original binary

Original OpenMPI application must inherit support for:• fault tolerance• timeout• detecting deadlocks

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

Page 13: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 14: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 15: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 16: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 17: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 18: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 19: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Problem Statement (Cont’d)

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

AB

Exploiting two very different forms of parallelism:• Using same resources• At the same time

Drop-inreplacement for

mpirun

Multiple sessions ofmpirun

within a single session ofof spm.python

Can use same resources for:• Checkpoint based parallelism• What-if analysis• Stress testing

Page 20: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:

• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 21: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:

• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 22: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,

• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 23: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,

• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 24: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,

• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 25: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and

• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 26: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 27: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:

• Coarse grain ...where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 28: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:• Coarse grain ...

where tasks may not communicate prior to conclusion, or

• Fine grain ...where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 29: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:• Coarse grain ...

where tasks may not communicate prior to conclusion, or• Fine grain ...

where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 30: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Exploiting Parallelism"

Exploiting parallelism entails the management of a collection ofserial tasks which may communicate using only compatible

communication primitives

management refers to policies by which:• tasks are scheduled,• premature terminations are handled,• preemptive support is provided,• communication primitives are enabled/disabled, and• the manner in which resources are obtained and released

serial tasks are classified in terms of either:• Coarse grain ...

where tasks may not communicate prior to conclusion, or• Fine grain ...

where tasks may communicate prior to conclusion.

Management policies codify how serial tasks areto be managed ... independent of what they may be

Page 31: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures>>> createVirtualCloud -async

>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 32: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures>>> createVirtualCloud -async

>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 33: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures>>> createVirtualCloud -async

>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 34: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures>>> createVirtualCloud -async>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 35: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures>>> createVirtualCloud -async

>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 36: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures

>>> createVirtualCloud -async>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 37: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Terminology: "Parallel Enabling Technologies"

Means to the end

• Bottom-up

OpenMPI OpenMPCUDA OpenGL

• Maximum flexibility• Maximum headaches• Must implement fault tolerance

• Top-downHadoop GoldenorbGraphLab

• Limited flexibility• Fewer headaches• Fault tolerance is inherited

• Self-contained environment

SPM.Python• Maximum flexibility• Fewest headaches• Fault tolerance is inherited

N environments/installations for N Frameworks

One environment/installation, N suites of Pclosures>>> createVirtualCloud -async

>>> cmdA >>> cmdA -parallel>>> cmdB >>> cmdB -parallel>>> cmdC >>> cmdC -parallel>>> cmdD >>> cmdD -parallel

Page 38: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline

GNU/Linux [] spm.python ...mpirun ./hello_world -prefix "api"

Page 39: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 40: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 41: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 42: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 43: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 44: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 45: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 46: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 47: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Timeline (Cont’d)

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 48: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Breakdown

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Page 49: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Breakdown

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Built-in Package Management System• Selectively change default OpenMPI env

Page 50: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 51: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 52: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 53: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 54: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 55: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 56: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 57: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Built-in Package Management System

>>> sys.path = [ ".", "/-@-/pkg.builtin", "/opt/default" ]>>> import pycuda

Hub

Spoke

Spoke

Spoke

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2

3.0 3.1 3.2 ⇒ Uncatchable exception

Page 58: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Breakdown

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Built-in Package Management System• Selectively change default OpenMPI env

Redirection of library calls• Augment libmpi.so, libc.so ...

with libSPM.so

Page 59: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Redirecting Shared Library Calls

signal(...);

exit(...);

MPI_Init(...);

MPI_Init_thread(...);

OpenMPI Application

libSPM.so

SPMMPI_Init(...) {...return MPI_Init(...);

}

SPMMPI_Init(...) {...return MPI_Init(...);

}

SPMMPI_Init(...) {...return MPI_Init(...);

}

libmpi.so

MPI_Init(...) {...

}

MPI_Init(...) {...

}

libc.so

exit(...) {...

}

exit(...) {...

}

Page 60: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Redirecting Shared Library Calls

signal(...);

exit(...);

MPI_Init(...);

MPI_Init_thread(...);

OpenMPI Application

libSPM.so

SPMMPI_Init(...) {...return MPI_Init(...);

}

SPMMPI_Init(...) {...return MPI_Init(...);

}

SPMMPI_Init(...) {...return MPI_Init(...);

}

libmpi.so

MPI_Init(...) {...

}

MPI_Init(...) {...

}

libc.so

exit(...) {...

}

exit(...) {...

}

Page 61: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Redirecting Shared Library Calls

signal(...);

exit(...);

MPI_Init(...);

MPI_Init_thread(...);

OpenMPI Application

libSPM.so

SPMMPI_Init(...) {...return MPI_Init(...);

}

SPMMPI_Init(...) {...return MPI_Init(...);

}

SPMMPI_Init(...) {...return MPI_Init(...);

}

libmpi.so

MPI_Init(...) {...

}

MPI_Init(...) {...

}

libc.so

exit(...) {...

}

exit(...) {...

}

Page 62: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Anatomy: Breakdown

Hub mpirun Spoke orted wrapper Application

exit();

exit();exit();

exit();

1

2 34

5

67

Launch:• mpirun

Monitor:• mpirun• Spokes

Launch:• orted

Monitor:• orted• wrapper

Launch:• Application

Monitor/Timeout:• Application

NormalExecution

Built-in Package Management System• Selectively change default OpenMPI env

Redirection of library calls• Augment libmpi.so, libc.so ...

with libSPM.so

Second Parallel Capability• ∼ 60-line python script• Authored by developer

Page 63: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Second Parallel Capability

@spm.util.dassert(predicateCb = spm.sys.sstat.amOffline)@spm.util.dassert(predicateCb = spm.sys.pstat.amHub)def __init():return spm.pclosure.macro.papply.template.openMPI.\

policyA.defun(signature = ’signature::Hub’,stage1Cb = __taskStat,);

__pc = __init();

Declaration + Definition of Pclosure

Page 64: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Second Parallel Capability

@spm.util.dassert(predicateCb = spm.sys.sstat.amOffline)@spm.util.dassert(predicateCb = spm.sys.pstat.amHub)def main(pool,

taskApiArgs,taskTimeout):

# Initialize ’stage0’.__pc.stage0.init.main(typedef = ...);hdl = __pc.stage0.payload.tie();# Populate the template taskhdl.spm.meta.label = ’***’; # Not interested.hdl.spm.meta.apiArgs = taskApiArgs;hdl.spm.meta.timeout = taskTimeout;# Invoke the pmanager__pc.stage0.event.manage(pool = pool,

nSpokesMin = ...nSpokesMax = ...timeoutWaitForSpokes = ...timeoutExecution = ...);

return;

Population + Invocation of Pclosure

Page 65: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Second Parallel Capability

r"""task<template> ::struct {# SPM component ...spm ::struct {

meta ::struct {label ::scalar<stringSnippet> = deferred;apiArgs ::dict<string,mixed> = deferred;timeout ::scalar<timeout> = deferred;

};

core ::struct {relaunchPre ::scalar<bool> = None;relaunchPost ::scalar<bool> = None;nameHost ::scalar<auto> = None;whoAmI ::scalar<auto> = None;

};

stat ::struct {exception ::scalar<auto> = None;returnValue ::scalar<record> = None;

};};# non-SPM component ...

};"""

Typedef for Template Task

Page 66: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Second Parallel Capability

@spm.util.dassert(predicateCb = spm.sys.sstat.amOnline)@spm.util.dassert(predicateCb = spm.sys.pstat.amHub)def __taskStat(pc):try:hdl = pc.stage1.payload.tie();returnValue = hdl.spm.stat.returnValue;if (returnValue.Has(attr = ’stdOut’)):

print("\tstdOut : %s", returnValue.stdOut);if (returnValue.Has(attr = ’stdErr’)):

print("\tstdErr : %s", returnValue.stdErr);if (returnValue.Has(attr = ’stdOutErr’)):

print("\tstdOutErr: %s", returnValue.stdOutErr);except (SPMTaskDropped,

SPMTaskLoad,SPMTaskEval,), (hdl,):

pass;

return (pc.stage1.event.done(),None,)[-1];

Callback for Status Reports

Page 67: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

SPM.Python Session

l GNU/Linux [] spm.3.111116.trial.A.python(Trial Edition)

Spm.Python 3.111116 / Python 2.4.6[GCC 4.4.3 (64 bit) on linux2]

NOTE>>>> Trial period ends at <<<<>>>> 24:00 hrs (Pacific Standard Time) <<<<>>>> December 29, 2011 <<<<

Type "help", "copyright", "credits", "license" or "spm.Api()" for more information.Type "spm.DemoExtract(dirname = ...)" to extract demo scripts.

Please visit www.mbasciences.com for the latest and growingcollection of scripts and technical briefs classified in terms of

parallel management patterns.

l >>> import pooll >>> import demol >>> import os;l >>> taskApiArgs = \l dict(app = os.getcwd() + ’/hello_world’,l appOptions = "-prefix=’app’",l );l >>> taskTimeout = spm.util.timeout.after(seconds = 10);3 >>> demo.main(pool = pool.intraAll(),l taskApiArgs = taskApiArgs,l taskTimeout = taskTimeout)l #: MetaStatus (hub): Waiting - ForSpokes ...l #: MetaStatus (hub): Tasks - Evall app => 0l app => 1l #: MetaStatus (hub): Tasks - EvalDone3 >>> demo.main(pool = pool.intraOnePerServer(),l taskApiArgs = taskApiArgs,l taskTimeout = taskTimeout)l #: MetaStatus (hub): Waiting - ForSpokes ...l #: MetaStatus (hub): Tasks - Evall #: MetaStatus (hub): Tasks - EvalDonel >>> exit()l GNU/Linux []

Page 68: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Conclusion

Prototyping should be frictionless

Must use original OpenMPI application• original source code• original binary

Original OpenMPI application must inherit support for:• fault tolerance• timeout• detecting deadlocks

GNU/Linux [] spm.python ...mpirun ... ./hello_world -prefix "api"

Page 69: Minesh B. Amin mamin @ mbasciences.com …...Terminology: "Exploiting Parallelism" Exploiting parallelism entails the management of a collection of serial tasks which may communicate

Conclusion (Cont’d)

http://www.mbasciences.comSPM.Python distribution

Technical Briefs

Parallel Management Patterns

CloneOnceRepeat

PartitionDAGList

PartitionAggregateCentralizedDecentralized

ElementaryParallel Primitives

PartitionGrid/OpenMPI

Limited BetaNov 30

;

HPCParallel Primitives

PartitionData FlowGraph

Stanford UDec 6

;

Data / GraphParallel Primitives