Running Parallel Simulations and Enabling Science Gateways ...

68
Running Parallel Simulations and Enabling Science Gateways with the NSF MATLAB Experimental Computing Resource at Cornell Steve Lantz, Cornell Center for Advanced Computing Susan Mehringer, Cornell Center for Advanced Computing The “MATLAB on the TeraGridexperimental computing resource is funded by NSF grant 0844032 in partnership with Purdue University, Dell, The MathWorks, and Microsoft. Presented at SC10, New Orleans, LA November, 2010

Transcript of Running Parallel Simulations and Enabling Science Gateways ...

Running Parallel Simulations and Enabling Science Gateways with the NSF

MATLAB Experimental Computing Resource at Cornell

Steve Lantz, Cornell Center for Advanced Computing

Susan Mehringer, Cornell Center for Advanced Computing

The “MATLAB on the TeraGrid” experimental computing resource is funded by NSF grant

0844032 in partnership with Purdue University, Dell, The MathWorks, and Microsoft.

Presented at SC10, New Orleans, LA

November, 2010

www.cac.cornell.edu/matlab 2

MATLAB on the TeraGrid

• This is an effort to provide a large parallel MATLAB resource to a national (and international) community in a secure, useable manner.

• Several different hardware and software components make up the system. These integrate with the MATLAB client at different levels.

• All functions are provided by various “services”, meaning you never actually log on to any CAC systems. The client software simply makes requests to CAC systems.

www.cac.cornell.edu/matlab 3

High-Level Process

• Security is managed via short-lived certificates. When you log in to the system, you trade a username/password for a certificate that allows you to use the services.

• File transfer service – enables you to move files through a specialized FTP server to a network file system that is mounted on all compute nodes.

• Job submission service – enables you to submit and query jobs on the cluster; these jobs are executed by MATLAB workers on the compute nodes.

www.cac.cornell.edu/matlab 4

Hardware View

MyProxy Server GridFTP Server

HPC 2008 Head Node

DataDirectNetworks

9700 Storage

Windows Server 2008

CAC 10GB Interconnect

1. Retrieve certificate2. Upload files to storage via GridFTP3. Submit job to run MATLAB workers on cluster4. Download files via GridFTP

www.cac.cornell.edu/matlab 5

Software View

• File movement and job submission interactions are largely hidden by software integrated with MATLAB.

• CAC’s client code for MATLAB is a mix of Java and .m files that enable access to the TUC cluster directly from your MATLAB client through the PCT “generic scheduler” interface.

• Client code will communicate as needed with server-side software to run your parallel jobs on TUC, the 512-core cluster devoted to parallel MATLAB applications.

6

JGlobus CoG

Apache CXFCertificate Management MyProxy GridFTP

SSLJSDL

matlabpool

parfor createJob submit

getAllOutputArguments

www.cac.cornell.edu/matlab 7

A Word on Security

• Logging on to MyProxy returns a short-lived X.509 certificate that is used to authenticate to services.

• This allows any TeraGrid user to access the system using their username/password on a TeraGrid MyProxy server. Most users will use the CAC MyProxy server.

• Job submission and status information is accessed via a web service call that is secured by a client-certificate SSL (or TLS) connection. Your data and job requests are transferred over secure channels.

www.cac.cornell.edu/matlab 8

GridFTP

• GridFTP is an extension of the standard File Transfer Protocol, developed as part of the Globus Toolkit.

• GridFTP provides two key extensions that the CAC client code uses:

– GSI Security – The Grid Security Infrastructure provides file transfer authentication and encryption and interoperates with MyProxy X.509 certificates.

– Parallel Transfers (extended block mode) – Makes use of multiple simultaneous connections so a higher percentage of available bandwidth can be used.

www.cac.cornell.edu/matlab 9

A Note on the Platform

• The compute nodes that run the MATLAB jobs are running Windows HPC 2008 (64 bit).

– Since a minority of people are running a Win64 platform, any files requiring compilation (e.g., mex files) will likely need to be recompiled on TUC.

– MATLAB is relatively resilient to paths with the wrong direction of slashes, but the difference can cause problems.

• C:\Users\naw47\myfiles\this.dat Windows path

• /home/naw47/myfiles/this.dat Mac, Linux path

www.cac.cornell.edu/matlab 10

Support

• As a funded project, the system is free to use for research applications.

– We will ask for information on your project so that we can learn who we are supporting and how to best address problems.

• We also provide consulting support for the system.

– Troubleshooting

– Guidance on optimizing your application

– General help with parallel MATLAB

Installing theCAC Parallel MATLAB

Client Code

www.cac.cornell.edu/matlab 12

Installation Overview

1. Check that prerequisites are met

2. Download the CAC client code

3. Modify your MATLAB classpath.txt file

4. Modify one function in the CAC client code

5. Register your Certificate

6. Set paths, runtime, etc

7. Run test jobs

www.cac.cornell.edu/matlab 13

Prerequisites

• Linux, Mac, or Windows operating system on the client machine

• MATLAB Release 2009a or 2009b or 2010a

• MATLAB Parallel Computing Toolbox

• Obtained access via submitting the Interest Form found at http://www.cac.cornell.edu/matlab/

www.cac.cornell.edu/matlab 14

Terminology

MATLABROOT: The MATLAB installation directory

• >> matlabroot

• Common locations are:

– Vista/XP/7: C:\Program Files\MATLAB\R2009a

– Mac: /Applications/MATLAB_R2009a.app

– Linux: /opt/matlab/r2009a or /usr/local/matlab/r2009a

CACHOME: Wherever you install the CAC client code

• Be sure to substitute your folder path for CACHOME in all installation steps. Can be named something else.

www.cac.cornell.edu/matlab 15

Helpful Links

• General: http://www.cac.cornell.edu/matlab/

• CAC client code, download and installation FAQ: http://www.cac.cornell.edu/matlab/downloads

• Helphttp://www.cac.cornell.edu/help/

www.cac.cornell.edu/matlab 16

Download CAC Client Code

• Choose a name and location for CACHOME. You will need write permissions. Some good choices:

– Windows: c:\username\cac

– Mac: /Users/username/cac

– Linux: /home/username/cac

• Download and extract the .zip file:

http://www.cac.cornell.edu/matlab/downloads

• Unpack it into CACHOME. You should end up with a folder which contains .m files and subdirectories.

www.cac.cornell.edu/matlab 17

classpath.txt: Java Libraries

• The CAC client code is heavily dependant on a series of Java libraries for functionality. The CACHOME/lib/*.jar files must be added to MATLAB’s java classpath, in the text file classpath.txt.

• Find the location of classpath.txt: >> which classpath.txt

www.cac.cornell.edu/matlab 18

classpath.txt: Modifications

• maci64=$matlabroot/java/jarext/aquaDecorations.jar After this line add 12 lines

– Note these are Windows slashes, reverse them for Mac and Unix.

– Replace CACHOME with your install path, and it must be an absolute path.

• Comment out one line: ## $matlabroot/java/jarext/ice/ib6https.jar

CACHOME\lib\littlejohn.jar

CACHOME\lib\bcprov-jdk15-1.43.jar

CACHOME\lib\bcprov-jdk16-143.jar

CACHOME\lib\cog-jglobus-1.7.0.jar

CACHOME\lib\commons-logging-1.1.1.jar

CACHOME\lib\cryptix-asn1.jar

CACHOME\lib\cryptix.jar

CACHOME\lib\cryptix32.jar

CACHOME\lib\cxf-2.2.7.jar

CACHOME\lib\log4j-1.2.15.jar

CACHOME\lib\not-yet-commons-ss0.3.11.jar

CACHOME\lib\puretls.jar

www.cac.cornell.edu/matlab 19

classpath.txt: Testing

1. Restart MATLAB, then:

2. List the paths of all of the jar files that MATLAB knows about. Do you see the 12 lines you added? >> javaclasspath

3. Are you using the classpath.txt file you expected? >> which classpath.txt

4. Test that classpath.txt is set up properly.>> addpath('LITTLEJOHNHOME/contrib');>> updateContrib();>> cacCheckClassPath();

www.cac.cornell.edu/matlab 20

classpath.txt: But, what if…

You are not administrator on the machine?

• You must run explicitly as the administrator when editing classpath.txt for the changed to be saved, since it affects the global MATLAB install.

• If this is not feasible, e.g. you are on a multi-user system:– Identify your startup directory: >> userpath

– Place a classpath.txt file in your startup directory. This file is user-specific and will only affect your MATLAB environment.

– MATLAB looks first in the startup directory for a classpath.txt file, then the default directory, using whichever it finds first.

– http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f8-10506.html

www.cac.cornell.edu/matlab 21

cacsched.m: Modifications

• Start MATLAB

• Edit CACHOME\cacsched.m. Change USERNAME to your CAC username in line 24.

• Add the CAC client code to your MATLAB path:>> addpath('CACHOME');

• Run cacsched to set up the scheduler object, sched. This object is passed to the createJob functions in order to initiate jobs. The scheduler settings will be output. >> cacsched

www.cac.cornell.edu/matlab 22

cacsched.m: Optional

• Review the following line from cacsched.m:set(sched,'DataLocation',fullfile(LJHome, 'jobs'));

• The communication between the client MATLAB and the scheduler is file based. This means each job submission creates a large number of files which need to be stored somewhere on the client machine. The default is to set it to CACHOME\jobs .

• For a different location, change the line to an explicit path, e.g.set(sched,'DataLocation','/home/myuserid/myJobsLocation');

www.cac.cornell.edu/matlab 23

Register your Certificate

Download the CAC server certificates and register your certificate with us.

>> cacRegisterCertificate();

• Follow the dialogue box instructions

• Run this again any time you change your password

• It can take up to two minutes to complete

www.cac.cornell.edu/matlab 24

Job Submission Settings

• ≫ ClusterInfo.setQueueName('Quick');only use this for 10 minutes and 16 cores or less

• These settings will be in effect for all subsequent job submissions, until you change them:

• >> ClusterInfo.setWallTime(10); set your wall time limit to 10 minutes

• >> ClusterInfo.getWallTimeshow your current wall time setting

• >> help ClusterInfosee more examples

www.cac.cornell.edu/matlab 25

Installation Testing

runtests.m, found in CACHOME, is a tool which performs a series of functionality tests:

>> runtests(sched,1); % run local tests on file and path settings

>> runtests(sched,2); % run test on the file transfer system and on scheduler communication.

>> runtests(sched,3); % run sample MATLAB jobs to the cluster to ensure that both parallel and distributed tests are functioning.

>> runtests(sched,0); % run all tests

www.cac.cornell.edu/matlab 26

Future Sessions

>> addpath('CACHOME');

>> cacsched

>> addpath('CACHOME/contrib');

>> updateContrib();

≫ ClusterInfo.setQueueName('Quick');% Or rely on setting from previous session

>> ClusterInfo.setWallTime(10); % Or rely on setting from previous session

>> submityourjob(sched); % “submityourjob” is your job

How to Submit a Job

www.cac.cornell.edu/matlab 28

Next Steps

• At this point, you have a fully operational install of the CAC client code for parallel MATLAB.

• Your next step should be to take a look in the examples directory to start seeing how to take advantage of TUC.

– cacsubmit – super simple distributed job example

– cacparsubmit – simple example of a parallel (MPI) job

– pooljobremote – MATLAB pool example

– cacNonBlockSubmit – example of submitting a non-blocking job (avoiding waitForState)

www.cac.cornell.edu/matlab 29

Using the PCT

• MATLAB’s Parallel Computing Toolbox is the client-side toolbox that enables parallelism (including using TUC).

• PCT provides a set of interfaces that allow us (CAC) to write implementations of PCT functions that talk to TUC but look the same to you (the user) as when run locally.

• Parallel resources are selected either by using a named configuration or by using the findResource function.

– Either way, PCT function calls must be tied to specific implementations to provide resource-specific functionality.

– You don’t ever call the underlying functions directly.

www.cac.cornell.edu/matlab 30

findResource

• In our examples (and in general practice) we call the findResource function via a script called cacsched.m.

• If you examine cacsched, you’ll see we also tie the PCT interface functions to specific functions provided by CAC.

www.cac.cornell.edu/matlab 31

Using a Configuration

www.cac.cornell.edu/matlab 32

Jobs and Tasks

• findResource creates a scheduler object, which allows you to create Jobs. In PCT, Jobs are containers for Tasks, which are where the actual work is.

schedScheduler Object

Jobs(24) Jobs(25)

j=createJob(sched) j=createParallelJob(sched)

Tasks(1)myFunction(z)

Tasks(1)someFunction(x)

Tasks(2)otherFunction(y)

createTask(j,…)createTask(j,…)createTask(j,…)

www.cac.cornell.edu/matlab 33

Distributed Jobs

• There are three types of jobs in the PCT: distributed, parallel, and pool.

• Distributed jobs have one or more one-core tasks and no communication between tasks. Thus, each task could be run as a one-core job through a batch scheduler. These are useful for EP work or for shifting lengthy jobs to TUC.

www.cac.cornell.edu/matlab 34

Parallel and Pool Jobs

• Parallel and Pool jobs are multi-core; communication between cores is possible. These jobs have just one task!

• The number of cores must be given. The task function is responsible for implementing the actual parallelism using MPI_Rank logic (or parfor/spmd/labindex for pool jobs).

www.cac.cornell.edu/matlab 35

State of Jobs

• After a job is submitted, “job.state” is just one of several different ways to learn the state of the job.

• waitForState is a PCT interface to block on job state; it’s problematic for long running jobs or jobs that fail.

• cacNonBlockingGetJobStates is an optional, non-PCT interface that offers more control.

www.cac.cornell.edu/matlab 36

Retrieving Results

• Once your job completes, you need to get the results in two steps: (1) download files, (2) load into workspace.

– getAllOutputArguments returns cell array a{Task,Output}

– a{1,2} = Task 1, second output

www.cac.cornell.edu/matlab 37

Helpers

• The CAC client code provides a number of functions beyond the PCT interface which should be helpful to you. The hands-on labs will take advantage of these functions.

– gridFTP() – creates an object whose methods are, in effect, a command-line interface to the TUC file storage.

– littleJohnLog/qpeek – monitor the status of a running job. littleJohnLog is a server-side function that writes data to a file that qpeek reads.

– getErrors/getOutput – pretty-print any errors your tasks had, as well as the command-line output.

www.cac.cornell.edu/matlab 38

Putting it All Together

We can control which resource is used to execute the job simply by swapping out the scheduler object!

www.cac.cornell.edu/matlab 39

Parallelizing a Pool Code

• As we have seen, converting code to run remotely as a distributed job is fairly trivial. All you really need is to do createTask on your function (maybe in a loop).

• Pool jobs are not hard, either. Let’s take a code that opens a pool on a multi-core workstation and alter it to exploit the many cores on TUC. The basic process:

1) Modify the pool function to run on TUC

2) Write the submitter or driver script

3) Script the movement of any needed files

www.cac.cornell.edu/matlab 40

Pool Code

• Our example code opens a matlabpool, reads an input file, then uses a parfor loop to execute the peakpickingalgorithm in parallel.

www.cac.cornell.edu/matlab 41

Modifications to the Pool Code– More outputs

are needed

– Pool commands are removed

– Absolute paths are best for I/O

– Graphics may be moved off to the client, or may be dumped to a file

www.cac.cornell.edu/matlab 42

Submitter Script

• The submitter or driver script sets up the pool job

• It starts up the matlabpool automatically (8 “labs” here)

www.cac.cornell.edu/matlab 43

Moving the Files

• Both the task function and the datafile must be present on the remote server. We’ll use gridFTP to take care of it.

• The submitter also sets PathDependencies for the job.

www.cac.cornell.edu/matlab 44

Parallel Jobs

• PCT supports basic MPI commands inside parallel jobs.

– Initialization is done for you (no MPI_Init)

– Size and rank are available from the start of the job; numlabs = MPI_Comm_size, labindex = MPI_Comm_rank

www.cac.cornell.edu/matlab 45

More on Parallel Jobs

• All the basic communication methods are available: Send, Receive, Broadcast, Barrier, gop (gather)

• Source and tag are the same as in MPI, but MATLAB figures out data formats for you.

– labSend(data,destination,[tag]);

– labReceive(source,tag);

– labReceive(); %take anything

• Co-distributed arrays are sliced across the workers so that huge matrices can be operated on.

File Transfer

www.cac.cornell.edu/matlab 47

Hardware View

MyProxy Server GridFTP Server

HPC 2008 Head Node

DataDirectNetworks

9700 Storage

Windows Server 2008

CAC 10GB Interconnect

1. Retrieve certificate2. Upload files to storage via GridFTP3. Submit job to run MATLAB workers on cluster4. Download files via GridFTP

www.cac.cornell.edu/matlab 48

File Transfer

• The basic job submit operation specifies that the program will run on a remote server. When it runs, the functions and data must be available.

j = createJob(sched);

createTask(j,@rand,1,{3,3});

submit(j);

waitForState(j);

a = getAllOutputArguments(j);

• This example works as-is because rand is a built-in MATLAB function. It is always on the MATLAB path.

www.cac.cornell.edu/matlab 49

File Transfer

• Have a custom function and/or require a datafile?

j = createJob(sched);

createTask(j,@myfunction,1,{3,3});

submit(j);

waitForState(j);

a = getAllOutputArguments(j);

• myfunction.m does not exist on the remote computer.

• Transfer this file and get it added to the path.

www.cac.cornell.edu/matlab 50

FileDependencies

• The MATLAB FileDependencies property will move the files for you

• Best for smaller projects with only a couple of files

• Specify directories and files the worker will need. All files and directory structure will be copied; file transfer occurs for each worker running a task for that particular job on a machine

set(j,'FileDependencies',{'/home/username/src/myfunction.m', '/home/username/data/dfile.mat');

www.cac.cornell.edu/matlab 51

Move the Files Yourself

• FileDependencies is best for smaller projects with only a couple of files

• Alternative:

1.Move the file(s)

2.Add the path to the worker sessions

www.cac.cornell.edu/matlab 52

1. Move the Files

• First move the file(s) needed by the job:

sendFileToCAC('filename.m');

• sendFileToCAC('filename') – super simple method for dumping a single file into your home directory on TUC.

• sendDirToCAC('mydir','tucDir') – Recursively move a directory and its contents to TUC.

www.cac.cornell.edu/matlab 53

2. Add the path

• On your laptop/workstation, you commonly issue addpath('path/to/file') statements.

• The same is true when running MATLAB on TUC, but:

– The task function must be on the startup path of MATLAB.

– You may enter addpath and cd statements into your task function, but first your function must be available.

• We will use PathDependencies to make our task function available.

www.cac.cornell.edu/matlab 54

PathDependencies

• PathDependencies is a property of the Job object that allows you to issue addpath statements on TUC before calling your task.

• Assuming the file has been moved to \\storage01\matlab\username\MyProjectDir

• Specify the path dependency in your job submission script:set(j,'PathDependencies',{'\\storage01\matlab\usernameMyProjectDir'});

www.cac.cornell.edu/matlab 55

PathDependencies

www.cac.cornell.edu/matlab 56

Scripted Solution

• Both send*ToCAC methods are primarily for one time use, best for moving a big directory of data files, testing, or copying a single file.

• The gridFTP interface is more flexible. It allows you to interactively move files to TUC as well as write scripts that move files.

• For projects that involve more than one or two source files, we recommend writing a “prep” function which ensures that the most up-to-date functions are available on TUC.

www.cac.cornell.edu/matlab 57

Then add PathDependencies to the job in the submission script:

www.cac.cornell.edu/matlab 58

Lab

• Source files:

– calcLatLongDistance.m

– degrees2Radians.m

• Data file:

– Airports_boardings.txt

• Task function:

– addpath_remote.m

• Batch script:

– addpath_submit.m

Using this set of fileswe will work with

• FileDependencies

• PathDependencies

• GridFTP

Debugging

www.cac.cornell.edu/matlab 60

Debugging

• Debugging a remote process is always difficult. The situation on TUC is no different. Errors must be caught, captured and returned to the client machine to resolve.

• MATLAB generally captures any errors thrown by a task and stores them as a MException in the task output.

– Distributed jobs may store an exception for each task.

– The CAC-provided function getErrors(j) collects the errors from the tasks of a job and pretty-prints them for you.

www.cac.cornell.edu/matlab 61

Getting Errors

• SimpleError.m has a simple error in the task function. Submit SimpleErrorSubmit.m and examine the error:

>> [j,a] = SimpleErrorSubmit(sched);>> getErrors(j); % how do we view just one stacktrace?>> ts = get(j,'Tasks');>> es = get(ts,'Errors');>> es.message>> es.stack(1)

www.cac.cornell.edu/matlab 62

Manual Retrieval

• Sometimes a job will hang or fail in such a way that the files don’t get downloaded from TUC correctly. In this case, you’ll want to retrieve those files manually.

>> downloadJob(sched,j);

• Here’s what to do first for a job defined in a prior session:

>> cacsched % re-create the sched object >> j = sched.Jobs(12) % copy the Jobs(12) object, e.g.>> get(j,'name') % check the job’s name

www.cac.cornell.edu/matlab 63

Manual Retrieval with gridFTP

• For large parallel jobs, you may want to use gridFTP to shortcut downloadJob, which downloads all of the files.

– If you need to spot the error in a large parallel job, for example, you can just use gridFTP to grab Task1.out.mat.

– It very likely contains the exception, because the error is almost always found in all Tasks or the master (Task1).

>> ftp = gridFTP();>> ftp.get(‘Job4/Task1.out.mat’);>> ftp.close();

www.cac.cornell.edu/matlab 64

Stdout, Stderr

• For a parallel job named JobN, the standard output and errors for all the MATLAB processes are stored in two files called JobN.ou and JobN.er respectively.

• In a distributed job, each TaskM of writes its own output and error into TaskM.ou and TaskM.er in the JobN folder.

• But capturing errors doesn’t help you catch other things:

– Problems with numerics

– Running times of different sections of your code

– Progress of a long-running job…

www.cac.cornell.edu/matlab 65

Printf

• There are two other ways to get diagnostic information:

– captureCommandWindowOutput - if ‘true’ for a task, this property tells MATLAB to return output from fprintfstatements and other console output (e.g., from statements lacking a semicolon) to the client.

– littleJohnLog/qpeek - this pair of functions can be used to create a log for a long-running job and examine the log as the job runs. Usage ideas can be found in cacLog.m and cacLogSubmit.m, in the examples folder.

www.cac.cornell.edu/matlab 66

Printf Example

• Verbose.m and VerboseSubmit.m contain both fprintfand littleJohnLog statements. Notice that the tasks must be set up to return the command window output at task creation.

• Run the jobs and make sure you can retrieve the output manually as well as using the getOutput(job) function.

>> at = get(j,’Tasks’);>> out = get(at,’CommandWindowOutput’);>> getOutput(j);

www.cac.cornell.edu/matlab 67

Debug Lab – Intro

• The Traveling Salesman Problem is a classic minimization problem. A salesman has a fixed set of cities that he must visit (each only one time). What order of city visits will minimize the total distance travelled?

– ga_run.m solves this problem using a genetic algorithm (GA) on the airport dataset that we worked with in the file transfer lab. This is a relatively small dataset with about 150 locations.

– ga_run2.m is a buggy version that solves the problem for a larger dataset (cities.txt).

www.cac.cornell.edu/matlab 68

Debug Lab – Instructions

• Examine the output from getErrors and getOutput in order to find and fix the problems with ga_run2.m and help our intrepid salesman out.

• The functions defined in the two .m files take the same arguments and return the same outputs, so ga_submit.mshould not need to be modified, except to change the name of the function in createTask.