Sge

Scaling, Grid Engine and Running UIMA on the Cluster

Chris Roeder 11/2010

The Scaling Problem

“Does the solution scale?” asks if larger versions of the problem (often more data) can be dealt with by a given piece of software.

“Scaling” is a loose collection of techniques to improve or implement a solution’s scalability.

The choice of techniques depends on the critical resource: cpu, memory or i/o and how easily the task is broken into pieces.

This talk focusses on Scaling as it applies to UIMA NLP processing (not withstanding OpenDMAPv2).

It is a work in progress.

Scaling NLPProcessing a file is independent of processing another file:

Text in, annotations out.

• Multi-threaded– More than one thread of execution in one process

• pipelines share memory and can step on each other.

– Ex. Stanford crashes because of concurrency issues• “was not an issue in 2001”

– <casProcessors casPoolSize=“4" processingUnitThreadCount=“2">

• Multi-process– Separate JVM’s, each with a single thread

• Memory is not shared, no crushed toes• <casProcessors casPoolSize="3" processingUnitThreadCount=“1">• Overhead of repeated JVM and pipeline does cost, but it works.

• Many machines– More memory, more cores– Independence means they won’t miss being on the same machine– Independent machines (Cluster) are cheaper than integrated (Enki)

Hardware

• Local Cluster (Colfax)– A rack of machines with software (SGE) to integrate

• Integrated CPUs (Enki)– Much like a rack, but motherboards are tied together and can

share memory• Gigabit ethernet delivers on the order of 300Mb/sec• Motherboard runs up to 4.8GB/sec

• Virtual Cluster– Virtualization software allows for a single machine to appear as

many, offers flexibility, security• Cloud

– A virtual cluster on the net: Amazon EC2

Hardware: CCP’s Colfax Cluster

• Runs Linux (Fedora/Red Hat)• 6 machines (amc-colfax,

amc-colfaxnd[1-5])• 2 cpus (Intel), 4 cores each, 48 cores total• Intel motherboard• 16GB memory each, 96 GB total• 5TB shared (over NFS) disk array, RAID5

• Named after the assembler: Colfax International

(Sun|Oracle) Grid Engine (SGE)

• Manages a queue of jobs, optimizing resources utilization

• Starts individual processes for a job• Often used with Message Passing Interface

(MPI) for processes that cooperate• Used here to start “Array Jobs”• Each job processes a portion of a large array of

work to be done.

SGE Job

– An SGE job is a script and a command line– Command line specifies resources for scheduling• Memory• others

– Script is run once for each process started• Is not pure shell, but more/less a shell script (next slide)

– Job is assigned an ID number

more/less a shell script?

• Put these lines at top for SGE:– #$ -N stanford_out• Standard out goes to a file with this prefix

– #$ -S /bin/bash• The shell to use (no “she-bang”: #!/bin/sh)

– #$ -cwd• Runs from the current directory

– #$ -j y• Merge stdout and stderr to one file

Submit a Job: qsub

• Qsub –t 1-200000:20000 sge_stanford_out.sh– -t Index Range

• Do array items from 1 to 200 thousand, by 20k: 10 processes

– Do this with the sge_stanford_out.sh script• How does the script know what files to process?– $SGE_TASK_ID (first file number to run)– $SGE_TASK_STEPSIZE

• A task will get values of 0,19999,20000 for example

Sge_stanford_out.sh

• Will evolve into generic UIMA job submission script• Script modifies a template CPE file, creates a CPE

for each process• CPE specifies starting document number and

number to process• http://wikis.sun.com/display/gridengine62u2/How

+to+Submit+an+Array+Job+From+the+Command+Line

[roederc@amc-colfax sge_scripts]$ qsub -t 1-50:3 sge_stanford_out.sh

Your job-array 130.1-50:3 ("stanford_out") has been submitted

http://wikis.sun.com/display/gridengine62u2/How+to+Submit+an+Array+Job+From+the+Command+Line



qstat[roederc@amc-colfax sge_scripts]$ qstatjob-ID prior name user state submit/start at queue slots ja-

task-ID -------------------------------------------------------------------------------------------------------------

---- 130 0.00000 stanford_o roederc qw 11/02/2010 12:39:01 1 1-49:3[roederc@amc-colfax sge_scripts]$ qmon[roederc@amc-colfax sge_scripts]$ qstatjob-ID prior name user state submit/start at queue slots ja-

task-ID -------------------------------------------------------------------------------------------------------------

---- 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 4 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 7 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 10 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 13 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 16 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 19 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 22 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 25 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 28 130 0.55500 stanford_o roederc r 11/02/2010 12:39:10 [email protected] 1 31

Qdel command

• Use to kill a command• Qdel <job num>

Failures?

• Q:What if a job fails?– (A: it stops)

• Open problem– For now, that process dies leaving unprocessed

jobs– Need to cull unprocessed files and try again• Usually not enough memory

– Future: db-driven collection reader with cas-consumer that reports completion

Example 1:

• Distribute a simple script on cluster:– Test_sge.sh– Qsub test_sge.sh• Runs it once

– Qsub test_sge.sh –t 1-5:1• Runs it five times

– Qsub test_sge.sh –t 100-500:100• Also runs it five times• Gives index starts spaced by 100

Example 2:Run UIMA on Cluster

• Sge_stanford_out.sh:• Calls a script with a template CPE and index range: • run_cpe_cluster_stanford_out.sh– Modifies CPE template, creating a CPE for each sub-

range– Sets up environment, calls SimpleRunCPE (java)

• Note temp_cpe_<n>.xml in ../desc/cpe• Start a number of terminals, run “top” in each to

see cpu and memory usage.

Hadoop

• Inspired by Lisp’s map/reduce• Map: apply a function to each element of a hash• Reduce: combine hashes into one• Known for optimizing by moving processing rather

than data• Similar code used by Google. • Hadoop is open source, used by Yahoo, Amazon.• Specialized interfaces make it more suited to

greenfield development

What about “The Cloud”

• Amazon’s Elastic Compute Cloud (EC2) is a cluster on the internet that can be rented by the hour

• Very Dynamic– Set up nodes when you start using them– Expect them to dissapper when you stop– Must have machine configuration management

sussed. You have to re-install everything.• Use S3 for long-term storage• Starts at $0.10/hour

Colfax Cluster

6 CPUs

5TB disk array

Enki

CPU

8TB RAID

Sge

Documents

Transcript of Sge