Beyond Best Practice: Grid Computing in the Modern World

36
1 About the Presenter Jan Bigalke SAS Architect at Allianz Managed Operations & Services SE Greg Nelson CEO and Founder of ThotWave Technologies

Transcript of Beyond Best Practice: Grid Computing in the Modern World

1

About the Presenter

Jan Bigalke

SAS Architect at Allianz Managed Operations & Services SE

Greg Nelson

CEO and Founder of ThotWave Technologies

2

Beyond Best Practice: Grid Computing in the Modern World 11562-2016

3

Five Reasons you should stay…1. 90% of the time, an out of the box grid install wont handle your use

cases

2. You will gain a better appreciation of what’s happening with using applications such as EG with SAS Grid Manager

3. You’ll better understand how options in queue will affect your users based on differing workloads

4. You will learn how to estimate how many jobs your grid environment can theoretically run

5. You will get a tutorial on how to go beyond the installation and configure your system for high availability

6. Plus more….

4

Agenda§ Introduction Modern SAS Architectures and SAS Grid

§ Understanding how SAS Workloads can impact different resources

§ How SAS Grid Processes Work

§ Best practices for Post-Installation Configuration

§ Calculating Capacity

§ Implementing High Availability

§ Maintaining your SAS Software in a Grid Environment

5

Trends / Motivation§ Workload management

§ Commodity Hardware x86

§ Scalability

§ Start small grow with the customer demand

§ Efficiency (reduce cost)

6

Types of Workload in an Enterprise SAS Environment

7

GridSAS WebServer

SAS WebApp

SAS Meta

SAS Compute SAS Compute SAS Compute…..

Grid architecture

SAS WebServer

SAS WebApp

SAS Meta

SAS Compute

Changes with GRID:

Shared Filesystem for the compute servers necessaryDistribution of Workload between ServersScalability

Default

Shared FS

8

A SAS Grid computing environment is one in which SAS computing tasks are distributed among multiple computers on a network, all under the control of SAS Grid Manager.

ArchitectureSAS Grid @work

9

Design considerations§ Memory, CPU and I/O

§ Utilization, latency, throughput

§ Type of Workload , different needs

10

SAS Versions and grid capabilitiesSAS Version new capabilities (key points)9.4 M2 Grid Manager plug-in for the Environment

Manager

9.4 M1 stored process servers ,pooled workspace servers grid-launched

9.4 M0 grid options, grid-launched workspace servers

9.3 load balancing for stored process servers, OLAP servers and pooled workspace servers

9.2 SAS code analyzer grid-launched batch SAS jobs load balancing for SAS Workspace Servers

11

SAS Grid request flow

4

3

SAS Metadata Server

SAS Object Spawner

LSFgridrun WorkspaceServer

SAS Grid Node

LSF

SAS® Enterprise Guide®

1

2

Request flow:

1. Metadata Server2. Object Spawner3. Get grid options4. Spawn task on Grid Node

12

Metadata and GRIDObject Spawner Console Log:

2015-10-07T21:09:23,801 INFO (gridrun.c:590) - commHandler: command received is >[INIT] [PROVNAME]:"Platform" [MODNAME]:"" [SRVHOST]:"sas94-app1-syst.testdomain" [SRVPORT]:"0" [USERNAME]:"" [PASSWORD]:"" [TIMEOUT]:"0" [OPTIONS]:<project=SASApp><.

2015-10-07T21:09:23,836 INFO (gridrun.c:609) - commHandler: command response is >[DONE]<.

2015-10-07T21:09:23,837 INFO (gridrun.c:590) - commHandler: command received is >[STARTJOB] [JOBNAME]:"SAS Enterprise Guide_SASApp - Workspace Server_4101A009-6340-2B46-8E08-8DB2933E8182" [RESOURCES]:"" [COMMAND]:</var/opt/data/sas/sas94/configAPP/Lev1/SASApp/WorkspaceServer/WorkspaceServer.sh> [ARGUMENTS]:<-noterminal -noxcmd -netencryptalgorithm AES -metaserver sas94-meta-syst.testdomain -metaport 8561 -metarepository Foundation -locale en_US -objectserver -objectserverparms "delayconn sph=hosta.testdomain protocol=bridge spawned spp=42449 cid=0 pb classfactory=440196D4-90F0-11D0-9F41-00A024BB830C server=OMSOBJ:SERVERCOMPONENT/A5ZI7 NU4.AY0000WN cel=everything lb recon grid "keepalive=30"" -METAUSER '"testuser@!*(generatedpassworddomain)*!"' -METAPASS 49944139d506b727d1555D7b1d8E6162 > [OPTIONS]:<> [ARMCORR]:"" [FLAGS]:"0" [INFILES]:"" [OUTFILES]:"" [HOSTS]:"sas94-app1-syst.testdomain,sas94-app2-syst.testdomain"[MPIPROCS]:"0" [PROCSHOST]:"0"<.

Job <33707> is submitted to queue <qiSASApp>.

13

Job Flow Processing Inside LSF

SEQUENCE:1. Submit the job2. Schedule the job3. Dispatch for job4. Run the job5. Return output

Master HostSubmit job

Job PEND

1 2

Compute Host3

Dispatch job

Job RUN

4Queue

5 Job output

mbatchd:JOB_SCHEDULING_INTERVAL

gridrun

sbatchd:JOB_ACCEPT_INTERVALSBD_SLEEP_TIME

Resource Info

Job DONE

14

Best PracticesConfiguration in Metadata

Beyond the basics with QueuesMulti-tenant considerationsSAS Grid and Hadoop

Theoretical number of jobsHigh availabilitySoftware Updates

15

SAS configuration§ STP , Workspace, batch

§ STP only balancing (needs to check if this helps)

§ Workspace Grid launched

§ Batch grid integration into enterprise scheduler

16

Grid Configuration in Metadata

§ Workspace Server

17

Grid Configuration in Metadata

§ Stored Process Server

18

Grid Configuration in Metadata

§ Pooled Workspace Server

19

GRID and workload considerations§ GRID load balanced (load balanced based on utilization)

§ Web Requests need short latency

§ Grid for longer running request (Batch, Workspace Server)

§ Online distribute via ObjectSpawner (shorter latency)

20

Mange Workloads with queuesQueues can manage Workload on different requirements

SAS

clients protect access with groups

Parameter per queue:• Priority • Limits • Stop und start conditions

Interactive

Batch

default

Queues

Scheduler

SAS Progs

21

Why Queues MatterParameter Example

(Interactive Queue)

Example (Batch Queue)

Definition

PRIORITY PRIORITY=50 PRIORITY=20 The relative priorities as compared to other queues

NICE NICE=20 NICE=10 Specifies the execution priority change, based on Linux “nice” values.

CPULIMIT CPULIMIT=5 CPULIMIT=15 a time limit applied to jobs

UJOB_LIMIT UJOB_LIMIT=5 UJOB_LIMIT=2 the maximum job slots per user in a queue

PJOB_LIMIT PJOB_LIMIT=10 PJOB_LIMIT=5 the maximum job slots per processor in a queue.

QJOB_LIMIT QJOB_LIMIT = 120 QJOB_LIMIT = 60 the maximum jobs in a queue.

22

Why Queues MatterParameter Example

(Interactive Queue)

Example (Batch Queue)

Definition

HJOB_LIMIT HJOB_LIMIT = 4 HJOB_LIMIT = 4 Maximum number of job slots that this queue can use on any host

CHUNK_JOB_SIZE

CHUNK_JOB_SIZE = 4

CHUNK_JOB_SIZE = 4

Specifies the maximum number of jobs allowed to be dispatched together in a chunk.

r1m r1m=0.3/1.5 r1m=0.3/1.5 1-minute CPU run queue length (alias:cpu)

ut ut=0.2 ut=0.2 1-minute CPU utilization (0.0 to 1.0)

r15s r15s=0.3/1.5 r15s=0.3/1.5 15 second CPU run queue length (alias:cpu)

it it=10/1 it=10/1 Idle time (minutes) (alias: idle)

23

Multi tenancy considerationsQueues per tenant

§ allow different settings per tenant

Queues and workload

§ stop_cond = select[ (cpuusg > 95.0) && (cguxx > 90.0)) ]

§ resume_cond = select[ (cguxx < 95.0) || (cpuusg < 95.0) ]

24

GRID and interaction with other systems (databases / Hadoop)

§ Database Access interfaces (same DB client config)

§ Hadoop (Java JAR, Auth Kerberos, Grid)

Node 1

RDBMS

Node 2 Node 3 Node 4 Node 5 Node n

RDBMS

25

Capacity Sample§ ! Jobs(k))

*+, = MXJ1 + MXJ2 +..+MXJn

§ JOB_ACCEPT_INTERVAL is 1

§ MBD_SLEEP_TIME is 5

§ Platform LSF dispatches one job to a particular machineand waits for 5 seconds before dispatching another jobto the same machine regardless of how long each jobtakes

26

Capacity Sample

§ Average job duration: 5 seconds

§ JOB_ACCEPT_INTERVAL: 1

§ MBD_SLEEP_TIME: 5

§ 4 cores per host

§ 2 hosts

§ 4 job slots per core

27

Capacity Sample

§ Jobs per host= 60/ (1 * 5) = 12 number of jobs per host per minute

§ 12 * 2 = 24 jobs per minute in the GRID

28

What’s Included§ Core LSF Processes

(Base and Batch)

What’s Not Included§ SAS Management Console§ SAS Mid-Tier

§ SAS Object Spawner

§ Platform Process Manager· § Platform Grid Management

Service

Failover Options

29

LSF Daemons on UNIX

Server Host

Master Server

LIM RES PIM PEM

LSF BASE

LSF BATCH SBATCHD MBATCHD MBSCHD

30

Other Daemons on UNIX

JFD PM

GMSGABD

31

GRID failover EGO§ Where can grid help, compute cluster

SAS WebServer

SAS WebApp

SAS Meta

SAS Compute SAS Compute SAS Compute…..

SAS Meta SAS Meta

SAS WebApp SAS WebApp Midtier Cluster

Metadata Cluster

Compute services via EGOGrid

32

GRID and failover § EGO define the services and number of instances

if one Server goes downEGO restarts the service on a failover node

Node 1 Node 2 Node 3S S S

33

Failover

34

GRID and hot fixing process § Update considerations

§ Base and beyond.

Node 1 Node 2 Node 3

Shared Store

FIX FIX FIX

FIX Close node in LSF, Stop Services EGOWait for end SAS processesFIX binaries, open NodeSync (rsync)

35

Conclusion§ GRID allows us to scale horizontally

§ Different Workloads need different settings

§ We can optimize workloads with Queues

§ We use EGO to mange the services failover

§ GRID is not only SAS … EGO/LSF settings

36

Jan BigalkeAllianz Managed Operations & Services [email protected]

Greg NelsonThotWave [email protected]