Condor: Overview and User Guide to the Condor Biostatistics Environment.
Transcript of Condor: Overview and User Guide to the Condor Biostatistics Environment.
Condor: Overview and User Guide to the
Condor Biostatistics Environment
2
Autoria
• Autores– Patrícia Kayser Vargas– Setembro de 2002– Palestra na Biostat, Wisconsin, EUA
• Revisões– V1
• C. Geyer• PDP/2005-2, PPGC, UFRGS• Dezembro 2005
3
Topics
• Introduction– What is Condor?
– Why and when use Condor?
– What are Condor Universes?
• Running Jobs on Condor– C programs
• YAP
– Java Programs
• Final Remarks
4
Introduction
5
What is Condor?
• Condor– is a distributed batch scheduling system
• “The goal of Condor is to provide the highest feasible throughput by executing the most jobs over extended periods of time.” [1]
• What is a job? – Several possibilities
6
What is Condor?
• Condor– is composed of a collection of different daemons
that provide various services, such as • mecanismo de fila de jobs,• políticas de escalonamento,• esquema de prioridades,• monitoramento,• resource management,• job management,• matchmaking...
7
What is Condor? Architecture
[1]
What is Condor? Architecture
• Tipos de máquinas– Central Manager
• Gerente de uma rede (grade) Condor• Uma por “pool”• Ponto de falha central ()
– Submit Machines• Máquinas de usuários• Usuário submete, monitora e controla execução de 1 job
– Execution Machine (escravo)• Executa jobs
– Uma máquina pode ter vários papéis
What is Condor? Architecture
• Tipos de máquinas (cont.)– CheckPoint Server
• Opcional• Armazena arquivos com checkpoints
10
What is Condor? Architecture
• Condor has four daemons• On Central Manager and on Submit Machines
– startd: • monitors the conditions of the resource where it runs• publishes ClassAds resource offer, and • is responsible for enforcing the resource owner’s policy
for starting, suspending, and evicting jobs. – schedd:
• maintains a persistent job queue• publishes ClassAds resource request, and • negotiates for available resources
11
What is Condor? Architecture
• Only on Central Manager:– collector:
• is the central repository of information • startd and schedd send periodic updates to the collector
– negotiator:• periodically performs a negotiation cycle
– process of matchmaking
– negotiator tries to find matches between various ClassAds,
– of resource offers and requests, and
– once a match is made, both parties are notified and are responsible for acting on that match
12
What is Condor? Architecture
[1]
13
What is Condor? Architecture
[1]
Submitter Executing
14
What is Condor? Architecture
• Publicação de ClassAds de recursos e de jobs que são enviados ao collector– Startd envia (de) recursos– Schedd envia (de) jobs
• O collector tudo envia ao negotiator que faz o matchmaking
15
What is Condor? Architecture
• Algoritmo de matchmaking– o negotiator pode descobrir recursos no qual um
job pode ser executado– ele avisa ao daemon schedd, da máquina que
submeteu, com quem ela deve se comunicar para exportar o job
– ele avisa o daemon startd da máquina escolhida para executar (recurso ocioso que tem os requisitos) que vai receber um tarefa
16
What is Condor? Architecture
• Neste ponto o central manager não age mais, são as duas máquinas que vão executar o job– a máquina de submissão cria um processo shadown
• para enviar a tarefa e receber os resultados– a máquina que vai executar
• cria um processo starter que recebe a tarefa e• um “user job” que por sua vez executa a tarefa • e ao final os resultados são enviados à máquina
de submissão
17
Why and when use Condor?
• Condor is useful when– there are several jobs to be submitted– there is one executable and several different input
data
18
Why and when use Condor?
• Condor is useful because– can use different available machines
• opportunistic scheduling– controls file transfers
• the job must be able to access the data files from any machine on which it can potentially run
– send email notifying when job has completed• except if jobs submitted from a Linux machine
19
What are Condor Universes?
• Types of universes– standard
– vanilla
– java
– parallel
• The Universe attribute is specified in the submit description file– the default is standard
20
What are Condor Universes?
• standard– provides
• checkpointing and • remote system calls
– job more reliable and uniform access to resources from anywhere in the pool
– to prepare a program as a standard universe job, it must be relinked with condor_ compile
21
What are Condor Universes?
• standard– there are a few restrictions
– complete list in manualhttp://www.cs.wisc.edu/condor/manual/v6.4/2_4Road_map_running.html
– examples• no multi-process jobs (no fork(), exec(), and system()) • no inter-process communication
(includes pipes, semaphores, and shared memory)• no sending or receiving the SIGUSR2 or SIGTSTP• all files must be opened read-only or write-only
22
What are Condor Universes?
• vanilla– used for programs which cannot be successfully
re-linked
– useful for shell scripts
– cannot checkpoint or use remote system calls
– sometimes a job must restart from the beginning on another machine in the pool
• sem checkpoint
23
What are Condor Universes?
• java– can execute on any machine in the pool that will
run the Java Virtual Machine
– at the moment it does not work at Biostat• departamento de Wisconsin
– compiled Java programs can be submitted
– creating jar file for programs with several classes is recommended
24
What are Condor Universes?
• parallel– MPI and PVM
• used for parallel programs using message passing
– Globus• must have Condor-G installed
– I did not check if they work at Biostats
25
Running Jobs on Condor
26
Running Jobs on Condor
• You can submit your jobs from any biostat machine, since all run schedd and startd
• You must – set PATH environment variable– prepare a submission file– compile your job with condor_compile if using
standard universe– submit your job(s) with condor_submit command
27
Running Jobs on Condor
• Submission file– o submit description file é o arquivo que diz
• qual é o executável• diretório onde vão ser colocados os arquivos de
saída• quantos jobs vão ser instanciados, etc
28
Running Jobs on Condor
• Submission file– esse arquivo é transformado em um ClassAdd para
cada job que precisa ser instanciado• p.ex. se no arq tiver o comando 'queue 50', vão
ter que ser executados 50 jobs daquele programa
• portanto vão ser publicados 50 ClassAds no central manager
29
Running Jobs on Condor Setting PATH environment variable
• Change PATH to find Condor commands (conforme shell)
bash:source /s/pkg/condor/condor.sh
PATH=$PATH:/s/pkg/`/s/share/ostoken`/condor/bin; export PATH
csh:source /s/pkg/condor/condor.cshset path = ( $path /s/pkg/`/s/share/ostoken`/condor/bin )rehash
30
Running Jobs on Condor Preparing a submission file
• ClassAds (Classified Advertisement)– pairs of values
– syntax similar to C/Java
• The commands are case insensitive, i.e., executable = fact Executable = fact
31
Running Jobs on Condor Preparing a submission file
• At least, must have the “executable” attribute: your program/binary
Executable = fact
• Other useful attribute: input file – your data
input = test.data
32
Running Jobs on CondorCompiling your job with condor_compile
• If using standard universe:– use condor_compile
• it is necessary to relink the program with the Condor library
condor_compile gcc fact.c -o fact
33
Running Jobs on CondorSubmitting your job(s) with condor_submit
• In any Condor Universe– jobs submitted using condor_submit command
with submission file as parameter condor_submit condor1.sub
– -v option to see information about submission (full ClassAd generated)
• somente uma lista e encerra (não interativo) condor_submit -v condor1.sub
Example of C Program
35
Running Jobs on Condor C programs
• options:– gcc (the GNU C compiler)
– cc (the system C compiler)
– acc (ANSI C compiler, on Sun systems)
– CC (the system C++ compiler)
– …(http://www.cs.wisc.edu/condor/manual/v6.4/condor_compile.html)
bash-2.03$ condor_compile gcc fact.c -o fact
36
Running Jobs on Condor C programs – exemplo de “submission file”
#################### # C Example: demonstrate use of multiple directories # "Arguments = 5" to pass integer 5 as parameter # #################### Executable = fact Universe = standard output = loop.out error = loop.error Log = loop.log Arguments = 5
Initialdir = run_1 Queue Initialdir = run_2 Queue
37
Running Jobs on Condor C programs
• Log– contém informações importantes para avaliar a
execução/desempenho da aplicação– para um usuário comum talvez não seja tão
relevante– descreve cada evento que ocorre com o job,
contendo informações de data/hora/máquina• quando: foi submetido, iniciou execução, foi
suspendido, foi migrado, terminou (com erro ou com sucesso
38
Running Jobs on Condor C programs
• Arguments– parâmetros para o executável– no exemplo;
• arguments = 5• equivaleria a executar no terminal 'fact 5'
• Initialdir – onde os arquivos output/erro/log vão ser
armazenados– initialdir= run_1
• Diretório “run_1”
39
Running Jobs on Condor C programs
• Queue– roda uma única instância de job, usando run_1
como initialdir– diretório deve ser criado antes de rodar o
condor_sub senão dá erro
• “Initialdir = run_2” e “Queue”– mais uma instância do job agora em outro diretório
40
Running Jobs on Condor C programs
outro exemplo de “submission file”
#################### # C Example: # each job runs with a different argument and # store results in different files #################### Executable = fact notify_user = [email protected]
Input = in.$(Process) Output = out.$(Process) Error = err.$(Process) Log = fact.log
Queue 2
41
Running Jobs on Condor C programs
• notify_user = [email protected]– diz para enviar msg avisando do término do job
• Input = in.$(Process)– $(Process): variável do condor Process
• que é instanciada com número inteiro sequencial para cada job criado
• assim: vai criar in.0, in.1, in.2 e
42
Running Jobs on Condor C programs
• Log = fact.log– um único arquivo de log apesar de vários jobs– eventos são anotados com número do job
• Queue 2– cria dois jobs– pode ser colocado qq nro inteiro– Queue 100
• cria 100 tarefas
43
Running Jobs on Condor C programs – YAP
• To configure YAP with Condor:
configure --enable-depth-limit --enable-condor
make
44
Running Jobs on Condor C programs – YAP
• condor.subUniverse = standardExecutable = /u/dutra/Yap-4.3.20/condor/yap.$$(Arch).$$(OpSys)Initialdir = /u/dutra/App/f1/train_bestLog = /u/dutra/App/f1/train_best/logRequirements = ((Arch == "INTEL" && OpSys == "LINUX") && (Mips >=
500) || (IsDedicated && UidDomain == "cs.wisc.edu"))
Arguments = -b /u/dutra/Yap-4.3.20/condor/../pl/boot.yapInput = condor.in.$(Process)Output = /dev/nullError = /dev/null
Queue 300
45
Running Jobs on Condor C programs – YAP
• condor.in.0[‘~/Yap-4.3.20/condor/../pl/init.yap'].module(user).[‘~/Aleph/aleph.pl'].read_all(‘~/App/f1/train_best/train').set(i,5).set(minacc,0.7).set(clauselength,5).set(recordfile,‘~/App/f1/train_best/trace-0.7-5.0').set(test_pos,‘~/App/f1/train_best/test.f').set(test_neg,‘~/App/f1/train_best/test.n').set(evalfn,coverage).induce.write_rules(‘~/App/f1/train_best/theory-0.7-5.0').halt.
Example of Java Program
47
Running Jobs on CondorJava programs
• Using Java Universe• Does not need to compile with Condor• Use jar file to programs with several classes:
http://java.sun.com/docs/books/tutorial/jar/
• If using Computer Science environment, must grant access of files to be used on AFS
http://www.cs.wisc.edu/condor/uwcs/
48
Running Jobs on CondorJava programs
#################### # Example in Java Universe # executable must have the .class file and # arguments must have the main class as first argument #################### universe = java executable = Fact.class arguments = Fact notify_user = [email protected] output = loop.out error = loop.error log = loop.log Queue
49
Running Jobs on CondorJava programs
#################### # Example in Java Universe using jar file #################### universe = java executable = jgfSection2.jar arguments = JGFAllSizeA 4 jar_files = jgfSection2.jar transfer_files = ALWAYS output = logAllSection2f.out error = logAllSection2f.error log = logAllSection2f.log Queue
50
Running Jobs on CondorJava programs
• executable = jgfSection2.jar– é um jar– não um .class como no exemplo anterior
• arguments = JGFAllSizeA 4– dois argumentos– exemplo gerado a partir do JavaGrand
• jar_files = jgfSection2.jar– parece redundante– mas sem esse argumento arquivo não é transferido
51
Running Jobs on CondorJava programs
• transfer_files = ALWAYS– idem: para transferir .jar– talvez um erro que tenha sido resolvido
52
Running Jobs on CondorInspecting Condor Jobs
• Some useful commands:– condor_q
• mostra fila de jobs submetidos localmente
– condor_q -analyze•mais informações•permitindo entender se um job não está executando pq teve algum problema nos requisitos ou se não há recurso
•condor_q –submitter <user>
53
Running Jobs on CondorInspecting Condor Jobs
• condor_q -run– mostra apenas os jobs que estão em execução
• condor_q -submitter <user>– filtra pra mostrar informações apenas dos jobs submetidos pelo “user”
54
Running Jobs on CondorInspecting Condor Jobs
• condor_status– mostra cada uma das máquinas da condor_pool
– mostrando informações• estáticas (p.ex. qual o SO)• dinâmicas (p.ex. se está ociosa ou ocupada)
55
Running Jobs on CondorInspecting Condor Jobs
• condor_rm– se resolver remover um job ou conjunto de jobs da
fila– parecido como o kill– precisa dar o número do job
• condor_q -global– mostra informações de todas as filas– em todas as máquinas onde houve submissão
56
Final Remarks
57
Final Remarks
• So, Condor...– controls execution of several jobs
– can really improve your runtime• Yap+Aleph: during three months: 53,000 CPU
hours (peak of 400 machines)
• But, Condor...– does not automatically parallelize your job
58
Final Remarks
• Running Jobs on Condor - Observations:– input data file and directory used to output/log/error must be
previously created, • otherwise an error will be reported and no job will be
executed– for each execution,
• the outputs are appended to log files• the results are overwritten to out files
– error, log and out files must have different names• to avoid race conditions
59
Final Remarks
• Trabalhos sobre gerenciamento de dados– mas não sei até que ponto integrados ao Condor?– Stork (Data Placement Scheduler):
http://www.cs.wisc.edu/condor/stork– Kangaroo (parece que esse foi abandonado):
http://www.cs.wisc.edu/condor/kangaroo– NeST: Network Storage :
http://www.cs.wisc.edu/condor/nest/
60
Final Remarks
• Trabalho sobre monitoração– Hawkeye System Monitoring Tool:
http://www.cs.wisc.edu/condor/hawkeye/
61
Final Remarks
• More information about Condor:http://www.cs.wisc.edu/condor/
• Tutoriais– http://www.cs.wisc.edu/condor/CondorWeek2006/
– http://www.cs.wisc.edu/condor/CondorWeek2005/presentations.html
• More information about running Condor:http://www.cs.wisc.edu/condor/manual/v6.4/
62
Final Remarks
• References:– [1] WRIGHT, Derek. Cheap cycles from the desktop to the
dedicated cluster: combining opportunistic and dedicated scheduling with Condor. In: Conference on Linux Clusters: The HPC Revolution, June, 2001, Champaign - Urbana, IL - USA. http://www.cs.wisc.edu/condor/doc/cheap-cycles.pdf
65
NMR-Star to ClassAd
• BioMagResBank (http://www.bmrb.wisc.edu)– an international repository for biological NMR (nuclear
magnetic resonance) data
– uses the NMR Self-defining Text Archival and Retrieval (NMR-STAR) format to store its data
• NMR-STAR is characterized by a set of information organized as a hierarchical tree – stored as plain text file
– some may have inconsistencies that are manually verified
66
NMR-Star to ClassAd
• ClassAds– a simple representation language used first in the
Condor context,
• Steps:– conversion of NMR-STAR data to ClassAds format
using starlibj (Java package)– use to detect inconsistencies on NMR-STAR files
67
NMR-Star to ClassAd
• Future work:– Matchmaking as consistency checker
– try to “learn” similarities among NMR data
• Working with R. Kent Wenger from the Condor team of UW-Madison
68
TALK 1: Condor: Managing Resources in the Biostatistics Department Environment
TALK 2: Using ClassAds to Represent NMR Data
70
What is Condor? Architecture
• After schedd receives a match for a given job, the schedd enters into a claiming protocol directly with the startd
• Through this protocol, the schedd presents the job ClassAd to the startd and requests temporary control over the resource