torque maui user handout


Queue
systems
Piero Calucci
The Problem
TORQUE
Queue systems
Understanding
Resource
and how to use TORQUE & Maui
Management
Piero Calucci
Scuola Internazionale Superiore di Studi Avanzati
Trieste
November 2008
Advanced School
in High Performance and Grid Computing
Queue
systems
Outline
Piero Calucci
The Problem
TORQUE
Understanding
Resource
Management
1 The Problem We Are Trying to Solve
2 Using the Resource Manager
3 Understanding Resource Management
Queue
systems
The User s Problem
Piero Calucci
The Problem
TORQUE
Understanding
Resource
Management
" have dedicated resources
multitasking is Bad for HPC
" have resources as soon as possible
you need to have your computation done by next week,
right?
" have jobs run unattended
and results delivered back to you
what do you want to do at 4.30AM?
Queue
systems
The Admin s Problem
Piero Calucci
The Problem
TORQUE
Understanding
Resource
Management
" minimize resource waste
" promote fair share of resources
a.k.a. «avoid complaints from users
" monitor and account for everything
Queue
systems
The Resource Manager
Piero Calucci
The Problem
TORQUE
Jobs
Understanding
Resource
Management
At the core of a batch system there is a RM that:
" accepts job submissions from users
" tracks resource usage
" delivers jobs to execution nodes
" informs users about job status
Queue
systems
The TORQUE Resource
Piero Calucci
Manager
The Problem
TORQUE
Jobs
Understanding
Resource
Management
The Terascale Open-source Resource and QUEue
manager is deployed as
" a server component (pbs_server) on the masternode
" an execution mini-server (pbs_mom) on each
execution node
There is also a scheduler component, but we will use the Maui
Scheduler instead  more on this later
Queue
systems
A Job s Life
Piero Calucci
The Problem
TORQUE
Jobs
Understanding
Resource
Management
1 a job is a shell script that contains a description of the
resources needed and the command you want to
execute
2 you submit the job to the batch system
3 the batch system sends the job to an execution queue
where it is executed without human intervention
4 job results are then delivered back to you
Queue
systems
Job Must Be a Shell Script
Piero Calucci
The Problem
TORQUE
Jobs
Understanding A job script contains a #!/bin/sh
Resource
#PBS -l walltime=1:00:00
description of the
Management
#PBS -l nodes=1:ppn=2
resources you request and
#PBS -N MyTestJob
all the commands your job
needs to perform.
do_something_useful && \
do_more || \
Resource description
do_something_else
always comes at the
exit $?
beginning of the script and
is identified by the#PBS
mark.
Queue
systems
Job Submission
Piero Calucci
The Problem
TORQUE
Jobs
Understanding
Jobs are submitted to the batch system by means of the
Resource
Management
qsubcommand, as in
qsub job.sh
But you can also add resource description directly on the
command line:
qsub -l nodes=4:ppn=4 job.sh
This is especially useful when you are experimenting with
subtle variations of a job submission.
Queue
systems
Queues
Piero Calucci
The Problem
TORQUE
Jobs
Understanding
Batch systems are usually configured with multiple queues.
Resource
Management
Each queue can be configured to accept job from a certain
group of users, or within specified resource limits, or simply
on request from the user.
Be sure to select the right queue for your jobs.
Queue selection is performed with-q queuenameon the
qsubcommand line or with#PBS -q queuenamein the
job script.
Queue
systems
Simple Resource Specification
Piero Calucci
The Problem
TORQUE
Jobs
-l nodes=n request n execution nodes
Understanding
Resource -l nodes=n:ppn=m request n execution nodes
Management
with m CPUs each
-l walltime=n request n seconds of wallclock time
(walltime can be specified also
as hours:minutes:seconds)
-l nodes=n:feature request n nodes with feature
e.g. we use:myri
for nodes with Myrinet cards
-q name submit job to named queue
-N name give job a name
Queue
systems
Interactive Jobs
Piero Calucci
The Problem
TORQUE
Jobs
If resources are available right now you can run interactive
Understanding
Resource
jobs withqsub -I
Management
In an interactive job you are given a shell on a computing
node and are allowed to execute all your computation
interactively, possibly on several nodes.
master $ qsub -I -q smp -l walltime=5:00
-l nodes=1:ppn=2
qsub: job 29506.cerbero.hpc.sissa.it ready
a211 $
Queue
systems
(No) Access to Computing
Piero Calucci
Nodes
The Problem
TORQUE
Understanding
Resource
Management
TORQUE Monitoring
Commands A common configuration on mid-sized to large clusters is:
" no «normal user access to computing nodes
" access permissions are created on the fly by the RM
when (and where) needed for your job to run
" while a job is running you are granted interactive
access to nodes allocated to your job
" at job completion access rights are cleared
Queue
systems
Node Access and Resource
Piero Calucci
Limit Enforcement
The Problem
TORQUE
Understanding
" access right is granted only to nodes allocated to your
Resource
Management
job
TORQUE Monitoring
Commands
this enforces the limit on the number of nodes you can
access and guarantees that no concurrent usage of a
resource is possible
" access right is granted only for the walltime allocated to
your job
when your allocated walltime expires, you are given a short
grace time, then all your processes on the computing node
are killed
" you should arrange so that your jobs completes before
the walltime limit, or save partial results before the job
is killed
Queue
systems
Queue Status
Piero Calucci
The Problem
TORQUE
Understanding
Resource
Management
TORQUE Monitoring qstat query queue status
Commands
qstat -a alternate form
qstat -r show only running jobs
qstat -rn only running jobs,
w/ list of allocated nodes
qstat -i only idle jobs
qstat -u username show jobs for named user
Queue
systems
Job Trace
Piero Calucci
The Problem
tracejob id show what happened today to job id
TORQUE
tracejob -n d id search last d days
Understanding
Resource
searching the RM logs is a time-consuming operation, don t
Management
TORQUE Monitoring
abuse it!
Commands
$ tracejob 29506
Job: 29506.cerbero.hpc.sissa.it
02/26/2007 10:12:39 S Job Queued at request of
cxxx@cerbero [...] job name = STDIN, queue = em64ts
...
02/26/2007 10:12:40 S Job Run at request of
maui@cerbero
...
02/26/2007 10:19:36 S Exit_status=265
resources_used.cput=00:00:00
resources_used.mem=2940kb resources_used.vmem=89532kb
resources_used.walltime=00:06:51
Queue
systems
The Scheduler
Piero Calucci
The Problem
TORQUE
The Maui Scheduler prioritizes jobs in the idle queue,
Understanding
Resource according to admin-defined policies. The highest-priority job
Management
is run as soon as resources are available.
TORQUE Monitoring
Commands
Jobs can be blocked if their requirements exceed available
resources.
Blocked jobs have an undefined priority.
Job priorities are recomputed at each scheduler iteration, so
your job can move up and down the idle queue as an effect
of resource usage by other jobs of yours.
Queue
systems
Queues as Seen by Maui
Piero Calucci
The Problem
$ showq
TORQUE
ACTIVE JOBS-------------
Understanding
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
Resource
Management
TORQUE Monitoring
29199 axxxx Running 32 1:59:17 Wed ...
Commands
29055 sxxxxxxx Running 8 4:03:07 Tue ...
28496 mxxxxxxx Running 4 5:24:00 Sat ...
...
27 Active Jobs 125 of 142 Processors Active (88.03%)
52 of 58 Nodes Active (89.66%)
IDLE JOBS---------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
29069 sxxxx Idle 4 1:21:00:00 Mon Feb 19 ...
29019 kxxxxxxx Idle 4 4:00:00:00 Mon Feb 19 ...
29076 fxxxxxx Idle 4 4:00:00:00 Mon Feb 19 ...
...
22 Idle Jobs
BLOCKED JOBS-----------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
28777 rxxxxxxx Hold 8 2:00:00:00 Thu ...
28892 dxxxxxxx BatchHold 4 4:00:00:00 Sat ...
29025 axxxx Idle 4 4:00:00:00 Mon ...
Queue
systems
The Backfill Window
Piero Calucci
The Problem
TORQUE
Understanding
Resource
Management
node 1 node 2 node 3
TORQUE Monitoring
Commands
0:00 job1 job1 job3
1:00 job1 job1 job3
2:00 job2 job2 job2
" job2 cannot run until job1 is done
" if you submit a job3 that requires only one node for two
hours or less you can run before job2 !
Queue
systems
Discovering Free Resources
Piero Calucci
The Problem
TORQUE
Theshowbfcommand queries the scheduler and displays
Understanding
Resource resources that are available for immediate use.
Management
TORQUE Monitoring
Commands
showbf summary of free resources
showbf -f myri select only nodes with a given feature
showbf -p intel select only nodes in a given partition
$ showbf
backfill window (user:  cxxx group:  bxxx
partition: ALL) Mon Feb 26 13:46:16
5 procs available with no timelimit
$ showbf -f myri
backfill window (user:  cxxx group:  bxxx
partition: ALL) Mon Feb 26 13:49:16
no procs available
$ showbf -p intel
backfill window (user:  cxxx group:  bxxx
partition: intel) Mon Feb 26 13:51:16
Queue
systems
Piero Calucci
The Problem
TORQUE
Understanding
Resource
Management
TORQUE Monitoring
Commands



Wyszukiwarka

Podobne podstrony:
design user interface?ABE09F
AGH Sed 4 sed transport & deposition EN ver2 HANDOUT
Software User Guide
HANDOUT 1
HANDOUT Chronology of polities?c to J ?nning
AGH Sed2 erosion weather etc HANDOUT
Ig MiniVNA v 10 06 11 VISTA user !!
user group howto pl 3
WinAVR user manual
calling user functions
FX2N 232 IF User s Manual JY992D66701
user add
community member user
DG ćw handout 10 verb complementation

więcej podobnych podstron