PBS is an alternative to our current method of managing machines on the cluster where jobs are managed by a scheduler with hopes of better resource utilization. It will enable a number of new workflows since it can handle the submission of a large number of jobs. Additionally, this means that it should be harder to overload a machine CPU/RAM-wise (assuming cooperative participation). We are experimenting using PBS to manage a portion of the nodes on the cluster (currently some subset of the jude nodes). Since PBS doesn't know about non-PBS usage of a machines, you will not be able to ssh to machines managed by PBS.
See also: PBS FAQ.
The PBS Contract
By using PBS, you should try to follow these principles -- PBS breaks if there are too many violations.
- Do not game the system. Be courteous to other users.
- Monitor your jobs. Check in on them periodically on the machines page to make sure they're not overloading machines.
- Watch your memory and CPU usage (i.e. don't use more than you request) and don't run multithreaded programs unless you request them in nlpsub.
- When possible, try to make your jobs short (1-2h is often a good tradeoff between the overhead of starting jobs and not having jobs that are too long). This gives the scheduler great flexibility.
- If you're running a very long job or an extremely large number of jobs, please use the "preemptable" priority level to allow other people to cut ahead and get more balanced cluster use.
- Remember to log out of your interactive sessions when you're finished with them. Idle interactive sessions will eventually be killed by the cluster autocop script.
Some example commands for submitting jobs:
nlpsub -h or
nlpsub --help (print help menu for nlpsub)
nlpsub -i(create an interactive session with the default settings -- long queue, normal priority, single core, 2GB of RAM)
nlpsub -ie(create an interactive session in exclusive mode -- same as above, but will use an entire machine)
nlpsub xyzzy(run the command 'xyzzy' with default settings in batch mode -- verylong queue, normal priority, single core, 2GB of RAM)
nlpsub -c2 xyzzy(run the command 'xyzzy' in batch mode asking for two cores)
nlpsub -m4g xyzzy(run the command 'xyzzy' in batch mode asking for 4GB RAM)
nlpsub -pbackground -qshort xyzzy(run the command 'xyzzy' in batch mode on the "short" queue with "background" priority)
nlpsub -ppreemptable -qverylong xyzzy(for commands that take a very long time to run but can be preempted -- note that jobs will only be preempted if given "preemptable" priority.)
nlpsub -dxyzzy-logs xyzzy(run the command 'xyzzy' in batch mode, putting the logs inside the xyzzy-logs/ directory)
nlpsub --tail xyzzy(run 'xyzzy' in batch mode but tail the output as it appears -- note that NFS will cause to not be realtime)
Queues and priorities can be abbreviated, so "nlpsub -ppreemptable -qverylong xyzzy" could also be typed as:
nlpsub -pp -qv xyzzy
Other critical commands:
showq(display status of the cluster)
qdel <jobnumber>(kill a running job or unqueue a scheduled job)
There are a couple commands you should know: nlpsub (submitting jobs), showq (monitoring the cluster), qdel/qhold/qrls (managing running jobs). All commands can be run from any machine on the cluster.
nlpsub helps you submit jobs to the grid. For those who remember it, it is similar to qqqsub (though is not a drop-in replacement). nlpsub includes a help menu which you should familiarize yourself with (type nlpsub -h or just nlpsub with no arguments to display it). There are two classes of jobs -- interactive and batch. Interactive jobs are just like ssh'ing to a machine -- if you type
nlpsub -i, nlpsub will put you on a free core. Note that you do not get exclusive use of a machine unless you ask for it. Batch jobs run outside of a terminal. For these, nlpsub will create a directory inside the current directory to store the stdout/stderr of your command. To run a command on PBS, just type
nlpsub [command] [arguments to command]. Note that if you're running a Java command, nlpsub will automatically detect memory use from -Xmx flags. For running non-Java commands, you will need to specify memory use.
At present, nlpsub only supports submitting a single command at a time (except for [array jobs]) so you'll need to run nlpsub over each command. If there's demand, it will support submitting a whole list of commands at once.
These commands operate on PBS job IDs. When you submit a job with nlpsub, nlpsub will report the jobs PBS job ID. Additionally, [showq] will list these as well.
qdel deletes jobs from the queue and kills running jobs. qhold will put a hold on a job in the queue (thus causing it to not be run). This can be used when you want to let someone else run ahead of you or if you've made a mistake in your job that you'd like to correct first. qrls removes the hold on a queued job.