PBS is an alternative to our current method of managing machines on the cluster where jobs are managed by a scheduler with hopes of better resource utilization. It will enable a number of new workflows since it can handle the submission of a large number of jobs. Additionally, this means that it should be harder to overload a machine CPU/RAM-wise (assuming cooperative participation). We are experimenting using PBS to manage a portion of the nodes on the cluster (currently some subset of the jude nodes). Since PBS doesn't know about non-PBS usage of a machines, you will not be able to ssh to machines managed by PBS.
See also: PBS FAQ.
The PBS Contract
By using PBS, you should try to follow these principles -- PBS breaks if there are too many violations.
- Do not game the system. Be courteous to other users.
- Monitor your jobs. Check in on them periodically on the machines page to make sure they're not overloading machines.
- Watch your memory and CPU usage (i.e. don't use more than you request) and don't run multithreaded programs unless you request them in nlpsub.
- When possible, try to make your jobs short (1-2h is often a good tradeoff between the overhead of starting jobs and not having jobs that are too long). This gives the scheduler great flexibility.
- If you're running a very long job or an extremely large number of jobs, please use the "preemptable" priority level to allow other people to cut ahead and get more balanced cluster use.
- Remember to log out of your interactive sessions when you're finished with them. Idle interactive sessions will eventually be killed by the cluster autocop script.
Some example commands for submitting jobs:
nlpsub -h or
nlpsub --help (print help menu for nlpsub)
nlpsub -i(create an interactive session with the default settings -- long queue, normal priority, single core, 2GB of RAM)
nlpsub -ie(create an interactive session in exclusive mode -- same as above, but will use an entire machine)
nlpsub xyzzy(run the command 'xyzzy' with default settings in batch mode -- verylong queue, normal priority, single core, 2GB of RAM)
nlpsub -c2 xyzzy(run the command 'xyzzy' in batch mode asking for two cores)
nlpsub -m4g xyzzy(run the command 'xyzzy' in batch mode asking for 4GB RAM)
nlpsub -pbackground -qshort xyzzy(run the command 'xyzzy' in batch mode on the "short" queue with "background" priority)
nlpsub -ppreemptable -qverylong xyzzy(for commands that take a very long time to run but can be preempted -- note that jobs will only be preempted if given "preemptable" priority.)
nlpsub -dxyzzy-logs xyzzy(run the command 'xyzzy' in batch mode, putting the logs inside the xyzzy-logs/ directory)
nlpsub --tail xyzzy(run 'xyzzy' in batch mode but tail the output as it appears -- note that NFS will cause to not be realtime)
Queues and priorities can be abbreviated, so "nlpsub -ppreemptable -qverylong xyzzy" could also be typed as:
nlpsub -pp -qv xyzzy
Other critical commands:
showq(display status of the cluster)
qdel <jobnumber>(kill a running job or unqueue a scheduled job)
There are a couple commands you should know: nlpsub (submitting jobs), showq (monitoring the cluster), qdel/qhold/qrls (managing running jobs). All commands can be run from any machine on the cluster.
nlpsub helps you submit jobs to the grid. For those who remember it, it is similar to qqqsub (though is not a drop-in replacement). nlpsub includes a help menu which you should familiarize yourself with (type nlpsub -h or just nlpsub with no arguments to display it). There are two classes of jobs -- interactive and batch. Interactive jobs are just like ssh'ing to a machine -- if you type
nlpsub -i, nlpsub will put you on a free core. Note that you do not get exclusive use of a machine unless you ask for it. Batch jobs run outside of a terminal. For these, nlpsub will create a directory inside the current directory to store the stdout/stderr of your command. To run a command on PBS, just type
nlpsub [command] [arguments to command]. Note that if you're running a Java command, nlpsub will automatically detect memory use from -Xmx flags. For running non-Java commands, you will need to specify memory use.
At present, nlpsub only supports submitting a single command at a time (except for array jobs) so you'll need to run nlpsub over each command. If there's demand, it will support submitting a whole list of commands at once.
These commands operate on PBS job IDs. When you submit a job with nlpsub, nlpsub will report the jobs PBS job ID. Additionally, showq will list these as well.
qdel deletes jobs from the queue and kills running jobs. qhold will put a hold on a job in the queue (thus causing it to not be run). This can be used when you want to let someone else run ahead of you or if you've made a mistake in your job that you'd like to correct first. qrls removes the hold on a queued job.
Debugging and diagnostics
To get more information about a specific job, try
qstat -f <jobname> and
- "diagnose -p" will give you details about the priorities of jobs waiting in the queue.
- "diagnose -f" will give you details on everyone's PBS usage and how fair share scores are determined.
- Adding --debug and/or -v to nlpsub will make it print the command line it is using to submit your job -- see if you can run your script file locally.
In addition to queues, priority levels (-p in nlpsub) and "fair share" influence scheduling order. There are four priority levels:
- high (runs first)
- preemptable (runs last, can be suspended by high and normal jobs)
See PBS FAQ for a summary of when you should use which. The "fair share" component of priority calculates how much you've used PBS in the past two weeks (with a decaying scale). PBS aims to equalize CPU time usage across all users so heavy users will get lower job priorities. Note that "social scheduling" (i.e. emailing other users and negotiating resources) is still a viable option -- as needed, users may put holds on their jobs or cancel/unqueue them to make way for other users. If users are non-responsive to social scheduling, you may email the grid czar(s).
- For emailing the details about life of a job, you can use -M and --mail-user options. The option -M takes values: b (begins execution), e (terminates normally), a (aborted by the batch system). The default email address is <username>@stanford.edu. To specify other email address, use --email@example.com
- To overwrite contents of the log directory (by default PBS appends the log), use --clobber.
Tips for running jobs on the cluster
- When running large batches of jobs, test a couple first to make sure they're correct. Running jobs in batches amplifies your errors (including the really bad ones which fill up disks, overload machines, etc.)
- Design your programs to be more parallelizable -- sometimes this is very hard, but often it is easy to make them operate on a subset of the input files (for data-independent problems)
- Avoid running daemons as part of your jobs. Daemons are no longer part of your process subtree which means that PBS won't kill them properly with qdel.
Questions or suggestions?
Please tell the grid czar(s) if you have any feedback or observe any bugs. Also, if this page seems out of date, please update it or let the grid czar(s) know.
- Support code (including nlpsub) lives in /u/nlp/packages/nlpbs/. It's all in Python and should be easy to extend. It lives in SVN under trunk/nonjavanlp/projects/nlpbs if you'd like to help out.