Difference between revisions of "Machines/PBS FAQ"

From NLPWiki
Jump to: navigation, search
(What commands should I know about?)
(What commands should I know about?)
 
(One intermediate revision by one user not shown)
Line 8: Line 8:
 
* <code> nlpsub </code> (main mechanism for submitting jobs, see nlpsub --help and nlpsub --help-examples)
 
* <code> nlpsub </code> (main mechanism for submitting jobs, see nlpsub --help and nlpsub --help-examples)
 
* <code> showq </code> (checking node status, also to a lesser extent: qstat)
 
* <code> showq </code> (checking node status, also to a lesser extent: qstat)
* <code> qdel M/code> (to unschedule jobs, cancel running jobs)
+
* <code> qdel </code> (to unschedule jobs, cancel running jobs)
 
* <code> qstat -f <jobname>/checkjob <jobname> </code> (check job status)
 
* <code> qstat -f <jobname>/checkjob <jobname> </code> (check job status)
 
* <code> pbstop </code> (check node status, also can give you details on queued/running jobs)
 
* <code> pbstop </code> (check node status, also can give you details on queued/running jobs)
Line 19: Line 19:
  
 
== Why isn't my job running? ==
 
== Why isn't my job running? ==
Try "showq", "qstat -f <jobnum>", and "checkjob <jobnum>" for some details.  Note that the jude machines often have less than 16GB available so they won't be able to run (for example) 4 jobs that need 4GB of RAM.  Additionally, if you ask for resources not available on PBS (too much RAM, etc.) your job will sit in the queue forever with "Deferred" status.  If you're still stumped, email the grid czar.
+
Try <code>showq</code>, <code>qstat -f <jobnum></code>, and <code>checkjob <jobnum></code> for some details.  Note that the jude machines often have less than 16GB available so they won't be able to run (for example) 4 jobs that need 4GB of RAM.  Additionally, if you ask for resources not available on PBS (too much RAM, etc.) your job will sit in the queue forever with "Deferred" status.  If you're still stumped, email the grid czar.
  
 
== How come my log files are empty?  Is my job really running? ==
 
== How come my log files are empty?  Is my job really running? ==

Latest revision as of 14:39, 12 December 2013

Contents

PBS Frequently Asked Questions

Have a question? Discovered something interesting about PBS? Please help keep this page current.

See also PBS Introduction.

What commands should I know about?

  • nlpsub (main mechanism for submitting jobs, see nlpsub --help and nlpsub --help-examples)
  • showq (checking node status, also to a lesser extent: qstat)
  • qdel (to unschedule jobs, cancel running jobs)
  • qstat -f <jobname>/checkjob <jobname> (check job status)
  • pbstop (check node status, also can give you details on queued/running jobs)

Extra credit (commands you'll probably use less often)

  • qhold/qrls (to suspend/release jobs you don't want to run just yet)
  • qalter (to change attributes of submitted jobs)
  • pbsnodes (detailed status of all the PBS node machines)
  • diagnose -f/diagnose -p (details of fair share/priority systems)

Why isn't my job running?

Try showq, qstat -f <jobnum>, and checkjob <jobnum> for some details. Note that the jude machines often have less than 16GB available so they won't be able to run (for example) 4 jobs that need 4GB of RAM. Additionally, if you ask for resources not available on PBS (too much RAM, etc.) your job will sit in the queue forever with "Deferred" status. If you're still stumped, email the grid czar.

How come my log files are empty? Is my job really running?

NFS buffering can make it difficult to see your output in real time. As a results, you probably won't be able to tail the output of your system in realtime (i.e. the stdout/stderr files in your nlpsub directories may not grow in real time). If you need this, you should run an interactive job (nlpsub -i). It's also possible in the future that you'll be able to ssh to machines that are running your job where you'll be able to inspect the actual output files.

What queue and priority should I use?

Queues and priorities can be used in any combination. Please choose the shortest queue and the lowest priority level that works for you. The system doesn't work great if everyone submits jobs in the verylong queue with high priority.

Queues

  • short (maximum length: 2 hours)
  • long (maximum length: 1 day)
  • verylong (unlimited)

Shorter jobs get a priority bonus and thus will generally be scheduled sooner.

Priority levels

  • high (use for extreme situations on critical jobs only)
  • normal (most jobs are in this category)
  • background (for large numbers of small jobs or less important jobs that should only be run when the grid is free)
  • preemptable (like background but allows jobs to be suspended by the high and normal queues if necessary. This is the nice thing to do.)

Note that there is also a fairshare system in place and frequent PBS usage will naturally lower your priority. The system attempts to equalize the CPU time used by all its participants.

What are some example PBS use cases?

  • standard edit/run/debug cycle (use interactive jobs -- logout before walking away from the keyboard)
  • a ton of short jobs (use short queue, background or preemptable priority)
  • highly parallelized jobs (ask for 4 cores to get a complete machine)
  • a ton of long running jobs (use verylong queue, background or preemptable priority -- note though that this can easily clog up the grid. Please break your jobs up into small pieces (ideally under 2 hours) or run a small number of long running jobs. If you must run a lot of long running jobs, you should use the preemptable priority level.)
  • urgent jobs for a conference deadline (use high priority -- but be courteous to other users who may have the same deadline!)

My job was next in line in qstat but didn't get scheduled. What's up with that?

qstat doesn't know job scheduling order. Use showq instead. (qstat -f <jobnum> will give you some details on a job, though, just not an accurate notion of scheduling priority)

What happens if I exceed the (wallclock) time limit on a job?

For now, it will get killed shortly after the timer expires. A more relaxed policy may be possible in the future.

What happens if I exceed the memory limit on a job?

Currently nothing. This may be addressed in the future, but currently PBS itself is unable to accurately monitor memory usage. This means that you should be extra careful to not exceed the amount of memory you requested -- remember that if it schedules 4 of your jobs on a 16GB machine and each job requests 4GB but uses 5GB, your jobs will thrash and not make any progress.

How do I find out if my job is really running?

For all jobs, you can do qstat -f <jobnum> and it will list CPU time used among its output (this is updated fairly frequently but not in real time). In an interactive job, you could run your job within a screen and run top on a new window. The machines status web page is also a good indication.

How do I find out the job name/number for my job?

nlpsub (and qsub) will print out something like "XXX.nlp.stanford.edu". XXX is the job number. Jobs can also have a (non-unique) ASCII name associated with them for better descriptions (pass these with the -n parameter) -- these descriptions cannot be used in commands like qdel, though. Once a job has been run, qstat and showq will display the job numbers. Additionally, batch jobs will include the job number in the qsub.log file in their log directories once they've started running.