What’s going on: checking the status of projects and jobs on NeSI (and other places)
It can be important to keep track of what you are doing on the server. How many CPUs you have used, how many you have left. How much disk space is left on your project, or your home folder. Here are a few commands to help you keep track, along with links to more information on NeSI.
CPU usage
There is a command for NeSI to track how many core hours you have used over the course of each project. Note, that this is not just the number of hours a job runs, but the time is multiplied by the number of CPUs used for that job. The amount of working memory (RAM) may also be figured in that calculation. To keep it straight, use the following command:
In this example, we are looking at logical CPU core use for the project nesi00420
nn_corehour_usage -l nesi00420
To view the Fair Share adjusted CPU core use (see this link for details on Fair Share and here for job prioritisation), use -f
instead of -l
nn_corehour_usage -f nesi00420
And to see only two months of usage use the -n option
nn_corehour_usage -l nesi00420 -n 2
Storage quota
Your home folder and each project folder has storage limits–for both the size and number of files. This link gives an overview of NeSI system quotas. Running the following command will help you find where you are running out of space.
nn_storage_quota
(just like that, no other options)
Dini at NeSI reminded me about two other useful commands for your home folder (which has a low storage limit). One to show the folders with the most files (in order):
find . -printf "%h\n" | cut -d/ -f-2 | sort | uniq -c | sort -rn
And the other to sort the largest directories
du -a | sort -n -r | head -n 10
Checking status of active or submitted jobs
This is likely one of the first slurm commands you learn (after submitting a job with sbatch
, but as a reminder, here is how to check to see your job’s status:
squeue -u user.name
The other important command for active commands is the scancel
command, which will cancel a job:
scancel 12345678
This will cancel the job with id: 12345678 (check the status with squeue
to get the job id)
You can also check on recently run jobs with the sacct
command. Here is a link for all job related commands and other useful information.