Deceema Basic Training¶
Overview¶
The purpose of these training pages is to get you up and running on Deceema as quickly as possible by providing a series of steps to follow. Their content is therefore structured as a brief overview with pointers to the relevant documentation elsewhere on the Deceema Docs site.
The topics covered are:
Deceema Compute Resources
For information on Deceema compute resources please refer to the System Architecture documentation.
Useful Websites
- Deceema Website: This site is more for the general public presenting news updates and general information on Deceema partners.
- Deceema Docs: This is the main information site for finding out how to use Deceema.
- Deceema Admin: This site shows information about your account, projects and use of resource.
- Deceema Apps: This site shows all the software/applications available to use on Deceema, with info on how to access them.
Accessing Deceema¶
Please refer to the following sections of the documentation:
First Time Access
Please follow the instructions on first time access carefully. These instructions ensure that your account is correctly setup.
Deceema Apps¶
module load¶
The website link website provides details on Deceema’s available applications, including their corresponding module load commands. There is also another method, which is to use the module spider command to list existing modules along with their descriptions.
The module load command is used to load installed applications. Before loading your specific applications the following commands are required:
module purge; module load deceemamodule load hype-apps/live
The first line ensures that your job starts with a clean environment. Failure to include this line means that your job will inherit its environment from where you submitted the job to the cluster, which can have unintended consequences
Apps environment¶
There is one module environment available to all users on Deceema, called hype-apps/live. This “live” environment provides modules that have undergone a period of testing and are considered stable. The various modules’ pages at website link include the required module load commands.
Write-out the module load procedure for Python version indicate version here.
module purge; module load deceema
module load hype-apps/live
module load Python/version-GCCcore-version
Explanation You should always do the first line as stated above, else your job will inherit the environment from where you submitted your batch script, which can then cause issues. Also, whilst not explicitly asked for in the question, you should initially load the “live”, rather than the “test”, versions of modules. (See Deceema application environments).
Python Environments¶
For information on using Python virtual environments to install additional Python modules please see the documentation on self-installing Python software.
Job Submission¶
Further information on job submission can be found within the Deceema job documentation.
Slurm¶
Deceema uses the scheduler Slurm to submit jobs. Slurm has a wide array of features – we’ll look at a few but for more information please see the Slurm website.
The primary method for submitting jobs is by using a batch script. The various Slurm options are passed via header lines that begin with #SBATCH, for example:
#SBATCH --account _projectaccount_ # Only required if you are a member of more than one Deceema project
#SBATCH --qos _qos_ # upon signing-up to Deceema you will be assigned a qos
#SBATCH --time days-hours:minutes:seconds # Time assigned for the simulation
#SBATCH --nodes n # Normally set to 1 unless your job requires multi-node, multi-GPU
#SBATCH --gpus n # Resource allocation on Deceema is primarily based on GPU requirement
#SBATCH --cpus-per-gpu 36 # This number should normally be fixed as "36" to ensure that the system resources are used effectively
#SBATCH --job-name _jobname_ # Title for the job
The --time Option
For convenience, the --time option can be expressed in multiple formats (in addition to the one detailed above):
-
35– a single numerical value is treated as minutes. -
1:30– two colon-separated values are treated as minutes:seconds. -
3:45:0– three colon-separated values are treated as hours:minutes:seconds.
Build a submission script requiring 2 GPUs and a wall-time (time-limit) of 2 days, 5 hours and 30 minutes. Please also specify the job name.
Monitoring Jobs¶
Slurm provides several commands that you can use to inspect and/or adjust your running jobs. For example you can:
- Find out how many jobs you have running
- See the nodes your jobs are running on
- See how long a job has been running for
- See how much time a job has left
- Cancel running jobs
Info and details on the these commands can be found on their help pages:
squeue– https://slurm.schedmd.com/squeue.htmlscontrol– https://slurm.schedmd.com/scontrol.htmlscancel– https://slurm.schedmd.com/scancel.html
You have submitted 2 jobs, which have Slurm IDs 7123 and 7235.
- What command would you use to see all jobs for your user account?
- What command would you use to inspect only job 7123?
- What command would you use to cancel job 7235?
squeuesqueue -j 7123 or scontrol show job 7123scancel 7235
Watching Job Progress
When you submit a job to Slurm it creates a file titled slurm-xxxxxx.stats (where “xxxxxx” represents the job’s ID); when the job starts running it creates a further file titled slurm-xxxxxx.out. You can watch the progress of a running job by executing the following command:
N.B. whilst the stats file shows useful information such as the amount of CPU, memory and time consumed by a job, it doesn’t show how much GPU resource was used beyond what was requested/allocated.
QOSes¶
Slurm uses QOSes (also referred to as “queues”) to ensure the equitable distribution of resources across the system. Further details can be found in the main documentation on Deceema projects and QOSes.
How can you find information on your available QOSes?
The following are valid methods for querying your available QOSes:
- Look at the confirmation email that you received when your Deceema account was created.
- See your project(s) and their associated QOSes at https://admin.deceema.com
- Execute the my_deceema command in a terminal shell on Deceema.
GPUs¶
General information on Deceema’s GPUs (in addition to info on CPUs, storage etc.) can be found on our system architecture page whilst specific information on the A100 GPUs can be found on the relevant Nvidia pages.
Quick GPU Walk-through¶
This task looks at analysing the effects of GPUs using the routine cudaOpenMP. Further information on CUDA’s library samples can be found on the Getting CUDA Samples page from NVidia’s website.
In order to run the cudaOpenMP command it is first necessary to retrieve and compile the CUDA samples. This process is relatively simple as each version of the CUDA modules on Deceema contains a command to unpack the sample sources into a specified directory, following which we can use the make command to then build them. Please therefore follow the preparatory steps outlined below prior to commencing the tasks:
-
Use
module spideror query https://apps.deceema.com to determine the available versions of the fosscuda module. -
Write a batch script (named, e.g,
samples.sh) to accomplish the following tasks:a. Load the required fosscuda (and therefore CUDA) module.
b. Run the
cuda-install-samples-11.1.shcommand, passing an argument to specify the installation directory.c. Change directory (
cd) to where the samples were unpacked.d. Run the
makecommand to build the necessary tools.Expand to View Example
```bash #!/bin/bash #SBATCH --account _projectaccount_ #SBATCH --qos _userqos_ #SBATCH --time 0-0:60:0 #SBATCH --nodes 1 #SBATCH --gpus 1 #SBATCH --cpus-per-gpu 36 set -x module purge; module load deceema module load bask-apps/live module load fosscuda/2020b # Run the unpack command to extract the sources into the current working directory cuda-install-samples-11.1.sh . # Navigate to the sources directory and make using the available resource cd NVIDIA_CUDA-11.1_Samples && make -j ${SLURM_CPUS_ON_NODE}```
- Submit the above script to Slurm using the
sbatchcommand. Once the job has completed you will have the cudaOpenMP binary required to run the tasks below: relative to theNVIDIA_CUDA-11.1_Samplesdirectory, the file’s path is as follows:./bin/x86_64/linux/release/cudaOpenMP.
- Submit the above script to Slurm using the
Summary
Write and submit a batch file to run the cudaOpenMP command. It should specify the following details:
-
an appropriate account
-
an appropriate QOS
-
a job-name
-
wall-time of 10 minutes
-
1 node
-
2 GPUs
-
What is your output file?
-
Change GPUs to 4. What happens and why?
-
Change GPUs to 8. What happens and why?
-
Change nodes to 2 (whilst retaining 8 GPUs). What happens and why?
Refer to the Monitoring Jobs section for info on Watching job progress and how to read the .output and .stats files.
-
Submission script and associated output file:
-
The output changes as follows, representing the increase in the reported GPUs and a proportional increase in the host CPUs.
number of host CPUs: 144 number of CUDA devices: 4 0: A100-SXM4-40GB 1: A100-SXM4-40GB 2: A100-SXM4-40GB 3: A100-SXM4-40GB --------------------------- CPU thread 0 (of 4) uses CUDA device 0 CPU thread 1 (of 4) uses CUDA device 1 CPU thread 2 (of 4) uses CUDA device 2 CPU thread 3 (of 4) uses CUDA device 3 --------------------------- -
The
Deceema's compute nodes each have 4 NVidia A100 GPUs. By increasing the GPU request to “8” whilst still restricting the job to a single node (withsbatchcommand rejects the job with the following error:--nodes 1) the job is unable to run within Deceema's configuration. You should always keep record of the architecture at hand and the associated max number of GPUs and CPUs available on a node. -
See below for the
.outand.statsfiles. The output fromcudaOpenMPshows a lot less resource than what was actually requested and as can be seen in the stats file (4 GPUs and 144 CPUs vs 8 GPUs, 288 CPUs and 2 nodes). What we’re seeing is that the resource was divided equally between the two nodes but thatcudaOpenMPis only designed to operate on a single node and is therefore reporting on half of the allocated resource. This demonstrates the importance of requesting an amount of resource that is appropriate for the application you are running and therefore ensuring that resource does not sit idle whilst still being allocated.
number of host CPUs: 144
number of CUDA devices: 4
0: A100-SXM4-40GB
1: A100-SXM4-40GB
2: A100-SXM4-40GB
3: A100-SXM4-40GB
---------------------------
CPU thread 0 (of 4) uses CUDA device 0
CPU thread 3 (of 4) uses CUDA device 3
CPU thread 2 (of 4) uses CUDA device 2
CPU thread 1 (of 4) uses CUDA device 1
---------------------------
+--------------------------------------------------------------------------+
| Job on the Deceema cluster:
| Starting at Tue Sep 28 15:00:30 2021 for auser(123456)
| Identity jobid 12345 jobname cudaopenmp.sh
| Running against project ace-project and in partition Deceema-shared
| Requested cpu=288,mem=864G,node=2,billing=288,gres/gpu=8 - 00:10:00 walltime
| Assigned to nodes bask-pg0308u30a,bask-pg0308u31a
| Command /bask/projects/a/ace-project/cudaopenmp.sh
| WorkDir /bask/projects/a/ace-project
+--------------------------------------------------------------------------+
+--------------------------------------------------------------------------+
| Finished at Tue Sep 28 15:00:35 2021 for auser(123456) on the Deceema Cluster
| Required (00:01.942 cputime, 3580K memory used) - 00:00:05 walltime
| JobState COMPLETING - Reason None
| Exitcode 0:0
+--------------------------------------------------------------------------+
CUDA routines and samples
Refer to NVidia’s Samples Reference documentation for further info on the CUDA routines and samples.
Interactive Jobs and Deceema Portal¶
Deceema Portal is the recommended method for running interactive jobs on Deceema. For information on other methods of interactive jobs, please refer to the Interactive Jobs section of the documentation.
Transferring Data¶
Our recommended command-line tools for transferring data to/from the Baskerville cluster are as follows:
rsync– see the rsync man page for usage instructions; for general guidance please see the following webpage: https://linuxize.com/post/how-to-use-rsync-for-local-and-remote-data-transfer-and-synchronizationscp– see the scp man page for usage instructions; for general guidance please see the following webpage: https://linuxize.com/post/how-to-use-scp-command-to-securely-transfer-files Alternatively, Deceema Portal includes a file management web-interface that can be used to upload and download content from directories to which you have access.
Support¶
Available User Support¶
The stages you should follow when you have an issue are as follows:
-
Check the application’s documentation and/or support pages for either the error message or the issue you’re experiencing
-
Consult with your colleagues – they might have had experience with the problem you are facing.
-
Contact your site’s Deceema support team.