Deceema Basic Training¶

Overview¶

The purpose of these training pages is to get you up and running on Deceema as quickly as possible by providing a series of steps to follow. Their content is therefore structured as a brief overview with pointers to the relevant documentation elsewhere on the Deceema Docs site.

The topics covered are:

Deceema Compute Resources

For information on Deceema compute resources please refer to the System Architecture documentation.

Useful Websites

Deceema Website: This site is more for the general public presenting news updates and general information on Deceema partners.
Deceema Docs: This is the main information site for finding out how to use Deceema.
Deceema Admin: This site shows information about your account, projects and use of resource.
Deceema Apps: This site shows all the software/applications available to use on Deceema, with info on how to access them.

Accessing Deceema¶

Please refer to the following sections of the documentation:

First Time Access

Please follow the instructions on first time access carefully. These instructions ensure that your account is correctly setup.

Deceema Apps¶

`module load`¶

The website link website provides details on Deceema’s available applications, including their corresponding module load commands. There is also another method, which is to use the module spider command to list existing modules along with their descriptions.

Example

    bash sample code will be here

The module load command is used to load installed applications. Before loading your specific applications the following commands are required:

module purge; module load deceema
module load hype-apps/live

The first line ensures that your job starts with a clean environment. Failure to include this line means that your job will inherit its environment from where you submitted the job to the cluster, which can have unintended consequences

Apps environment¶

There is one module environment available to all users on Deceema, called hype-apps/live. This “live” environment provides modules that have undergone a period of testing and are considered stable. The various modules’ pages at website link include the required module load commands.

TasksSolution

Write-out the module load procedure for Python version indicate version here.

    module purge; module load deceema
    module load hype-apps/live
    module load Python/version-GCCcore-version

Explanation You should always do the first line as stated above, else your job will inherit the environment from where you submitted your batch script, which can then cause issues. Also, whilst not explicitly asked for in the question, you should initially load the “live”, rather than the “test”, versions of modules. (See Deceema application environments).

Python Environments¶

For information on using Python virtual environments to install additional Python modules please see the documentation on self-installing Python software.

Job Submission¶

Further information on job submission can be found within the Deceema job documentation.

Slurm¶

Deceema uses the scheduler Slurm to submit jobs. Slurm has a wide array of features – we’ll look at a few but for more information please see the Slurm website. The primary method for submitting jobs is by using a batch script. The various Slurm options are passed via header lines that begin with #SBATCH, for example:

    #SBATCH --account _projectaccount_  # Only required if you are a member of more than one Deceema project
    #SBATCH --qos _qos_  # upon signing-up to Deceema you will be assigned a qos 
    #SBATCH --time days-hours:minutes:seconds  # Time assigned for the simulation
    #SBATCH --nodes n  # Normally set to 1 unless your job requires multi-node, multi-GPU
    #SBATCH --gpus n  # Resource allocation on Deceema is primarily based on GPU requirement
    #SBATCH --cpus-per-gpu 36  # This number should normally be fixed as "36" to ensure that the system resources are used effectively
    #SBATCH --job-name _jobname_  # Title for the job

The --time Option

For convenience, the --time option can be expressed in multiple formats (in addition to the one detailed above):

35 – a single numerical value is treated as minutes.
1:30 – two colon-separated values are treated as minutes:seconds.
3:45:0 – three colon-separated values are treated as hours:minutes:seconds.

TaskSolution

Build a submission script requiring 2 GPUs and a wall-time (time-limit) of 2 days, 5 hours and 30 minutes. Please also specify the job name.

    #!/bin/bash

    #SBATCH --account _projectaccount_
    #SBATCH --qos _qos_
    #SBATCH --time 2-5:30:0
    #SBATCH --nodes 1
    #SBATCH --gpus 2
    #SBATCH --cpus-per-gpu 36
    #SBATCH --job-name _jobname_

Monitoring Jobs¶

Slurm provides several commands that you can use to inspect and/or adjust your running jobs. For example you can:

Find out how many jobs you have running
See the nodes your jobs are running on
See how long a job has been running for
See how much time a job has left
Cancel running jobs

Info and details on the these commands can be found on their help pages:

squeue – https://slurm.schedmd.com/squeue.html
scontrol – https://slurm.schedmd.com/scontrol.html
scancel – https://slurm.schedmd.com/scancel.html

QuestionsQuestion

You have submitted 2 jobs, which have Slurm IDs 7123 and 7235.

What command would you use to see all jobs for your user account?
What command would you use to inspect only job 7123?
What command would you use to cancel job 7235?

squeue
squeue -j 7123 or scontrol show job 7123
scancel 7235

Watching Job Progress

When you submit a job to Slurm it creates a file titled slurm-xxxxxx.stats (where “xxxxxx” represents the job’s ID); when the job starts running it creates a further file titled slurm-xxxxxx.out. You can watch the progress of a running job by executing the following command:

    tail -f slurm-xxxxxx.out

Alternatively you can view the entire contents of these files in your terminal by using, for example, the cat command:

    cat slurm-xxxxxx.stats

N.B. whilst the stats file shows useful information such as the amount of CPU, memory and time consumed by a job, it doesn’t show how much GPU resource was used beyond what was requested/allocated.

QOSes¶

Slurm uses QOSes (also referred to as “queues”) to ensure the equitable distribution of resources across the system. Further details can be found in the main documentation on Deceema projects and QOSes.

QuestionAnswers

How can you find information on your available QOSes?

The following are valid methods for querying your available QOSes:

Look at the confirmation email that you received when your Deceema account was created.
See your project(s) and their associated QOSes at https://admin.deceema.com
Execute the my_deceema command in a terminal shell on Deceema.

GPUs¶

General information on Deceema’s GPUs (in addition to info on CPUs, storage etc.) can be found on our system architecture page whilst specific information on the A100 GPUs can be found on the relevant Nvidia pages.

Quick GPU Walk-through¶

This task looks at analysing the effects of GPUs using the routine cudaOpenMP. Further information on CUDA’s library samples can be found on the Getting CUDA Samples page from NVidia’s website.

In order to run the cudaOpenMP command it is first necessary to retrieve and compile the CUDA samples. This process is relatively simple as each version of the CUDA modules on Deceema contains a command to unpack the sample sources into a specified directory, following which we can use the make command to then build them. Please therefore follow the preparatory steps outlined below prior to commencing the tasks:

Use module spider or query https://apps.deceema.com to determine the available versions of the fosscuda module.
Write a batch script (named, e.g, samples.sh) to accomplish the following tasks:

a. Load the required fosscuda (and therefore CUDA) module.

b. Run the cuda-install-samples-11.1.sh command, passing an argument to specify the installation directory.

c. Change directory (cd) to where the samples were unpacked.

d. Run the make command to build the necessary tools.
Expand to View Example
```
```bash
    #!/bin/bash

    #SBATCH --account _projectaccount_
    #SBATCH --qos _userqos_
    #SBATCH --time 0-0:60:0
    #SBATCH --nodes 1
    #SBATCH --gpus 1
    #SBATCH --cpus-per-gpu 36

    set -x

    module purge; module load deceema
    module load bask-apps/live
    module load fosscuda/2020b

    # Run the unpack command to extract the sources into the current working directory
    cuda-install-samples-11.1.sh .
    # Navigate to the sources directory and make using the available resource
    cd NVIDIA_CUDA-11.1_Samples && make -j ${SLURM_CPUS_ON_NODE}
```
```
1. Submit the above script to Slurm using the sbatch command. Once the job has completed you will have the cudaOpenMP binary required to run the tasks below: relative to the NVIDIA_CUDA-11.1_Samples directory, the file’s path is as follows: ./bin/x86_64/linux/release/cudaOpenMP.

TasksSolutions

Summary

Write and submit a batch file to run the cudaOpenMP command. It should specify the following details:

an appropriate account
an appropriate QOS
a job-name
wall-time of 10 minutes
1 node
2 GPUs

What is your output file?
Change GPUs to 4. What happens and why?
Change GPUs to 8. What happens and why?
Change nodes to 2 (whilst retaining 8 GPUs). What happens and why?

Refer to the Monitoring Jobs section for info on Watching job progress and how to read the .output and .stats files.

Submission script and associated output file:

Submission ScriptOutput

    #!/bin/bash

    #SBATCH --account _projectaccount_
    #SBATCH --qos _userqos_
    #SBATCH --time 0:10:0
    #SBATCH --nodes 1
    #SBATCH --gpus 2
    #SBATCH --cpus-per-gpu 36
    #SBATCH --job-name _jobname_

    module purge; module load deceema
    module load hype-apps/live 
    module load fosscuda/2020b

    ./cudaOpenMP

    number of host CPUs:    72
    number of CUDA devices: 2
    0: A100-SXM4-40GB
    1: A100-SXM4-40GB
    ---------------------------
    CPU thread 0 (of 2) uses CUDA device 0
    CPU thread 1 (of 2) uses CUDA device 1
    ---------------------------

The output changes as follows, representing the increase in the reported GPUs and a proportional increase in the host CPUs.

    number of host CPUs:    144
    number of CUDA devices: 4
    0: A100-SXM4-40GB
    1: A100-SXM4-40GB
    2: A100-SXM4-40GB
    3: A100-SXM4-40GB
    ---------------------------
    CPU thread 0 (of 4) uses CUDA device 0
    CPU thread 1 (of 4) uses CUDA device 1
    CPU thread 2 (of 4) uses CUDA device 2
    CPU thread 3 (of 4) uses CUDA device 3
    ---------------------------

The sbatch command rejects the job with the following error:
```
    sbatch: error: Batch job submission failed: Requested node configuration is not available
```
Deceema's compute nodes each have 4 NVidia A100 GPUs. By increasing the GPU request to “8” whilst still restricting the job to a single node (with --nodes 1) the job is unable to run within Deceema's configuration. You should always keep record of the architecture at hand and the associated max number of GPUs and CPUs available on a node.
See below for the .out and .stats files. The output from cudaOpenMP shows a lot less resource than what was actually requested and as can be seen in the stats file (4 GPUs and 144 CPUs vs 8 GPUs, 288 CPUs and 2 nodes). What we’re seeing is that the resource was divided equally between the two nodes but that cudaOpenMP is only designed to operate on a single node and is therefore reporting on half of the allocated resource. This demonstrates the importance of requesting an amount of resource that is appropriate for the application you are running and therefore ensuring that resource does not sit idle whilst still being allocated.

Slurm.outSlurm.stats

    number of host CPUs:    144
    number of CUDA devices: 4
    0: A100-SXM4-40GB
    1: A100-SXM4-40GB
    2: A100-SXM4-40GB
    3: A100-SXM4-40GB
    ---------------------------
    CPU thread 0 (of 4) uses CUDA device 0
    CPU thread 3 (of 4) uses CUDA device 3
    CPU thread 2 (of 4) uses CUDA device 2
    CPU thread 1 (of 4) uses CUDA device 1
    ---------------------------

    +--------------------------------------------------------------------------+
    | Job on the Deceema cluster:
    | Starting at Tue Sep 28 15:00:30 2021 for auser(123456)
    | Identity jobid 12345 jobname cudaopenmp.sh
    | Running against project ace-project and in partition Deceema-shared
    | Requested cpu=288,mem=864G,node=2,billing=288,gres/gpu=8 - 00:10:00 walltime
    | Assigned to nodes bask-pg0308u30a,bask-pg0308u31a
    | Command /bask/projects/a/ace-project/cudaopenmp.sh
    | WorkDir /bask/projects/a/ace-project
    +--------------------------------------------------------------------------+
    +--------------------------------------------------------------------------+
    | Finished at Tue Sep 28 15:00:35 2021 for auser(123456) on the Deceema Cluster
    | Required (00:01.942 cputime, 3580K memory used) - 00:00:05 walltime
    | JobState COMPLETING - Reason None
    | Exitcode 0:0
    +--------------------------------------------------------------------------+

CUDA routines and samples

Refer to NVidia’s Samples Reference documentation for further info on the CUDA routines and samples.

Interactive Jobs and Deceema Portal¶

Deceema Portal is the recommended method for running interactive jobs on Deceema. For information on other methods of interactive jobs, please refer to the Interactive Jobs section of the documentation.

Transferring Data¶

Our recommended command-line tools for transferring data to/from the Baskerville cluster are as follows:

rsync – see the rsync man page for usage instructions; for general guidance please see the following webpage: https://linuxize.com/post/how-to-use-rsync-for-local-and-remote-data-transfer-and-synchronization
scp – see the scp man page for usage instructions; for general guidance please see the following webpage: https://linuxize.com/post/how-to-use-scp-command-to-securely-transfer-files Alternatively, Deceema Portal includes a file management web-interface that can be used to upload and download content from directories to which you have access.

Support¶

Available User Support¶

The stages you should follow when you have an issue are as follows:

Check the application’s documentation and/or support pages for either the error message or the issue you’re experiencing
Consult with your colleagues – they might have had experience with the problem you are facing.
Contact your site’s Deceema support team.