ICCS Summer School 2025
Working definition:
A computing resource that is larger than can be provided by one laptop or server
One of the most performant computers in the world at a particular point in time.
An architecture for combining a number of servers, storage and networking to act on concert.
Most supercomputers for the past few decades have been clusters.
Why would I need a supercomputer?
Three traditional applications:
Now, AI
Computer math is not people math
>>> 0.1 + 0.2
>>> 0.1 + 0.2
0.30000000000000004
One FLOPS == one floating point operation per second.
Conventionally these are 64-bit (“double precision”) FLOPS
Image source: Felix LeClair
A benchmark is a particular known and specified workload which can be repeated on different systems and the performance compared.
A typical weather related one is WRF running the CONUS 2.5km configuration.
LINPACK is a software library for performing numerical linear algebra
LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.
The LINPACK benchmarks appeared initially as part of the LINPACK user’s manual. The parallel LINPACK benchmark implementation called HPL (High Performance Linpack) is used to benchmark and rank supercomputers for the TOP500 list.
Got to the Top500 site at https://top500.org/
Before we get to the computing infrastructure there is the underpinning building and plant (power, cooling) required
The name comes from the terminology of mathematical graphs - nodes and edges.
You can think of a node as a single server - one computer that an instance of an operating system
These are your entry point on to the cluster
Usually accessable from the outside world.
Often more than one (sometimes multiple login nodes use the same DNS name, e.g . login.hpc.cam.ac.uk
)
Shared with multiple users.
DO NOT RUN COMPUTE JOBS ON THE LOGIN NODE
These are the nodes that do the heavy lifting computing work.
Normally managed by the job scheduler - you don’t usually log in to them directly.
Quite often for the exclusive use of one user for the duration of their job.
N.B. On some clusters compute nodes can be of a different architecture to the login nodes.
Compute nodes sometimes have on node disk storage.
Ther is normally some large storage that is visible to all the compute nodes.
Since this is a shared resource an anti-social user can affect the performnace of other users.
Connects the compute nodes, login nodes and storage
Usually faster (higher bandwidth, lower latency) than comoddity ethernet networking.
It’s what makes a supercomputer super.
examples: - Infiniband - Omnipath - Slingshot
login.hpc.cam.ac.uk
The scheduler takes requests to run jobs with particular cluster resources, fits these in around other user’s jobs according to some policy, launches the job, terminates the job if it is overrunning, does accounting.
Examples: - PBSpro - Platform LSF - Flux - Slurm (today, on CSD3)
A shell script with shell comments that are directives to the sheduler about how the jobs should be run
sbatch job.sh
You will get back a Job ID.
squeue
squeue --me
If you don’t specify, by default it will be called slurm-<$JOBID>.out
To change this you can add an extra directive #SBATCH --output=
sbatch
squeue --me
ls -lrt
cat
sleep 60
squeue --me
scancel <JOBID>
stderr
(!)
module avail 2>&1 | grep the_thing
module list
module avail
h5perf_serial
module load hdf5/1.12.1
h5perf_serial
#!/bin/bash
#SBATCH --account=TRAINING-CPU
#SBATCH --reservation=iccs-summer-school2
#SBATCH --time=00:02:00
#SBATCH --job-name=array-test
#SBATCH --partition=icelake
#SBATCH --nodes=1
#SBATCH --cores=1
echo "Hello from array member ${SLURM_ARRAY_TASK_ID}"
sbatch --array=1-10 array-test.sh
squeue --me
Do not be tempted to write your own workflow orchestrator
Choose from one of the already existing ones, e.g.:
See NERSC’s advice
#include <stdio.h>
#include <omp.h>
int main() {
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
printf("Hello from thread %d\n", thread_id);
}
return 0;
}
Build with
Note
To change the number of threads we need to set environment variable OMP_NUM_THREADS
e.g., export OMP_NUM_THREADS=1
What happens when we run the following code?
pi.c
example (in directory example-code/
)pi.c
use command:#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
Build:
Run a small test on login node:
mpiexec -np 4 ./mpi-hello.exe
mpi-hello.c
\(S = 1 / (1 - p + p/s)\), where…
printf()
gdb
, lldb
, linaro ddt
…)Warning!
Premature Optimization Is the Root of All Evil
Donald Knuth (1974)
Advice:
For more information we can be reached at:
You can also contact the ICCS, make a resource allocation request, or visit us at the Summer School RSE Helpdesk.