Introduction to High Performance Computing

ICCS Summer School 2025

Chris Edsall

Head of RSE
ICCS/Cambridge

Tom Meltzer

Senior Research Software Engineer
ICCS - University of Cambridge

High Performance Computing

Working definition:

A computing resource that is larger than can be provided by one laptop or server

Supercomputers and clusters

Supercomputer

One of the most performant computers in the world at a particular point in time.

Cluster

An architecture for combining a number of servers, storage and networking to act on concert.

Most supercomputers for the past few decades have been clusters.

Applications of HPC

Why would I need a supercomputer?

Three traditional applications:

Nuclear
Chemical
Climate / weather

Now, AI

Floating Point

Computer math is not people math

>>> 0.1 + 0.2

Floating Point

>>> 0.1 + 0.2
0.30000000000000004

Floating Point

’60s and ’70s many vendor implementations
Standardised in 1982 as IEEE 754

FLOPS

One FLOPS == one floating point operation per second.

TF terraflops
PF petaflops
EF exaflops

Conventionally these are 64-bit (“double precision”) FLOPS

“AI” FLOPS

smaller data formats
- float16
- bfloat16
- int8

FLOPS

Image source: Felix LeClair

Benchmarks

A benchmark is a particular known and specified workload which can be repeated on different systems and the performance compared.

A typical weather related one is WRF running the CONUS 2.5km configuration.

HPL

LINPACK is a software library for performing numerical linear algebra

LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.

The LINPACK benchmarks appeared initially as part of the LINPACK user’s manual. The parallel LINPACK benchmark implementation called HPL (High Performance Linpack) is used to benchmark and rank supercomputers for the TOP500 list.

Top500 list

Exercise 1

Got to the Top500 site at https://top500.org/

Use the sublist generator to find the largest HPC systems in your country
What is the ratio of Rmax performance between the number 1 system in June 1993 and the June 2025 number 1

Cluster architecture

Before we get to the computing infrastructure there is the underpinning building and plant (power, cooling) required

Cluster architecture

Nodes

The name comes from the terminology of mathematical graphs - nodes and edges.

You can think of a node as a single server - one computer that an instance of an operating system

Compute nodes

These are the nodes that do the heavy lifting computing work.

Normally managed by the job scheduler - you don’t usually log in to them directly.

Quite often for the exclusive use of one user for the duration of their job.

N.B. On some clusters compute nodes can be of a different architecture to the login nodes.

Shared storage

Compute nodes sometimes have on node disk storage.

Ther is normally some large storage that is visible to all the compute nodes.

Since this is a shared resource an anti-social user can affect the performnace of other users.

Interconnect

Connects the compute nodes, login nodes and storage

Usually faster (higher bandwidth, lower latency) than comoddity ethernet networking.

It’s what makes a supercomputer super.

examples: - Infiniband - Omnipath - Slingshot

Connecting to CSD3

SSH
- linux / mac should be built in
- Windows has openssh nowadays
- PuTTY
- MobaXTerm
HPC Carpentry advice on connecting
Chrome extension for SSH

Connecting to CSD3

host is login.hpc.cam.ac.uk
CSD3 Docs

The Command Line

Not as discoverable as a GUI
You can’t break the HPC system
You type a command with optional flags and optional arguments and press “Return”
The system may or may not give you any output

The Command Line resources

https://swcarpentry.github.io/shell-novice/
https://wizardzines.com/comics/every-core-unix-program-i-use/

The Scheduler

The scheduler takes requests to run jobs with particular cluster resources, fits these in around other user’s jobs according to some policy, launches the job, terminates the job if it is overrunning, does accounting.

Examples: - PBSpro - Platform LSF - Flux - Slurm (today, on CSD3)

Job Scripts

A shell script with shell comments that are directives to the sheduler about how the jobs should be run

#!/bin/bash
#SBATCH --account=TRAINING-CPU 
#SBATCH --reservation=iccs-summer-school1
#SBATCH --time=00:02:00
#SBATCH --job-name=my-first-job
#SBATCH --nodes=1
#SBATCH --cores=1

echo "My first job - hooray"

Submitting Jobs

sbatch job.sh

You will get back a Job ID.

Viewing the Queue

squeue
squeue --me

Job Output

If you don’t specify, by default it will be called slurm-<$JOBID>.out

To change this you can add an extra directive #SBATCH --output=

Exercise 2

Write a job script to echo “hello world”
Submit the job with sbatch
See it in the queue with `squeue --me
Find the output in te directory you submitted it from ls -lrt
Examine the output using cat

Exercise 3

add in the unix command sleep 60
find you job in the queue with squeue --me
kill it with scancel <JOBID>

Exercise 4

change the sleep to 180 seconds
reduce the job request time to 1 minute
see what happens

Environment Modules

Multiple versions of the same software can be installed and you can choose between them
Two implementations
- Environment Modules (TCL)
- LMod (Lua)
Module names are site specific
Module output is on stderr (!)
- grep with module avail 2>&1 | grep the_thing
If you load a module interactively remember to load it in a job script
Purge and only load the modules needed in the job script

Exercise 5

List the currently loaded modules module list
List the available modules module avail
Try running h5perf_serial
Load the module module load hdf5/1.12.1
Now try h5perf_serial

Array Jobs

map array member ID to problem domain, needn’t necessarily be 1-dim
array members need to be independant can execute concurrently and order isn’t guaranteed

#!/bin/bash
#SBATCH --account=TRAINING-CPU 
#SBATCH --reservation=iccs-summer-school2
#SBATCH --time=00:02:00
#SBATCH --job-name=array-test
#SBATCH --partition=icelake
#SBATCH --nodes=1
#SBATCH --cores=1

echo "Hello from array member ${SLURM_ARRAY_TASK_ID}"

sbatch --array=1-10 array-test.sh

Exercise 6

submit an array job using the previous example
see how it appears in squeue --me
examine the output files

Workflows

Do not be tempted to write your own workflow orchestrator
Choose from one of the already existing ones, e.g.:
Snakemake
NextFlow

See NERSC’s advice

programming HPC

single node

OpenMP is a specification for parallel programming
Hardware independent by design (e.g., CPU, FPGA, GPU…)
Shared memory multiprocessing programming model

hello OpenMP

#include <stdio.h>
#include <omp.h>

int main() {
  #pragma omp parallel
  {
    int thread_id = omp_get_thread_num();
    printf("Hello from thread %d\n", thread_id);
  }
  return 0;
}

Build with

$ gcc -fopenmp -O3 hello.c -o hello.exe

hello OpenMP

Can you predict what the output will be with 4 threads?

Note

To change the number of threads we need to set environment variable OMP_NUM_THREADS e.g., export OMP_NUM_THREADS=1

Exercise 7

Build the OpenMP hello world
Run with varying numbers of threads

hello OpenMP

What happens when we run the following code?

$ gcc -fopenmp -O3 hello.c -o hello.exe
$ ./hello.exe
hello from process 1
hello from process 3
hello from process 0
hello from process 2

OpenMP Scaling

Try running the pi.c example (in directory example-code/)
To compile pi.c use command:

cd example-code
gcc -fopenmp -O3 pi.c -o pi.exe

What might be causing this?

The problem is “pleasingly parallel”
Process pinning
NUMA
- Non-unifrom memory architecture
- differing latencies from cores to main memory
First touch placement policy.

ls-topo

programming HPC

distributed (multi-node)

MPI (Message Passing Interface)
One of the most common methods of distributed compute
Distributed memory multiprocessing programming model
Implementations (openMPI, MPICH, Intel MPI)

MPI Hello World

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
}

Building MPI PRograms

Build:

mpicc mpi-hello.c -o mpi-hello.exe

Run a small test on login node:

mpiexec -np 4 ./mpi-hello.exe

Exercise 8

Build mpi-hello.c
Write a job script to run on 2 nodes, 3 cores per node
- (np whouls be 2*3 = 6)

programming HPC

GPU offloading

There are a range of GPU offloading programming models
Vendor specific
- CUDA
- HIP
Vendor agnostic
- SYCL
- kokkos
- openMP GPU offloading
- openAcc

The bad news: Amdahal’s law

$S = 1 / (1 - p + p/s)$, where…

$S$ is the speedup of a process
$p$ is the proportion of a program that can be made parallel, and
$s$ is the speedup factor of the parallel portion.

Amdahal’s Law

The good news: Gustafson’s law

Mitigates the drawbacks of Amdahl’s Law
Scale the problem size as you scale the number of nodes

debugging

Multiple strategies 🔍🐛
- printf()
- logging
- debuggers (gdb, lldb, linaro ddt…)
gdb
- available on most HPC systems
- works with C, C++, Fortran, Rust…
- Command-line interface
Debugging Course coming up next in this room!

Profiling

Warning!

Premature Optimization Is the Root of All Evil

Donald Knuth (1974)

Profiling

Profiling

IO profiling with Darshan

Green HPC

Green500 - Top500 divided by energy
Calculators:
- Green Algorithms
- How Much is Enough

Advice:

Profile!
Use reduced precision
Design experiments well

Exercise

Go to the Green Algorithms website
Use the calculator to look at
- 3 hours
- 4 nodes (4 * 76 cores)
- Xeon Platinum 8268
- Memory 512 GB
- Location Europe / United Kingdom
- PUE 1.1

Applying for Resources

Further Resources

Your local HPC support
HPC carpentry
- https://carpentries-incubator.github.io/hpc-intro/
ARCHER2
- https://www.archer2.ac.uk/training/materials/
ATPESC
- https://extremecomputingtraining.anl.gov/
SC, ISC Tutorials

Contact

For more information we can be reached at:

Chris Edsall

ICCS/UoCambridge

cje57[AT]cam.ac.uk

christopheredsall

@hpcchris@scholar.social

You can also contact the ICCS, make a resource allocation request, or visit us at the Summer School RSE Helpdesk.

Introduction to High Performance Computing

High Performance Computing

Supercomputers and clusters

Supercomputer

Cluster

Applications of HPC

Floating Point

Floating Point

Floating Point

FLOPS

“AI” FLOPS

FLOPS

Benchmarks

HPL

Top500 list

Exercise 1

Cluster architecture

Cluster architecture

Nodes

Login Nodes

Compute nodes

Shared storage

Interconnect

Connecting to CSD3

Connecting to CSD3

The Command Line

The Command Line resources

The Scheduler

Job Scripts

Submitting Jobs

Viewing the Queue

Job Output

Exercise 2

Exercise 3

Exercise 4

Environment Modules

Exercise 5

Array Jobs

Exercise 6

Workflows

programming HPC

single node

hello OpenMP

hello OpenMP

Exercise 7

hello OpenMP

OpenMP Scaling

What might be causing this?

ls-topo

programming HPC

distributed (multi-node)

MPI Hello World

Building MPI PRograms

Exercise 8

programming HPC

GPU offloading

The bad news: Amdahal’s law

Amdahal’s Law

The good news: Gustafson’s law

debugging

Profiling

Profiling

Profiling

IO profiling with Darshan

Green HPC

Exercise

Applying for Resources

Further Resources

Contact