Introduction to High Performance Computing

ICCS Summer School 2025

Chris Edsall

Head of RSE
ICCS/Cambridge

Tom Meltzer

Senior Research Software Engineer
ICCS - University of Cambridge

High Performance Computing

Working definition:

A computing resource that is larger than can be provided by one laptop or server

Supercomputers and clusters

Supercomputer

One of the most performant computers in the world at a particular point in time.

Cluster

An architecture for combining a number of servers, storage and networking to act on concert.

Most supercomputers for the past few decades have been clusters.

Applications of HPC

Why would I need a supercomputer?

Three traditional applications:

  • Nuclear
  • Chemical
  • Climate / weather

Now, AI

Floating Point

Computer math is not people math

>>> 0.1 + 0.2

Floating Point

>>> 0.1 + 0.2
0.30000000000000004

Floating Point

  • ’60s and ’70s many vendor implementations
  • Standardised in 1982 as IEEE 754

FLOPS

One FLOPS == one floating point operation per second.

  • TF terraflops
  • PF petaflops
  • EF exaflops

Conventionally these are 64-bit (“double precision”) FLOPS

“AI” FLOPS

  • smaller data formats
    • float16
    • bfloat16
    • int8

FLOPS

Image source: Felix LeClair

Benchmarks

A benchmark is a particular known and specified workload which can be repeated on different systems and the performance compared.

A typical weather related one is WRF running the CONUS 2.5km configuration.

HPL

LINPACK is a software library for performing numerical linear algebra

LINPACK makes use of the BLAS (Basic Linear Algebra Subprograms) libraries for performing basic vector and matrix operations.

The LINPACK benchmarks appeared initially as part of the LINPACK user’s manual. The parallel LINPACK benchmark implementation called HPL (High Performance Linpack) is used to benchmark and rank supercomputers for the TOP500 list.

Top500 list

Exercise 1

Got to the Top500 site at https://top500.org/

  • Use the sublist generator to find the largest HPC systems in your country
  • What is the ratio of Rmax performance between the number 1 system in June 1993 and the June 2025 number 1

Cluster architecture

Before we get to the computing infrastructure there is the underpinning building and plant (power, cooling) required

Cluster architecture

Nodes

The name comes from the terminology of mathematical graphs - nodes and edges.

You can think of a node as a single server - one computer that an instance of an operating system

Login Nodes

These are your entry point on to the cluster

Usually accessable from the outside world.

Often more than one (sometimes multiple login nodes use the same DNS name, e.g . login.hpc.cam.ac.uk)

Shared with multiple users.

DO NOT RUN COMPUTE JOBS ON THE LOGIN NODE

Compute nodes

These are the nodes that do the heavy lifting computing work.

Normally managed by the job scheduler - you don’t usually log in to them directly.

Quite often for the exclusive use of one user for the duration of their job.

N.B. On some clusters compute nodes can be of a different architecture to the login nodes.

Shared storage

Compute nodes sometimes have on node disk storage.

Ther is normally some large storage that is visible to all the compute nodes.

Since this is a shared resource an anti-social user can affect the performnace of other users.

Interconnect

Connects the compute nodes, login nodes and storage

Usually faster (higher bandwidth, lower latency) than comoddity ethernet networking.

It’s what makes a supercomputer super.

examples: - Infiniband - Omnipath - Slingshot

Connecting to CSD3

Connecting to CSD3

The Command Line

  • Not as discoverable as a GUI
  • You can’t break the HPC system
  • You type a command with optional flags and optional arguments and press “Return”
  • The system may or may not give you any output

The Command Line resources

  • https://swcarpentry.github.io/shell-novice/
  • https://wizardzines.com/comics/every-core-unix-program-i-use/

The Scheduler

The scheduler takes requests to run jobs with particular cluster resources, fits these in around other user’s jobs according to some policy, launches the job, terminates the job if it is overrunning, does accounting.

Examples: - PBSpro - Platform LSF - Flux - Slurm (today, on CSD3)

Job Scripts

A shell script with shell comments that are directives to the sheduler about how the jobs should be run

#!/bin/bash
#SBATCH --account=TRAINING-CPU 
#SBATCH --reservation=iccs-summer-school1
#SBATCH --time=00:02:00
#SBATCH --job-name=my-first-job
#SBATCH --nodes=1
#SBATCH --cores=1

echo "My first job - hooray"

Submitting Jobs

sbatch job.sh

You will get back a Job ID.

Viewing the Queue

  • squeue
  • squeue --me

Job Output

If you don’t specify, by default it will be called slurm-<$JOBID>.out

To change this you can add an extra directive #SBATCH --output=

Exercise 2

  • Write a job script to echo “hello world”
  • Submit the job with sbatch
  • See it in the queue with `squeue --me
  • Find the output in te directory you submitted it from ls -lrt
  • Examine the output using cat

Exercise 3

  • add in the unix command sleep 60
  • find you job in the queue with squeue --me
  • kill it with scancel <JOBID>

Exercise 4

  • change the sleep to 180 seconds
  • reduce the job request time to 1 minute
  • see what happens

Environment Modules

  • Multiple versions of the same software can be installed and you can choose between them
  • Two implementations
  • Module names are site specific
  • Module output is on stderr (!)
    • grep with module avail 2>&1 | grep the_thing
  • If you load a module interactively remember to load it in a job script
  • Purge and only load the modules needed in the job script

Exercise 5

  • List the currently loaded modules module list
  • List the available modules module avail
  • Try running h5perf_serial
  • Load the module module load hdf5/1.12.1
  • Now try h5perf_serial

Array Jobs

  • map array member ID to problem domain, needn’t necessarily be 1-dim
  • array members need to be independant can execute concurrently and order isn’t guaranteed
#!/bin/bash
#SBATCH --account=TRAINING-CPU 
#SBATCH --reservation=iccs-summer-school2
#SBATCH --time=00:02:00
#SBATCH --job-name=array-test
#SBATCH --partition=icelake
#SBATCH --nodes=1
#SBATCH --cores=1

echo "Hello from array member ${SLURM_ARRAY_TASK_ID}"

sbatch --array=1-10 array-test.sh

Exercise 6

  • submit an array job using the previous example
  • see how it appears in squeue --me
  • examine the output files

Workflows

  • Do not be tempted to write your own workflow orchestrator

  • Choose from one of the already existing ones, e.g.:

  • Snakemake

  • NextFlow

See NERSC’s advice

programming HPC

single node

  • OpenMP is a specification for parallel programming
  • Hardware independent by design (e.g., CPU, FPGA, GPU…)
  • Shared memory multiprocessing programming model

hello OpenMP

#include <stdio.h>
#include <omp.h>

int main() {
  #pragma omp parallel
  {
    int thread_id = omp_get_thread_num();
    printf("Hello from thread %d\n", thread_id);
  }
  return 0;
}

Build with

$ gcc -fopenmp -O3 hello.c -o hello.exe

hello OpenMP

  • Can you predict what the output will be with 4 threads?

Note

To change the number of threads we need to set environment variable OMP_NUM_THREADS e.g., export OMP_NUM_THREADS=1

Exercise 7

  • Build the OpenMP hello world
  • Run with varying numbers of threads

hello OpenMP

What happens when we run the following code?

$ gcc -fopenmp -O3 hello.c -o hello.exe
$ ./hello.exe
hello from process 1
hello from process 3
hello from process 0
hello from process 2

OpenMP Scaling

  • Try running the pi.c example (in directory example-code/)
  • To compile pi.c use command:
cd example-code
gcc -fopenmp -O3 pi.c -o pi.exe

What might be causing this?

  • The problem is “pleasingly parallel”
  • Process pinning
  • NUMA
    • Non-unifrom memory architecture
    • differing latencies from cores to main memory
  • First touch placement policy.

ls-topo

programming HPC

distributed (multi-node)

  • MPI (Message Passing Interface)
  • One of the most common methods of distributed compute
  • Distributed memory multiprocessing programming model
  • Implementations (openMPI, MPICH, Intel MPI)

MPI Hello World

#include <mpi.h>
#include <stdio.h>

int main(int argc, char** argv) {
    // Initialize the MPI environment
    MPI_Init(NULL, NULL);

    // Get the number of processes
    int world_size;
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process
    int world_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    int name_len;
    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message
    printf("Hello world from processor %s, rank %d out of %d processors\n",
           processor_name, world_rank, world_size);

    // Finalize the MPI environment.
    MPI_Finalize();
} 

Building MPI PRograms

Build:

mpicc mpi-hello.c -o mpi-hello.exe

Run a small test on login node:

mpiexec -np 4 ./mpi-hello.exe

Exercise 8

  • Build mpi-hello.c
  • Write a job script to run on 2 nodes, 3 cores per node
    • (np whouls be 2*3 = 6)

programming HPC

GPU offloading

  • There are a range of GPU offloading programming models
  • Vendor specific
  • Vendor agnostic

The bad news: Amdahal’s law

\(S = 1 / (1 - p + p/s)\), where…

  • \(S\) is the speedup of a process
  • \(p\) is the proportion of a program that can be made parallel, and
  • \(s\) is the speedup factor of the parallel portion.

Amdahal’s Law

The good news: Gustafson’s law

  • Mitigates the drawbacks of Amdahl’s Law
  • Scale the problem size as you scale the number of nodes

debugging

  • Multiple strategies 🔍🐛
    • printf()
    • logging
    • debuggers (gdb, lldb, linaro ddt…)
  • gdb
    • available on most HPC systems
    • works with C, C++, Fortran, Rust…
    • Command-line interface
  • Debugging Course coming up next in this room!

Profiling

Warning!

Premature Optimization Is the Root of All Evil

Donald Knuth (1974)

Profiling

Profiling

IO profiling with Darshan

Green HPC

Advice:

  • Profile!
  • Use reduced precision
  • Design experiments well

Exercise

  • Go to the Green Algorithms website
  • Use the calculator to look at
    • 3 hours
    • 4 nodes (4 * 76 cores)
    • Xeon Platinum 8268
    • Memory 512 GB
    • Location Europe / United Kingdom
    • PUE 1.1

Applying for Resources

Further Resources

  • Your local HPC support
  • HPC carpentry
    • https://carpentries-incubator.github.io/hpc-intro/
  • ARCHER2
    • https://www.archer2.ac.uk/training/materials/
  • ATPESC
    • https://extremecomputingtraining.anl.gov/
  • SC, ISC Tutorials

Contact

For more information we can be reached at: