Session 1: Forward mode differentiation and source transformation

Authors

Joe Wallwork

Niccolò Zanotti

%matplotlib inline
import matplotlib.pyplot as plt
from mpltools import annotation
import numpy as np
import pandas
def print_matrix(filename):
    """Function for printing a matrix saved to a text file from Fortran."""
    print(np.array([[float(entry) for entry in line[:-2].split(",")] for line in open(filename).readlines()]))

Set up codespace now. (It will take a while!)

We will show you where to find this notebook in the repo while you wait.

Motivation

Differentiable programming is an enabling technology. Given scientific code for computing some quantity, it allows us to generate derivatives of quantities involved in the computation without any need for deriving or hand-coding the derivative expressions involved.

Some motivating examples: * Computing the Jacobian for a nonlinear system. * Computing the gradient required for an ODE- or PDE-constrained optimisation method. * The backpropagation operation used for training machine learning models. * Computing Hessians for uncertainty quantification methods. * Solving the adjoint problems involved in data assimilation methods commonly used for weather forecasting.

Learning objectives

In today’s session we will:

  • Get a brief history of automatic differentiation.
  • Learn about forward mode and the source transformation approach.
  • Try out the Tapenade source transformation AD tool applied to some test problems.
  • Verify the code generated by Tapenade both manually and using the Taylor test.
  • Learn about compressed and matrix-free approaches for sparse problems.

Preparations

Terminology

This course introduces the concept of differentiable programming, a.k.a. automatic differentiation (AD), or algorithmic differentiation. We will use the acronym AD henceforth.

Notation

For a differentiable mathematical function \(f:A\rightarrow\mathbb{R}\) with scalar input (i.e., a single value) from \(A\subseteq\mathbb{R}\), we make use of both the Lagrange notation \(f'(x)\) and Leibniz notation \(\frac{\mathrm{d}f}{\mathrm{d}x}\) for its derivative.

Caution with the physics notation for derivatives \(\dot{x}\). It won’t always mean what you expect! (See later.)

Similarly, for \(m\in\mathbb{N}\) dimensional, differentiable, vector-valued function \(\mathbf{f}:A\rightarrow\mathbb{R}^m\) with scalar input, we have derivative notations \(\mathbf{f}'(x)\) and \(\frac{\mathrm{d}\mathbf{f}}{\mathrm{d}x}\).

For a differentiable function with vector input (i.e., multiple inputs), we use partial derivative notation. For example, if \(f:\mathbb{R}^2\rightarrow\mathbb{R}\) is written as \(f=f(x,y)\) then we have the partial derivatives \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) with respect to first and second components, respectively. We use \[\nabla f=\left(\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_m}\right)\] to denote the vector of all such partial derivatives. Similarly for vector-valued functions with multiple inputs.

When it comes to derivatives in code, we use the _d notation (for “derivative” or “dot”), which is standard in the AD literature. Its meaning will be described in due course.

History

  • Origins of AD in 1950s.
  • However, it found a wider audience in the 1980s, when it became more relevant thanks to advances in both computer power and modern programming languages.
  • Forward mode (the subject of this session) was discovered by Wengert in 1964.
  • Further developed by Griewank in the late 1980s.

Figure 1: Header of (R. E. Wengert, 1964).

Idea

The idea of AD is to treat a model as a sequence of elementary instructions (e.g., addition, multiplication, exponentiation). Here a model could be a function or subroutine, code block, or a whole program. Elementary operations are well-understood and their derivatives are known. As such, the derivative of the whole model may be computed by composing the derivatives of each operation using the chain rule.

Recap on A-level maths: the Chain Rule

Consider two composable, differentiable (mathematical) functions, \(f\) and \(g\), with composition \(h=f\circ g\). By definition, this means \[h(x)=(f\circ g)(x)=g(f(x)).\]

Then the chain rule states that the derivative of \(h\) may be computed in terms of the derivatives of \(f\) and \(g\) using the formula \[h'(x)=(f\circ g)'(x)=(f\circ g')(x)\,f'(x)=g'(f(x))\,f'(x).\]

Equivalently, in Leibniz notation: \[\frac{\mathrm{d}h}{\mathrm{d}x}=\frac{\mathrm{d}g}{\mathrm{d}f}\frac{\mathrm{d}f}{\mathrm{d}x}.\]

For variables with multiple arguments, the result is equivalent for each partial derivative, e.g., \[\frac{\partial h}{\partial x}=\frac{\partial g}{\partial f}\frac{\partial f}{\partial x}.\]

\(f\) & \(g\) example

Consider two functions acting on real numbers: \[f(x,y)=xy\] and \[g(z)=(\sin(z),\cos(z)).\] Here \(f:\mathbb{R}^2\rightarrow\mathbb{R}\) takes two inputs and returns a single output, while \(g:\mathbb{R}\rightarrow\mathbb{R}^2\) takes a single input and returns two outputs.

Optional exercise

Convince yourself that it is well defined for these functions may be composed in either order. (Although they won’t necessarily give the same value!)

Solution

The image of \(f\) is the set of all real numbers, so its image is the same as the domain of \(g\) (i.e., \(\text{im}(f)=\mathbb{R}=\text{dom}(g)\)).

The image of \(g\) is \([-1,1]^2=[-1,1]\times[-1,1]\) because \(\sin\) and \(\cos\) give values between -1 and 1. Since this is a subset of \(\mathbb{R}^2\), the image of \(g\) is a subset of the domain of \(f\) (i.e., \(\text{im}(g)\subset\text{dom}(f)\)).

\(f\) & \(g\) example: Directed Acyclic Graph

We can visualise the functions in terms of DAGs.

Recalling that \[f(x_1,x_2)=x_1x_2\] and \[g(y)=(\sin(y),\cos(y))\, ,\] we have

Figure 2: Directed Acyclic Graph (DAG) for the \(f\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).

Figure 3: Directed Acyclic Graph (DAG) for the \(g\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).

\(f\) & \(g\) example: composition

Consider the composition \(h=f\circ g:\mathbb{R}^2\rightarrow\mathbb{R}^2\), which is given by \[h(x_1,x_2)=(f\circ g)(x_1,x_2)=g(f(x_1,x_2))=g(x_1x_2)=(\sin(x_1x_2),\cos(x_1x_2)).\]

For the derivative of each component, \[ \frac{\partial f}{\partial x_1}=\frac{\partial}{\partial x_1}x_1x_2=x_2, \quad\frac{\partial f}{\partial x_2}=\frac{\partial}{\partial x_2}x_1x_2=x_1, \quad\frac{\partial g}{\partial y}=\frac{\partial}{\partial y}(\sin(y),\cos(y))=(\cos(y),-\sin(y)). \]

Introduce the notation \(g(y)=(g_1(y),g_2(y))\) so that \[ \frac{\partial g_1}{\partial y}=\cos(y), \quad\frac{\partial g_2}{\partial y}=-\sin(y). \] Similarly \(h(x_1,x_2)=(h_1(x_1,x_2),h_2(x_1,x_2))=(g_1(f(x_1,x_2)),g_2(f(x_1,x_2)))\).

Let’s use the chain rule to work out the derivatives of each of the outputs with respect to each of the inputs. \[ \frac{\partial h_1}{\partial x_1}=\frac{\partial g_1}{\partial f}\frac{\partial f}{\partial x_1}=\cos(y)x_2=x_2\cos(x_1x_2), \quad\frac{\partial h_1}{\partial x_2}=\frac{\partial g_1}{\partial f}\frac{\partial f}{\partial x_2}=\cos(y)x_1=x_1\cos(x_1x_2), \] where \(y=f(x_1,x_2)\) and \[ \quad\frac{\partial h_2}{\partial x_1}=\frac{\partial g_2}{\partial f}\frac{\partial f}{\partial x_1}=-\sin(y)x_2=-x_2\sin(x_1x_2), \quad\frac{\partial h_2}{\partial x_2}=\frac{\partial g_2}{\partial f}\frac{\partial f}{\partial x_2}=-\sin(y)x_1=-x_1\sin(x_1x_2). \]

We will come back to these formulae to verify the correctness of our AD computations.

Directional derivative, a.k.a. Jacobian-vector product (JVP)

Consider a vector-valued function \(\mathbf{f}\) mapping from a subspace \(A\subseteq\mathbb{R}^n\) into \(\mathbb{R}^m\), for some \(m,n\in\mathbb{N}\): \[\mathbf{f}:A\rightarrow\mathbb R^m.\]

Given input \(\mathbf{x}\in A\) and a seed vector \(\dot{\mathbf{x}}\in\mathbb{R}^n\), forward mode AD allows us to compute the action (matrix-vector product) \[\text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\,\dot{\mathbf{x}}.\] Here \(\nabla\mathbf{f}\) is referred to as the Jacobian for the map, so the above is known as a Jacobian-vector product (JVP). You will also hear the related term tangent linear model (TLM).

Think of the seed vector as the direction in which we want to compute the derivative. In practice, the seed vector is often a derivative of some upstream code from outside of the part of the program being differentiated. That is, the upstream code is passive, whereas the part we are interested in is active, as far as AD is concerned.

Note The computation is matrix-free. We don’t actually need to assemble the Jacobian when we compute this product.

\(f\) & \(g\) example: Seed vectors

Let’s revisit the DAG interpretation and consider how the derivatives work.

Figure 4: Directed Acyclic Graph (DAG) for the composition of the functions in the \(f\) & \(g\) example.

Approach 1: Source transformation

High level idea: Given some (code) function f(x) and a seed vector x_d, generate the code for the function f_d(x, x_d) for its (directional) derivative.

Notes on the source transformation approach:

  • Static analysis, done ahead of time.
  • Often the difficult part is hooking the differentiated code into the wider model/build system.
  • Limited number of tools.

Source transformation tools:

  • Tapenade (C, Fortran, Julia*)
  • OpenAD (Fortran) [no longer maintained]
  • TAF (Fortran) [commercial]
  • PSyAD* (domain-specific)

*Work in progress

\(f\) & \(g\) example: source transformation

Below we have the Fortran code for the example functions above, written as subroutines. You can find this in the repository at session1/exercises/fg/f.f90 and session1/exercises/fg/g.f90, respectively.

subroutine f(x, y)
  implicit none
  real, dimension(2), intent(in)  :: x
  real, intent(out) :: y
  y = x(1) * x(2)
end subroutine f
subroutine g(y, z)
  implicit none
  real, intent(in) :: y
  real, intent(out), dimension(2) :: z
  z = [sin(y), cos(y)]
end subroutine g

Exercise

  1. Run tapenade -h to see all the options.
  2. Apply Tapenade to each of these subroutines using the -head argument.* Inspect the output files f_d.f90 and g_d.f90 and check they are as you expect.
  3. (Optional) Inspect the message files f_d.msg and g_d.msg.
  4. (Optional) See https://gitlab.inria.fr/tapenade/tapenade/-/blob/develop/src/README.md for details on the internals of Tapenade.

*The syntax is -head "procedure_name(dependent_variables)/(independent_variables)", where here procedure_name, dependent_variables, and independent_variables all need substituting as appropriate. If multiple variables are used then they should be comma-separated.

Solution 1

$ tapenade -h
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
 Builds a differentiated program.
 Usage: tapenade [options]* filenames
  options:
   -head, -root <proc>     set the differentiation root procedure(s)
                           See FAQ for refined invocation syntax, e.g.
                           independent and dependent arguments, multiple heads...
   -tangent, -d            differentiate in forward/tangent mode (default)
   -reverse, -b            differentiate in reverse/adjoint mode
   -vector, -multi         turn on "vector" mode (i.e. multi-directional)
   -specializeactivity <unit_names or %all%>  Allow for several activity patterns per routine
   -primal, -p             turn off differentiation. Show pointer destinations
   -output, -o <file>      put all generated code into a single <file>
   -splitoutputfiles       split generated code, one file per top unit
   -outputdirectory, -O <directory>  put all generated files in <directory> (default: .)
   -I <includePath>        add a new search path for include files
   -tgtvarname <str>       set extension for tangent variables  (default %d)
   -tgtfuncname <str>      set extension for tangent procedures (default %_d)
   -tgtmodulename <str>    set extension for tangent modules and types (default %_diff)
   -adjvarname <str>       set extension for adjoint variables  (default %b)
   -adjfuncname <str>      set extension for adjoint procedures (default %_b)
   -adjmodulename <str>    set extension for adjoint modules and types (default %_diff)
   -modulename <str>       set extension for tangent&adjoint modules and types (default %_diff)
   -inputlanguage <lang>   language of  input files (fortran, fortran90,
                           fortran95, or C)
   -outputlanguage <lang>  language of output files (fortran, fortran90,
                           fortran95, or C)
   -ext <file>             incorporate external library description <file>
   -nolib                  don't load standard libraries descriptions
   -i<n>                   count <n> bytes for an integer (default -i4)
   -r<n>                   count <n> bytes for a real (default -r4)
   -dr<n>                  count <n> bytes for a double real (default -dr8)
   -p<n>                   count <n> bytes for a pointer (default -p8)
   -fixinterface           don't use activity to filter user-given (in)dependent vars
   -noinclude              inline include files
   -debugTGT               insert instructions for debugging tangent mode
   -debugADJ               insert instructions for debugging adjoint mode
   -tracelevel <n>         set the level of detail of trace milestones
   -msglevel <n>           set the level of detail of error messages
   -msginfile              insert error messages in output files
   -dump <file>            write a dump <file>
   -html                   display results in a web browser
   -nooptim <str>          turn off optimization <str> (in {activity, difftypes,
                           diffarguments, stripprimalmodules, spareinit, splitdiff, 
                           mergediff, saveonlyused, tbr, snapshot, diffliveness,
                           deadcontrol, recomputeintermediates,
                           everyoptim}
   -version                display Tapenade version information
 Report bugs to <tapenade@inria.fr>.


Solution 2

Running

$ cd session1/exercises/fg
$ ls
f.f90 g.f90
$ tapenade -head "f(y)/(x)" f.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Diff-liveness analysis turned off
@@ Created ./f_d.f90 
@@ Created ./f_d.msg
$ cat f_d.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 23 Apr 2025 13:39
!
!  Differentiation of f in forward (tangent) mode:
!   variations   of useful results: y
!   with respect to varying inputs: x
!   RW status of diff variables: x:in y:out
SUBROUTINE F_D(x, xd, y, yd)
  IMPLICIT NONE
  REAL, DIMENSION(2), INTENT(IN) :: x
  REAL, DIMENSION(2), INTENT(IN) :: xd
  REAL, INTENT(OUT) :: y
  REAL, INTENT(OUT) :: yd
  yd = x(2)*xd(1) + x(1)*xd(2)
  y = x(1)*x(2)
END SUBROUTINE F_D

Running

$ tapenade -head "g(z)/(y)" g.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Diff-liveness analysis turned off
@@ Created ./g_d.f90 
@@ Created ./g_d.msg
$ cat g_d.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 23 Apr 2025 13:39
!
!  Differentiation of g in forward (tangent) mode:
!   variations   of useful results: z
!   with respect to varying inputs: y
!   RW status of diff variables: y:in z:out
SUBROUTINE G_D(y, yd, z, zd)
  IMPLICIT NONE
  REAL, INTENT(IN) :: y
  REAL, INTENT(IN) :: yd
  REAL, DIMENSION(2), INTENT(OUT) :: z
  REAL, DIMENSION(2), INTENT(OUT) :: zd
  INTRINSIC COS
  INTRINSIC SIN
  zd = (/COS(y)*yd, -(SIN(y)*yd)/)
  z = (/SIN(y), COS(y)/)
END SUBROUTINE G_D

Verification: the Taylor test

Recall the Taylor expansion of a differentiable scalar function \(f:A\rightarrow\mathbb{R}\), for \(A\subseteq\mathbb{R}\): \[ f(x+\epsilon)=f(x)+\epsilon f'(x) + \frac{1}{2!}\epsilon^2f''(x) + \dots + \frac{1}{n!}\epsilon^nf^{(n)}(x) + \dots, \] for some \(\epsilon\in\mathbb{R}\), i.e., \[ f(x+\epsilon)=f(x)+\epsilon f'(x) + \mathcal{O}(\epsilon^2). \] This gives rise to the first-order forward difference approximation of the first derivative as follows: \[ f'(x) \approx \frac{f(x+\epsilon)-f(x)}{\epsilon}. \]

This is a rather crude approximation to a derivative, but it can do the job, given an appropriate choice of \(\epsilon\).

The idea of the Taylor test is to take smaller and smaller spacing values \(\epsilon\) from a given input value and to check that the difference between the forward difference and the AD-generated result converges quadratically. That is, we verify that \[ \|f(x+\epsilon)-f(x)-\epsilon\:\texttt{f\_d}(x)\|=\mathcal{O}(\epsilon^2), \] where \(\texttt{f\_d}\) is the AD-generated result.

\(f\) & \(g\) example: Taylor test

We return to the previous example and double-check that we are happy with the output of the AD tool.

For simplicity of demonstration, we translate f, g, f_d, and g_d into Python and conduct the Taylor test here.

def f(x):
    """Python translation of the `f` subroutine from `session1/exercises/fg/f.f90` using NumPy."""
    return x[0] * x[1]

def g(y):
    """Python translation of the `g` subroutine from `session1/exercises/fg/g.f90` using NumPy."""
    return np.array([np.sin(y), np.cos(y)])
def f_d(x, xd):
    """Python translation of the `f_d` subroutine generated by Tapenade in `session1/exercises/fg/f_d.f90` using NumPy."""
    yd = x[1] * xd[0] + x[0] * xd[1]
    y = x[0] * x[1]
    return y, yd

def g_d(y, yd):
    """Python translation of the `g_d` subroutine generated by Tapenade in `session1/exercises/fg/g_d.f90` using NumPy."""
    zd = np.array([np.cos(y) * yd, -np.sin(y) * yd])
    z = np.array([np.sin(y), np.cos(y)])
    return z, zd
# Choose arbitrary inputs for the composition
x = np.array([1.0, 1.0])

# Choose seed vector
xd = np.array([1.0, 0.0])

# Compute the derivative using forward mode AD
y, yd = f_d(x, xd)
_, derivative_ad = g_d(y, yd)

# Run the Taylor test over several spacing values
spacings = [1.0, 0.1, 0.01, 0.001]
errors = []
for spacing in spacings:

    # Compute the perturbation in the seed vector direction
    epsilon = spacing * xd

    # Compute the discrepancy
    errors.append(np.linalg.norm(g(f(x + epsilon)) - g(f(x)) - spacing * derivative_ad))

# Plot the solution, demonstrating that the expected quadratic convergence is achieved
fig, axes = plt.subplots()
axes.loglog(spacings, errors, "--x")
axes.set_xlabel(r"$\epsilon$ spacing")
axes.set_ylabel(r"$\ell_2$ error")
annotation.slope_marker((1e-2, 1e-4), 2, ax=axes, invert=True)
axes.grid()

Above, we computed the derivative of \(h=f\circ g\) with respect to the first component, \(x_1\).

Optional exercise

Perform the Taylor test to double-check you are happy with the AD-generated derivative with respect to the second component, \(x_2\), too.

Hint: You only need to change one line.

Example: ODE-constrained optimisation

Consider the scalar ordinary differential equation (ODE) \[ \frac{\mathrm{d}u}{\mathrm{d}t}=f(u),\quad u(0)=u_0, \] where \(t\in[0,T]\) is the time variable, \(T>0\) is the end time, and \(u_0\in\mathbb{R}\) is the initial condition. Given some \(f:A\rightarrow\mathbb{R}\) with \(A\subseteq\mathbb{R}\), we seek to solve the ODE for \(u:[0,T]\rightarrow\mathbb{R}\).

For simplicity, let’s consider the ODE \[ \frac{\mathrm{d}u}{\mathrm{d}t}=u,\quad u(0)=1, \] i.e., \(f(u)=u\).

Optional exercise

Convince yourself that the analytical solution of the ODE is \(u(t)=\mathrm{e}^t\).

Solutions

Plugging \(u(t)=\mathrm{e}^t\) into the LHS gives \(\frac{\mathrm{d}u}{\mathrm{d}t}=\mathrm{e}^t=u\), which satisfies the ODE. Checking the initial condition, we have \(u(0)=\mathrm{e}^0=1\), which also satisfies.

ODE example: forward vs backward Euler

You’re probably aware of the simplest example of an explicit timestepping method to approximate the solution of the ODE. This is the forward Euler (a.k.a. explicit Euler): \[ \frac{u_{k}-u_{k-1}}{\Delta t}=f(u_{k-1}), \] for \(k\in\mathbb{N}\) and some timestep \(\Delta t>0\).

You’re probably also aware that the simplest example of an implicit timestepping method to approximate the solution of the ODE is backward Euler (a.k.a. implicit Euler): \[ \frac{u_{k}-u_{k-1}}{\Delta t}=f(u_k), \] for \(k\in\mathbb{N}\) and some timestep \(\Delta t>0\).

These are actually special cases of a more general theta-method, \[ \frac{u_{k}-u_{k-1}}{\Delta t}=(1-\theta)f(u_{k-1})+\theta f(u_k), \] where \(\theta\in[0,1]\).

ODE example: applying forward and backward Euler

Using the method above, our problem reads \[ \frac{u_{k}-u_{k-1}}{\Delta t}=(1-\theta)u_{k-1}+\theta u_k, \] which can be rearranged to give \[ u_{k}=\frac{1+\Delta t(1-\theta)}{1-\Delta t\theta}u_{k-1}. \]

The forward and backward Euler approaches have been implemented for the case \(f(u)=u\) in the session1/exercises/ode subdirectory.

$ cd exercises/ode
$ ls
backward_euler.f90  cost_mod.f90  forward_euler.f90  gradient_descent.f90  Makefile  theta_method_mod.f90

where theta_methods.f90 is a fortran module containing several subroutines, including

! Single iteration of a theta method for solving the ODE initial value problem
!   du/dt = u, u(0) = 1
subroutine theta_step(u, u_, dt, theta)
  implicit none
  real, intent(out) :: u    ! Solution at current timestep
  real, intent(in) :: u_    ! Solution at previous timestep
  real, intent(in) :: dt    ! Timestep length
  real, intent(in) :: theta ! Theta parameter

  u = u_ * (1 + dt * (1 - theta)) / (1 - dt * theta)
end subroutine theta_step

Here forward_euler.f90 and backward_euler.f90 contain programs to solve the ODE, making use of code from the module above, but with values \(\theta=0\) and \(\theta=1\), respectively. In the former case, we have

! Program for running the forward Euler method to solve the ODE initial value problem
!   du/dt = u, u(0) = 1
program forward_euler
  use theta_method_mod, only: theta_method
  implicit none

  real, parameter :: theta = 0.0  ! Forward Euler corresponds to theta = 0.0
  real :: u                       ! Solution variable
  character(len=100), parameter :: filename = "forward.csv" ! Filename to save results to

  call theta_method(theta, u, filename)
end program forward_euler

Details on running the code

Compile with

$ make forward_euler
$ make backward_euler

and then run with

$ ./forward_euler
$ ./backward_euler
$ ls
backward.csv    backward_euler.f90  forward.csv    forward_euler.f90     Makefile              theta_method_mod.mod
backward_euler  cost_mod.f90        forward_euler  gradient_descent.f90  theta_method_mod.f90  theta_method_mod.o
df_forward = pandas.read_csv("exercises/ode/forward.csv")
df_backward = pandas.read_csv("exercises/ode/backward.csv")

fig, axes = plt.subplots()

times = np.linspace(0, 1, 101)

axes.plot(times, np.exp(times), "-", color="k", label="Analytical solution")
axes.plot(df_forward["t"], df_forward["u"], "--x", label="Forward Euler")
axes.plot(df_backward["t"], df_backward["u"], ":o", label="Backward Euler")
axes.legend()
axes.grid()

ODE example: source transformation

As we see from the plot above, the forward Euler method tends to underestimate the solution, whereas the backward Euler method tends to overestimate it. Let’s try to optimise the value of \(\theta\) to best match the solution using a gradient-based optimisation method. To do that, we first need the gradient. AD enables us to do this automatically.

The optimisation problem we seek to solve is to minimise some error measure \(J\) for the approximation of \(u\) by varying \(\theta\). That is, \[ \min_{\theta\in[0,1]}J(u;\theta). \] where the notation \(J(u;\theta)\) refers to the implicit dependence of the solution approximation \(u\) on \(\theta\).

Note Forward and backward Euler are first-order accurate methods. By optimising the \(\theta\) parameter, we can arrive at a second-order accurate method.

Exercise

Navigate to session1/exercises/ode and apply forward mode AD to the theta_method subroutine in theta_method_mod.f90 by differentiating the module with Tapenade.

Solutions

$ tapenade -tgtmodulename %_d -head "theta_method_mod.theta_method(theta)\(u)" theta_method_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Diff-liveness analysis turned off
@@ Created ./theta_method_mod_d.f90 
@@ Created ./theta_method_mod_d.msg

$ cat theta_method_mod_d.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
! Module containing subroutines for solving the ODE initial value problem
!   du/dt = u, u(0) = 1
MODULE THETA_METHOD_MOD_D
  IMPLICIT NONE
  PRIVATE 
  PUBLIC :: theta_method
  PUBLIC :: theta_method_d

CONTAINS
! Apply the initial condition for the ODE initial value problem
!   du/dt = u, u(0) = 1
  SUBROUTINE INITIAL_CONDITION(u0)
    IMPLICIT NONE
! Initial condition value
    REAL, INTENT(OUT) :: u0
    u0 = 1.0
  END SUBROUTINE INITIAL_CONDITION

!  Differentiation of theta_step in forward (tangent) mode:
!   variations   of useful results: u
!   with respect to varying inputs: u_ theta
! Single iteration of a theta method for solving the ODE initial value problem
!   du/dt = u, u(0) = 1
  SUBROUTINE THETA_STEP_D(u, ud, u_, u_d, dt, theta, thetad)
    IMPLICIT NONE
! Solution at current timestep
    REAL, INTENT(OUT) :: u
    REAL, INTENT(OUT) :: ud
! Solution at previous timestep
    REAL, INTENT(IN) :: u_
    REAL, INTENT(IN) :: u_d
! Timestep length
    REAL, INTENT(IN) :: dt
! Theta parameter
    REAL, INTENT(IN) :: theta
    REAL, INTENT(IN) :: thetad
    REAL :: temp
    temp = u_/(-(dt*theta)+1)
    ud = (dt*(1-theta)+1)*(u_d+temp*dt*thetad)/(1-dt*theta) - temp*dt*&
&     thetad
    u = (dt*(1-theta)+1)*temp
  END SUBROUTINE THETA_STEP_D

! Single iteration of a theta method for solving the ODE initial value problem
!   du/dt = u, u(0) = 1
  SUBROUTINE THETA_STEP(u, u_, dt, theta)
    IMPLICIT NONE
! Solution at current timestep
    REAL, INTENT(OUT) :: u
! Solution at previous timestep
    REAL, INTENT(IN) :: u_
! Timestep length
    REAL, INTENT(IN) :: dt
! Theta parameter
    REAL, INTENT(IN) :: theta
    u = u_*(1+dt*(1-theta))/(1-dt*theta)
  END SUBROUTINE THETA_STEP

!  Differentiation of theta_method in forward (tangent) mode:
!   variations   of useful results: u
!   with respect to varying inputs: theta
!   RW status of diff variables: u:out theta:in
! Solve the ODE initial value problem
!   du/dt = u, u(0) = 1
! using a theta timestepping method, writing the solution trajectory to file
  SUBROUTINE THETA_METHOD_D(theta, thetad, u, ud, filename)
    IMPLICIT NONE
! Theta parameter
    REAL, INTENT(IN) :: theta
    REAL, INTENT(IN) :: thetad
! Solution at the final time
    REAL, INTENT(OUT) :: u
    REAL, INTENT(OUT) :: ud
! Output filename
    CHARACTER(len=*), INTENT(IN) :: filename
! Timestep length
    REAL, PARAMETER :: dt=0.1
! End time of the simulation
    REAL, PARAMETER :: end_time=1.0
! Dummy variable for time
    REAL :: t
! Lagged solution variable (previous timestep)
    REAL :: u_
    REAL :: u_d
! Initialisation
    t = 0.0
    CALL INITIAL_CONDITION(u_)
! Create a CSV file and write out the header and initial condition
    OPEN(unit=10, file=filename) 
    WRITE(unit=10, fmt='(''t,u'')') 
    WRITE(unit=10, fmt=100) t, u_
    ud = 0.0
    u_d = 0.0
! Timestepping loop
    DO WHILE (t .LT. end_time - 1e-05)
      CALL THETA_STEP_D(u, ud, u_, u_d, dt, theta, thetad)
      u_d = ud
      u_ = u
      t = t + dt
      WRITE(unit=10, fmt=100) t, u
    END DO
! Remember to close the CSV file
    CLOSE(unit=10) 
 100 FORMAT(f4.2,',',f4.2)
  END SUBROUTINE THETA_METHOD_D

! Solve the ODE initial value problem
!   du/dt = u, u(0) = 1
! using a theta timestepping method, writing the solution trajectory to file
  SUBROUTINE THETA_METHOD(theta, u, filename)
    IMPLICIT NONE
! Theta parameter
    REAL, INTENT(IN) :: theta
! Solution at the final time
    REAL, INTENT(OUT) :: u
! Output filename
    CHARACTER(len=*), INTENT(IN) :: filename
! Timestep length
    REAL, PARAMETER :: dt=0.1
! End time of the simulation
    REAL, PARAMETER :: end_time=1.0
! Dummy variable for time
    REAL :: t
! Lagged solution variable (previous timestep)
    REAL :: u_
! Initialisation
    t = 0.0
    CALL INITIAL_CONDITION(u_)
! Create a CSV file and write out the header and initial condition
    OPEN(unit=10, file=filename) 
    WRITE(unit=10, fmt='(''t,u'')') 
    WRITE(unit=10, fmt=100) t, u_
! Timestepping loop
    DO WHILE (t .LT. end_time - 1e-05)
      CALL THETA_STEP(u, u_, dt, theta)
      u_ = u
      t = t + dt
      WRITE(unit=10, fmt=100) t, u
    END DO
! Remember to close the CSV file
    CLOSE(unit=10) 
 100 FORMAT(f4.2,',',f4.2)
  END SUBROUTINE THETA_METHOD

END MODULE THETA_METHOD_MOD_D

ODE example: optimisation with gradient descent

Let’s solve this ODE problem with one of the simplest gradient-based optimisation approaches: gradient descent. This amounts to an initial guess \(\theta_0\), followed by iterative updates \[ \theta_{k+1}=\theta_k+\alpha\:p_k, \] where \(\alpha>0\) is the step length and \(p_k\) is the descent direction. For gradient descent, we simply take \[ p_k=-\frac{\mathrm{d}J_k}{\mathrm{d}\theta_k}. \]

Since we know the analytical solution for this problem, we may make an ‘artificial’ choice of cost function such as \[ J(u;\theta)=(u(1)-\mathrm{e}^1)^2, \] where here \(\mathrm{e}^1\) is the analytical solution at the end time \(t=1\). This is implemented in cost_mod.f90 module as a subroutine

! Cost function evaluating the l2 error at the end time against the analytical solution u(t)=exp(t)
subroutine cost_function(u, J)
  implicit none
  real, intent(in) :: u           ! Numerical solution at the end time
  real, intent(out) :: J          ! Cost function value
  real, parameter :: e = exp(1.0) ! The exponential constant, e=2.71828...

  J = (u - e) ** 2
end subroutine cost_function

Exercise

The gradient descent algorithm is implemented for our ODE problem in session1/exercises/ode/gradient_descent.f90.

  1. However, there is another missing piece: you will also need to differentiate the cost function. Convince yourself you are satisfied with the output.
  2. Run the gradient descent optimisation algorithm and execute the following two cells to visualise the results.
  3. Play around with the optimisation parameters defined in gradient_descent.f90. Can you achieve convergence in fewer iterations without changing the gtol convergence tolerance?

Solution 1

$ tapenade -tgtmodulename %_d cost_mod.f90 
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Command: Took subroutine cost_function as default differentiation root
Diff-liveness analysis turned off
@@ Created ./cost_mod_d.f90 
@@ Created ./cost_mod_d.msg
cat cost_mod_d.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
! Module containing subroutines for cost functions
MODULE COST_MOD_D
  IMPLICIT NONE
  PRIVATE 
  PUBLIC :: cost_function
  PUBLIC :: cost_function_d

CONTAINS
!  Differentiation of cost_function in forward (tangent) mode:
!   variations   of useful results: j
!   with respect to varying inputs: u
!   RW status of diff variables: j:out u:in
! Cost function evaluating the l2 error at the end time against the analytical solution u(t)=exp(t)
  SUBROUTINE COST_FUNCTION_D(u, ud, j, jd)
    IMPLICIT NONE
! Numerical solution at the end time
    REAL, INTENT(IN) :: u
    REAL, INTENT(IN) :: ud
! Cost function value
    REAL, INTENT(OUT) :: j
    REAL, INTENT(OUT) :: jd
    INTRINSIC EXP
! The exponential constant, e=2.71828...
    REAL, PARAMETER :: e=EXP(1.0)
    jd = 2*(u-e)*ud
    j = (u-e)**2
  END SUBROUTINE COST_FUNCTION_D

! Cost function evaluating the l2 error at the end time against the analytical solution u(t)=exp(t)
  SUBROUTINE COST_FUNCTION(u, j)
    IMPLICIT NONE
! Numerical solution at the end time
    REAL, INTENT(IN) :: u
! Cost function value
    REAL, INTENT(OUT) :: j
    INTRINSIC EXP
! The exponential constant, e=2.71828...
    REAL, PARAMETER :: e=EXP(1.0)
    j = (u-e)**2
  END SUBROUTINE COST_FUNCTION

END MODULE COST_MOD_D


Solution 2

Compile gradient_descent.f90` and its dependencies with

$ make gradient_descent

You can then run

$ ./gradient_descent

You should get convergence in around 755 iterations.


Solution 3

Increasing the step length slighly to \(\alpha=0.199\) you should get convergence in around 378 iterations - about half. Interestingly, the choice \(\alpha=0.2\) fails to converge. Bonus exercise: plot the progress for \(\alpha=0.2\) in the cell below to see what went wrong.

df_opt = pandas.read_csv("exercises/ode/optimisation_progress.csv")

costs = np.array(df_opt["J"])
costs[0] = np.nan  # Remove the first entry because it's uninitialised garbage
controls = np.array(df_opt["theta"])

fig, axes = plt.subplots(ncols=2, figsize=(12, 5))
axes[0].loglog(df_opt["it"], costs, "--", label="Cost function value")
axes[0].legend()
axes[0].grid()
axes[1].plot(df_opt["it"], df_opt["theta"], "--", label="Control value")
axes[1].legend()
axes[1].grid()
df_optimised = pandas.read_csv("exercises/ode/optimised.csv")

fig, axes = plt.subplots()
axes.plot(times, np.exp(times), "-", color="k", label="Analytical solution")
axes.plot(df_forward["t"], df_forward["u"], "--x", label="Forward Euler")
axes.plot(df_backward["t"], df_backward["u"], ":o", label="Backward Euler")
axes.plot(df_optimised["t"], df_optimised["u"], "-.^", label=rf"Optimised ($\theta={controls[-1]:.4f}$)")
axes.legend()
axes.grid()

As we might hope, the optimised value of \(\theta\) gives a much better approximation.

Calculating the full Jacobian

Question

Given a map \(\mathbf{f}\), some input \(\mathbf{x}\), and some seed \(\dot{\mathbf{x}}\), we have the Jacobian vector product \[\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}.\]

How can we use this to compute the full Jacobian matrix \(\nabla\mathbf{f}(\mathbf{x})\)?

Solution

\[ \nabla\mathbf{f}(\mathbf{x}) =\nabla\mathbf{f}(\mathbf{x})\mathbf{I}_n =\nabla\mathbf{f}(\mathbf{x})\begin{bmatrix}\mathbf{e}_1,\mathbf{e}_2,\dots,\mathbf{e}_n\end{bmatrix} =\begin{bmatrix}\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_1,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_2,\dots,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_n\end{bmatrix}. \]

Apply JVP to the \(n\) canonical unit vectors.

Vector mode

The expression \(\begin{bmatrix}\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_1,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_2,\dots,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_n\end{bmatrix}\) above is a concatenation of \(n\) JVPs. Whilst it’s possible to apply these in a loop, most AD tools support a ‘vector’ mode, which allows this to be done in a single call. This is the vector-Jacobian product (VJP):

\[\text{VJP}(\mathbf{f},\mathbf{x},\dot{\mathbf{X}}):=\begin{bmatrix}\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}_1,\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}_2,\dots,\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}_k\end{bmatrix},\]

for a seed matrix \(\dot{\mathbf{X}}:=\begin{bmatrix}\dot{\mathbf{x}}_1,\dot{\mathbf{x}}_2,\dots,\dot{\mathbf{x}}_k\end{bmatrix}\in\mathbb{R}^{k\times n}\), \(k\in\mathbb{N}\).

Example: tridiagonal matrix

Consider the classic central difference approximation to the second derivative (yes, more derivatives!): \[ \frac{\mathrm{d}^2u}{\mathrm{d}x^2}\approx\frac{u(x-h)-2u(x)+u(x+h)}{h^2}, \] for some uniform spacing \(h>0\). For a uniform discretisation \(\{x_k\}_{k=1}^n\) of an interval, this can be written as \[ \left.\frac{\mathrm{d}^2u}{\mathrm{d}x^2}\right|_{x=x_k}\approx\frac{u_{k-1}-2u_k+u_{k+1}}{h^2}, \] where \(u_k\) is an approximation of \(u(x_k)\). This is implemented as a Fortran subroutine central_diff inside the Fortran module at session1/exercises/sparse/central_diff.f90.

module central_diff_mod
  implicit none

  private
  public :: central_diff

contains

  ! Central difference approximation for the second derivative of a periodic 1D function
  subroutine central_diff(u, approx, h, n)
    implicit none

    integer, intent(in) :: n                   ! Number of grid points
    real, intent(in) :: h                      ! Uniform grid spacing
    real, dimension(n), intent(in) :: u        ! Input vector
    real, dimension(n), intent(out) :: approx  ! Central difference approximation
    integer :: i                               ! Dummy index for looping

    if (size(u, 1) /= n) then
      print *, "Invalid input array size"
      stop 1
    end if
    if (size(approx, 1) /= n) then
      print *, "Invalid output array size"
      stop 1
    end if

    do i = 1, n
      if (i == 1) then
        ! Periodic boundary on the left
        approx(i) = (u(n) - 2 * u(i) + u(i+1)) / h ** 2
      else if (i == n) then
        ! Periodic boundary on the right
        approx(i) = (u(i-1) - 2 * u(i) + u(1)) / h ** 2
      else
        ! Interior points
        approx(i) = (u(i-1) - 2 * u(i) + u(i+1)) / h ** 2
      end if
    end do

  end subroutine central_diff

end module central_diff_mod

Exercise

Apply Tapenade in vector mode to compute the derivative of the output array approx with respect to the input array u.

Solution

Running

$ tapenade -tgtmodulename %_d -head "central_diff(approx)/(u)" -vector central_diff_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options:  multiDirectional
Command: Procedure central_diff understood as central_difference.central_diff
Diff-liveness analysis turned off
@@ Created ./central_diff_mod_dv.f90 
@@ Created ./central_diff_mod_dv.msg
$ cat central_diff_mod_dv.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE CENTRAL_DIFF_MOD_DV
  USE DIFFSIZES
!  Hint: nbdirsmax should be the maximum number of differentiation directions
  IMPLICIT NONE
  PRIVATE 
  PUBLIC :: central_diff
  PUBLIC :: central_diff_dv

CONTAINS
!  Differentiation of central_diff in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: approx
!   with respect to varying inputs: u
!   RW status of diff variables: approx:out u:in
! Central difference approximation for the second derivative of a periodic 1D function
  SUBROUTINE CENTRAL_DIFF_DV(u, ud, approx, approxd, h, n, nbdirs)
    USE DIFFSIZES
!  Hint: nbdirsmax should be the maximum number of differentiation directions
    IMPLICIT NONE
! Number of grid points
    INTEGER, INTENT(IN) :: n
! Uniform grid spacing
    REAL, INTENT(IN) :: h
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: u
    REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: ud
! Central difference approximation
    REAL, DIMENSION(n), INTENT(OUT) :: approx
    REAL, DIMENSION(nbdirsmax, n), INTENT(OUT) :: approxd
! Dummy index for looping
    INTEGER :: i
    INTRINSIC SIZE
    INTEGER :: nd
    INTEGER :: nbdirs
    IF (SIZE(u, 1) .NE. n) THEN
      PRINT*, 'Invalid input array size'
      STOP
    ELSE IF (SIZE(approx, 1) .NE. n) THEN
      PRINT*, 'Invalid output array size'
      STOP
    ELSE
      approxd = 0.0
      DO i=1,n
        IF (i .EQ. 1) THEN
! Periodic boundary on the left
          DO nd=1,nbdirs
            approxd(nd, i) = (ud(nd, n)-2*ud(nd, i)+ud(nd, i+1))/h**2
          END DO
          approx(i) = (u(n)-2*u(i)+u(i+1))/h**2
        ELSE IF (i .EQ. n) THEN
! Periodic boundary on the right
          DO nd=1,nbdirs
            approxd(nd, i) = (ud(nd, i-1)-2*ud(nd, i)+ud(nd, 1))/h**2
          END DO
          approx(i) = (u(i-1)-2*u(i)+u(1))/h**2
        ELSE
! Interior points
          DO nd=1,nbdirs
            approxd(nd, i) = (ud(nd, i-1)-2*ud(nd, i)+ud(nd, i+1))/h**2
          END DO
          approx(i) = (u(i-1)-2*u(i)+u(i+1))/h**2
        END IF
      END DO
    END IF
  END SUBROUTINE CENTRAL_DIFF_DV

! Central difference approximation for the second derivative of a periodic 1D function
  SUBROUTINE CENTRAL_DIFF(u, approx, h, n)
    IMPLICIT NONE
! Number of grid points
    INTEGER, INTENT(IN) :: n
! Uniform grid spacing
    REAL, INTENT(IN) :: h
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: u
! Central difference approximation
    REAL, DIMENSION(n), INTENT(OUT) :: approx
! Dummy index for looping
    INTEGER :: i
    INTRINSIC SIZE
    IF (SIZE(u, 1) .NE. n) THEN
      PRINT*, 'Invalid input array size'
      STOP
    ELSE IF (SIZE(approx, 1) .NE. n) THEN
      PRINT*, 'Invalid output array size'
      STOP
    ELSE
      DO i=1,n
        IF (i .EQ. 1) THEN
! Periodic boundary on the left
          approx(i) = (u(n)-2*u(i)+u(i+1))/h**2
        ELSE IF (i .EQ. n) THEN
! Periodic boundary on the right
          approx(i) = (u(i-1)-2*u(i)+u(1))/h**2
        ELSE
! Interior points
          approx(i) = (u(i-1)-2*u(i)+u(i+1))/h**2
        END IF
      END DO
    END IF
  END SUBROUTINE CENTRAL_DIFF

END MODULE CENTRAL_DIFF_MOD_DV

Note: The differentiated module central_diff_mod_dv depends on a module diffsizes which specifies the number of differentiation directions. The examples that will follow will each define this module (diffsizes_dense.f90, diffsizes_compressed.f90).

Tridiagonal example: compute dense Jacobian

In session1/exercises/sparse/dense_jacobian.f90 we provide a program for computing the Jacobian of the centred difference function over the uniform partition of the interval.

! Naive program for computing a tridiagonal Jacobian
program dense_jacobian
  use diffsizes, only: m => nbdirsmax  ! Number of seed vectors, from diffsizes_dense.f90
  use central_diff_mod_dv, only: central_diff_dv ! Tapenade Forward mode derivative

  implicit none

  logical, parameter :: output = .true. ! Flag for whether to write output to file
  integer, parameter :: n = 10          ! Number of grid points
  real, parameter :: h = 1.0            ! Uniform grid spacing
  real, dimension(n) :: u               ! Input vector
  real, dimension(n) :: approx          ! Central difference approximation
  real, dimension(m,n) :: seed          ! Seed matrix for the VJP
  real, dimension(m,n) :: jacobian      ! Jacobian for the central difference calculation
  integer :: i                          ! Dummy index for looping

  if (m /= n) then
    print *, "Error: number of grid points must match number of seed vectors."
    stop 1
  end if

  ! Specify some arbitrary input
  u(:) = 1.0

  ! Set up the seed matrix as the nxn identity
  seed(:,:) = 0.0
  do i = 1, n
    seed(i,i) = 1.0
  end do

  ! Propagate the seed matrix through the VJP
  call central_diff_dv(u, seed, approx, jacobian, h, n, n)

  ! Write out the result to file
  if (output) then
    open(unit=10, file="dense_jacobian.dat")
    do i = 1, n
      write(unit=10, fmt=100) jacobian(i,:)
    end do
    100 format(10(f4.1,","))
    close(unit=10)
  end if

end program dense_jacobian

Details on running the code

Compile with

$ make dense_jacobian

and then run with

$ ./dense_jacobian

Compiling and running produces a file dense_jacobian.dat, which we can read and view in the notebook. Note that the computation here required \(n\) applications of forward mode AD, where \(n\) is the number of gridpoints.

print_matrix("exercises/sparse/dense_jacobian.dat")

Sparse AD

We can compute the full Jacobian with \[\nabla\mathbf{f}(\mathbf{x})=\nabla\mathbf{f}(\mathbf{x})\mathbf{I}_n=\nabla\mathbf{f}(\mathbf{x})\begin{bmatrix}\mathbf{e}_1,\mathbf{e}_2,\dots,\mathbf{e}_n\end{bmatrix}.\]

  • But what about when \(n\) gets very large?
  • And what about when the Jacobian is sparse?

Diagonal Jacobian example

Suppose \(\nabla\mathbf{f}(\mathbf{x})\) is diagonal, say

\[\nabla\mathbf{f}(\mathbf{x})=\begin{bmatrix}J_1\\& J_2\\ & & \ddots\\ & & & J_n\end{bmatrix}.\]

Then, for a seed vector \(\dot{\mathbf{x}}=\begin{bmatrix}\dot{x}_1 & \dot{x}_2 & \dots & \dot{x}_n\end{bmatrix}^T\), we have

\[ \nabla\mathbf{f}(\mathbf{x})\:\dot{\mathbf{x}} =\begin{bmatrix}J_1\\& J_2\\ & & \ddots\\ & & & J_n\end{bmatrix} \begin{bmatrix}\dot{x}_1 \\ \dot{x}_2 \\ \vdots \\ \dot{x}_n\end{bmatrix} =\begin{bmatrix}J_1\dot{x}_1 \\ J_2\dot{x}_2 \\ \vdots \\ J_n\dot{x}_n\end{bmatrix}. \]

Making the clever choice of seed vector \(\dot{\mathbf{x}}:=\mathbf{e}=\begin{bmatrix}1 & 1 & \dots & 1\end{bmatrix}^T\), we can apply a single JVP and then back out the full Jacobian by putting each entry on the diagonal: \[ \mathrm{diag}(\nabla\mathbf{f}(\mathbf{x})\:\mathbf{e}) =\mathrm{diag}\left(\begin{bmatrix}J_1 \\ J_2 \\ \vdots \\ J_n \end{bmatrix}\right) =\begin{bmatrix}J_1\\& J_2\\ & & \ddots\\ & & & J_n\end{bmatrix} =\nabla\mathbf{f}(\mathbf{x}). \]

Sparse AD: what colour is your Jacobian?

The example above is the simplest case of what is known as colouring. The idea is to group columns of the Jacobian such that the columns in each group are linearly independent. This gives rise to compression approaches for sparse AD.

Figure 5: Jacobian colouring diagram taken from (Gebremedhin et al., 2005).

Tridiagonal example: compute compressed Jacobian

In session1/exercises/sparse/compressed_jacobian.f90 we provide a program for computing the same Jacobian of the centred difference function but this time using a clever choice of seed matrix: \[ \dot{\mathbf{X}}:=\begin{bmatrix} 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \end{bmatrix}. \]

! Program for computing a tridiagonal Jacobian using compression
program compressed_jacobian
  use diffsizes, only: m => nbdirsmax  ! Number of seed vectors, from diffsizes_compressed.f90
  use central_diff_mod_dv, only: central_diff_dv ! Tapenade Forward mode derivative

  implicit none

  logical, parameter :: output = .true. ! Flag for whether to write output to file
  integer, parameter :: n = 10          ! Number of grid points
  real, parameter :: h = 1.0            ! Uniform grid spacing
  real, dimension(n) :: u               ! Input vector
  real, dimension(n) :: approx          ! Central difference approximation
  real, dimension(m,n) :: seed          ! Seed matrix for the VJP
  real, dimension(m,n) :: compressed    ! Compressed Jacobian for the central difference calculation
  real, dimension(n,n) :: jacobian      ! Jacobian for the central difference calculation
  integer :: i, j                       ! Dummy indices for looping

  ! Specify some arbitrary input
  u(:) = 1.0

  ! Set up the seed matrix as a 3xn array
  seed(:,:) = 0.0
  do i = 1, m
    seed(i,i::m) = 1.0
  end do

  ! Apply the VJP
  call central_diff_dv(u, seed, approx, compressed, h, n, m)

  ! Write out the compressed result to file
  if (output) then
    open(unit=10, file="compressed_jacobian.dat")
    do i = 1, m
      write(unit=10, fmt=100) compressed(i,:)
    end do
    100 format(10(f4.1,","))
    close(unit=10)
  end if

  ! Decompress rows and insert them into the Jacobian
  do i = 1, m
    j = i
    jacobian(i::m,:) = 0.0
    do while (j <= n)
      if (j == 1) then
        jacobian(j,n) = compressed(i,n)
      else
        jacobian(j,j-1) = compressed(i,j-1)
      end if
      jacobian(j,j) = compressed(i,j)
      if (j == n) then
        jacobian(j,1) = compressed(i,1)
      else
        jacobian(j,j+1) = compressed(i,j+1)
      end if
      j = j + m
    end do
  end do

  ! Write out the result to file
  if (output) then
    open(unit=11, file="decompressed_jacobian.dat")
    do i = 1, n
      write(unit=11, fmt=100) jacobian(i,:)
    end do
    close(unit=11)
  end if

end program compressed_jacobian
  

Details on running the code

Compile with

$ make compressed_jacobian

and then run with

$ ./compressed_jacobian

Compiling and running produces files compressed_jacobian.dat and decompressed_jacobian.dat, which we can again read and view in the notebook. This time we only need to compute three JVPs, rather than \(n\)!

Exercise

Experiment with increasing the number of gridpoints to see the impact on the overall cost of the dense approach versus the compressed approach.

Hint: Using the time command line tool should give some very basic timing information.

Solution

Make the changes

diff --git a/session1/exercises/sparse/compressed_jacobian.f90 b/session1/exercises/sparse/compressed_jacobian.f90
index 1c49ebd..86c05db 100644
--- a/session1/exercises/sparse/compressed_jacobian.f90
+++ b/session1/exercises/sparse/compressed_jacobian.f90
@@ -5,8 +5,8 @@ program compressed_jacobian
 
   implicit none
 
-  logical, parameter :: output = .true. ! Flag for whether to write output to file
-  integer, parameter :: n = 10          ! Number of grid points
+  logical, parameter :: output = .false. ! Flag for whether to write output to file
+  integer, parameter :: n = 10000          ! Number of grid points
   real, parameter :: h = 1.0            ! Uniform grid spacing
   real, dimension(n) :: u               ! Input vector
   real, dimension(n) :: approx          ! Central difference approximation
diff --git a/session1/exercises/sparse/dense_jacobian.f90 b/session1/exercises/sparse/dense_jacobian.f90
index a61111d..e04a9aa 100644
--- a/session1/exercises/sparse/dense_jacobian.f90
+++ b/session1/exercises/sparse/dense_jacobian.f90
@@ -5,8 +5,8 @@ program dense_jacobian
 
   implicit none
 
-  logical, parameter :: output = .true. ! Flag for whether to write output to file
-  integer, parameter :: n = 10          ! Number of grid points
+  logical, parameter :: output = .false. ! Flag for whether to write output to file
+  integer, parameter :: n = 10000          ! Number of grid points
   real, parameter :: h = 1.0            ! Uniform grid spacing
   real, dimension(n) :: u               ! Input vector
   real, dimension(n) :: approx          ! Central difference approximation
diff --git a/session1/exercises/sparse/diffsizes_dense.f90 b/session1/exercises/sparse/diffsizes_dense.f90
index 0fef76e..ffc198a 100644
--- a/session1/exercises/sparse/diffsizes_dense.f90
+++ b/session1/exercises/sparse/diffsizes_dense.f90
@@ -5,5 +5,5 @@ module diffsizes
   private
   public :: nbdirsmax
 
-  integer, parameter :: nbdirsmax = 10
+  integer, parameter :: nbdirsmax = 10000
 end module diffsizes

recompile and then run

$ time ./compressed_jacobian && time ./dense_jacobian

You should get output something like

real    0m0.309s
user    0m0.140s
sys     0m0.168s

real    0m0.883s
user    0m0.505s
sys     0m0.377s

indicating that the compressed approach is about twice as fast at this resolution. We could likely get a much better speedup by optimising the decompression strategy.

Warning! Make sure you turn off the output flag and recompile before conducting any timing experiments.

print_matrix("exercises/sparse/compressed_jacobian.dat")
print_matrix("exercises/sparse/decompressed_jacobian.dat")

Sparse AD: in practice

It’s worth noting that we used knowledge about the tridiagonal structure of the Jacobian to choose the seed vectors… so that we could compute the Jacobian. For Jacobians in the wild (e.g., those related to networks) we don’t necessarily have such intuition. In practice, we need to perform an initial step to establish the sparsity pattern before we can do the colouring. This functionality is available in Tapenade, but currently only on a branch.

Sparse AD: overall performance comparison

We have considered three approaches for sparse AD:

  1. ‘Dense approach’, which computes a JVP for each canonical unit vector.
  2. ‘Compressed approach’, which computes JVPs for sums over sets of linearly independent columns.
  3. ‘Matrix-free approach’, which applies a single JVP. (Only useful if the problem involves JVPs.)

In the following, we walk through a performance comparison of these different approaches applied to the solution of the nonlinear Gray-Scott equation \[ \frac{\partial u}{\partial t}=D_1\Delta u-uv^2+\gamma(1-u), \quad\frac{\partial v}{\partial t}=D_2\Delta v-uv^2=(\gamma+\kappa)v, \quad u,v:[0,2.5]^2\rightarrow\mathbb{R} \] (subject to some initial conditions) using a finite difference approach with Crank-Nicolson timestepping (\(\theta=0.5\)). This discretisation corresponds to a 5-point stencil and it turns out the Jacobian can be coloured with five colours. As such, the compressed approach involves five JVPs per Jacobian computation.

Figure 6: Performance comparison for different Jacobian computation approaches for solving the Gray-Scott equation, taken from (Wallwork et al, 2019). PETSc was used for numerical solvers and ADOL-C for AD. The \(N\\times N\) notation in the subcaptions refers to the grid dimension.


Here ‘analytic’ refers to the hand-derived Jacobian, which we can’t realistically expect to beat.

Sparse AD: convergence comparison

Matrix-free works well for small dimensions. However, it slows down at higher resolution because Jacobian assembly is done at each iteration of the nonlinear solver, whereas the JVP is calculated at each iteration of the linear solver. It’s possible that a different choice of linear solver or preconditioner would reduce the cost of the matrix-free approach.

Figure 7: Linear solver iterations as a function of grid resolution and number of processors for the Gray-Scott problem. (Numbers of nonlinear solver iterations are much lower.) Taken from (Wallwork et al, 2019).

Summary and outlook

In today’s session we:

  • Got a brief history of automatic differentiation.
  • Learnt about forward mode and the source transformation approach.
  • Tried out the Tapenade source transformation AD tool applied to some test problems.
  • Verified the code generated by Tapenade both manually and using the Taylor test.
  • Learnt about compressed and matrix-free approaches for sparse problems.

In tomorrow’s session (at the same time) we will:

  • Learn about reverse mode and the operator overloading approach, comparing them with forward mode and source transformation, respectively.
  • Verify the consistency of code generated by Tapenade under forward and reverse mode using the dot product test.
  • Calculate higher order derivatives using Tapenade.
  • Try out the Pyadjoint operator overloading AD tool underpinnning the Firedrake finite element library and see showcases of more advanced AD usage.
  • Learn about checkpointing and using AD to compute higher order derivatives.

References

  • R. E. Wengert. A simple automatic derivative evaluation program (1964). Communications of the ACM, 7(8):463–464, doi.org:10.1145/355586.364791.
  • A. Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optimization Methods & Software, 1:35–54, 1992, doi:10.1080/10556789208805505.
  • A. H. Gebremedhin, et al. What color is your Jacobian? Graph coloring for computing derivatives (2005). SIAM review, 47(4), pp.629-705, doi:10.1137/S0036144504444711.
  • J. G. Wallwork, et al. Computing derivatives for petsc adjoint solvers using algorithmic differentiation (2019), arXiv preprint doi:10.48550/arXiv.1909.02836.