%matplotlib inline
Session 1: Forward mode differentiation and source transformation
import matplotlib.pyplot as plt
from mpltools import annotation
import numpy as np
import pandas
def print_matrix(filename):
"""Function for printing a matrix saved to a text file from Fortran."""
print(np.array([[float(entry) for entry in line[:-2].split(",")] for line in open(filename).readlines()]))
Set up codespace now. (It will take a while!)
We will show you where to find this notebook in the repo while you wait.
Motivation
Differentiable programming is an enabling technology. Given scientific code for computing some quantity, it allows us to generate derivatives of quantities involved in the computation without any need for deriving or hand-coding the derivative expressions involved.
Some motivating examples: * Computing the Jacobian for a nonlinear system. * Computing the gradient required for an ODE- or PDE-constrained optimisation method. * The backpropagation operation used for training machine learning models. * Computing Hessians for uncertainty quantification methods. * Solving the adjoint problems involved in data assimilation methods commonly used for weather forecasting.
Learning objectives
In today’s session we will:
- Get a brief history of automatic differentiation.
- Learn about forward mode and the source transformation approach.
- Try out the Tapenade source transformation AD tool applied to some test problems.
- Verify the code generated by Tapenade both manually and using the Taylor test.
- Learn about compressed and matrix-free approaches for sparse problems.
Preparations
Terminology
This course introduces the concept of differentiable programming, a.k.a. automatic differentiation (AD), or algorithmic differentiation. We will use the acronym AD henceforth.
Notation
For a differentiable mathematical function \(f:A\rightarrow\mathbb{R}\) with scalar input (i.e., a single value) from \(A\subseteq\mathbb{R}\), we make use of both the Lagrange notation \(f'(x)\) and Leibniz notation \(\frac{\mathrm{d}f}{\mathrm{d}x}\) for its derivative.
Caution with the physics notation for derivatives \(\dot{x}\). It won’t always mean what you expect! (See later.)
Similarly, for \(m\in\mathbb{N}\) dimensional, differentiable, vector-valued function \(\mathbf{f}:A\rightarrow\mathbb{R}^m\) with scalar input, we have derivative notations \(\mathbf{f}'(x)\) and \(\frac{\mathrm{d}\mathbf{f}}{\mathrm{d}x}\).
For a differentiable function with vector input (i.e., multiple inputs), we use partial derivative notation. For example, if \(f:\mathbb{R}^2\rightarrow\mathbb{R}\) is written as \(f=f(x,y)\) then we have the partial derivatives \(\frac{\partial f}{\partial x}\) and \(\frac{\partial f}{\partial y}\) with respect to first and second components, respectively. We use \[\nabla f=\left(\frac{\partial f}{\partial x_1},\dots,\frac{\partial f}{\partial x_m}\right)\] to denote the vector of all such partial derivatives. Similarly for vector-valued functions with multiple inputs.
When it comes to derivatives in code, we use the _d
notation (for “derivative” or “dot”), which is standard in the AD literature. Its meaning will be described in due course.
History
- Origins of AD in 1950s.
- However, it found a wider audience in the 1980s, when it became more relevant thanks to advances in both computer power and modern programming languages.
- Forward mode (the subject of this session) was discovered by Wengert in 1964.
- Further developed by Griewank in the late 1980s.
Figure 1: Header of (R. E. Wengert, 1964).
Idea
The idea of AD is to treat a model as a sequence of elementary instructions (e.g., addition, multiplication, exponentiation). Here a model could be a function or subroutine, code block, or a whole program. Elementary operations are well-understood and their derivatives are known. As such, the derivative of the whole model may be computed by composing the derivatives of each operation using the chain rule.
Recap on A-level maths: the Chain Rule
Consider two composable, differentiable (mathematical) functions, \(f\) and \(g\), with composition \(h=f\circ g\). By definition, this means \[h(x)=(f\circ g)(x)=g(f(x)).\]
Then the chain rule states that the derivative of \(h\) may be computed in terms of the derivatives of \(f\) and \(g\) using the formula \[h'(x)=(f\circ g)'(x)=(f\circ g')(x)\,f'(x)=g'(f(x))\,f'(x).\]
Equivalently, in Leibniz notation: \[\frac{\mathrm{d}h}{\mathrm{d}x}=\frac{\mathrm{d}g}{\mathrm{d}f}\frac{\mathrm{d}f}{\mathrm{d}x}.\]
For variables with multiple arguments, the result is equivalent for each partial derivative, e.g., \[\frac{\partial h}{\partial x}=\frac{\partial g}{\partial f}\frac{\partial f}{\partial x}.\]
\(f\) & \(g\) example
Consider two functions acting on real numbers: \[f(x,y)=xy\] and \[g(z)=(\sin(z),\cos(z)).\] Here \(f:\mathbb{R}^2\rightarrow\mathbb{R}\) takes two inputs and returns a single output, while \(g:\mathbb{R}\rightarrow\mathbb{R}^2\) takes a single input and returns two outputs.
Optional exercise
Convince yourself that it is well defined for these functions may be composed in either order. (Although they won’t necessarily give the same value!)
Solution
The image of \(f\) is the set of all real numbers, so its image is the same as the domain of \(g\) (i.e., \(\text{im}(f)=\mathbb{R}=\text{dom}(g)\)).
The image of \(g\) is \([-1,1]^2=[-1,1]\times[-1,1]\) because \(\sin\) and \(\cos\) give values between -1 and 1. Since this is a subset of \(\mathbb{R}^2\), the image of \(g\) is a subset of the domain of \(f\) (i.e., \(\text{im}(g)\subset\text{dom}(f)\)).
\(f\) & \(g\) example: Directed Acyclic Graph
We can visualise the functions in terms of DAGs.
Recalling that \[f(x_1,x_2)=x_1x_2\] and \[g(y)=(\sin(y),\cos(y))\, ,\] we have
Figure 2: Directed Acyclic Graph (DAG) for the \(f\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).
Figure 3: Directed Acyclic Graph (DAG) for the \(g\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).
\(f\) & \(g\) example: composition
Consider the composition \(h=f\circ g:\mathbb{R}^2\rightarrow\mathbb{R}^2\), which is given by \[h(x_1,x_2)=(f\circ g)(x_1,x_2)=g(f(x_1,x_2))=g(x_1x_2)=(\sin(x_1x_2),\cos(x_1x_2)).\]
For the derivative of each component, \[ \frac{\partial f}{\partial x_1}=\frac{\partial}{\partial x_1}x_1x_2=x_2, \quad\frac{\partial f}{\partial x_2}=\frac{\partial}{\partial x_2}x_1x_2=x_1, \quad\frac{\partial g}{\partial y}=\frac{\partial}{\partial y}(\sin(y),\cos(y))=(\cos(y),-\sin(y)). \]
Introduce the notation \(g(y)=(g_1(y),g_2(y))\) so that \[ \frac{\partial g_1}{\partial y}=\cos(y), \quad\frac{\partial g_2}{\partial y}=-\sin(y). \] Similarly \(h(x_1,x_2)=(h_1(x_1,x_2),h_2(x_1,x_2))=(g_1(f(x_1,x_2)),g_2(f(x_1,x_2)))\).
Let’s use the chain rule to work out the derivatives of each of the outputs with respect to each of the inputs. \[ \frac{\partial h_1}{\partial x_1}=\frac{\partial g_1}{\partial f}\frac{\partial f}{\partial x_1}=\cos(y)x_2=x_2\cos(x_1x_2), \quad\frac{\partial h_1}{\partial x_2}=\frac{\partial g_1}{\partial f}\frac{\partial f}{\partial x_2}=\cos(y)x_1=x_1\cos(x_1x_2), \] where \(y=f(x_1,x_2)\) and \[ \quad\frac{\partial h_2}{\partial x_1}=\frac{\partial g_2}{\partial f}\frac{\partial f}{\partial x_1}=-\sin(y)x_2=-x_2\sin(x_1x_2), \quad\frac{\partial h_2}{\partial x_2}=\frac{\partial g_2}{\partial f}\frac{\partial f}{\partial x_2}=-\sin(y)x_1=-x_1\sin(x_1x_2). \]
We will come back to these formulae to verify the correctness of our AD computations.
Directional derivative, a.k.a. Jacobian-vector product (JVP)
Consider a vector-valued function \(\mathbf{f}\) mapping from a subspace \(A\subseteq\mathbb{R}^n\) into \(\mathbb{R}^m\), for some \(m,n\in\mathbb{N}\): \[\mathbf{f}:A\rightarrow\mathbb R^m.\]
Given input \(\mathbf{x}\in A\) and a seed vector \(\dot{\mathbf{x}}\in\mathbb{R}^n\), forward mode AD allows us to compute the action (matrix-vector product) \[\text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\,\dot{\mathbf{x}}.\] Here \(\nabla\mathbf{f}\) is referred to as the Jacobian for the map, so the above is known as a Jacobian-vector product (JVP). You will also hear the related term tangent linear model (TLM).
Think of the seed vector as the direction in which we want to compute the derivative. In practice, the seed vector is often a derivative of some upstream code from outside of the part of the program being differentiated. That is, the upstream code is passive, whereas the part we are interested in is active, as far as AD is concerned.
Note The computation is matrix-free. We don’t actually need to assemble the Jacobian when we compute this product.
\(f\) & \(g\) example: Seed vectors
Let’s revisit the DAG interpretation and consider how the derivatives work.
Figure 4: Directed Acyclic Graph (DAG) for the composition of the functions in the \(f\) & \(g\) example.
Approach 1: Source transformation
High level idea: Given some (code) function f(x)
and a seed vector x_d
, generate the code for the function f_d(x, x_d)
for its (directional) derivative.
Notes on the source transformation approach:
- Static analysis, done ahead of time.
- Often the difficult part is hooking the differentiated code into the wider model/build system.
- Limited number of tools.
Source transformation tools:
- Tapenade (C, Fortran, Julia*)
- OpenAD (Fortran) [no longer maintained]
- TAF (Fortran) [commercial]
- PSyAD* (domain-specific)
*Work in progress
\(f\) & \(g\) example: source transformation
Below we have the Fortran code for the example functions above, written as subroutines. You can find this in the repository at session1/exercises/fg/f.f90
and session1/exercises/fg/g.f90
, respectively.
subroutine f(x, y)
implicit none
real, dimension(2), intent(in) :: x
real, intent(out) :: y
= x(1) * x(2)
y end subroutine f
subroutine g(y, z)
implicit none
real, intent(in) :: y
real, intent(out), dimension(2) :: z
= [sin(y), cos(y)]
z end subroutine g
Exercise
- Run
tapenade -h
to see all the options. - Apply Tapenade to each of these subroutines using the
-head
argument.* Inspect the output filesf_d.f90
andg_d.f90
and check they are as you expect. - (Optional) Inspect the message files
f_d.msg
andg_d.msg
. - (Optional) See https://gitlab.inria.fr/tapenade/tapenade/-/blob/develop/src/README.md for details on the internals of Tapenade.
*The syntax is -head "procedure_name(dependent_variables)/(independent_variables)"
, where here procedure_name
, dependent_variables
, and independent_variables
all need substituting as appropriate. If multiple variables are used then they should be comma-separated.
Solution 1
$ tapenade -h
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Builds a differentiated program.
Usage: tapenade [options]* filenames
options:
-head, -root <proc> set the differentiation root procedure(s)
See FAQ for refined invocation syntax, e.g.
independent and dependent arguments, multiple heads...
-tangent, -d differentiate in forward/tangent mode (default)
-reverse, -b differentiate in reverse/adjoint mode
-vector, -multi turn on "vector" mode (i.e. multi-directional)
-specializeactivity <unit_names or %all%> Allow for several activity patterns per routine
-primal, -p turn off differentiation. Show pointer destinations
-output, -o <file> put all generated code into a single <file>
-splitoutputfiles split generated code, one file per top unit
-outputdirectory, -O <directory> put all generated files in <directory> (default: .)
-I <includePath> add a new search path for include files
-tgtvarname <str> set extension for tangent variables (default %d)
-tgtfuncname <str> set extension for tangent procedures (default %_d)
-tgtmodulename <str> set extension for tangent modules and types (default %_diff)
-adjvarname <str> set extension for adjoint variables (default %b)
-adjfuncname <str> set extension for adjoint procedures (default %_b)
-adjmodulename <str> set extension for adjoint modules and types (default %_diff)
-modulename <str> set extension for tangent&adjoint modules and types (default %_diff)
-inputlanguage <lang> language of input files (fortran, fortran90,
fortran95, or C)
-outputlanguage <lang> language of output files (fortran, fortran90,
fortran95, or C)
-ext <file> incorporate external library description <file>
-nolib don't load standard libraries descriptions
-i<n> count <n> bytes for an integer (default -i4)
-r<n> count <n> bytes for a real (default -r4)
-dr<n> count <n> bytes for a double real (default -dr8)
-p<n> count <n> bytes for a pointer (default -p8)
-fixinterface don't use activity to filter user-given (in)dependent vars
-noinclude inline include files
-debugTGT insert instructions for debugging tangent mode
-debugADJ insert instructions for debugging adjoint mode
-tracelevel <n> set the level of detail of trace milestones
-msglevel <n> set the level of detail of error messages
-msginfile insert error messages in output files
-dump <file> write a dump <file>
-html display results in a web browser
-nooptim <str> turn off optimization <str> (in {activity, difftypes,
diffarguments, stripprimalmodules, spareinit, splitdiff,
mergediff, saveonlyused, tbr, snapshot, diffliveness,
deadcontrol, recomputeintermediates,
everyoptim}
-version display Tapenade version information
Report bugs to <tapenade@inria.fr>.
Solution 2
Running
$ cd session1/exercises/fg
$ ls
f.f90 g.f90
$ tapenade -head "f(y)/(x)" f.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Diff-liveness analysis turned off
@@ Created ./f_d.f90
@@ Created ./f_d.msg
$ cat f_d.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 23 Apr 2025 13:39
!
! Differentiation of f in forward (tangent) mode:
! variations of useful results: y
! with respect to varying inputs: x
! RW status of diff variables: x:in y:out
SUBROUTINE F_D(x, xd, y, yd)
IMPLICIT NONE
REAL, DIMENSION(2), INTENT(IN) :: x
REAL, DIMENSION(2), INTENT(IN) :: xd
REAL, INTENT(OUT) :: y
REAL, INTENT(OUT) :: yd
= x(2)*xd(1) + x(1)*xd(2)
yd = x(1)*x(2)
y END SUBROUTINE F_D
Running
$ tapenade -head "g(z)/(y)" g.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Diff-liveness analysis turned off
@@ Created ./g_d.f90
@@ Created ./g_d.msg
$ cat g_d.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 23 Apr 2025 13:39
!
! Differentiation of g in forward (tangent) mode:
! variations of useful results: z
! with respect to varying inputs: y
! RW status of diff variables: y:in z:out
SUBROUTINE G_D(y, yd, z, zd)
IMPLICIT NONE
REAL, INTENT(IN) :: y
REAL, INTENT(IN) :: yd
REAL, DIMENSION(2), INTENT(OUT) :: z
REAL, DIMENSION(2), INTENT(OUT) :: zd
INTRINSIC COS
INTRINSIC SIN
= (/COS(y)*yd, -(SIN(y)*yd)/)
zd = (/SIN(y), COS(y)/)
z END SUBROUTINE G_D
Verification: the Taylor test
Recall the Taylor expansion of a differentiable scalar function \(f:A\rightarrow\mathbb{R}\), for \(A\subseteq\mathbb{R}\): \[ f(x+\epsilon)=f(x)+\epsilon f'(x) + \frac{1}{2!}\epsilon^2f''(x) + \dots + \frac{1}{n!}\epsilon^nf^{(n)}(x) + \dots, \] for some \(\epsilon\in\mathbb{R}\), i.e., \[ f(x+\epsilon)=f(x)+\epsilon f'(x) + \mathcal{O}(\epsilon^2). \] This gives rise to the first-order forward difference approximation of the first derivative as follows: \[ f'(x) \approx \frac{f(x+\epsilon)-f(x)}{\epsilon}. \]
This is a rather crude approximation to a derivative, but it can do the job, given an appropriate choice of \(\epsilon\).
The idea of the Taylor test is to take smaller and smaller spacing values \(\epsilon\) from a given input value and to check that the difference between the forward difference and the AD-generated result converges quadratically. That is, we verify that \[ \|f(x+\epsilon)-f(x)-\epsilon\:\texttt{f\_d}(x)\|=\mathcal{O}(\epsilon^2), \] where \(\texttt{f\_d}\) is the AD-generated result.
\(f\) & \(g\) example: Taylor test
We return to the previous example and double-check that we are happy with the output of the AD tool.
For simplicity of demonstration, we translate f
, g
, f_d
, and g_d
into Python and conduct the Taylor test here.
def f(x):
"""Python translation of the `f` subroutine from `session1/exercises/fg/f.f90` using NumPy."""
return x[0] * x[1]
def g(y):
"""Python translation of the `g` subroutine from `session1/exercises/fg/g.f90` using NumPy."""
return np.array([np.sin(y), np.cos(y)])
def f_d(x, xd):
"""Python translation of the `f_d` subroutine generated by Tapenade in `session1/exercises/fg/f_d.f90` using NumPy."""
= x[1] * xd[0] + x[0] * xd[1]
yd = x[0] * x[1]
y return y, yd
def g_d(y, yd):
"""Python translation of the `g_d` subroutine generated by Tapenade in `session1/exercises/fg/g_d.f90` using NumPy."""
= np.array([np.cos(y) * yd, -np.sin(y) * yd])
zd = np.array([np.sin(y), np.cos(y)])
z return z, zd
# Choose arbitrary inputs for the composition
= np.array([1.0, 1.0])
x
# Choose seed vector
= np.array([1.0, 0.0])
xd
# Compute the derivative using forward mode AD
= f_d(x, xd)
y, yd = g_d(y, yd)
_, derivative_ad
# Run the Taylor test over several spacing values
= [1.0, 0.1, 0.01, 0.001]
spacings = []
errors for spacing in spacings:
# Compute the perturbation in the seed vector direction
= spacing * xd
epsilon
# Compute the discrepancy
+ epsilon)) - g(f(x)) - spacing * derivative_ad))
errors.append(np.linalg.norm(g(f(x
# Plot the solution, demonstrating that the expected quadratic convergence is achieved
= plt.subplots()
fig, axes "--x")
axes.loglog(spacings, errors, r"$\epsilon$ spacing")
axes.set_xlabel(r"$\ell_2$ error")
axes.set_ylabel(1e-2, 1e-4), 2, ax=axes, invert=True)
annotation.slope_marker(( axes.grid()
Above, we computed the derivative of \(h=f\circ g\) with respect to the first component, \(x_1\).
Optional exercise
Perform the Taylor test to double-check you are happy with the AD-generated derivative with respect to the second component, \(x_2\), too.
Hint: You only need to change one line.
Example: ODE-constrained optimisation
Consider the scalar ordinary differential equation (ODE) \[ \frac{\mathrm{d}u}{\mathrm{d}t}=f(u),\quad u(0)=u_0, \] where \(t\in[0,T]\) is the time variable, \(T>0\) is the end time, and \(u_0\in\mathbb{R}\) is the initial condition. Given some \(f:A\rightarrow\mathbb{R}\) with \(A\subseteq\mathbb{R}\), we seek to solve the ODE for \(u:[0,T]\rightarrow\mathbb{R}\).
For simplicity, let’s consider the ODE \[ \frac{\mathrm{d}u}{\mathrm{d}t}=u,\quad u(0)=1, \] i.e., \(f(u)=u\).
Optional exercise
Convince yourself that the analytical solution of the ODE is \(u(t)=\mathrm{e}^t\).
Solutions
Plugging \(u(t)=\mathrm{e}^t\) into the LHS gives \(\frac{\mathrm{d}u}{\mathrm{d}t}=\mathrm{e}^t=u\), which satisfies the ODE. Checking the initial condition, we have \(u(0)=\mathrm{e}^0=1\), which also satisfies.
ODE example: forward vs backward Euler
You’re probably aware of the simplest example of an explicit timestepping method to approximate the solution of the ODE. This is the forward Euler (a.k.a. explicit Euler): \[ \frac{u_{k}-u_{k-1}}{\Delta t}=f(u_{k-1}), \] for \(k\in\mathbb{N}\) and some timestep \(\Delta t>0\).
You’re probably also aware that the simplest example of an implicit timestepping method to approximate the solution of the ODE is backward Euler (a.k.a. implicit Euler): \[ \frac{u_{k}-u_{k-1}}{\Delta t}=f(u_k), \] for \(k\in\mathbb{N}\) and some timestep \(\Delta t>0\).
These are actually special cases of a more general theta-method, \[ \frac{u_{k}-u_{k-1}}{\Delta t}=(1-\theta)f(u_{k-1})+\theta f(u_k), \] where \(\theta\in[0,1]\).
ODE example: applying forward and backward Euler
Using the method above, our problem reads \[ \frac{u_{k}-u_{k-1}}{\Delta t}=(1-\theta)u_{k-1}+\theta u_k, \] which can be rearranged to give \[ u_{k}=\frac{1+\Delta t(1-\theta)}{1-\Delta t\theta}u_{k-1}. \]
The forward and backward Euler approaches have been implemented for the case \(f(u)=u\) in the session1/exercises/ode
subdirectory.
$ cd exercises/ode
$ ls
backward_euler.f90 cost_mod.f90 forward_euler.f90 gradient_descent.f90 Makefile theta_method_mod.f90
where theta_methods.f90
is a fortran module containing several subroutines, including
! Single iteration of a theta method for solving the ODE initial value problem
! du/dt = u, u(0) = 1
subroutine theta_step(u, u_, dt, theta)
implicit none
real, intent(out) :: u ! Solution at current timestep
real, intent(in) :: u_ ! Solution at previous timestep
real, intent(in) :: dt ! Timestep length
real, intent(in) :: theta ! Theta parameter
= u_ * (1 + dt * (1 - theta)) / (1 - dt * theta)
u end subroutine theta_step
Here forward_euler.f90
and backward_euler.f90
contain programs to solve the ODE, making use of code from the module above, but with values \(\theta=0\) and \(\theta=1\), respectively. In the former case, we have
! Program for running the forward Euler method to solve the ODE initial value problem
! du/dt = u, u(0) = 1
program forward_euler
use theta_method_mod, only: theta_method
implicit none
real, parameter :: theta = 0.0 ! Forward Euler corresponds to theta = 0.0
real :: u ! Solution variable
character(len=100), parameter :: filename = "forward.csv" ! Filename to save results to
call theta_method(theta, u, filename)
end program forward_euler
Details on running the code
Compile with
$ make forward_euler
$ make backward_euler
and then run with
$ ./forward_euler
$ ./backward_euler
$ ls
backward.csv backward_euler.f90 forward.csv forward_euler.f90 Makefile theta_method_mod.mod
backward_euler cost_mod.f90 forward_euler gradient_descent.f90 theta_method_mod.f90 theta_method_mod.o
= pandas.read_csv("exercises/ode/forward.csv")
df_forward = pandas.read_csv("exercises/ode/backward.csv")
df_backward
= plt.subplots()
fig, axes
= np.linspace(0, 1, 101)
times
"-", color="k", label="Analytical solution")
axes.plot(times, np.exp(times), "t"], df_forward["u"], "--x", label="Forward Euler")
axes.plot(df_forward["t"], df_backward["u"], ":o", label="Backward Euler")
axes.plot(df_backward[
axes.legend() axes.grid()
ODE example: source transformation
As we see from the plot above, the forward Euler method tends to underestimate the solution, whereas the backward Euler method tends to overestimate it. Let’s try to optimise the value of \(\theta\) to best match the solution using a gradient-based optimisation method. To do that, we first need the gradient. AD enables us to do this automatically.
The optimisation problem we seek to solve is to minimise some error measure \(J\) for the approximation of \(u\) by varying \(\theta\). That is, \[ \min_{\theta\in[0,1]}J(u;\theta). \] where the notation \(J(u;\theta)\) refers to the implicit dependence of the solution approximation \(u\) on \(\theta\).
Note Forward and backward Euler are first-order accurate methods. By optimising the \(\theta\) parameter, we can arrive at a second-order accurate method.
Exercise
Navigate to session1/exercises/ode
and apply forward mode AD to the theta_method
subroutine in theta_method_mod.f90
by differentiating the module with Tapenade.
Solutions
$ tapenade -tgtmodulename %_d -head "theta_method_mod.theta_method(theta)\(u)" theta_method_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Diff-liveness analysis turned off
@@ Created ./theta_method_mod_d.f90
@@ Created ./theta_method_mod_d.msg
$ cat theta_method_mod_d.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
! Module containing subroutines for solving the ODE initial value problem
! du/dt = u, u(0) = 1
MODULE THETA_METHOD_MOD_D
IMPLICIT NONE
PRIVATE
PUBLIC :: theta_method
PUBLIC :: theta_method_d
CONTAINS
! Apply the initial condition for the ODE initial value problem
! du/dt = u, u(0) = 1
SUBROUTINE INITIAL_CONDITION(u0)
IMPLICIT NONE
! Initial condition value
REAL, INTENT(OUT) :: u0
= 1.0
u0 END SUBROUTINE INITIAL_CONDITION
! Differentiation of theta_step in forward (tangent) mode:
! variations of useful results: u
! with respect to varying inputs: u_ theta
! Single iteration of a theta method for solving the ODE initial value problem
! du/dt = u, u(0) = 1
SUBROUTINE THETA_STEP_D(u, ud, u_, u_d, dt, theta, thetad)
IMPLICIT NONE
! Solution at current timestep
REAL, INTENT(OUT) :: u
REAL, INTENT(OUT) :: ud
! Solution at previous timestep
REAL, INTENT(IN) :: u_
REAL, INTENT(IN) :: u_d
! Timestep length
REAL, INTENT(IN) :: dt
! Theta parameter
REAL, INTENT(IN) :: theta
REAL, INTENT(IN) :: thetad
REAL :: temp
= u_/(-(dt*theta)+1)
temp = (dt*(1-theta)+1)*(u_d+temp*dt*thetad)/(1-dt*theta) - temp*dt*&
ud & thetad
= (dt*(1-theta)+1)*temp
u END SUBROUTINE THETA_STEP_D
! Single iteration of a theta method for solving the ODE initial value problem
! du/dt = u, u(0) = 1
SUBROUTINE THETA_STEP(u, u_, dt, theta)
IMPLICIT NONE
! Solution at current timestep
REAL, INTENT(OUT) :: u
! Solution at previous timestep
REAL, INTENT(IN) :: u_
! Timestep length
REAL, INTENT(IN) :: dt
! Theta parameter
REAL, INTENT(IN) :: theta
= u_*(1+dt*(1-theta))/(1-dt*theta)
u END SUBROUTINE THETA_STEP
! Differentiation of theta_method in forward (tangent) mode:
! variations of useful results: u
! with respect to varying inputs: theta
! RW status of diff variables: u:out theta:in
! Solve the ODE initial value problem
! du/dt = u, u(0) = 1
! using a theta timestepping method, writing the solution trajectory to file
SUBROUTINE THETA_METHOD_D(theta, thetad, u, ud, filename)
IMPLICIT NONE
! Theta parameter
REAL, INTENT(IN) :: theta
REAL, INTENT(IN) :: thetad
! Solution at the final time
REAL, INTENT(OUT) :: u
REAL, INTENT(OUT) :: ud
! Output filename
CHARACTER(len=*), INTENT(IN) :: filename
! Timestep length
REAL, PARAMETER :: dt=0.1
! End time of the simulation
REAL, PARAMETER :: end_time=1.0
! Dummy variable for time
REAL :: t
! Lagged solution variable (previous timestep)
REAL :: u_
REAL :: u_d
! Initialisation
= 0.0
t CALL INITIAL_CONDITION(u_)
! Create a CSV file and write out the header and initial condition
OPEN(unit=10, file=filename)
WRITE(unit=10, fmt='(''t,u'')')
WRITE(unit=10, fmt=100) t, u_
= 0.0
ud = 0.0
u_d ! Timestepping loop
DO WHILE (t .LT. end_time - 1e-05)
CALL THETA_STEP_D(u, ud, u_, u_d, dt, theta, thetad)
= ud
u_d = u
u_ = t + dt
t WRITE(unit=10, fmt=100) t, u
END DO
! Remember to close the CSV file
CLOSE(unit=10)
100 FORMAT(f4.2,',',f4.2)
END SUBROUTINE THETA_METHOD_D
! Solve the ODE initial value problem
! du/dt = u, u(0) = 1
! using a theta timestepping method, writing the solution trajectory to file
SUBROUTINE THETA_METHOD(theta, u, filename)
IMPLICIT NONE
! Theta parameter
REAL, INTENT(IN) :: theta
! Solution at the final time
REAL, INTENT(OUT) :: u
! Output filename
CHARACTER(len=*), INTENT(IN) :: filename
! Timestep length
REAL, PARAMETER :: dt=0.1
! End time of the simulation
REAL, PARAMETER :: end_time=1.0
! Dummy variable for time
REAL :: t
! Lagged solution variable (previous timestep)
REAL :: u_
! Initialisation
= 0.0
t CALL INITIAL_CONDITION(u_)
! Create a CSV file and write out the header and initial condition
OPEN(unit=10, file=filename)
WRITE(unit=10, fmt='(''t,u'')')
WRITE(unit=10, fmt=100) t, u_
! Timestepping loop
DO WHILE (t .LT. end_time - 1e-05)
CALL THETA_STEP(u, u_, dt, theta)
= u
u_ = t + dt
t WRITE(unit=10, fmt=100) t, u
END DO
! Remember to close the CSV file
CLOSE(unit=10)
100 FORMAT(f4.2,',',f4.2)
END SUBROUTINE THETA_METHOD
END MODULE THETA_METHOD_MOD_D
ODE example: optimisation with gradient descent
Let’s solve this ODE problem with one of the simplest gradient-based optimisation approaches: gradient descent. This amounts to an initial guess \(\theta_0\), followed by iterative updates \[ \theta_{k+1}=\theta_k+\alpha\:p_k, \] where \(\alpha>0\) is the step length and \(p_k\) is the descent direction. For gradient descent, we simply take \[ p_k=-\frac{\mathrm{d}J_k}{\mathrm{d}\theta_k}. \]
Since we know the analytical solution for this problem, we may make an ‘artificial’ choice of cost function such as \[
J(u;\theta)=(u(1)-\mathrm{e}^1)^2,
\] where here \(\mathrm{e}^1\) is the analytical solution at the end time \(t=1\). This is implemented in cost_mod.f90
module as a subroutine
! Cost function evaluating the l2 error at the end time against the analytical solution u(t)=exp(t)
subroutine cost_function(u, J)
implicit none
real, intent(in) :: u ! Numerical solution at the end time
real, intent(out) :: J ! Cost function value
real, parameter :: e = exp(1.0) ! The exponential constant, e=2.71828...
= (u - e) ** 2
J end subroutine cost_function
Exercise
The gradient descent algorithm is implemented for our ODE problem in session1/exercises/ode/gradient_descent.f90
.
- However, there is another missing piece: you will also need to differentiate the cost function. Convince yourself you are satisfied with the output.
- Run the gradient descent optimisation algorithm and execute the following two cells to visualise the results.
- Play around with the optimisation parameters defined in
gradient_descent.f90
. Can you achieve convergence in fewer iterations without changing thegtol
convergence tolerance?
Solution 1
$ tapenade -tgtmodulename %_d cost_mod.f90
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Command: Took subroutine cost_function as default differentiation root
Diff-liveness analysis turned off
@@ Created ./cost_mod_d.f90
@@ Created ./cost_mod_d.msg
cat cost_mod_d.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
! Module containing subroutines for cost functions
MODULE COST_MOD_D
IMPLICIT NONE
PRIVATE
PUBLIC :: cost_function
PUBLIC :: cost_function_d
CONTAINS
! Differentiation of cost_function in forward (tangent) mode:
! variations of useful results: j
! with respect to varying inputs: u
! RW status of diff variables: j:out u:in
! Cost function evaluating the l2 error at the end time against the analytical solution u(t)=exp(t)
SUBROUTINE COST_FUNCTION_D(u, ud, j, jd)
IMPLICIT NONE
! Numerical solution at the end time
REAL, INTENT(IN) :: u
REAL, INTENT(IN) :: ud
! Cost function value
REAL, INTENT(OUT) :: j
REAL, INTENT(OUT) :: jd
INTRINSIC EXP
! The exponential constant, e=2.71828...
REAL, PARAMETER :: e=EXP(1.0)
= 2*(u-e)*ud
jd = (u-e)**2
j END SUBROUTINE COST_FUNCTION_D
! Cost function evaluating the l2 error at the end time against the analytical solution u(t)=exp(t)
SUBROUTINE COST_FUNCTION(u, j)
IMPLICIT NONE
! Numerical solution at the end time
REAL, INTENT(IN) :: u
! Cost function value
REAL, INTENT(OUT) :: j
INTRINSIC EXP
! The exponential constant, e=2.71828...
REAL, PARAMETER :: e=EXP(1.0)
= (u-e)**2
j END SUBROUTINE COST_FUNCTION
END MODULE COST_MOD_D
Solution 2
Compile gradient_descent.f90
` and its dependencies with
$ make gradient_descent
You can then run
$ ./gradient_descent
You should get convergence in around 755 iterations.
Solution 3
Increasing the step length slighly to \(\alpha=0.199\) you should get convergence in around 378 iterations - about half. Interestingly, the choice \(\alpha=0.2\) fails to converge. Bonus exercise: plot the progress for \(\alpha=0.2\) in the cell below to see what went wrong.
= pandas.read_csv("exercises/ode/optimisation_progress.csv")
df_opt
= np.array(df_opt["J"])
costs 0] = np.nan # Remove the first entry because it's uninitialised garbage
costs[= np.array(df_opt["theta"])
controls
= plt.subplots(ncols=2, figsize=(12, 5))
fig, axes 0].loglog(df_opt["it"], costs, "--", label="Cost function value")
axes[0].legend()
axes[0].grid()
axes[1].plot(df_opt["it"], df_opt["theta"], "--", label="Control value")
axes[1].legend()
axes[1].grid() axes[
= pandas.read_csv("exercises/ode/optimised.csv")
df_optimised
= plt.subplots()
fig, axes "-", color="k", label="Analytical solution")
axes.plot(times, np.exp(times), "t"], df_forward["u"], "--x", label="Forward Euler")
axes.plot(df_forward["t"], df_backward["u"], ":o", label="Backward Euler")
axes.plot(df_backward["t"], df_optimised["u"], "-.^", label=rf"Optimised ($\theta={controls[-1]:.4f}$)")
axes.plot(df_optimised[
axes.legend() axes.grid()
As we might hope, the optimised value of \(\theta\) gives a much better approximation.
Calculating the full Jacobian
Question
Given a map \(\mathbf{f}\), some input \(\mathbf{x}\), and some seed \(\dot{\mathbf{x}}\), we have the Jacobian vector product \[\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}.\]
How can we use this to compute the full Jacobian matrix \(\nabla\mathbf{f}(\mathbf{x})\)?
Solution
\[ \nabla\mathbf{f}(\mathbf{x}) =\nabla\mathbf{f}(\mathbf{x})\mathbf{I}_n =\nabla\mathbf{f}(\mathbf{x})\begin{bmatrix}\mathbf{e}_1,\mathbf{e}_2,\dots,\mathbf{e}_n\end{bmatrix} =\begin{bmatrix}\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_1,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_2,\dots,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_n\end{bmatrix}. \]
Apply JVP to the \(n\) canonical unit vectors.
Vector mode
The expression \(\begin{bmatrix}\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_1,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_2,\dots,\nabla\mathbf{f}(\mathbf{x})\mathbf{e}_n\end{bmatrix}\) above is a concatenation of \(n\) JVPs. Whilst it’s possible to apply these in a loop, most AD tools support a ‘vector’ mode, which allows this to be done in a single call. This is the vector-Jacobian product (VJP):
\[\text{VJP}(\mathbf{f},\mathbf{x},\dot{\mathbf{X}}):=\begin{bmatrix}\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}_1,\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}_2,\dots,\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}_k\end{bmatrix},\]
for a seed matrix \(\dot{\mathbf{X}}:=\begin{bmatrix}\dot{\mathbf{x}}_1,\dot{\mathbf{x}}_2,\dots,\dot{\mathbf{x}}_k\end{bmatrix}\in\mathbb{R}^{k\times n}\), \(k\in\mathbb{N}\).
Example: tridiagonal matrix
Consider the classic central difference approximation to the second derivative (yes, more derivatives!): \[
\frac{\mathrm{d}^2u}{\mathrm{d}x^2}\approx\frac{u(x-h)-2u(x)+u(x+h)}{h^2},
\] for some uniform spacing \(h>0\). For a uniform discretisation \(\{x_k\}_{k=1}^n\) of an interval, this can be written as \[
\left.\frac{\mathrm{d}^2u}{\mathrm{d}x^2}\right|_{x=x_k}\approx\frac{u_{k-1}-2u_k+u_{k+1}}{h^2},
\] where \(u_k\) is an approximation of \(u(x_k)\). This is implemented as a Fortran subroutine central_diff
inside the Fortran module at session1/exercises/sparse/central_diff.f90
.
module central_diff_mod
implicit none
private
public :: central_diff
contains
! Central difference approximation for the second derivative of a periodic 1D function
subroutine central_diff(u, approx, h, n)
implicit none
integer, intent(in) :: n ! Number of grid points
real, intent(in) :: h ! Uniform grid spacing
real, dimension(n), intent(in) :: u ! Input vector
real, dimension(n), intent(out) :: approx ! Central difference approximation
integer :: i ! Dummy index for looping
if (size(u, 1) /= n) then
print *, "Invalid input array size"
stop 1
end if
if (size(approx, 1) /= n) then
print *, "Invalid output array size"
stop 1
end if
do i = 1, n
if (i == 1) then
! Periodic boundary on the left
= (u(n) - 2 * u(i) + u(i+1)) / h ** 2
approx(i) else if (i == n) then
! Periodic boundary on the right
= (u(i-1) - 2 * u(i) + u(1)) / h ** 2
approx(i) else
! Interior points
= (u(i-1) - 2 * u(i) + u(i+1)) / h ** 2
approx(i) end if
end do
end subroutine central_diff
end module central_diff_mod
Exercise
Apply Tapenade in vector mode to compute the derivative of the output array approx
with respect to the input array u
.
Solution
Running
$ tapenade -tgtmodulename %_d -head "central_diff(approx)/(u)" -vector central_diff_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options: multiDirectional
Command: Procedure central_diff understood as central_difference.central_diff
Diff-liveness analysis turned off
@@ Created ./central_diff_mod_dv.f90
@@ Created ./central_diff_mod_dv.msg
$ cat central_diff_mod_dv.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE CENTRAL_DIFF_MOD_DV
USE DIFFSIZES
! Hint: nbdirsmax should be the maximum number of differentiation directions
IMPLICIT NONE
PRIVATE
PUBLIC :: central_diff
PUBLIC :: central_diff_dv
CONTAINS
! Differentiation of central_diff in forward (tangent) mode (with options multiDirectional):
! variations of useful results: approx
! with respect to varying inputs: u
! RW status of diff variables: approx:out u:in
! Central difference approximation for the second derivative of a periodic 1D function
SUBROUTINE CENTRAL_DIFF_DV(u, ud, approx, approxd, h, n, nbdirs)
USE DIFFSIZES
! Hint: nbdirsmax should be the maximum number of differentiation directions
IMPLICIT NONE
! Number of grid points
INTEGER, INTENT(IN) :: n
! Uniform grid spacing
REAL, INTENT(IN) :: h
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: u
REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: ud
! Central difference approximation
REAL, DIMENSION(n), INTENT(OUT) :: approx
REAL, DIMENSION(nbdirsmax, n), INTENT(OUT) :: approxd
! Dummy index for looping
INTEGER :: i
INTRINSIC SIZE
INTEGER :: nd
INTEGER :: nbdirs
IF (SIZE(u, 1) .NE. n) THEN
PRINT*, 'Invalid input array size'
STOP
ELSE IF (SIZE(approx, 1) .NE. n) THEN
PRINT*, 'Invalid output array size'
STOP
ELSE
= 0.0
approxd DO i=1,n
IF (i .EQ. 1) THEN
! Periodic boundary on the left
DO nd=1,nbdirs
= (ud(nd, n)-2*ud(nd, i)+ud(nd, i+1))/h**2
approxd(nd, i) END DO
= (u(n)-2*u(i)+u(i+1))/h**2
approx(i) ELSE IF (i .EQ. n) THEN
! Periodic boundary on the right
DO nd=1,nbdirs
= (ud(nd, i-1)-2*ud(nd, i)+ud(nd, 1))/h**2
approxd(nd, i) END DO
= (u(i-1)-2*u(i)+u(1))/h**2
approx(i) ELSE
! Interior points
DO nd=1,nbdirs
= (ud(nd, i-1)-2*ud(nd, i)+ud(nd, i+1))/h**2
approxd(nd, i) END DO
= (u(i-1)-2*u(i)+u(i+1))/h**2
approx(i) END IF
END DO
END IF
END SUBROUTINE CENTRAL_DIFF_DV
! Central difference approximation for the second derivative of a periodic 1D function
SUBROUTINE CENTRAL_DIFF(u, approx, h, n)
IMPLICIT NONE
! Number of grid points
INTEGER, INTENT(IN) :: n
! Uniform grid spacing
REAL, INTENT(IN) :: h
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: u
! Central difference approximation
REAL, DIMENSION(n), INTENT(OUT) :: approx
! Dummy index for looping
INTEGER :: i
INTRINSIC SIZE
IF (SIZE(u, 1) .NE. n) THEN
PRINT*, 'Invalid input array size'
STOP
ELSE IF (SIZE(approx, 1) .NE. n) THEN
PRINT*, 'Invalid output array size'
STOP
ELSE
DO i=1,n
IF (i .EQ. 1) THEN
! Periodic boundary on the left
= (u(n)-2*u(i)+u(i+1))/h**2
approx(i) ELSE IF (i .EQ. n) THEN
! Periodic boundary on the right
= (u(i-1)-2*u(i)+u(1))/h**2
approx(i) ELSE
! Interior points
= (u(i-1)-2*u(i)+u(i+1))/h**2
approx(i) END IF
END DO
END IF
END SUBROUTINE CENTRAL_DIFF
END MODULE CENTRAL_DIFF_MOD_DV
Note: The differentiated module central_diff_mod_dv
depends on a module diffsizes
which specifies the number of differentiation directions. The examples that will follow will each define this module (diffsizes_dense.f90
, diffsizes_compressed.f90
).
Tridiagonal example: compute dense Jacobian
In session1/exercises/sparse/dense_jacobian.f90
we provide a program for computing the Jacobian of the centred difference function over the uniform partition of the interval.
! Naive program for computing a tridiagonal Jacobian
program dense_jacobian
use diffsizes, only: m => nbdirsmax ! Number of seed vectors, from diffsizes_dense.f90
use central_diff_mod_dv, only: central_diff_dv ! Tapenade Forward mode derivative
implicit none
logical, parameter :: output = .true. ! Flag for whether to write output to file
integer, parameter :: n = 10 ! Number of grid points
real, parameter :: h = 1.0 ! Uniform grid spacing
real, dimension(n) :: u ! Input vector
real, dimension(n) :: approx ! Central difference approximation
real, dimension(m,n) :: seed ! Seed matrix for the VJP
real, dimension(m,n) :: jacobian ! Jacobian for the central difference calculation
integer :: i ! Dummy index for looping
if (m /= n) then
print *, "Error: number of grid points must match number of seed vectors."
stop 1
end if
! Specify some arbitrary input
= 1.0
u(:)
! Set up the seed matrix as the nxn identity
= 0.0
seed(:,:) do i = 1, n
= 1.0
seed(i,i) end do
! Propagate the seed matrix through the VJP
call central_diff_dv(u, seed, approx, jacobian, h, n, n)
! Write out the result to file
if (output) then
open(unit=10, file="dense_jacobian.dat")
do i = 1, n
write(unit=10, fmt=100) jacobian(i,:)
end do
100 format(10(f4.1,","))
close(unit=10)
end if
end program dense_jacobian
Details on running the code
Compile with
$ make dense_jacobian
and then run with
$ ./dense_jacobian
Compiling and running produces a file dense_jacobian.dat
, which we can read and view in the notebook. Note that the computation here required \(n\) applications of forward mode AD, where \(n\) is the number of gridpoints.
"exercises/sparse/dense_jacobian.dat") print_matrix(
Sparse AD
We can compute the full Jacobian with \[\nabla\mathbf{f}(\mathbf{x})=\nabla\mathbf{f}(\mathbf{x})\mathbf{I}_n=\nabla\mathbf{f}(\mathbf{x})\begin{bmatrix}\mathbf{e}_1,\mathbf{e}_2,\dots,\mathbf{e}_n\end{bmatrix}.\]
- But what about when \(n\) gets very large?
- And what about when the Jacobian is sparse?
Diagonal Jacobian example
Suppose \(\nabla\mathbf{f}(\mathbf{x})\) is diagonal, say
\[\nabla\mathbf{f}(\mathbf{x})=\begin{bmatrix}J_1\\& J_2\\ & & \ddots\\ & & & J_n\end{bmatrix}.\]
Then, for a seed vector \(\dot{\mathbf{x}}=\begin{bmatrix}\dot{x}_1 & \dot{x}_2 & \dots & \dot{x}_n\end{bmatrix}^T\), we have
\[ \nabla\mathbf{f}(\mathbf{x})\:\dot{\mathbf{x}} =\begin{bmatrix}J_1\\& J_2\\ & & \ddots\\ & & & J_n\end{bmatrix} \begin{bmatrix}\dot{x}_1 \\ \dot{x}_2 \\ \vdots \\ \dot{x}_n\end{bmatrix} =\begin{bmatrix}J_1\dot{x}_1 \\ J_2\dot{x}_2 \\ \vdots \\ J_n\dot{x}_n\end{bmatrix}. \]
Making the clever choice of seed vector \(\dot{\mathbf{x}}:=\mathbf{e}=\begin{bmatrix}1 & 1 & \dots & 1\end{bmatrix}^T\), we can apply a single JVP and then back out the full Jacobian by putting each entry on the diagonal: \[ \mathrm{diag}(\nabla\mathbf{f}(\mathbf{x})\:\mathbf{e}) =\mathrm{diag}\left(\begin{bmatrix}J_1 \\ J_2 \\ \vdots \\ J_n \end{bmatrix}\right) =\begin{bmatrix}J_1\\& J_2\\ & & \ddots\\ & & & J_n\end{bmatrix} =\nabla\mathbf{f}(\mathbf{x}). \]
Sparse AD: what colour is your Jacobian?
The example above is the simplest case of what is known as colouring. The idea is to group columns of the Jacobian such that the columns in each group are linearly independent. This gives rise to compression approaches for sparse AD.
Figure 5: Jacobian colouring diagram taken from (Gebremedhin et al., 2005).
Tridiagonal example: compute compressed Jacobian
In session1/exercises/sparse/compressed_jacobian.f90
we provide a program for computing the same Jacobian of the centred difference function but this time using a clever choice of seed matrix: \[
\dot{\mathbf{X}}:=\begin{bmatrix}
1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1\\
0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0\\
0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0
\end{bmatrix}.
\]
! Program for computing a tridiagonal Jacobian using compression
program compressed_jacobian
use diffsizes, only: m => nbdirsmax ! Number of seed vectors, from diffsizes_compressed.f90
use central_diff_mod_dv, only: central_diff_dv ! Tapenade Forward mode derivative
implicit none
logical, parameter :: output = .true. ! Flag for whether to write output to file
integer, parameter :: n = 10 ! Number of grid points
real, parameter :: h = 1.0 ! Uniform grid spacing
real, dimension(n) :: u ! Input vector
real, dimension(n) :: approx ! Central difference approximation
real, dimension(m,n) :: seed ! Seed matrix for the VJP
real, dimension(m,n) :: compressed ! Compressed Jacobian for the central difference calculation
real, dimension(n,n) :: jacobian ! Jacobian for the central difference calculation
integer :: i, j ! Dummy indices for looping
! Specify some arbitrary input
= 1.0
u(:)
! Set up the seed matrix as a 3xn array
= 0.0
seed(:,:) do i = 1, m
::m) = 1.0
seed(i,iend do
! Apply the VJP
call central_diff_dv(u, seed, approx, compressed, h, n, m)
! Write out the compressed result to file
if (output) then
open(unit=10, file="compressed_jacobian.dat")
do i = 1, m
write(unit=10, fmt=100) compressed(i,:)
end do
100 format(10(f4.1,","))
close(unit=10)
end if
! Decompress rows and insert them into the Jacobian
do i = 1, m
= i
j ::m,:) = 0.0
jacobian(ido while (j <= n)
if (j == 1) then
= compressed(i,n)
jacobian(j,n) else
-1) = compressed(i,j-1)
jacobian(j,jend if
= compressed(i,j)
jacobian(j,j) if (j == n) then
1) = compressed(i,1)
jacobian(j,else
+1) = compressed(i,j+1)
jacobian(j,jend if
= j + m
j end do
end do
! Write out the result to file
if (output) then
open(unit=11, file="decompressed_jacobian.dat")
do i = 1, n
write(unit=11, fmt=100) jacobian(i,:)
end do
close(unit=11)
end if
end program compressed_jacobian
Details on running the code
Compile with
$ make compressed_jacobian
and then run with
$ ./compressed_jacobian
Compiling and running produces files compressed_jacobian.dat
and decompressed_jacobian.dat
, which we can again read and view in the notebook. This time we only need to compute three JVPs, rather than \(n\)!
Exercise
Experiment with increasing the number of gridpoints to see the impact on the overall cost of the dense approach versus the compressed approach.
Hint: Using the time
command line tool should give some very basic timing information.
Solution
Make the changes
diff --git a/session1/exercises/sparse/compressed_jacobian.f90 b/session1/exercises/sparse/compressed_jacobian.f90
index 1c49ebd..86c05db 100644--- a/session1/exercises/sparse/compressed_jacobian.f90
+++ b/session1/exercises/sparse/compressed_jacobian.f90
@@ -5,8 +5,8 @@ program compressed_jacobian
implicit none
- logical, parameter :: output = .true. ! Flag for whether to write output to file
- integer, parameter :: n = 10 ! Number of grid points
+ logical, parameter :: output = .false. ! Flag for whether to write output to file
+ integer, parameter :: n = 10000 ! Number of grid points
real, parameter :: h = 1.0 ! Uniform grid spacing
real, dimension(n) :: u ! Input vector
real, dimension(n) :: approx ! Central difference approximationdiff --git a/session1/exercises/sparse/dense_jacobian.f90 b/session1/exercises/sparse/dense_jacobian.f90
index a61111d..e04a9aa 100644--- a/session1/exercises/sparse/dense_jacobian.f90
+++ b/session1/exercises/sparse/dense_jacobian.f90
@@ -5,8 +5,8 @@ program dense_jacobian
implicit none
- logical, parameter :: output = .true. ! Flag for whether to write output to file
- integer, parameter :: n = 10 ! Number of grid points
+ logical, parameter :: output = .false. ! Flag for whether to write output to file
+ integer, parameter :: n = 10000 ! Number of grid points
real, parameter :: h = 1.0 ! Uniform grid spacing
real, dimension(n) :: u ! Input vector
real, dimension(n) :: approx ! Central difference approximationdiff --git a/session1/exercises/sparse/diffsizes_dense.f90 b/session1/exercises/sparse/diffsizes_dense.f90
index 0fef76e..ffc198a 100644--- a/session1/exercises/sparse/diffsizes_dense.f90
+++ b/session1/exercises/sparse/diffsizes_dense.f90
@@ -5,5 +5,5 @@ module diffsizes
private
public :: nbdirsmax
- integer, parameter :: nbdirsmax = 10
+ integer, parameter :: nbdirsmax = 10000
end module diffsizes
recompile and then run
$ time ./compressed_jacobian && time ./dense_jacobian
You should get output something like
real 0m0.309s
user 0m0.140s
sys 0m0.168s
real 0m0.883s
user 0m0.505s sys 0m0.377s
indicating that the compressed approach is about twice as fast at this resolution. We could likely get a much better speedup by optimising the decompression strategy.
Warning! Make sure you turn off the output
flag and recompile before conducting any timing experiments.
"exercises/sparse/compressed_jacobian.dat") print_matrix(
"exercises/sparse/decompressed_jacobian.dat") print_matrix(
Sparse AD: in practice
It’s worth noting that we used knowledge about the tridiagonal structure of the Jacobian to choose the seed vectors… so that we could compute the Jacobian. For Jacobians in the wild (e.g., those related to networks) we don’t necessarily have such intuition. In practice, we need to perform an initial step to establish the sparsity pattern before we can do the colouring. This functionality is available in Tapenade, but currently only on a branch.
Sparse AD: overall performance comparison
We have considered three approaches for sparse AD:
- ‘Dense approach’, which computes a JVP for each canonical unit vector.
- ‘Compressed approach’, which computes JVPs for sums over sets of linearly independent columns.
- ‘Matrix-free approach’, which applies a single JVP. (Only useful if the problem involves JVPs.)
In the following, we walk through a performance comparison of these different approaches applied to the solution of the nonlinear Gray-Scott equation \[ \frac{\partial u}{\partial t}=D_1\Delta u-uv^2+\gamma(1-u), \quad\frac{\partial v}{\partial t}=D_2\Delta v-uv^2=(\gamma+\kappa)v, \quad u,v:[0,2.5]^2\rightarrow\mathbb{R} \] (subject to some initial conditions) using a finite difference approach with Crank-Nicolson timestepping (\(\theta=0.5\)). This discretisation corresponds to a 5-point stencil and it turns out the Jacobian can be coloured with five colours. As such, the compressed approach involves five JVPs per Jacobian computation.
Figure 6: Performance comparison for different Jacobian computation approaches for solving the Gray-Scott equation, taken from (Wallwork et al, 2019). PETSc was used for numerical solvers and ADOL-C for AD. The \(N\\times N\) notation in the subcaptions refers to the grid dimension.
Here ‘analytic’ refers to the hand-derived Jacobian, which we can’t realistically expect to beat.
Sparse AD: convergence comparison
Matrix-free works well for small dimensions. However, it slows down at higher resolution because Jacobian assembly is done at each iteration of the nonlinear solver, whereas the JVP is calculated at each iteration of the linear solver. It’s possible that a different choice of linear solver or preconditioner would reduce the cost of the matrix-free approach.
Figure 7: Linear solver iterations as a function of grid resolution and number of processors for the Gray-Scott problem. (Numbers of nonlinear solver iterations are much lower.) Taken from (Wallwork et al, 2019).
Summary and outlook
In today’s session we:
- Got a brief history of automatic differentiation.
- Learnt about forward mode and the source transformation approach.
- Tried out the Tapenade source transformation AD tool applied to some test problems.
- Verified the code generated by Tapenade both manually and using the Taylor test.
- Learnt about compressed and matrix-free approaches for sparse problems.
In tomorrow’s session (at the same time) we will:
- Learn about reverse mode and the operator overloading approach, comparing them with forward mode and source transformation, respectively.
- Verify the consistency of code generated by Tapenade under forward and reverse mode using the dot product test.
- Calculate higher order derivatives using Tapenade.
- Try out the Pyadjoint operator overloading AD tool underpinnning the Firedrake finite element library and see showcases of more advanced AD usage.
- Learn about checkpointing and using AD to compute higher order derivatives.
References
- R. E. Wengert. A simple automatic derivative evaluation program (1964). Communications of the ACM, 7(8):463–464, doi.org:10.1145/355586.364791.
- A. Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optimization Methods & Software, 1:35–54, 1992, doi:10.1080/10556789208805505.
- A. H. Gebremedhin, et al. What color is your Jacobian? Graph coloring for computing derivatives (2005). SIAM review, 47(4), pp.629-705, doi:10.1137/S0036144504444711.
- J. G. Wallwork, et al. Computing derivatives for petsc adjoint solvers using algorithmic differentiation (2019), arXiv preprint doi:10.48550/arXiv.1909.02836.