%matplotlib inline
Session 2: Reverse mode differentiation and operator overloading
import matplotlib.colors as colors
import matplotlib.pyplot as plt
from mpltools import annotation
import networkx as nx
import numpy as np
from firedrake.adjoint import pyadjoint
Set up codespace now. (It will take a while!)
We will show you where to find this notebook in the repo while you wait.
Motivation
- Were you interested in yesterday’s session but unclear how to make use of differentiable programming in practice?
- We computed first-order derivatives for various programs. However, some advanced applications like uncertainty quantification requires Hessians (matrices of second derivatives). How do we calculate those?
- Have you ever wondered how neural network training actually works under the hood?
Machine learning (ML) models typically have large numbers of parameters to be tuned and small numbers of outputs. Forward mode doesn’t seem like such a great fit…
Learning objectives
In today’s session we will:
- Learn about reverse mode and the operator overloading approach, comparing them with forward mode and source transformation, respectively.
- Verify the consistency of code generated by Tapenade under forward and reverse mode using the dot product test.
- Calculate higher order derivatives using Tapenade.
- Try out the Pyadjoint operator overloading AD tool underpinnning the Firedrake finite element library and see showcases of more advanced AD usage.
- Learn about checkpointing and using AD to compute higher order derivatives.
Preparations
Terminology
Recall from session 1: This course introduces the concept of differentiable programming, a.k.a. automatic differentiation (AD), or algorithmic differentiation. We will use the acronym AD henceforth.
Notation
Recall from session 1: For a differentiable mathematical function \(f:A\rightarrow\mathbb{R}\) with scalar input (i.e., a single value) from \(A\subseteq\mathbb{R}\), we make use of both the Lagrange notation \(f'(x)\) and Leibniz notation \(\frac{\mathrm{d}f}{\mathrm{d}x}\) for its derivative.
Caution with the physics notation for derivatives \(\dot{x}\). It can refer to both seeds for forward mode and forward mode derivatives.
Caution For seed vectors in reverse mode we will use the “bar” notation \(\bar{y}\). This is not a mean value.
Recall from session 1: When it comes to forward derivatives in code, we use the _d
notation, which is standard in the AD literature.
When it comes to reverse mode derivatives in code, we use the _b
notation (for “backward” or “bar”).
Brief history of reverse mode
- Reverse mode (a.k.a. backpropagation) was discovered by Linnainmaa in the 1970s.
- The terminology ‘back-propagating error correction’ had already been introduced in 1962 by Frank Rosenblatt, but he did not know how to implement this.
- Speelpenning introduced the modern formulation of reverse mode in the late 1980s.
Figure 2: Schematic from (Speelpenning, 1980).
- Griewank improved the feasibility of reverse mode in 1992 by introducing checkpointing.
Recall: directional derivative, a.k.a. Jacobian-vector product (JVP)
Consider a vector-valued function \(\mathbf{f}\) mapping from a subspace \(A\subseteq\mathbb{R}^n\) into \(\mathbb{R}^m\), for some \(m,n\in\mathbb{N}\): \[\mathbf{f}:A\rightarrow\mathbb R^m.\]
Given input \(\mathbf{x}\in A\) and a seed vector \(\dot{\mathbf{x}}\in\mathbb{R}^n\), forward mode AD allows us to compute the action (matrix-vector product) \[\text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\,\dot{\mathbf{x}}.\] Again, think of the seed vector as being an input from outside of the part of the program being differentiated.
Here \(\nabla\mathbf{f}\) is referred to as the Jacobian for the map, so the above is known as a Jacobian-vector product (JVP).
Note The computation is matrix-free. We don’t actually need the Jacobian when we compute this product.
Jacobian-transpose-vector product (JTVP)
Consider a vector-valued function \(\mathbf{f}\) mapping from a subspace \(A\subseteq\mathbb{R}^n\) into \(\mathbb{R}^m\), for some \(m,n\in\mathbb{N}\): \[\mathbf{f}:A\rightarrow\mathbb{R}^m.\]
Given \(\mathbf{x}\in A\) and a seed vector \(\bar{\mathbf{y}}\in\mathbb{R}^m\), reverse mode AD allows us to compute the transpose action (transposed matrix-vector product) \[\text{JTVP}(\mathbf{f},\mathbf{x},\bar{\mathbf{y}}):=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}}.\]
Optional exercise
Convince yourself that the JTVP is well defined.
Solution
We have \(\nabla\mathbf{f}(\mathbf{x})\in\mathbb{R}^{m\times n}\), so \(\nabla\mathbf{f}(\mathbf{x})^T\in\mathbb{R}^{n\times m}\). Since \(\bar{\mathbf{y}}\in\mathbb{R}^m\), the dimensions are appropriate to take the JTVP.
Note Again, the computation is matrix-free. We don’t actually need the Jacobian or its transpose when we compute this product.
Forward mode vs. reverse mode
For seed vectors \(\dot{\mathbf{x}}\in\mathbb{R}^n\) and \(\bar{\mathbf{y}}\in\mathbb{R}^m\), forward mode and reverse mode compute
\[ \text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}} \quad\text{and}\quad \text{JTVP}(\mathbf{f},\mathbf{x},\bar{\mathbf{y}}):=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}} \] respectively.
- Forward mode is more appropriate if \(n\ll m\), i.e., \(\#inputs\ll\#outputs\).
- e.g., sensitivity analysis or optimisation w.r.t. a small number of parameters.
- Reverse mode is more appropriate if \(n\gg m\), i.e., \(\#inputs\gg\#outputs\).
- e.g., ODE/PDE-constrained optimisation (cost function), machine learning training (loss function), goal-oriented error estimation (quantity of interest).
- Forward mode is computed eagerly, whereas reverse mode is done separately from the primal run.
- Reverse mode tends to have higher memory requirements.
\(f\) & \(g\) example: Directed Acyclic Graph
Recall the introductory example from yesterday and the DAG representations of the \(f\) and \(g\) functions.
Recalling that \[f(x_1,x_2)=x_1x_2\] and \[g(y)=(\sin(y),\cos(y)),\] we have
Figure 3: Directed Acyclic Graph (DAG) for the \(f\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).
Figure 4: Directed Acyclic Graph (DAG) for the \(g\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).
\(f\) & \(g\) example: reverse mode seed vectors
In reverse mode, gradient information propagates in the opposite direction.
Figure 5: Directed Acyclic Graph (DAG) for the composition of the functions in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).
\(f\) & \(g\) example: reverse mode
The Fortran code for the two functions is copied in the repository at session2/exercises/fg/f_mod.f90
and session2/exercises/fg/g_mod.f90
. This time, they are defined inside modules.
module f_mod
implicit none
private
public :: f
contains
subroutine f(x, y)
implicit none
real, dimension(2), intent(in) :: x
real, intent(out) :: y
= x(1) * x(2)
y end subroutine f
end module f_mod
module g_mod
implicit none
private
public :: g
contains
subroutine g(y, z)
implicit none
real, intent(in) :: y
real, intent(out), dimension(2) :: z
= [sin(y), cos(y)]
z end subroutine g
end module g_mod
Exercise
- Run
tapenade -h
to review the options.* - Apply Tapenade to each of these subroutines in reverse mode, which will compute the JTVP for some seed vector. Inspect the output files
f_mod_b.f90
andg_mod_b.f90
and check they are as you expect. - (Optional) Inspect the message files
f_mod_b.msg
andg_mod_b.msg
.
*Recall that “tangent mode” is another term for “forward mode”. Similarly, “adjoint mode” is another term for “reverse mode”.
Solution 1
$ tapenade -h
Tapenade 3.16 (develop) - 23 Apr 2025 13:39 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Builds a differentiated program.
Usage: tapenade [options]* filenames
options:
-head, -root <proc> set the differentiation root procedure(s)
See FAQ for refined invocation syntax, e.g.
independent and dependent arguments, multiple heads...
-tangent, -d differentiate in forward/tangent mode (default)
-reverse, -b differentiate in reverse/adjoint mode
-vector, -multi turn on "vector" mode (i.e. multi-directional)
-specializeactivity <unit_names or %all%> Allow for several activity patterns per routine
-primal, -p turn off differentiation. Show pointer destinations
-output, -o <file> put all generated code into a single <file>
-splitoutputfiles split generated code, one file per top unit
-outputdirectory, -O <directory> put all generated files in <directory> (default: .)
-I <includePath> add a new search path for include files
-tgtvarname <str> set extension for tangent variables (default %d)
-tgtfuncname <str> set extension for tangent procedures (default %_d)
-tgtmodulename <str> set extension for tangent modules and types (default %_diff)
-adjvarname <str> set extension for adjoint variables (default %b)
-adjfuncname <str> set extension for adjoint procedures (default %_b)
-adjmodulename <str> set extension for adjoint modules and types (default %_diff)
-modulename <str> set extension for tangent&adjoint modules and types (default %_diff)
-inputlanguage <lang> language of input files (fortran, fortran90,
fortran95, or C)
-outputlanguage <lang> language of output files (fortran, fortran90,
fortran95, or C)
-ext <file> incorporate external library description <file>
-nolib don't load standard libraries descriptions
-i<n> count <n> bytes for an integer (default -i4)
-r<n> count <n> bytes for a real (default -r4)
-dr<n> count <n> bytes for a double real (default -dr8)
-p<n> count <n> bytes for a pointer (default -p8)
-fixinterface don't use activity to filter user-given (in)dependent vars
-noinclude inline include files
-debugTGT insert instructions for debugging tangent mode
-debugADJ insert instructions for debugging adjoint mode
-tracelevel <n> set the level of detail of trace milestones
-msglevel <n> set the level of detail of error messages
-msginfile insert error messages in output files
-dump <file> write a dump <file>
-html display results in a web browser
-nooptim <str> turn off optimization <str> (in {activity, difftypes,
diffarguments, stripprimalmodules, spareinit, splitdiff,
mergediff, saveonlyused, tbr, snapshot, diffliveness,
deadcontrol, recomputeintermediates,
everyoptim}
-version display Tapenade version information
Report bugs to <tapenade@inria.fr>.
Solution 2
Running
$ cd session2/exercises/fg
$ ls
f_mod.f90 g_mod.f90
$ tapenade -b -head "f(x)\(y)" -adjmodulename _b f_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Command: Procedure f understood as f_mod.f
@@ Created ./f_mod_b.f90
@@ Created ./f_mod_b.msg
$ cat f_mod_b.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE F_MOD_B
IMPLICIT NONE
PRIVATE
PUBLIC :: f
PUBLIC :: f_b
CONTAINS
! Differentiation of f in reverse (adjoint) mode:
! gradient of useful results: y
! with respect to varying inputs: x y
! RW status of diff variables: x:out y:in-zero
SUBROUTINE F_B(x, xb, y, yb)
IMPLICIT NONE
REAL, DIMENSION(2), INTENT(IN) :: x
REAL, DIMENSION(2) :: xb
REAL :: y
REAL :: yb
= 0.0
xb 1) = xb(1) + x(2)*yb
xb(2) = xb(2) + x(1)*yb
xb(= 0.0
yb END SUBROUTINE F_B
SUBROUTINE F(x, y)
IMPLICIT NONE
REAL, DIMENSION(2), INTENT(IN) :: x
REAL, INTENT(OUT) :: y
= x(1)*x(2)
y END SUBROUTINE F
END MODULE F_MOD_B
Running
$ tapenade -b -head "g(y)\(z)" -adjmodulename _b g_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Command: Procedure g understood as g_mod.g
@@ Created ./g_mod_b.f90
@@ Created ./g_mod_b.msg
$ cat g_mod_b.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE G_MOD_B
IMPLICIT NONE
PRIVATE
PUBLIC :: g
PUBLIC :: g_b
CONTAINS
! Differentiation of g in reverse (adjoint) mode:
! gradient of useful results: z
! with respect to varying inputs: y z
! RW status of diff variables: y:out z:in-zero
SUBROUTINE G_B(y, yb, z, zb)
IMPLICIT NONE
REAL, INTENT(IN) :: y
REAL :: yb
REAL, DIMENSION(2) :: z
REAL, DIMENSION(2) :: zb
INTRINSIC COS
INTRINSIC SIN
= COS(y)*zb(1) - SIN(y)*zb(2)
yb = 0.0
zb END SUBROUTINE G_B
SUBROUTINE G(y, z)
IMPLICIT NONE
REAL, INTENT(IN) :: y
REAL, DIMENSION(2), INTENT(OUT) :: z
INTRINSIC COS
INTRINSIC SIN
= (/SIN(y), COS(y)/)
z END SUBROUTINE G
END MODULE G_MOD_B
Verification: the dot product test
We have approaches for computing derivatives with both forward mode and reverse mode. But how do we know the outputs are consistent? This can be verified using the so-called dot product test.
In the dot product test, we define the inner product \(\psi:=\bar{\mathbf{y}}^T\dot{\mathbf{y}}\) of \(\bar{\mathbf{y}}\) and \(\dot{\mathbf{y}}\), where \(\dot{\mathbf{y}}=\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}\) is the output of forward mode for some seed \(\dot{\mathbf{x}}\) and \(\bar{\mathbf{y}}\) is a seed for reverse mode. We check that \(\psi=\bar{\mathbf{x}}\dot{\mathbf{x}}\), where \(\bar{\mathbf{x}}^T=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}}\) is the output of reverse mode.
Optional exercise
Prove that we do indeed expect \(\psi=\bar{\mathbf{y}}^T\dot{\mathbf{y}}=\bar{\mathbf{x}}^T\dot{\mathbf{x}}\).
Solution
\[ \psi:=\bar{\mathbf{y}}^T\dot{\mathbf{y}} =\bar{\mathbf{y}}^T(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}) =(\bar{\mathbf{y}}^T\nabla\mathbf{f}(\mathbf{x}))\dot{\mathbf{x}} =(\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}})^T\dot{\mathbf{x}} =\bar{\mathbf{x}}^T\dot{\mathbf{x}} \]
\(f\) & \(g\) example: dot product test
We’ve already computed forward mode and reverse mode derivatives for the function \(f\) using Tapenade. We can run a dot product test using the code in session2/exercises/fg/dot_product_test.f90
.
! Program for verifying the consistency of the forward mode and reverse mode derivatives of the
! function f.
program dot_product_test
use f_mod_d, only: f_d
use f_mod_b, only: f_b
implicit none
real, dimension(2) :: x ! Primal input
real, dimension(2) :: xd ! Forward mode seed
real, dimension(2) :: xb ! Reverse mode derivative
real :: y ! Primal output
real :: yd ! Forward mode derivative
real :: yb ! Reverse mode seed
real :: result1 ! LHS of dot product test
real :: result2 ! RHS of dot product test
real, parameter :: atol = 1e-05 ! Absolute tolerance for the dot product test
! Set arbitrary primal input
= [1.2, -2.3]
x(:)
! Call forward mode with some arbitrary seeds
= [4.2, -0.7]
xd(:) call f_d(x, xd, y, yd)
! Choose a seed for reverse mode and evaluate the first result
= 3.0
yb = dot_product([yd], [yb])
result1
! Call reverse mode and evaluate the second result
= [0.0, 0.0]
xb(:) call f_b(x, xb, y, yb)
= dot_product(xd, xb)
result2
! Check the two results match within the prespecified tolerance
if (abs(result1 - result2) < atol) then
write(unit=6, fmt="('PASS')")
else
write(unit=6, fmt="('FAIL with atol=',e10.4)") atol
end if
end program dot_product_test
Exercise
- Regenerate the forward mode derivative code for
f.f90
using Tapenade. - Build and run the
dot_product_test
program. - (Optional) Why does the test fail if the
result1
assignment is moved to after the call tof_b
?
Solution 1
$ tapenade -d -tgtmodulename _d -head "f(x)\(y)" f_mod.f90
Solution 2
$ cd exercises/fg
$ make dot_product_test
$ ./dot_product_test
PASS
Solution 3
Because yb
gets reset to zero by f_b
. This is done by default because it’s almost always the Right Thing to do.
Second order
For seed vectors \(\dot{\mathbf{x}}\in\mathbb{R}^n\) and \(\bar{\mathbf{y}}\in\mathbb{R}^m\), forward mode and reverse mode compute
\[ \text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}} \quad\text{and}\quad \text{JTVP}(\mathbf{f},\mathbf{x},\bar{\mathbf{y}}):=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}} \] respectively.
Question
What are two ways we can we use these to compute the Hessian of \(f\)?
Solution 1
Given a seed vector \(\dot{\mathbf{x}}\), first apply forward mode to compute \(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}\). Then apply forward mode to compute the gradient of this (i.e., apply forward mode to the forward mode derivative code). Use vector mode (preferably with compression!) to get the full Hessian.
Solution 2
Given a seed vector \(\dot{\mathbf{x}}\), first apply forward mode to compute \(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}\). Then apply reverse mode to compute the gradient of this (i.e., apply reverse mode to the forward mode derivative code). That is, \((\nabla(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}))^T\bar{\mathbf{y}}=\dot{\mathbf{x}}^T\nabla^T\nabla\mathbf{f}(\mathbf{x})\bar{\mathbf{y}}\). Here the Hessian \(\mathbf{H}(\mathbf{f}):=\nabla^T\nabla\mathbf{f}(\mathbf{x})\) is symmetric and so the two applications give the Hessian-vector product with the seed. Use vector mode (preferably with compression!) to get the full Hessian.
Speelpenning example
We compute the Hessian for the classic test case introduced by Speelpenning, which amounts to a reduction of a vector using the product: \[ f(\mathbf{x})=\prod_{i=1}^nx_i. \] This test case is of interest because its Hessian is dense.
The Speelpenning function is implemented as a Fortran subroutine in session2/exercises/speelpenning/speelpenning.f90
:
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
subroutine speelpenning(x, f, n)
implicit none
integer, intent(in) :: n ! Size of the input vector
real, dimension(n), intent(in) :: x ! Input vector
real, intent(out) :: f ! Output value, product of all entries
integer :: i ! Dummy loop index
= 1.0
f do i = 1, n
= f * x(i)
f end do
end subroutine speelpenning
Exercises
- Compute the Hessian of the
speelpenning
subroutine using Tapenade using two applications of forward mode.* - Build and run the
view_hessian
program insession2/exercises/speelpenning/view_hessian.f90
. - (Optional) Derive the expected Hessian and convince yourself that the output is as you expect.
- (Optional) Compute the Hessian of the
speelpenning
subroutine using Tapenade using forward mode followed by reverse mode.*
*Hint: You will need to pass the diffsizes_gradient.f90
file as an additional input file for the second derivative calculation. The diffsizes_hessian.f90
file will be used when computing Hessians.
Solution 1
Forward-then-forward approach.
$ cd session2/exercises/speelpenning
$ ls
diffsizes_gradient.f90 diffsizes_hessian.f90 expected_hessian.f90 Makefile speelpenning.f90 view_hessian.f90
$ tapenade -d -vector -tgtmodulename _d -head "speelpenning(x)\(f)" speelpenning.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options: multiDirectional
Command: Procedure speelpenning understood as speelpenning_mod.speelpenning
Diff-liveness analysis turned off
@@ Created ./speelpenning_dv.f90
@@ Created ./speelpenning_dv.msg
$ cat speelpenning_dv.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE SPEELPENNING_MOD_DV
USE DIFFSIZES
! Hint: nbdirsmax should be the maximum number of differentiation directions
IMPLICIT NONE
CONTAINS
! Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
! variations of useful results: f
! with respect to varying inputs: x
! RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING_DV(x, xd, f, fd, n, nbdirs)
USE DIFFSIZES
! Hint: nbdirsmax should be the maximum number of differentiation directions
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
REAL, INTENT(OUT) :: f
REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
! Dummy loop index
INTEGER :: i
INTEGER :: nd
INTEGER :: nbdirs
= 1.0
f = 0.0
fd DO i=1,n
DO nd=1,nbdirs
= x(i)*fd(nd) + f*xd(nd, i)
fd(nd) END DO
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING_DV
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING(x, f, n)
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
! Output value, product of all entries
REAL, INTENT(OUT) :: f
! Dummy loop index
INTEGER :: i
= 1.0
f DO i=1,n
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING
END MODULE SPEELPENNING_MOD_DV
As mentioned in the hint, we need to make use of the diffsizes
module defined in diffsizes_gradient.f90
:
! Module containing the nbdirsmax variable required by speelpenning_dv
module diffsizes
implicit none
integer, parameter :: nbdirsmax = 7
end module diffsizes
We can apply vector forward mode again, passing this in as an additional input:
$ tapenade -vector -head "speelpenning_dv(fd)/(x)" -tgtmodulename _d diffsizes_gradient.f90 speelpenning_dv.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options: multiDirectional
Command: Procedure speelpenning_dv understood as speelpenning_mod_dv.speelpenning_dv
Diff-liveness analysis turned off
@@ Created ./speelpenning_dv_dv.f90
@@ Created ./diffsizes_gradient_dv.msg
$ cat speelpenning_dv_dv.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE SPEELPENNING_MOD_DV_DV
USE DIFFSIZES
USE DIFFSIZES
! Hint: nbdirsmax0 should be the maximum number of differentiation directions
IMPLICIT NONE
CONTAINS
! Differentiation of speelpenning_dv in forward (tangent) mode (with options multiDirectional):
! variations of useful results: fd
! with respect to varying inputs: x
! RW status of diff variables: x:in fd:out
! Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
! variations of useful results: f
! with respect to varying inputs: x
! RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING_DV_DV(x, xd0, xd, f, fd, fdd, n, nbdirs, &
& nbdirs0)
USE DIFFSIZES
USE DIFFSIZES
! Hint: nbdirsmax0 should be the maximum number of differentiation directions
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
REAL, DIMENSION(nbdirsmax0, n), INTENT(IN) :: xd0
REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
REAL, INTENT(OUT) :: f
REAL, DIMENSION(nbdirsmax0) :: fd0
REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
REAL, DIMENSION(nbdirsmax0, nbdirsmax), INTENT(OUT) :: fdd
! Dummy loop index
INTEGER :: i
INTEGER :: nd
INTEGER :: nbdirs
INTEGER :: nd0
INTEGER :: nbdirs0
= 1.0
f = 0.0
fd = 0.0
fd0 = 0.0
fdd DO i=1,n
DO nd=1,nbdirs
DO nd0=1,nbdirs0
= fd(nd)*xd0(nd0, i) + x(i)*fdd(nd0, nd) + xd(nd&
fdd(nd0, nd) & , i)*fd0(nd0)
END DO
= x(i)*fd(nd) + f*xd(nd, i)
fd(nd) END DO
DO nd0=1,nbdirs0
= x(i)*fd0(nd0) + f*xd0(nd0, i)
fd0(nd0) END DO
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING_DV_DV
! Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
! variations of useful results: f
! with respect to varying inputs: x
! RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING_DV(x, xd, f, fd, n, nbdirs)
USE DIFFSIZES
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
REAL, INTENT(OUT) :: f
REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
! Dummy loop index
INTEGER :: i
INTEGER :: nd
INTEGER :: nbdirs
= 1.0
f = 0.0
fd DO i=1,n
DO nd=1,nbdirs
= x(i)*fd(nd) + f*xd(nd, i)
fd(nd) END DO
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING_DV
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING(x, f, n)
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
! Output value, product of all entries
REAL, INTENT(OUT) :: f
! Dummy loop index
INTEGER :: i
= 1.0
f DO i=1,n
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING
END MODULE SPEELPENNING_MOD_DV_DV
Solution 2
Running
$ cd session2/exercises/speelpenning
$ make view_hessian
$ ./view_hessian
should give the output
0.0 2520.0 1680.0 1260.0 1008.0 840.0 720.0
2520.0 0.0 840.0 630.0 504.0 420.0 360.0
1680.0 840.0 0.0 420.0 336.0 280.0 240.0
1260.0 630.0 420.0 0.0 252.0 210.0 180.0
1008.0 504.0 336.0 252.0 0.0 168.0 144.0
840.0 420.0 280.0 210.0 168.0 0.0 120.0 720.0 360.0 240.0 180.0 144.0 120.0 0.0
Solution 3
It makes sense that the diagonal entries are zero. The matrix is symmetric, as expected. The \((i,j)^{th}\) entry of the Hessian is given by \[ \frac{\partial^2}{\partial x_i\partial x_j}\prod_{k=1}^nx_k =\frac{\partial}{\partial x_i}\prod_{k=1,\,j\neq k}^nx_k =\prod_{k=1,\,j\neq k,\,j\neq i}^nx_k \] As such, the \((6,7)^{th}\) and \((7,6)^{th}\) entries are both \[120=5!=\prod_{i=1}^5x_i=\frac{\partial^2}{\partial x_6\partial x_7}\prod_{i=1}^7x_i.\] Further checks left as an exercise to the reader.
Solution 4
Forward-then-reverse approach.
$ tapenade -b -vector -head "speelpenning_dv(fd)/(x)" -adjmodulename _b diffsizes_gradient.f90 speelpenning_dv.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options: multiDirectional
Command: Procedure speelpenning_dv understood as speelpenning_mod_dv.speelpenning_dv
@@ Created ./speelpenning_dv_bv.f90
@@ Created ./diffsizes_gradient_bv.msg
$ cat speelpenning_dv_bv.f90
gives
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
! Generated by TAPENADE (INRIA, Ecuador team)
! Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE SPEELPENNING_MOD_DV_BV
USE DIFFSIZES
USE DIFFSIZES
! Hint: nbdirsmax0 should be the maximum number of differentiation directions
IMPLICIT NONE
CONTAINS
! Differentiation of speelpenning_dv in reverse (adjoint) mode (with options multiDirectional):
! gradient of useful results: fd
! with respect to varying inputs: x fd
! RW status of diff variables: x:out fd:in-zero
! Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
! variations of useful results: f
! with respect to varying inputs: x
! RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING_DV_BV(x, xb, xd, f, fd, fdb, n, nbdirs, &
& nbdirs0)
USE DIFFSIZES
USE DIFFSIZES
! Hint: nbdirsmax0 should be the maximum number of differentiation directions
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
REAL, DIMENSION(nbdirsmax0, n) :: xb
REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
REAL :: f
REAL, DIMENSION(nbdirsmax0) :: fb
REAL, DIMENSION(nbdirsmax) :: fd
REAL, DIMENSION(nbdirsmax0, nbdirsmax) :: fdb
! Dummy loop index
INTEGER :: i
INTEGER :: nd
INTEGER :: nbdirs
INTEGER :: nd0
INTEGER :: nbdirs0
= 1.0
f = 0.0
fd DO i=1,n
DO nd=1,nbdirs
CALL PUSHREAL4(fd(nd))
= x(i)*fd(nd) + f*xd(nd, i)
fd(nd) END DO
CALL PUSHREAL4(f)
= f*x(i)
f END DO
= 0.0
fb = 0.0
xb DO i=n,1,-1
CALL POPREAL4(f)
DO nd0=1,nbdirs0
= xb(nd0, i) + f*fb(nd0)
xb(nd0, i) = x(i)*fb(nd0)
fb(nd0) END DO
DO nd=nbdirs,1,-1
CALL POPREAL4(fd(nd))
DO nd0=1,nbdirs0
= xb(nd0, i) + fd(nd)*fdb(nd0, nd)
xb(nd0, i) = fb(nd0) + xd(nd, i)*fdb(nd0, nd)
fb(nd0) = x(i)*fdb(nd0, nd)
fdb(nd0, nd) END DO
END DO
END DO
= 0.0
fdb END SUBROUTINE SPEELPENNING_DV_BV
! Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
! variations of useful results: f
! with respect to varying inputs: x
! RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING_DV(x, xd, f, fd, n, nbdirs)
USE DIFFSIZES
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
REAL, INTENT(OUT) :: f
REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
! Dummy loop index
INTEGER :: i
INTEGER :: nd
INTEGER :: nbdirs
= 1.0
f = 0.0
fd DO i=1,n
DO nd=1,nbdirs
= x(i)*fd(nd) + f*xd(nd, i)
fd(nd) END DO
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING_DV
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
SUBROUTINE SPEELPENNING(x, f, n)
IMPLICIT NONE
! Size of the input vector
INTEGER, INTENT(IN) :: n
! Input vector
REAL, DIMENSION(n), INTENT(IN) :: x
! Output value, product of all entries
REAL, INTENT(OUT) :: f
! Dummy loop index
INTEGER :: i
= 1.0
f DO i=1,n
= f*x(i)
f END DO
END SUBROUTINE SPEELPENNING
END MODULE SPEELPENNING_MOD_DV_BV
Note that running the code requires either PUSHREAL4
and POPREAL4
to be imported from Tapenade’s ‘ADFirstAidKit’, or for these subroutines for pushing to and popping from a stack to be implemented manually.
Approach 2: Operator overloading
The operator overloading approach is fundamentally different to source transformation. It does not generate additional source files and instead generates derivatives at runtime.
The general idea is to record all operations applied to variables involved in the computation so that derivatives can be computed by applying the chain rule to the derivatives of those operations.
Depending on the AD tool, the user experience is something like the following:
- Mark (input) variables to compute derivatives with respect to as independent.
- Mark (output) variables to compute derivatives of as dependent.
- Call driver methods supported by the AD tool to compute derivatives.
- Run the program.
Operator overloading
The approach fits well with another ‘OO’: object orientation. In classic implementations such as ADOL-C, basic types are enriched by introducing corresponding structs (derived types in Fortran) that hold both the standard (forward) value and also the gradient. For example,
type adouble
double precision :: val
double precision :: grad
end type adouble
In order to compute derivatives of expressions involving adouble
s, we need to overload the operators that act on double
s so that they operate on the val
attribute and also provide code for the derivative(s). That is, we extend the definition of such operators (+
, -
, *
, /
, **
, and others) so that they may be applied to adoubles
s.
Since forward mode evaluates eagerly (i.e., computes the derivative at the same time as the function evaluation), the adouble
approach facilitates this. This is sometimes referred to as the dual number approach.
Operator overloading: tape
In the case of reverse mode, the dual number approach is insufficient. As mentioned previously, we need to keep track of operations from the primal code, as well as primal values when executing reverse mode.
A fundamental concept related to the operator overloading AD approach is tape. This is the name used for the record of all the operations that have been executed. Consider the following lines of Fortran code for an example evaluation of the \(f\) function considered previously:
type(adouble) :: x1
type(adouble) :: x2
type(adouble) :: y
= 2.0
x1 = 3.0
x2 = x1 * x2 y
Conceptually, the tape would be something like:
Index | Operation | Dependencies | Outputs |
---|---|---|---|
0 | Assignment | 2.0 | x1 |
1 | Assignment | 3.0 | x2 |
2 | Multiplication | x1, x2 | y |
That is, it records each operation in order, as well as the relevant dependencies and outputs. In practice, tapes are much more sophisticated than this, but hopefully the idea is clear. The tape can also be visualised as a DAG, as illustrated in subsequent cells.
Given that we know the derivatives of addition, exponentiation, and multiplication, we can unroll the tape to compute the desired derivative code. As an alternative to the dual number approach, forward mode can be implemented in this way.
Question
What’s the difference between forward and reverse mode as far as tape unrolling is concerned?
Solution
In forward mode, we unroll the tape in the same order it was written (increasing index). Values for the primal code are computed eagerly as part of this unrolling.
For reverse mode, we do need to first unroll the tape in the forward direction to compute primal variables. Then we unroll the tape in the reverse direction (decreasing index) to accumulate derivatives.
Operator overloading example in Pyadjoint
We demonstrate operator overloading using Pyadjoint in the following. We define some utility functions for interrogating the Tape
class.
def print_tape():
"""
Print the current working tape entry-by-entry.
"""
= pyadjoint.get_working_tape()
tape print("Tape:")
for i, block in enumerate(tape.get_blocks()):
print(f"{i}: {block}")
def plot_dag(pos=None, n=4):
"""
Plot the DAG for the current working tape.
:kwarg pos: list of tuples for the node positions
:kwarg n: number of nodes in final graph for tracking sub-DAGs as they progress
"""
= pyadjoint.get_working_tape()
tape = tape.create_graph()
graph = plt.subplots(figsize=(5, 3))
fig, axes
# Use the default automatic node positioning if none is specified
if pos is not None:
= {node: p for node, p in zip(graph.nodes, pos)}
pos
# Plot the DAG with the truncated colormap
=axes, node_color=range(len(graph.nodes)), node_size=800, cmap=plt.cm.coolwarm)
nx.draw(graph, pos, ax plt.show()
We start by enabling annotation of the tape with pyadjoint.continue_annotation
.
Having done so, we create two AdjFloat
s and assign them the values 2 and 3.
def f(x1, x2):
"""f function as defined in previous examples."""
return x1 * x2
# Enable annotation of the tape
;
pyadjoint.continue_annotation()
# Create some AdjFloats
= pyadjoint.AdjFloat(2.0)
x1 = pyadjoint.AdjFloat(3.0) x2
Let’s now combine the two AdjFloat
s using a multiplication operation.
= f(x1, x2)
y
# Interrogate the tape
print_tape()=[(0, 0), (1, 0.5), (0, 1), (2, 0.5)]) plot_dag(pos
The tape shows an entry for the product, displaying the float
values for x1
and x2
. Note that AdjFloat
assignments are not shown on the tape in pyadjoint
, although assignments for finite element fields are. The DAG shows four nodes: two for the input AdjFloat
s, one for the multiplication operation, and one for the output AdjFloat
s.
Having finished the computation of interest, we can stop annotating the tape.
; pyadjoint.pause_annotation()
To compute derivatives, we need to specify which are the independent variables. These are declared as Control
s, as follows.
= [pyadjoint.Control(x1), pyadjoint.Control(x2)] xc
Having marked variables as independent, we can stop annotating the tape and then compute the gradient.
# Inspect the docstring for the compute_gradient driver
print(" \"\"\"\n" + pyadjoint.compute_gradient.__doc__ + "\n \"\"\"\n")
# Compute the gradient of the output with respect to the inputs
= pyadjoint.compute_gradient(y, xc)
dydx
print(f"x = [x1, x2] = [{float(x1)}, {float(x2)}]")
print(f"y = x1 * x2 = {float(y)}")
print(f"dydx = [{float(dydx[0])}, {float(dydx[1])}] ?= [x2, x1] = [{float(x2)}, {float(x1)}]")
Note that we haven’t specified whether to use forward or reverse mode. Pyadjoint is set up such that it will choose which approach to used, depending on whether there are more inputs or outputs.
Exercise
We’ve computed first derivatives of \(f\). Now let’s consider its second derivatives. 1. What would you expect the Hessian of \(f\) to be? 2. Take a look at pyadjoint.compute_hessian.__doc__
. Compute the compressed Hessian with a single call to the compute_hessian
driver function.
Solution 1
\[ \frac{\partial^2f}{\partial x_1^2}=\frac{\partial}{\partial x_1}x_2=0, \quad\frac{\partial^2f}{\partial x_1x_2}=\frac{\partial}{\partial x_1}x_1=1, \quad\frac{\partial^2f}{\partial x_2x_1}=\frac{\partial}{\partial x_2}x_2=1, \quad\frac{\partial^2f}{\partial x_2^2}=\frac{\partial}{\partial x_2}x_1=0, \] so we expect \(H(f)=\begin{bmatrix}0&1\\1&0\end{bmatrix}\).
Solution 2
With the code
# Define seed vector for second derivative
= [pyadjoint.AdjFloat(1.0), pyadjoint.AdjFloat(1.0)]
xdd
# Apply driver function
= pyadjoint.compute_hessian(y, xc, xdd)
d2ydx2
print(f"x = [x1, x2] = [{float(x1)}, {float(x2)}]")
print(f"y = x1 * x2")
# Decompress Hessian before printing
print(f"d2ydx2 = [0.0, {float(d2ydx2[0])}; {float(d2ydx2[1])}, 0.0] ?= [0.0, 1.0; 1.0, 0.0]")
we should get
x = [x1, x2] = [2.0, 3.0]
y = x1 * x2 d2ydx2 = [0.0, 1.0; 1.0, 0.0] ?= [0.0, 1.0; 1.0, 0.0]
Source transformation vs. operator overloading
We already evaluated the differences between modes (forward and reverse). How about the differences between approaches?
- ST is done as a preprocessing step, whereas OO is done at runtime.
- ST is fairly clear, whereas OO is somewhat of a ‘black box’ (unless you’re able to inspect the tape).
- OO’s tape requires memory.
- There are only a few ST tools, but very many OO tools! See below.
- LLVM: Enzyme
- C/C++: ADIC, ADOL-C, Torch Autograd, CoDiPack, Sacado, dco/c++ [commercial] (about two dozen!)
- Fortran:Differentia, lots of abandonware…
- Python: PyTorch Autograd, Jax, PyADOL-C.
- Julia: Enzyme, Zygote, ForwardDiff, DifferentiationInterface (about two dozen overall!)
- Domain-specific: dolfin-adjoint/pyadjoint (Python/UFL - Firedrake & FEniCS)
- And many more! https://autodiff.org/?module=Tools
Levels of abstraction
So far, we have considered AD applied to elementary operations (+
, -
, *
, /
, **
) applied to intrinsic Fortran types like real
s and arrays thereof. Most of the early AD tools worked in this way, tracking operations applied to real numbers and chaining together their derivatives. However, code differentiated with a source transformation tool in this manner can quickly become difficult to read for large codes. And under the operator overloading approach, the memory footprint of the tape can become very large.
By raising the level of abstraction, some of these difficulties can be avoided.
Medium-level: API calls
Example:
- AD in PETSc
- Latest implementation.
- (Note: old draft implementation differentiated elementary operators.)
High-level: operations on fields
Examples:
- AD in Firedrake using Pyadjoint/dolfin-adjoint.
- AD in PSyclone using PSyAD.
Showcase
Open and work through the following Firedrake Jupyter notebooks that were copied over.
session2/exercises/firedrake/06-pde-constrained-optimisation.ipynb session2/exercises/firedrake/11-extract-adjoint-solutions.ipynb
Goal-oriented error estimation
Navigate to https://mesh-adaptation.github.io/docs/demos/point_discharge2d.py.html.
Checkpointing
Recall our discussion about tape unrolling and how for reverse mode we first unroll the tape to run the primal code.
Question
Under what circumstances can we get away without an initial forward pass when running reverse mode?
Solution
Two possible answers:
If the variables involved in the forward computation had their values assigned exactly once and those values are still available in memory then we already have the forward data required for the reverse pass.*
If the problem is linear in the independent variables then the reverse mode code will be independent of those variables. As such, there is no need to compute them.
*Although your AD tool might not be aware of this or be able to make use of the information.
Checkpointing
In scientific programming, time-dependent problems are typically solved using a timestepping method such as the theta-method we saw in the first session. When writing code for such methods, it’s common practice to overwrite the variable for the approximation at a given timestep as we progress through the steps. In such a formulation, the value is no longer in memory and needs to be recomputed.
In some cases, it’s possible to keep the full state on the tape at every timestep, whereby all the information required for reverse mode is available - see Figure 4.
Figure 4: Time-dependent adjoint problem solved without loading checkpoints. Generated using tikZ and \(\LaTeX\).
This quickly becomes infeasible for real world problems, with the amount of memory required at each timestep taking a large chunk of the available RAM. In the extreme case where only one timestep’s worth of data fits in memory, a reverse mode propagation requires checkpointing at each timestep as in Figure 5.
Figure 5: Time-dependent adjoint problem solved with checkpointing at every timestep. Generated using tikZ and \(\LaTeX\).
In intermediate cases, checkpointing can be done on the basis of some fixed frequency or in a more intelligent way using the revolve algorithm (Griewank & Walther, 2000).
Further applications
- Sensitivity analysis.
- Data assimilation.
- Uncertainty quantification.
- Online training in machine learning.
- PDE-constrained optimisation.
- Goal-oriented error estimation and mesh adaptation.
Summary
In today’s session we:
- Learnt about reverse mode and the operator overloading approach, comparing them with forward mode and source transformation, respectively.
- Verified the consistency of code generated by Tapenade under forward and reverse mode using the dot product test.
- Calculated higher order derivatives using Tapenade.
- Tried out the Pyadjoint operator overloading AD tool underpinnning the Firedrake finite element library and saw showcases of more advanced AD usage.
- Learnt about checkpointing and using AD to compute higher order derivatives.
References
- S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT, 16(2):146–160, 1976, doi:10.1007/BF01931367.
- B. Speelpenning. Compiling fast partial derivatives of functions given by algorithms. University of Illinois, 1980, doi:10.2172/5254402.
- A. Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optimization Methods & Software, 1:35–54, 1992, doi:10.1080/10556789208805505.
- A. Griewank and Andrea Walther. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS) 26.1: 19-45, 2000, doi:10.1145/347837.347846.
- J. Huckelheim, et al. A taxonomy of automatic differentiation pitfalls, WIREs Data Mining Knowl Discovery, 2024, doi:10.1002/widm.1555.