Session 2: Reverse mode differentiation and operator overloading

Authors

Joe Wallwork

Niccolò Zanotti

%matplotlib inline

import matplotlib.colors as colors
import matplotlib.pyplot as plt
from mpltools import annotation
import networkx as nx
import numpy as np
from firedrake.adjoint import pyadjoint

Set up codespace now. (It will take a while!)

We will show you where to find this notebook in the repo while you wait.

Motivation

Were you interested in yesterday’s session but unclear how to make use of differentiable programming in practice?
We computed first-order derivatives for various programs. However, some advanced applications like uncertainty quantification requires Hessians (matrices of second derivatives). How do we calculate those?
Have you ever wondered how neural network training actually works under the hood?

Figure 1: Neural network schematic created with https://alexlenail.me/NN-SVG/index.html.

Machine learning (ML) models typically have large numbers of parameters to be tuned and small numbers of outputs. Forward mode doesn’t seem like such a great fit…

Learning objectives

In today’s session we will:

Learn about reverse mode and the operator overloading approach, comparing them with forward mode and source transformation, respectively.
Verify the consistency of code generated by Tapenade under forward and reverse mode using the dot product test.
Calculate higher order derivatives using Tapenade.
Try out the Pyadjoint operator overloading AD tool underpinnning the Firedrake finite element library and see showcases of more advanced AD usage.
Learn about checkpointing and using AD to compute higher order derivatives.

Preparations

Terminology

Recall from session 1: This course introduces the concept of differentiable programming, a.k.a. automatic differentiation (AD), or algorithmic differentiation. We will use the acronym AD henceforth.

Notation

Recall from session 1: For a differentiable mathematical function \(f:A\rightarrow\mathbb{R}\) with scalar input (i.e., a single value) from \(A\subseteq\mathbb{R}\), we make use of both the Lagrange notation \(f'(x)\) and Leibniz notation \(\frac{\mathrm{d}f}{\mathrm{d}x}\) for its derivative.

Caution with the physics notation for derivatives \(\dot{x}\). It can refer to both seeds for forward mode and forward mode derivatives.

Caution For seed vectors in reverse mode we will use the “bar” notation \(\bar{y}\). This is not a mean value.

Recall from session 1: When it comes to forward derivatives in code, we use the _d notation, which is standard in the AD literature.

When it comes to reverse mode derivatives in code, we use the _b notation (for “backward” or “bar”).

Brief history of reverse mode

Reverse mode (a.k.a. backpropagation) was discovered by Linnainmaa in the 1970s.
The terminology ‘back-propagating error correction’ had already been introduced in 1962 by Frank Rosenblatt, but he did not know how to implement this.
Speelpenning introduced the modern formulation of reverse mode in the late 1980s.

Figure 2: Schematic from (Speelpenning, 1980).

Griewank improved the feasibility of reverse mode in 1992 by introducing checkpointing.

Recall: directional derivative, a.k.a. Jacobian-vector product (JVP)

Consider a vector-valued function \(\mathbf{f}\) mapping from a subspace \(A\subseteq\mathbb{R}^n\) into \(\mathbb{R}^m\), for some \(m,n\in\mathbb{N}\): \[\mathbf{f}:A\rightarrow\mathbb R^m.\]

Given input \(\mathbf{x}\in A\) and a seed vector \(\dot{\mathbf{x}}\in\mathbb{R}^n\), forward mode AD allows us to compute the action (matrix-vector product) \[\text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\,\dot{\mathbf{x}}.\] Again, think of the seed vector as being an input from outside of the part of the program being differentiated.

Here \(\nabla\mathbf{f}\) is referred to as the Jacobian for the map, so the above is known as a Jacobian-vector product (JVP).

Note The computation is matrix-free. We don’t actually need the Jacobian when we compute this product.

Jacobian-transpose-vector product (JTVP)

Consider a vector-valued function \(\mathbf{f}\) mapping from a subspace \(A\subseteq\mathbb{R}^n\) into \(\mathbb{R}^m\), for some \(m,n\in\mathbb{N}\): \[\mathbf{f}:A\rightarrow\mathbb{R}^m.\]

Given \(\mathbf{x}\in A\) and a seed vector \(\bar{\mathbf{y}}\in\mathbb{R}^m\), reverse mode AD allows us to compute the transpose action (transposed matrix-vector product) \[\text{JTVP}(\mathbf{f},\mathbf{x},\bar{\mathbf{y}}):=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}}.\]

Optional exercise

Convince yourself that the JTVP is well defined.

Solution

We have \(\nabla\mathbf{f}(\mathbf{x})\in\mathbb{R}^{m\times n}\), so \(\nabla\mathbf{f}(\mathbf{x})^T\in\mathbb{R}^{n\times m}\). Since \(\bar{\mathbf{y}}\in\mathbb{R}^m\), the dimensions are appropriate to take the JTVP.

Note Again, the computation is matrix-free. We don’t actually need the Jacobian or its transpose when we compute this product.

Forward mode vs. reverse mode

For seed vectors \(\dot{\mathbf{x}}\in\mathbb{R}^n\) and \(\bar{\mathbf{y}}\in\mathbb{R}^m\), forward mode and reverse mode compute

\[ \text{JVP}(\mathbf{f},\mathbf{x},\dot{\mathbf{x}}):=\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}} \quad\text{and}\quad \text{JTVP}(\mathbf{f},\mathbf{x},\bar{\mathbf{y}}):=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}} \] respectively.

Forward mode is more appropriate if \(n\ll m\), i.e., \(\#inputs\ll\#outputs\).
- e.g., sensitivity analysis or optimisation w.r.t. a small number of parameters.
Reverse mode is more appropriate if \(n\gg m\), i.e., \(\#inputs\gg\#outputs\).
- e.g., ODE/PDE-constrained optimisation (cost function), machine learning training (loss function), goal-oriented error estimation (quantity of interest).
Forward mode is computed eagerly, whereas reverse mode is done separately from the primal run.
Reverse mode tends to have higher memory requirements.

\(f\) & \(g\) example: Directed Acyclic Graph

Recall the introductory example from yesterday and the DAG representations of the \(f\) and \(g\) functions.

Recalling that \[f(x_1,x_2)=x_1x_2\] and \[g(y)=(\sin(y),\cos(y)),\] we have

Figure 3: Directed Acyclic Graph (DAG) for the \(f\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).

Figure 4: Directed Acyclic Graph (DAG) for the \(g\) function in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).

\(f\) & \(g\) example: reverse mode seed vectors

In reverse mode, gradient information propagates in the opposite direction.

Figure 5: Directed Acyclic Graph (DAG) for the composition of the functions in the \(f\) & \(g\) example. Generated using tikZ and \(\LaTeX\).

\(f\) & \(g\) example: reverse mode

The Fortran code for the two functions is copied in the repository at session2/exercises/fg/f_mod.f90 and session2/exercises/fg/g_mod.f90. This time, they are defined inside modules.

module f_mod
  implicit none

  private
  public :: f

  contains

    subroutine f(x, y)
      implicit none
      real, dimension(2), intent(in)  :: x
      real, intent(out) :: y
      y = x(1) * x(2)
    end subroutine f

end module f_mod

module g_mod
  implicit none

  private
  public :: g

  contains

    subroutine g(y, z)
      implicit none
      real, intent(in) :: y
      real, intent(out), dimension(2) :: z
      z = [sin(y), cos(y)]
    end subroutine g

end module g_mod

Exercise

Run tapenade -h to review the options.*
Apply Tapenade to each of these subroutines in reverse mode, which will compute the JTVP for some seed vector. Inspect the output files f_mod_b.f90 and g_mod_b.f90 and check they are as you expect.
(Optional) Inspect the message files f_mod_b.msg and g_mod_b.msg.

*Recall that “tangent mode” is another term for “forward mode”. Similarly, “adjoint mode” is another term for “reverse mode”.

Solution 1

$ tapenade -h
Tapenade 3.16 (develop) - 23 Apr 2025 13:39 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
 Builds a differentiated program.
 Usage: tapenade [options]* filenames
  options:
   -head, -root <proc>     set the differentiation root procedure(s)
                           See FAQ for refined invocation syntax, e.g.
                           independent and dependent arguments, multiple heads...
   -tangent, -d            differentiate in forward/tangent mode (default)
   -reverse, -b            differentiate in reverse/adjoint mode
   -vector, -multi         turn on "vector" mode (i.e. multi-directional)
   -specializeactivity <unit_names or %all%>  Allow for several activity patterns per routine
   -primal, -p             turn off differentiation. Show pointer destinations
   -output, -o <file>      put all generated code into a single <file>
   -splitoutputfiles       split generated code, one file per top unit
   -outputdirectory, -O <directory>  put all generated files in <directory> (default: .)
   -I <includePath>        add a new search path for include files
   -tgtvarname <str>       set extension for tangent variables  (default %d)
   -tgtfuncname <str>      set extension for tangent procedures (default %_d)
   -tgtmodulename <str>    set extension for tangent modules and types (default %_diff)
   -adjvarname <str>       set extension for adjoint variables  (default %b)
   -adjfuncname <str>      set extension for adjoint procedures (default %_b)
   -adjmodulename <str>    set extension for adjoint modules and types (default %_diff)
   -modulename <str>       set extension for tangent&adjoint modules and types (default %_diff)
   -inputlanguage <lang>   language of  input files (fortran, fortran90,
                           fortran95, or C)
   -outputlanguage <lang>  language of output files (fortran, fortran90,
                           fortran95, or C)
   -ext <file>             incorporate external library description <file>
   -nolib                  don't load standard libraries descriptions
   -i<n>                   count <n> bytes for an integer (default -i4)
   -r<n>                   count <n> bytes for a real (default -r4)
   -dr<n>                  count <n> bytes for a double real (default -dr8)
   -p<n>                   count <n> bytes for a pointer (default -p8)
   -fixinterface           don't use activity to filter user-given (in)dependent vars
   -noinclude              inline include files
   -debugTGT               insert instructions for debugging tangent mode
   -debugADJ               insert instructions for debugging adjoint mode
   -tracelevel <n>         set the level of detail of trace milestones
   -msglevel <n>           set the level of detail of error messages
   -msginfile              insert error messages in output files
   -dump <file>            write a dump <file>
   -html                   display results in a web browser
   -nooptim <str>          turn off optimization <str> (in {activity, difftypes,
                           diffarguments, stripprimalmodules, spareinit, splitdiff, 
                           mergediff, saveonlyused, tbr, snapshot, diffliveness,
                           deadcontrol, recomputeintermediates,
                           everyoptim}
   -version                display Tapenade version information
 Report bugs to <tapenade@inria.fr>.

Solution 2

Running

$ cd session2/exercises/fg
$ ls
f_mod.f90 g_mod.f90
$ tapenade -b -head "f(x)\(y)" -adjmodulename _b f_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Command: Procedure f understood as f_mod.f
@@ Created ./f_mod_b.f90 
@@ Created ./f_mod_b.msg
$ cat f_mod_b.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE F_MOD_B
  IMPLICIT NONE
  PRIVATE 
  PUBLIC :: f
  PUBLIC :: f_b

CONTAINS
!  Differentiation of f in reverse (adjoint) mode:
!   gradient     of useful results: y
!   with respect to varying inputs: x y
!   RW status of diff variables: x:out y:in-zero
  SUBROUTINE F_B(x, xb, y, yb)
    IMPLICIT NONE
    REAL, DIMENSION(2), INTENT(IN) :: x
    REAL, DIMENSION(2) :: xb
    REAL :: y
    REAL :: yb
    xb = 0.0
    xb(1) = xb(1) + x(2)*yb
    xb(2) = xb(2) + x(1)*yb
    yb = 0.0
  END SUBROUTINE F_B

  SUBROUTINE F(x, y)
    IMPLICIT NONE
    REAL, DIMENSION(2), INTENT(IN) :: x
    REAL, INTENT(OUT) :: y
    y = x(1)*x(2)
  END SUBROUTINE F

END MODULE F_MOD_B

Running

$ tapenade -b -head "g(y)\(z)" -adjmodulename _b g_mod.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
Command: Procedure g understood as g_mod.g
@@ Created ./g_mod_b.f90 
@@ Created ./g_mod_b.msg
$ cat g_mod_b.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE G_MOD_B
  IMPLICIT NONE
  PRIVATE 
  PUBLIC :: g
  PUBLIC :: g_b

CONTAINS
!  Differentiation of g in reverse (adjoint) mode:
!   gradient     of useful results: z
!   with respect to varying inputs: y z
!   RW status of diff variables: y:out z:in-zero
  SUBROUTINE G_B(y, yb, z, zb)
    IMPLICIT NONE
    REAL, INTENT(IN) :: y
    REAL :: yb
    REAL, DIMENSION(2) :: z
    REAL, DIMENSION(2) :: zb
    INTRINSIC COS
    INTRINSIC SIN
    yb = COS(y)*zb(1) - SIN(y)*zb(2)
    zb = 0.0
  END SUBROUTINE G_B

  SUBROUTINE G(y, z)
    IMPLICIT NONE
    REAL, INTENT(IN) :: y
    REAL, DIMENSION(2), INTENT(OUT) :: z
    INTRINSIC COS
    INTRINSIC SIN
    z = (/SIN(y), COS(y)/)
  END SUBROUTINE G

END MODULE G_MOD_B

Verification: the dot product test

We have approaches for computing derivatives with both forward mode and reverse mode. But how do we know the outputs are consistent? This can be verified using the so-called dot product test.

In the dot product test, we define the inner product \(\psi:=\bar{\mathbf{y}}^T\dot{\mathbf{y}}\) of \(\bar{\mathbf{y}}\) and \(\dot{\mathbf{y}}\), where \(\dot{\mathbf{y}}=\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}\) is the output of forward mode for some seed \(\dot{\mathbf{x}}\) and \(\bar{\mathbf{y}}\) is a seed for reverse mode. We check that \(\psi=\bar{\mathbf{x}}\dot{\mathbf{x}}\), where \(\bar{\mathbf{x}}^T=\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}}\) is the output of reverse mode.

Optional exercise

Prove that we do indeed expect \(\psi=\bar{\mathbf{y}}^T\dot{\mathbf{y}}=\bar{\mathbf{x}}^T\dot{\mathbf{x}}\).

Solution

\[ \psi:=\bar{\mathbf{y}}^T\dot{\mathbf{y}} =\bar{\mathbf{y}}^T(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}) =(\bar{\mathbf{y}}^T\nabla\mathbf{f}(\mathbf{x}))\dot{\mathbf{x}} =(\nabla\mathbf{f}(\mathbf{x})^T\bar{\mathbf{y}})^T\dot{\mathbf{x}} =\bar{\mathbf{x}}^T\dot{\mathbf{x}} \]

\(f\) & \(g\) example: dot product test

We’ve already computed forward mode and reverse mode derivatives for the function \(f\) using Tapenade. We can run a dot product test using the code in session2/exercises/fg/dot_product_test.f90.

! Program for verifying the consistency of the forward mode and reverse mode derivatives of the
! function f.
program dot_product_test
  use f_mod_d, only: f_d
  use f_mod_b, only: f_b
  implicit none

  real, dimension(2) :: x   ! Primal input
  real, dimension(2) :: xd  ! Forward mode seed
  real, dimension(2) :: xb  ! Reverse mode derivative
  real :: y                 ! Primal output
  real :: yd                ! Forward mode derivative
  real :: yb                ! Reverse mode seed

  real :: result1                 ! LHS of dot product test
  real :: result2                 ! RHS of dot product test
  real, parameter :: atol = 1e-05 ! Absolute tolerance for the dot product test

  ! Set arbitrary primal input
  x(:) = [1.2, -2.3]

  ! Call forward mode with some arbitrary seeds
  xd(:) = [4.2, -0.7]
  call f_d(x, xd, y, yd)

  ! Choose a seed for reverse mode and evaluate the first result
  yb = 3.0
  result1 = dot_product([yd], [yb])

  ! Call reverse mode and evaluate the second result
  xb(:) = [0.0, 0.0]
  call f_b(x, xb, y, yb)
  result2 = dot_product(xd, xb)

  ! Check the two results match within the prespecified tolerance
  if (abs(result1 - result2) < atol) then
    write(unit=6, fmt="('PASS')")
  else
    write(unit=6, fmt="('FAIL with atol=',e10.4)") atol
  end if

end program dot_product_test

Exercise

Regenerate the forward mode derivative code for f.f90 using Tapenade.
Build and run the dot_product_test program.
(Optional) Why does the test fail if the result1 assignment is moved to after the call to f_b?

Solution 1

$ tapenade -d -tgtmodulename _d -head "f(x)\(y)" f_mod.f90

Solution 2

$ cd exercises/fg
$ make dot_product_test
$ ./dot_product_test
PASS

Solution 3

Because yb gets reset to zero by f_b. This is done by default because it’s almost always the Right Thing to do.

Second order

For seed vectors \(\dot{\mathbf{x}}\in\mathbb{R}^n\) and \(\bar{\mathbf{y}}\in\mathbb{R}^m\), forward mode and reverse mode compute

Question

What are two ways we can we use these to compute the Hessian of \(f\)?

Solution 1

Given a seed vector \(\dot{\mathbf{x}}\), first apply forward mode to compute \(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}\). Then apply forward mode to compute the gradient of this (i.e., apply forward mode to the forward mode derivative code). Use vector mode (preferably with compression!) to get the full Hessian.

Solution 2

Given a seed vector \(\dot{\mathbf{x}}\), first apply forward mode to compute \(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}\). Then apply reverse mode to compute the gradient of this (i.e., apply reverse mode to the forward mode derivative code). That is, \((\nabla(\nabla\mathbf{f}(\mathbf{x})\dot{\mathbf{x}}))^T\bar{\mathbf{y}}=\dot{\mathbf{x}}^T\nabla^T\nabla\mathbf{f}(\mathbf{x})\bar{\mathbf{y}}\). Here the Hessian \(\mathbf{H}(\mathbf{f}):=\nabla^T\nabla\mathbf{f}(\mathbf{x})\) is symmetric and so the two applications give the Hessian-vector product with the seed. Use vector mode (preferably with compression!) to get the full Hessian.

Speelpenning example

We compute the Hessian for the classic test case introduced by Speelpenning, which amounts to a reduction of a vector using the product: \[ f(\mathbf{x})=\prod_{i=1}^nx_i. \] This test case is of interest because its Hessian is dense.

The Speelpenning function is implemented as a Fortran subroutine in session2/exercises/speelpenning/speelpenning.f90:

! Classic 'speelpenning' test function, which computes the product of all entries of a vector
subroutine speelpenning(x, f, n)
  implicit none

  integer, intent(in) :: n            ! Size of the input vector
  real, dimension(n), intent(in) :: x ! Input vector
  real, intent(out) :: f              ! Output value, product of all entries
  integer :: i                        ! Dummy loop index

  f = 1.0
  do i = 1, n
    f = f * x(i)
  end do
end subroutine speelpenning

Exercises

Compute the Hessian of the speelpenning subroutine using Tapenade using two applications of forward mode.*
Build and run the view_hessian program in session2/exercises/speelpenning/view_hessian.f90.
(Optional) Derive the expected Hessian and convince yourself that the output is as you expect.
(Optional) Compute the Hessian of the speelpenning subroutine using Tapenade using forward mode followed by reverse mode.*

*Hint: You will need to pass the diffsizes_gradient.f90 file as an additional input file for the second derivative calculation. The diffsizes_hessian.f90 file will be used when computing Hessians.

Solution 1

Forward-then-forward approach.

$ cd session2/exercises/speelpenning
$ ls
diffsizes_gradient.f90  diffsizes_hessian.f90  expected_hessian.f90  Makefile  speelpenning.f90  view_hessian.f90
$ tapenade -d -vector -tgtmodulename _d -head "speelpenning(x)\(f)" speelpenning.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options:  multiDirectional
Command: Procedure speelpenning understood as speelpenning_mod.speelpenning
Diff-liveness analysis turned off
@@ Created ./speelpenning_dv.f90 
@@ Created ./speelpenning_dv.msg
$ cat speelpenning_dv.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE SPEELPENNING_MOD_DV
  USE DIFFSIZES
!  Hint: nbdirsmax should be the maximum number of differentiation directions
  IMPLICIT NONE

CONTAINS
!  Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: f
!   with respect to varying inputs: x
!   RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING_DV(x, xd, f, fd, n, nbdirs)
    USE DIFFSIZES
!  Hint: nbdirsmax should be the maximum number of differentiation directions
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
    REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
    REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
! Dummy loop index
    INTEGER :: i
    INTEGER :: nd
    INTEGER :: nbdirs
    f = 1.0
    fd = 0.0
    DO i=1,n
      DO nd=1,nbdirs
        fd(nd) = x(i)*fd(nd) + f*xd(nd, i)
      END DO
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING_DV

! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING(x, f, n)
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
! Dummy loop index
    INTEGER :: i
    f = 1.0
    DO i=1,n
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING

END MODULE SPEELPENNING_MOD_DV

As mentioned in the hint, we need to make use of the diffsizes module defined in diffsizes_gradient.f90:

! Module containing the nbdirsmax variable required by speelpenning_dv
module diffsizes
  implicit none
  integer, parameter :: nbdirsmax = 7
end module diffsizes

We can apply vector forward mode again, passing this in as an additional input:

$ tapenade -vector -head "speelpenning_dv(fd)/(x)" -tgtmodulename _d diffsizes_gradient.f90 speelpenning_dv.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options:  multiDirectional
Command: Procedure speelpenning_dv understood as speelpenning_mod_dv.speelpenning_dv
Diff-liveness analysis turned off
@@ Created ./speelpenning_dv_dv.f90 
@@ Created ./diffsizes_gradient_dv.msg
$ cat speelpenning_dv_dv.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE SPEELPENNING_MOD_DV_DV
  USE DIFFSIZES
  USE DIFFSIZES
!  Hint: nbdirsmax0 should be the maximum number of differentiation directions
  IMPLICIT NONE

CONTAINS
!  Differentiation of speelpenning_dv in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: fd
!   with respect to varying inputs: x
!   RW status of diff variables: x:in fd:out
!  Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: f
!   with respect to varying inputs: x
!   RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING_DV_DV(x, xd0, xd, f, fd, fdd, n, nbdirs, &
&   nbdirs0)
    USE DIFFSIZES
    USE DIFFSIZES
!  Hint: nbdirsmax0 should be the maximum number of differentiation directions
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
    REAL, DIMENSION(nbdirsmax0, n), INTENT(IN) :: xd0
    REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
    REAL, DIMENSION(nbdirsmax0) :: fd0
    REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
    REAL, DIMENSION(nbdirsmax0, nbdirsmax), INTENT(OUT) :: fdd
! Dummy loop index
    INTEGER :: i
    INTEGER :: nd
    INTEGER :: nbdirs
    INTEGER :: nd0
    INTEGER :: nbdirs0
    f = 1.0
    fd = 0.0
    fd0 = 0.0
    fdd = 0.0
    DO i=1,n
      DO nd=1,nbdirs
        DO nd0=1,nbdirs0
          fdd(nd0, nd) = fd(nd)*xd0(nd0, i) + x(i)*fdd(nd0, nd) + xd(nd&
&           , i)*fd0(nd0)
        END DO
        fd(nd) = x(i)*fd(nd) + f*xd(nd, i)
      END DO
      DO nd0=1,nbdirs0
        fd0(nd0) = x(i)*fd0(nd0) + f*xd0(nd0, i)
      END DO
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING_DV_DV

!  Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: f
!   with respect to varying inputs: x
!   RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING_DV(x, xd, f, fd, n, nbdirs)
    USE DIFFSIZES
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
    REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
    REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
! Dummy loop index
    INTEGER :: i
    INTEGER :: nd
    INTEGER :: nbdirs
    f = 1.0
    fd = 0.0
    DO i=1,n
      DO nd=1,nbdirs
        fd(nd) = x(i)*fd(nd) + f*xd(nd, i)
      END DO
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING_DV

! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING(x, f, n)
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
! Dummy loop index
    INTEGER :: i
    f = 1.0
    DO i=1,n
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING

END MODULE SPEELPENNING_MOD_DV_DV

Solution 2

Running

$ cd session2/exercises/speelpenning
$ make view_hessian
$ ./view_hessian

should give the output

   0.0 2520.0 1680.0 1260.0 1008.0  840.0  720.0
2520.0    0.0  840.0  630.0  504.0  420.0  360.0
1680.0  840.0    0.0  420.0  336.0  280.0  240.0
1260.0  630.0  420.0    0.0  252.0  210.0  180.0
1008.0  504.0  336.0  252.0    0.0  168.0  144.0
 840.0  420.0  280.0  210.0  168.0    0.0  120.0
 720.0  360.0  240.0  180.0  144.0  120.0    0.0

Solution 3

It makes sense that the diagonal entries are zero. The matrix is symmetric, as expected. The \((i,j)^{th}\) entry of the Hessian is given by \[ \frac{\partial^2}{\partial x_i\partial x_j}\prod_{k=1}^nx_k =\frac{\partial}{\partial x_i}\prod_{k=1,\,j\neq k}^nx_k =\prod_{k=1,\,j\neq k,\,j\neq i}^nx_k \] As such, the \((6,7)^{th}\) and \((7,6)^{th}\) entries are both \[120=5!=\prod_{i=1}^5x_i=\frac{\partial^2}{\partial x_6\partial x_7}\prod_{i=1}^7x_i.\] Further checks left as an exercise to the reader.

Solution 4

Forward-then-reverse approach.

$ tapenade -b -vector -head "speelpenning_dv(fd)/(x)" -adjmodulename _b diffsizes_gradient.f90 speelpenning_dv.f90
Tapenade 3.16 (develop) - 25 Jun 2025 16:38 - Java 21.0.7 Linux
@@ TAPENADE_HOME=/workspaces/differentiable-programming-summer-school-2025/tapenade_3.16/bin/..
@@ Options:  multiDirectional
Command: Procedure speelpenning_dv understood as speelpenning_mod_dv.speelpenning_dv
@@ Created ./speelpenning_dv_bv.f90 
@@ Created ./diffsizes_gradient_bv.msg
$ cat speelpenning_dv_bv.f90

gives

!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
!        Generated by TAPENADE     (INRIA, Ecuador team)
!  Tapenade 3.16 (develop) - 25 Jun 2025 16:38
!
MODULE SPEELPENNING_MOD_DV_BV
  USE DIFFSIZES
  USE DIFFSIZES
!  Hint: nbdirsmax0 should be the maximum number of differentiation directions
  IMPLICIT NONE

CONTAINS
!  Differentiation of speelpenning_dv in reverse (adjoint) mode (with options multiDirectional):
!   gradient     of useful results: fd
!   with respect to varying inputs: x fd
!   RW status of diff variables: x:out fd:in-zero
!  Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: f
!   with respect to varying inputs: x
!   RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING_DV_BV(x, xb, xd, f, fd, fdb, n, nbdirs, &
&   nbdirs0)
    USE DIFFSIZES
    USE DIFFSIZES
!  Hint: nbdirsmax0 should be the maximum number of differentiation directions
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
    REAL, DIMENSION(nbdirsmax0, n) :: xb
    REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
    REAL :: f
    REAL, DIMENSION(nbdirsmax0) :: fb
    REAL, DIMENSION(nbdirsmax) :: fd
    REAL, DIMENSION(nbdirsmax0, nbdirsmax) :: fdb
! Dummy loop index
    INTEGER :: i
    INTEGER :: nd
    INTEGER :: nbdirs
    INTEGER :: nd0
    INTEGER :: nbdirs0
    f = 1.0
    fd = 0.0
    DO i=1,n
      DO nd=1,nbdirs
        CALL PUSHREAL4(fd(nd))
        fd(nd) = x(i)*fd(nd) + f*xd(nd, i)
      END DO
      CALL PUSHREAL4(f)
      f = f*x(i)
    END DO
    fb = 0.0
    xb = 0.0
    DO i=n,1,-1
      CALL POPREAL4(f)
      DO nd0=1,nbdirs0
        xb(nd0, i) = xb(nd0, i) + f*fb(nd0)
        fb(nd0) = x(i)*fb(nd0)
      END DO
      DO nd=nbdirs,1,-1
        CALL POPREAL4(fd(nd))
        DO nd0=1,nbdirs0
          xb(nd0, i) = xb(nd0, i) + fd(nd)*fdb(nd0, nd)
          fb(nd0) = fb(nd0) + xd(nd, i)*fdb(nd0, nd)
          fdb(nd0, nd) = x(i)*fdb(nd0, nd)
        END DO
      END DO
    END DO
    fdb = 0.0
  END SUBROUTINE SPEELPENNING_DV_BV

!  Differentiation of speelpenning in forward (tangent) mode (with options multiDirectional):
!   variations   of useful results: f
!   with respect to varying inputs: x
!   RW status of diff variables: f:out x:in
! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING_DV(x, xd, f, fd, n, nbdirs)
    USE DIFFSIZES
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
    REAL, DIMENSION(nbdirsmax, n), INTENT(IN) :: xd
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
    REAL, DIMENSION(nbdirsmax), INTENT(OUT) :: fd
! Dummy loop index
    INTEGER :: i
    INTEGER :: nd
    INTEGER :: nbdirs
    f = 1.0
    fd = 0.0
    DO i=1,n
      DO nd=1,nbdirs
        fd(nd) = x(i)*fd(nd) + f*xd(nd, i)
      END DO
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING_DV

! Classic 'speelpenning' test function, which computes the product of all entries of a vector
  SUBROUTINE SPEELPENNING(x, f, n)
    IMPLICIT NONE
! Size of the input vector
    INTEGER, INTENT(IN) :: n
! Input vector
    REAL, DIMENSION(n), INTENT(IN) :: x
! Output value, product of all entries
    REAL, INTENT(OUT) :: f
! Dummy loop index
    INTEGER :: i
    f = 1.0
    DO i=1,n
      f = f*x(i)
    END DO
  END SUBROUTINE SPEELPENNING

END MODULE SPEELPENNING_MOD_DV_BV

Note that running the code requires either PUSHREAL4 and POPREAL4 to be imported from Tapenade’s ‘ADFirstAidKit’, or for these subroutines for pushing to and popping from a stack to be implemented manually.

Approach 2: Operator overloading

The operator overloading approach is fundamentally different to source transformation. It does not generate additional source files and instead generates derivatives at runtime.

The general idea is to record all operations applied to variables involved in the computation so that derivatives can be computed by applying the chain rule to the derivatives of those operations.

Depending on the AD tool, the user experience is something like the following:

Mark (input) variables to compute derivatives with respect to as independent.
Mark (output) variables to compute derivatives of as dependent.
Call driver methods supported by the AD tool to compute derivatives.
Run the program.

Operator overloading

The approach fits well with another ‘OO’: object orientation. In classic implementations such as ADOL-C, basic types are enriched by introducing corresponding structs (derived types in Fortran) that hold both the standard (forward) value and also the gradient. For example,

type adouble
  double precision :: val
  double precision :: grad
end type adouble

In order to compute derivatives of expressions involving adoubles, we need to overload the operators that act on doubles so that they operate on the val attribute and also provide code for the derivative(s). That is, we extend the definition of such operators (+, -, *, /, **, and others) so that they may be applied to adoubless.

Since forward mode evaluates eagerly (i.e., computes the derivative at the same time as the function evaluation), the adouble approach facilitates this. This is sometimes referred to as the dual number approach.

Operator overloading: tape

In the case of reverse mode, the dual number approach is insufficient. As mentioned previously, we need to keep track of operations from the primal code, as well as primal values when executing reverse mode.

A fundamental concept related to the operator overloading AD approach is tape. This is the name used for the record of all the operations that have been executed. Consider the following lines of Fortran code for an example evaluation of the \(f\) function considered previously:

type(adouble) :: x1
type(adouble) :: x2
type(adouble) :: y

x1 = 2.0
x2 = 3.0
y = x1 * x2

Conceptually, the tape would be something like:

Index	Operation	Dependencies	Outputs
0	Assignment	2.0	x1
1	Assignment	3.0	x2
2	Multiplication	x1, x2	y

That is, it records each operation in order, as well as the relevant dependencies and outputs. In practice, tapes are much more sophisticated than this, but hopefully the idea is clear. The tape can also be visualised as a DAG, as illustrated in subsequent cells.

Given that we know the derivatives of addition, exponentiation, and multiplication, we can unroll the tape to compute the desired derivative code. As an alternative to the dual number approach, forward mode can be implemented in this way.

Question

What’s the difference between forward and reverse mode as far as tape unrolling is concerned?

Solution

In forward mode, we unroll the tape in the same order it was written (increasing index). Values for the primal code are computed eagerly as part of this unrolling.

For reverse mode, we do need to first unroll the tape in the forward direction to compute primal variables. Then we unroll the tape in the reverse direction (decreasing index) to accumulate derivatives.

Operator overloading example in Pyadjoint

We demonstrate operator overloading using Pyadjoint in the following. We define some utility functions for interrogating the Tape class.

def print_tape():
    """
    Print the current working tape entry-by-entry.
    """
    tape = pyadjoint.get_working_tape()
    print("Tape:")
    for i, block in enumerate(tape.get_blocks()):
        print(f"{i}: {block}")

def plot_dag(pos=None, n=4):
    """
    Plot the DAG for the current working tape.

    :kwarg pos: list of tuples for the node positions
    :kwarg n: number of nodes in final graph for tracking sub-DAGs as they progress
    """
    tape = pyadjoint.get_working_tape()
    graph = tape.create_graph()
    fig, axes = plt.subplots(figsize=(5, 3))

    # Use the default automatic node positioning if none is specified
    if pos is not None:
        pos = {node: p for node, p in zip(graph.nodes, pos)}

    # Plot the DAG with the truncated colormap
    nx.draw(graph, pos, ax=axes, node_color=range(len(graph.nodes)), node_size=800, cmap=plt.cm.coolwarm)
    plt.show()

We start by enabling annotation of the tape with pyadjoint.continue_annotation.

Having done so, we create two AdjFloats and assign them the values 2 and 3.

def f(x1, x2):
    """f function as defined in previous examples."""
    return x1 * x2

# Enable annotation of the tape
pyadjoint.continue_annotation();

# Create some AdjFloats
x1 = pyadjoint.AdjFloat(2.0)
x2 = pyadjoint.AdjFloat(3.0)

Let’s now combine the two AdjFloats using a multiplication operation.

y = f(x1, x2)

# Interrogate the tape
print_tape()
plot_dag(pos=[(0, 0), (1, 0.5), (0, 1), (2, 0.5)])

The tape shows an entry for the product, displaying the float values for x1 and x2. Note that AdjFloat assignments are not shown on the tape in pyadjoint, although assignments for finite element fields are. The DAG shows four nodes: two for the input AdjFloats, one for the multiplication operation, and one for the output AdjFloats.

Having finished the computation of interest, we can stop annotating the tape.

pyadjoint.pause_annotation();

To compute derivatives, we need to specify which are the independent variables. These are declared as Controls, as follows.

xc = [pyadjoint.Control(x1), pyadjoint.Control(x2)]

Having marked variables as independent, we can stop annotating the tape and then compute the gradient.

# Inspect the docstring for the compute_gradient driver
print("    \"\"\"\n" + pyadjoint.compute_gradient.__doc__ + "\n    \"\"\"\n")

# Compute the gradient of the output with respect to the inputs
dydx = pyadjoint.compute_gradient(y, xc)

print(f"x = [x1, x2] = [{float(x1)}, {float(x2)}]")
print(f"y = x1 * x2 = {float(y)}")
print(f"dydx = [{float(dydx[0])}, {float(dydx[1])}] ?= [x2, x1] = [{float(x2)}, {float(x1)}]")

Note that we haven’t specified whether to use forward or reverse mode. Pyadjoint is set up such that it will choose which approach to used, depending on whether there are more inputs or outputs.

Exercise

We’ve computed first derivatives of \(f\). Now let’s consider its second derivatives. 1. What would you expect the Hessian of \(f\) to be? 2. Take a look at pyadjoint.compute_hessian.__doc__. Compute the compressed Hessian with a single call to the compute_hessian driver function.

Solution 1

\[ \frac{\partial^2f}{\partial x_1^2}=\frac{\partial}{\partial x_1}x_2=0, \quad\frac{\partial^2f}{\partial x_1x_2}=\frac{\partial}{\partial x_1}x_1=1, \quad\frac{\partial^2f}{\partial x_2x_1}=\frac{\partial}{\partial x_2}x_2=1, \quad\frac{\partial^2f}{\partial x_2^2}=\frac{\partial}{\partial x_2}x_1=0, \] so we expect \(H(f)=\begin{bmatrix}0&1\\1&0\end{bmatrix}\).

Solution 2

With the code

# Define seed vector for second derivative
xdd = [pyadjoint.AdjFloat(1.0), pyadjoint.AdjFloat(1.0)]

# Apply driver function
d2ydx2 = pyadjoint.compute_hessian(y, xc, xdd)

print(f"x = [x1, x2] = [{float(x1)}, {float(x2)}]")
print(f"y = x1 * x2")

# Decompress Hessian before printing
print(f"d2ydx2 = [0.0, {float(d2ydx2[0])}; {float(d2ydx2[1])}, 0.0] ?= [0.0, 1.0; 1.0, 0.0]")

we should get

x = [x1, x2] = [2.0, 3.0]
y = x1 * x2
d2ydx2 = [0.0, 1.0; 1.0, 0.0] ?= [0.0, 1.0; 1.0, 0.0]

Source transformation vs. operator overloading

We already evaluated the differences between modes (forward and reverse). How about the differences between approaches?

ST is done as a preprocessing step, whereas OO is done at runtime.
ST is fairly clear, whereas OO is somewhat of a ‘black box’ (unless you’re able to inspect the tape).
OO’s tape requires memory.
There are only a few ST tools, but very many OO tools! See below.
- LLVM: Enzyme
- C/C++: ADIC, ADOL-C, Torch Autograd, CoDiPack, Sacado, dco/c++ [commercial] (about two dozen!)
- Fortran:Differentia, lots of abandonware…
- Python: PyTorch Autograd, Jax, PyADOL-C.
- Julia: Enzyme, Zygote, ForwardDiff, DifferentiationInterface (about two dozen overall!)
- Domain-specific: dolfin-adjoint/pyadjoint (Python/UFL - Firedrake & FEniCS)
- And many more! https://autodiff.org/?module=Tools

Levels of abstraction

So far, we have considered AD applied to elementary operations (+, -, *, /, **) applied to intrinsic Fortran types like reals and arrays thereof. Most of the early AD tools worked in this way, tracking operations applied to real numbers and chaining together their derivatives. However, code differentiated with a source transformation tool in this manner can quickly become difficult to read for large codes. And under the operator overloading approach, the memory footprint of the tape can become very large.

By raising the level of abstraction, some of these difficulties can be avoided.

Medium-level: API calls

Example:

AD in PETSc
- Latest implementation.
- (Note: old draft implementation differentiated elementary operators.)

High-level: operations on fields

Examples:

AD in Firedrake using Pyadjoint/dolfin-adjoint.
AD in PSyclone using PSyAD.

Showcase

Open and work through the following Firedrake Jupyter notebooks that were copied over.

session2/exercises/firedrake/06-pde-constrained-optimisation.ipynb
session2/exercises/firedrake/11-extract-adjoint-solutions.ipynb

Goal-oriented error estimation

Navigate to https://mesh-adaptation.github.io/docs/demos/point_discharge2d.py.html.

Checkpointing

Recall our discussion about tape unrolling and how for reverse mode we first unroll the tape to run the primal code.

Question

Under what circumstances can we get away without an initial forward pass when running reverse mode?

Solution

Two possible answers:

If the variables involved in the forward computation had their values assigned exactly once and those values are still available in memory then we already have the forward data required for the reverse pass.*

If the problem is linear in the independent variables then the reverse mode code will be independent of those variables. As such, there is no need to compute them.

*Although your AD tool might not be aware of this or be able to make use of the information.

Checkpointing

In scientific programming, time-dependent problems are typically solved using a timestepping method such as the theta-method we saw in the first session. When writing code for such methods, it’s common practice to overwrite the variable for the approximation at a given timestep as we progress through the steps. In such a formulation, the value is no longer in memory and needs to be recomputed.

In some cases, it’s possible to keep the full state on the tape at every timestep, whereby all the information required for reverse mode is available - see Figure 4.

Figure 4: Time-dependent adjoint problem solved without loading checkpoints. Generated using tikZ and \(\LaTeX\).

This quickly becomes infeasible for real world problems, with the amount of memory required at each timestep taking a large chunk of the available RAM. In the extreme case where only one timestep’s worth of data fits in memory, a reverse mode propagation requires checkpointing at each timestep as in Figure 5.

Figure 5: Time-dependent adjoint problem solved with checkpointing at every timestep. Generated using tikZ and \(\LaTeX\).

In intermediate cases, checkpointing can be done on the basis of some fixed frequency or in a more intelligent way using the revolve algorithm (Griewank & Walther, 2000).

Further applications

Sensitivity analysis.
Data assimilation.
Uncertainty quantification.
Online training in machine learning.
PDE-constrained optimisation.
Goal-oriented error estimation and mesh adaptation.

Summary

In today’s session we:

Learnt about reverse mode and the operator overloading approach, comparing them with forward mode and source transformation, respectively.
Verified the consistency of code generated by Tapenade under forward and reverse mode using the dot product test.
Calculated higher order derivatives using Tapenade.
Tried out the Pyadjoint operator overloading AD tool underpinnning the Firedrake finite element library and saw showcases of more advanced AD usage.
Learnt about checkpointing and using AD to compute higher order derivatives.

References

S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT, 16(2):146–160, 1976, doi:10.1007/BF01931367.
B. Speelpenning. Compiling fast partial derivatives of functions given by algorithms. University of Illinois, 1980, doi:10.2172/5254402.
A. Griewank. Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation. Optimization Methods & Software, 1:35–54, 1992, doi:10.1080/10556789208805505.
A. Griewank and Andrea Walther. Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software (TOMS) 26.1: 19-45, 2000, doi:10.1145/347837.347846.
J. Huckelheim, et al. A taxonomy of automatic differentiation pitfalls, WIREs Data Mining Knowl Discovery, 2024, doi:10.1002/widm.1555.