Autodiff engine

KeOps provides a simple Automatic Differentiation (AD) engine for generic formulas. This feature can be used seamlessly through the Grad instruction or the PyTorch backend: users don’t have to understand backpropagation to enjoy our “free” gradients. Nevertheless, for the sake of completeness, here is a short introduction to the inner workings of KeOps.

Backprop 101

Gradient of a vector valued function

Let $F : R^{n} \to R^{d}$ be a smooth function. Around any given point $x \in R^{n}$ , its variations are encoded in a linear application $d F (x) : R^{n} \to R^{d}$ called the differential of $F$ : for all variation $δ x \in R^{n}$ of $x$ ,

F (x + δ x) = F (x) + [d F (x)] (δ x) + o (δ x) .

Going further, we can define the gradient of $F$ at $x$ as the adjoint of $d F (x)$ : it is the unique application $\partial F (x) = d F (x)^{*} : R^{d} \to R^{n}$ such that for all vector $e \in R^{d}$ and variation $δ x \in R^{n}$ of $x$ , we have:

⟨ [\partial F (x)] (e), δ x ⟩_{R^{n}} = ⟨ e, [d F (x)] (δ x) ⟩_{R^{d}},

where $⟨ \cdot, \cdot ⟩$ denotes the standard scalar product.

When $F$ is scalar-valued (i.e. when $d = 1$ ), the gradient is a linear application from $R$ to $R^{n}$ : it is best understood as the vector of partial derivatives of $F$ . In the general case, the matrix of the gradient in the canonical basis is given by the transpose of the Jacobian of $F$ .

Reverse mode AD = backpropagation = chain rule

Now, let’s assume that the function $F : R^{n} \to R^{d}$ can be written as a composition $F = F_{p} \circ \dots \circ F_{2} \circ F_{1}$ of $p$ functions $F_{i} : E_{i - 1} \to E_{i}$ , where $E_{i} = R^{d_{i}}$ . With these notations $d_{0} = n$ and $d_{p} = d$ .

Evaluating the gradient of $F$ with the backpropagation algorithm requires:

A Forward pass to evaluate the functions

$\begin{array}{r} \begin{array}{ccccl} F_{i} & : & E_{i - 1} & \to & E_{i} \\ x & \mapsto & F_{i} (x) \end{array} \end{array}$

and thus compute the value $F (x)$ :
A Backward pass to evaluate the (adjoints of the) differentials

$\begin{array}{r} \begin{array}{ccccl} \partial F_{i} & : & E_{i - 1} \times E_{i} & \to & E_{i - 1} \\ (x_{i - 1}, x_{i}^{*}) & \mapsto & [d F_{i}^{*} (x_{i - 1})] (x_{i}^{*}) = x_{i - 1}^{*} \end{array} \end{array}$

and compute the gradient of $F$ at location $x$ , applied to an arbitrary vector $e$ is the space of outputs:

This method relies on the chain-rule, as

\begin{array}{r} \begin{aligned} d (F_{p} \circ \dots \circ F_{1}) (x_{0}) & = d F_{p} (x_{p - 1}) \circ \dots \circ d F_{1} (x_{0}), \\ i.e. & d (F_{p} \circ \dots \circ F_{1})^{*} (x_{0}) & = d F_{1}^{*} (x_{0}) \circ \dots \circ d F_{p}^{*} (x_{p - 1}), \\ i.e. & [\partial F (x_{0})] (e) & = [\partial F_{1} (x_{0})] (\dots [\partial F_{p} (x_{p - 1})] (e)) . \end{aligned} \end{array}

When $F$ is scalar-valued (i.e. $d = 1$ ), this algorithm allows us to compute the vector of partial derivatives

\nabla F (x_{0}) = [\partial F (x_{0})] (1)

with a mere forward-backward pass through the computational graph of $F$ … which is much cheaper than the naive evaluation of $n$ finite differences of $F$ .

The KeOps generic engine

Backpropagation has become the standard way of computing the gradients of arbitrary “Loss” functions in imaging and machine learning. Crucially, any backprop engine should be able to:

Link together the forward operations $F_{i}$ with their backward counterparts $\partial F_{i}$ .
Store in memory the intermediate results $x_{0}, \dots, x_{p}$ before using them in the backward pass.

The `Grad` operator

At a low level, KeOps allows us to perform these tasks with the Grad instruction: given a formula $F$ , the symbolic expression Grad(F, V, E) denotes the gradient $[\partial_{V} F (x)] (E)$ with respect to the variable $V$ evaluated on the input variable $E$ .

If V is a variable place-holder that appears in the expression of F and if E has the same dimension and category as F, Grad(F,V,E) can be fed to KeOps just like any other symbolic expression. The resulting output will have the same dimension and category as the variable V, and can be used directly for gradient descent or higher-order differentiation: operations such as Grad(Grad(..,..,..),..,..) are fully supported.

User interfaces

As evidenced by this example, the simple Grad syntax can relieve us from the burden of differentiating symbolic formulas by hand.

Going further, our Python interface is fully compatible with the PyTorch library: feel free to use the output of a pykeops.torch routine just like any other differentiable tensor! Thanks to the flexibility of the torch.autograd engine, end-to-end automatic differentiation is at hand: see this example or this example for an introduction.

An example

Coming back to our previous example where the formula

F (p, x, y, a)_{i} = {(\sum_{j = 1}^{N} (p - a_{j})^{2} \exp (x_{i}^{u} + y_{j}^{u}))}_{i = 1, \dots, M, u = 1, 2, 3} \in R^{M \times 3}

F = "Sum_Reduction(Square(Pm(0,1) - Vj(3,1)) * Exp(Vi(1,3) + Vj(2,3)), 1)"

was discussed, the symbolic expression

[grad_a F] = "Grad( Sum_Reduction(Square(Pm(0,1) - Vj(3,1)) * Exp(Vi(1,3) + Vj(2,3)), 1),
                 Vj(3,1), Vi(4,3) )"

allows us to compute the gradient of $F$ with respect to $(a_{j}) \in R^{N}$ (= Vj(3,1)), applied to an arbitrary test vector $e \in R^{M \times 3}$ given as a fifth input Vi(4,3) :

{[[\partial_{a} F (p, x, y, a)] (e)]}_{j} = - \sum_{i = 1}^{M} \sum_{u = 1}^{3} 2 (p - a_{j}) \exp (x_{i}^{u} + y_{j}^{u}) e_{i}^{u} \in R .

With aliases, this computation simply reads:

p=Pm(0,1), x=Vi(1,3), y=Vj(2,3), a=Vj(3,1), e=Vi(4,3)
[grad_a F](e) = "Grad( Sum_Reduction(Square(p-a)*Exp(x+y), 1), a, e)"