# KeOps Lazy tensors

**Current state of the art.**
The level of performance provided by KeOps
may surprise readers who grew accustomed to
the limitations of tensor-centric frameworks.
As discussed in previous sections,
common knowledge in the machine
learning community asserts that “kernel” computations can *not* scale to
large point clouds with the CUDA backends of modern libraries:
\(\mathrm{N}\)-by-\(\mathrm{N}\) kernel matrices **stop fitting contiguously on
the Device memory** as soon as \(\mathrm{N}\) exceeds some threshold in the
10,000–50,000 range that depends on the GPU chip.

Focusing on the key operation involved, the *kernel dot product* or
*discrete convolution*:

most authors are tempted to introduce an \(\mathrm{M}\)-by-\(\mathrm{N}\) kernel matrix \(K_{ij} = k(x_i,y_j)\) and implement the operation above as a matrix dot product

To accelerate computations, a flourishing literature has then focused
on the construction of **low-rank approximations** of the linear
operator \((K_{ij})\). Common
methods rely on random sampling schemes, multiscale decompositions of
the input data or take advantage of specific properties of the kernel
function \(k\) – in our case, a convenient Gaussian blob.

## Our focus: exact Map-Reduce computations

As discussed in detail in
our conclusion, such approximation strategies have a long history and a clear
intrinsic value. Nevertheless, acknowledging the fact that
**progresses can also be made through low-level software engineering**,
we decide to
tackle this problem in a completely different way. Brushing aside the
elegant but *inefficient* matrix decomposition written above,
the KeOps package **directly optimizes** the kernel sum
by understanding it as a **Map-Reduce composition** of the
operators:

over the *indexing* indices \(i \in \left[\!\left[ 1,\mathrm{M}\right]\!\right]\) and *reduction*
indices \(j\in \left[\!\left[ 1,\mathrm{N}\right]\!\right]\).

**Parenthesis: are we re-packaging the wheel?**
This approach is **common in the computer graphics literature** but tends to
be strictly limited to C++/CUDA programming guides: with its
emphasis on real-time rendering and explicit models, the graphics
community never felt the need to develop high-level libraries that would
be suited to machine learning research.

In this context, **our scientific contribution does not lie in any new
theorem or algorithm**. Described in the next few sections, the tools on
which KeOps relies (the backpropagation algorithm, online
Map-Reduce CUDA schemes and symbolic variadic templating) are all very
well-known in their respective communities. But as we **combine** them in
a versatile framework, endowed with a transparent interface and a
comprehensive documentation, we allow them to **reach a much wider
audience** and have, hopefully, a positive and fertilizing impact.

## A generic framework

The seminal Theano library combined the flexibility of high-level
Python frameworks with a first-class support of convolutional
architectures on the GPU. In the same vein, the KeOps package
**puts the spotlight on Map-Reduce schemes for (off-grid) sampled data**, an
algorithmic structure that we deem to be relevant in many fields that
are related to data sciences and shape analysis.

Removing all the Python sugar coating, the workhorse of our
library is a **Generic Reduction** (`Genred`

) operator that supports
a wide family of formulas. Let us assume that we have:

A collection \(p^1, p^2, \dots, p^P\) of vectors.

A collection \(x^1_i, x^2_i, \dots, x^X_i\) of vector sequences, indexed by an integer \(i\) in \(\left[\!\left[ 1,\mathrm{M}\right]\!\right]\).

A collection \(y^1_j, y^2_j, \dots, y^Y_j\) of vector sequences, indexed by an integer \(j\) in \(\left[\!\left[ 1,\mathrm{N}\right]\!\right]\).

A vector-valued formula \(F(p^1, p^2, \dots, x^1_i, x^2_i, \dots, y^1_j, y^2_j, \dots)\) on these input vectors.

A \(\operatorname{Reduction}\) operation that may be a sum, an arg-min, a log-sum-exp, etc.

Then, referring to the \(p\)’s as **parameters**, the
\(x_i\)’s as \(i\)-**variables** and the \(y_j\)’s as
\(j\)-**variables**, a single KeOps `Genred`

call allows
users to compute efficiently the expression

\[\begin{aligned} a_i ~&=~ \operatorname{Reduction}_{j=1,\dots,\mathrm{N}}\limits \big[ F(p^1, p^2, \dots, x^1_i, x^2_i, \dots, y^1_j, y^2_j, \dots) \big] \qquad \text{for}~~ i=1,\dots,\mathrm{M}, \label{eq:keops_genred}\end{aligned}\]

alongside its **derivatives** with respect to all variables and
parameters. As showcased in
our gallery of tutorials, this level of generality allows KeOps to handle
K-Nearest-Neighbors classification, K-Means clustering, Gaussian
Mixture Model-fitting and many other tasks.

## The `LazyTensor`

abstraction

Implementation details are covered in the next few sections but probably
won’t interest most mathematicians. Wary of making users step outside of
the convenient tensor-centric paradigm, we give a **matrix-like interface**
to the generic reduction above. Through a new
`LazyTensor`

wrapper for NumPy arrays and PyTorch tensors,
users may specify formulas \(F\) without ever leaving the comfort of
a NumPy-like interface.

KeOps `LazyTensors`

embody the concept of
“symbolic” tensors that are not sparse in the traditional sense, but can
nevertheless be handled more efficiently than large
\(\mathrm{M}\)-by-\(\mathrm{N}\) arrays by using:

A symbolic mathematical formula\(F\), the “.formula” attribute that is encoded as a well-formed string, manipulated with Python operations and parsed at reduction time.

A collection of “small” data arrays\(p\), \(x\) and \(y\), the “.variables” list ofparameters, \(i\)- and \(j\)-variables that are needed to evaluate the formula \(F\).

Coming back to the example of the previous section,
we may display the `LazyTensor`

**K_ij** using:

```
>>> print(K_ij)
... KeOps LazyTensor
... formula: Exp((Minus(Sum(Square((Var(0,3,0) - Var(1,3,1))))) / Var(2,1,2)))
... shape: (1000, 1000)
```

Here, the **Var(index, dimension, [ i | j | parameter ] )** placeholders
refer to the data arrays **q_i**, **q_j** and **1/(2*s*s)** that are
stored in the list of **K_ij.variables**. As we call a supported
reduction operator such as the matrix dot-product “**@**” on **K_ij**,
this information is fed to the `Genred`

engine and a result is
returned as a genuine, differentiable PyTorch tensor: things just
work smoothly, with full support of **operator broadcasting** and
**batch dimensions**.