# Why using KeOps?

## Scalable kernel operations

KeOps can be used on a broad class of problems (A generic framework).
But the first motivation behind this library is simple:
we intend to accelerate the computation of Gaussian convolutions on point clouds,
also known as **RBF kernel products** on sampled data.

Working in a vector space \(\mathbb{R}^{\mathrm{D}}\), let
us consider for **large values** of \(\mathrm{M}\) and \(\mathrm{N}\):

a

**target**point cloud \(x_1, \dots, x_{\mathrm{M}} \in \mathbb{R}^{\mathrm{D}}\), encoded as an \(\mathrm{M}\times\mathrm{D}\) array;a

**source**point cloud \(y_1, \dots, y_{\mathrm{N}} \in \mathbb{R}^{\mathrm{D}}\), encoded as an \(\mathrm{N}\times\mathrm{D}\) array;a

**signal**\(b_1, \dots, b_{\mathrm{N}} \in \mathbb{R}\), attached to the \(y_j\)’s and encoded as a \(\mathrm{N}\times 1\) vector.

Then, KeOps allows us to compute efficiently the vector of \(\mathrm{M}\) values \(a_1, \dots, a_{\mathrm{M}} \in \mathbb{R}\) given by:

where \(k(x_i,y_j) = \exp(-\|x_i - y_j\|^2 / 2 \sigma^2)\)
is a Gaussian kernel of deviation \(\sigma > 0\).
Thanks to the KeOps **automatic differentiation** engine,
we can also get access to the gradient of the \(a_i\)’s with respect to the \(x_i\)’s:

without having to write the formula \(\partial_x k(x_i,y_j) = -\tfrac{1}{\sigma^2}(x_i - y_j) \exp(-\|x_i - y_j\|^2 / 2 \sigma^2)\) in our programs.

## A generic framework

Going further, KeOps supports a wide range of **mathematical** and
**deep learning computations**. Let’s say that we have at hand:

a collection \(p^1, p^2, ..., p^{\mathrm{P}}\) of vectors;

a collection \(x^1_i, x^2_i, ..., x^{\mathrm{X}}_i\) of vector sequences, indexed by an integer \(i\) that ranges from 1 to \(\mathrm{M}\);

a collection \(y^1_j, y^2_j, ..., y^{\mathrm{Y}}_j\) of vector sequences, indexed by an integer \(j\) that ranges from 1 to \(\mathrm{N}\);

a vector-valued function \(F(i, j, p^1, p^2,..., x^1_i, x^2_i,..., y^1_j, y^2_j, ...)\) on these input vectors and indices, such as a small neural network;

a \(\operatorname{Reduction}\) or “

*pooling*” operator such as a sum, a max or an arg-min.

Then, referring to the \(p\)’s as **parameters**, the \(x_i\)’s as **i-variables** and the \(y_j\)’s as **j-variables**, the KeOps library allows us to compute efficiently **any expression** \(a_i\) of the form:

alongside its **derivatives** with respect to all variables and parameters.

## Examples of applications

This type of computation is common in machine learning and applied mathematics:

A

**kernel matrix-vector product**is implemented using a sum reduction and a formula \(F(x_i,y_j,b_j)=k(x_i,y_j)\cdot b_j\) that is weighted by a suitable kernel function \(k(x,y)\). As detailed in the section above, the computation reads:\[a_i \gets \sum_{j=1}^{\mathrm{N}} k(x_i,y_j)\, b_j~, \qquad i=1,\dots,\mathrm{M}~.\]This operation is key to spline regression, kernel methods and the study of Gausian processes. In physics, we often use Newton or Coulomb kernels such as \(k(x,y)=1/\|x-y\|^2\): accelerating kernel products is the first step towards fast N-body simulations.

**K-Nearest Neighbors queries**are implemented using an “arg-K-min” reduction that returns, for all index \(i\), the indices \((j_1,\dots,j_{\mathrm{K}})\) that correspond to the K smallest values of a distance or similarity metric \(F(x_i,y_j)\). For instance, in a Euclidean space, we compute:\[a_i \gets \arg_{\mathrm{K}} \min_{j=1,\,\dots\,,\,\mathrm{N}} \|x_i - y_j\|^2 ~, \qquad i=1,\dots,\mathrm{M}~,\]where \(\| x - y \|^2 = \sum_k (x[k] - y[k])^2\) is a sum of squared distances. K-NN search is a key building block for numerous methods in data sciences, from simple classifiers to advanced methods in topological data analysis and dimensionality reduction. KeOps intends to provide fast runtimes for

**all types of metrics**, beyond the standard Euclidean distance and cosine similarity: we refer to our benchmarks for an extensive discussion.

In

**computer graphics**and**geometric deep learning**, we implement**point cloud convolutions**and**message passing layers**using a function:\[F(p,x_i,y_j,f_j)=\text{Window}(x_i,y_j)\cdot \text{Filter}(p,x_i,y_j,f_j)\]that is the product of a neighborhood \(\text{Window}(x_i,y_j)\) between point positions \(x_i\), \(y_j\) and of a parametric filter that is applied to a collection of feature vectors \(f_j\). The reduction or “pooling” operator is usually a (weighted) sum or a maximum.

Most architectures in computer vision rely on K-Nearest Neighbors graphs (”\(x_i \leftrightarrow y_j\)”) to define sparse neighborhood windows. These are equal to 1 if \(y_j\) is one of the closest neighbors of \(x_i\) and 0 otherwise. The point convolution then reads:

\[\begin{split}a_i \gets \sum_{\substack{j \text{ such that }\\ x_i \leftrightarrow y_j}} \text{Filter}(p,x_i,y_j,f_j) ~.\end{split}\]Crucially, KeOps now also lets users work with

**global point convolutions**without compromising on performances: we refer to the Section 5.3 of our NeurIPS 2020 paper and to this presentation of quasi-geodesic convolutions on protein surfaces for a detailed discussion.

In

**natural language processing**, we implement**attention layers**for transformer networks using an exponentiated dot product \(F(q_i,k_j)=\exp(\langle q_i,k_j\rangle/ \sqrt{\mathrm{D}})\) between*query*(\(q_i\)) and*key*(\(k_j\)) vectors of dimension \(\mathrm{D}\). The reduction is a normalized matrix-vector product with an array of*value*vectors \(v_j\) (a**soft maximum**) and the overall computation reads:\[a_i \gets \frac{ \sum_{j=1}^{\mathrm{N}} \exp\big[ \langle q_i,k_j\rangle / \sqrt{\mathrm{D}} \big]~\cdot~ v_j }{ \sum_{j=1}^{\mathrm{N}} \exp\big[ \langle q_i,k_j\rangle / \sqrt{\mathrm{D}}\big] }~.\]It can be implemented efficiently using the KeOps “Sum-SoftMax-Weight” reduction.

We implement the

**Fourier transform**with a sum reduction and a complex exponential:\[\widehat{f_i} = \widehat{f}(\omega_i) ~\gets~ \sum_{j=1}^{\mathrm{N}} \exp(i\langle \omega_i,x_j\rangle)~\cdot~ f_j ~.\]This expression evaluates the spectral content at frequency \(\omega_i\) of a signal \(f\) that is represented by sampled values \(f_j=f(x_j)\) at locations \(x_j\). KeOps thus allows users to implement efficient Fourier-Stieltjes transforms on

**non-uniform data**using both real- and complex-valued trigonometric functions.

In

**optimization theory**, we implement the Legendre-Fenchel transform or convex conjugate of an arbitrary function \(f(x)\) that is sampled on a point cloud \(x_1, \dots, x_\mathrm{N}\) with a vector of values \(f_j = f(x_j)\) using a dot product and a maximum reduction:\[\forall u_i \in \mathbb{R}^\mathrm{D},~~ f^*_i = f^*(u_i) ~\gets~ \max_{j=1,\, \dots\,,\,\mathrm{N}} \langle u_i, x_j\rangle - f_j.\]In

**imaging sciences**, we implement the distance transform of a binary mask \(m_j = m(y_j) \in \{0, 1\}\) that is defined on the rectangle domain \([\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!]\) using a minimum reduction and a squared distance function:\[\forall x_i \in [\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!],~~ d_i = d(x_i) ~\gets~ \min_{y_j \in [\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!]} \|x_i-y_j\|^2 - \log(m_j) .\]We note that just like the Legendre-Fenchel transform, the distance transform is

**separable**and can be implemented efficiently on 2D and 3D grids. Just as with separable Gaussian convolution, the trick is to apply the transform**successively**on the lines and columns of the image. Thanks to its native support for batch processing, KeOps is ideally suited to these manipulations: it can be used to implement all types of fast separable transforms on the GPU.

In optimal transport theory, we implement the

**C-transform**using a “min” reduction and a formula \(F(x_i,y_j,g_j)=\text{C}(x_i,y_j) -g_j\) that penalizes the value of the ground cost function \(\text{C}\) by that of the dual potential \(g\) :\[a_i \gets \min_{j=1,\, \dots\,,\,\mathrm{N}} \big[ \text{C}(x_i,y_j) - g_j \big], \qquad i=1,\dots,\mathrm{M}~.\]Going further, numerically stable

**Sinkhorn iterations**correspond to the case where the minimum in the C-transform is replaced by a (rescaled) log-sum-exp reduction, known as a**soft minimum**at temperature \(\varepsilon > 0\):\[a_i \gets - \varepsilon \cdot \log \sum_{j=1}^{\mathrm{N}} \exp \tfrac{1}{\varepsilon} \big[ g_j - \text{C}(x_i,y_j) \big], \qquad i=1,\dots,\mathrm{M}~.\]As detailed in our NeurIPS 2020 paper, KeOps speeds up modern optimal transport solvers by

**one to three orders of magnitude**, from standard auction iterations to multiscale Sinkhorn loops. A collection of reference solvers is provided by the GeomLoss library, that now scales up to millions of samples in seconds.

Numerous

**particle**and**swarming**models rely on**interaction steps**that fit this template to update the positions and inner states of their agents. For instance, on modest gaming hardware, KeOps can scale up simulations of Vicsek-like systems to millions of active swimmers and flyers: this allows researchers to make original conjectures on their models with a minimal amount of programming effort.

Crucially, we can understand all these computations as **reductions of “symbolic” matrices** whose coefficients are given, for all indices \(i\) and \(j\), by a mathematical formula \(F(p, x_i, y_j)\).
As detailed on the front page of this website,
**the KeOps library is built around this remark**. We introduce a new type of “symbolic” tensor that lets users implement all these operations efficiently, with a small memory footprint.
Under the hood, operations on KeOps `LazyTensors`

avoid storing in memory the matrix of values \(F(p,x_i,y_j)\) and rely instead on fast C++/CUDA routines that are compiled on demand.
We refer to our guided tour of the KeOps++ engine for more details.

## High performances

KeOps fits within a thriving ecosystem of Python/C++ libraries for scientific computing. So how does it compare with other acceleration frameworks such as Numba, Halide, TVM, Julia or JAX/XLA? To answer this question, let us now briefly explain the relationship between our library and the wider software environment for tensor computing.

### Tensor computing on the GPU

**Fast numerical methods are the fuel of machine learning research.**
Over the last decade, the sustained
development of the CUDA ecosystem has driven the progress in the field:
though Python is the lingua
franca of data science and machine learning,
most frameworks rely on **efficient C++ backends** to
leverage the computing power of GPUs.
Recent advances in computer vision or natural
language processing attest to the fitness of modern libraries:
they stem from the **mix of power and flexibility**
that is provided by PyTorch,
TensorFlow and general purpose accelerators such
as JAX/XLA.

Nevertheless, **important work remains to be done.** Geometric computations present a clear gap
in performances between Python and C++: notable examples are implementations of point cloud
convolutions or of the nearest neighbor search that is discussed above.
To scale up geometric computations to
real-world data, a common practice is therefore to replace the compute-intensive parts of a Python
code by **handcrafted CUDA kernels**.
These are expensive to develop and maintain, which
leads to an unfortunate need to **compromise between ease of development and scalability**.

### KeOps: a specialized tool

**Requirements for geometric data analysis and learning.**
None of the aforementioned methods are fully suited
for exploratory research in geometric data analysis and machine learning.
Let us briefly explain why:

First of all, some acceleration schemes

**do not stream well on GPUs**or have to rely on**expensive pre-computations**: hierarchical matrices or advanced nearest neighbor finders can hardly be used in the training loop of a neural network.Other strategies make

**strong assumptions**on the properties of the convolution filter \(k\) or on the dimension and geometry of the ambient feature space. These restrictions make existing tools cumbersome to use in e.g. deep learning, where one wishes to have**modelling freedom**with respect to the choice of the embedding space geometry and dimension.Finally, most acceleration frameworks for Python expect users to be

**knowledgeable on GPU parallelism**or do not support**automatic differentiation**.

The bottomline is that most existing tools are not ready to be used by a majority of researchers in the community.

**A gap in the literature.**
In order to tackle these issues,
the developers of deep learning libraries
have recently put an emphasis on
**just-in-time compilation for neural networks**.
For instance, the recent
PyTorch JIT and
XLA engines enable operator
fusion and unlock performance speed-ups for research code.
These **general purpose compilers** are fully transparent to users
and show promise for a wide range of applications.
Nevertheless,
**they fall short** on the type of **geometric computations** that are discussed above.
This is most apparent for
nearest neighbor search,
matrix-vector products
with kernel matrices
and message passing methods on point clouds,
where one still has to develop and maintain custom CUDA kernels to achieve state-of-the-art performance.

**A unique position.**
KeOps intends to fix this
**specific but important problem** with all the convenient
features of a modern library.
We present examples of applications
in our
gallery of tutorials
and discuss its inner workings
in our
guided tour of the KeOps++ engine.
As evidenced by our benchmarks,
the KeOps routines **outperform** their standard counterparts
**by two orders of magnitude** in many settings.
On top of a reduced memory usage, they can thus bring
a considerable speed-up to numerous methods
in machine learning, computational physics and other applied fields.

### Is KeOps going to speed-up your program?

**Strengths.**
At its heart, KeOps leverages the low
Kolmogorov complexity of symbolic arrays: it can be used when the computational bottleneck
of a method is an interaction step
that fits a simple Map-Reduce template.
In practice, it is thus likely to offer gains on runtime and memory usage when
the formula \(F(x_i,y_j)\) is compact
and the numbers of samples \(\text{M}\) and \(\text{N}\) range from \(10^3\) to \(10^7\).

**Limitations.**
On the other hand, the main limitations of KeOps stem from the overflow of CUDA registers in the computation of the formula \(F(x_i,y_j)\).
These result in decreased performances on large feature vectors
of dimension D > 100.
The problem is known as
register spilling,
with some documented but non-trivial work-arounds.

Another drawback is that we do not pre-ship binaries but instead rely on C++/CUDA compilers to run our kernels. Fortunately, this weakness is now mitigated by the ubiquitous deployment of fast compilers built in e.g. the CUDA drivers. With the release of KeOps 2.0 in March 2022, installation and compilation issues have (mostly) become a thing of the past.

## Main features

Feel free to browse through our gallery of tutorials for examples of applications. Among other features, KeOps supports:

Non-radial kernels, neural networks and other arbitrary formulas.

Most common reduction operations: Summations, stabilized LogSumExp reductions, Min, Max, ArgKMin, SoftMin, Softmax…

Batch processing and block-wise sparsity masks.

High-order derivatives with respect to all parameters and variables.

The resolution of positive definite linear systems using a conjugate gradient solver.