Why using KeOps?

Scalable kernel operations

KeOps can be used on a broad class of problems (A generic framework). But the first motivation behind this library is simple: we intend to accelerate the computation of Gaussian convolutions on point clouds, also known as RBF kernel products on sampled data.

Working in a vector space \(\mathbb{R}^{\mathrm{D}}\), let us consider for large values of \(\mathrm{M}\) and \(\mathrm{N}\):

a target point cloud \(x_1, \dots, x_{\mathrm{M}} \in \mathbb{R}^{\mathrm{D}}\), encoded as an \(\mathrm{M}\times\mathrm{D}\) array;
a source point cloud \(y_1, \dots, y_{\mathrm{N}} \in \mathbb{R}^{\mathrm{D}}\), encoded as an \(\mathrm{N}\times\mathrm{D}\) array;
a signal \(b_1, \dots, b_{\mathrm{N}} \in \mathbb{R}\), attached to the \(y_j\)’s and encoded as a \(\mathrm{N}\times 1\) vector.

Then, KeOps allows us to compute efficiently the vector of \(\mathrm{M}\) values \(a_1, \dots, a_{\mathrm{M}} \in \mathbb{R}\) given by:

\[a_i \gets \sum_{j=1}^{\mathrm{N}} k(x_i,y_j)\, b_j~, \qquad i=1,\dots,\mathrm{M}~,\]

where \(k(x_i,y_j) = \exp(-\|x_i - y_j\|^2 / 2 \sigma^2)\) is a Gaussian kernel of deviation \(\sigma > 0\). Thanks to the KeOps automatic differentiation engine, we can also get access to the gradient of the \(a_i\)’s with respect to the \(x_i\)’s:

\[a_i' \gets \sum_{j=1}^{\mathrm{N}} \partial_x k(x_i,y_j)\, b_j~, \qquad i=1,\dots,\mathrm{M}~,\]

without having to write the formula \(\partial_x k(x_i,y_j) = -\tfrac{1}{\sigma^2}(x_i - y_j) \exp(-\|x_i - y_j\|^2 / 2 \sigma^2)\) in our programs.

A generic framework

Going further, KeOps supports a wide range of mathematical and deep learning computations. Let’s say that we have at hand:

a collection \(p^1, p^2, ..., p^{\mathrm{P}}\) of vectors;
a collection \(x^1_i, x^2_i, ..., x^{\mathrm{X}}_i\) of vector sequences, indexed by an integer \(i\) that ranges from 1 to \(\mathrm{M}\);
a collection \(y^1_j, y^2_j, ..., y^{\mathrm{Y}}_j\) of vector sequences, indexed by an integer \(j\) that ranges from 1 to \(\mathrm{N}\);
a vector-valued function \(F(i, j, p^1, p^2,..., x^1_i, x^2_i,..., y^1_j, y^2_j, ...)\) on these input vectors and indices, such as a small neural network;
a \(\operatorname{Reduction}\) or “pooling” operator such as a sum, a max or an arg-min.

Then, referring to the \(p\)’s as parameters, the \(x_i\)’s as i-variables and the \(y_j\)’s as j-variables, the KeOps library allows us to compute efficiently any expression \(a_i\) of the form:

\[a_i \gets \operatorname{Reduction}_{j=1,...,\mathrm{N}}\limits \big[ F(i, j, p^1, p^2,..., x^1_i, x^2_i,..., y^1_j, y^2_j, ...) \big]~, \qquad i=1,\dots,\mathrm{M}~,\]

alongside its derivatives with respect to all variables and parameters.

Examples of applications

This type of computation is common in machine learning and applied mathematics:

A kernel matrix-vector product is implemented using a sum reduction and a formula \(F(x_i,y_j,b_j)=k(x_i,y_j)\cdot b_j\) that is weighted by a suitable kernel function \(k(x,y)\). As detailed in the section above, the computation reads:

\[a_i \gets \sum_{j=1}^{\mathrm{N}} k(x_i,y_j)\, b_j~, \qquad i=1,\dots,\mathrm{M}~.\]

This operation is key to spline regression, kernel methods and the study of Gausian processes. In physics, we often use Newton or Coulomb kernels such as \(k(x,y)=1/\|x-y\|^2\): accelerating kernel products is the first step towards fast N-body simulations.
K-Nearest Neighbors queries are implemented using an “arg-K-min” reduction that returns, for all index \(i\), the indices \((j_1,\dots,j_{\mathrm{K}})\) that correspond to the K smallest values of a distance or similarity metric \(F(x_i,y_j)\). For instance, in a Euclidean space, we compute:

\[a_i \gets \arg_{\mathrm{K}} \min_{j=1,\,\dots\,,\,\mathrm{N}} \|x_i - y_j\|^2 ~, \qquad i=1,\dots,\mathrm{M}~,\]

where \(\| x - y \|^2 = \sum_k (x[k] - y[k])^2\) is a sum of squared distances. K-NN search is a key building block for numerous methods in data sciences, from simple classifiers to advanced methods in topological data analysis and dimensionality reduction. KeOps intends to provide fast runtimes for all types of metrics, beyond the standard Euclidean distance and cosine similarity.
In computer graphics and geometric deep learning, we implement point cloud convolutions and message passing layers using a function:

\[F(p,x_i,y_j,f_j)=\text{Window}(x_i,y_j)\cdot \text{Filter}(p,x_i,y_j,f_j)\]

that is the product of a neighborhood \(\text{Window}(x_i,y_j)\) between point positions \(x_i\), \(y_j\) and of a parametric filter that is applied to a collection of feature vectors \(f_j\). The reduction or “pooling” operator is usually a (weighted) sum or a maximum.

Most architectures in computer vision rely on K-Nearest Neighbors graphs (”\(x_i \leftrightarrow y_j\)”) to define sparse neighborhood windows. These are equal to 1 if \(y_j\) is one of the closest neighbors of \(x_i\) and 0 otherwise. The point convolution then reads:

\[\begin{split}a_i \gets \sum_{\substack{j \text{ such that }\\ x_i \leftrightarrow y_j}} \text{Filter}(p,x_i,y_j,f_j) ~.\end{split}\]

Crucially, KeOps now also lets users work with global point convolutions without compromising on performances: we refer to the Section 5.3 of our NeurIPS 2020 paper and to this presentation of quasi-geodesic convolutions on protein surfaces for a detailed discussion.
In natural language processing, we implement attention layers for transformer networks using an exponentiated dot product \(F(q_i,k_j)=\exp(\langle q_i,k_j\rangle/ \sqrt{\mathrm{D}})\) between query (\(q_i\)) and key (\(k_j\)) vectors of dimension \(\mathrm{D}\). The reduction is a normalized matrix-vector product with an array of value vectors \(v_j\) (a soft maximum) and the overall computation reads:

\[a_i \gets \frac{ \sum_{j=1}^{\mathrm{N}} \exp\big[ \langle q_i,k_j\rangle / \sqrt{\mathrm{D}} \big]~\cdot~ v_j }{ \sum_{j=1}^{\mathrm{N}} \exp\big[ \langle q_i,k_j\rangle / \sqrt{\mathrm{D}}\big] }~.\]

It can be implemented efficiently using the KeOps “Sum-SoftMax-Weight” reduction.
We implement the Fourier transform with a sum reduction and a complex exponential:

\[\widehat{f_i} = \widehat{f}(\omega_i) ~\gets~ \sum_{j=1}^{\mathrm{N}} \exp(i\langle \omega_i,x_j\rangle)~\cdot~ f_j ~.\]

This expression evaluates the spectral content at frequency \(\omega_i\) of a signal \(f\) that is represented by sampled values \(f_j=f(x_j)\) at locations \(x_j\). KeOps thus allows users to implement efficient Fourier-Stieltjes transforms on non-uniform data using both real- and complex-valued trigonometric functions.
In optimization theory, we implement the Legendre-Fenchel transform or convex conjugate of an arbitrary function \(f(x)\) that is sampled on a point cloud \(x_1, \dots, x_\mathrm{N}\) with a vector of values \(f_j = f(x_j)\) using a dot product and a maximum reduction:

\[\forall u_i \in \mathbb{R}^\mathrm{D},~~ f^*_i = f^*(u_i) ~\gets~ \max_{j=1,\, \dots\,,\,\mathrm{N}} \langle u_i, x_j\rangle - f_j.\]
In imaging sciences, we implement the distance transform of a binary mask \(m_j = m(y_j) \in \{0, 1\}\) that is defined on the rectangle domain \([\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!]\) using a minimum reduction and a squared distance function:

\[\forall x_i \in [\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!],~~ d_i = d(x_i) ~\gets~ \min_{y_j \in [\![1, \text{W} ]\!] \times [\![1, \text{H} ]\!]} \|x_i-y_j\|^2 - \log(m_j) .\]

We note that just like the Legendre-Fenchel transform, the distance transform is separable and can be implemented efficiently on 2D and 3D grids. Just as with separable Gaussian convolution, the trick is to apply the transform successively on the lines and columns of the image. Thanks to its native support for batch processing, KeOps is ideally suited to these manipulations: it can be used to implement all types of fast separable transforms on the GPU.
In optimal transport theory, we implement the C-transform using a “min” reduction and a formula \(F(x_i,y_j,g_j)=\text{C}(x_i,y_j) -g_j\) that penalizes the value of the ground cost function \(\text{C}\) by that of the dual potential \(g\) :

\[a_i \gets \min_{j=1,\, \dots\,,\,\mathrm{N}} \big[ \text{C}(x_i,y_j) - g_j \big], \qquad i=1,\dots,\mathrm{M}~.\]

Going further, numerically stable Sinkhorn iterations correspond to the case where the minimum in the C-transform is replaced by a (rescaled) log-sum-exp reduction, known as a soft minimum at temperature \(\varepsilon > 0\):

\[a_i \gets - \varepsilon \cdot \log \sum_{j=1}^{\mathrm{N}} \exp \tfrac{1}{\varepsilon} \big[ g_j - \text{C}(x_i,y_j) \big], \qquad i=1,\dots,\mathrm{M}~.\]

As detailed in our NeurIPS 2020 paper, KeOps speeds up modern optimal transport solvers by one to three orders of magnitude, from standard auction iterations to multiscale Sinkhorn loops. A collection of reference solvers is provided by the GeomLoss library, that now scales up to millions of samples in seconds.
Numerous particle and swarming models rely on interaction steps that fit this template to update the positions and inner states of their agents. For instance, on modest gaming hardware, KeOps can scale up simulations of Vicsek-like systems to millions of active swimmers and flyers: this allows researchers to make original conjectures on their models with a minimal amount of programming effort.

Crucially, we can understand all these computations as reductions of “symbolic” matrices whose coefficients are given, for all indices \(i\) and \(j\), by a mathematical formula \(F(p, x_i, y_j)\). As detailed on the front page of this website, the KeOps library is built around this remark. We introduce a new type of “symbolic” tensor that lets users implement all these operations efficiently, with a small memory footprint. Under the hood, operations on KeOps LazyTensors avoid storing in memory the matrix of values \(F(p,x_i,y_j)\) and rely instead on fast C++/CUDA routines that are compiled on demand. We refer to our guided tour of the KeOps++ engine for more details.

High performances

KeOps fits within a thriving ecosystem of Python/C++ libraries for scientific computing. So how does it compare with other acceleration frameworks such as Numba, Halide, TVM, Julia or JAX/XLA? To answer this question, let us now briefly explain the relationship between our library and the wider software environment for tensor computing.

Tensor computing on the GPU

Fast numerical methods are the fuel of machine learning research. Over the last decade, the sustained development of the CUDA ecosystem has driven the progress in the field: though Python is the lingua franca of data science and machine learning, most frameworks rely on efficient C++ backends to leverage the computing power of GPUs. Recent advances in computer vision or natural language processing attest to the fitness of modern libraries: they stem from the mix of power and flexibility that is provided by PyTorch, TensorFlow and general purpose accelerators such as JAX/XLA.

Nevertheless, important work remains to be done. Geometric computations present a clear gap in performances between Python and C++: notable examples are implementations of point cloud convolutions or of the nearest neighbor search that is discussed above. To scale up geometric computations to real-world data, a common practice is therefore to replace the compute-intensive parts of a Python code by handcrafted CUDA kernels. These are expensive to develop and maintain, which leads to an unfortunate need to compromise between ease of development and scalability.

KeOps: a specialized tool

Requirements for geometric data analysis and learning. None of the aforementioned methods are fully suited for exploratory research in geometric data analysis and machine learning. Let us briefly explain why:

First of all, some acceleration schemes do not stream well on GPUs or have to rely on expensive pre-computations: hierarchical matrices or advanced nearest neighbor finders can hardly be used in the training loop of a neural network.
Other strategies make strong assumptions on the properties of the convolution filter \(k\) or on the dimension and geometry of the ambient feature space. These restrictions make existing tools cumbersome to use in e.g. deep learning, where one wishes to have modelling freedom with respect to the choice of the embedding space geometry and dimension.
Finally, most acceleration frameworks for Python expect users to be knowledgeable on GPU parallelism or do not support automatic differentiation.

The bottomline is that most existing tools are not ready to be used by a majority of researchers in the community.

A gap in the literature. In order to tackle these issues, the developers of deep learning libraries have recently put an emphasis on just-in-time compilation for neural networks. For instance, the recent PyTorch JIT and XLA engines enable operator fusion and unlock performance speed-ups for research code. These general purpose compilers are fully transparent to users and show promise for a wide range of applications. Nevertheless, they fall short on the type of geometric computations that are discussed above. This is most apparent for nearest neighbor search, matrix-vector products with kernel matrices and message passing methods on point clouds, where one still has to develop and maintain custom CUDA kernels to achieve state-of-the-art performance.

A unique position. KeOps intends to fix this specific but important problem with all the convenient features of a modern library. We present examples of applications in our gallery of tutorials and discuss its inner workings in our guided tour of the KeOps++ engine. As evidenced by our benchmarks, the KeOps routines outperform their standard counterparts by two orders of magnitude in many settings. On top of a reduced memory usage, they can thus bring a considerable speed-up to numerous methods in machine learning, computational physics and other applied fields.

Is KeOps going to speed-up your program?

Strengths. At its heart, KeOps leverages the low Kolmogorov complexity of symbolic arrays: it can be used when the computational bottleneck of a method is an interaction step that fits a simple Map-Reduce template. In practice, it is thus likely to offer gains on runtime and memory usage when the formula \(F(x_i,y_j)\) is compact and the numbers of samples \(\text{M}\) and \(\text{N}\) range from \(10^3\) to \(10^7\).

Limitations. On the other hand, the main limitations of KeOps stem from the overflow of CUDA registers in the computation of the formula \(F(x_i,y_j)\). These result in decreased performances on large feature vectors of dimension D > 100. The problem is known as register spilling, with some documented but non-trivial work-arounds.

Another drawback is that we do not pre-ship binaries but instead rely on C++/CUDA compilers to run our kernels. Fortunately, this weakness is now mitigated by the ubiquitous deployment of fast compilers built in e.g. the CUDA drivers. With the release of KeOps 2.0 in March 2022, installation and compilation issues have (mostly) become a thing of the past.

Main features

Feel free to browse through our gallery of tutorials for examples of applications. Among other features, KeOps supports:

Non-radial kernels, neural networks and other arbitrary formulas.
Most common reduction operations: Summations, stabilized LogSumExp reductions, Min, Max, ArgKMin, SoftMin, Softmax…
Batch processing and block-wise sparsity masks.
High-order derivatives with respect to all parameters and variables.
The resolution of positive definite linear systems using a conjugate gradient solver.