Blocksparsity
Complexity of the KeOps routines. Notwithstanding their finely grained management of memory, the GPU_1D and GPU_2D schemes have a quadratic time complexity: if \(\mathrm{M}\) and \(\mathrm{N}\) denote the number of \(i\) and \(j\) variables, the time needed to perform a generic reduction scales asymptotically in \(O(\mathrm{M}\mathrm{N})\). This is most evident in our convolution benchmarks, where all the kernel coefficients
are computed to implement a discrete Gaussian convolution:
Can we do better?
To break through this quadratic lower bound, a simple idea is to skip some computations, using a sparsity prior on the kernel matrix. For instance, we could decide to skip kernel computations when points \(x_i\) and \(y_j\) are far away from each other. But can we do so efficiently?
Sparsity on the CPU. On CPUs, a standard strategy is to use sparse matrices, encoding our operators through lists of nonzero coefficients and indices. Schematically, this comes down to endowing each index \(i\in\left[\!\left[ 1,\mathrm{M}\right]\!\right]\) with a set \(J_i\subset\left[\!\left[ 1,\mathrm{N}\right]\!\right]\) of \(j\)neighbors and to restrict ourselves to the computation of
This approach is very well suited to matrices which only have a handful of nonzero coefficients per line, such as the intrinsic Laplacian of a 3D mesh. But on large, densely connected problems, sparse encoding runs into a major issue: since it relies on noncontiguous memory accesses, it scales poorly on GPUs.
Blocksparse reductions
As explained in our introduction, GPU chips are wired to rely on coalesced memory operations which load blocks of dozens of contiguous bytes at once. Instead of allowing the use of arbitrary indexing sets \(J_i\) for all lines of our sparse kernel matrix, we should thus restrict ourselves to computations of the form:
where:
The \(\left[\!\left[\text{start}_q, \text{end}_q\right[\!\right[\) intervals form a partition of the set of \(i\)indices \(\left[\!\left[ 1,\mathrm{M}\right]\!\right]\):
\[\begin{aligned} \left[\!\left[ 1,\mathrm{M}\right]\!\right] ~=~ \bigsqcup_{q=1}^{\mathrm{Q}} \,\left[\!\left[\text{start}_q, \text{end}_q\right[\!\right[. \end{aligned}\]For every segment \(q\in\left[\!\left[ 1,\mathrm{Q}\right]\!\right]\), the \(S_q\) intervals \([\![\text{start}^q_l, \text{end}^q_l[\![\) encode a set of neighbors as a finite collection of contiguous ranges of indices:
\[\begin{aligned} \forall~i\in \left[\!\left[\text{start}_q, \text{end}_q\right[\!\right[, ~ J_i~=~ \bigsqcup_{l=1}^{S_q} \,[\![\text{start}^q_l, \text{end}^q_l[\![. \end{aligned}\]
By encoding our sparsity patterns as blockwise binary masks made up of tiles
we can leverage coalesced memory operations for maximum efficiency on the GPU. As long as our index ranges are wider than the CUDA blocks, we should get close to optimal performances.
Going further. This scheme can be generalized to generic formulas and reductions. For reductions with respect to the \(i\) axis, we simply have to define transposed tiles
and restrict ourselves to computations of the form:
A decent tradeoff
This blockwise approach to sparse reductions may seem a bit too coarse, as some negligible coefficients get computed with little to no impact on the final result… But in practice, the GPU speedups on contiguous memory operations more than make up for it: implemented in the GpuConv1D_ranges.cu CUDA file, our blocksparse MapReduce scheme is the workhorse of the multiscale Sinkhorn algorithm showcased in the GeomLoss library.
As explained in our documentation,
the main user interface for KeOps
blocksparse reduction is an optional “.ranges” attribute for
LazyTensors
which encodes arbitrary blocksparsity masks. In
practice, as illustrated in the Figure below, helper
routines allow users to specify tiled sparsity patterns from clustered
arrays of samples \(x_i\), \(y_j\) and coarse
clustertocluster boolean matrices. Implementing
BarnesHutlike strategies
and other approximation rules
is thus relatively easy, up to a preliminary sorting pass which ensures
that all clusters are stored contiguously in memory.


Figure. Illustrating blocksparse reductions with 2D point clouds. When using an \(\mathrm{M}\)by\(\mathrm{N}\) “kernel” matrix to compute an interaction term between two datasets, a common approximation strategy is to skip terms which correspond to clusters of points that are far away from each other. Through a set of helper routines and optional arguments, KeOps allows users to implement these pruning strategies efficiently, on the GPU. (a) Putting our points in square bins, we compute the centroid of each cluster. Simple thresholds on centroidtocentroid distances allow us to decide that the 43rd “cyan” cluster of target points \((x_i)\) in the spiral should only interact with neighboring cells of source points \((y_j)\) in the Gaussian sample, highlighted in magenta, etc. (b) In practice, this decision is encoded in a coarse boolean matrix that is processed by KeOps, with each line (resp. column) corresponding to a cluster of \(x\) (resp. \(y\)) variables. Here, we higlight the 43rd line of our mask which corresponds to the cyanmagenta points of (a).