Solving positive definite linear systems

This benchmark compares the performances of KeOps versus Numpy and Pytorch on a inverse matrix operation. It uses the functions torch.KernelSolve (see also here) and numpy.KernelSolve (see also here).

In a nutshell, given \(x \in\mathbb R^{N\times D}\) and \(b \in \mathbb R^{N\times D_v}\), we compute \(a \in \mathbb R^{N\times D_v}\) so that

\[b = (\alpha\operatorname{Id} + K_{x,x}) a \quad \Leftrightarrow \quad a = (\alpha\operatorname{Id}+ K_{x,x})^{-1} b\]

where \(K_{x,x} = \Big[\exp(-\|x_i -x_j\|^2 / \sigma^2)\Big]_{i,j=1}^N\). The method is based on a conjugate gradient scheme. The benchmark tests various values of \(N \in [10, \cdots,10^6]\).

Note

In this demo, we implement the linear operator \(K_xx\) using a bruteforce implementation and do not leverage any multiscale or low-rank (Nystroem/multipole) decomposition of the Kernel matrix. First support for these approximation schemes is scheduled for May-June 2021. Going further, advanced strategies and solvers are now available through the GPyTorch and Falkon libraries, which rely on a KeOps backend whenever relevant.

Setup

Standard imports:

import numpy as np
import torch
from matplotlib import pyplot as plt

from scipy.sparse import diags
from scipy.sparse.linalg import aslinearoperator, cg
from scipy.sparse.linalg.interface import IdentityOperator

from pykeops.numpy import KernelSolve as KernelSolve_np, LazyTensor
from pykeops.torch import KernelSolve
from pykeops.torch.utils import squared_distances as sqdist_torch
from pykeops.numpy.utils import squared_distances as sqdist_np

from benchmark_utils import flatten, random_normal, unit_tensor, full_benchmark

use_cuda = torch.cuda.is_available()

Benchmark specifications:

D = 3  # Let's do this in 3D
Dv = 1  # Dimension of the vectors (= number of linear problems to solve)

# Numbers of samples that we'll loop upon:
problem_sizes = flatten(
    [[1 * 10 ** k, 2 * 10 ** k, 5 * 10 ** k] for k in [1, 2, 3, 4, 5]] + [[10 ** 6]]
)
D = 3  # We work with 3D points
Dv = 1  # and solve one problem at a time.

Create some random input data:

def generate_samples(N, device="cuda", lang="torch", **kwargs):
    """Generates a point cloud x, a scalar signal b of size N and two regularization parameters.

    Args:
        N (int): number of point.
        device (str, optional): "cuda", "cpu", etc. Defaults to "cuda".
        lang (str, optional): "torch", "numpy", etc. Defaults to "torch".

    Returns:
        3-uple of arrays: x, y, b
    """
    randn = random_normal(device=device, lang=lang)
    ones = unit_tensor(device=device, lang=lang)

    x = randn((N, D))
    b = randn((N, Dv))
    gamma = ones((1,)) / (2 * 0.01 ** 2)  # kernel bandwidth
    alpha = ones((1,)) * 0.8  # regularization
    return x, b, gamma, alpha

KeOps kernel

Define a Gaussian RBF kernel:

formula = "Exp(- g * SqDist(x,y)) * a"
aliases = [
    "x = Vi(" + str(D) + ")",  # First arg:  i-variable of size D
    "y = Vj(" + str(D) + ")",  # Second arg: j-variable of size D
    "a = Vj(" + str(Dv) + ")",  # Third arg:  j-variable of size Dv
    "g = Pm(1)",
]  # Fourth arg: scalar parameter

Note

This operator uses a conjugate gradient solver and assumes that formula defines a symmetric, positive and definite linear reduction with respect to the alias "a" specified trough the third argument.

Define the Kernel solver, with a ridge regularization alpha:

def Kinv_keops(x, b, gamma, alpha, **kwargs):
    Kinv = KernelSolve(formula, aliases, "a", axis=1)
    res = Kinv(x, x, b, gamma, alpha=alpha)
    return res


def Kinv_keops_numpy(x, b, gamma, alpha, **kwargs):
    Kinv = KernelSolve_np(formula, aliases, "a", axis=1, dtype="float32")
    res = Kinv(x, x, b, gamma, alpha=alpha)
    return res


def Kinv_scipy(x, b, gamma, alpha, **kwargs):

    x_i = LazyTensor(np.sqrt(gamma) * x[:, None, :])
    y_j = LazyTensor(np.sqrt(gamma) * x[None, :, :])

    K_ij = (-((x_i - y_j) ** 2).sum(2)).exp()

    A = aslinearoperator(diags(alpha * np.ones(x.shape[0]))) + aslinearoperator(K_ij)
    A.dtype = np.dtype("float32")
    res = cg(A, b)
    return res

Define the same Kernel solver, using a tensorized implementation:

def Kinv_pytorch(x, b, gamma, alpha, **kwargs):
    K_xx = alpha * torch.eye(x.shape[0], device=x.get_device()) + torch.exp(
        -gamma * sqdist_torch(x, x)
    )
    res = torch.solve(b, K_xx)[0]
    return res


def Kinv_numpy(x, b, gamma, alpha, **kwargs):
    K_xx = alpha * np.eye(x.shape[0]) + np.exp(-gamma * sqdist_np(x, x))
    res = np.linalg.solve(K_xx, b)
    return res

Run the benchmark

routines = [
    (Kinv_numpy, "NumPy", {"lang": "numpy"}),
    (Kinv_pytorch, "PyTorch", {}),
    (Kinv_keops_numpy, "NumPy + KeOps", {"lang": "numpy"}),
    (Kinv_keops, "PyTorch + KeOps", {}),
    (Kinv_scipy, "Scipy + KeOps", {"lang": "numpy"}),
]
full_benchmark(
    "Inverse radial kernel matrix",
    routines,
    generate_samples,
    problem_sizes=problem_sizes,
)

plt.show()
Inverse radial kernel matrix

Out:

Benchmarking : Inverse radial kernel matrix ===============================
NumPy -------------
  1x100 loops of size   10 :   1x100x 150.5 µs
  1x100 loops of size   20 :   1x100x 297.7 µs
  1x100 loops of size   50 :   1x100x 945.6 µs
  1x100 loops of size  100 :   1x100x 281.2 ms
  1x 10 loops of size  200 :   1x 10x    1.4 s
  1x  1 loops of size  500 :   1x  1x    7.1 s
  1x  1 loops of size   1 k:   1x  1x   11.1 s
** Too slow!
PyTorch -------------
  1x100 loops of size   10 :   1x100x   4.7 ms
  1x100 loops of size   20 :   1x100x   4.8 ms
  1x100 loops of size   50 :   1x100x   8.0 ms
  1x100 loops of size  100 :   1x100x   9.3 ms
  1x100 loops of size  200 :   1x100x   8.9 ms
  1x100 loops of size  500 :   1x100x  75.5 ms
  1x 10 loops of size   1 k:   1x 10x  95.8 ms
  1x 10 loops of size   2 k:   1x 10x  26.5 ms
  1x 10 loops of size   5 k:   1x 10x 138.9 ms
  1x 10 loops of size  10 k:   1x 10x 399.3 ms
  1x  1 loops of size  20 k:   1x  1x    1.5 s
CUDA out of memory. Tried to allocate 9.31 GiB (GPU 0; 10.76 GiB total capacity; 9.31 GiB already allocated; 428.56 MiB free; 9.32 GiB reserved in total by PyTorch)
** Runtime error!
NumPy + KeOps -------------
  1x100 loops of size   10 :   1x100x 712.3 µs
  1x100 loops of size   20 :   1x100x 535.7 µs
  1x100 loops of size   50 :   1x100x 686.9 µs
  1x100 loops of size  100 :   1x100x   1.0 ms
  1x100 loops of size  200 :   1x100x 810.5 µs
  1x100 loops of size  500 :   1x100x   2.5 ms
  1x100 loops of size   1 k:   1x100x   2.8 ms
  1x100 loops of size   2 k:   1x100x   3.6 ms
  1x100 loops of size   5 k:   1x100x   6.7 ms
  1x100 loops of size  10 k:   1x100x  11.2 ms
  1x100 loops of size  20 k:   1x100x  21.9 ms
  1x 10 loops of size  50 k:   1x 10x  91.6 ms
  1x 10 loops of size 100 k:   1x 10x 346.7 ms
  1x  1 loops of size 200 k:   1x  1x    1.4 s
  1x  1 loops of size 500 k:   1x  1x   10.4 s
** Too slow!
PyTorch + KeOps -------------
  1x100 loops of size   10 :   1x100x   1.1 ms
  1x100 loops of size   20 :   1x100x   1.1 ms
  1x100 loops of size   50 :   1x100x   1.1 ms
  1x100 loops of size  100 :   1x100x   1.1 ms
  1x100 loops of size  200 :   1x100x   1.1 ms
  1x100 loops of size  500 :   1x100x   1.7 ms
  1x100 loops of size   1 k:   1x100x   2.2 ms
  1x100 loops of size   2 k:   1x100x   2.5 ms
  1x100 loops of size   5 k:   1x100x   4.9 ms
  1x100 loops of size  10 k:   1x100x   8.7 ms
  1x100 loops of size  20 k:   1x100x  16.4 ms
  1x100 loops of size  50 k:   1x100x  50.5 ms
  1x 10 loops of size 100 k:   1x 10x 207.0 ms
  1x  1 loops of size 200 k:   1x  1x 770.9 ms
  1x  1 loops of size 500 k:   1x  1x    5.3 s
  1x  1 loops of size   1 M:   1x  1x   24.7 s
** Too slow!
Scipy + KeOps -------------
  1x100 loops of size   10 :   1x100x   2.6 ms
  1x100 loops of size   20 :   1x100x   5.5 ms
  1x100 loops of size   50 :   1x100x   4.8 ms
  1x100 loops of size  100 :   1x100x   2.9 ms
  1x100 loops of size  200 :   1x100x   9.0 ms
  1x100 loops of size  500 :   1x100x   9.8 ms
  1x100 loops of size   1 k:   1x100x  11.6 ms
  1x100 loops of size   2 k:   1x100x  13.5 ms
  1x100 loops of size   5 k:   1x100x  17.1 ms
  1x100 loops of size  10 k:   1x100x  23.2 ms
  1x 10 loops of size  20 k:   1x 10x 113.5 ms
  1x 10 loops of size  50 k:   1x 10x 206.9 ms
  1x  1 loops of size 100 k:   1x  1x 579.6 ms
  1x  1 loops of size 200 k:   1x  1x    1.6 s
  1x  1 loops of size 500 k:   1x  1x    9.5 s
  1x  1 loops of size   1 M:   1x  1x   50.3 s
** Too slow!

Total running time of the script: ( 5 minutes 52.355 seconds)

Gallery generated by Sphinx-Gallery