Sparse GP regression

[ ]:
# © Crown Copyright GCHQ
#
# Licensed under the GNU General Public License, version 3 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.gnu.org/licenses/gpl-3.0.en.html
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
[ ]:
# This notebook is not compiled into the documentation due to the time taken to run it to get
# a representative analysis. Please run this notebook locally if you wish to see the outputs.

In this notebook we demonstrate use of sparse inducing point kernel approximations [Titsias09] within Vanguard, including combining with warping or input uncertainty.

[ ]:
random_seed = 1_989
[ ]:
import matplotlib.pyplot as plt
import numpy as np
import torch
from gpytorch.kernels import MaternKernel, ScaleKernel
from sklearn.preprocessing import StandardScaler

from vanguard.datasets.bike import BikeDataset
from vanguard.models import InducingPointKernelGPModel
from vanguard.uncertainty import GaussianUncertaintyGPController
from vanguard.vanilla import GaussianGPController
from vanguard.warps import SetWarp, warpfunctions

Data

We will use the BikeDataset, with 13 input features. The main point of this dataset for this example notebook is that it’s large. With a 90/10 train/test split, we have ~15.5k training points. Exact GP inference in this case would be very expensive. This dataset is taken from [FanaeeT2013] and was accessed and copied to Github LFS within this repo on 1st July 2024.

[ ]:
DATASET = BikeDataset(rng=np.random.default_rng(random_seed))
[ ]:
plt.hist(DATASET.train_y.numpy(force=True))
plt.xlabel("$y$", fontsize=15)
plt.show()

The regressand is non-negative, so warping could be useful. Let’s standardise the inputs for numerical stability.

[ ]:
scaler = StandardScaler()
scaled_train_x = scaler.fit_transform(DATASET.train_x.numpy(force=True))
scaled_test_x = scaler.transform(DATASET.test_x.numpy(force=True))

Modelling

Let’s try a vanilla GP using a sparse kernel approx. To do this, we just have to use InducingPointKernelGPModel with a plain GaussianGPController. In order to use a different model, we need a new controller subclass:

[ ]:
class SparseGaussianGPController(GaussianGPController):
    gp_model_class = InducingPointKernelGPModel
[ ]:
N_INDUCING_POINTS = 50
num_iters = int(len(scaled_train_x) / 64) * 10
[ ]:
class ScaledMaternKernel(ScaleKernel):
    """A scaled matern kernel."""

    def __init__(self):
        super().__init__(MaternKernel(nu=1.5, ard_num_dims=scaled_train_x.shape[1]))
[ ]:
gp = SparseGaussianGPController(
    train_x=scaled_train_x,
    train_y=DATASET.train_y,
    kernel_class=ScaledMaternKernel,
    y_std=DATASET.train_y_std,
    gp_kwargs={"n_inducing_points": N_INDUCING_POINTS},
    optim_kwargs={"lr": 0.01},
    rng=np.random.default_rng(random_seed),
)

gp.fit(n_sgd_iters=num_iters)
[ ]:
posterior = gp.posterior_over_point(scaled_test_x)
mean, lower, upper = posterior.confidence_interval()

# Convert to numpy arrays for plotting
plt_test_y = DATASET.test_y.numpy(force=True)
mean = mean.numpy(force=True)
lower = lower.numpy(force=True)
upper = upper.numpy(force=True)

print(f"RMSE: {np.sqrt(np.mean((plt_test_y - mean) ** 2))}")
plt.errorbar(plt_test_y, mean, yerr=np.vstack([mean - lower, upper - mean]), marker="o", label="mean", linestyle="")
plt.xlabel("true y values")
plt.ylabel("predicted y values")
plt.legend()
plt.show()

Now let’s look at sparse kernels combined with compositional warping. We’ll use an affine-log warp to reflect the non-negativity of the data: \(\phi(y) = a + b\log(y)\). The code to create this GP model in Vanguard is simple. Use a SetWarp decorator to apply the warp to the same controller used above.

[ ]:
warp = warpfunctions.AffineWarpFunction() @ warpfunctions.BoxCoxWarpFunction(lambda_=0)


@SetWarp(warp_function=warp, ignore_methods=("fit", "__init__"))
class WarpedGaussianGPController(SparseGaussianGPController):
    pass
[ ]:
gp = WarpedGaussianGPController(
    train_x=scaled_train_x,
    train_y=DATASET.train_y,
    kernel_class=ScaledMaternKernel,
    y_std=DATASET.train_y_std,
    gp_kwargs={"n_inducing_points": N_INDUCING_POINTS},
    optim_kwargs={"lr": 0.01},
    rng=np.random.default_rng(random_seed),
)

gp.fit(n_sgd_iters=num_iters)
[ ]:
posterior = gp.posterior_over_point(scaled_test_x)
warp_mean, warp_lower, warp_upper = posterior.confidence_interval()

# Convert to numpy arrays for plotting
warp_mean = warp_mean.numpy(force=True)
warp_lower = warp_lower.numpy(force=True)
warp_upper = warp_upper.numpy(force=True)


print(f"RMSE: {np.sqrt(np.mean((plt_test_y - warp_mean) ** 2))}")
plt.errorbar(
    plt_test_y,
    mean,
    yerr=np.vstack([warp_mean - warp_lower, warp_upper - warp_mean]),
    marker="o",
    label="mean",
    linestyle="",
)
plt.xlabel("true y values")
plt.ylabel("predicted y values")
plt.legend()
plt.show()

Warping here is likely to be most useful for smaller \(y\) values, so let’s filter by the true \(y\) value and compare warping to no warping.

[ ]:
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
small_indices = plt_test_y < 0.5

plt.errorbar(
    plt_test_y[small_indices],
    mean[small_indices],
    yerr=np.vstack([mean - lower, upper - mean])[:, small_indices],
    marker="o",
    label="mean",
    linestyle="",
)
plt.title(f"No warping. RMSE: {np.sqrt(np.mean((plt_test_y[small_indices] - mean[small_indices]) ** 2)):.4}")
plt.xlabel("true y values")
plt.ylabel("predicted y values")

plt.subplot(1, 2, 2)
y_err = np.vstack([warp_mean - warp_lower, warp_upper - warp_mean])[:, small_indices]
plt.errorbar(plt_test_y[small_indices], warp_mean[small_indices], yerr=y_err, marker="o", label="mean", linestyle="")
plt.title(
    f"Affine-log warping. RMSE: {np.sqrt(np.mean((plt_test_y[small_indices] - warp_mean[small_indices]) ** 2)):.4}"
)
plt.xlabel("true y values")

plt.tight_layout()
plt.show()

This demonstrates nicely that the warping is working where it matters, preventing impossible negative predictions.

Finally we can demonstrate combining with input uncertainty as well, using some dummy input noise. Variational inference in batch mode is not yet supported with input uncertainty, so we’ll subset the data and switch-off batch-mode.

[ ]:
warp = warpfunctions.AffineWarpFunction() @ warpfunctions.BoxCoxWarpFunction(lambda_=0)


@SetWarp(warp_function=warp, ignore_all=True)
class WarpedGaussianUncertaintyGPController(GaussianUncertaintyGPController):
    pass
[ ]:
gp = WarpedGaussianUncertaintyGPController(
    train_x=scaled_train_x[:500],
    train_x_std=0.1,
    train_y=DATASET.train_y[:500],
    kernel_class=ScaledMaternKernel,
    y_std=0.001 * torch.mean(torch.abs(DATASET.train_y)),
    likelihood_kwargs={"learn_additional_noise": True},
    batch_size=None,
    rng=np.random.default_rng(random_seed),
)
gp.fit(n_sgd_iters=num_iters)
[ ]:
posterior = gp.posterior_over_point(scaled_test_x)
mean, lower, upper = posterior.confidence_interval()

# Convert to numpy arrays for plotting
mean = mean.numpy(force=True)
lower = lower.numpy(force=True)
upper = upper.numpy(force=True)

print(f"RMSE: {np.sqrt(np.mean((plt_test_y - mean) ** 2))}")
plt.errorbar(plt_test_y, mean, yerr=np.vstack([mean - lower, upper - mean]), marker="o", label="mean", linestyle="")
plt.xlabel("true y values")
plt.ylabel("predicted y values")
plt.legend()
plt.show()

Conclusions

This short example demonstrates that compositional warping can be combined with sparse kernel approximations in Vanguard using very little code. We have demonstrated good results on a real-world dataset with ~15.5k training items. Other datasets may exhibit a stronger preference for warping, but we have shown that, for low values of the regressand (close to zero), the warped GP is much better, as it makes no impossible negative predictions. We have shown that combining warping and sparse kernel approximations is feasible and the training is no more difficult than without the warping.

In addition, we have provided a proof-of-concept demonstration that warping, input uncertainty and sparse kernel approximations can be combined simply within Vanguard and trained successfully.