The following is a brief overview of Gaussian process regression, the GPR_Theory notebook contains a step-by-step implementation of GP regression following Algorithm 2.1 on p19 of Gaussian Processes for Machine Learning. The GPR notebook contains roughly the same information but uses the custom gpr module, which contains all the functions and methods required for performing Gaussian process regression and takes inspiration from the Scikit-learn implementation of GPR.
A Gaussian Process defines a distribution over functions that fit a set of observed data. We can loosely think of a function in continuous space as a vector of infinite length containing the function values
In practice, we rarely need infinitely many outputs. Instead, we only care about a finite number of outputs. Luckily, inference in Gaussian processes provides the same function outputs regardless of the number of points [1].
First, a quick recap of the Multivariate Normal (or MVN):
The MVN is given by
Where
The figure below shows 5 samples drawn from a 20-variate Normal distribution, with zero mean and identity covariance spaced evenly over the range
Connecting the entries of each sampled vector illustrates how the 20-dimensional vector can resemble a function. However, the choice of identity covariance causes the produced 'function' to appear noisy and unsmooth.
To smooth the outputs of the MVN, we can use a function to control the covariances, i.e. a covariance function. One would expect similar inputs (inputs that are close together) to produce similar outputs, i.e. for a small change in
Where
As mentioned before, a Gaussian Process models a distribution over functions that fit a set of observed data. The mean of this distribution is the function used for regression purposes and is given by
Where
Without any observations, the mean is assumed to be
The joint distribution of
Where
For regression purposes, we require the conditional distribution
In practice, measurements often include some noise, i.e.
And we replace the respective block in the covariance matrix of the joint distribution with the new formulation. This leads to the new predictive equations for Gaussian process regression.
The figure below shows an example of GP regression using the squared exponential kernel; the shaded blue area denotes 2 standard deviations.
[1] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine Learning, The MIT Press, 2006.
[2] J. Wang, An Intuitive Tutorial to Gaussian Processes Regression, arXiv:2009.10862, 2020.
[3] Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011
[4] API design for machine learning software: experiences from the scikit-learn project, Buitinck et al., 2013