Synthetic Data

Synthetic data is particularly useful when running tests, as the data can be specifically cultivated for one’s needs.

class vanguard.datasets.synthetic.SyntheticDataset(functions=(<function simple_f>, ), output_noise=0.1, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Bases: Dataset

Synthetic data with homoskedastic noise for testing.

Parameters:
  • functions (Iterable[Callable[[ndarray[tuple[Any, ...], dtype[floating]]], ndarray[tuple[Any, ...], dtype[floating]]]]) – The functions to be used to generate the synthetic data. If multiple functions are given, a multidimensional output is generated.

  • output_noise (float) – The standard deviation for the output standard deviation, defaults to 0.1. Only applied to the training data; the testing data has no output noise actually applied, but we still set test_y_std = output_noise.

  • train_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the training input. Defaults to (0.01, 0.05).

  • test_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the testing input. Defaults to (0.01, 0.03).

  • n_train_points (int) – The total number of training points.

  • n_test_points (int) – The total number of testing points.

  • significance (float) – The significance to be used.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(functions=(<function simple_f>, ), output_noise=0.1, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Initialise self.

Parameters:
make_sample_data(n_points, input_noise_bounds, output_noise_level, interval_length=1)[source]

Create some sample data.

Parameters:
  • n_points (int) – The number of points to create.

  • input_noise_bounds (tuple[float, float]) – The lower, upper bounds for the resulting input noise.

  • output_noise_level (Union[int, float]) – The amount of noise applied to the outputs.

  • interval_length (float) – Use to scale the exact image of the function, defaults to 1.

Return type:

tuple[tuple[ndarray[tuple[Any, ...], dtype[floating]], ndarray[tuple[Any, ...], dtype[floating]]], ndarray[tuple[Any, ...], dtype[floating]]]

Returns:

The output and the mean and standard deviation of the input, in the form (x_mean, x_std), y.

class vanguard.datasets.synthetic.HeteroskedasticSyntheticDataset(functions=(<function simple_f>, ), output_noise_mean=0.1, output_noise_std=0.01, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Bases: SyntheticDataset

Synthetic data with heteroskedastic noise for testing.

The train_y_std and test_y_std attributes are created by drawing from a normal distribution centred on the value of the output_noise parameter.

Parameters:
  • functions (Iterable[Callable[[ndarray[tuple[Any, ...], dtype[floating]]], ndarray[tuple[Any, ...], dtype[floating]]]]) – The functions to be used to generate the synthetic data.

  • output_noise_mean (float) – The mean for the output standard deviation, defaults to 0.1.

  • output_noise_std (float) – The standard deviation for the output standard deviation, defaults to 0.01.

  • train_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the training input. Defaults to (0.01, 0.05).

  • test_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the testing input. Defaults to (0.01, 0.03).

  • n_train_points (int) – The total number of training points.

  • n_test_points (int) – The total number of testing points.

  • significance (float) – The significance to be used.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(functions=(<function simple_f>, ), output_noise_mean=0.1, output_noise_std=0.01, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Initialise self.

Parameters:
class vanguard.datasets.synthetic.HigherRankSyntheticDataset(functions=(<function simple_f>, ), output_noise=0.1, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Bases: Dataset

Synthetic data with rank 2 input features. In this case each x is a 2x2 matrix.

Parameters:
  • functions (Iterable[Callable[[ndarray[tuple[Any, ...], dtype[floating]]], ndarray[tuple[Any, ...], dtype[floating]]]]) – The functions to be used to generate the synthetic data.

  • output_noise (float) – The standard deviation for the output standard deviation, defaults to 0.1.

  • train_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the training input. Defaults to (0.01, 0.05).

  • test_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the testing input. Defaults to (0.01, 0.03).

  • n_train_points (int) – The total number of training points.

  • n_test_points (int) – The total number of testing points.

  • significance (float) – The significance to be used.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(functions=(<function simple_f>, ), output_noise=0.1, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Initialise self.

Parameters:
make_sample_data(n_points, input_noise_bounds, output_noise_level, interval_length=1)[source]

Create some sample data.

Parameters:
  • n_points (int) – The number of points to create.

  • input_noise_bounds (tuple[float, float]) – The lower, upper bounds for the resulting input noise.

  • output_noise_level (Union[int, float]) – The amount of noise applied to the inputs.

  • interval_length (float) – Use to scale the exact image of the function, defaults to 1.

Return type:

tuple[tuple[ndarray[tuple[Any, ...], dtype[floating]], ndarray[tuple[Any, ...], dtype[floating]]], ndarray[tuple[Any, ...], dtype[floating]]]

Returns:

The output and the mean and standard deviation of the input, in the form (x_mean, x_std), y.

class vanguard.datasets.synthetic.MultidimensionalSyntheticDataset(functions=(<function simple_f>, <function complicated_f>), output_noise=0.1, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Bases: Dataset

Synthetic data with multiple input dimensions.

Parameters:
  • functions (Iterable[Callable[[ndarray[tuple[Any, ...], dtype[floating]]], ndarray[tuple[Any, ...], dtype[floating]]]]) – The functions used on each input dimension (they are combined linearly to make a single output).

  • output_noise (float) – The standard deviation for the output standard deviation, defaults to 0.1. Only applied to the training data; the testing data has no output noise actually applied, but we still set test_y_std = output_noise.

  • train_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the training input. Defaults to (0.01, 0.05).

  • test_input_noise_bounds (tuple[float, float]) – The lower, upper bounds of the linearly varying noise for the testing input. Defaults to (0.01, 0.03).

  • n_train_points (int) – The total number of training points.

  • n_test_points (int) – The total number of testing points.

  • significance (float) – The significance to be used.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(functions=(<function simple_f>, <function complicated_f>), output_noise=0.1, train_input_noise_bounds=(0.01, 0.05), test_input_noise_bounds=(0.01, 0.03), n_train_points=30, n_test_points=50, significance=0.025, rng=None)[source]

Initialise self.

Parameters:

The following functions are included as defaults for the previous class, although the class can be passed almost any function if required.

vanguard.datasets.synthetic.simple_f(x)[source]

Map values through a simple equation.

\[f(x) = \sin(2\pi x)\]
Parameters:

x (ndarray[tuple[Any, ...], dtype[floating]])

Return type:

ndarray[tuple[Any, ...], dtype[floating]]

vanguard.datasets.synthetic.complicated_f(x)[source]

Map values through a complicated equation.

\[f(x) = -x^\frac{3}{2} + x\sin^2(2\pi x)\]
Parameters:

x (ndarray[tuple[Any, ...], dtype[floating]])

Return type:

ndarray[tuple[Any, ...], dtype[floating]]

vanguard.datasets.synthetic.very_complicated_f(x)[source]

Map values through a very complicated equation.

\[f(x) = -x^\frac{3}{2} + x\sin^2(2\pi x) + x^2 \cos(10\pi x)\]
Parameters:

x (ndarray[tuple[Any, ...], dtype[floating]])

Return type:

ndarray[tuple[Any, ...], dtype[floating]]

Classification

class vanguard.datasets.classification.BinaryStripeClassificationDataset(num_train_points, num_test_points, rng=None)[source]

Bases: Dataset

Dataset comprised of one-dimensional input values, and binary output values (0 or 1).

../_images/synthetic-1.png
Parameters:
  • num_train_points (int) – The number of training points.

  • num_test_points (int) – The number of testing points.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(num_train_points, num_test_points, rng=None)[source]

Initialise self.

Parameters:
static even_split(x)[source]

Return the reals, divided into two distinct values.

Parameters:

x (ndarray[tuple[Any, ...], dtype[floating]])

Return type:

ndarray[tuple[Any, ...], dtype[floating]]

class vanguard.datasets.classification.BinaryGaussianClassificationDataset(num_train_points, num_test_points, *, covariance_scale=1.0, num_features=2, rng=None)[source]

Bases: MulticlassGaussianClassificationDataset

A binary dataset based on sklearn.datasets.make_gaussian_quantiles().

../_images/synthetic-2.png
Parameters:
  • num_train_points (int) – The number of training points.

  • num_test_points (int) – The number of testing points.

  • covariance_scale (float) – The covariance matrix will be this value times the unit matrix. Defaults to 1.0.

  • num_features (int) – The number of features to generate for the input data.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(num_train_points, num_test_points, *, covariance_scale=1.0, num_features=2, rng=None)[source]

Initialise self.

Parameters:
class vanguard.datasets.classification.MulticlassGaussianClassificationDataset(num_train_points, num_test_points, num_classes, *, covariance_scale=1.0, num_features=2, rng=None)[source]

Bases: Dataset

A multiclass dataset based on sklearn.datasets.make_gaussian_quantiles().

../_images/synthetic-3.png
Parameters:
  • num_train_points (int) – The number of training points.

  • num_test_points (int) – The number of testing points.

  • num_classes (int) – The number of classes.

  • num_features (int) – The number of features to generate for the input data.

  • covariance_scale (float) – The covariance matrix will be this value times the unit matrix. Defaults to 1.0.

  • rng (Optional[Generator]) – Generator instance used to generate random numbers.

__init__(num_train_points, num_test_points, num_classes, *, covariance_scale=1.0, num_features=2, rng=None)[source]

Initialise self.

Parameters:
property one_hot_train_y: Tensor

Return the training data as a one-hot encoded array.

Note that if there are exactly two classes, this returns train_y.reshape((-1, 1)) instead.

plot(cmap='Set1', alpha=0.5)[source]

Plot the data.

Parameters:
  • cmap (Union[str, Colormap]) – The colour map to be used.

  • alpha (float) – The transparency of the points.

Return type:

None

plot_prediction(prediction, cmap='Set1', alpha=0.5)[source]

Plot a prediction.

Parameters:
Return type:

None

plot_confusion_matrix(prediction, cmap='OrRd', text_size='xx-large')[source]

Plot a confusion matrix based on a specific prediction.

Parameters:
Return type:

None