Getting Started

Installation

PyPAL can be installed from PyPi using

pip install pypal

We recommend installing PyPAL in a dedicated virtual environment or conda environment.

The latest version of PyPAL can be installed from GitHub using

pip install git+https://github.com/kjappelbaum/pypal.git

Which class do i use?

If your favorite model is not listed, you can easily implement it yourself (see Implementing a new PAL class)!

Running an active learning experiment

The examples directory contains a Jupyter notebook with an example that can also be run on MyBinder.

If using a Gaussian process model built with sklearn or GPy we recommend using a pre-built class such as pypal.pal.pal_sklearn.PALSklearn, pypal.pal.pal_coregionalized.PALCoregionalized, pypal.pal.pal_gpy.PALGPy and following the subsequent steps (for more details on which class to use see Which class do i use?):

  1. For each objective create a model (if using a coregionalized Gaussian process model, only one model needs to be created)

  2. Sample a few initial points from the design space. We provide the pypal.pal.utils.get_maxmin_samples() or pypal.pal.utils.get_kmeans_samples() utilities that can help with the sampling. Our code assumes that X is a np.array.

    from pypal import get_kmeans_samples, get_maxmin_samples
    
    # This selects the 10 points closest to the centroids of a k=10 means clustering
    indices = get_kmeans_samples(X, 10)
    
    # This selects the 10 farthest points in feature space
    indices = get_maxmin_samples(X, 10)
    
  3. Now we can initialize the instance of one PAL class. If using a sklearn Gaussian process model, we would use

    from pypal import PALSklearn
    
    # Each of these models is an instance of sklearn.gaussian_process.GaussianProcessRegressor
    models = [gpr0, gpr1, gpr2]
    
    # We always need to provide the feature matrix (X), a list of models, and the number of objectives
    palinstance = PALSklearn(X, models, 3)
    
    # Now, we can also feed in the first measurements
    # this here assumes that we have all measurements for y and we now
    # provide those which are present in the indices array
    palinstance.update_train_set(indices, y[indices])
    
    # Now we can run one step
    next_idx = palinstance.run_one_step()
    

    At this level, we have a range of different optional arguements we can set.

    • epsilon: one \(\epsilon\) per dimension in a np.ndarray. This can be used to set different tolerances for each objective. Note that \(\epsilon_i \in [0,1]\).

    • delta: the \(\delta\) hyperparameter (\(\delta \in [0,1]\)). Increasing this value will speed up the convergence.

    • beta_scale: an empirical scaling parameter for \(beta\). The theoretical guarantees in the PAL paper are derived for this parameter set to 1. But in practice, a much faster convergence can be achieved by setting it to a number \(0< \beta_\mathrm{scale} \ll 1\).

    • goal: By default, PyPAL assumes that the goal is to maximize every objective. If this is not the case, this argument can be set using a list of “min” and “max” strings, with “min” specifying whether to minimize the ith objective and “max” indicating whether to maximize this objective.

    • coef_var_threshold: By default, PyPAL will not consider points with a coefficient of variation \(\ge 3\) for the classification step of the algorithm. This is meant to avoid classifying design points for which the model is entirely unsure. This tends to happen when a model is severely overfit on the training data (i.e., the training data uncertainties are very low, whereas the prediction uncertainties are very high). To change this setting, reduce this value to make the check tighter or increase it to avoid this check (as in the original implementation).

In the case of missing observations, i.e., only two of three outputs are measured, report the missing observations as np.nan. The call could look like

import numpy as np

palinstance.update_train_set(np.array([1,2]), np.array([[1, 2, 3], [np.nan, 1, 2, 0]]))

for a case in which we performed measurements for samples with index 1 and 2 of our design space, but did not measure the first target for sample 2.

Hyperparameter optimization

Usually, the hyperparameters of a machine learning model, in particular the kernel hyperparameters of a Gaussian process regression model, should be optimized as new training data is added. However, since this is usually a computationally expensive process, it may not be desirable to perform this at every iteration of the active learning process. The iteration frequency of the hyperparameter optimization is internally set by the _should_optimize_hyperparameters function, which by default uses a schedule that optimizes the hyperparameter every 10th iteration. This behavior can be changed by override this function.

Logging

Basic information such as the current iteration and the classification status are logged and can be viewed by printing the PAL object

print(palinstance)

# returns: pypal at iteration 1. 10 Pareto optimal points, 1304 discarded points, 200 unclassified points.

We also provide calculation of the hypervolume enclosed by the Pareto front with the function pypal.pal.utils.get_hypervolume()

hv = get_hypervolume(palinstance.means[palinstance.pareto_optimal])

Properties of the PAL object

For debugging there are some properties and attributes of the PAL class that can be used to inspect the progress of the active learning loop.

  • get the points in the design space, x:
    • design_space returns the full design space matrix

    • pareto_optimal_points: returns the points that are classified as Pareto-efficient

    • sampled_points: returns the points that have been sampled

    • discarded_points: returns the points that have been discarded

  • get the indicies of Pareto efficent, sampled, discarded, and unclassified points with pareto_optimal_indices, sampled_indices, discarded_indices, and unclassified_indices

  • similarly, the number of points in the different classes can be obtained using number_pareto_optimal_points, number_discarded_points, number_unclassified_points, and number_sampled_points

  • hyperrectangle_size returns the sizes of the hyperrectangles, i.e., the weights that are used in the sampling step

  • means and std contain the predictions of the model

  • sampled is a mask array. In case one objective has not been measured its cell is False

Exploring a space where all objectives are known

In some cases, we may already posess all measurements, but would like to run PAL with different settings to test how the algorithm performs. In this case, we provide the pypal.pal.utils.exhaust_loop() wrapper.

from pypal import PALSklearn, exhaust_loop
models = [gpr0, gpr1, gpr2]
palinstance = PALSklearn(X, models, 3)

exhaust_loop(palinstance, y)

This will continue calling run_one_step() until there is no unclassified sample left.

Batch sampling

By default, the run_one_step function of the PAL classes will return a np.ndarray with only one index for the point in the design space for which the next experiment should be performed. In some situations, it may be more practical to run multiple experiments as batches before running the next active learning iteration. In such cases, we provide the batch_size argument which can be set to an integer greater than one.

next_idx = palinstance.run_one_step(batch_size=10)
# next_idx will be a np.array of length 10

Note that the exhaust_loop also supports the batch_size keyword argument

palinstance = PALSklearn(X, models, 3)

# sample always 10 points and do this until there is no unclassified
# point left
exhaust_loop(palinstance, y, batch_size=10)

Caveats and tricks with Gaussian processes

One caveat to keep in mind is that \(\epsilon\)-PAL will not work if the predictive variance does not make sense. For example, when the model is overconfident and the uncertainties for the training set is significantly lower than those for the predicted set. In this case, PyPAL will untimely, and often incorrectly, label the design points. An example situation where the predictions for an overconfident model due to a training set that excludes a part of design space is shown in the figure below

Example of the predictions of an overconfident GPR model

This problem is exacerbated in conjunction with \(\beta_\mathrm{scale} < 1\). To make the model more robust we suggest trying:

  • to set reasonable bounds on the length scale parameters

  • to increase the regularization parameter/noise kernel (alpha in sklearn)

  • to increase the number of data points, especially the coverage of the design space

  • to use a kernel that suits the problem

  • to turn off ARD. Automatic relevance determination (ARD) might increase the predictive performance, but also makes the model more prone to overfitting

We also recommend to cross-validate the Gaussian process models and to check that the predicted variances make sense. When performing cross-validation, make sure that the index provided to PyPAL is the same size as the cross-validation folds. By default, the code will run a simple cross-validation only on the first iteration and provide a warning if the mean absolute error is above the mean standard deviation. The warning will look something like

The mean absolute error in cross-validation is 64.29, the mean variance is 0.36.
Your model might not be predictive and/or overconfident.
In the docs, you find hints on how to make GPRs more robust.

This behavior can changed with the cross-validation test being performed more frequently by overriding the should_run_crossvalidation function.

Another way to detect overfitting is to use pypal.plotting.make_jointplot() function from the plotting subpackage. This function will plot all objectives against each other (with errorbars and different classes indicated with colors) and histograms of the objectives on the diagonal. If the majority of predicted points tend to overlap one another and get discarded by PyPAL, this may suggest that the surrogate model is overfitted.

from pypal.plotting import make_jointplot

# palinstance is a instance of a PAL class after
# calling run_one_step
fig = make_jointplot(palinstance.means, palinstance)
Example of the output of make_jointplot