PyPAL can be installed from PyPi using
pip install pypal
We recommend installing PyPAL in a dedicated virtual environment or conda environment.
The latest version of PyPAL can be installed from GitHub using
pip install git+https://github.com/kjappelbaum/pypal.git
For Gaussian processes built with sklearn use pypal.pal.pal_sklearn.PALSklearn
sklearn
pypal.pal.pal_sklearn.PALSklearn
For Gaussian processes built with GPy use pypal.pal.pal_gpy.PALGPy
GPy
pypal.pal.pal_gpy.PALGPy
For coregionalized Gaussian processes (built with GPy) use pypal.pal.pal_coregionalized.PALCoregionalized
pypal.pal.pal_coregionalized.PALCoregionalized
For quantile regression using LightGBM gradient boosted decision trees use pypal.pal.pal_gbdt.PALGBDT
LightGBM
pypal.pal.pal_gbdt.PALGBDT
If your favorite model is not listed, you can easily implement it yourself (see Implementing a new PAL class)!
The examples directory contains a Jupyter notebook with an example that can also be run on MyBinder.
If using a Gaussian process model built with sklearn or GPy we recommend using a pre-built class such as pypal.pal.pal_sklearn.PALSklearn, pypal.pal.pal_coregionalized.PALCoregionalized, pypal.pal.pal_gpy.PALGPy and following the subsequent steps (for more details on which class to use see Which class do i use?):
For each objective create a model (if using a coregionalized Gaussian process model, only one model needs to be created)
Sample a few initial points from the design space. We provide the pypal.pal.utils.get_maxmin_samples() or pypal.pal.utils.get_kmeans_samples() utilities that can help with the sampling. Our code assumes that X is a np.array.
pypal.pal.utils.get_maxmin_samples()
pypal.pal.utils.get_kmeans_samples()
X
np.array
from pypal import get_kmeans_samples, get_maxmin_samples # This selects the 10 points closest to the centroids of a k=10 means clustering indices = get_kmeans_samples(X, 10) # This selects the 10 farthest points in feature space indices = get_maxmin_samples(X, 10)
Now we can initialize the instance of one PAL class. If using a sklearn Gaussian process model, we would use
PAL
from pypal import PALSklearn # Each of these models is an instance of sklearn.gaussian_process.GaussianProcessRegressor models = [gpr0, gpr1, gpr2] # We always need to provide the feature matrix (X), a list of models, and the number of objectives palinstance = PALSklearn(X, models, 3) # Now, we can also feed in the first measurements # this here assumes that we have all measurements for y and we now # provide those which are present in the indices array palinstance.update_train_set(indices, y[indices]) # Now we can run one step next_idx = palinstance.run_one_step() At this level, we have a range of different optional arguements we can set. epsilon: one \(\epsilon\) per dimension in a np.ndarray. This can be used to set different tolerances for each objective. Note that \(\epsilon_i \in [0,1]\). delta: the \(\delta\) hyperparameter (\(\delta \in [0,1]\)). Increasing this value will speed up the convergence. beta_scale: an empirical scaling parameter for \(beta\). The theoretical guarantees in the PAL paper are derived for this parameter set to 1. But in practice, a much faster convergence can be achieved by setting it to a number \(0< \beta_\mathrm{scale} \ll 1\). goal: By default, PyPAL assumes that the goal is to maximize every objective. If this is not the case, this argument can be set using a list of “min” and “max” strings, with “min” specifying whether to minimize the ith objective and “max” indicating whether to maximize this objective. coef_var_threshold: By default, PyPAL will not consider points with a coefficient of variation \(\ge 3\) for the classification step of the algorithm. This is meant to avoid classifying design points for which the model is entirely unsure. This tends to happen when a model is severely overfit on the training data (i.e., the training data uncertainties are very low, whereas the prediction uncertainties are very high). To change this setting, reduce this value to make the check tighter or increase it to avoid this check (as in the original implementation).
from pypal import PALSklearn # Each of these models is an instance of sklearn.gaussian_process.GaussianProcessRegressor models = [gpr0, gpr1, gpr2] # We always need to provide the feature matrix (X), a list of models, and the number of objectives palinstance = PALSklearn(X, models, 3) # Now, we can also feed in the first measurements # this here assumes that we have all measurements for y and we now # provide those which are present in the indices array palinstance.update_train_set(indices, y[indices]) # Now we can run one step next_idx = palinstance.run_one_step()
At this level, we have a range of different optional arguements we can set.
epsilon: one \(\epsilon\) per dimension in a np.ndarray. This can be used to set different tolerances for each objective. Note that \(\epsilon_i \in [0,1]\).
epsilon
np.ndarray
delta: the \(\delta\) hyperparameter (\(\delta \in [0,1]\)). Increasing this value will speed up the convergence.
delta
beta_scale: an empirical scaling parameter for \(beta\). The theoretical guarantees in the PAL paper are derived for this parameter set to 1. But in practice, a much faster convergence can be achieved by setting it to a number \(0< \beta_\mathrm{scale} \ll 1\).
beta_scale
goal: By default, PyPAL assumes that the goal is to maximize every objective. If this is not the case, this argument can be set using a list of “min” and “max” strings, with “min” specifying whether to minimize the ith objective and “max” indicating whether to maximize this objective.
goal
coef_var_threshold: By default, PyPAL will not consider points with a coefficient of variation \(\ge 3\) for the classification step of the algorithm. This is meant to avoid classifying design points for which the model is entirely unsure. This tends to happen when a model is severely overfit on the training data (i.e., the training data uncertainties are very low, whereas the prediction uncertainties are very high). To change this setting, reduce this value to make the check tighter or increase it to avoid this check (as in the original implementation).
coef_var_threshold
In the case of missing observations, i.e., only two of three outputs are measured, report the missing observations as np.nan. The call could look like
np.nan
import numpy as np palinstance.update_train_set(np.array([1,2]), np.array([[1, 2, 3], [np.nan, 1, 2, 0]]))
for a case in which we performed measurements for samples with index 1 and 2 of our design space, but did not measure the first target for sample 2.
Usually, the hyperparameters of a machine learning model, in particular the kernel hyperparameters of a Gaussian process regression model, should be optimized as new training data is added. However, since this is usually a computationally expensive process, it may not be desirable to perform this at every iteration of the active learning process. The iteration frequency of the hyperparameter optimization is internally set by the _should_optimize_hyperparameters function, which by default uses a schedule that optimizes the hyperparameter every 10th iteration. This behavior can be changed by override this function.
_should_optimize_hyperparameters
Basic information such as the current iteration and the classification status are logged and can be viewed by printing the PAL object
print(palinstance) # returns: pypal at iteration 1. 10 Pareto optimal points, 1304 discarded points, 200 unclassified points.
We also provide calculation of the hypervolume enclosed by the Pareto front with the function pypal.pal.utils.get_hypervolume()
pypal.pal.utils.get_hypervolume()
hv = get_hypervolume(palinstance.means[palinstance.pareto_optimal])
For debugging there are some properties and attributes of the PAL class that can be used to inspect the progress of the active learning loop.
x
design_space returns the full design space matrix
design_space
pareto_optimal_points: returns the points that are classified as Pareto-efficient
pareto_optimal_points
sampled_points: returns the points that have been sampled
sampled_points
discarded_points: returns the points that have been discarded
discarded_points
get the indicies of Pareto efficent, sampled, discarded, and unclassified points with pareto_optimal_indices, sampled_indices, discarded_indices, and unclassified_indices
pareto_optimal_indices
sampled_indices
discarded_indices
unclassified_indices
similarly, the number of points in the different classes can be obtained using number_pareto_optimal_points, number_discarded_points, number_unclassified_points, and number_sampled_points
number_pareto_optimal_points
number_discarded_points
number_unclassified_points
number_sampled_points
hyperrectangle_size returns the sizes of the hyperrectangles, i.e., the weights that are used in the sampling step
hyperrectangle_size
means and std contain the predictions of the model
means
std
sampled is a mask array. In case one objective has not been measured its cell is False
sampled
False
In some cases, we may already posess all measurements, but would like to run PAL with different settings to test how the algorithm performs. In this case, we provide the pypal.pal.utils.exhaust_loop() wrapper.
pypal.pal.utils.exhaust_loop()
from pypal import PALSklearn, exhaust_loop models = [gpr0, gpr1, gpr2] palinstance = PALSklearn(X, models, 3) exhaust_loop(palinstance, y)
This will continue calling run_one_step() until there is no unclassified sample left.
run_one_step()
By default, the run_one_step function of the PAL classes will return a np.ndarray with only one index for the point in the design space for which the next experiment should be performed. In some situations, it may be more practical to run multiple experiments as batches before running the next active learning iteration. In such cases, we provide the batch_size argument which can be set to an integer greater than one.
run_one_step
batch_size
next_idx = palinstance.run_one_step(batch_size=10) # next_idx will be a np.array of length 10
Note that the exhaust_loop also supports the batch_size keyword argument
palinstance = PALSklearn(X, models, 3) # sample always 10 points and do this until there is no unclassified # point left exhaust_loop(palinstance, y, batch_size=10)
One caveat to keep in mind is that \(\epsilon\)-PAL will not work if the predictive variance does not make sense. For example, when the model is overconfident and the uncertainties for the training set is significantly lower than those for the predicted set. In this case, PyPAL will untimely, and often incorrectly, label the design points. An example situation where the predictions for an overconfident model due to a training set that excludes a part of design space is shown in the figure below
This problem is exacerbated in conjunction with \(\beta_\mathrm{scale} < 1\). To make the model more robust we suggest trying:
to set reasonable bounds on the length scale parameters
to increase the regularization parameter/noise kernel (alpha in sklearn)
alpha
to increase the number of data points, especially the coverage of the design space
to use a kernel that suits the problem
to turn off ARD. Automatic relevance determination (ARD) might increase the predictive performance, but also makes the model more prone to overfitting
We also recommend to cross-validate the Gaussian process models and to check that the predicted variances make sense. When performing cross-validation, make sure that the index provided to PyPAL is the same size as the cross-validation folds. By default, the code will run a simple cross-validation only on the first iteration and provide a warning if the mean absolute error is above the mean standard deviation. The warning will look something like
The mean absolute error in cross-validation is 64.29, the mean variance is 0.36. Your model might not be predictive and/or overconfident. In the docs, you find hints on how to make GPRs more robust.
This behavior can changed with the cross-validation test being performed more frequently by overriding the should_run_crossvalidation function.
should_run_crossvalidation
Another way to detect overfitting is to use pypal.plotting.make_jointplot() function from the plotting subpackage. This function will plot all objectives against each other (with errorbars and different classes indicated with colors) and histograms of the objectives on the diagonal. If the majority of predicted points tend to overlap one another and get discarded by PyPAL, this may suggest that the surrogate model is overfitted.
pypal.plotting.make_jointplot()
from pypal.plotting import make_jointplot # palinstance is a instance of a PAL class after # calling run_one_step fig = make_jointplot(palinstance.means, palinstance)