Core functions for PAL
pypal.pal.core.
_get_max_wt
Returns the index in design space with the maximum size of the hyperrectangle (scaled by the mean predictions, i.e., effectively, we use the coefficient of variation). Samples only from unclassified or Pareto-optimal points.
rectangle_lows (np.array) – Lower, pessimistic, bounds of the hyperrectangles
rectangle_ups (np.array) – Upper, optimistic, bounds of the hyperrectangles
means (np.array) – Mean predictions
pareto_optimal_t (np.array) – Mask array that is True for the Pareto optimal points
unclassified_t (np.array) – Mask array that is True for the unclassified points
sampled (np.array) – Mask array that is True for the sampled points
index with maximum size of hyperrectangle
int
_get_uncertainty_region
mu (float) – mean
std (float) – standard deviation
beta_sqrt (float) – scaling factor
lower bound, upper bound
Tuple[float, float]
_get_uncertainty_regions
for each dimension (=target)
mus (np.array) – means
stds (np.array) – standard deviations
beta_sqrt (float) – scaling factors
lower bounds, upper bounds
Union[np.array, np.array]
_pareto_classify
Performs the classification part of the algorithm (p. 4 of the PAL paper, see algorithm 1/2 of the epsilon-PAL paper)
One core concept is that once a point is classified, it does no longer change the class.
When we do the comparison with +/- epsilon we always use the absolute values! Otherwise, we get inconcistent results depending on the sign!
pareto_optimal_0 (np.array) – boolean mask of points classified as Pareto optimal
not_pareto_optimal_0 (np.array) – boolean mask of points classified as non-Pareto optimal
unclassified_0 (np.array) – boolean mask of unclassified points
rectangle_lows (np.array) – lower uncertainty boundaries
rectangle_ups (np.array) – upper uncertainty boundaries
epsilon (np.array) – granularity parameter (one per dimension)
non-Pareto optimal and unclassified points
Tuple[list, list, list]
_uncertainty
Hyperrectangle sizes
_union
Performing iterative intersection (eq. 6 in PAL paper) in all dimensions.
lows (np.array) – lower bounds from previous iteration
ups (np.array) – upper bounds from previous iteration
new_lows (np.array) – lower bounds from current iteration
new_ups (np.array) – upper bounds from current iteration
_union_one_dim
Used to intersect the confidence regions, for eq. 6 of the PAL paper. The iterative intersection ensures that all uncertainty regions are non-increasing with t.
We do not check for the ordering in this function. We really assume that the lower limits are the lower limits and the upper limits are the upper limits.
All arrays must have the same length.
lows (Sequence) – lower bounds from previous iteration
ups (Sequence) – upper bounds from previous iteration
new_lows (Sequence) – lower bounds from current iteration
new_ups (Sequence) – upper bounds from current iteration
array of lower limits, array of upper limits
Tuple[np.array, np.array]
Base class for PAL
pypal.pal.pal_base.
PALBase
Bases: object
object
PAL base class
__init__
Initialize the PAL instance
X_design (np.array) – Design space (feature matrix)
models (list) – Machine learning models
ndim (int) – Number of objectives
epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.
delta (float, optional) – Delta hyperparameter. Defaults to 0.05.
beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.
goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.
coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.
__repr__
Return repr(self).
__weakref__
list of weak references to the object (if defined)
_replace_by_measurements
Implements one “trick”. Instead of using the GPR predictions for the sampled points we use the data that was actually measured and the actual uncertainty. This is different from the PAL implementation proposed by Zuluaga et al. This could make issues when the measurements are outliers
_update_beta
Update beta according to section 7.2. of the epsilon-PAL paper
_update_coef_var_mask
Update the mask array of elements that have variance below the coefficient of variation threshold
_update_hyperrectangles
Computes new hyperrectangles based on beta, the means and the standard deviations. If the iteration is > 0, then it uses iterative intersection to ensure that the size of the hyperrectangles is decreasing.
discarded_indices
Return the indices of the discarded points
discarded_points
Return the discarded points
hyperrectangle_sizes
Return the sizes of the hyperrectangles
number_discarded_points
Return the nnumber of discarded points
number_pareto_optimal_points
Return the number of Pareto optimal points
number_sampled_points
Return the number of sampled points
number_unclassified_points
Return the number of unclassified points
pareto_optimal_indices
Return the indices of the Pareto optimal points
pareto_optimal_points
Return the pareto optimal points
run_one_step
[summary]
batch_size (int, optional) – Number of indices that will be returned. Defaults to 1.
ValueError – In case the PAL instance was not initialized with measurements.
unclassified points that can be sample left.
Union[np.array, None]
sample
Runs the sampling step based on the size of the hyperrectangle. I.e., favoring exploration.
exclude_idx (Union[np.array, None], optional) – Points in design space to exclude from sampling. Defaults to None.
ValueError – In case there are no uncertainty rectangles, i.e., when the _predict has not been successfully called.
Index of next point to evaluate in design space
sampled_indices
Return the indices of the sampled points
sampled_mask
Create a mask for the sampled points We count a point as sampled if at least one objective has been measured, i.e., self.sampled is a N * number objectives array in which some columns can be false if a measurement has not been performed
sampled_points
Return the sampled points
should_cross_validate
Override for more complex cross validation schedules
unclassified_indices
Return the indices of the unclassified points
unclassified_points
update_train_set
Update training set following a measurement
indices (np.ndarray) – Indices of design space at which the measurements were taken
measurements (np.ndarray) – Measured values, 2D array. the length must equal the length of the indices array. the second direction must equal the number of objectives. If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])
measurement_uncertainty (np.ndarray) – uncertainty in the measuremens, if not provided (None) will be zero. If it is not None, it must be an array with the same shape as the measurements If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])
PAL using GPy GPR models
pypal.pal.pal_gpy.
PALGPy
Bases: pypal.pal.pal_base.PALBase
pypal.pal.pal_base.PALBase
PAL class for a list of GPy GPR models, with one model per objective
Contruct the PALGPy instance
restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.
n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.
PAL for coregionalized GPR models
pypal.pal.pal_coregionalized.
PALCoregionalized
PAL class for a coregionalized GPR model
Construct the PALCoregionalized instance
parallel (bool) – If true, model hyperparameters are optimized in parallel, using the GPy implementation. Defaults to False.
PAL using Sklearn GPR models
pypal.pal.pal_sklearn.
PALSklearn
PAL class for a list of Sklearn (GPR) models, with one model per objective
Construct the PALSklearn instance
models (list) – Machine learning models. You can provide a list of GaussianProcessRegressor instances or a list of fitted RandomizedSearchCV/GridSearchCV instances with GaussianProcessRegressor models
Implements a PAL class for GBDT models which can predict uncertainity intervals when used with quantile loss. For an example of GBDT with quantile loss see Jablonka, Kevin Maik; Moosavi, Seyed Mohamad; Asgari, Mehrdad; Ireland, Christopher; Patiny, Luc; Smit, Berend (2020): A Data-Driven Perspective on the Colours of Metal-Organic Frameworks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.13033217.v1
For general information about quantile regression see https://en.wikipedia.org/wiki/Quantile_regression
Note that the scaling of the hyperrectangles has been derived for GPR models (with RBF kernels).
pypal.pal.pal_gbdt.
PALGBDT
PAL class for a list of LightGBM GBDT models
Construct the PALGBDT instance
(List[Iterable[LGBMRegressor (models) – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s
LGBMRegressor – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s
LGBMRegressor]] – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s
interquartile_scaler (float, optional) – Used to convert the difference between the upper and lower quantile into a standard deviation. This, is std = (up-low)/interquartile_scaler. Defaults to 1.35, following Wan, X., Wang, W., Liu, J. et al. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol 14, 135 (2014). https://doi.org/10.1186/1471-2288-14-135
Provides some scheduling functions that can be used to implement the _should_optimize_hyperparameters function
pypal.pal.schedules.
exp_decay
Optimize hyperparameters at logartihmically spaced intervals
iteration (int) – current iteration
base (int, optional) – Base of the logarithm. Defaults to 10.
True if iteration is on the log scaled grid
bool
linear
Optimize hyperparameters at equally spaced intervals
frequency (int, optional) – Spacing between the True outputs. Defaults to 10.
True if iteration can be divided by frequency without remainder
Utilities for dealing with Pareto fronts in general
pypal.pal.utils.
dominance_check
One point dominates another if it is not worse in all objectives and strictly better in at least one. This here assumes we want to maximize
dominance_check_jitted
Check if point dominates any point in array
dominance_check_jitted_2
Check if any point in array dominates point
dominance_check_jitted_3
Check if any point in array dominates point. ignore_me since numba does not understand masked arrays
exhaust_loop
Helper function that takes an initialized PAL instance and loops the sampling until there is no unclassified point left. This is useful if all measurements are already taken and one wants to test the algorithm with different hyperparameters.
palinstance (PALBase) – A initialized instance of a class that inherited from PALBase and implemented the ._train() and ._predict() functions
y (np.array) – Measurements. The number of measurements must equal the number of points in the design space.
batch_size (int, optional) – Number of indices that will be returned. Defaults to 10.
None. The PAL instance is updated in place
get_hypervolume
Compute the hypervolume indicator of a Pareto front I multiply it with minus one as we assume that we want to maximize all objective and then we calculate the area
f1 | |----| | -| | -| ———— f2
But the code we use for the hv indicator assumes that the reference vector is larger than all the points in the Pareto front. For this reason, we then flip all the signs using prefactor
This indicator is not needed for the epsilon-PAL algorithm itself but only to allow tracking a metric that might help the user to see if the algorithm converges.
float
get_kmeans_samples
Get the samples that are closest to the k=n_samples centroids
X (np.array) – Feature array, on which the KMeans clustering is run
n_samples (int) – number of samples are should be selected
passed to the KMeans (**kwargs) –
selected_indices
np.array
get_maxmin_samples
Greedy maxmin sampling, also known as Kennard-Stone sampling (1). Note that a greedy sampling is not guaranteed to give the ideal solution and the output will depend on the random initialization (if this is chosen).
If you need a good solution, you can restart this algorithm multiple times with random initialization and different random seeds and use a coverage metric to quantify how well the space is covered. Some metrics are described in (2). In contrast to the code provided with (2) and (3) we do not consider the feature importance for the selection as this is typically not known beforehand.
You might want to standardize your data before applying this sampling function.
Some more sampling options are provided in our structure_comp (4) Python package. Also, this implementation here is quite memory hungry.
References: (1) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11 (1), 137–148. https://doi.org/10.1080/00401706.1969.10490666. (2) Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H. J. Understanding the Diversity of the Metal-Organic Framework Ecosystem. Nature Communications 2020, 11 (1), 4068. https://doi.org/10.1038/s41467-020-17755-8. (3) Moosavi, S. M.; Chidambaram, A.; Talirz, L.; Haranczyk, M.; Stylianou, K. C.; Smit, B. Capturing Chemical Intuition in Synthesis of Metal-Organic Frameworks. Nat Commun 2019, 10 (1), 539. https://doi.org/10.1038/s41467-019-08483-9. (4) https://github.com/kjappelbaum/structure_comp
X (np.array) – Feature array, this is the array that is used to perform the sampling
n_samples (int) – number of points that will be selected, needs to be lower than the length of X
metric (str, optional) – Distance metric to use for the maxmin calculation. Must be a valid option of scipy.spatial.distance.cdist (‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’). Defaults to ‘euclidean’
init (str, optional) – either ‘mean’, ‘median’, or ‘random’. Determines how the initial point is chosen. Defaults to ‘center’
seed (int, optional) – seed for the random number generator. Defaults to None.
passed to the cdist (**kwargs) –
is_pareto_efficient
Find the Pareto efficient points Based on https://stackoverflow.com/questions/ 32791911/fast-calculation-of-pareto-front-in-python
costs (np.array) – An (n_points, n_costs) array
return_mask (bool, optional) – True to return a mask, Otherwise it will be a (n_efficient_points, ) integer array of indices. Defaults to True.
[description]
Plotting utilities
pypal.plotting.
make_jointplot
Make a jointplot of the objective space
y (np.array) – array with the objectives (measurements)
palinstance (PALBase) – “trained” PAL instance
labels (Union[List[str], None], optional) – [description]. Defaults to None.
figsize (tuple, optional) – [description]. Defaults to (8.0, 6.0).
fig
plot_bar_iterations
Plot stacked barplots for every step of the iteration.
pareto_optimal (np.ndarray) – Number of pareto optimal points for every iteration.
non_pareto_points (np.ndarray) – Number of discarded points for every iteration
unclassified_points (np.ndarray) – Number of unclassified points for every iteration
ax
plot_histogram
Plot histograms, with maxima scaled to one and different categories indicated in color
y (np.ndarray) – objective (measurement)
palinstance (PALBase) – instance of a PAL class
ax (ax) – Matplotlib figure axis
plot_pareto_front_2d
Plot a 2D pareto front, with the different categories indicated in color.
y_0 (np.ndarray) – objective 0
y_1 (np.ndarray) – objective 1
std_0 (np.ndarray) – standard deviation objective 0
std_1 (np.ndarray) – standard deviation objective 0
palinstance (PALBase) – PAL instance
ax (ax, optional) – Matplotlib figure axis. Defaults to None.
Methods to validate inputs for the PAL classes
pypal.pal.validate_inputs.
_validate_sklearn_gpr_model
Make sure that we deal with a GaussianProcessRegressor instance, if it is a fitted random or grid search instance, extract the model
GaussianProcessRegressor
base_validate_models
Currently no validation as the predict and train function are implemented independet of the base class
list
validate_beta_scale
beta_scale (Any) – scaling factor for beta
ValueError – If beta is smaller than 0
scaling factor for beta
validate_coef_var
Make sure that the coef_var makes sense
validate_coregionalized_gpy
Make sure that model is a coregionalized GPR model
validate_delta
Make sure that delta is in a reasonable range
delta (Any) – Delta hyperparameter
ValueError – Delta must be in [0,1].
delta
validate_epsilon
Validate epsilon and return a np.array
epsilon (Any) – Epsilon hyperparameter
ndim (int) – Number of dimensions/objectives
ValueError – If epsilon is a list there must be one float per dimension
ValueError – Epsilon must be in [0,1]
ValueError – If epsilon is an array there must be one float per dimension
Array of one epsilon per objective
np.ndarray
validate_gbdt_models
Make sure that the number of iterables is equal to the number of objectives and that every iterable contains three LGBMRegressors. Also, we check that at least the first and last models use quantile loss
List[Iterable]
List
Iterable
validate_goals
for objectives that are to be minimized.
goals (Any) – List of goals, typically provideded as strings ‘max’ for maximization and ‘min’ for minimization
ndim (int) – number of dimensions
ValueError – If goals is a list and the length is not equal to ndim
ValueError – If goals is a list and the elements are not strings ‘min’, ‘max’ or -1 and 1
Array of -1 and 1
validate_gpy_model
Make sure that all elements of the list a GPRegression models
validate_interquartile_scaler
Make sure that the interquartile_scaler makes sense
validate_ndim
Make sure that the number of dimensions makes sense
ndim (Any) – number of dimensions
ValueError – If the number of dimensions is not an integer
ValueError – If the number of dimensions is not greater than 0
the number of dimensions
validate_njobs
Make sure that njobs is an int > 1
validate_number_models
Make sure that there are as many models as objectives
models (Any) – List of models
ValueError – If the number of models does not equal the number of objectives
validate_sklearn_gpr_models
Make sure that there is a list of GPR models, one model per objective
List[GaussianProcessRegressor]
Wrappers for Gaussian Process Regression models.
We typically use the GPy package as it offers most flexibility for Gaussian processes in Python. Typically, we use automatic relevance determination (ARD), where one lengthscale parameter per input dimension is used.
If your task requires training on larger training sets, you might consider replacing the models with their sparse version but for the epsilon-PAL algorithm this typically shouldn’t be needed.
For kernel selection, you can have a look at https://www.cs.toronto.edu/~duvenaud/cookbook/ Matérn, RBF and RationalQuadrat are good quick and dirty solutions but have their caveats
pypal.models.gpr.
_get_matern_32_kernel
Matern-3/2 kernel without ARD
Matern32
_get_matern_52_kernel
Matern-5/2 kernel without ARD
Matern52
_get_ratquad_kernel
Rational quadratic kernel without ARD
RatQuad
build_coregionalized_model
Wrapper for building a coregionalized GPR, it will have as many outputs as y_train.shape[1]. Each output will have its own noise term
GPCoregionalizedRegression
build_model
Build a single-output GPR model
GPRegression
predict
Wrapper function for the prediction method of a GPy regression model. It return the standard deviation instead of the variance
Tuple[array, array]
Tuple
array
predict_coregionalized
Wrapper function for the prediction method of a coregionalized GPy regression model. It return the standard deviation instead of the variance
set_xy_coregionalized
Wrapper to update a coregionalized model with new data