The PyPAL API reference

The PAL package

Core functions

Core functions for PAL

pypal.pal.core._get_max_wt(rectangle_lows, rectangle_ups, means, pareto_optimal_t, unclassified_t, sampled)[source]

Returns the index in design space with the maximum size of the hyperrectangle (scaled by the mean predictions, i.e., effectively, we use the coefficient of variation). Samples only from unclassified or Pareto-optimal points.

Parameters
  • rectangle_lows (np.array) – Lower, pessimistic, bounds of the hyperrectangles

  • rectangle_ups (np.array) – Upper, optimistic, bounds of the hyperrectangles

  • means (np.array) – Mean predictions

  • pareto_optimal_t (np.array) – Mask array that is True for the Pareto optimal points

  • unclassified_t (np.array) – Mask array that is True for the unclassified points

  • sampled (np.array) – Mask array that is True for the sampled points

Returns

index with maximum size of hyperrectangle

Return type

int

pypal.pal.core._get_uncertainty_region(mu, std, beta_sqrt)[source]
Parameters
  • mu (float) – mean

  • std (float) – standard deviation

  • beta_sqrt (float) – scaling factor

Returns

lower bound, upper bound

Return type

Tuple[float, float]

pypal.pal.core._get_uncertainty_regions(mus, stds, beta_sqrt)[source]
Compute the lower and upper bound of the uncertainty region

for each dimension (=target)

Parameters
  • mus (np.array) – means

  • stds (np.array) – standard deviations

  • beta_sqrt (float) – scaling factors

Returns

lower bounds, upper bounds

Return type

Union[np.array, np.array]

pypal.pal.core._pareto_classify(pareto_optimal_0, not_pareto_optimal_0, unclassified_0, rectangle_lows, rectangle_ups, epsilon)[source]

Performs the classification part of the algorithm (p. 4 of the PAL paper, see algorithm 1/2 of the epsilon-PAL paper)

One core concept is that once a point is classified, it does no longer change the class.

When we do the comparison with +/- epsilon we always use the absolute values! Otherwise, we get inconcistent results depending on the sign!

Parameters
  • pareto_optimal_0 (np.array) – boolean mask of points classified as Pareto optimal

  • not_pareto_optimal_0 (np.array) – boolean mask of points classified as non-Pareto optimal

  • unclassified_0 (np.array) – boolean mask of unclassified points

  • rectangle_lows (np.array) – lower uncertainty boundaries

  • rectangle_ups (np.array) – upper uncertainty boundaries

  • epsilon (np.array) – granularity parameter (one per dimension)

Returns

binary encoded list of Pareto optimal,

non-Pareto optimal and unclassified points

Return type

Tuple[list, list, list]

pypal.pal.core._uncertainty(rectangle_ups, rectangle_lows, means)[source]

Hyperrectangle sizes

pypal.pal.core._union(lows, ups, new_lows, new_ups)[source]

Performing iterative intersection (eq. 6 in PAL paper) in all dimensions.

Parameters
  • lows (np.array) – lower bounds from previous iteration

  • ups (np.array) – upper bounds from previous iteration

  • new_lows (np.array) – lower bounds from current iteration

  • new_ups (np.array) – upper bounds from current iteration

Returns

lower bounds, upper bounds

Return type

Union[np.array, np.array]

pypal.pal.core._union_one_dim(lows, ups, new_lows, new_ups)[source]

Used to intersect the confidence regions, for eq. 6 of the PAL paper. The iterative intersection ensures that all uncertainty regions are non-increasing with t.

We do not check for the ordering in this function. We really assume that the lower limits are the lower limits and the upper limits are the upper limits.

All arrays must have the same length.

Parameters
  • lows (Sequence) – lower bounds from previous iteration

  • ups (Sequence) – upper bounds from previous iteration

  • new_lows (Sequence) – lower bounds from current iteration

  • new_ups (Sequence) – upper bounds from current iteration

Returns

array of lower limits, array of upper limits

Return type

Tuple[np.array, np.array]

Base class

Base class for PAL

class pypal.pal.pal_base.PALBase(X_design, models, ndim, epsilon=0.01, delta=0.05, beta_scale=0.1111111111111111, goals=None, coef_var_threshold=3)[source]

Bases: object

PAL base class

__init__(X_design, models, ndim, epsilon=0.01, delta=0.05, beta_scale=0.1111111111111111, goals=None, coef_var_threshold=3)[source]

Initialize the PAL instance

Parameters
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

__repr__()[source]

Return repr(self).

__weakref__

list of weak references to the object (if defined)

_replace_by_measurements()[source]

Implements one “trick”. Instead of using the GPR predictions for the sampled points we use the data that was actually measured and the actual uncertainty. This is different from the PAL implementation proposed by Zuluaga et al. This could make issues when the measurements are outliers

_update_beta()[source]

Update beta according to section 7.2. of the epsilon-PAL paper

_update_coef_var_mask()[source]

Update the mask array of elements that have variance below the coefficient of variation threshold

_update_hyperrectangles()[source]

Computes new hyperrectangles based on beta, the means and the standard deviations. If the iteration is > 0, then it uses iterative intersection to ensure that the size of the hyperrectangles is decreasing.

property discarded_indices

Return the indices of the discarded points

property discarded_points

Return the discarded points

property hyperrectangle_sizes

Return the sizes of the hyperrectangles

property number_discarded_points

Return the nnumber of discarded points

property number_pareto_optimal_points

Return the number of Pareto optimal points

property number_sampled_points

Return the number of sampled points

property number_unclassified_points

Return the number of unclassified points

property pareto_optimal_indices

Return the indices of the Pareto optimal points

property pareto_optimal_points

Return the pareto optimal points

run_one_step(batch_size=1)[source]

[summary]

Parameters

batch_size (int, optional) – Number of indices that will be returned. Defaults to 1.

Raises

ValueError – In case the PAL instance was not initialized with measurements.

Returns

Returns array of indices if there are

unclassified points that can be sample left.

Return type

Union[np.array, None]

sample(exclude_idx=None)[source]

Runs the sampling step based on the size of the hyperrectangle. I.e., favoring exploration.

Parameters

exclude_idx (Union[np.array, None], optional) – Points in design space to exclude from sampling. Defaults to None.

Raises

ValueError – In case there are no uncertainty rectangles, i.e., when the _predict has not been successfully called.

Returns

Index of next point to evaluate in design space

Return type

int

property sampled_indices

Return the indices of the sampled points

property sampled_mask

Create a mask for the sampled points We count a point as sampled if at least one objective has been measured, i.e., self.sampled is a N * number objectives array in which some columns can be false if a measurement has not been performed

property sampled_points

Return the sampled points

should_cross_validate()[source]

Override for more complex cross validation schedules

property unclassified_indices

Return the indices of the unclassified points

property unclassified_points

Return the discarded points

update_train_set(indices, measurements, measurement_uncertainty=None)[source]

Update training set following a measurement

Parameters
  • indices (np.ndarray) – Indices of design space at which the measurements were taken

  • measurements (np.ndarray) – Measured values, 2D array. the length must equal the length of the indices array. the second direction must equal the number of objectives. If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])

  • measurement_uncertainty (np.ndarray) – uncertainty in the measuremens, if not provided (None) will be zero. If it is not None, it must be an array with the same shape as the measurements If an objective is missing, provide np.nan. For example, np.array([1, 1, np.nan])

For GPy models

PAL using GPy GPR models

class pypal.pal.pal_gpy.PALGPy(*args, **kwargs)[source]

Bases: pypal.pal.pal_base.PALBase

PAL class for a list of GPy GPR models, with one model per objective

__init__(*args, **kwargs)[source]

Contruct the PALGPy instance

Parameters
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.

  • n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.

For coregionalized GPy models

PAL for coregionalized GPR models

class pypal.pal.pal_coregionalized.PALCoregionalized(*args, **kwargs)[source]

Bases: pypal.pal.pal_base.PALBase

PAL class for a coregionalized GPR model

__init__(*args, **kwargs)[source]

Construct the PALCoregionalized instance

Parameters
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • restarts (int) – Number of random restarts that are used for hyperparameter optimization. Defaults to 20.

  • parallel (bool) – If true, model hyperparameters are optimized in parallel, using the GPy implementation. Defaults to False.

For sklearn GPR models

PAL using Sklearn GPR models

class pypal.pal.pal_sklearn.PALSklearn(*args, **kwargs)[source]

Bases: pypal.pal.pal_base.PALBase

PAL class for a list of Sklearn (GPR) models, with one model per objective

__init__(*args, **kwargs)[source]

Construct the PALSklearn instance

Parameters
  • X_design (np.array) – Design space (feature matrix)

  • models (list) – Machine learning models. You can provide a list of GaussianProcessRegressor instances or a list of fitted RandomizedSearchCV/GridSearchCV instances with GaussianProcessRegressor models

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • n_jobs (int) – Number of parallel processes that are used to fit the GPR models. Defaults to 1.

For quantile regression with LightGBM

Implements a PAL class for GBDT models which can predict uncertainity intervals when used with quantile loss. For an example of GBDT with quantile loss see Jablonka, Kevin Maik; Moosavi, Seyed Mohamad; Asgari, Mehrdad; Ireland, Christopher; Patiny, Luc; Smit, Berend (2020): A Data-Driven Perspective on the Colours of Metal-Organic Frameworks. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.13033217.v1

For general information about quantile regression see https://en.wikipedia.org/wiki/Quantile_regression

Note that the scaling of the hyperrectangles has been derived for GPR models (with RBF kernels).

class pypal.pal.pal_gbdt.PALGBDT(*args, **kwargs)[source]

Bases: pypal.pal.pal_base.PALBase

PAL class for a list of LightGBM GBDT models

__init__(*args, **kwargs)[source]

Construct the PALGBDT instance

Parameters
  • X_design (np.array) – Design space (feature matrix)

  • (List[Iterable[LGBMRegressor (models) – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s

  • LGBMRegressor – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s

  • LGBMRegressor]] – Machine learning models. You need to provide a list of iterables. One iterable per objective and every iterable contains three LGBMRegressors. The first one for the lower uncertainty limits, the middle one for the median and the last one for the upper limit. To create appropriate models, you need to use the quantile loss. If you want to parallelize training, we recommend that you use the LightGBM parallelization and fit the models for the different objectives in serial fashion.s

  • ndim (int) – Number of objectives

  • epsilon (Union[list, float], optional) – Epsilon hyperparameter. Defaults to 0.01.

  • delta (float, optional) – Delta hyperparameter. Defaults to 0.05.

  • beta_scale (float, optional) – Scaling parameter for beta. If not equal to 1, the theoretical guarantees do not necessarily hold. Also note that the parametrization depends on the kernel type. Defaults to 1/9.

  • goals (List[str], optional) – If a list, provide “min” for every objective that shall be minimized and “max” for every objective that shall be maximized. Defaults to None, which means that the code maximizes all objectives.

  • coef_var_threshold (float, optional) – Use only points with a coefficient of variation below this threshold in the classification step. Defaults to 3.

  • interquartile_scaler (float, optional) – Used to convert the difference between the upper and lower quantile into a standard deviation. This, is std = (up-low)/interquartile_scaler. Defaults to 1.35, following Wan, X., Wang, W., Liu, J. et al. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol 14, 135 (2014). https://doi.org/10.1186/1471-2288-14-135

Schedules for hyperparameter optimization

Provides some scheduling functions that can be used to implement the _should_optimize_hyperparameters function

pypal.pal.schedules.exp_decay(iteration, base=10)[source]

Optimize hyperparameters at logartihmically spaced intervals

Parameters
  • iteration (int) – current iteration

  • base (int, optional) – Base of the logarithm. Defaults to 10.

Returns

True if iteration is on the log scaled grid

Return type

bool

pypal.pal.schedules.linear(iteration, frequency=10)[source]

Optimize hyperparameters at equally spaced intervals

Parameters
  • iteration (int) – current iteration

  • frequency (int, optional) – Spacing between the True outputs. Defaults to 10.

Returns

True if iteration can be divided by frequency without remainder

Return type

bool

Utilities for multiobjective optimization

Utilities for dealing with Pareto fronts in general

pypal.pal.utils.dominance_check(point1, point2)[source]

One point dominates another if it is not worse in all objectives and strictly better in at least one. This here assumes we want to maximize

Return type

bool

pypal.pal.utils.dominance_check_jitted(point, array)[source]

Check if point dominates any point in array

Return type

bool

pypal.pal.utils.dominance_check_jitted_2(array, point)[source]

Check if any point in array dominates point

Return type

bool

pypal.pal.utils.dominance_check_jitted_3(array, point, ignore_me)[source]

Check if any point in array dominates point. ignore_me since numba does not understand masked arrays

Return type

bool

pypal.pal.utils.exhaust_loop(palinstance, y, batch_size=1)[source]

Helper function that takes an initialized PAL instance and loops the sampling until there is no unclassified point left. This is useful if all measurements are already taken and one wants to test the algorithm with different hyperparameters.

Parameters
  • palinstance (PALBase) – A initialized instance of a class that inherited from PALBase and implemented the ._train() and ._predict() functions

  • y (np.array) – Measurements. The number of measurements must equal the number of points in the design space.

  • batch_size (int, optional) – Number of indices that will be returned. Defaults to 10.

Returns

None. The PAL instance is updated in place

pypal.pal.utils.get_hypervolume(pareto_front, reference_vector, prefactor=- 1)[source]

Compute the hypervolume indicator of a Pareto front I multiply it with minus one as we assume that we want to maximize all objective and then we calculate the area

f1 | |----| | -| | -| ———— f2

But the code we use for the hv indicator assumes that the reference vector is larger than all the points in the Pareto front. For this reason, we then flip all the signs using prefactor

This indicator is not needed for the epsilon-PAL algorithm itself but only to allow tracking a metric that might help the user to see if the algorithm converges.

Return type

float

pypal.pal.utils.get_kmeans_samples(X, n_samples, **kwargs)[source]

Get the samples that are closest to the k=n_samples centroids

Parameters
  • X (np.array) – Feature array, on which the KMeans clustering is run

  • n_samples (int) – number of samples are should be selected

  • passed to the KMeans (**kwargs) –

Returns

selected_indices

Return type

np.array

pypal.pal.utils.get_maxmin_samples(X, n_samples, metric='euclidean', init='mean', seed=None, **kwargs)[source]

Greedy maxmin sampling, also known as Kennard-Stone sampling (1). Note that a greedy sampling is not guaranteed to give the ideal solution and the output will depend on the random initialization (if this is chosen).

If you need a good solution, you can restart this algorithm multiple times with random initialization and different random seeds and use a coverage metric to quantify how well the space is covered. Some metrics are described in (2). In contrast to the code provided with (2) and (3) we do not consider the feature importance for the selection as this is typically not known beforehand.

You might want to standardize your data before applying this sampling function.

Some more sampling options are provided in our structure_comp (4) Python package. Also, this implementation here is quite memory hungry.

References: (1) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11 (1), 137–148. https://doi.org/10.1080/00401706.1969.10490666. (2) Moosavi, S. M.; Nandy, A.; Jablonka, K. M.; Ongari, D.; Janet, J. P.; Boyd, P. G.; Lee, Y.; Smit, B.; Kulik, H. J. Understanding the Diversity of the Metal-Organic Framework Ecosystem. Nature Communications 2020, 11 (1), 4068. https://doi.org/10.1038/s41467-020-17755-8. (3) Moosavi, S. M.; Chidambaram, A.; Talirz, L.; Haranczyk, M.; Stylianou, K. C.; Smit, B. Capturing Chemical Intuition in Synthesis of Metal-Organic Frameworks. Nat Commun 2019, 10 (1), 539. https://doi.org/10.1038/s41467-019-08483-9. (4) https://github.com/kjappelbaum/structure_comp

Parameters
  • X (np.array) – Feature array, this is the array that is used to perform the sampling

  • n_samples (int) – number of points that will be selected, needs to be lower than the length of X

  • metric (str, optional) – Distance metric to use for the maxmin calculation. Must be a valid option of scipy.spatial.distance.cdist (‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘wminkowski’, ‘yule’). Defaults to ‘euclidean’

  • init (str, optional) – either ‘mean’, ‘median’, or ‘random’. Determines how the initial point is chosen. Defaults to ‘center’

  • seed (int, optional) – seed for the random number generator. Defaults to None.

  • passed to the cdist (**kwargs) –

Returns

selected_indices

Return type

np.array

pypal.pal.utils.is_pareto_efficient(costs, return_mask=True)[source]

Find the Pareto efficient points Based on https://stackoverflow.com/questions/ 32791911/fast-calculation-of-pareto-front-in-python

Parameters
  • costs (np.array) – An (n_points, n_costs) array

  • return_mask (bool, optional) – True to return a mask, Otherwise it will be a (n_efficient_points, ) integer array of indices. Defaults to True.

Returns

[description]

Return type

np.array

Utilities for plotting

Plotting utilities

pypal.plotting.make_jointplot(y, palinstance, labels=None, figsize=8.0, 6.0)[source]

Make a jointplot of the objective space

Parameters
  • y (np.array) – array with the objectives (measurements)

  • palinstance (PALBase) – “trained” PAL instance

  • labels (Union[List[str], None], optional) – [description]. Defaults to None.

  • figsize (tuple, optional) – [description]. Defaults to (8.0, 6.0).

Returns

fig

pypal.plotting.plot_bar_iterations(pareto_optimal, non_pareto_points, unclassified_points, ax=None)[source]

Plot stacked barplots for every step of the iteration.

Parameters
  • pareto_optimal (np.ndarray) – Number of pareto optimal points for every iteration.

  • non_pareto_points (np.ndarray) – Number of discarded points for every iteration

  • unclassified_points (np.ndarray) – Number of unclassified points for every iteration

Returns

ax

pypal.plotting.plot_histogram(y, palinstance, ax)[source]

Plot histograms, with maxima scaled to one and different categories indicated in color

Parameters
  • y (np.ndarray) – objective (measurement)

  • palinstance (PALBase) – instance of a PAL class

  • ax (ax) – Matplotlib figure axis

pypal.plotting.plot_pareto_front_2d(y_0, y_1, std_0, std_1, palinstance, ax=None)[source]

Plot a 2D pareto front, with the different categories indicated in color.

Parameters
  • y_0 (np.ndarray) – objective 0

  • y_1 (np.ndarray) – objective 1

  • std_0 (np.ndarray) – standard deviation objective 0

  • std_1 (np.ndarray) – standard deviation objective 0

  • palinstance (PALBase) – PAL instance

  • ax (ax, optional) – Matplotlib figure axis. Defaults to None.

Input validation

Methods to validate inputs for the PAL classes

pypal.pal.validate_inputs._validate_sklearn_gpr_model(model)[source]

Make sure that we deal with a GaussianProcessRegressor instance, if it is a fitted random or grid search instance, extract the model

Return type

GaussianProcessRegressor

pypal.pal.validate_inputs.base_validate_models(models)[source]

Currently no validation as the predict and train function are implemented independet of the base class

Return type

list

pypal.pal.validate_inputs.validate_beta_scale(beta_scale)[source]
Parameters

beta_scale (Any) – scaling factor for beta

Raises

ValueError – If beta is smaller than 0

Returns

scaling factor for beta

Return type

float

pypal.pal.validate_inputs.validate_coef_var(coef_var)[source]

Make sure that the coef_var makes sense

pypal.pal.validate_inputs.validate_coregionalized_gpy(models)[source]

Make sure that model is a coregionalized GPR model

pypal.pal.validate_inputs.validate_delta(delta)[source]

Make sure that delta is in a reasonable range

Parameters

delta (Any) – Delta hyperparameter

Raises

ValueError – Delta must be in [0,1].

Returns

delta

Return type

float

pypal.pal.validate_inputs.validate_epsilon(epsilon, ndim)[source]

Validate epsilon and return a np.array

Parameters
  • epsilon (Any) – Epsilon hyperparameter

  • ndim (int) – Number of dimensions/objectives

Raises
  • ValueError – If epsilon is a list there must be one float per dimension

  • ValueError – Epsilon must be in [0,1]

  • ValueError – If epsilon is an array there must be one float per dimension

Returns

Array of one epsilon per objective

Return type

np.ndarray

pypal.pal.validate_inputs.validate_gbdt_models(models, ndim)[source]

Make sure that the number of iterables is equal to the number of objectives and that every iterable contains three LGBMRegressors. Also, we check that at least the first and last models use quantile loss

Return type

List[Iterable]

pypal.pal.validate_inputs.validate_goals(goals, ndim)[source]
Create a valid array of goals. 1 for maximization, -1

for objectives that are to be minimized.

Parameters
  • goals (Any) – List of goals, typically provideded as strings ‘max’ for maximization and ‘min’ for minimization

  • ndim (int) – number of dimensions

Raises
  • ValueError – If goals is a list and the length is not equal to ndim

  • ValueError – If goals is a list and the elements are not strings ‘min’, ‘max’ or -1 and 1

Returns

Array of -1 and 1

Return type

np.ndarray

pypal.pal.validate_inputs.validate_gpy_model(models)[source]

Make sure that all elements of the list a GPRegression models

pypal.pal.validate_inputs.validate_interquartile_scaler(interquartile_scaler)[source]

Make sure that the interquartile_scaler makes sense

Return type

float

pypal.pal.validate_inputs.validate_ndim(ndim)[source]

Make sure that the number of dimensions makes sense

Parameters

ndim (Any) – number of dimensions

Raises
  • ValueError – If the number of dimensions is not an integer

  • ValueError – If the number of dimensions is not greater than 0

Returns

the number of dimensions

Return type

int

pypal.pal.validate_inputs.validate_njobs(njobs)[source]

Make sure that njobs is an int > 1

Return type

int

pypal.pal.validate_inputs.validate_number_models(models, ndim)[source]

Make sure that there are as many models as objectives

Parameters
  • models (Any) – List of models

  • ndim (int) – Number of objectives

Raises

ValueError – If the number of models does not equal the number of objectives

pypal.pal.validate_inputs.validate_sklearn_gpr_models(models, ndim)[source]

Make sure that there is a list of GPR models, one model per objective

Return type

List[GaussianProcessRegressor]

The models package

Helper functions for GPR with GPy

Wrappers for Gaussian Process Regression models.

We typically use the GPy package as it offers most flexibility for Gaussian processes in Python. Typically, we use automatic relevance determination (ARD), where one lengthscale parameter per input dimension is used.

If your task requires training on larger training sets, you might consider replacing the models with their sparse version but for the epsilon-PAL algorithm this typically shouldn’t be needed.

For kernel selection, you can have a look at https://www.cs.toronto.edu/~duvenaud/cookbook/ Matérn, RBF and RationalQuadrat are good quick and dirty solutions but have their caveats

pypal.models.gpr._get_matern_32_kernel(NFEAT, ARD=True, **kwargs)[source]

Matern-3/2 kernel without ARD

Return type

Matern32

pypal.models.gpr._get_matern_52_kernel(NFEAT, ARD=True, **kwargs)[source]

Matern-5/2 kernel without ARD

Return type

Matern52

pypal.models.gpr._get_ratquad_kernel(NFEAT, ARD=True, **kwargs)[source]

Rational quadratic kernel without ARD

Return type

RatQuad

pypal.models.gpr.build_coregionalized_model(X_train, y_train, kernel=None, **kwargs)[source]

Wrapper for building a coregionalized GPR, it will have as many outputs as y_train.shape[1]. Each output will have its own noise term

Return type

GPCoregionalizedRegression

pypal.models.gpr.build_model(X_train, y_train, index=0, kernel=None, **kwargs)[source]

Build a single-output GPR model

Return type

GPRegression

pypal.models.gpr.predict(model, X)[source]

Wrapper function for the prediction method of a GPy regression model. It return the standard deviation instead of the variance

Return type

Tuple[array, array]

pypal.models.gpr.predict_coregionalized(model, X, index=0)[source]

Wrapper function for the prediction method of a coregionalized GPy regression model. It return the standard deviation instead of the variance

Return type

Tuple[array, array]

pypal.models.gpr.set_xy_coregionalized(model, X, y, mask=None)[source]

Wrapper to update a coregionalized model with new data