Testing the Plasticity of Reinforcement Learning-based Systems

The dataset available for pre-release training of a machine-learning based system is often not representative of all possible execution contexts that the system will encounter in the field. Reinforcement Learning (RL) is a prominent approach among those that support continual learning, i.e., learning continually in the field, in the post-release phase. No study has so far investigated any method to test the plasticity of RL-based systems, i.e., their capability to adapt to an execution context that may deviate from the training one. We propose an approach to test the plasticity of RL-based systems. The output of our approach is a quantification of the adaptation and anti-regression capabilities of the system, obtained by computing the adaptation frontier of the system in a changed environment. We visualize such frontier as an adaptation/anti-regression heatmap in two dimensions, or as a clustered projection when more than two dimensions are involved. In this way, we provide developers with information on the amount of changes that can be accommodated by the continual learning component of the system, which is key to decide if online, in-the-field learning can be safely enabled or not.

Despite the evidence that DRL is used in the real world, few works focus on testing DRL-based systems [58,88]. Some works study how DRL-based systems behave in the presence of adversarial attacks [29,41], i.e., techniques that automatically craft inputs in which the trained agent performs very poorly. Uesato et al. [83] focus, instead, on finding initial configurations of a DRLbased system in order to find catastrophic failures. Ruderman et al. [17,62], on the other hand, use procedurally generated environments and a search process to find specific configurations of the environment in which the trained agent fails. However, none of those works focus on testing the plasticity of DRL-based systems, i.e., the extent to which a DRL agent under test is able to adapt to a changing environment that is different from the environment it was initially trained on. In fact, Uesato et al. and Ruderman et al. only evaluate a trained agent on a changed environment rather than letting the agent continually learn on a changing environment to assess its adaptation capabilities. In fact, one of the characteristics of RL that differentiates it from supervised learning is that an RL agent can potentially learn continuously and autonomously even when the initial environment it was trained on evolves, since it does not require the presence of a supervisor during the training process. Therefore, testing the plasticity of an RL agent before deployment is important to understand its strengths and weaknesses in specific environment configurations that can arise in the real world.
In this article, we propose an approach to characterize the adaptation capabilities of the DRL agent under test in its environment. In particular, our approach takes as input a DRL agent trained in a parameterized environment. Then, it samples the parameter space defined by the environment parameters and it trains, in a continual learning mode, the agent on the resulting environment configurations, with the objective of characterizing the adaptation frontier of the agent. The adaptation frontier consists of two environment configurations that are close to each other in the parameter space and that trigger opposite adaptation behaviors of the agent (i.e., successful vs failed adaptation). After such search process, the parameter space is approximated to interpolate the missing values in order to compute the adaptation volume that quantifies the adaptation capabilities of the agent in the environment configurations defined by the environment parameters. In addition to the adaptation volume metric, our approach provides the users with a visualization of the adaptation capabilities of the agent. Specifically, when the environment has two parameters, or alternatively two parameters of interests are chosen at a time, we provide the users with two heatmaps: The adaptation heatmap, which visualizes the adaptation frontier of the agent, and the anti-regression heatmap, which shows whether the agent has any regression in the regions of the parameter space where the agent was able to behave correctly after adaptation. The adaptation (respectively, antiregression) heatmap shows in green the regions of the parameter space where the agent adapts (respectively, does not regress), in red the regions where the agent does not adapt (respectively, regresses) and different shades of colors between green and red to indicate the adaptation (respectively, anti-regression) probability. If the user chooses to analyze the behavior of the given DRL agent considering more that two environment parameters, the visualization of the adaptation frontier is made possible through a combination of dimensionality reduction and clustering techniques. Afterwards, our approach employs decision trees such that the user can understand what are the crucial parameters of the environment that characterize the clusters of frontier points.
Our article makes the following contributions: -The first approach that tests the adaptation and regression capabilities of a DRL-based system. Our approach characterizes the adaptation frontier of the agent in the parameter space defined by the environment and provides a visualization of such frontier; -An implementation of our approach in a tool named AlphaTest, which is publicly available as open source software [44];

Reinforcement Learning Algorithms
There are several design choices that regard RL algorithms. In the following, we describe the tradeoff between these choices and position the RL algorithms considered in our experiments within such tradeoff. Even though the RL problem was addressed without deep learning (through dynamic programming techniques, monte carlo methods, and temporal difference learning [75]), it was the fusion with deep learning that made RL applicable to complex practical problems [47]. In particular, deep learning made it possible to scale RL algorithms to complex high dimensional state and action spaces (e.g., continuous spaces or high dimensional discrete spaces like images). In DRL, non-linear function approximators (namely, neural networks) are used to estimate quantities that depend on state and/or actions. Actually, neural networks can be used to estimate a policy π (either stochastic or deterministic), a value function V (or an action-value function Q), and/or a model of the environment. 1 One of the fundamental design choices for RL algorithms is the usage of a model of the environment, i.e., a quantity that predicts how the environment evolves when an action is taken and the reward it gives. If the algorithm uses a model of the environment, either available or by learning it, then it is said to be model-based; otherwise it is said to be model-free. A model of the environment could be beneficial for the agent since it can be used to look ahead by predicting the possible outcomes of its decisions. On the other hand, model-based algorithms rely on a model of the environment that could be biased. For example, a model may represent very well a specific part of the environment, but it might fail in capturing other parts of it (e.g., because those other parts are difficult to explore). When the model is biased, it does not accurately represent the real conditions of the environment and, as a consequence, the resulting agent might behave poorly on it.
Often the choice between using model-free or model-based algorithms depends on the task we want to solve by using RL methods. If, for example, the environment has very complex dynamics, but the pattern for optimal behavior (i.e. the policy) is simple, then model-free algorithms may be the best choice. On the other hand, if the environment is simple to represent but the strategy to solve the task is complex, then model-based algorithms are more appropriate. Moreover, despite model-free methods might be inferior w.r.t. model-based algorithms regarding their sampling efficiency (i.e., the amount of data the algorithm needs in order to get a new policy), they are applicable in more situations (e.g. also in those cases in which the environment is difficult to model) and, as a consequence, they are more popular. For this reason, we decided to focus our study on model-free algorithms, leaving the study of model-based algorithms for future work.
Model-free algorithms can be divided into three categories, namely, policy gradients, value-based, and hybrid methods. We discuss them in the context of DRL since in this article we use DRL algorithms belonging to such categories.

Policy Gradients.
In policy gradient methods, the policy π is represented by a neural network whose weights are updated by maximizing the expected return. This optimization is performed on-policy, which means that each update is carried out with data (i.e., states, actions, and rewards resulting from the interaction of the agent with the environment) coming only from the policy that produced that data. The state-of-the-art policy gradient algorithm is Proximal Policy Optimization (PPO) [67] whose focus is, at each step, to improve the policy as much as possible without causing performance collapse. In fact, a large policy update may lead to a bad policy; 80:6 M. Biagiola and P. Tonella such policy will be used for learning, which, in turn, may lead to even worse subsequent policies. On the other hand, if the improvement steps are too small then learning will be slow. In order to achieve this goal, PPO has a mechanism called clipping that prevents the policy from changing too much and the hyperparameter ϵ controls how much the new policy can change w.r.t. the previous version of it. Moreover, previous policy gradient methods learn from current experience and discard past experiences after gradient updates, which makes them sample inefficient. PPO, instead, is designed to reuse even experiences coming from old versions of the policy being updated.
Regarding the exploration-exploitation tradeoff, PPO trains a stochastic policy that lets the agent explore the environment in the initial phases of training. As training progresses, the policy becomes more and more deterministic and the agent is more encouraged to exploit than to explore.

Value-Based.
In value-based methods it is the the action-value function that is represented with a neural network (i.e., Q). The weights are not updated by maximizing the expected return but rather the optimization is based on the Bellman equation. Moreover, the optimization is performed off-policy, which means that it can be done with data coming from any policy, not only from the policy that produced it. Once the action-value function Q is trained, the policy is obtained by choosing the action that maximizes it for each state. The most popular example of value-based method is DQN [47], which actually started the field of DRL, and was improved over the years with various optimizations [26,66,85]. The basic idea behind the DQN algorithm is to estimate the action-value function by solving the Bellman equation iteratively. When a non-linear function approximator is used to represent the action-value function (e.g., a neural network), the iterative algorithm is not guaranteed to converge [47,81]. The sources of instabilities leading to divergence are addressed by using experience replay, i.e., a buffer of agent's experiences (sequence of states, actions and rewards) that are replayed randomly during the optimization process, and the use of a separate neural network to represent the action-value function Q. Adding a separate neural network makes the optimization process much more similar to supervised learning and more stable.
Since DQN does not represent the policy explicitly, it needs a way to interact with the environment (often called behavior policy). Specifically, DQN uses an ϵ-greedy policy specifying that the agent selects a greedy action with probability 1 − ϵ (i.e., it selects the optimal action in a certain state according to the current Q function) and selects a random action with probability ϵ (with ϵ annealed linearly from 1.0 to 0.1 over the first K time-steps, and fixed at 0.1 thereafter). This strategy implies that the agent mostly explores in the early training phase and mostly exploits at the end of training (a minimum amount of exploration is beneficial also in the final stages of training).

Hybrid between Policy Gradients and Value-Based.
Algorithms belonging to this category use both policy gradients and value-based methods. One notable example of algorithm from this category is Soft Actor Critic ( SAC) [24]. SAC is based on the maximum entropy RL framework in which the objective is to both maximize the expected return and to maximize the policy entropy (i.e., the degree of "randomness" of the policy). This is desirable because policies optimized for maximum entropy will be more robust to unexpected environmental changes that are common in the real-world (i.e., at test time). Moreover, maximum entropy policies promote exploration, hence acquiring diverse behaviors of the agent.
SAC has a mechanism to control the entropy of the policy through a coefficient α. Depending on the implementation α is either set manually or automatically adjusted during training (the implementation of SAC we used [28] supports both). Moreover, similarly to DQN, SAC also performs its update steps off-policy.

APPROACH
Let E be an environment, which can be defined as the set of n parameters {p 1 , . . . ,p n }, and A E be an RL agent trained on E until the environment is solved, or alternatively until the desired performance level is reached. Moreover, let E be a new environment with parameters {p 1 , . . . ,p n } and A E be an RL agent that is obtained by training A E continually on E for a certain period of time T . More precisely the task the agent needs to solve remains fixed (e.g., stabilizing the pole in a cartpole environment), but the environment parameters change (e.g., the mass of the pole increases), and the agent is asked to sequentially learn how to adapt to the changes. Such adaptation process can be successful if the agent A E solves the environment E (or it reaches a desired level of performance on E , when there is no binary notion of success); unsuccessful otherwise. Differently, techniques for Transfer Reinforcement Learning are not focused on the issues related to sequential (or continual) learning but rather on finding ways to transfer knowledge acquired by an agent in one domain to another domain, in order to accelerate the learning process [39,53,77,90]. In the context of continual learning, the problem we want to address is to understand the frontier between the environments E in which the agent A E can successfully adapt and the closest environments E in which the agent A E cannot adapt. Intuitively, the adaptation frontier is defined as the set of all environment pairs (E T , E F ) such that on E T adaptation is successful and on E F adaptation is not successful, and E T and E F are ϵ-close (i.e., have a distance lower than a user defined threshold ϵ) in the parameter space. When adaptation on a new environment E is successful, we are also interested in studying whether adaptation affects the capabilities of the agent on the original environment E it was trained on (i.e., whether there are regressions). The goal of our approach is to find the adaptation frontier of the agent A E efficiently (as a consequence, regressions will be also found efficiently), as exhaustive exploration of the environment parameter space is infeasible due to its size and to the need to conduct a complete continual training session in each parameter configuration. The adaptation and anti-regression frontiers help developers understand the strengths and weaknesses of the agent in variable environment configurations before deployment. Figure 1 shows an overview of our approach, composed of two phases. The first phase is the search phase (see ❶ in Figure 1) where we iteratively modify the environment parameters, and let the agent continually learn on the modified environments to evaluate whether it can successfully adapt to the changes or not. Since the assessment of a single environment configuration involves full continual learning of the agent in the changed environment, we adopt efficient search strategies that minimize the number of steps needed to find the "frontier" of the agent's adaptation capabilities. Then, in the second phase (see ❷ in Figure 1), we interpolate the missing points in the adaptation heatmap (see the parameter space approximation box in Figure 1), so as to make the computation of the adaptation volume possible.
The search phase (❶ in Figure 1) is divided into two sub-phases, namely, exponential search and binary search. For the search phase, we chose exponential and binary search for the following reasons. First of all, each search iteration requires a continual learning run of the agent in the changed environment. The high computational cost associated with a full continual learning session does not make it compatible with the use of population based, evolutionary algorithms already proposed in the literature for frontier exploration [59]. Moreover, exponential and binary search have a logarithmic computational complexity and, hence, they are very efficient in terms of number of steps required to achieve their respective objectives. Specifically, exponential search and binary search share some common operations. In particular, both of them generate search points where a search point is defined as a 3-tuple E j,i = E j,i , P A , P R . The first component of a search point is the environment configuration E j,i at a particular iteration of the search. Specifically, the first subscript j indicates either the exponential search or the binary search iteration, whereas the second subscript i indicates the global search iteration. The second component is the adaptation probability P A , which tells us what is the probability that the agent trained on the original environment E 0 is able to adapt to the environment E j,i . The choice of representing adaptation as a probability instead of a boolean predicate is due to the inherent non-determinism of DRL algorithms [27]. For each environment configuration, we train the agent multiple times with different random seeds, and we deem the adaptation successful if the agent adapts the majority of the times (i.e., P A > 0.5, but the threshold is conventional and can be modified). Finally, the last component is the regression probability P R | P A >0.5 , which tells us what is the probability that an agent which adapted successfully to the environment E j,i forgets how to behave in the original environment E 0 (i.e., it regresses). To simplify the notation, we write E k T |F to indicate that, at a generic iteration k, either the adaptation is successful (i.e., P A > 0.5, hence, the adaptation predicate adapted (E k ) = T ) or unsuccessful (i.e. P A ≤ 0.5, hence the adaptation predicate adapted (E k ) = F ).
The objective of the exponential search is to find an environment configuration in which the agent is not able to adapt. In other words, the output of the exponential search is a search point where P A ≤ 0.5, namely, E i F . Then, the binary search component takes as input E i F and looks for a frontier pair, defined as follows: Definition 1 (Frontier Pair). A frontier pair is a pair of search points, namely, Hence, the objective of binary search is finding two environment configurations that are close to each other (according to distance dist) and that trigger different behaviors of the agent, i.e., the agent is able to adapt to one environment configuration, and it is not able to adapt to the other environment configuration. In practice, for a pair of frontier points E k F .E = {p 1 , . . . ,p n } and E h T .E = {q 1 , . . . , q n }, we can define a distance function dist that computes the average relative parameter change The underlying assumption we make in Definition 1 is that the values that the parameters defining the environment E can assume are real numbers, i.e., p i .v ∈ R, q i .v ∈ R, ∀i ∈ [1, n]. When dist measures a relative change, as in the formula above, we can choose a value of ϵ, which ensures a small percentage change, such as ϵ = 0.05.
During the search phase, search points are stored in a dataset D S , whereas the frontier pairs produced by the binary search are stored in the dataset D f . Such datasets are the input to the second phase of our approach (❷ in Figure 1), namely, the volume computation phase. In fact, in order to quantify the adaptation capabilities of a DRL algorithm and how it regresses when the adaptation is successful, we want to compute the volume underlying the adaptation (respectively, the regression) frontier. To this aim, we first map each environment configuration in the search points onto an n-dimensional grid (n = |E 0 |), where the ith dimension represents the range of values of each parameter p i and the value of each cell in the grid represents the adaptation probability (respectively, the regression probability) of the agent in that environment configuration. Then, for each grid cell that does not have a probability value, since it was not covered during exponential/binary search, we iteratively approximate its value based on the probability values of its neighbors (e.g., by majority voting), in order to get a completely filled grid (we call this sub-phase parameter space approximation). Then, the grid counting sub-phase consists of just counting the number of cells with P A > 0.5 for what regards the adaptation capabilities and counting the number of cells with P R |P A >0.5 ≤ 0.5 for what regards the anti-regression capabilities of the agent.
If the environment E given as input is described by two parameters (i.e., p 1 and p 2 , with n = 2), which might be two parameters of interest selected among all possible parameters, the grid with the mapped search points can be represented in a two-dimensional plot (see the bottom left corner in the middle of Figure 1). The points in the plot represent environment configurations (E i j = (p 1 .v i , p 2 .v j )) and their shapes and colors indicate the adaptation probability (the same applies to the regression probability plot as well). In the example, red stars are environment configuration in which the adaptation probability is 0, whereas green squares are environment configurations in which the adaptation probability is 1.0 (e.g., the original environment the agent was trained on). Frontier pairs are represented by yellow diamonds and greenish circles where the adaptation probability is respectively a bit below 0.5 and a bit above it.
The output of our approach is, in general, (1) the adaptation volume that tells the users of our approach how much the given DRL algorithm (or agent) is able to adapt when the initial environment changes, together with (2) a visualization the frontier pairs sampled by the search (when n > 2 the visualization of the frontier pairs is built using a dimensionality reduction technique that maps ndimensional vectors to two-dimensional ones). Our approach also outputs (3) the anti-regression volume which quantifies how much the DRL algorithm is able to remember how to behave in the original environment (i.e., does not have regressions), when adaptation is successful. When n = 2, or alternatively two parameters of the environment are chosen at a time, the approach outputs two-dimensional heatmaps that visually show the behaviors of the DRL algorithm across the parameter space, regarding both adaptation and regression, as shown on the right hand side of Figure 1. While the adaptation heatmap (see the top right corner of Figure 1) has usually a continuous frontier, indicated with a black solid line, the anti-regression heatmap, defined only within the adaptation frontier (the gray region indicates the part of the parameter space where the antiregression heatmap is not defined), is often discontinuous. On the bottom right corner of Figure 1, such anti-regression frontier is indicated as a continuous dashed line only for illustration purposes. Figure 2 shows the CartPole environment [6], which is one of the environments we used in our evaluation (see Section 4). CartPole, first described by Barto et al. [5], is an inverted pendulum,   [6,54], described by Barto et al. [5].

Motivating Example
which is attached through an un-actuated joint to a cart. The cart moves on a track, which is frictionless in the default configuration, and the whole system is controlled by a force either pushing it to the right direction (+1) or to the left (−1) with the same magnitude. The pole starts upright and the objective is to move the cart in a way that prevents the pole from falling over. At every time-step when the pole stays upright, the agent gets a +1 reward. The task of controlling the cart is episodic and there are three conditions that determine the end of an episode: (1) the pole falls over more than 15°from the vertical, (2) the cart moves more than 2.4 units from the center, or . The action space of the agent is discrete, and the agent can only decide to either move the cart left or right. Moreover, this task (or environment) is solvable, meaning that the developers of the environment specified a condition for which the task is considered solved. In particular, for the CartPole environment, if an agent gets an average reward over 100 testing episodes that is greater or equal to 475, then the task is deemed solved.
One possible way of parameterizing the CartPole environment is by defining it in terms of four parameters, namely, masspole, lengthpole, masscart, and cartfriction. We define a parameter p as a 4-tuple v 0 , v, L, m , where v 0 is the default or initial value for that parameter, v is the current value, L is a tuple l, h representing the valid range for the values of the parameter (i.e., l stands for the low limit and h stands for the high limit), and m > 0 is a constant value that stands for multiplier. The multiplier is computed before the exponential search and guides it on how to modify the initial value v 0 in order to make the environment more challenging for the agent.
The default or initial value v 0 for each parameter is already defined by the developers of the environment, whereas the current value v is computed by the search phase at each iteration of the approach. The only variable of each parameter that needs to be specified by the users of our approach is L. The limits of a parameter L can be constrained by the environment itself (e.g., physical limits of the simulator) or alternatively they can be specified based on the desired values the user wants to analyze its agent on. Moreover, the limits determine the direction, and hence, the value of the multiplier m, following which the environment becomes more challenging for the agent. Assuming that there is only one direction that makes the environment challenging, either toward increasing |v 0 | or decreasing it toward 0 (if both directions make the environment challenging, we just instantiate our approach twice, once per direction), one of the values in L is equal to v 0 . If L.l = v 0 then m > 1, otherwise L.h = v 0 and m ∈ (0, 1). For example, the parameter masscart has v 0 = 1.0 in the CartPole environment and, by increasing it, the cart becomes more difficult to control since the magnitude of the force that is applied to the cart remains the same. In this case, as for the other parameters of the CartPole environment, the multiplier m > 1 and l = v 0 .
The users of our approach need to specify both the adaptation and the regression conditions (respectively, Ac and Rc) for the environment, which determine when adaptation (respectively, regression testing) is deemed successful (respectively, failed). In practice, one can introduce a percentage performance degradation (e.g., 20% for Ac and 5% for Rc) that is considered acceptable, respectively, during adaptation and regression. For instance, for the CartPole environment, Ac returns true if the agent gets at least an average reward of 380, corresponding to 475 (solved environment) minus 20%. Regarding the regression condition Rc, it returns false (i.e., there are no regressions) if the agent gets at least an average reward of 450, corresponding to 475 (solved environment) minus 5%. More generally, the users of our approach are allowed to define such conditions based on specific knowledge of the environment of interest, for example, based on safety conditions that must not be violated.

Algorithm
Algorithm 1 shows the pseudocode of our approach, as implemented in AlphaTest. The algorithm takes as input the initial environment E 0 where all the limits L i for each parameter p i have already been set, the current values v i are initialized to the respective original values v 0,i, and all the multipliers are set to −1. The agent A needs to be trained on E 0 until the desired performance is reached (when we refer to an agent trained on E 0 we simply use A instead of A E 0 , whereas when the agent is trained on another environment E E 0 , we explicitly indicate that by using A E ) and the user needs to specify both the constant ϵ, that determines how close two search points in the frontier pair should be (see Definition 1), and the number of training runs n tr used to compute P A and P R . Moreover, the user needs to specify two functions, Ac and Rc, to check the adaptation and regression conditions. Such functions take as input a trace and output a boolean value indicating whether the respective condition is satisfied. The trace T r refers to the output of the training or the evaluation of an agent (in Algorithms 2 and 3 indicated as train and eval functions, respectively). In our empirical evaluation, the training and evaluation traces contain the rewards the agent gets over a certain number of episodes.

Search Phase
The search phase (see Algorithm 1) starts by calling the procedure determineMultipliers, shown in Algorithm 2, which sets the multipliers for all the parameters in E 0 . The objective of such procedure is, for each parameter p, to find a value p.m such that p.m × p.v gives a new environment ( E 0 ) in which the agent trained on E 0 fails to adapt (Lines 8-11). In order to make this phase efficient, we skip continual learning and apply (i.e., evaluate) the pre-trained agent on the new environment. In fact, for the computation of the multipliers, it is enough to approximate the behavior of the agent after continual learning with the evaluation of the initially trained agent. Correspondingly, at Line 11 we improperly use the adaptation condition, as at Line 10 we evaluate the agent A without training it on the new environment. Although evaluation alone is not enough to deem the adaptation of the agent on a new environment successful or unsuccessful (since adaptation would involve also training and learning how to behave in the changed environment), it is much cheaper computationally and it proved to be sufficient to find good multipliers, which is the overall goal of this procedure. In other words, we are using evaluation as a proxy for training to quickly find environment configurations that potentially challenge the capabilities of adaptation of the agent.
Let us now consider all the steps of Algorithm 2 in detail. At Line 3, we set the initial value of the ith parameter to its default value v 0 . Then, at Lines 4-7, we infer the challenging direction by looking at the limits of the parameter p, i.e., l, h , and we double p.v at every step (Line 9) if such direction is toward h (c = 2 at Line 4), while we halve it otherwise (c = 0.5 at Line 5). The loop at

ALGORITHM 1: Pseudocode of AlphaTest
Input : E 0 : initial environment {p 1 , . . . ,p n }, Lines 8-11 stops when either p.v is beyond the given limits or the adaptation condition Ac is false. Finally, the multiplier for the ith parameter is set at Line 12. For example, let us suppose that for our running example CartPole, we have two parameters In both cases v 0 = p.L.l, hence, c = 2. To simplify the notation, we indicate an environment configuration as (p 1 .v, p 2 .v). Let us suppose that the configuration E = (2.0, 0.1) is such that a given DRL agent A does not satisfy Ac during evaluation (i.e., by evaluating the agent A on E E 0 over a certain number of episodes n ea the average reward is less than 380). Then, p 1 .m = 2.0 1.0 = 2.0. Instead, for what regards p 2 , let us suppose that the first environment configuration for which Ac is not satisfied is (1.0, 0.4), hence, p 2 .m = 0.4 0.1 = 4.0. For brevity we omitted two edge cases in Algorithm 2. The first one regards the situation in which p.v 0 < 0; in such case we still work with positive numbers (|p.v |) and change the sign for Ac (T r ) → T |F : adaptation condition function for the environment, T r = training or evaluation trace Output : Reset current value for next iteration 14 end the specific parameter when creating the environment with the specific environment configuration. Second, if the limits are such that it is not possible to falsify the adaptation condition, we automatically decrease/increase the limits l, h until Ac is falsifiable.
The multipliers computed by determineMultipliers are used to calculate the environment configurations at Line 5 of Algorithm 1. Specifically, for each parameter p, we multiply the current value p.v by the corresponding multiplier p.m until one of the limits (either p.L.l or p.L.h) is reached. By combining those values across all parameters, we construct the environment configurations to be explored during exponential search and save them into envs. Each element env i is equal to E i , exec = T |F , and exec tells us whether the environment configuration E i was executed or not during exponential search. Considering our running example, we have the following values for p 1

Exponential Search.
Algorithm 3 shows the exponential search function called by the main program (Algorithm 1, Line 7). The main loop at Lines 2-22 executes at least once, since the function assumes that there is at least one environment to execute, until a search point with P A ≤ 0.5 is found. At Line 3 by calling the function chooseNotExec we choose (e.g., randomly) one environment configuration that was not executed. Then, at Line 4 dominanceAnalysis is performed to look for an already executed environment (i.e., one whose corresponding search point belongs to the set D S , computed by Algorithm 1) that dominates or is dominated by the candidate environment to be executed, env.E. More specifically, dominanceAnalysis looks for any failing environment E F ∈ D S that is dominated by the current environment env.E, or alternatively for any succeeding environment E T ∈ D S that dominates the current environment env.E. In fact, in both cases the current environment does not need to be executed.

ALGORITHM 3: Exponential search function
Ac (T r ) → T |F : adaptation condition function for the environment, T r = training or evaluation trace Rc (T r ) → T |F : regression condition function for the environment, T r = training or evaluation trace D S : dataset of search points previously executed by the main algorithm n tr : number of training runs Output : Formally, we can define dominance as follows. Let us introduce the inequality symbol > c that points to the more challenging direction of a parameter (i.e., x 1 > c x 2 reads as x 1 > x 2 if the parameter makes the environment more challenging when its value increases; it reads as Figure 3 shows the dominance relationship between two environments. On the x axis we have the values of the parameter p 1 , whereas in the y axis we have the values of the parameter p 2 . For both parameters the direction that makes the environment more challenging for the agent is the positive direction. In Figure 3(a), the environment E = (x 2 , y 2 ) dominates the already executed environment E F .E = (x 1 , y 1 ), where the agent does not adapt (i.e., P A ≤ 0.5). Given the dominance relation between the two environments we can say that the agent will not be able to adapt on E without carrying out the training process. The reason is that E will be more challenging for the agent than E F .E since x 2 > x 1 and y 2 > y 1 . As a consequence, if the agent is not able to adapt E F .E, it will not be able to adapt to the more challenging environment E. On the other hand, Figure 3(b) shows the case in which the environment E = (x 1 , y 1 ) is dominated by an already executed environment E T .E = (x 2 , y 2 ) where the agent adapts (i.e., P A > 0.5). For a similar reason, we can conclude that the agent will be able to adapt to E without actually training it on E, since E is less challenging than E T .E. Such dominance analysis is performed both in exponential search and binary search and it reduces the computation time of the search process by skipping unnecessary executions.
The dominanceAnalysis function returns the first search point E d that satisfies the dominance relation described above; otherwise it returns null. If E d exists then the adaptation/regression probabilities of env.E are approximated as the probabilities of the already executed search point (Lines 7-8). Otherwise, at Line 10 the train function is called. It performs n tr training runs of the agent A on the environment env.E. Upon each training run, the function Ac is called on the training trace produced by the run, and if Ac returns true then the agent is saved in the list As env.E , the list of successfully trained agents. The adaptation probability is computed at Line 11 as the ratio between the number of times the agent is able to adapt and the number of training runs n tr . If the adaptation probability is > 0.5, to measure the regression probability P R , we perform an evaluation run in the original environment of each successful agent (A env.E ∈ As env.E ; see for loop at Lines 14-17). The regression probability is computed at Line 18 as the number of times the regression predicate is true (namely, |R p s | R p =T ) and the number of evaluation runs (i.e., |As env.E |).
Considering our CartPole environment and assuming n tr = 3, let us suppose that the first environment chosen by the exponential search is E 1 = (4.0, 1.6) and that P A of the agent on E 1 is 1.0, i.e., the agent always adapts in the three training runs, which means that after training the agent on E 1 , the agent is evaluated on E 1 , and it gets at least 380 points of reward in all the three training runs. Since the adaptation probability is > 0.5, the regression probability is also computed at Lines 12-19. The for loop at Line 14 is executed n tr = |As env .E| = 3 times in which we may get P R = 0.0: in all such runs the Rc function returns false, which means all successful agents, when evaluated on E 0 over a certain number of episodes n er , get an average reward of at least 450.
Then, the do-while loop continues and the next environment chosen at Line 3 could be E 2 = (2.0, 0.3). E 2 is dominated by E 1 in which the agent adapts. Therefore, the adaptation probability of the agent on E 2 is inferred to be 1.0 and the regression probability to be 0.0, without having actually executed the training process on E 2 (Lines 7-8). Finally, let us suppose that the next environment being sampled is E 3 = (10.0, 4.0) and that in this environment the agent is not able to adapt. Specifically, its adaptation probability on E 3 is P A = 0.0. For this environment, the regression probability is not computed since the condition at Line 12 is false. We do not compute the regression probabilities of agents whose adaptation is not successful.

Same as exponential search Algorithm 3 lines 4-20
Binary Search. Algorithm 4 shows the next step of the search phase, i.e., the binary search phase, called by the main program Algorithm 1 at Line 9. At Line 2, a new search point is constructed where the environment is the initial environment that by definition has P A = 1.0 and P R = 0.0 since the agent A was trained on E 0 . Then, the main loop at Lines 3-17 executes until the condition for a frontier pair (see Definition 1) is satisfied. For brevity, we did not include in the algorithm a timeout that is necessary for the while loop to terminate when ϵ is too small for the given environment and agent. In such situation, the algorithm would return the frontier pair that is the closest to ϵ when the timeout expires.
At Line 4, either E T or E F is randomly chosen as starting point for the construction of a new environment (initially the latter is the output of exponential search). Then, the loop at Lines 5-8 performs the binary search operation by randomly choosing a parameter index (Line 6) and by assigning (Line 7) the value of the parameter corresponding to that index to be the average of the values of the corresponding parameters of the two environments where in one of them the agent is able to adapt (i.e., E.T ) and in the other the agent is not able to adapt (i.e., E.F ). The loop ends when a new search point is found that does not belong to the set of already computed/evaluated search points. For simplicity, we omitted the further stopping condition that makes sure that the loop terminates when all the possible combinations of environment configurations have been tried; in that case, we return f = null, which will not be included in the dataset of frontier pairs. At Line 9, the new search point E is evaluated in the same way as in the exponential search (see Algorithm 3 at Lines 4-20). Then, given the P A of E (Line 10) either E T (Line 11) or E F (Line 13) is reassigned. At Line 16, the distance between the two candidate frontier environments E T , E F is computed and if the condition for a frontier pair is satisfied at Line 17, the main loop of the binary search function stops and such frontier pair is returned, together with the dataset of search points produced in the given iteration.
Considering our running example, the environment of the search point returned by the exponential search in the previous step is E 3 = (10.0, 4.0) and the original environment is E 0 = (1.0, 0.1). Let us suppose that at Line 4 we choose E F and that at Line 6 the index is 0. Then, the new environment will be E 4 = ( 1.0+10.0 2 = 5.5, 4.0), and let us suppose that the adaptation probability of the agent on E 4 is P A = 0.33, which means that it adapts once out of three times, since n tr = 3. Since P A ≤ 0.5, we say that the agent is not able to adapt on E 4 and that Assuming that we chose ϵ = 0.5, the current search points E T and E F do not form a frontier pair since the distance between them, computed at Line 16, is 1.64 > ϵ = 0.5 (for reference, the two environments to consider are E 0 = (1.0, 0.1) and E 4 = (5.5, 4.0)). Therefore, the main loop continues and let us say that we still select the search point E F at Line 4, that now contains E 4 as environment. At Line 6 this time we select the second parameter, i.e., index 1, in order to obtain the new environment E 5 = (5.5, 4.0+0.1 2 = 2.05). Let us suppose that the agent A is able to adapt to E 5 , i.e., P A = 0.67 > 0.5 (the agent is able to adapt twice out of three times) and therefore This time the two environments to consider for the calculation of the frontier pair are E 4 = (5.5, 4.0) and E 5 = (5.5, 2.05), and the distance between them, computed at Line 16, is 0.32 ≤ ϵ = 0.5. Hence, the current search points E F = E 4 , P A = 0.3, P R and E T = E 5 , P A = 0.67, P R form a frontier pair f that the binary search function returns.

Volume Computation Phase
The next phase of our approach is the volume computation phase, which starts at Line 14 in Algorithm 1. It consists of the interpolation of the missing adaptation/regression probabilities from the existing ones (parameter space approximation sub-phase) for the entire grid and then the count of the grid points with P A > 0.5 and P R |P A >0.5 ≤ 0.5, respectively (grid counting sub-phase).
For parameter space approximation, the main algorithm calls the nearestNeighbor function, which takes as input the dataset of search points D S produced by the search and returns two fully filled grids, namely, A дr , the adaptation grid, and R дr , the regression grid. First, the near-estNeighbor function creates an n-dimensional grid, where n is the number of parameters in the initial environment E 0 . The grid will be filled with an adaptation probability value for each combination of parameters (for brevity we only refer to A дr but the same procedure applies to R дr ). In particular, for each parameter p i , we compute the step size of the array a i forming the ith dimension of the grid using the following formula: where g is the granularity parameter that determines how granular the resolution of the values of p i (i.e., p i .v) in the ith dimension of the grid is (with д = 1 we get a grid containing 100 discrete slots along each dimension). The higher the parameter д, the higher the resolution. The number of values nv = len(a i ) that p i can assume in the ith dimension is д × 100 ; therefore, the grid will be a hyper-square matrix with dimensions nv × · · · × nv = nv n . It is initialized with null values.
Once the grid A дr is built the next step is to map the search points in D S into the grid. For each search point E j ∈ D S , considering its associated environment E j .E, and for each parameter p i of the environment (i.e., E j .E[i]) the index in the ith dimension of the grid of the search point E j is computed as follows: where nv is the length of each array a i (we subtract 1 because the index starts from 0), min a i is the minimum value of the parameter p i in the grid (correspondingly max a i is the maximum) and · indicates that the product is approximated to the nearest integer. Once we have the index in the grid corresponding to the environment E j .E, we insert the adaptation probability value into the associated grid cell: However, if the granularity д is too small, there could be collisions, i.e., multiple search points, and consequently multiple environment configurations, have the same index in the grid. In that case for the value in A дr [index j ], we take the mean of the adaptation probabilities of the colliding search points. At this point, A дr has the adaptation probabilities of all the search points in D S , together with null values where the search procedure did not sample any point. In order to compute the volume to quantify the capability of adaptation of the agent on the possible environment configurations, we need to approximate the missing values of the adaptation probabilities. We apply the nearest neighbor technique: We replace each null value in A дr with the mean value of the neighboring points when there exists at least one neighboring value different from null. At first, the grid is scanned for indices with null values. Then, the values of these null cells are updated in batch and the procedure is repeated until there are no more indexes with null values. By performing a batch update of all null entries at the same time, we ensure that the nearest neighbor approximation does not depend on the order in which the indexes are scanned, as would happen if the updates were in-place. The neighborhood to consider for each update is composed of 3 n − 1 points, where n is the number of parameters of the environment E, i.e., the number of dimensions of the grid A дr . In fact, along each dimension a point has two neighbors, obtained by incrementing or decrementing its index. This means that a neighborhood consists of 3 points per dimension, or 3 n points overall, except for the initial, unchanged point (hence, 3 n − 1). Once the grid A дr is fully filled, we can compute the adaptation volume (Line 15 in Algorithm 1) by counting the number of points in the grid that have an adaptation probability value of > 0.5, normalized by the total number of points in the grid.
If the environment has two parameters (Line 16 in Algorithm 1) then we also build a twodimensional adaptation probability heatmap from A дr , for visualization by users. The color in each two-dimensional heatmap cell represents the adaptation probability ranging from red (P A = 0.0) to green (P A = 1.0). However, instead of plotting the grid directly, we apply a further interpolation function that makes the transitions (in yellow) between the adaptation part (in green) and the nonadaptation part (in red) smoother than they are on the grid, so as to make the adaptation frontier more visible in the heatmap.
On the other hand, if the environment has more than two parameters, besides the adaptation volume, we provide the users with a two-dimensional visualization of the frontier based on the t-SNE (t-distributed Stochastic Neighbor Embedding) [84] dimensionality reduction technique. t-SNE preserves the local structure, such that the frontier points that are close in the n-dimensional space remain close to each other in the two-dimensional space. Despite the information on the actual shape of the adaptation frontier is necessarily lost in the two-dimensional space, the clusters resulting from the application of the dimensionality reduction technique give the users an idea of the regions of the parameter space in which the frontier points represent similar environments. In order to better show the separation between the different frontier pairs, we apply clustering to the output of the dimensionality reduction method. In particular, we use the K-means clustering algorithm and we determine the optimal number of clusters k * by performing silhouette analysis on a range of candidates k. The silhouette score falls within the range [−1, 1] and we choose as k * the k that gives the highest silhouette score, resulting in dense and well separated clusters. Then, once each frontier pair has been assigned a cluster label, we train a Decision Tree that determines the critical environment parameters for which a certain frontier pair belongs to a cluster rather than another. Finally, we plot the frontier pairs as given by the dimensionality reduction technique, within colored regions indicating the respective clusters they belong to, together with the decision tree plot. Specifically Figure 4(a) contains the search points explored during exponential and binary search sections (including the ones discussed above: hence, we can set the value at index 3 = (19,19) in A дr to be E 3 F .P A = 0.0, i.e. A дr [index 3 ] = 0.0. After repeating the same mapping procedure for all the other search points, we get the heatmap in Figure 4(a). The next step is to approximate the missing adaptation probability values (blank cells in the figure) using the nearest neighbor algorithm. For example, let us take the missing value with coordinate (6.0, 2.2), i.e., the central point of the black square in the middle of the heatmaps. The neighborhood of that point is composed of 3 2 − 1 = 8 points, since n = 2. These are the points delimited by the black square, excluding the central point, i.e., the point under analysis. There are two other values in the neighborhood of (6.0, 2.2); therefore, the adaptation probability value for such point is inferred to be 0.6+0.3 2 = 0.5. In Figure 4(b), we can see that the point is on the frontier delimited by the yellow curved region.

Implementation
We implemented our approach in an open-source tool called AlphaTest written in Python [44]. The environments we considered in our empirical evaluation are available from the Gym library [10] (v. 0.16.0) and the DRL algorithms we trained on those environments are implemented in a library called stable-baselines [28] (v. 2.10.1) a fork of the popular baselines [14] library from OpenAI. For plotting the heatmaps, we used seaborn and matplotlib whereas for dimensionality reduction we used the t-SNE method [84] implemented in the scikit-learn Python library [55]. We used the scikit-learn also for the k-means and the decision trees implementations. For what regards the smoothing technique, we used the radial basis function method of the scipy Python library [13].

EMPIRICAL EVALUATION
We consider the following research questions:

RQ 1 (effectiveness):
How effective is AlphaTest in finding frontier pairs that characterize the adaptation behavior of a DRL algorithm in a given environment w.r.t. a random exploration of the parameter space? RQ 1 aims at empirically comparing AlphaTest with the random approach on the characterization of the adaptation frontier. We deem a given approach as effective when it is able to characterize the adaptation frontier accurately (i.e., with high resolution across the entire parameter space), which is important to understand the regions of the parameter space where the agent is able to adapt and the regions where it is not able to adapt. RQ 2 (discrimination): How does AlphaTest discriminate between different DRL algorithms, that exhibit different degrees of plasticity?
The goal of RQ 2 is to determine whether AlphaTest is able to discriminate a DRL algorithm that has high adaptation capabilities from another DRL with low adaptation capabilities. The discrimination capability is crucial in practical scenarios to guide the decision of the most appropriate algorithm that adapts better overall or in a desired region of the parameter space.

RQ 3 (hyperparameters):
What are the critical hyperparameters of AlphaTest and what is the best way to fine tune them?
In RQ 3, we want to study what is the impact of the hyperparameters of AlphaTest on its performance and on the discrimination metrics. Such empirical assessment will guide the users of AlphaTest to choose a proper tradeoff, depending on the computational resources available and the algorithm under test. Table 1 shows the environments we considered in our evaluation together with the associated parameters. We kept the same parameter names that can be found in the source code of the respective environments except in the case of CartPole, where we added a cartfriction parameter, taken from the implementation of Patanjali et al. [54], that was not present in the original implementation. Moreover, we modified the constructors of the environment classes to receive the values of such parameters, which were previously hardcoded. Table 1 can be read in the following way:  [5,6] length, cartfriction masspole masscart Pendulum [51] dt, length mass -MountainCar [48,49] force, gravity goalvelocity -Acrobot [21,73,74] linklength 1  We already described the CartPole environment in the previous section (see Section 3.1) as our motivating example, together with the parameters that we vary during the search process. The Pendulum environment [51] can be described an inverted pendulum, like the CartPole environment, but the problem is framed in a different way. In the Pendulum environment, the pendulum starts in a random position, and the goal is to swing it up so that it stays upright. The reward function depends on the angle of the pendulum, and it gives the agent maximum reward if the pendulum is upright. The action space is continuous and it consists of a single action corresponding to the torque applied on the joint of the pendulum. The observation space is also continuous and it is a vector of two components, namely, the pendulum angle w.r.t. the rest position θ and the angular velocityθ . The Pendulum environment is episodic and the condition for the end of an episode is based on a fixed maximum number of time-steps (i.e., 200). Pendulum is an unsolved environment, as opposed to the CartPole environment, meaning that it does not have a specified reward threshold at which it is considered solved. The parameters that can be changed in such environment are dt, length, and mass. The latter two parameters are relative to the length and the mass of the pendulum pole, respectively, while the former determines the rate at which the angle θ of the pendulum can change.

Subject Systems
In the MountainCar environment [49], described for the first time by Moore et al. [48], a car starts positioned between two mountains in a one-dimensional track. The objective is to reach the goal position up the right mountain. The car engine is not powerful enough to reach the goal in a single pass; therefore, it needs to drive back and forth between the two mountains to build up momentum. The agent receives a negative reward at every time-step so that it is encouraged to reach the goal as soon as possible. The action space is discrete and the agent can either choose to apply a force to push the car in the right direction, to apply the same force to push the car in the left direction or, alternatively, not to apply any force. The observation space, instead, is continuous and it is a one dimensional vector containing the position of the car and its linear velocity. Also, the MountainCar environment is episodic and an episode ends when 200 time-steps have passed or the agent reaches the goal position. Like the CartPole environment, also the MountainCar environment is solvable and it is considered solved when the agent gets an average reward of at least −110.0 over 100 consecutive evaluation episodes. The parameters that can be changed in such environment are force, gravity, and goalvelocity. The force parameter represents the magnitude of the action applied to the car to move it either to the left or to the right. The gravity parameter also affects the difficulty of the task and it can be considered equivalent to changing the mass of the car. Lastly, the goalvelocity parameter imposes a constraint on the velocity that the car needs to have when reaching the goal position.
The Acrobot environment [73], first described by Sutton et al. [74] and later refined by Geramifard et al. [21], is a system composed of two joints and two links where the joint between the two links is actuated. The initial position of the system is with the two links hanging downwards and the goal of the agent is to bring the end of the lower link to a given height. The agent gets a reward of −1 for each time-step and a reward of 0 when it manages to reach the goal. The action space of the agent is discrete so that it can either apply a positive torque on the actuated joint between the two links, a negative torque of the same magnitude or, alternatively, it can choose to apply no torque. The observation space, instead, is continuous and it consists of sin · and cos · of the two rotational joint angles θ 1 , θ 2 , and the joint angular velocitiesθ 1 ,θ 2 . The Acrobot environment is an episodic environment and an episode ends when either the goal is reached or 500 time-steps have passed. The Acrobot environment has a reward threshold for which it is considered solved, which is −100 over 100 evaluation episodes. The parameters that can be changed in such environment are linklength 1 (length of the first link, changing the length of the second link has no effect), linkcompos 1 (position of the center of mass of the first link), linkmass 2 (mass of the second link), and linkmass 1 (mass of the first link).

Subject Algorithms
The DRL algorithms we selected for our empirical evaluation are Proximal Policy Optimization (PPO) [67], Soft Actor Critic (SAC) [24], and DQN [47]. They belong to the categories presented in Section 2, respectively, policy gradients (PPO), hybrid (SAC), and value-based (DQN) methods. Moreover, these DRL algorithms are mature and widely used and as such many model-free RL libraries provide a stable implementation of them.
The implementation of SAC we used [28] only supports continuous action spaces, while DQN only supports discrete action spaces. The PPO implementation supports both continuous and discrete action spaces, but we always chose to use PPO in its discrete version. Therefore, we modified the Pendulum environment to also support a discrete action space for the two possible actions. This way a DQN agent, as well as a PPO agent with discrete actions, can either swing the pendulum left with maximum torque or right with the same torque magnitude (we did not include a do nothing action). Moreover, we also modified the CartPole, MountainCar, and Acrobot environments to support continuous action spaces, so that a SAC agent would be able to control them. In particular, we defined such continuous action spaces as a linear interpolation between the minimum and the maximum force that can be applied to the system. In all environments with discrete action spaces, the left and right actions correspond to the extremes of the respective continuous action spaces and the do nothing action (in the MountainCar and Acrobot environments) is the middle point in the continuous space.

Procedure.
We trained all DRL algorithms under test on the environments with the default parameters. The hyperparameters of the DRL algorithms were, in part, taken from the opensource repository rl-baselines-zoo, 2 which contains both hyperparameters and trained agents for the DRL algorithms implemented in stable-baselines [28]. We changed the given hyperparameters only when it was possible to achieve a better cumulative reward and/or to decrease the training time. The final set of hyperparameters for each DRL algorithm was chosen as the one that gave the highest average reward over 10 distinct training runs (each one with a different random seed). Such set of hyperparameters was then used to train the model that we used in our evaluation, i.e., the model under test. In particular, when training the model under test, we evaluated it for 100 episodes every N time-steps, where N is a fraction (i.e., 10%) of the total training time measured in number of time-steps, which changes for every algorithm and every environment. The best model was chosen as the one with highest average reward considering all the evaluation runs. Table 2 shows the performance, both in terms of average reward and in terms of training time, of the best models produced by each DRL algorithm on the respective environments. In particular, Column Avg reward shows the average reward a trained agent gets over 100 evaluation episodes. Column Time to train indicates the time taken, in minutes, to train a certain DRL algorithm on the given environment. The table also shows that for all the environments that have a reward threshold (i.e., they are solvable, see column Reward threshold) the average reward over 100 episodes for all the DRL algorithms is above the threshold. For the Pendulum environment, which does not have a reward threshold, we took the average reward obtained when training the SAC algorithm with the hyperparameters we found in the rl-baselines-zoo repository and trained both PPO and DQN algorithms to achieve the same average performance. In order to enable continual learning for the selected DRL algorithms, we had to change some of their specific hyperparameters. Since no guidelines exist on how to best set the hyperparameters of DRL algorithms for continual learning, we acted mainly on those hyperparameters that control the amount of exploration of the agent during training. Indeed, when continual learning is enabled, the objective is to preserve the performance achieved in the previous training phase while also learning the new behaviors that might be necessary in the new, changed environments. In order to preserve the previous performance, we saved the replay memory resulting form the training phase on the original environment for DQN and SAC since, being off-policy algorithms (see Section 2), they learn from transitions stored in such memories. Then, when continual learning is enabled for such algorithms, the memory is restored and, as the subsequent training phase proceeds, the older knowledge is replaced by the newer one, coming from the interactions of the agent with the new environment. Regarding the specific hyperparameters of the DQN algorithm (described in Section 2.2.2) when continual learning is enabled, we set the initial ϵ of the ϵ-greedy policy to be equal to the final value of ϵ resulting from the previous training phase (usually a small value, such as 0.01). For the SAC algorithm (described in Section 2.2.3), the entropy regularization coefficient α that controls the exploration-exploitation tradeoff is automatically adjusted in the implementation we used. Therefore, we did not act on this parameter when enabling continual learning. Finally, for the PPO algorithm (described in Section 2.2.1), before enabling continual learning, we modified the parameter ϵ that controls how much the new policy can be different from the old policy. This parameter is usually in the interval [0.1, 0.3], but we noticed that in continual learning the policy changes in a way that the previous performance cannot be restored if the value of such parameter is set within this interval. Therefore, we set ϵ = 0.08 before starting the training phase on the new environment. Moreover, we checked that the hyperparameters for all the DRL algorithms were reasonably set by running 10 continual learning runs for half of the training time on the same environment where a certain DRL algorithm was originally trained. Then, we made sure that in all the runs the performance of the agent (average reward over 100 evaluation episodes) after each continual learning phase matched the initial performance (i.e., the one reached at the end of the initial training phase). In other words, we made sure that we could restore the performance of an agent on the same environment it was trained on and maintain it for a certain number of timesteps.
In order to choose the environment parameters to consider for evaluation, we instantiated each environment with all the possible parameters (i.e., CartPole 4, Pendulum 3, MountainCar 3, and Acrobot 4), we computed the multipliers for each parameter according to Algorithm 2 and stopped AlphaTest (see line 2 in Algorithm 1). The adaptation condition is the same for all environments, i.e., we deem the adaptation successful if the average reward of the agent trained and evaluated in an environment does not fall below 20% of the average reward the agent gets in the original environment. Likewise, the regression condition is also the same for all environments, with a regression threshold of 5%. Since each discovered multiplier is such that the adaptation condition returns false, we could compute the precise drop in performance of the agent w.r.t. the average reward the agent had achieved after the initial training phase. Then, for each environment, we ranked the parameters that are potentially more critical in terms of adaptation based on such performance drops. The results of such analysis are shown in Table 1 where the parameters for each environment are ranked by their performance drops. For example, the parameters length and cartfriction are the most critical in the CartPole environment. The parameter masspole is the next most critical among the four and masscart is the less critical. We have chosen to prioritize the parameters based on criticality because evaluating all the possible combinations of parameters for all DRL algorithms would have been too computationally expensive.
Once we have set the multipliers for each parameter, we determined the respective limits, i.e., L = l, h . In particular, we set the limits of each parameter to be L We multiplied and divided the discovered limits by 4 in order to have a reasonably large range of values for each parameter. Given a certain parameter, the multipliers can be different for each DRL algorithm and, therefore, also the respective limits are different. Hence, to be able to compare the output of different DRL algorithms, when we construct the adaptation grids and compute the volumes, we consider the limits associated with the smallest range of values across all DRL algorithms. For example, if for the CartPole environment, considering the parameter length, the limits for SAC are L SAC = l = 0.5, h = 10.0 and the limits for PPO are L P PO = l = 0.5, h = 5.0 , we construct the adaptation grids considering L P PO . In this way, we are sure that the comparison is fair and the adaptation volumes are comparable. Regarding the anti-regression volume, we computed it as the number of points in the grid within the adaptation frontier of each algorithm (i.e., with P A > 0.5) that have P R ≤ 0.5. The anti-regression volume is normalized over the total number of grid points within the adaptation frontier. For this reason, the normalization factor can be different for every algorithm, even considering the same environment configuration. Therefore, anti-regression volumes of different algorithms within the same environment configuration are not comparable in an absolute way, but only relatively to the respective adaptation volumes.
To form a baseline for comparison, we devised a random exploration procedure of the parameter space. In particular, at each iteration, we randomly choose a possible value in the defined range for each parameter and then train the agent on the resulting environment. The resulting search point E may have an adaptation probability either below or above threshold (we do not compute the regression probability in this case). Then, to understand if the randomly selected search point E belongs to the adaptation frontier, we determine the neighborhood of E by modifying the environment parameters in a way that the resulting search points are at a distance ϵ from the already executed search point E, according to the distance function in Equation (1). In particular, for each parameter, we considered two values, respectively, greater and smaller than the current parameter value, that respect the distance function; therefore, the neighborhood of an executed search point is composed of n × 2 search points, where n is the dimension of the search space (number of parameters being searched). The resulting neighboring search points E i are executed and their adaptation probabilities are computed. If a pair (E, E i ) satisfies Definition 1, i.e., their adaptation probabilities are one below and one above threshold, a frontier pair is found.
In order to establish a fair comparison between the random baseline and AlphaTest, we determined the number of iterations of the random baseline in the following way. For each environment and for each DRL algorithm, we first executed AlphaTest for five repetitions and computed the maximum number of search points M resulting from the experiments. We then set the number of iterations for the random method to be M n×2 , where · means the approximation to the nearest integer. In fact, for each iteration, the random method executes n × 2 + 1 search points, i.e., one search point determined at random plus the n × 2 search points in the neighborhood.
For both the random method and AlphaTest, we set the number of repetitions to be equal to five for each experiment (in order to compare the two methods statistically), where each experiment is a pair (environment configuration, DRL algorithm), the number of training runs for probability estimation (rpe) to be equal to 3 (in order to account for the randomness of DRL algorithms the adaptation probability is estimated with multiple continual learning runs on each environment configuration) and the continual learning time (clt) to be equal to half of the initial training time (measured in number of time-steps). For each training run the model is evaluated every N time-steps, where N is a fraction (i.e. 20%) of the total continual learning time, for 20 episodes in order to evaluate the adaptation condition. If the adaptation condition returns true at some point during training, then training is stopped. Otherwise, training goes on until the continual learning time expires and the adaptation condition is also evaluated at the end.

Metrics.
In order to characterize accurately the adaptation frontier, it is important to obtain the highest number of frontier pairs and maximize their sparseness across the parameter space. The more frontier pairs are found in the parameter space and the more scattered they are, the better the adaptation frontier will be characterized. Therefore, to assess effectiveness (RQ 1 ) we ran both AlphaTest and the random exploration on all the environments by varying only two parameters (see first column of Table 1) for all DRL algorithms. We measured the number of frontier pairs found by AlphaTest and by random exploration, and compute both the p-value with the non-parametric Wilcoxon test and the Vargha-DelaneyÂ 12 effect size to compare the two methods statistically [4]. We also measured the sparseness of the frontier pairs. In particular, for each frontier pair, we considered a search point with parameter values corresponding to the average of the parameter values of the two search points in the frontier and measured the pairwise distances between each resulting point and all the others. Each pairwise distance is normalized by dividing it by the maximum distance between two points in the parameter space. We then considered the We evaluated discrimination (RQ 2 ) by computing both the adaptation volume and the antiregression volume for all the DRL algorithms and all the combinations of environment parameters (see Table 1) that we considered in our evaluation (the granularity hyperparameter for volume computation was set to 1, i.e., д = 1). We also computed the adaptation and anti-regression heatmaps for all the environment configurations with two parameters, and built frontier visualization plots by applying the t-SNE dimensionality reduction technique together with clustering and decision trees, for all the environment configurations with more than two parameters. Concerning the hyperparameters (RQ 3 ), we measured the impact on the number of runs skipped and, hence, on the time saved during search, of the dominance option of AlphaTest. Moreover, we measured the impact on the volume of increasing the number of runs for probability estimation from three to five. In particular, we measured the standard error of the mean (SEM) of the adaptation volume, i.e., the standard deviation of the adaptation volume divided by the square root of the runs used for probability estimation. We also varied the granularity of the adaptation grid. Specifically, we doubled the original value considered in RQ 2 and we halved it, to measure the impact both on the adaptation volume metric and on the percentage of search points that have a collision in the grid. Finally, for each environment, we took the worst performing algorithm in terms of adaptation volume and carried out other five repetitions of AlphaTest by increasing the continual learning time to be equal to the initial training time in order to measure the impact on adaptation and anti-regression volumes of the continual learning time hyperparameter.

Results
Effectiveness (RQ 1 ). Table 3 shows the comparison between AlphaTest and random for all DRL algorithms in the four environments in terms of number of search points (Columns 1-2), number of frontier pairs (Columns 3-4), and sparseness of the pairs in the frontier (Columns 5-6).
We can notice that AlphaTest and random both have roughly the same number of search points executed, on average, although there is a slight advantage for random, which explores more search points in all environments. This data confirm that the comparison between AlphaTest and random is fair.
In terms of number of frontier pairs, Table 3 shows that AlphaTest finds more frontier pairs than random. Bold values indicate that the difference between the two means is statistically significant (i.e., p-value is below 0.05) and the underline indicates when the magnitude of the effect sizeÂ 12 is large. In all but one case (DQN in CartPole 2), the number of frontier pairs found by AlphaTest is larger than random, and the difference between the two means is both statistically significant and the magnitude of the effect size is large.
Regarding sparseness there is a significant difference only in two out of four environments, namely, CartPole 2 (considering PPO and SAC) and Acrobot 2. In the remaining environments, i.e., Pendulum 2 and MountainCar 2, the effect size is large in favor of AlphaTest when considering the algorithm SAC. The small values of sparseness in Pendulum 2, MountainCar 2, and Acrobot 2 are due to the frontier pairs being concentrated near the origin of the two-dimensional parameter space. RQ 1 : Considering the same number of search points, AlphaTest finds significantly more frontier points than random. Moreover, in two out of four environments and for most of the DRL algorithms, the frontier pairs found by AlphaTest are significantly more sparse than those found by random. Table 4 (i.e., Columns 1-4) shows how AlphaTest discriminates between different DRL algorithms in different environments. In particular, Column 1 shows the average adaptation volume considering five repetitions of AlphaTest and Column 2 shows the standard deviation of the adaptation volume in percentage. Similarly, Columns 3-4 show the average anti-regression volume and its standard deviation in percentage, respectively. The adaptation and anti-regression volumes were computed by considering the granularity д = 1.0 except for the environments CartPole 4 and Acrobot 4 where the volumes were computed with д = 0.5, in order to save computation time. The adaptation volumes of different DRL algorithms can be compared within the same environment configuration, since the adaptation grids are constructed with the same parameter ranges. Regarding the anti-regression volumes, each normalization factor depends on the number of points within the adaptation frontier of the specific DRL algorithm. Therefore, each anti-regression volume indicates how much a certain DRL algorithm is able to remember how to behave in the original environment (i.e., it does not regress) relative to its adaptation volume. For example, in CartPole 2 the PPO algorithm, on average, has regressions only within 5% of the adaptation frontier (i.e., the anti-regression volume is 95%).

Discrimination (RQ 2 ). The first part of
Considering the average adaptation volumes, we marked in green (with suffix (1)) the best adaptation volume in each environment configuration, in yellow (with suffix (2)) the second best adaptation volume, and in red (with suffix (3)) the worst adaptation volume among the three DRL algorithms. We can see that PPO achieves the best adaptation volume in three environments out of four (i.e., except for Pendulum) and it is never the worst among the three. SAC is the best in Pendulum but it is the worst in two environments, namely, CartPole and MountainCar. DQN, on the other hand, is never the best and it is the worst in Pendulum. Interestingly, the ranking is maintained in Pendulum and MountainCar when passing from two parameters, i.e., Pendulum 2 and MountainCar 2, to three parameters, i.e., Pendulum 3 and MountainCar 3. Moreover, also in Cart-Pole and Acrobot the algorithm that obtains the best adaptation volume, i.e., PPO, remains the best  when considering more than two parameters. On the other hand, in Acrobot the algorithm SAC is moved from the second to the third position when considering three and four parameters, i.e., Acrobot 3 and Acrobot 4, being surpassed in the ranking by DQN (in Acrobot 2 the average adaptation volume of DQN is smaller than that of SAC only by a negligible amount). In CartPole, instead, the switch in the ranking between second and third position takes place when moving from three to four parameters, i.e., from CartPole 3 to CartPole 4, and it involves SAC and DQN with the latter moving from the second to the third position in CartPole 4. In summary, although PPO is the DRL algorithm that adapts better in three out of four considered environments, our experiments are not sufficient to deem PPO superior than others in absolute terms, since the number of environments we trained PPO on is relatively small and we did not carry out any hyperparameter tuning, which DRL algorithms have been shown to be sensitive to [27].
The other interesting dimension along which we compare the selected DRL algorithms is the anti-regression volume. Figure 5 shows the tradeoff between adaptation and anti-regression volumes. In particular, Figure 5(a) shows such tradeoff in CartPole 2, whereas Figure 5(b) shows it in MountainCar 2. We can see that in CartPole 2, there is clearly a DRL algorithm, namely, PPO, that dominates the others, i.e., it has the highest adaptation volume and, at the same time, it also has the least amount of regression (i.e., the highest anti-regression volume). On the other hand, Figure 5(b) shows that there is no clear winner among the three algorithms applied to MountainCar 2. In fact, the Pareto front, indicated with a dashed black line, suggests that as the adaptation volume increases, there is also an increase in the amount of regressions that a certain algorithm has (i.e., the anti-regression volume decreases). In other words, in MountainCar 2 and for all DRL algorithms, adapting to new environments means forgetting how to behave in the original environment. For DQN, whose average adaptation volume increases by ≈39% w.r.t. the average adaptation volume of SAC, its average anti-regression volume decreases only by ≈2%. Instead, PPO has an average adaptation volume that is ≈80% higher than the one of DQN but its average anti-regression volume is ≈38% smaller. This could be due to the fact that DQN and SAC are both off-policy algorithms (see Section 2). The replay memory helps these algorithms to replay the previous experience while learning new behaviors, hence, forgetting less how to behave in the original environment than PPO which, being an on-policy algorithm, does not use a replay memory to learn. However, this tradeoff seems to be significant only when considering two parameters in an environment. When the number of parameters increases, correspondingly the adaptation volume of a certain algorithm decreases and its anti-regression volume tends to increase (except for DQN in Pendulum and DQN  from Acrobot 3 to Acrobot 4). One possible explanation of this phenomenon might be that, since the adaptation volume is smaller in higher dimensions, the new environments are more similar to the original environment (i.e., the origin) and, as a consequence, the algorithm has less regressions w.r.t. new environments that are not so far away from the origin.
Besides the quantification of the adaptation capabilities of a certain algorithm with the computation of the adaptation volume, our approach also produces the adaptation heatmap for a certain algorithm when the considered environment has two parameters. Figure 6 shows the adaptation heatmaps for PPO, SAC, and DQN (respectively, Figure 6  adaptation frontier is the yellow continuous line between the green (adaptation successful) and the red (adaptation failed) regions of the heatmap, while the black dots indicate the search points sampled by the search procedure of AlphaTest. From the maps we can see that the parameter length is more critical in terms of adaptation capabilities than the cartfriction parameter, for all the algorithms. In particular, no DRL algorithm adapted when length = 4.0, although DQN seems to be the best at tolerating the increase of such parameter, but all DRL algorithms adapted when cartfriction = 51.20. The adaptation heatmaps are easily interpretable by developers, as they show the regions of the parameter space where the adaptation frontier lies and where we can expect a certain algorithm to successfully adapt or not when the initial conditions of the environment change. Figure 7 shows the adaptation and anti-regression heatmaps of the DQN algorithm in the Cart-Pole 2 environment (respectively, Figures 7(a) and (b), with Figure 7(a) the same as Figure 6(c)). In the anti-regression heatmap (Figure 7(b)), the color code is reversed w.r.t. the adaptation heatmap: The heatmap is red where the regression probability is 1.0 and it is green when the regression probability is 0.0. The gray color indicates the region of the parameter space where the anti-regression heatmap is not defined, i.e., outside the boundary delimited by the adaptation frontier indicated by the yellow continuous line in Figure 7(a). We can notice that the anti-regression frontier is not continuous and there can be islands of regressions inside regions of the parameter space where the algorithm does not regress (see the red region in Figure 7 When the environment has more than two parameters, AlphaTest provides a visualization of the adaptation frontier by means of the t-SNE dimensionality reduction technique and k-means clustering applied to the t-SNE lower dimensional vectors. Figure 8(a) shows the frontier pairs found by AlphaTest in the CartPole 3 environment for PPO, i.e., by considering the three parameters length, cartfriction, and masspole. After reducing the search space dimensionality from 3 to 2 by means of t-SNE, we perform silhouette analysis to find the optimal number k of clusters produced by k-means. In Figure 8(a), the six resulting clusters are represented as regions with the same background color. A magenta star with a label indicating the cluster ID (class-i) is positioned at each cluster centroid. In the figure, the original environment is indicated with a blue circle and each frontier pair is displayed as two points, one green (where the algorithm adapted) and one red (where Then, we trained a decision tree to classify each point of a frontier pair by the cluster they belong to, based on its features, i.e., the values of the parameters for that particular environment. Figure 8(b) shows the decision tree trained on the frontier pairs, using the cluster IDs as class labels, for the clusters shown in Figure 8(a). The leaf nodes are highlighted in yellow and each leaf node is pure (i.e., its Gini impurity metric is 0.0). The decision tree tells us that the length parameter is not crucial for the classification of the frontier pairs, since it is not present in any of the decision nodes nor in the root node. Moreover, the decision tree tells us in which regions of the parameter space the frontier pairs are clustered. For example, from the decision tree in Figure 8(b), we can see that the six frontier pairs that have masspole ≤ 5.689 and cartfriction ≤ 4.8 are clustered together (with label class-0 on the bottom left corner of Figure 8(a)). When cartfriction ∈ (4.8, 24.0], instead, we get the four frontier pairs that belong to the cluster with ID class-2 (bottom center of Figure 8(a)). 3 RQ 2 : The adaptation volume allows the users of AlphaTest to discriminate a DRL algorithm with high adaptation capabilities from another DRL algorithm with low adaptation capabilities. Sometimes such adaptation comes at the cost of more regressions on the original environment (i.e., higher anti-regression volume); therefore, the user needs to decide which of the two properties is more important for the specific situation. AlphaTest also provides the users with a visualization of the adaptation frontier to better evaluate the behaviors of the DRL agent in the parameter space of the environment of interest, both when two (heatmap) and when more than two (clusters and decision tree) parameters are considered. Table 4 shows, on the right hand side of the table, the results for the hyperparameter dominance. Column 5 shows the average number of search points, across five repetitions, sampled by AlphaTest during the search phase. Such number increases when moving from an environment with two parameters to an environment with three parameters, and from three parameters to four parameters, since the combinations of possible environment configurations increases. Column 6 shows the average number of search points skipped, i.e., not executed, by enabling dominance analysis both in the exponential and in the binary search sub-phases. Column 7 shows the percentage of search points skipped out of the total number of search points, whereas Column 8, shows the search time, measured in hours, of the search phase of AlphaTest for a single repetition. Column 9 shows the average percentage of time saved by enabling dominance analysis, measured as the ratio between the estimated execution time for the skipped points and the actual execution time of the search, with dominance analysis enabled. It gives the percentage of the actual execution time that would be added if search point skipping were disabled. In practice, we compute it by multiplying the number of skipped search points by the time to execute a single search point (estimated as the average across all the executed search points in all repetitions for a certain algorithm) and dividing the product by the actual search time.

Hyperparameters (RQ 3 ).
Interestingly, for all environments except MountainCar and for all algorithms, the average number of skipped search points in percentage increases by ≈ 13% when moving from the environment configuration with the lowest number of parameters (e.g., CartPole 2) to the environment configuration with the highest number of parameters (e.g., CartPole 4). In MountainCar, such increase is only 1%. Moreover, dominance analysis seems to be quite dependent on the specific environment. In particular, considering the environment configurations with two parameters, in MountainCar 2, dominance analysis is able to skip the execution of the highest number of search points (i.e., ≈83%), while in CartPole 2 it skips the lowest number of search points (i.e., ≈55%). The impact of the specific DRL algorithm on dominance is less significant but not negligible in some environments. For example, in CartPole, dominance analysis skips 54% of the total number of search points for DQN and 68% of the total number of search points for SAC. Similarly, in MountainCar, dominance analysis skips 74% of the total number of search points for PPO and 88% for DQN. The differences in the remaining environments are less pronounced. The corresponding percentage of time saved depends on the time to execute a single search point that can greatly vary across different algorithms. For example, in Acrobot 2, the percentage of skipped search points is higher for PPO than for DQN (respectively, 74% vs 71%) but the time saving percentage is lower for PPO than for DQN (respectively, 189% vs 207%). In fact, the time to execute a single search point for PPO is 1.5 minutes, whereas for DQN it is 5.2 minutes. Table 5 shows the results for the remaining hyperparameters of AlphaTest. The first five columns report the metric values obtained when running AlphaTest with the default parameters, i.e., granularity д = 1.0, runs for adaptation probability estimation rpe = 3, and continual learning time clt = half the initial training time. In particular, Column 1 shows the average percentage of search points colliding when constructing the adaptation grids with granularity д = 1.0. Such adaptation grids are constructed considering the original limits for each algorithm (not the lowest across algorithms, as in Table 4 where different algorithms are being compared). Column 2 shows the average adaptation volumes when clt = half and Column 4 is the average search time, in hours, for a single repetition of AlphaTest. Column 3 reports the average standard error of the mean (i.e., SEM) of the adaptation volume when considering three runs for estimating the adaptation probability. Column 5 shows the average anti-regression volumes.
Columns 6-9 are related to the granularity hyperparameter. In particular, we analyze the percentage of collisions and the adaptation volume percentage variation when the granularity is half the original (i.e., д = 0.5) and when the granularity is twice as much (i.e., д = 2.0). As expected, the percentage of collisions approximately doubles on average when halving the granularity, i.e., д = 0.5, and it becomes half of the original value on average when д = 2.0. More interesting is what happens to the adaptation volume that, on average, varies by ≈ 10% when д = 0.5 (the maximum variation of the adaptation volume, i.e., 24%, happens in MountainCar 2 for SAC and the minimum variation, i.e., 1%, happens in CartPole 2 for DQN), whereas when д = 2.0, the average variation is ≈ 5% (the maximum variation, 11.49%, happens in Pendulum 2 for DQN and the minimum variation, 0.55%, in CartPole 2 also for DQN). We can also notice that the adaptation volume percentage difference is high when the percentage of collisions is also high. In particular, when considering д = 2.0, i.e., a grid where the adaptation volume is estimated better, with less collisions, than at д = 1.0, the adaptation volume percentage difference is non negligible (>1%) only when the initial percentage of collisions is high (>10%). As Figure 9 shows, the adaptation volume percentage difference, when the granularity is doubled, increases linearly (with angular coefficient between 0 and 1) with the collisions percentage. In terms of execution time, halving the granularity decreases the time to approximate the adaptation grid with the nearest neighbor algorithm by 60% (the number of points in the grid decreases from 10k to 2.5k), while doubling the granularity increases such time by 200% (the number of points in the grid increases from 10k to 40k). Columns 10-13 are about the runs of continual learning needed to estimate the adaptation probability of a certain algorithm given an environment configuration. Specifically, Column 10 shows the SEM of the adaptation volume when rpe = 5, i.e., when we use five runs of continual learning to estimate the adaptation probability. Column 11 shows the percentage decrease of SEM when moving from rpe = 3 to rpe = 5. We can see that SEM always decreases (on average it decreases by 48%), suggesting that the adaptation volume across five repetitions is more stable, since the adaptation probability is better estimated. The maximum decrease, i.e., 71%, happens in the environment Acrobot 2 for SAC, whereas the minimum decrease, i.e., 12%, happens in Pendulum 2 for PPO. The search time percentage increase is reported in Column 13 and, on average, the time to carry out the search phase in AlphaTest increases by 71%. The maximum increase happens in CartPole 2 for SAC, where the search time more than doubles (i.e., it increases by 107%), whereas the minimum increase, i.e., 32%, happens in MountainCar 2 for PPO.
Finally, Columns 14-15 are related to the time hyperparameter, specifically the continual learning time that was originally set to half of the initial training time. We want to study the impact of increasing the continual learning time, making it equal to the full training time, on the adaptation volume and the anti-regression volume. We considered one algorithm in each environment for this study, in particular, the algorithm that had the worst adaptation volume in the comparison done in Table 4. We can see in Column 14 that, on average the adaptation volume increases by 35%, with the maximum being 47% in CartPole 2 for PPO and the minimum being 23% in MountainCar for SAC. Column 15 shows that the anti-regression volume always decreases, on average by 10%, with the maximum being 20.5% in MountainCar for SAC and the minimum being 2.6% in CartPole also for SAC.

RQ 3 (dominance):
The most critical hyperparameter of AlphaTest is the dominance option which should always be enabled, since it is beneficial to save computation time (on average, in our experiments, we measured a percentage saving of ≈ 250%, i.e., 2.5× more computation time would be needed without dominance).

RQ 3 (granularity):
The granularity hyperparameter д needs to be chosen as a tradeoff between the number of collisions and the time to carry out the parameter space approximation phase. We recommend decreasing the granularity hyperparameter when the number of parameters in an environment increases. Indeed, the time to compute the volume increases by an order of magnitude when adding a new parameter and considering the same granularity.

RQ 3 (number of runs):
The number of runs for adaptation probability estimation (hyperparameter rpe) is important for those algorithms that are more unstable, i.e., those that tend to produce different results when trained on the same environment multiple times. Our results show that, when increasing rpe from three to five, on average, the standard error of the adaptation volume mean decreases by 35% for PPO, by 52% for SAC, and by 58% for DQN. As a consequence increasing rpe seems more beneficial for SAC and DQN. Correspondingly, the search time increases, on average, by 71% (considering all algorithms). RQ 3 (continual learning time): As expected, by increasing the continual learning time hyperparameter we can increase the adaptation capabilities of the agent but, at the same time, we also decrease its capabilities to perform well in the original environment when adaptation is successful. In some environments, such tradeoff is negligible (e.g., in CartPole 2 the ratio between the anti-regression volume decrease and the adaptation volume increase is only 0.05), whereas others it can be significant (e.g., in MountainCar 2 such ratio is 0.88).

Threats to Validity and Limitations
In this section, we discuss the threads to validity that could affect our results [86]. Threats to internal validity might come from how the empirical study was carried out. To be fair, when considering effectiveness, we compared AlphaTest and random under identical parameter settings (e.g., same number of search points), on the same environments and DRL algorithms.
Threats to conclusion validity are related to random variations and inappropriate use of statistical tests. To mitigate these threats, we ran each experiment (both with AlphaTest and random) five times and used the non-parametric Wilcoxon test and the Vargha-Delaney effect size for statistical testing. Moreover, to account for the random initialization of DRL algorithms, we ran them three times on the same environment configuration with different random seeds, so as to better estimate adaptation and anti-regression probabilities.
Using a limited number of environments poses an external validity threat. Although more subjects would be needed to fully assess the generalizability of our results, we have chosen all the four classic control environments that are highly used in the DRL community, as they are part of the popular gym library.
With respect to reproducibility of our results, the source code of AlphaTest, the experimental results and all the environments are available online [44], making the evaluation repeatable and our results reproducible.
One limitation of our approach is that the time to approximate the frontier in the parameter space, needed to compute the adaptation and anti-regression volumes, increases exponentially with the number of parameters of the environment if the granularity hyperparameter stays fixed. To address this limitation, developers can consider two parameters at a time and indeed in most of our experiments we also considered two parameters at a time, although we also conducted experiments involving three or four parameters. The latter experiments confirmed the results obtained on the reduced dimensionality space, showing that it is often possible to extrapolate the experimental outcomes. In our future work, we plan to investigate techniques to scale parameter approximation to larger dimensionality.
Our approach makes the assumption that a total order relation exists between the parameter values, such that the complexity of the environment increases/decreases in the direction (or in the opposite direction) of the parameter change. If a parameter satisfies this requirement then it (respectively, Figure 6(a) and (b)). However, the DQN algorithm has its adaptation frontier for values of length above 1.20 and cartfriction ∈ [12.3, 24.5], whereas the PPO algorithm is well inside the adaptation boundary for the same ranges of values. Hence, despite the DQN adaptation area being overall bigger than the one of PPO, the latter might be preferable in terms of adaptation in specific regions of the parameter space. When different algorithms are available, AlphaTest helps supporting the decision of the software engineer with both qualitative and quantitative adaptation measures.

RELATED WORK
In this section, we first summarize the current approaches to test RL-based systems. Then, we discuss the techniques proposed to test DL-based systems in general and, finally, we present how the major issue of catastrophic forgetting is addressed in the context of continual learning. The literature on transfer learning is not discussed since the objective of transfer learning techniques is to study how to transfer knowledge acquired in a source domain to a target domain so as to speed up learning in the latter [39,53,77,90]. This goes beyond the scope of the present article, which is studying the adaptation boundary of a learned policy when the environment changes and the policy needs to be adapted incrementally.

Testing of Reinforcement Learning Systems
The problem of testing RL-based systems is not much explored w.r.t. testing supervised learning (SL)-based systems [58,88]. The first body of work in RL testing draws from the SL literature that studies adversarial attacks, i.e., techniques to craft inputs on which trained neural networks perform very poorly despite having very good average performance on the test set. For example, classifiers trained to classify images are vulnerable to perturbations to the input image added by an adversary, possibly causing misclassification [76]. In the same way also DRL algorithms can be vulnerable to adversarial attacks, since DRL algorithms can learn end-to-end behaviors as well (i.e., from raw inputs, e.g., images, to actions). Huang et al. [29] explored this hypothesis, finding that the policies trained to play Atari games [7] from raw pixels are also prone to adversarial attacks that can degrade their performance at test time. The authors analyzed the robustness to adversarial attacks of different DRL algorithms considering both white-box and black-box adversarial techniques (i.e., whether the adversary has access to the policy network or not). The results show that even in black-box scenarios, i.e., when the adversary has only access to the training environment (i.e., the simulator), it is possible to confuse DRL policies in a computationally efficient way. More recently Lin et al. [41] proposed adversarial attacks that exploit the sequential nature of RL systems. In fact, they designed a technique to perturb the observations received by the agent only when the perturbations are likely to be effective instead of performing attacks at every time-step (i.e., uniform attacks). Such strategically-timed attacks, applied at selective time-steps, can lower the reward of DRL agents while being less likely to be detected w.r.t. uniform attacks. Moreover, the authors propose another attack, called enchanting attack, which uses a generative model in combination with a planning algorithm to lure the agent to a certain, possibly dangerous, state. The definition of frontier pair (see Definition 1) presents some similarities with the concept of adversarial example. Indeed, a frontier pair is made by two environment configurations that are close to each other and that trigger a different adaptation behavior of the agent. Similarly, an adversarial search technique seeks to find the minimal perturbation of the input (e.g., a pixel, if the input space of the model is an image) that triggers a misbehavior of the model (e.g., a misclassification). However, there are some important differences that distinguish an adversarial example from a frontier pair. First of all, the frontier pair is defined on environment configurations determining the whole environment the agent will be trained on, whereas an adversarial example is defined on a single input the agent receives (e.g., through a camera) and processes for each prediction. Hence, gradient-based techniques to generate adversarial input examples are not applicable in the context of this article.
Adversarial evaluation is proposed by Uesato et al. [83] to find failures in trained DRL agents without generating out-of-distribution inputs, unlike previous work on adversarial examples in DRL [29,76]. The objective of the authors is to efficiently find inputs, i.e., initial conditions (e.g., the shape of the track in a driving scenario, as it is generated by the environment), that cause a catastrophic failure (e.g., the car hits a wall in the driving scenario). Toward such objective a failure probability predictor (AVF for short) is learnt, which takes as input an initial condition and outputs a binary signal, indicating a catastrophic failure. To do that efficiently, the authors propose to use data from intermediate agents taken at different stages during the training process, since such agents are less robust and, therefore, more prone to failure. The underlying assumption of the approach is that agents fail early on in the training phase in similar ways as the final agent does. Their results confirm the hypothesis in two tasks, namely, the Humanoid task on the MuJoCo [79] simulator and a self-driving scenario on the TORCS [87] simulator, where their approach was able to find failures in the final agents significantly faster than a Monte Carlo method (e.g., repeated random trials).
Another active area of research somehow related to RL testing is the study of generalization and overfitting. Procedurally generated environments (PGEs) have been proposed to help alleviate both concerns [12,17]. In fact, PGEs can provide significant variation during training so that the agents are encouraged to learn general strategies to solve the problem rather than overfitting a specific instantiation of the environment. Of particular interest is the work by Ruderman et al. [17,62], which explores the question of whether specific failures can emerge when training DRL agents in such environments. Specifically, when navigating in procedurally generated mazes, agents can suffer from catastrophic failures despite having a high average-case performance at evaluation time. The search for environment settings, which cause catastrophic failures is a local search process. Initially, a set of mazes is sampled from the training distribution and the trained agent is evaluated on each environment. Afterwards, the maze where the agent has the lowest score is selected, new candidate mazes are generated by randomly changing the wall positions and the process repeats. The authors found that this search procedure can effectively discover mazes that the trained agent fails to solve in a two minute episode (i.e., a catastrophic failure, according to their definition). Moreover, the failure-causing mazes transfer among different DRL agents and different architectures.
Drawing from the DL testing literature, which we discuss in the next section, Trujillo et al. [80] applies the concept of neuron coverage [56] to DRL systems. Given a test set of inputs, neuron coverage is defined as the proportion of activated neurons over all neurons when all available test inputs are supplied to a neural network. According to such metric, a test set is adequate if it is able to activate a high proportion of the neural network neurons, hence, thoroughly exercising its "logic". Trujillo et al. measured neuron coverage during training and testing of a DQN [47] agent on the Mountain Car problem [75] and investigated the correlation between neuron coverage and cumulative reward. The preliminary results show that neuron coverage tends to be higher when the agent explores, i.e., in the early stages of training, and does not correlate, or it correlates negatively, with the cumulative reward. Therefore, neuron coverage is not sufficient to reach substantial conclusions about the quality of neural networks for DRL agents, even though more extensive studies would be needed to confirm such result.
Rupprecht et al. [63] instead, focused on the visual aspect of DRL agents that learn from images, in order to understand the relationship between the actions made by an agent and its visual input, with the objective of identifying potential problems in the learnt behavior. In particular, the authors learn a generative model of the environment aims at evaluating the agent's behavior in particular classes of states created by the optimization. The rationale is that, often, a trained DRL agent is evaluated on a set of scenarios, which rarely include potential failure cases. Instead, by training a generative model, the authors were able to sample out-of-distribution states where a certain target condition is satisfied. For example, when training a DQN agent, it may be of interest to see what happens in states where the action-values are high or low. Alternatively, states where one action yields a high expected return and another one is not beneficial at all are also potentially very interesting.
Such works relate to ours because they evaluate and test RL algorithms. In particular, our work is built on the idea of changing the environment in which the agent is originally trained, similar to what Ruderman et al. [17,62] accomplish with PGEs. One difference w.r.t. such works is that the environments we consider are not procedurally generated. We instead modify critical parameters of the environment at runtime. The other fundamental difference is that we test the plasticity of the learning algorithm when exposed to the new environment: We do not just evaluate a trained agent in a new environment to find its sensitivity to the changed conditions; we rather let the agent learn in the changed environment to study its behaviors in the new conditions. None of the existing work tested the continual learning capabilities of RL agents.

Testing of Deep Learning Systems
Testing DL systems is a very active area of research. Research work in such context comprises all aspects of software testing, from test input generation [8,20,23,42,56,59,78,89] to test oracles [16,32,43,56,68,72], including test adequacy criteria [36,42,56]. Although none of those works specifically address RL-based systems such approaches could still be applied to DRL, i.e., to RL when a neural network is used as function approximator. The works most related to ours are those on test input generation. For a more in-depth and thorough discussion of the topic, the interested reader can refer to the systematic mapping by Riccio et al. [58] and the survey by Zhang et al. [88].
According to Riccio et al. [58], the most widely applied test input generation technique is input mutation where existing inputs to a DL-based system are mutated with the constraint of preserving their semantics. Most of the works are focused on changing the input in a way that is imperceptible for humans [15,16], but others focus also on mutating the inputs, especially images, with the objective of simulating real environmental changes [61,78], e.g., changes in weather conditions, occlusions, lens distortions, and object movements. Besides input mutation another popular way to generate test inputs is the search based approach [1,2,8,20,59]. The objective of such works is to generate challenging scenarios, i.e., environment configurations, for the system under test to detect as many system failures as possible. Moreover, search-based approaches are also used to carry out boundary input analysis [57]. In fact, instead of finding single failures, these approaches aim at finding similar inputs that trigger different behaviors of a DL-based system, being at the boundary (or frontier) of the behaviors of such system. This analysis is, for example, performed by Mullins et al. [50], Tuncali et al. [82], and Riccio et al. [59] for autonomous systems using different search techniques. Indeed, Mullins et al. [50] use adaptive search to discover inputs at the frontier of behaviors of a system using a minimal number of samples. On the contrary, Tuncali et al. [82] utilize rapidly-exploring random trees to find pairs of environment configurations at the collision boundary of an autonomous car, i.e., one environment configuration in which the collision is unavoidable and the other, close to it, in which the collision is avoidable. Likewise, Riccio et al. [59] use a model-based multi-objective search technique to characterize the frontier of behaviors of both an autonomous car and a handwritten digit classifier. The results show that the points found by this technique are spread across the frontier and that the scenarios produced by the approach are realistic.
Our work builds on the idea of characterizing the frontier of behaviors of a DL-based system [50,59,82]. We differ from such existing techniques in two ways. First, we search for the frontier of behaviors of a DRL system not just by evaluating the trained agent on different environments, but also by letting the agent learn in the new environments, in continual training mode. In this way, we characterize both how the agent learns new behaviors (i.e., the adaptivity heatmap, see Section 3), as well as how the agent behaves in the environment it was initially trained on, once it has adapted to the new environments (i.e., the anti-regression heatmap, see Section 3). Second, we use a combination of systematic search algorithms to sample the parameter space, while enabling continual learning, and we use the nearest neighbor algorithm to approximate the behaviors of the agent in the remaining parts of the parameter space. In fact, the high cost associated with a single, complete (continual) training run is not compatible with the use of population based, evolutionary algorithms, such as those used in previous works for frontier exploration [59].

Continual Learning
The continual learning problem was first studied by McCloskey et al. [45], who investigated whether neural networks can acquire new knowledge incrementally (or sequentially). Such question was explored in a supervised learning setting by the authors, who trained a neural network to perform single-digit additions. They showed that when a new task is learnt incrementally, the new knowledge interferes with the existing one, replacing it completely. McCloskey et al. referred to this failure mode of neural networks in the continual learning setting as catastrophic forgetting or catastrophic interference. When learning a new task, the neural network's inability to learn sequentially is a major problem in several scenarios. For instance, if training from scratch takes a long time, it might be impossible to reorder and replay all training data, to ensure high performance on all the tasks, including the old ones.
Catastrophic interference is a manifestation of a more general problem of neural networks, the so-called stability-plasticity problem [11,19,22]. The fundamental question is how to design a system that is plastic (or sensitive) with respect to new inputs in order to incorporate new knowledge, but that, at the same time, does not forget the already acquired knowledge (i.e., the system is stable w.r.t. old inputs). Catastrophic forgetting is a failure of stability, in which new experience overwrites previous experience. The algorithms that address the continual learning problem, which we present next, are designed to find a tradeoff between stability and plasticity.
Approaches that address and mitigate catastrophic forgetting are divided into three main categories [52], namely, regularization approaches, dynamic networks, and complementary learning systems. Regularization approaches retrain the whole network while regularizing to prevent forgetting of the previously learnt tasks. One such approach is represented by the work of Zhizhong et al. [40], who proposed the learning without forgetting (LwF) algorithm. The LwF algorithm considers a network with shared parameters across tasks and some task specific parameters. When a new task needs to be learnt, the approach optimizes the parameters of the new task together with the shared parameters with the constraint that the predictions of the network on the old tasks do not change significantly. Another regularization approach is the elastic weight consolidation (EWC) algorithm proposed by Kirkpatrick et al. [37] in supervised and RL scenarios. In EWC, the objective is to identify the weights that are important for past tasks and, while learning a new task, penalize their updates w.r.t. the updates on weights that have less significance for past tasks. Differently from the previous two approaches, Kaplanis et al. [35] propose a policy consolidation model for continual RL that does not require the knowledge of task boundaries. Such approach can also be viewed as an extension of the PPO algorithm [67] (see Section 2), which constrains the new policy to be close to the old one, thus preventing catastrophic forgetting at a very short timescale. In the policy consolidation model [35], instead, the constraint for the policy is applied to multiple gradient steps in order to maintain the knowledge acquired at several stages in the training history.
Dynamic networks approaches selectively train the network and expand it if necessary to learn new tasks. For instance, Rusu et al. [64] propose to freeze changes to the networks trained on previous tasks and, when a new task is presented, add novel sub-networks with fixed capacity to be trained for such next task.
Finally, complementary learning systems use mechanisms to replay old experience while learning new tasks in order to consolidate the acquired knowledge. Shin et al. [69] train a generative model to output synthetic data that follows the same distribution as the original training data. In such a way, when learning a new task, the training data for previous tasks can easily be sampled and interleaved with those for a new task even if the training data the network was trained on is no longer available. In the same way, Rolnick et al. [60] use experience replay as a way to reduce catastrophic forgetting in multi-task RL. Such approach, called CLEAR, mixes on-policy learning from novel experiences (for plasticity) and off-policy learning from replay experiences (for stability).
Continual learning in RL is mostly studied in multi-task settings [37,60,71], e.g., an agent trained to play a certain game is then challenged to sequentially learn to play another game without forgetting how to play the first one. In contrast, Fedus et al. [18] explores the question of whether catastrophic forgetting may arise within a single game environment. They show that in Atari games [7], catastrophic interference causes the agent performance to plateau, i.e., learning one segment of the game degrades the performance of the policy on previously learnt segments of the game.
In our work, we devise a novel methodology that helps developers understand the frontier of the adaptive behaviors of a given RL algorithm, when continual learning is enabled, i.e., when a trained agent learns incrementally to adapt to an environment which is different from the one it was originally trained on. Since continual learning is a key property of RL-based systems, which can continue to learn as new, unlabeled data are acquired, finding the limits where such property holds is fundamental for any practical application that involves runtime adaptation via continual learning.

CONCLUSION AND FUTURE WORK
In this article, we proposed the first approach to test the adaptation and anti-regression capabilities of a DRL-based system. We characterize the adaptation frontier of a DRL algorithm along the parameters that define the environment in which the agent operates. We provide a visualization of such frontier, and we propose a volume metric to quantify both the adaptation capabilities of an agent and its ability to remember how to perform in the original environment when the adaptation is successful. We implemented the approach in a tool called AlphaTest [44], and we carried out an extensive evaluation on three DRL algorithms and four continuous control environments, considering several parameter combinations.
AlphaTest has been successfully applied to four subjects taken from the popular gym library. Experimental results indicate that AlphaTest is effective at finding the adaptation frontier points and that it can be very useful in characterizing and discriminating the adaptation and antiregression capabilities of alternative DRL algorithms. AlphaTest provides developers with an interpretable visual output, which consists of the adaptation/anti-regression heatmaps (when two parameters are considered) or a clustered two-dimensional projection, accompanied by a decision tree (when dealing with more than two parameters).