A Critical Review on Adaptive Sample Size Re-estimation (SSR) Designs for Superiority Trials with Continuous Endpoints

Sample size determination is a key part of designing clinical trials. The objective of a good clinical trial design is to achieve the balance between efficiently spending resources and enrolling enough patients to achieve a desired power. At the designing stage of a clinical trial, there usually only have limited information available about the population, so that the sample size calculated at this stage may not be sufficient to address the study objective. Assumed that the data from two parallel treatment groups (e.g. treatment and control groups) are normally distributed with mean treatment effect μ1 and μ2, and equal withingroup variance σ. Let the mean difference (treatment effect) 1 2    = − . The efficacy of


Introduction
Sample size determination is a key part of designing clinical trials. The objective of a good clinical trial design is to achieve the balance between efficiently spending resources and enrolling enough patients to achieve a desired power. At the designing stage of a clinical trial, there usually only have limited information available about the population, so that the sample size calculated at this stage may not be sufficient to address the study objective. Assumed that the data from two parallel treatment groups (e.g. treatment and control groups) are normally distributed with mean treatment effect 1 and 2 , and equal withingroup variance 2 . Let the mean difference (treatment effect) 12    =−. The efficacy of the treatment will be evaluated by testing the hypothesis. 0: =0 against : >0.
Traditional fixed-size designs or group sequential designs calculate the sample size of the trial before it starts based on an assumed , or estimate them from historical data. However, at the planning stage of the study, we may have little information about the parameters or the information we have might be inaccurate, which can lead to grossly overestimation or underestimation of the sample size. The underestimation of the sample size is especially unfavorable, because it will make the trial under-powered and fail to find a significant treatment benefit. It will be helpful if we can re-estimate the sample size after some of the data are observed from the study, so we can re-estimate the unknown parameter accordingly. Because the re-estimated parameter is from the current data, it will represent our current population much better than the parameter estimated from previous information.
There is no doubt that increasing sample size will increase test power, but there are also problems we need to pay attention to. First, how can we control type I error rate? When the reestimated sample size is depending on observed data, it may bias the final test. Then, how can we control the power at a desired level when the design changes? Moreover, we want the reestimated sample size to be efficient, because there is no need for the power to be as high as possible, otherwise we can just use the maximum affordable sample size at the beginning of the trial and it may detect a nonclinical meaningful difference.
The purpose of this paper is not to promote or discourage people to use certain SSR design, but to offer some guidance for people who want to use SSR designs (especially for the first time) about the basic ideas, advantages and drawbacks of each design. Many literatures with similar purpose have been published to summarize and to review the existing SSR designs with different focuses. Some old review papers with technical details such as [1,2] were written at least ten years ago. They cannot involve many designs proposed in the literatures published recently, and they focus more on summarizing the authors' own works. The review paper given by [3] five years ago focuses on comparing some commonly used unblinded two-stage SSR designs in terms of their operating characteristics. More recently, the paper published by [4] gives thorough review about the development of SSR designs since 1945. However, they only use a few words to summarize the basic idea of each design, which provide little technical detail or comments about how they perform. In this paper, we focus on reviewing and giving comments on the literatures of two-stage adaptive SSR designs for both blinded and unblinded superiority trials with continuous endpoints, especially those published during the past two decades. Early stop for efficacy or futility will not be discussed here, and here we only consider increasing sample size. To reestimate sample size based on the information observed from the first stage, a variety of different techniques proposed in past literatures will be summarized in this paper, such as reestimating within-group variance or treatment effect; adjusting final test statistic, critical value or significance level; giving constraints on adaptive region; and so on. The objective, design details and some of the key suggestions and concerns of each design will be given in this paper. The common adaptive SSR designs can be summarized in the following procedures; it can also be simplified by the flowchart in Figure 1. a) At the beginning of the trial, calculate the original planned per-group sample size 0 based on assumed parameters such as the targeted treatment effect that the experiment wants to detect, the within-group variance 2 or both.  In this section, we discuss some well-known blinded SSR designs (BSSR) which do not break treatment code before the trial is finished. The latest FDA draft guidance Adaptive Designs for Clinical Trials of Drugs and Biologics [5], refers to these designs as based on non-comparative data. BSSR designs have the advantage of protecting the confidentiality of the treatment effect at the interim study and more acceptable regulatory. The original planned per-group sample size of the designs in this section can be all given by  is re-estimated after observing the first stage data to protect trial power against underestimation of the sample size. Two sample t-test will be used at the end of the trial with cumulated data from two stages.

Objective and Design
The design proposed by [6,7] conducts interim analysis after part of the original planned data (e.g. 0 samples for the combined two groups data) are observed. They re-estimate the pergroup sample size * with the same formula as equation 1, but replace the assumed within-group variance 2   by the re-estimated value 2  , where 2  is calculated by EM-algorithm method based on the observed first stage data. The sample size increases only if the re-estimated sample size * is sufficiently larger than the original planned sample size 0, say * / 0 > λ > 1, where λ is a pre-decided value. If the sample size modification is made, the efficacy of the treatment will be claimed with significance level if the final test with cumulated data from two stages ( * ) > 2 * −2, . Where ( * ) is the two-sample t-test statistic with * N sample per-group; 2 * 2, N t  − is the (1-)th quantile of t-distribution with 2 * − 2 degrees of freedom. To protect the treatment result, the authors claim, although the EMalgorithm method gives accurate estimation of within-group variance 2 , it does not estimate standard treatment effect ( 1 − 2 )/ very well [6]. For instance, we can't get clear evidence from the estimation results of ( 1 − 2 )/ about how likely the null hypothesis will be rejected, thus protecting the blindness of the trial.

Concerns and Weaknesses of the Design
It was pointed out by [8] that with increased sample sizes, bias and variability of the EMalgorithm estimations of 2 and ( 1 − 2 )/ both decrease. That means although it was claimed by the authors that the estimation of standard treatment effect ( 1 − 2 )/ is not accurate, it still reveals some information about the test result, especially when the sample size or the mean difference is pretty large. Furthermore, the accuracy of the 2  estimation by EM-algorithm greatly depends on the choice of initial values and the procedure sometimes may stop before convergence is reached. It's also shown by [8] that when the true treatment effect is moderate, EM-algorithm dramatically underestimates the within-group variance while the sample variance calculated from the combined two group data (will be introduced in later section of this paper) is much simpler, and the overall variance is only slightly larger than the true within-group variance. Moreover, even though the estimation of 2  is accurate, to reestimate sample size depending on the observed first stage data may bias the final test. Therefore, it might be problematic to still compare the final test with the original planned critical value.

A Critical Review on Adaptive Sample Size Re-estimation (SSR) Designs for Superiority Trials with Continuous Endpoints OJPSR: April-2019: Page No: 01-13
Page: 5 www.raftpubs.com

Objective and Design
The design given by [9] has flexible choice of first stage sample size 1 = * 0 (0 < < 1) After observing the first stage data, two ways were proposed to re-estimate the within-group variance. The first estimation is denoted by 2 (one-sample variance), which is simply the sample variance of the combined data from two groups. The second estimation is ( )

Concerns and Weaknesses of the Design
The advantage of this design is that its type I error rate will not exceed the desired level . However, the calculation of is quite complicated. Furthermore, it might not be comfortable for some people to accept that the final test significance level must be changed to maintain the type I error rate only because the sample size was re-estimated. If the new significance level * is smaller than the original one, it feels like that this design leads to a penalty for the final test.

Part Two: Unblinded SSR (UbSSR) Designs Based on Nuisance Parameters
Two commonly used unblinded SSR (UbSSR) designs with re-estimated within-group variance 2   are introduced in this section.
The later one is based on the design of the earlier one with some adjustment on the final test significance level. Two sample t-test is used at the end of the trial with the cumulated data from two stages.

A Critical Review on Adaptive Sample Size Re-estimation (SSR) Designs for Superiority Trials with Continuous Endpoints OJPSR: April-2019: Page No: 01-13
Page: 6 www.raftpubs.com

Objective and Design
For some non-clinical experiments and clinical designs without blinding requirement, the design proposed by [11] and later further analyzed by [12] was one of the earliest designs that recommended to include the internal pilot study data (i.e. the first stage data) in the final test. The initial planned per-group sample size 0 can again be given by equation 1.
The authors use two-sided test in the original paper, without loss of generosity, we can make some adjustment to make it a one-sided test. They recommend that after the data of 0/2 patients are observed as per-group, increasing sample size if the pooled sample variance 2 of the two groups based on the first stage data is larger than 2   . Because the sample sizes in their study were small, to make the calculation precisely, they use t-distribution rather than its normal approximation to compute the reestimated sample size after the internal pilot study. If

Concerns and Weaknesses of the Design
This design greatly improves test power than the fixed size design if the variance 2   used for calculating 0 is less than the true variance 2 . However, since the final test is biased because of the SSR procedure, simulation results shows there is non-negligible type I error inflation when sample size is relatively small, internal pilot is conducted at around half the required sample size and 2   is close to the true variance 2 .

Objective and Design
The design of [13,14] is based on the design of [12], but with moderate to large size trial, they used the normal approximation of t-distribution to calculate the re-estimated sample size, say, 0 is the same as equation 1, and * is given by replacing solve the type I error rate inflation problem, they again derived the exact formula of the actually type I error rate after sample size adjustment, which is similar as they did for the BSSR design. The unknown parameters in the actual type I error rate function ( , 1, ) are significance level , first stage sample size 1 and the unknown actual required sample size . For each fixed 1 and , we can find a to maximize the actual type I error, say ( , 1). Then to control the type I error rate at , for each fixed 1, we can find an adjusted significance level * so that Because this method has negative effect on test power if we just use the pooled sample variance of the first stage data to re-estimate the sample size, the authors also proposed to use 100(1− )% Upper Confident Limit (UCL) for the variance estimation to have a probability of at least 1− to achieve a planned power. If the sample size is changed, the final test claims efficacy if the test score exceeds 2 * −2, * .

Part Three: Unblinded SSR (UbSSR) Designs Based on Treatment Effect
Three types of unblinded SSR (UbSSR) designs based on the re-estimated treatment effect after observing the first stage data are introduced in this section. Because the within-group variance 2 was not re-esitmated in the related literatures, here we assume 2 =1 is a known value. Thus, the initial assumed per-group sample size can be simplified by The final analysis can simply use z-test since the variance is assumed known.
The interim study is conducted at information time t after the data of 10 * (0 1) n t N t =   patients are observed per-group. Besides simply re-estimating sample size * N by reestimating based on the first stage data and substituting it to equation 3, a new method "conditional power function" is widely used in UbSSR based on the re-estimated treatment effect. The re-estimated total per-group sample size * now can be given by one of the following conditional power functions:

Objective and Design
Since to re-estimate the sample size depending on the data observed from the first stage could inflate the type I error rate, some of the designs control the type I error rate by re-designing the final test statistics.
where c  is the Fisher's product criterion; 1 , 2 are the observed error probabilities (p-value) for the tests based on the data observed before and after the interim analysis; 4 2 (1 − ) is the (1-)th quantile of the central chi-squared distribution with 4 degrees of freedom.
The design proposed by [16] modifies the traditional z-test of two-sample means by changing the weights of independent z-score from before and after the interim analysis (linear summation of z-score from each stage). If the sample size modification is made, the final test statistic can be given by where 1 is the z-score calculated based on the first stage data and * 2 Z is the z-score calculated by the second stage data with re-estimated sample size. Note that this approach is equivalent to a combination test with inverse normal combination function in [17]. The sample size modification will not change the distribution of the test statistic under the null hypothesis, because 1 and * 2 Z are independent and follow standard normal distribution; Therefore, as long as the weights 1 and 2 are pre-specified, satisfy 1+ 2=1, and remain unchanged when the sample size changes, then is also following standard normal distribution. Thus, the rejection criterion > results in a level-α test.

Concerns and Weaknesses of the Design
Since the distribution of these re-designed test statistics will not be changed by the sample size modification, the type I error probability will be preserved exactly at desired level, its generality and simplicity greatly facilitate the application of these methods. However, the authors of [15] claim their method has a very small loss of power compared to the optimal test in the whole sample. It is not a surprise as it's generally a nonparametric method, which may lead to power loss compared to parametric methods when the distribution information is known. Moreover, it is well known that the method of [16] unequally weighted the patients enrolled before and after the interim study if a decision of increasing sample size is made, which violates the one patient one vote principle [18]. also mentioned that the modified test statistic will cause efficiency loss since it is not a sufficient statistic for mean difference.

Concerns and Weaknesses of the Design
With the re-estimated sample size * and reestimated critical value * calculated by the combined solution of equation 6 and some sample size re-estimation functions, the type I error rate and the conditional power of the final test will be preserved at the desired level. However, the methods of [19][20][21] provided no constraint or didn't give clear criteria about how to find the constraint on the range of conditional power that allows SSR. The numerical example in [23] suggests that no lower boundary of the adaptive region or no upper boundary of the sample size increase will cause design inefficiencies if a very small value of conditional power is obtained at the interim analysis which is equivalent to having a very large re-estimated critical value * c at the final test. On the other hand, although the designs in [18,22] provide constraint on the range of adaptive region, similar as [19][20][21], reestimating sample size with their proposed region may lead the final test to be compared with a critical value larger than the original planned critical value z  . As it was mentioned in previous sections, it might be hard for design users to accept the critical value for the final test to be changed only because the sample size is changed. It is especially difficult when we need a larger critical value * than the original critical value z  , it's like giving a penalty for the final test [18]. Moreover, it was proved in [18]

Objective and Design
Because changing the final test critical value due to SSR may not be easily accepted, the designs proposed by [24][25][26] control the type I error by giving a constraint on the range of conditional power (given by in equation 7) that allows SSR. This constraint is the so called "promising zone". Two procedures are proposed and compared by [24]. The first procedure allows increasing sample size if ( 0 , | 1 ) ∈ (0.5, ( / √1 − )) and the re-estimated sample size is calculate by replacing is between 0.5 and 1− , its new sample size can be given by equation 4. It was proved that both procedures control the type I error rate, but the simulation results in their paper show that the second procedure is more powerful than the first procedure.
The designs given by [18,25,26] were proved to have a wider promising zone than the design of [24]. More specifically, their "Promising Zone" includes all the value of The promising zone UbSSR design was further developed in [27] by setting additional constraints on the range of the "Promising Zone", which considering the balance between increasing conditional power and the cost for increasing sample size. Later, [28] also proposed to constrain the range of the "Promising Zone" with the information of maximum allowed sample size and the range of the conditional power achieved with the maximum allowed sample size evaluated at the smallest clinical meaningful treatment benefit.

Concerns and Weaknesses of the Design
The lower boundary of the promising zone of [24]

Further Concerns for Adaptive SRR Based on Conditional Power
Although conditional power-based adaptive SSR can save a trial when the original planned total sample size is underestimated, never think it is without penalty. In fact, it can save the trial if the assumed treatment effect slightly overestimates the true treatment effect, adding more samples can improve the power to certain extent. However, for certain situation, the uncertainty of the conditional power function will actually reduce the efficiency of a welldesigned trial. It was shown in the paper of [29] when the expected sample size of a fixed size design is equivalent to that of a [26] design, the power of the [26] design can be lower than that of a fixed size design. It was also pointed out in [30], when the true effect size is small, recalculate sample size in mid-trial based on an interim estimate may lead to an overly large price to be paid in average sample size compared to the gain in overall power. On the other hand, if the assumed treatment effect dramatically over-estimates the true value, the conditional power at the interim study will be too low, no sample size modification will be made, and nothing will be gained from the extra procedure. Moreover, due to the randomness of the conditional power, for small sample size, even when the original trial is well designed, there will still be a high chance for the adaptive SSR to increase the sample size to achieve an undesired higher power. It's better to examine the operating characteristics (power and type I error) of the entire procedure, which can be done, for example, by simulating the adaptive design under different values of δ in the range of interest, through such simulations that one may be able to judge whether the adaptive design is worth adopting.

Conclusions and Remarks
There is no doubt that the designs we review in this paper may help some underpowered trials from failing to find a significant treatment benefit, but as it was summarized in each section, none of them is a perfect design. Before we apply the SSR designs to any real clinical trial, it would be better for us to take into consideration their potential problems such as inflation of type I error rate, inefficiency, computation complication, impractical, etc. Page: 11 www.raftpubs.com Moreover, a few additional points are also worth to be mentioned. 1) If the population is not following normal distribution, most of the designs discussed in this paper are using asymptotic normal distribution derived from central limit theorem to calculate sample size and conduct hypothesis testing, thus they are only applicable with large sample size at both stages. 2) In this paper, we only discuss the SSR designs for superiority tests. The designs may encounter more problems when they are applied to non-inferiority or equivalence hypothesis. 3) We assume equal variance for both treatment groups at the beginning of this paper, which is also the assumption given by most of the papers we reviewed. Formulation may be more complicated, and efficiency may be compromised when the variances are actually different. 4) The designs based on conditional power function we reviewed in this paper assume known variance for both treatment groups, thus they don't have to re-estimate variance and they can use simple z-statistic for the final test. The formulation will be more complicated but also more accurate if the variance is re-estimated and t-statistic is used for the final test. 5) When we compare different SSR designs, besides the basic statistical operating characteristics (type I error rate, power, etc.), we have plenty of different criteria but hard to identify a most important one. We need to take good consideration about the advantages and disadvantages of each design before using it in clinical trials.