Incorporating historical information in biosimilar trials: Challenges and a hybrid Bayesian‐frequentist approach

For the approval of biosimilars, it is, in most cases, necessary to conduct large Phase III clinical trials in patients to convince the regulatory authorities that the product is comparable in terms of efficacy and safety to the originator product. As the originator product has already been studied in several trials beforehand, it seems natural to include this historical information into the showing of equivalent efficacy. Since all studies for the regulatory approval of biosimilars are confirmatory studies, it is required that the statistical approach has reasonable frequentist properties, most importantly, that the Type I error rate is controlled—at least in all scenarios that are realistic in practice. However, it is well known that the incorporation of historical information can lead to an inflation of the Type I error rate in the case of a conflict between the distribution of the historical data and the distribution of the trial data. We illustrate this issue and confirm, using the Bayesian robustified meta‐analytic‐predictive (MAP) approach as an example, that simultaneously controlling the Type I error rate over the complete parameter space and gaining power in comparison to a standard frequentist approach that only considers the data in the new study, is not possible. We propose a hybrid Bayesian‐frequentist approach for binary endpoints that controls the Type I error rate in the neighborhood of the center of the prior distribution, while improving the power. We study the properties of this approach in an extensive simulation study and provide a real‐world example.


INTRODUCTION
A biosimilar (the test product) is approved as a copy of an already marketed biological product (the reference product) if it can be shown that the biosimilar and the reference product are highly similar and that "there are no clinically meaningful differences between the biological product and the reference product in terms of the safety, purity, and potency of the product" (FDA, 2009). Biosimilar development has gained much attention in the last ten years because the patents of several biological blockbusters have recently expired (e.g., etanercept, infliximab) that makes it possible for competing companies to bring their own version of the drug onto the market. Currently, there are 41 approved biosimilars in Europe (Generics and Biosimilars Initiative, 2017a) and 10 approved biosimilars in the US (Generics and Biosimilars Initiative, 2017b).
In order to demonstrate that a proposed biosimilar has the same efficacy and safety as its corresponding reference product, large Phase III trials in patients are usually requested. For previously approved biosimilars in Europe, these were mostly parallel groups trials (Mielke, Jilma, Koenig, & Jones, 2016) with 89-759 subjects. It is well known that drug development is a costly and time-consuming process (DiMasi, Hansen, & Grabowski, 2003) and because biosimilars are supposed to be sold at a lower price than the originator product, it is essential to keep the development costs as low as possible. Consequently, it is desirable that all available information is used to determine if the biosimilar and reference products are highly similar. In terms of available information, the development of a biosimilar differs from the development of a new drug in the sense that the reference product is already an established product at the point in time at which the studies for the biosimilar product are conducted. The efficacy of the reference product has already been investigated in several studies and, at least, summary statistics from these trials are publicly available. It seems natural to incorporate the historical information on the reference product into the assessment of biosimilarity since this may lead to a reduction in sample size. The Bayesian framework provides an intuitive way of combining the historical information with the observed data in the new study . Neuenschwander, Capkun-Niggli, Branson, and Spiegelhalter (2010) developed, for example, a meta-analytic-predictive (MAP) approach that can combine several historical trials into a prior distribution taking into account the between-trial variation. The MAP approach assumes that the parameters corresponding to different trials are not identical but similar. A random-effects meta-analytical model is used to quantify the degree of similarity.
However, Bayesian methods are mostly used in exploratory analyses in early stage development. For biosimilars, in contrast, all clinical studies are confirmatory studies in which the frequentist operating characteristics of the approaches are, in most cases, expected to fulfil certain criteria; most importantly, that the Type I error rate is controlled over the complete parameter space or, if that is not feasible, at least in all scenarios that are realistic in practice. It is well known that the use of historical information can lead to an inflation of the Type I error rate in the case where the observed data in the study do not match the historical information (e.g., Schmidli et al., 2014). A moderate inflation of the Type I error rate might be acceptable if, for example, a rare disease is studied and the development of the drug would not be feasible otherwise (e.g., Hampson, Whitehead, Eleftheriou, & Brogan, 2014). For biosimilars, however, the incorporation of historical information is motivated by savings in terms of resources and time. This might speed up the development and therefore bring the biosimilar earlier to patients and that might reduce the costs of treatment and allow more patients to be treated with the biologic. In addition, as fewer patients need to be enrolled in the trial, the burden on the patients is reduced. However, since the reference product is already approved and there is no unmet medical need, Type I error rate inflation should not occur in scenarios that are relevant in practice. Consequently, a methodology for incorporating historical information that controls the Type I error rate in all realistic scenarios, while still providing an advantage in terms of power, is needed. Here, we present such a methodology.
Bayesian approaches for biosimilar development have already been discussed previously, for example by Chiu, Liu, and Chow (2014). Also, Tsou, Chang, Hwang, and Lai (2013) proposed a consistency approach and Hsieh, Chow, Yang, and Chi (2013) discussed the use of reproducibility probabilities. However, none of these authors focussed on a methodology that can control the Type I error rate in realistic scenarios which we consider highly important.
The goal of our paper is, on the one hand, to illustrate the challenges with strict Type I error rate control, and on the other hand to propose a methodology that can control the Type I error rate in the neighborhood of the mean value of the prior distribution, while still providing an advantage in terms of power or reduction of the sample size. We focus on binary endpoints, but the general conclusions and ideas are also valid for other types of endpoint.
The rest of the paper is structured as follows: we first show in Section 2 that it is not possible to achieve strict Type I error rate control while gaining power with the Bayesian robustified MAP approach (Schmidli et al., 2014) in comparison to the frequentist two-one-sided-test (TOST) approach (Schuirmann, 1987). The TOST approach is commonly used in this setting and considers only the data in the new study. Afterwards, in Section 3, we propose a hybrid Bayesian-frequentist approach that can achieve the desired control of the Type I error rate in an area around the center of the prior distribution while still gaining power in that region. Next, the operating characteristics of the proposed methodology are discussed in an extensive simulation study (Section 4) before we show how to use the proposed methodology in practice by illustrating the planning of a study for a biosimilar with the active substance adalimumab (Section 5). We finish with some conclusions and comments in Section 6. Source code to reproduce all results presented in this paper is available as Supporting Information on the journal's web page (http://onlinelibrary.wiley.com/doi/10.1002/bimj.201700152/suppinfo).

STRICT T YPE I ERROR RATE CONTROL AND GAIN IN POWER IS INCOMPATIBLE WITH THE INCLUSION OF HISTORICAL DATA
Before we propose our hybrid Bayesian-frequentist approach, we illustrate the challenges with strict Type I error rate control if historical information is incorporated using the robustified meta-analytic-predictive (MAP) approach that was introduced by Schmidli et al. (2014). It is important to note that this methodology only serves as an example for the challenges related to Type I error rate inflation due to the use of historical information and other approaches share the same main issues. In this paper, we focus on binary endpoints, but the general approach is transferable to other types of endpoints. The necessary notation, the standard two-one-sided-test (TOST) approach that only considers the data in the new study (Schuirmann, 1987) and the robustified MAP approach are introduced in Subsection 2.1. The operating characteristics of the robustified MAP approach are described in Subsection 2.2.

Notation and methodology
In this paper, we assume that a proposed biosimilar (the test product, ) is compared to the authorized product (the reference product, ) in a parallel groups design with equal allocation of patients to the test and reference groups (total sample size for the study: = 2 patients). We focus on a binary endpoint that is a typical level of measurement in biosimilar development and was, for example, used in the applications of biosimilars with the active substance etanercept and infliximab in Europe (Mielke et al., 2016). Let be the response rate for the reference treatment and be the response rate for the test treatment. The goal is to show equivalent response rates for and . More precisely, the following hypotheses are considered: where Δ is a pre-specified value (the equivalence margin) that is the maximum difference in response rates such that the products would not be considered different from a clinical point of view. We denote the number of observed responders in the reference group as and as in the test group. The observed response rates are given, respectively, bŷand̂. The operating characteristics of the approach that incorporates the historical information are compared to an approach that considers the new data only and can therefore serve as the benchmark. The commonly used frequentist approach that only considers the data from the new study is the two-one-sided-test (TOST) approach (Schuirmann, 1987). For that, the null hypothesis is split into two parts: The standard TOST approach for binary data tests the previously described pair of hypotheses by use of an approximation to the normal distribution. For that, the variability needs to be estimated and the standard error for the difference in response rates is given bŷ= The test statistics are defined as Since both test statistics follow asymptotically a standard normal distribution, the null hypothesis is rejected if both 1 and 2 are larger than the (1 − )-quantile of the standard normal distribution. This frequentist approach is compared to the robustified MAP approach which serves as a representative of a Baysian approach. It is important to note that historical information is available for the reference product only. Although some authors (e.g., Chiu et al., 2014) proposed using an informative prior based on the historical information of the reference product both for the test and reference product, we believe that this introduces a bias toward the decision for equality, which is supposed to be shown and not assumed. Therefore, we use informative priors for the reference product only and for the test product, we assume a non-informative prior (uniform distribution). In the following, it is described how the posterior distribution is derived for the reference product. The posterior distribution for the test product is obtained directly by applying Bayes' theorem in the same way.
The robustified MAP approach is a Bayesian, stepwise, procedure that requires the derivation of a so-called MAP prior based on the historical data, to robustify it, and finally combines this prior with the observed data in the new study to give a posterior distribution using Bayes' theorem (Schmidli et al., 2014). The MAP approach is based on a hierarchical model. In the case of a binary endpoint, we assume that the data in the new study follow a binomial distribution with parameter and sample size and the data in the ℎ historical studies also follow a binomial model with parameters , , and sample sizes , ( = 1, … , ℎ). The parameters of the historical data and the new study are linked using the so-called exchangeability assumption, that is the joint distribution of the parameters does not change when using a different ordering of the trials. In the following, we make the stronger assumption that , , ,1 , … , , ,ℎ ∼ ( , 2 ), where = logit( ) and , , = logit( , , ).
The parameter represents the population mean whereas is the between-trial standard deviation. As priors for the betweentrial standard deviation in hierarchical models, half-normal or half-t distributions are recommended; see Spiegelhalter, Abrams, and Myles (2004) and Gelman (2006). For , for example ∼ (0, 10 2 ) is a reasonable choice. For the reference product, we assume that a MAP prior ( ) is already given and details on deriving MAP priors can be found in Schmidli et al. (2014). The MAP prior is robustified by combining it with a vague prior , so that the robust MAP prior HR is given by where ∈ [0, 1] is a chosen weight that represents the prior probability that the new data and the historical data differ and may be based on clinical judgment about the relevance of the historical data. Higher values of put more weight on the vague part of the prior and lead to a prior distribution with heavier tails. In the case of prior-data conflicts, the historical information is more quickly discounted. However, the gain in power is smaller for higher weights . For binary endpoints, may be chosen to be a uniform distribution. After the trial is finished, the robustified MAP prior is combined with the data that were observed during the trial using Bayes' theorem: where HR is the robustified prior as described above and (̂| ) represents the likelihood for the data in the new trial (Schmidli et al., 2014).
In a way similar to the frequentist hypothesis testing criterion, we need to define a Bayesian success criterion that corresponds to the described null and alternative hypotheses so that Type I error rate and power can be assessed. For that, we assume that and are random variables that follow the posterior distributions , and , for the test and reference product, respectively. Then, we claim equivalence of test and reference if where has to be chosen such that the desired Type I error rate profile is achieved. Choosing a higher value for leads to a smaller Type I error rate, but also to a lower power. The impact of this choice is discussed in more detail in Subsection 2.2.

Operating characteristics of the robustified MAP approach
For an illustration of the properties of the robustified MAP approach, we assume that the derived MAP prior for the reference product is a (50, 50) distribution. We consider different weights for the robustification, = 0, 0.1, 0.5, 0.9, in order to illustrate the impact of this choice. The value = 0 corresponds to the situation without robustification. As described before, the prior for the test product is chosen to be non-informative (uniform distribution, (1, 1)). We assume that the new study has a sample size of = 150 subjects per group, the critical value for the Bayesian approach is set to = 0.95 and we use = 0.05 as the significance level for the TOST approach. We set the equivalence margin to Δ = 0.15 and calculate the rejection rates under the null hypothesis (to evaluate the Type I error rate) and under the alternative (to evaluate power when = ) for several true response rates of the reference product. It is important to note that there are two limiting scenarios under the null hypothesis to consider: first, = − Δ which we label as Situation ( ) (response rate of the reference product is smaller than the response rate of the test product) and second = + Δ (Situation ( ), response rate of the reference product is larger than the response rate of the test product). Considering several true response rates for the reference product corresponds to considering different levels of prior-data conflict. The mean value of the chosen MAP prior is = 0.5. Therefore, a response rate of = 0.5 is a scenario in which the data in the new study exactly match the prior belief. A response rate of = 0.48, for example, relates to a scenario in which the response rate differs slightly from the prior belief and a response rate of = 0.3 indicates a clear prior-data conflict. Figure 1 shows the results. We first focus on the left and middle panels that show the empirical Type I error rates for the Situations ( ) and ( ). The nominal significance level is = 0.05. The figure shows that the Type I error rate is controlled if the true response rate in the new trial matches the mean value of the prior distribution ( = 0.5), but can be substantially higher than the nominal significance level when the data from the new study do not match the prior belief. For the following discussion, we focus on Situation ( ); however, Situation ( ) is symmetric and all conclusions made for Situation ( ) are also valid for Situation ( ). In Situation ( ), the approach is conservative if the response rate of the reference product is larger than the mean value of the prior (vertical line). On the other hand, we observe an inflation of the Type I error rate for response rates smaller than the mean value of the prior. The Type I error rate inflation is moderate if a high weight is used (maximal Type I error rate for = 0.9: 0.0643), but increases up to 0.4654 if no robustification is used (solid line). The Type I error rate for the approach without robustification increases monotonically if the prior-data conflict increases. In contrast, for the robustified approaches, if the response rate increases, the Type I error rate increases first, then reaches a maximum value and afterwards decreases again. Thus, if the prior-data conflict is extreme, the robustified approach discounts the prior information and this reduces the Type I error rate in these situations. However, our concerns are not with the situations in which the prior-data conflict is extreme, since we consider these situations to be avoidable in a biosimilar setting, but with the more realistic situation of a minor difference between historical studies and the new study in which MAP priors do not guarantee a reduction of the Type I error rate. For example, for = 0.9 and a true response rate of = 0.4655, we get a Type I error rate of 0.0564 that is higher than the nominal significance level of = 0.05. Therefore, it is important to emphasize that while robustification helps to limit the overall maximum Type I error rate, it does not control the Type I error rate at its nominal level in the region which we consider most relevant.
The panel on the right shows the empirical power. Here, we see that all approaches have higher power than the TOST approach in the area around the mean value of the prior distribution. However, the power is highest if no robustification is used ( = 0) and lowest if the highest weight is chosen ( = 0.9). This shows that there is a direct trade-off between the power and the Type I error rate profile: a more desirable Type I error rate profile comes with a smaller gain in power.
Obviously, it is possible to achieve Type I error rate control for all Bayesian approaches by increasing the critical value (see Equation (1)). We calibrated the MAP approaches with the different weights to keep the Type I error rate at the nominal level over the shown parameter space (response rates of the reference product between 0.3 and 0.7). The Type I error rate and power are shown in Figure 2. The Type I error rate is now controlled for all approaches (left and middle panel). However, since the Type I error rate is not constant for all response rates in the new study, the approaches are very conservative for some response rates, specifically in the region around = 0.5 that is the mean value of the prior distribution and therefore also the area in which we expect the response rate in the new study to lie. This conservative behavior directly transfers into lower power (right panel): the power of the Bayesian approaches is now lower than the power of the TOST approach. Therefore, if strict Type I error rate control is required, the MAP approach (or any other approach using informative priors) cannot give any advantages in terms of power.

THE HYBRID BAYESIAN-FREQUENTIST APPROACH
The proposed approach combines Bayesian and frequentist ideas and uses two switching rules and a response rate-dependent critical value to improve the performance under the alternative hypotheses (higher power) while maintaining the required Type I error rate under the null hypothesis in the neighborhood of the mean value of the prior distribution. Both the switching rules and the response rate-dependent critical values are used at the analysis stage of the study. No interim analysis is required. We only aim to control the Type I error rate in a subset of the parameter space because it was shown in Subsection 2.2 that strict Type I error rate control and a gain in power are incompatible. The chosen subset of the parameter space is the region around the mean value of the prior distribution that is the area in which we expect the true response rate of the new study to lie. We consider it to be less important to control the Type I error rate for response rates in the tails of the prior distribution because, due to extensive experience with the reference product, it is realistic to assume that a new study can be set up that is similar to the historical trials. Profiles of the Type I error rate and power that we would consider acceptable are shown in Figure 3: the Type I error rate is controlled in a neighborhood of the mean value of the prior distribution that is indicated by the dotted vertical lines.
Outside of this region, we accept an inflation of the Type I error rate because we are highly confident that the true response rate will not lie within that region. More formally, let̄be the mean value of the prior distribution. We aim to control the Type I error rate for all response rates in the new study in the interval , where is the parameter that controls the width of the controlled interval. This partial control of the Type I error rate comes with a price: the gain in power in comparison to the MAP approach is expected to be lower due to the partial control of the Type I error rate. The goal is therefore not to develop a test with comparable power to the MAP approach, but to develop a test that achieves a higher power than the TOST approach. The approach is described in Subsection 3.1 and some computational issues and the choice of tuning parameters are discussed in Subsection 3.2.

Description of the approach
The starting point of our approach is a moment-matched Beta prior for the reference product that can be derived, for example, with the MAP approach (Schmidli et al., 2014). In this context, "moment-matched" means that the estimated first two moments of the generated sample from the distribution of the MAP prior match the first two moments of the derived Beta distribution. We do not use the concept of robustification for the following two reasons: (1) as shown in Subsection 2.2, there is a decrease in power without reducing the Type I error rate in a relevant way in the regions that are relevant for us. Therefore, the concept of robustification does not seem to bring us closer to the desired operating characteristics (shown in Figure 3), (2) the assumption of a monotonic Type I error rate simplifies the determination of the response rate-dependent critical values (see Subsection 3.2). For the test product, we assume a uniform prior distribution. Afterwards, the new study is conducted using a parallel groups design with subjects per group. The information from the new study is combined with the prior distribution using Bayes' theorem as stated in Subsection 2.1. However, we introduce some modifications (response rate-dependent critical values and switching rules) for achieving the power profile that is shown in Figure 3. These modifications are summarized in a flow chart in Figure 4 and are described and motivated in the following. It is important to note that while the approach consists of several tuning parameters, all of these values can be predefined during the planning stage of the new study by simulating the operating characteristics. The complete algorithm including all critical values and decision points is therefore chosen independently from F I G U R E 4 Flow chart of the hybrid approach the data of the new study and can be pre-specified in the study protocol. We illustrate in Section 5 the steps that have to be conducted using an example and give in Subsection 3.2 an algorithm that tunes the proposed hybrid approach automatically.
The response rate-dependent critical values are motivated by Figure 1 that showed that for the MAP approach without robustification (solid line), the rejection rate under the null hypothesis highly depends on the true response rate of the reference treatment and has a monotonic shape. Therefore, we propose to use instead of a constant critical value a flexible, response rate-dependent, critical value. This allows the use of higher critical values for areas in which the standard MAP approach was too liberal and smaller critical values in areas in which the standard MAP approach was too conservative. Also, the critical value depends on the ordering of the response rates of and : the conservative situations are located, for example, in Situation ( ) right of the mean value of the prior distribution whereas they are left of the mean value of the prior in Situation ( ). Therefore, different critical values shall be defined for Situations ( ) and ( ). More formally, let be two functions that map the response rate of the reference product to the critical value. Function 1 relates to the situation in which the response rate for reference is equal to or larger than that for test (Situation ( )) and 2 to the situation in which the response rate for test is smaller than that for reference (Situation ( )). The true response rates for test and reference are not known in practice and need to be estimated and the functions 1 or 2 are evaluated at the estimated response rate of the reference product. Also, the decision if 1 or 2 is used, is based on the estimated response rates. If the true response rate for would be known, optimal critical value could easily be identified using the quantiles of the distribution of the test statistic for a specific response rate. The additional uncertainty of using estimated response rates makes the determination of valid critical values more difficult because this additional uncertainty has to be incorporated into the choice of critical values. We discuss the identification of the functions 1 and 2 in Subsection 3.2.
In addition, the proposed approach uses two switching rules that are based on the estimated response rates. First, the historical information should only be used if the observed data are similar to the historical data (Bayes-frequentist switching rule, Switching rule I, see Figure 4). Similarity is here measured as the difference between the mean response rate,̄, of the derived MAP prior and the observed response rate,̂, in the new study. If the historical information is ignored and the standard TOST approach as described in Subsection 2.1 is applied. Otherwise, the Bayesian approach is used. The motivation for this switching rule is that in the case of a prior-data conflict, the historical data do not provide any useful information on the reference treatment and should be ignored. The negative impact of incorporating historical information in the case of a prior-data conflict is reflected in the operating characteristics of the Bayesian approach that are not desirable (see the solid line in Figure 1)-neither in terms of power nor in terms of the Type I error rate. Therefore, it is reasonable to switch back to the TOST approach in this case. The constant 1 has to be chosen beforehand. It reflects the acceptable degree of prior-data conflict and influences the operating characteristics of the approach. We give some advice on its choice in Subsection 3.2. The second switching rule is used in order to increase the chance of rejection of the null hypothesis in the case where the study shows that the test and the reference products are similar, but the data of the reference product do not match the historical data. If the data do not match the prior, even if in the new study test and reference show perfect similarity, the prior knowledge for the reference product will pull the posterior mean of the reference product away from the mean value of the test product making it difficult to reject the null hypothesis. Therefore, if we use the constant̄instead of the response rate-dependent critical values 1 (̂), 2 (̂) as the critical value. Both 2 andā re tuning parameters and need to be chosen to fit the desired operating characteristics. Some advice on the choice is provided in Subsection 3.2. Figure 4 gives a summary of the algorithm: in the first step, priors are chosen for test ( ) and reference ( ) (e.g., MAP prior for , uniform distribution for ). Afterwards, the tuning parameters and functions 1 , 2 are chosen (see Subsection 3.2) before the new study is conducted and the response rates for reference and test are estimated (̂,̂). Using the first switching rule (Equation (3)), it is decided if the historical data should be ignored or not. In the case where the historical data are not used, the test decision can be made directly using the TOST approach. If not, the Bayesian success criterion (Equation (1)), that was introduced in Subsection 2.1, is used. If the historical data should be incorporated, the second switching rule (Equation (4)) is applied that decides which critical values are used. In the case where the point estimates for the response rate in test and reference are very similar,̄is used as the critical value. Otherwise,̂is compared witĥ. If̂is smaller than̂, then 2 (̂) is the critical value, otherwise 1 (̂) is used.

Optimal choice of response rate-dependent critical values and tuning parameters
In this subsection, we discuss the derivation of optimal response rate-dependent critical values and the choice of tuning parameters. We provide supplementary online material with further details concerning the implementation of the proposed methodology and a step-by-step example with R-code.
Before discussing the optimization, it is important to note that due to the discrete nature of the problem, it is possible to calculate the exact Type I error rate and exact values for power without using simulations. For that, it is necessary to calculate the test decision for all combinations of number of responders, and , that can be observed in the new study. We denote the test decision for observed values and by , . This is a binary variable with the value 1 when the test decision is for the alternative and 0 otherwise. For example, for = 150, it is necessary to evaluate the test decision , for 151 2 = 22801 scenarios. Combining the test decision with the probabilities to observe and , which are denoted by ( = ) and ( = ), respectively, leads to the exact rejection rates: The probabilities ( = ) and ( = ) for a specific scenario can be derived using the binomial distribution with parameters and , respectively, and a sample size of subjects per group. Both the functions that map the estimated response rates to the critical values ( 1 , 2 ) and several tuning parameters have to be chosen before the hybrid approach can be applied. Choosing the functions 1 , 2 without any assumptions on the functional form is very difficult. Since the Type I error rate is a monotonic function, we also use monotonic functions for 1 and 2 . More specific, we assume a logistic function with the parameters (the minimal value of the function), (the difference between the minimal and maximal value of the function), 0 (the sigmoid's midpoint), and (the steepness of the curve). In addition, we assume that 1 and 2 are complements of each other, that is, 1 ( ) = + 1 + exp(− ( − 0 )) , 2 ( ) = + 1 + exp( ( − 0 )) .
Also, 1 , 2 , and̄have to be chosen. The parameter 1 describes the boundary at which it is claimed that the data-prior conflict is so extreme that the historical data should be ignored (see Equation (3)). If 1 is 0, the TOST approach would always be used. If 1 = 1, only the Bayesian approach would be used. Therefore, a higher choice of 1 leads to higher power, but a less desirable Type I error rate. We found that is a reasonable compromise. The parameter 2 is the boundary at which the critical valuēis used instead of 1 (̂) or 2 (̂) that is supposed to make a rejection of the null hypothesis easier in case the response rates of and are very similar (see Equation (4)). For 2 , similar arguments as for 1 and experience gained with simulation studies leads to a recommended choice of The parameter̄defines the value to which the Bayesian success criterion (see Equation (1)) is compared to if the absolute difference in response rates of and is smaller than 2 in the new study. Generally, a lower value for̄leads to higher power, but if̄is chosen too small, it might not be possible to control the Type I error rate in the interval -even if 1 and 2 are chosen to be a constant value of 1, making a rejection of the null hypothesis impossible if the difference in response rates is not smaller than 2 . Therefore, it is difficult to give a general recommendation for this parameter, but a suitable choice can be made during the optimization procedure. In total, five parameters (four related to the response rate-dependent critical values and one tuning parameter) need to be chosen.

F I G U R E 5 Flow chart of the automatic tuning of the parameters for the hybrid approach
The choice of the five parameters can be made automatic by an optimization procedure that is described in the following. We optimize the power under the alternative that = =̄under the constraint that the Type I error rate range is controlled in the interval , however, other target functions are also possible (e.g., optimize the power under the alternative that = ± log(1.05)) and might lead to slightly different results. The chosen target function can be visualized in Figure 3: the Type I error rate is to be controlled at the points in the left panel while the power is optimized at the dot in the right panel. It is important to note that it is sufficient to control the Type I error rate at the boundaries of the interval because the Type I error rate has a monotonic shape in Situations ( ) and ( ) for response rates in the center of the prior distribution.
Due to the complexity of the optimization problem, we use a step-wise procedure that does not guarantee the identification of the global optimum, but experience (e.g., see simulation study in Section 4) has confirmed that this strategy leads to a good solution nonetheless. First, we identify the parameter̄, then, we determine suitable starting values for optimizing 1 and 2 before we identify the parameters of the optimal functions 1 , 2 ( , , , 0 ). The procedure is summarized in the flow chart in Figure 5. For the start of the optimization procedure (Step 1), we set 1 = 3 , 2 = ,̄= 0.9, 1 = 2 = 0.99.
Then, we calculate the Type I error rate for =̄± in Situation ( ) and Situation ( ) with these parameters (Step 2). While the Type I error rate is larger than 0.05 in Situation ( ) or Situation ( ), we set ∶=̄+ 0.005 and repeat the calculation of the Type I error rate. Once the Type I error rate is controlled, we aim to identify good starting values for the optimization of the functions 1 , 2 (Step 3). For that, we determine the constant critical value with 1 = 2 = that maximizes the power at the mean value of the prior while controlling the Type I error rate in the interval . This optimization is performed with the function crs2lm (Price, 1983;Kaelo & Ali, 2006) in the R-package nloptr (Johnson, 2014). Afterwards, the found constant critical value is used as the starting value for identifying the optimal values of the parameters , , , 0 for the functions 1 , 2 (Step 4, see the supplementary material for more details). This optimization is also first performed with the function crs2lm. The Nelder-Mead algorithm (function optim) is applied on the found solution in order to improve the performance. Since crs2lm is a random search algorithm, we apply both steps several times in order to avoid stopping at a local minimum.
It should be emphasized that this approach leads in general to good results and is especially useful if several scenarios need to be evaluated in a systematic way. However, if a study is planned, it is recommended to evaluate the operating characteristics carefully and to adjust the tuning parameters manually if necessary (Step 5). This is illustrated in the case study in Section 5.

SIMULATION STUDY
We investigate the properties of the proposed approach in a simulation study. All simulations were performed with R version 3.2.3 (R Core Team, 2015). Before showing any general results, we would like to illustrate the Type I error rate and power profile for one chosen setting. For that, we fix the sample size per group to = 200, the parameter for the controlled region is = 0.04, the mean value of the prior distribution is̄= 0.667 and the effective sample size (ESS) is set to ESS = 100. The effective sample size describes how informative the prior is (Morita, Thall, & Mueller, 2008). For a Beta prior with parameters and , the ESS is, for example, given by = + .
With these choices of̄and , we aim to control the Type I error rate in the interval as recommended in Subsection 3.2, and apply the strategy described in Subsection 3.2 to determine the optimal response ratedependent critical values that are displayed in Figure 6. Figure 7 shows the empirical power (right panel) and Type I error rate (left panel) for different true response rates in the new study to illustrate different levels of commensurability of prior and data. The proposed hybrid approach is compared to the TOST F I G U R E 6 Example for the functions 1 and 2 : optimal response rate-dependent critical values for the described scenario F I G U R E 7 Operating characteristics for the robustified MAP approach with 90% weight on the non-informative part ( = 0.9, = 0.96), the proposed hybrid approach and the TOST approach that only considers the data in the new study. The vertical solid line shows the mean value of the prior distribution, the vertical dotted lines indicate the boundaries for the interval . The displayed Type I error rate is the maximum of Situations ( ) and ( ) T A B L E 1 Overview of considered parameter settings ( : sample size per group, ESS: effective sample size (ESS= + if prior is Beta( , )), : mean value of the prior, Δ ∶ equivalence margins, : parameter that controls the width of interval (Equation (2) approach that only considers the new data and the robustified MAP approach with 90% weight on the non-informative part that is calibrated to control the Type I error rate in the interval (see Subsection 2.1, = 0.96). The proposed hybrid approach keeps the Type I error rate within the defined interval . The Type I error rate is substantially inflated outside of the controlled interval, but as we only aim to control the Type I error rate in the defined interval , this is acceptable. In contrast, power is gained in the interval in comparison to the TOST approach that only considers the observed data of the new study: for example, at the mean value of the response rate of the reference product in the historical trials (see vertical solid line in Figure 7) the TOST approach has a power of 0.8776, but the proposed approach has a power of 0.9363. The power of the TOST approach can be increased by using a higher number of subjects . It would have been necessary to include = 241 instead of = 200 subjects to achieve a comparable level of power with the TOST approach. In comparison to the robustified MAP approach, the Type I error rate inflation of the hybrid approach is much higher outside of the interval . In contrast, there is not a relevant gain in power for the robustified MAP approach in comparison to the TOST approach. Next, we discuss the operating characteristics of the hybrid approach for a broader range of scenarios. Table 1 shows the considered settings. We consider three different sample sizes for the new study, two effective sample sizes (ESS) of the prior and ten mean values of the prior distribution for the reference product,̄, between 0.5 to 0.8. Due to symmetry, the chosen response rates for the historical data are representative also for response rates from 0.2 to 0.5. The equivalence margin is set to Δ = 0.15. We aim to control the Type I error in the interval = [̄− ,̄+ ], and use three different values for corresponding to a width of the controlled interval of 0.04, 0.08, 0.12. In total, we consider 180 scenarios. The R-package BatchJobs (Bischl, Lang, Mersmann, Rahnenführer, & Weihs, 2015) is used for parallel computing. The optimal response rate-dependent critical values and tuning parameters are chosen using the recommendations made in Subsection 3.2. In order to compare the results so that the impact of the different scenarios can be analyzed, we display in Figure 8 the empirical power at̄= = , F I G U R E 8 Operating characteristics for the proposed hybrid approach (solid lines) in comparison to the TOST approach (dotted lines) for different values of and different sample sizes n for the new study. Displayed is the empirical power assuming that the mean value of the prior is identical to the true response rates of reference and test product in the new study which relates to the scenario of a perfect match of historical data and new data and complete equivalence between response rates of and . This value can be found in Figure 7 at the vertical solid line. It is important to keep in mind that the Type I error rate is controlled in the interval and that makes the Figure meaningful even though only a part of the information is shown. The rejection rates are shown for several different sample sizes, different values , different mean values of the prior distribution and an ESS of 100. The results for an ESS of 20 were comparable (not shown), except that an ESS of 100 gave a small advantage in terms of power, but led to a higher inflation of the Type I error rate outside of the interval . However, as we only aim to control the Type I error rate within the interval , this is of course acceptable. The feature that the power increases with a higher ESS is clearly desirable because having a more informative prior should be rewarded. Figure 8 displays the rejection rates of the proposed hybrid approach (solid line) in comparison to the TOST approach (dotted line). The hybrid approach leads to a gain in power for the three values of that are explored. However, if a higher value for is used, the advantage of the hybrid approach decreases. This is expected because a higher value of leads to a larger interval in which the Type I error rate has to be controlled. This more desirable Type I error rate leads to a less desirable power profile. This was already discussed for the robustified MAP priors in Subsection 2.2. However, it should be emphasized that even if a fairly large interval with a width of 0.12 is controlled, still up to 5.5% power can be gained. Comparing the different sample sizes ( = 100, 150, 200), we see that a higher sample size in general leads, as expected, to a higher power. The gain in power seems to be rather independent of the sample size.
The optimization and choice of tuning parameters can be, as discussed in Subsection 3.2, difficult. However, the results show that, apart from the situation with = 0.02 and = 100, the power profiles look smooth, indicating that the optimization strategy that was proposed in Subsection 3.2 is stable. However, it should be emphasized that if a study is planned using this approach, it is recommended to study the operating characteristics for the specific parameter setting of the study in order to check if manually tuning the proposed method increases the performance. Manually adjusting of the parameters can also be useful in case a specific profile of power and Type I error rate is required. We discuss this in a case study in Section 5.

CASE STUDY
In this section, we illustrate the proposed method by planning a hypothetical Phase III study for a biosimilar with the active substance adalimumab (originator drug: Humira ® from AbbVie Ltd.). The considered indication is psoriasis. The efficacy of treatments for psoriasis is often measured with the Psoriasis Area and Severity Index (Fredriksson & Pettersson, 1978, PASI). This score is the most widely used measurement scale for the severity of psoriasis and lies between 0 (no disease) and 72 (maximal disease). The rate of patients with at least 90% improvement in the PASI score after week 16 (PASI 90 responders) is an important endpoint for clinical trials in psoriasis (Torres & Puig, 2015). Through a systematic search we identified five studies with relatively homogeneous study populations, identical treatment regimens (initial treatment with 80 , afterwards 40 every other week) and assessment of PASI 90 at week 16. The results of the studies are displayed in Table 2. In total, 1868 subjects were studied of which 886 were classified as PASI 90 responders at week 16.
As shown in the flow chart in Figure 4 in Section 3, it is first necessary to derive the MAP prior for the reference treatment. We obtain the MAP prior with the R-package RBesT (Weber, 2017). For that, we chose the priors for the hyper-parameters of the MAP approach (see Subsection 2.1) as recommended in the clinical trial example in Schmidli et al. (2014): for the betweentrial standard deviation , a weakly informative half-normal prior with a standard deviation of 1 was used. For the mean value of the prior distribution , we used a normal distribution with mean value 0 and standard deviation 10 that is also a weakly informative prior distribution. The MAP prior was approximated with a Beta distribution and the best fit was obtained with the T A B L E 2 Historical studies for Humira ® (adalimumab)

Study
Publication Indication Responders/Sample size (%) The shown response rates correspond to PASI 90 at week 16. * : Response rate was given in the paper and number of responders was calculated using the information in the publication. parameters = 55.0844 and = 59.3647. The ESS was approximately 114 subjects. The mean value of the prior distribution is = 0.4813. For the test product, we set a uniform distribution (i.e., Beta(1, 1)) as the prior distribution.
The goal of the new study is to demonstrate equivalence with an equivalence margin of Δ = 0.15, that is to test the hypotheses: We assume that the Type I error rate is to be controlled for response rates in the interval It is noted that the observed response rates for all historical trials that are shown in Table 2 lie clearly within this range. Before any data from the new study become available, it is necessary to choose the tuning parameters and determine the optimal response rate-dependent critical values. We first use the proposed algorithm (see Figure 5) to get initial values for the response ratedependent critical values and tuning parameters and we will adjust this choice afterwards according to the desired Type I error rate profile. We assume a sample size of = 175 subjects per group for the new study.
In the first step of the algorithm (see flow chart in Figure 5), we set We then calculate the Type I error rate under Situation ( ) and Situation ( ) ( Step 2) for =̄± 0.05 ( at the boundaries of interval ) which has a maximum value of 0.0226 and is therefore smaller than the nominal significance level of = 0.05. That is why we can proceed to Step 3 in which we search for starting values for the target function. The optimal constant critical value is 1 = 2 = 0.9748. The constant critical value is used as the starting value for the optimization of the response rate-dependent critical values (Step 4). Since the chosen optimizer (crs2lm in the package nloptr) is a random optimizer, we repeat the optimization several times with the chosen starting value and get for the optimal solution a power of 0.8153 that clearly exceeds the power of the TOST approach (0.7414). One could now fix the functions 1 and 2 , run the study and get results with the hybrid approach as explained in the flow chart in Figure 4. However, we will analyse the Type I error rate and power profile in order to adjust the tuning parameters manually to our desired profile and to check if further increases in the performance are possible. For that, the leftmost panels in Figure 9 show the operating characteristics of the hybrid approach (automatic optimized parameters, referred to as Choice 1 in the following) in comparison to the TOST approach. The upper panel gives the Type I error rate. It is shown that the Type I error rate is controlled in the chosen region (indicated with the vertical dotted lines). Outside of the region, the Type I error rate first increases, reaches a maximum and decreases again if the data do not match the prior at all. This property is comparable to the operating characteristics of the robustified MAP approach (see Figure 1).

F I G U R E 9
Operating characteristics (Type I error rate: upper panels, empirical power: lower panels) of the proposed hybrid approach for three different choices of tuning parameters in comparison to the TOST approach for a hypothetical Phase III study for adalimumab. Whereas for Choice 1 (left), the automatic optimization was used, the parameters 1 and 2 were changed for Choice 2 (middle) and Choice 3 (right; see description in the text). The dotted vertical lines indicate the boundaries for interval (the controlled region), the solid vertical line represents the mean value of the prior distribution. The displayed Type I error rate is the maximum error rate for Situations ( ) and ( ) The lower panel shows the empirical power. It confirms that there is a high gain in the center of the prior distribution. However, if we move closer to the boundaries of the controlled interval (vertical lines), the power of the hybrid approach converges to the TOST approach. We would prefer a relevant gain in power in the complete interval . The ability to gain power in situations in which the historical data and the data in the new study are not completely commensurate is controlled with the parameter 2 : if 2 is set to a higher value, then more often the easier critical valuēis used. We increase in Choice 2 the parameter 2 to 2 = 1.5 ⋅ √ 0.4813 ⋅ 0.5187 175 = 0.0567.
The other tuning parameters ( 1 ,̄) are kept fixed. Again, we apply the algorithm outlined in Figure 5. The operating characteristics are displayed in Figure 9 in the middle panels. Comparing the empirical power of Choice 2 (lower middle panel) with the empirical power of Choice 1 (lower left panel), we can see that this choice of 2 brings the desired change: for this setting, the power in the complete interval is higher than that of the TOST approach. The Type I error rate outside of the interval is comparable. Therefore, we prefer this set of tuning parameters. Lastly, we consider decreasing the Type I error rate outside of the controlled region to provide some certainty that even if anything unexpected happens, the Type I error rate would not be extreme. This can be controlled by adjusting 1 : if this value is decreased, the hybrid approach more often switches to the TOST approach and ignores the historical data. We set 1 to The operating characteristics are displayed in the rightmost panels of Figure 9. The Type I error rate (upper panel) is clearly reduced. However, the power peak (lower panel) is much narrower than for Choice 2. Therefore, we prefer Choice 2 and propose using 1 = 0.1133 and 2 = 0.0567, and to optimize the response rate-dependent critical values with the method proposed in Subsection 3.2.
It is important to note that a formal inclusion of 1 and 2 in the algorithm that automatically tunes the hybrid approach is not possible with the chosen target function (Subsection 3.2). From the three choices of critical values and tuning parameters shown in this paper, we preferred the operating characteristics achieved with the tuning parameters of Choice 2. However, this choice has a value of the target function of 0.8079 whereas the first choice had a value of 0.8153 and would be preferred in the automatic tuning that is described in Section 3.2. The preference for a specific profile of Type I error rate and power is too complex to be included in a completely automatic choice of tuning parameters. We provide an algorithm for a first automatic choice, but it is highly recommended that the operating characteristics are checked and the tuning parameters are adjusted if necessary as shown in this case study. For now, we fix the parameters obtained in Choice 2.
Next, it is demonstrated how a test decision is made with the proposed hybrid approach using the information provided in the European public assessment report (EPAR) for the application of Amgevita ® (Amgen, CHMP, 2017). Amgevita ® is a biosimilar with the active substance adalimumab and was approved in March 2017 in Europe. It is the first approved biosimilar to Humira ® in Europe. The sponsor conducted a Phase III study in patients with stable moderate to severe plaque psoriasis with the same treatment regimen as in the historical trials (see Table 2). PASI 90 was evaluated at week 16. At that point in time, 81 of 172 subjects in the test group (47.1%) and 82 of 173 subjects in the reference group (47.4%) were classified as responders. We will now illustrate how a test decision is made with the hybrid approach by following the algorithm displayed in Figure 4. After estimating the response rates under test and reference treatment, it is first necessary to decide if the historical data should be used or not (Switching rule I, see Equation (3)). For that, we calculate the difference between the mean value of the prior distribution and the observed response rate: Since this value is smaller than 1 = 0.0944, we proceed with the Bayesian approach. Next, we compare the observed response rates in the test and reference groups (Switching rule II, see Equation (4) This difference is smaller than 2 , therefore, the Bayesian success criterion (Equation (1)) is compared tō= 0.9 The parameters for the posterior distribution of the test product are given by for the reference product. The Bayesian success criterion (Equation (1)) is = 0.9983 that is clearly larger than̄= 0.9. We therefore reject the null hypothesis and claim equivalence of test and reference. It is important to note that the TOST approach leads to the same conclusion in this example.

CONCLUSION
When a biosimilar is developed, historical information on the reference product, which has been on the market for several years, is available. Including this information into the confirmation of equivalent efficacy might lead to a reduction in terms of sample size, but can also inflate the Type I error rate. In this paper, we first showed in Section 2, using the MAP approach (Neuenschwander et al., 2010), that even with the robustified version of this approach (Schmidli et al., 2014), the Type I error rate cannot be controlled if power is to be gained in comparison to the two-one-sided-test (TOST) approach (Schuirmann, 1987). The TOST approach is the standard, frequentist approach that considers only the data in the new study. Strict control of the Type I error rate seems to be incompatible with a gain in power. Therefore, we propose to control the Type I error rate not in the complete parameter space, but only in an interval that is centered around the mean value of the prior distribution. This region is the area we consider to be most important because we expect the true value of the new study to be in that region. For example, for binary endpoints and in the case where the mean response rate of the historical data is 0.25, we might aim to control the Type I error rate for true response rates in the new study between 0.2 and 0.3. We showed that the MAP approach does not provide any relevant advantage even if the Type I error rate has to be controlled in an interval in the center of the prior distribution only. Therefore, we proposed in Section 3 a hybrid Bayesian-frequentist approach for binary endpoints that borrows strength from historical data while controlling the Type I error rate in the center of the prior distribution. This is achieved by combining two switching rules and response rate-dependent critical values. We analyzed the operating characteristics of the proposed approach in a simulation study (Section 4). It was shown that power can be gained in comparison to the TOST approach while controlling the Type I error rate in the pre-specified interval as long as the controlled interval is not too wide. If, for example, the interval is chosen to represent the complete parameter space, no gain in power is possible with the proposed hybrid approach.
The decision on the width of the region in which the Type I error rate needs to be controlled depends on how confident the sponsor of the new study is of matching the study result from the historical trials. If, for example, the sponsor was involved in one of the historical trials, it might be easy to set up an identical study using the same research centers and exactly the same inclusion/exclusion criteria. In that case, the sponsor might be able to justify that it is sufficient to control the Type I error rate in a narrow interval. On the other hand, if not much knowledge about the historical trials is publicly available, it might be difficult to set up a comparable study and therefore a wider interval would be necessary. It is important to emphasize that there are also situations in which the proposed approach is not recommended (e.g., in cases in which there is a high uncertainty about the response rate in the new study). However, using simulations, it is easy to investigate the usefulness of applying the hybrid approach at the planning stage of the new trial.
The Type I error rate and power profile can be adjusted by the choice of tuning parameters. We have given general advice on the choice of the parameters and provided an algorithm that chooses the parameters automatically. In addition, we explained the fine tuning of the parameters in Section 5 using historical data for the active substance adalimumab (originator drug name: Humira ® from AbbVie Ltd.) as a case study.
We only aim to control the Type I error rate in the region surrounding the mean value of the prior distribution. Clearly, this may lead to discussions with regulatory agencies as to whether the evidence obtained with this method is reliable enough to be included in the showing of equivalence between a biosimilar and a reference product. We believe, however, that our hybrid approach is an improvement of current practice where weaker evidence has been used to gain regulatory approval in the past. For example, in the area of new antiepileptic drugs (monotherapy), a single-arm study was accepted in which the efficacy comparison was based on comparing to historical control only (Jacobson et al., 2015;Sperling et al., 2015). Also in the context of biosimilar development, single-arm studies with a comparison to historical data have already been used, for example for the application of Zarzio in Europe (Sandoz) (CHMP, 2009). Another example is noninferiority (NI) trials for which the FDA guideline states that the noninferiority study "is dependent on knowing something that is not measured in the study, namely, that the active control had its expected effect in the NI study" . In all these examples the Type I error rate is uncontrolled and can lie anywhere between 0% and 100%. It is acknowledged that the single-arm study for Zarzio was considered supportive only and the main confirmation of equivalence was based on a pharmacodynamic study that involved a control group. Also, biosimilar trials differ from antiepileptic trials in the sense that using placebo in patients for which an effective treatment is already wellestablished is ethically questionable, whereas a control group in a biosimilar trial is less controversial. However, we would like to emphasize that the Type I error rate profile obtained with our method is, from a statistical point of view, much more preferable than some approaches already used in practice because, on the one hand, the statistical properties of our method are known and can be taken into consideration when making the decision if a biosimilar is approved or not and, on the other hand, we guarantee partial control of the Type I error rate. This paper has focussed on a binary endpoint. However, the general idea can be adapted to other types of endpoints, for example normally distributed endpoints. In that case, using concepts like Bayes factors (e.g., Lavine & Schervish, 1999) might be useful for combining multiple parameter estimates (e.g., mean value and variance for normally distributed endpoints) into a measure for similarity that is required for the proposed switching rules. It should also be noted that the approach was developed within the biosimilar framework, but can be used in all situations in which strict Type I error rate control for scenarios that are realistic in practice is required. Finally, we would like to point out that there several options for modifying standard methodology to achieve the desired Type I error rate and power profile. We found that the proposed approach has reasonable and acceptable operating characteristics, however, we acknowledge that further improvement might be possible and could be an area of further research.