Modeling Gene-Gene Interactions Using Graphical Chain Models

Objective: To investigate whether graphical chain models are suitable to detect gene-gene interaction under different biological models. Methods: We conducted a simulation study comparing graphical chain models with logistic regression models regarding their ability to detect underlying biological interaction models. For both methods, we attempted to capture simulation data following 12 different biological models. We used 10 statistical models for both methods. Of the 12 different biological models, four contained no interaction effects, two were multiplicative, and six were epistasis models. For each situation, the choice for a statistical model was based on global model fit as judged by two different information criteria, the BIC and the AIC. Results: Both methods failed in most of the scenarios to capture the gene-gene interaction present in the simulation data. Only in very specific cases, when disease risk was high and both genes had a dominant effect, present gene-gene interaction was detected. Conclusions: Graphical chain models are, similar to logistic regression models, not able to capture gene-gene interactions for arbitrary biological models underlying the data.


Introduction
The meaning of gene-gene interaction has led to much confusion caused by different definitions of interaction or epistasis as it is usually referred to in genetics, statistics or epidemiology. Thus, different models are used in the different disciplines to capture gene-gene interactions. The problem is that the degree to which statistical analysis can elucidate the underlying biological mechanisms may be limited and may require prior knowledge of the underlying etiology. In addition, it may also depend on the study design and the applied statistical methods. For instance, Vieland and Huang [1] have shown that data from affected sib-pairs cannot be used to distinguish heterogeneity from epistasis.
In a recent paper, North et al. [2] presented a simulation study that investigated the capability of logistic regression models to detect gene-gene interaction in a casecontrol study under different biological models. They assumed two independent susceptibility loci, i. e. the linkage disequilibrium was assumed to be zero ( LD = 0), with joint influence on the disease risk. Their results showed that logistic regression models cannot convincingly reflect the biological mechanisms of interaction. This leaves the question open whether other statistical models may be more appropriate to reflect the biological meaning of gene-gene interactions as already suggested by North et al. [2] .
Graphical chain models may offer a reasonable alternative. Although they are also based on regression models they allow in contrast to simple regressions to reflect complex multivariate association structures including intermediate variables and interactions in a graph. This would allow to capture biological pathways and to simultaneously account for environmental factors. The advantages of graphical models compared to simple logistic regression models became for instance apparent in Didelez et al. [3] . Thus, we study the properties of graphical chain models in small settings with the aim to generalize this approach to complex disease models.
The paper is organized as follows. In Section 2, graphical chain models, biological models of two-loci-interaction and the applied statistical models are introduced. Here, we focus on the different meanings of gene-gene interaction. The statistical models will then be investigated whether they are able to capture the biological interaction by means of simulation studies. Section 3 describes the design of the simulation study. The results are presented in Section 4 where it is especially investigated whether graphical chain models perform better in this context than the logistical regression models used by North et al. [2] . The above findings are critically reflected in the discussion.

Methods
Let us first define two-loci interaction in a biological sense [cf. 4 ]. Locus heterogeneity means that two genes have an influence on the disease but act separately. Epistasis means that the two genes interfere with each other in their influence on the disease. Bateson [5] originally introduced the term epistatic to describe a masking effect in which a factor at one Mendelian locus prevents another from manifesting its effect. In quantitative genetics, the term epistatic has classically been used as introduced by Fisher [6] . It refers to a deviation from additivity in the effects of the alleles of different loci with respect to prediction of a quantitative phenotype. This is similar to the usual concept of statistical interaction. Here, however, the choice of the scale becomes important, since two variables that interact, for example, on a multiplicative scale may be purely additive on a logarithmic scale.

Biological Interaction Models
Consider a disease phenotype Y and a sample of cases ( Y = 1) and controls ( Y = 0) selected from an arbitrary population. We assume that genotypes are obtained for each individual for a set of biallelic, autosomal candidate loci. For each candidate locus, say A and B , G A and G B give the number of disease alleles, i.e. G A , G B take possible values 0, 1, or 2. In the following, let allele a denote the wild-type allele, and allele A the disease allele at locus A , b and B are defined analogously. For binary traits, we are interested in the probability of being affected given a certain joint gen- is usually referred to as joint penetrance f ij , i , j = 0, 1, 2. That is, f ij is the penetrance for a genotype with i copies of allele A at locus A and j copies of allele B at locus B , respectively.
Based on theses penetrances, Risch [7] transferred the three main biological models of gene-gene interaction to mathematical models applying penetrance factors respectively summands depending on the underlying model. The additive model assumes that no interaction is present and thus describes the joint penetrance as sum of the two penetrance summands, i.e. f ij = a i + b j .
The heterogeneity model f ij = a i + b j -a i ؒ b j corresponds to the standard probability equation for calculating the probability of the union of two overlapping events where stochastic independence of these two events is assumed. Risch [7] showed that the heterogeneity model can often be well approximated by the additive model when used to model relative risks of disease in families. The additive and the heterogeneity models are assumed to represent biological models without interaction. The multiplicative model f ij = a i ؒ b j , in contrast, represents biological interaction, where the penetrance can be written as product of the two penetrance factors [7,8] . Another approach to capture genetic interactions is based on so-called pure epistasis models that mainly model potential masking effects of one or both genes. Thus, various types can be distinguished depending on the inheritance model [cf. 9 ]. For two dominant genes the biological mechanism can be described by an epistasis model if only the occurrence of at least one disease allele at both loci leads to an increased penetrance (for an example see the relative penetrance matrix for Epi dd in table 5 ). In case of one dominant and one recessive gene the biological model reflects epistasis if the recessive gene masks the dominant one. Thus, the dominant gene only leads to an increased penetrance if two disease alleles occur at the recessive gene locus (for an example see the relative penetrance matrix for Epi rd in table 5 ). For two recessive genes, an epistasis model reflects that both genes mask each other. In this case, the penetrance can only be increased if two disease alleles occur at each locus (for an example see the relative penetrance matrix for Epi rr in table 5 ). These three epistasis models are based on the definition of Bateson.

Statistical Models
The models proposed by Risch can be written as linear regression models using the parametrization as introduced by Cordell [10] and later used by North et al. [2] . At each candidate locus, G A , G B are subdivided into two design variables. The additive effects x 1 , x 2 D {-1,0,1} are linear transformations of the genotype and reflect the number of disease alleles. The dominance effects z 1 , z 2 D {-0.5,0.5} distinguish homozygous from heterozygous genotypes. For instance, x i is set to -1 and z i to -0.5 for a homozygote aa , x i = 0 and z i = 0.5 for a heterozygote aA and x i = 1, z i = 0.5 for the genotype AA . Using this parameterization, an additive biological penetrance model corresponds to the following statistical model: This relationship can be exploited such that the penetrance summands can be calculated from the regression parameters of the linear model as a 0 = -␤ 1 -0.5 ␥ 1 , a 1 = 0.5 ␥ 1 , a 2 = ␤ 1 -0.5 ␥ 1 ( b j can be derived analogously).
Graphical Chain Models Graphical chain models are an adequate tool to display a multivariate association structure in a graph. Variables are represented as vertices and associations between each pair of these variables as edges. Undirected edges represent symmetric associations, whereas directed edges pointing from an influential variable to a potential response represent asymmetric associations. In case that the so-called Markov properties hold, missing edges can be interpreted as conditional independences, as illustrated in figure  1 . A graphical chain model consists of several boxes usually ordered from right to left where the pure explanatories are summarized in the box on the right and the pure responses in the box to the left. In between are boxes with intermediates that are responses to variables on the right and explanatories to variables on the left. Edges within boxes are always undirected and edges between boxes are always directed pointing from variables in boxes on the right to variables in boxes on the left. For more details we refer to [11,12] .
Typically, a first rough structure is obtained from subject-matter knowledge where the variables are categorized as pure explanatories, pure responses and different levels of intermediates. The specific associations between the variables have then to be derived from the data using an appropriate selection strategy [12] .
Using graphical chain models, it has first to be decided on which information should be displayed in the graph. To ensure comparability with logistic regression models we follow the idea to divide each genotype into its dominance and additive effect. In addition, it has to be accounted for the fact that we are interested in case-control studies which implies that the disease status of an individual is given and the relation to the exposure has to be clarified during the analysis. Thus, although perhaps not common, the disease status Y is depicted as explanatory variable in the right box whereas the genetic exposures X as additive effects and Z as dominance effects are represented as response variables in the left box.

Model Equations
Let us now briefly summarize the statistical models to be further investigated where for the sake of simplicity we restrict ourselves to representing only the additive effects of two marker genotypes. We consider logistic regressions and graphical chain models to model the statistical association structure. For illustrative purposes, the latter are additionally displayed as graphs. Since we assume that all variables are discrete, we consider graphical loglinear models. That is, we start from an I ! J ! K table where the cell probabilities are denoted as p ijk , i = 1, ..., I, j = 1, ..., J , k = 1, ..., K . The simplest case is that of a null model, which is given as logit( y ) = ␣ as logistic regression and as graphical model. Since no associations are present in the mean model the corresponding independence graph does not contain any edges (see fig. 2 a). Main effects can be included as follows in the above models: The corresponding graph now contains edges pointing from Y to X 1 and X 2 , but still X 1 and X 2 are disconnected (see fig. 2 b). If we now include an additional interaction effect, X 1 and X 2 are no longer disconnected but connected by an undirected edge (see fig. 2 c). The corresponding models read as follows: In the following, we perform a simulation study based on the design of North et al. [2] to investigate whether statistical models of the type introduced above are able to reflect the underlying biological models.  Different designs are considered for the simulation study where in general we assume that loci A and B are in linkage equilibrium. For the sake of comparison, we set the disease allele frequencies at both loci to 0.1 ( p A = p B = 0.1) [9] and the disease prevalence to prev = 0.01 [2] . Let us denote the basic risk for developing the disease of interest with c regardless of the joint genotype. We then introduce a factor t , t D {1, ..., T }, to reflect a potential risk increase. For the Risch models, this factor depends on the number of disease alleles such that e. g. the penetrance vectors a 1 and a 2 can be derived from a 0 by multiplying a 0 with 1 and 2 , respectively.
For each biological interaction scenario, we simulate a model of low and of high risk increase. Table 5 in the ap-pendix shows the marginal genotype relative risk for each locus, which is defined e. g. for locus A as 2. Second step. Given the prevalence prev , the penetrances f ij and 1 -f ij and the joint genotype G l , l D {0, ...,8}, l = 3 i + j , i , j D {0, 1, 2}, the conditional probabilities of a certain joint genotype given the disease status, i.e. P ( G l ͉ Y = k ), k D {0, 1}, are calculated. These probabilities are each multiplied with 10,000 to obtain a population with 10,000 affected (cases) and 10,000 unaffected subjects (controls). 3. Third step. A random sample of size N = 1,000 of the subpopulation with joint genotype G l is drawn with replacement. This sampling step is replicated independently n = 100 times. Note that the simulation design of North et al. [2] did not include a random selection of the population to be investigated (step 3). The generated data samples are used to examine whether the statistical models introduced above correspond to biological interaction models. For this purpose, ten statistical candidate models with different parametrization are formulated (see table 1 ). In the following, statistical models are denoted by S with the special model type as index. The first seven models consist only of main effects representing biological models without gene-gene interactions. The last three models include interaction terms and shall therefore capture genegene interactions.
All statistical models are translated into graphical chain models and logistic regression models (see table 1 ).
Each statistical scenario is fitted using a logistic regression model and the complementary graphical chain model where the statistical software packages R, MIM [12] and the interface mimR [13] are applied.
The global model fit is assessed by the Bayesian Information Criterion (BIC) and by Akaike's Information Criterion (AIC). The results obtained from the latter are described only briefly. Since on the one hand different model parameterizations do not allow to compare the absolute BIC resp. AIC values and on the other hand only the relative values are relevant to select the best model out of a given set of candidate models, we consider the differences between the value of the actual model and the minimum of all BIC resp. AIC values in a given set, which means e. g. for the BIC According to Burnham and Anderson [14] a selected model with ⌬ m BIC ^ 2 or ⌬ m AIC ^ 2, respectively, should be considered as particularly useful. Such models are said to have a substantial level of empirical support.

Results
The results of our simulation study are presented in table 2 . Statistical models that are assumed to reflect best the biological mechanisms, briefly denoted as 'true' models are shaded gray. As a general result, the logistic regression models typically leads to a higher variety in selected statistical models as compared to graphical chain models. In addition, more models being consistent with the assumed biological mechanisms are selected based on logistic regression models compared to graphical chain models.
For the null model the mean log(OR) and the bias in log(GRR)s are close to 0 (mean log(OR) = -0.0028 using allele b as reference; bias log(GRR): -0.0027, -0.059 using genotype B/B as reference). The bias in log(GRR) is the difference between the log of simulated GRR and the log of true GRR. The type I error rate (see table 4 ) shows that the logistic regression model (LRM) is too conservative, the graphical chain model (GCM) comes up with good empirical error rates for the 0.01-level, but shows a slightly anti-conservative behavior for the interaction models.
Let us now go into more details, where we distinguish Risch models and epistasis models.
For the low risk additive model Add L we presume S A 2 and S D 2 as reasonable 'true' models, since an risk increase is observed for the marginal GRR at locus B in the relative penetrance matrix (see table 5 ). Here, the LRM mostly selects the additive effect model S A 2 (92%) whereas the GCM prefers the dominance effect model S D 2 (86%). The 'true' models for the high risk additive model Add H are S A 1 and S D 1 . The tenfold risk increase provokes the GCM to chose S D 1 in 98%.
Like the additive model, the heterogeneity model Het L is regarded as biological model without gene-gene interaction. The 'true' models for the low risk scenario are S A 1 , S D 1 and for the high risk model S D 2 and S D 12 , since Het H includes an explicit dominance effect. Looking at Het L , LRM finds again in contrast to GCM models without dominance effects. This changes for an increased risk and two recessive loci. LRM captures the 'true' models S D 2 and S D 12 in 100%. GCM fails and selects S D 2 in only 11%, but selects additive effect models in 89%.
The results for the multiplicative model show that LRM selects the additive effect model S A 12 for the low risk model and S D 12 for the high risk model. LRM finds as well some interaction models. GCM decides for S D 12 regardless of whether a low or high risk model applies.
For the dominant low risk epistasis model Epi L dd , most frequently the mean model is selected. For the high risk model Epi H dd gene-gene interaction can be found. GCM as well as LRM fully reflect the underlying biological model. Substituting one dominant genotype for a recessive genotype, LRM and GCM have both diffculties to ascertain interaction effects for the Epi L rd . Almost all model replications favor the mean model. The high risk model performs better. Although the epistasis effect cannot be revealed, both approaches select main effect models S D 1 resp. S A 1 and hence find evidence that only locus A is involved in the pathway of disease. Both epistasis models with two recessive loci, Epi L rr and Epi H rr , show only minor variation in the marginal GRR's. Since only the joint genotype AABB leads to an increase of risk, the choice of allele frequencies and the number of observations lead to a lack of statistical power for detecting any interaction effect. Both, LRM and GCM are not able to identify this relationship, neither for a 10-nor 100-fold risk increase.
In addition to the BIC, the AIC was computed for all analyses (see table 3 ). The results come up with some remarkable differences. In general, the AIC selects much more models that are consistent with the data. This leads particularly for the LRM to a broader variety of selected statistical models with less focus on one best, whereas the GCM selects a few statistical models with stronger focus on one best model. Furthermore, the AIC often prefers  Since more than one model may fulfill this criterion in each replication the total number of selected models may exceed the number 100 of model replications.
Bold numbers indicate the maximum. Gray shadowed cells mark 'true' models, that are statistical models assumed to reflect best the biological mechanism. statistical models including interaction terms, even for additive or heterogeneity models. Using the AIC, GCM reflect the underlying biological model much better than LRM.

Discussion
In our simulation study, we investigated whether graphical chain models are able to detect underlying biological models of gene-gene interaction in association studies. It should be noted that the simulated biological models are unrealistically simplistic and thus of limited value with respect to the understanding of the underlying complex biological pathway, as also pointed out by Cordell [4] . But, despite their simplicity, accounting for interaction may increase the power to detect genetic effects (see [4] for practical examples). For this reason, the statistical  The table shows the number of falsely rejected null hypothesis of 1,000 replications for the relevant statistical models. model used to capture the simplified biological model should be able to reflect the biological interaction by its statistical interaction terms. Unfortunately, like in the study of North et al. [2] , our results showed that there is a severe gap between a fitted graphical chain model and the simulated biological model. The simulation results should be interpreted with caution regarding the true biology since the whole simulation study is based on two modeling steps: first, the complex biological pathway is modeled by a simplified biological model and second, a statistical model is used to capture this biological model. Based on the simulation results, it is, therefore, not possible to assess whether a statistical model is able to capture the true biology, but only whether it is able to capture the simplified biological model.
For the Risch models of biological interaction, it can be seen that the additive and heterogeneity models can be reflected with both methods, logistic regression models and graphical chain models. The multiplicative model, however, cannot be identified by either of these. It is well known that a multiplicative model is equivalent to an additive model on a logarithmic scale, that is, in loglinear models loci A and B are modeled as statistically independent additive effects on the logarithmic scale. Since interaction effects are present on the penetrance scale, logistic regression models and graphical chain models that are based on log-linear models will have difficulties to find these interaction effects when analyzing multiplicative penetrance models. In addition, if all penetrances are small, the approximation a i + b j = log( f ij /(1 -f ij )) ; log ( f ij ) and so f ij ; exp( a i ) ؒ exp( b j ) = ã i b j indicates that an additive model for the log odds should be equivalent to a multiplicative model for the penetrances [see 15 ]. In general, the LRM frequently decides for the more parsimonious additive effect models, whereas the GCM prefers dominance effect models to additive effect models. Surprisingly, if a dominance effect is explicitly included in the data, as is the case in the biological HET H model, the LRM detects this effect while the GCM does not. The reason for this behavior is not well understood.
For the epistasis models with their underlying biological interaction, statistical interaction is only found for one scenario. Especially, if a recessive locus is involved, both approaches fail to detect interaction effects. In this case, the epistasis models are often falsely classified as mean models, which leads to the misinterpretation that none of the analyzed loci has an effect on the disease. The results from the epistasis models are therefore not convincing for both approaches.
The question is, whether this finding was due to a lack of statistical power. Especially, when using the conservative BIC, the number of observations ( N = 1,000) may be too small to identify epistasis models. The AIC might to be more appropriate in this case. Since the results of BIC and AIC are not concordant, the choice of the appropriate information criterion is crucial. Several authors have compared both criteria [14,[16][17][18] . The BIC is based on the assumption that a 'true' model exists, although it does not have to be in the set of candidate models. Moreover, the BIC is a dimension-consistent criterion which means that it searches for a true model that is independent from the sample size. In contrast, the AIC does not intend to select any 'true' model, but selects the best model within a given set of candidate models. As non-dimension-consistent information criterion this includes that the selected best model varies with the sample size. According to Hurvich and Tsai [16] , if the truth is infinite-dimensional, the AIC selects the best finite-dimensional approximating model in large samples. We decided for the BIC since we have simulated data from underlying biological models with given information about presence or absence of gene-gene interaction.
It has been shown that analyzing case-control studies using graphical chain models or logistic regression models fails in detecting interaction on a penetrance scale if the model assessment is based on AIC or BIC. Hence, other approaches may be more appropriate as for instance nonparametric methods like the multifactor dimensionality reduction [19,20] . Another alternative would be not to model interaction but to measure it directly. For this purpose, we will develop an interaction measure on the penetrance scale where directed acyclic graphs have demonstrated to be a good starting point. The models Add, Het, Mult and Epi dd , Epi rd , Epi rr are described in Biological Interaction Models in Section 2. The additional index letters L and H symbolize low and high risk increase.