Explanation Trees for Causal Bayesian Networks

Bayesian networks can be used to extract explanations about the observed state of a subset of variables. In this paper, we explicate the desiderata of an explanation and confront them with the concept of explanation proposed by existing methods. The necessity of taking into account causal approaches when a causal graph is available is discussed. We then introduce causal explanation trees, based on the construction of explanation trees using the measure of causal information ow (Ay and Polani, 2006). This approach is compared to several other methods on known networks.


INTRODUCTION
A Bayesian network (BN, Pearl, 1988) is an algebraic tool to compactly represent the joint probability distribution of a set of variables V by exploiting conditional independence amongst variables.It represents all variables in a directed acyclic graph (DAG), where the absence of arcs between nodes denotes (conditional) independence.In addition to graphically representing the structure of the dependencies between the variables, BNs allow inference tasks to be solved more eciently.
In this paper, we discuss the extraction of explanations in causal BNs (Pearl, 2000;Spirtes et al., 2001)BNs where the arcs depict direct causeeect relationships between variables.
Generally, explanations in BNs can be classied in three categories (Lacave and Diez, 2002) depending on the focus of the explanation: • Explanation of evidence.Given a subset of observed (instantiated) variables O V, what is the state of (some of ) the other variables V \ O that best explains O = o?
• Explanation of the reasoning process.When we have received some evidence and belief states are updated by probabilistic inference, how was the reasoning process by which we arrive at this state?
• Explanation of the model, which provides insight into the static components of a network such as (conditional) independence relationships, causal mechanisms, etc.
We shall focus our attention on the explanation of evidence: we wish to explain why variables in O took on specic observed values using assignments in V \ O.
To this purpose, we discuss in section 2 the requirements of such an explanation.In section 3, we list the standard approaches to evidence explanation as well as some recent methods to make explanations more concise, and explain some of their drawbacks.We then present causal information trees in section 4, and detail experiments and comparisons in section 5.

NOTATION
Boldface capitals denote sets of random variables or nodes in a graph, depending on the context.V is the set of all variables in the analysis.Italicized capitals like X, X i , Y are random variables or nodes and elements of V; calligraphic capitals such as X , Y are their respective domains.Vectors are denoted boldface lowercase, as e or p; scalars in italics.Unless otherwise stated, the scalars x, y are assumed to be a value of their respective uppercase random variable.
The probability distribution of a random variable X is denoted by p(X), and we write p(X = x) or p(x) the probability of x.We only work with discrete variables.

AN IDEALIZED EXPLANATION
Even though many variables O may be observed, the explanation can be focused only on a specic subset E ⊆ O.The state E = e is then called explanandum.We insist on the distinction between the explanandum e and the observations o (Chajewska and Halpern, 1997).Observations are all our knowledge about the current state of a system, and this might not coincide exactly with what we want explained.Consider for example the case where we wish to know why the grass is wet while we know it has been raining.We do not seek an explanation for why it has rained, only for why the grass is wet.A perfectly valid explanation is that the grass is wet because it is raining if no other factors can suciently explain the facts.
An algorithm respecting this should then determine, for each variable in O, whether its observed state is relevant to explain e, and for each unobserved variable in V \ O, whether knowing its state adds explanatory power to the proposed explanation.This excludes methods which marginalize out O, preventing these variables from being part of an explanation.
To explain why a given system is observed in a given state, we must intuitively convey some information about the causal mechanisms that lead to the observation made.If we observe that it is raining and some explanation tells us that it rains because the grass is wet, we do not nd it a good explanation as it contradicts our understanding of how the system works; that the grass being wet cannot make it rain.Suppose we have an explanation H = h for E = e: an intuitive interpretation of this result is that manually setting H = h will be a favourable conguration to observe E = e.As Halpern and Pearl (2005) discuss, explanations need to be causal to be consistent with users' knowledge of the mechanisms of the system.It is therefore important that the explanations are given in a data-generating direction, such that users can infer interventional rules from the given explanations (for instance, if I can make it rain somehow, then I know that the grass will be wet, as opposed to an impossible let me make the grass wet so as to make it rain).
Causal methods are subject to the availability of causal information.In this paper, we extract the causal information from causal BNs, but in general, the approach is adaptable to any causal model that can predict the eect of interventions on certain variables.
In addition to assuming that the relationships between the variables V can be represented by a fully oriented causal BN, we assume that the corresponding joint probability distribution is faithful and causally sucient (Pearl, 2000;Spirtes et al., 2001).Faithfulness of the distribution ensures that there is a unique graph whose arcs depict all (conditional) dependencies of the distribution, and only those.Causal suciency forbids hidden common causes for variables in V, such that we can build a DAG whose arcs represent direct causation.Although most expert-designed BN are naturally oriented causally, the output of structure causal learning algorithms are often partially directed graphs and may need additional expert knowledge to be fully oriented.
To summarize, we wish our explanations to give us causal information by detailing the mechanisms that lead to the explanandum, using all the available information we have about the state of the network.

EXISTING METHODS
This section reviews and discusses some of the major techniques to nd explanations.

MOST PROBABLE EXPLANATION & VARIANTS
A common noncausal measure of explanatory power is the conditional probability of the explanatory variables H given the explanandum e.The most probable explanation (MPE) approach (Pearl, 1988)  moreover, it is hard to distinguish between long explanations, whose respective probabilities are low anyway and close to one another.
In the partial abduction approach (Shimony, 1991), the set of explanatory variables is a strict subset excluded from the explanation is then marginalized out before the maximum is computed: we look for arg max h x p(h, x | e).This is the maximum a posteriori (MAP) model approach.The excluded variables X are selected either by a user, or via automated analysis of the network.Automatically selecting the relevant explanatory variable is a nontrivial issue (Shimony, 1991).
Partial abduction is computationally more expensive than standard MPE, because it cannot be readily solved by message passing algorithms, but approximations exist (e.g., Park, 2002).On the other hand, it globally leads to more concise explanations than MPE.
Further eorts to make explanations more concise include de Campos et al. (2001), where the k most probable explanations are found and then simplied based on relevance and probabilistic criteria; and Henrion and Druzdzel (1990), where also partial assignments are allowed but only within a predened tree that limits the set of possible explanations.An explanation is then a path from the root of the tree to a leaf, denoting variable assignments for each branch taken.This is known as scenario-based explanation.The best explanation is the one with the highest posterior probability.
There are several concerns with these approaches MPE/MAP or scenario-basedmaximizing some conditional probability of the explanatory variables (Chajewska and Halpern, 1997).First, they do not distinguish the explanandum and the observations, such that the additional state information that is not meant to be explained is excluded from a possible explanation.Furthermore, there is no distinction between observing an explanatory variable X in a certain state x, and forcing it to have the value x. 1 Thus, depending on the choice of the explanatory variables, the intuitive interpretation (as described in the previous section) stating that setting H = h * will be a favourable conguration for observing E = e does not hold.
MPEs and, to a lesser extent, MAP model explanations, are not robust: little changes in the network will often change the result of the analysis, even though the changes occur in parts of the network largely independent of the explanandum (Chan and Darwiche, 2006).
Common to the methods in this subsection is that they order explanations by p(h, e) (this is equivalent to p(h | e) as p(e) is constant for a given e): this joint probability cannot be considered alone to determine the explanatory power of h on e.Some of these problems are illustrated by the experiments in section 5.

SE ANALYSIS
In SE analysis, Jensen (2001) additionally considers the sensitivity of an explanation h with respect to the explanandum.Less sensitive explanations ensure that little changes in the network's parameters will not lead to severely dierent explanations, so that the explanation is stable with respect to the specication of the network. 1 The dierence between observation and intervention is fundamental to causality and is best described with the example of Simpson's paradox in Pearl (2000), chap.6. SE analysis also works by comparing two explanations h i and h j , usually with Bayes' factor or the likelihood ratio (Jereys, 1961): Bayes' factor = posterior ratio The empirical interpretation of Bayes' factor given by Jereys (1961) is that if it is less than 1 it is in favor of h j , if less than 3 it is a slight support for h i .If it is between 3 and 12, it is a positive support; and higher than 12, it is a strong support for h i .
In Yuan and Lu (2007), Bayes' factor is used to search for explanations consisting of only a few variables by ranking them by their Bayes' factor computed as the ratio between the probability of the explanation given the explanandum and its opposite: . A similar criticism as before can be applied to these methods: additional observations are discarded, and the causal directionality is ignored in the selection of the relevant explanatory variables.

EXPLANATION TREES
The method of Flores (2005) constructs a set of best explanations while at the same time giving a preference for concise explanations, summarizing the results of the analysis in an explanation tree.We describe this method in more detail, as the causal information tree method (described in section 4) is based on a similar representation.
Denition 1 An explanation tree for an explanandum E = e is a tree in which every node X is an explanatory variable (with X ∈ V \ E), and each branch out of X is a specic instantiation x ∈ X of X.A path from the root to a leaf is then a series of assignments X = x, Y = y, • • • , Z = z, summarized as P = p, which constitutes a full explanation.Flores's (2005) algorithm, summarized in Algorithm 1, builds such an explanation tree.
Starting with an empty tree, the variable to use as the next node is selected to be the one that, given the explanandum, reduces the uncertainty the most in the rest of the explanatory variables according to some measure.The nodes that are on the path being grown are added to a conditioning set, so that the part of the explanatory space they already account for is taken into account.Two stopping criteria are used to determine when to stop growing the tree: the minimum posterior probability β of the current branch, and the minimum amount of uncertainty reduction α that must be achieved by adding a new variable.Among all explanations represented by the nal tree, the best one is the one with the largest posterior probability p(p | e).
Algorithm 1 Flores's (2005)    about the remaining variables in the set of explanatory variables.But this does not measure the information of the added variables shared with the explanandum.
Moreover, the explanandum actually grows as the tree is constructed, since there is no dierence between the constructed path p and e at line 2. Thus, this maximization cannot be interpreted as selecting variables reducing the uncertainty in the explanandum.
Second, the algorithm makes no distinction between explanandum and observations.To try to x this, we could either additionally condition on observations O = o, or marginalize out O altogether.The former case is no dierent from adding o to the explanandum e, and the latter case excludes all X ∈ O from explanations, such that both cases are unsatisfactory.
Third, the criterion to choose the best explanation is the probability of the explanation path given the evidence p(p | e), and not how likely the system is to have produced the evidence we are trying to explain with a conguration p, p(e | p).Both measures are linked, but since several explanations can cover almost an equal share of the explanation space and often only one will be included in the explanation tree, the criterion p(p | e) will miss explanations which could have explained the evidence well, but do not cover as large a fraction of the explanation space.
Fourth, causal considerations are ignored: there is no distinction between ancestors and descendants of a variables, such that we can get explanations of the type it rains because the grass is wet.
From an end-user perspective though, trees are a good solution for representing several competing explanations compactly and readably.We introduce in section 4 a modied approach, which can address the issues discussed here.

CAUSAL EXPLANATION TREES
Like the previous method, causal explanation trees take advantage of a tree representation.The tree is grown so as to ensure that explanations in any path are causal: variables can be selected as explanatory only if they causally inuence the explanandum.
Before dening the causal criterion used in this approach, we need to dene the concept of postintervention distribution (Pearl, 2000, p. 72).A standard conditional probability of the form p(e | x) gives the probability (or probability density) of e when X = x is observed.It does not represent, however, the probability of e if we manually force variable X to have value x. Causally, we are interested in the intervention on X, which we denote by do(X = x), rather the observation of x.In causal BNs, the tool used to evaluate the eect of these conditionings is Pearl's (1995) docalculus, which uses the structure the causal graph to evaluate the postintervention distribution.
Denition 2 Given a causal Bayesian network B in the sense of Pearl (2000, p. 23) over variables an intervention do(X i = x i ) can be expressed as: where pa j is the values of the graphical parents (i.e., direct causes) of the node X j in B.
The truncated factorization of (2) states that the probability distribution is computed as if the manipulated variable X i had no incoming causal inuence (i.e., no direct causes), and as if p(X i = x i ) had probability one.This makes sense, as forcing X i to have a certain value eectively ignores its direct causes and natural distribution.
With this concept, we can now dene the causal information ow (Ay and Polani, 2006), which will be our measure of causal contribution of explanatory variables towards our explanandum.
Denition 3 The causal information ow from where The expression I(X → Y | ẑ) measures the amount of information owing from X to Y if we intervene on Z, setting it to z (i.e., if we block the causal ow on all paths going through Z).Note that (3) is, in essence, similar to (1).For faithful probability distributions, I(X → Y | ẑ) = 0 if and only if all directed paths (if any) from X to Y go through Z in the corresponding causal graph.For binary variables, if Y is a deterministic function of X regardless of Z, then I(X → Y | ẑ) = 1.
In our application, we use the causal information ow to decide which variable should be added to the tree being built.This is shown in Algorithm 2. At line 2, we use the do-conditioning on the already build path Using this criterion, the explanation tree is then built as follows: the root node is selected as being arg max X I(X → e | o); i.e., the node which has the maximum information ow to the state of the explanandum.The important part is that we may condition on o without confusing observation and explanandum.Furthermore, we also allow selection of an observed variable X ∈ O in addition to unobserved variables, consistently with our desiderata.When X ∈ O, it is observed and we know its value x.We must then compute the pointwise causal information ow from x to e, with o being o without the observation X = x: .
The tree is then grown recursively: for each possible value for X, a branch is added to the root.We can further avoid unnecessary computations by using the graphical reachability criterion from a candidate node X to E, blocking paths going through O∪P.
The inference steps were implemented using the factor graph message-passing algorithm (Frey et al., 2001).
The complexity of this algorithm, in terms of number of calls to an inference engine per node in the constructed tree, is O(nd), where n is the number of explanatory variables |H| and d is the average domain size of the variables, e.g., 2 for binary variables.For comparison, Flores's (2005) approach is O(n 2 d 2 ).

EXPERIMENTS
We compare causal explanation trees (CET) with parameter α = 0 to Most Probable Explanation (MPE), Bayes' factor (BF) following Yuan and Lu (2007), and standard (noncausal) explanation trees (ET) with parameters α = 0.02 and β = 0. We test the approach on three simple networks 3 to compare the relevance of explanations.A more extended version of these experiments and comments can be found in Nielsen (2007).
Drug (Figure 1).This network comes from Pearl (2000, chap. 6).It represents the outcome of an experiment designed to check the eciency of a new drug on male and female patients.The males have a natural recovery rate of 70%; taking the drug decreases it to 60%.Similarly, 30% of females recover naturally, 3 The conditional probability tables have been omitted in the gures of the two larger BNs due to lack of space and can be found at http://www.zurich.ibm.com/~uln/causalexpl/.but only 20% when given the drug.Thus, both the absence of drug and being a male can explain a good recovery rate.In Figure 2, we try to explain a recovery.All approaches correctly realize that Sex = m largely accounts for the recovery.However, ET selects Sex = m ∧ Drug = yes as the best explanation according to the leaves' labels, just like MPE.This contradicts the natural idea of explanation, since the drug has a negative impact on the recovery.CET labels the leaves more sensibly: branches where the drug was not given have a higher rank.Moreover, the branches where Sex = f have a negative label, indicating that they actually decrease the probability of recovery.Although the rst two BF explanations are sensible, the third one is mistakenly selects Drug = yes as an explanation.
Academe (Figure 3).This network depicts the relationships between various marks given to students following a course.The Final mark is determined by some Other outside factors and an intermediate mark (T.P. mark), which is in turn determined by the student's abilities in Theory and Practice as well as Extra curricular activities in this tested subject.
In Figure 4, the explanandum was set to Final mark = fail; i.e., we want to explain why a student failed the course.T.P. mark and Global mark have been excluded from the possible explanatory variables in the two tree algorithms as they are modeling artifacts.
ET tells us that Theory = bad is the best explanation.
We could have expected Practice to also be part of The set of explanatory variables H V can include both observed and unobserved variables, and an explanation is an assignment H = h (compatible with O = o for variables both in H and in O).
An exhaustive search is performed over all subsets of the hypothesis, and the explanations are shown to be more concise in a sample network than MPE, Shimony's (1991) MAP, and the simplications described byde Campos et al. (2001).
x to T with subtree T and 10: assign it the label p(p, x | e) 11: end for 12: return T The algorithm is parametrized with the measure of uncertainty reduction Inf(X; Y | e, p). 2 For our implementation, we used the conditional mutual information.The mutual information I(X; Y | z) is a symmetrical measure of how much reduction in uncertainty about Y we get by knowing X in the context Z = z, and is dened as: I(X; Y | z) = x∈X p(x | z) y∈Y p(y | x, z) log p(y | x, z) p(y | z) .
X and Y are independent given Z = z, we have I(X; Y | z) = 0.If X fully determines Y , then knowing one is enough to know the other and full information is shared.Explanation trees are interesting in that they can present many mutually exclusive explanations in a compact form.Flores (2005) also argues that explanations as constructed by Algorithm 1 are reasonable and more sensible than (k-)MPE in the sense that on simple networks, the returned explanations are those that we expect.Four elements, however, are subject to discussion.First, on line 2 of Algorithm 1, variables are added to the tree in order of how much information they provide 2 SeeFlores (2005) for additional cases where max at line 3 is replaced by min or avg, and Inf is the Gini index.
p, and we allow inputting additional observed variables O in an additional conditioning set.We replace Y from (3) with the state of the explanandum e, suppress the corresponding summation and divide by the prior probability p(e | o, p), so that we end up computing I(X → e | o, p) = x∈X p(x | o, p)p(e | o, x, p) p(e | o, p) log p(e | o, x, p) x ∈X p(x | o, p)p(e | o, x , p) , ensuring that the expected value E E I(X → e | o, p) = e∈E p(e | o, p)I(X → e | o, p) equals I(X → E | o, p).
For each new leaf, the next explanatory variable is selected as being arg max Y I(Y → e | o, x), and so on, where the do-conditioning set always reects the selected variable values from the root to the current leaf.We use only one stopping criterion, the minimum information ow α we accept as a causal information contribution.The algorithm furthermore allows explicitly to restrict the search set for explanatory variables H (defaulting to V \ {E}).Finally, each leaf is labeled with log p(e | o, p)/p(e | o) (where we make sure that variables selected in p are removed from o if needed).This measures how much performing the interventions p changes the probability of the explanandum (given the observations) with respect to the prior probability of the explanandum.Higher values indicate better explanations; negative values indicate that the probability of the explanandum actually decreases with the proposed explanation.Using the information ow criterion brings us two advantages over standard (conditional) mutual information: rst, we automatically only consider variables that can causally inuence the explanandum.Second, when selecting the ith variable on a tree branch, we take into account the previously selected variables 1 through i − 1 causally, as they enter the conditioning set of variables that have been intervened on.In practice, computing a causal information ow of the type I(X → e | o, p) at line 2 of Algorithm 2 requires Algorithm 2 Causal Explanation Tree 1: function T CausalExplTree(H, o, e, p; α) Input: H : set of explanatory variables O = o : observation set E = e : explanandum p : path of interventions α : stopping criterion Output: T : a causal explanation tree 2: X * ← arg max X∈H I(X → e | o, p) 3: if I(X * → e | o, p) < α then return ∅ 4: T ← new tree with root X * 5: for each x ∈ domain(X * ) do 6:

looks for the k best explanations by maximizing
this probability).The explanandum e is in the case of MPE equal to the full set of observations O = o, and the set H is V \ E. This list can be long and uninformative because of lack of conciseness; Explanation Tree 1: function T = ExplanationTree(H, e, p; α, β) * ← arg max X∈H P Y ∈H Inf(X; Y | e, p) 3: if max Y ∈H\X * Inf(X; Y | e, p) < α or p(p | e) < β * 7: for each x ∈ domain(X * ) do 8: T ← ExplanationTree(H \ X * , e, p ∪ {x}) : add a branch x to T with subtree T and 8: assign it the contribution log `p(e | o, p, x)/p(e | o) 9: is needed to label the leaves, but as it is only dependent on e and o, we compute it only once.