Philosophy Department

University of Waterloo

Waterloo, Ontario, N2L 3G1

pthagard@watarts.uwaterloo.ca

© Paul Thagard, 1997

To go directly a particular section of this paper, click on a section title below.

To return to the Coherence Articles Table of Contents page, click here.

When surprising events occur, people naturally try to generate explanations of them. Such explanations usually involve hypothesizing causes that have the events as effects. Reasoning from effects to prior causes is found in many domains, including:

- Social reasoning: when friends are acting strange, we conjecture about what might be bothering them.
- Legal reasoning: when a crime has been committed, jurors must decide whether the prosecution's case gives a convincing explanation of the evidence.
- Medical diagnosis: given a set of symptoms, a physician tries to decide what disease or diseases produced them.
- Fault diagnosis in manufacturing: when a piece of equipment breaks down, a trouble shooter must try to determine the cause of the breakdown.
- Scientific theory evaluation: scientists seek an acceptable theory to explain experimental evidence.

What is the nature of such reasoning?

The many discussions of causal reasoning over the centuries can be seen
as falling under two general traditions that I shall call *explanationism*
and *probabilism*. Explanationists understand causal reasoning qualitatively,
while probabilists exploit the resources of the probability calculus to
understand causal reasoning quantitatively. Explanationism goes back at
least to Aristotle (1984, vol. 1 p. 128) who considered the inference that
the planets are near as providing an explanation of why they do not twinkle.
Some Renaissance astronomers such as Copernicus and Rheticus evaluated theories
according to their explanatory capabilities (Blake 1960). The leading explanationists
in the nineteenth century were the British scientist, philosopher, and historian
William Whewell (1967), and the American polymath C. S. Peirce (1931-58).
The most enthusiastic explanationists in this century have been epistemologists
such as Gilbert Harman (1973, 1986) and William Lycan (1988). Very recently,
computational models of inference to the best explanation have been developed
(Josephson et al. 1987; Thagard 1988, 1989, 1992b; O'Rorke, Morris, and
Schulenburg 1990; Rajamoney 1990; Ng and Mooney 1990; Ram 1990).

The probabilist tradition is less ancient than explanationism, for the mathematical theory of probability only arose in the seventeenth century through the work of Pascal, Bernoulli, and others (Hacking 1975). Laplace and Jevons were the major proponents of probabilistic approaches to induction in the eighteenth and nineteenth century, respectively (Laudan 1981, ch. 12). Many twentieth-century philosophers have advocated probabilistic approaches to epistemology, including Keynes (1921), Carnap (1950), Jeffrey (1983), Levi (1980), and Kyburg (1983). Probabilistic approaches have recently become influential in artificial intelligence as a way of dealing with uncertainty encountered in expert systems (Pearl 1988, Peng and Reggia 1990, Neapolitain 1990). Probabilistic approaches are also being applied to natural language understanding (Goldman and Charniak 1990).

The explanationist versus probabilist issue surfaces in a variety of sub-areas. Some legal scholars concerned with evidential reasoning have been probabilist (Lempert 1986, Cohen 1977), while some are explanationist and see probabilist reasoning as neglecting important aspects of how jurors reach decisions (Allen 1990, Pennington and Hastie 1986). In the philosophy of science, there is an unresolved tension between probabilist accounts of scientific inference (Hesse 1974, Horwich 1982, Howson and Urbach 1989) and explanationist accounts (Thagard 1978, 1988, 1989). Neither the probabilist nor the explanationist tradition is monolithic: there are competing interpretations of probability and inference within the former, and differing views of explanatory inference within the latter.

In recent years, it has become possible to examine the differences between explanationist and probabilist approaches at a much finer level, because algorithms have been developed for implementing them computationally. I have proposed a theory of explanatory coherence that incorporates the kinds of reasoning advocated by explanationists, and implemented the theory in a connectionist program called ECHO that shows how explanatory coherence can be computed in networks of propositions (Thagard 1989). Pearl (1988) and others have shown how probabilistic reasoning can be implemented computationally using networks. The question naturally arises of the relation between ECHO networks and probabilistic networks.

This paper shows how ECHO's qualitative input can be used to produce a probabilistic network to which Pearl's algorithms are applicable. At one level, this result can be interpreted as showing that ECHO is a special case of a probabilistic network. The production of a probabilistic version of ECHO highlights, however, several computational problems with probabilistic networks. The probabilistic version of ECHO requires the provision of many conditional probabilities of dubious availability, and the computational techniques needed to translate ECHO into probabilistic networks are potentially combinatorially explosive. ECHO can therefore be viewed as an intuitively appealing and computationally efficient approximation to probabilistic reasoning. We will also see that ECHO puts important constraints on the conditional probabilities used in probabilistic networks.

The comparison between ECHO and probabilistic networks does not in itself settle the relation between the explanationist and probabilist traditions, since there are other ways of being an explanationist besides ECHO, and there are other ways of being a probabilist besides Pearl's networks. But from a computational perspective, ECHO and Pearl networks are much more fully specified than previous explanationist and probabilist proposals, so a head-to-head comparison is potentially illuminating. After briefly reviewing my theory of explanatory coherence and Pearl's approach to probabilistic networks, I shall sketch the probabilistic interpretation of explanatory coherence and discuss the computational problems that arise. Then a demonstration of how ECHO naturally handles Pearl's central examples will support the conclusion that explanatory coherence theory is not obviated by the probabilistic approach. (2)

My theory of explanatory coherence, TEC, has been evolving in a series
of publications. Here is the current set of principles, taken from Thagard
(1992b). P, Q, and P*1* ... P*n* are propositions, and coherence
is a relation between two propositions. To ``incohere'' is to *resist*
holding together.

*Principle 1. Symmetry*.

(a) If P and Q cohere, then Q and P cohere.

(b) If P and Q incohere, then Q and P incohere.

*Principle 2. Explanation*.

If P

1 ...Pmexplain Q, then:(a) For each P

iin P1 ...Pm, Piand Q cohere.(b) For each P

iand Pjin P1 ...Pm, Piand Pjcohere.(c) In (a) and (b) the degree of coherence is inversely proportional to the number of propositions P

1... Pm.

*Principle 3. Analogy*. (3)

If P

1explains Q1, P2explains Q2, P1is analogous to P2, and Q1is analogous to Q2, then P1and P2cohere, and Q1and Q2cohere.

*Principle 4. Data Priority*.

Propositions that describe the results of observation have a degree of acceptability on their own.

*Principle 5. Contradiction*.

If P contradicts Q, then P and Q incohere.

*Principle 6. Competition*

If P and Q both explain a proposition R, and if P and Q are not explanatorily connected, then P and Q incohere. Here P and Q are explanatorily connected if any of the following conditions holds:

(a) P is part of the explanation of Q,

(b) Q is part of the explanation of P,

(c) P and Q are together part of the explanation of some proposition S.

*Principle 7. Acceptability*.

(a) The acceptability of a proposition P in a system S depends on its coherence with the propositions in S.

(b) If many results of relevant experimental observations are unexplained, then the acceptability of a proposition P that explains only a few of them is reduced.

To understand these principles, further discussion is required. (4) Principle 1, Symmetry, asserts that pairwise coherence and incoherence are symmetric relations, in keeping with the everyday sense of coherence as holding together. The coherence of two propositions is thus different from the nonsymmetric relations of entailment and conditional probability. Typically, P entails Q without Q entailing P, and the conditional probability of P given Q is different from the probability of Q given P. But if P and Q hold together, so do Q and P. The use of a symmetrical relation has advantages that will become clearer in the discussion of the connectionist implementation below.

Principle 2, Explanation, is by far the most important for assessing
explanatory coherence, since it establishes most of the coherence relations.
Part (a) is the most obvious: If a hypothesis P is part of the explanation
of a piece of evidence Q, then P and Q cohere. Moreover, if a hypothesis
P2 is explained by another hypothesis P*1*, then P*1* and P*2*
cohere. Part (a) presupposes that explanation is a more restrictive relation
than deductive implication, since otherwise we could prove that any two
propositions cohere.

Whereas part (a) of Principle 2 says that what explains coheres with
what is explained, part (b) states that two propositions cohere if together
they provide an explanation. Behind part (b) is the idea that the evaluation
of a hypothesis depends partly on the other hypotheses with which it furnishes
explanations. I call two hypotheses that are used together in an explanation
``cohypotheses.'' Again I assume that explanation is more restrictive than
implication, since otherwise it would follow that any proposition that explained
something was coherent with every other proposition, because if P1 implies
Q, then so does P*1* & P2. But any scientist who maintained at
a conference that the theory of general relativity and today's baseball
scores together explain the motion of planets would be laughed off the podium.
Principle 2 is intended to apply to explanations and hypotheses actually
proposed by scientists.

Part (c) of Principle 2 embodies the claim that if numerous propositions
are needed to furnish an explanation, then the coherence of the explaining
propositions with each other and with what is explained is thereby diminished.
Scientists tend to be skeptical of hypotheses that require myriad *ad
hoc* assumptions in their explanations. There is nothing wrong in principle
in having explanations that draw on many assumptions, but we should prefer
theories that generate explanations using a unified core of hypotheses.
I have elsewhere contended that the notion of *simplicity* most appropriate
for scientific theory choice is a comparative one preferring theories that
make fewer special assumptions (Thagard 1988). Principles 2(b) and 2(c)
together subsume this criterion. I shall not attempt further to characterize
``degree of coherence'' here, but the onnectionist algorithm described below
provides a natural interpretation.

The third criterion for the best explanation in my earlier account was analogy, and this is subsumed in principle 3. It is controversial whether analogy is of more than heuristic use, but scientists such as Charles Darwin have used analogies to defend their theories; his argument for evolution by natural selection is analyzed below. Principle 3 does not say simply that any two analogous propositions cohere. There must be an explanatory analogy, with two analogous propositions occurring in explanations of two other propositions that are analogous to each other. Recent computational models of analogical mapping and retrieval show how such correspondences can be noticed (Holyoak and Thagard 1989, Thagard, Holyoak, Nelson, and Gochfeld 1990).

Principle 4, Data Priority, stands much in need of elucidation and defense. In saying that a proposition describing the results of observation has a degree of acceptability on its own, I am not suggesting that it is indubitable, only that it can stand on its own more successfully than a hypothesis whose sole justification is what it explains. A proposition Q may have some independent acceptability and still not be accepted, if it is only coherent with propositions that are not themselves acceptable.

From the point of view of explanatory coherence alone, we should not take propositions based on observation as independently acceptable without any explanatory relations to other propositions. The coherence of such propositions is non-explanatory, based on background knowledge that observations of certain sorts are very likely to be true. From experience, we know that our observations are very likely to be true, so we should believe them unless there is substantial reason not to. Similarly, at a different level, we have some confidence in the reliability of descriptions of experimental results in carefully refereed scientific journals. Observations may be ``theory-laden'', but they are far from being theory-determined. I count as data not just individual observations such as ``the instrument dial reads .5,'' but also generalizations from such observations.

Principle 5, Contradiction, is straightforward. By ``contradictory'' here I mean not just syntactic contradictions like P \& not-P but also semantic contradictions such as ``This ball is black all over'' and ``This ball is white all over.'' In my earlier version of TEC, I tried to stretch ``contradiction'' to cover cases where hypotheses that are not strictly contradictory are nevertheless held to be incompatible, but such cases are better handled by the new principle 6, Competition.

According to Principle 6, we should assume that *hypotheses that explain
the same evidence compete with each other unless there is reason to believe
otherwise*. Hence there need be no special relation between two hypotheses
for them to be incoherent, since hypotheses that both explain a piece of
evidence are judged to incohere unless there are reasons to think that they
cohere. Not all alternative hypotheses incohere, however, since many phenomena
have multiple causes. For example, explanations of why someone has certain
medical symptoms may involve hypotheses that the patient has various diseases,
and it is possible that more than one disease is present. Normally, however,
if hypotheses are proposed to explain the same evidence, they will be treated
as competitors. For example, in the debate over dinosaur extinction (Thagard
1991), scientists generally treat as contradictory the hypotheses:

(1) Dinosaurs became extinct because of a meteorite collision.

(2) Dinosaurs became extinct because the sea level fell.

Logically, (1) and (2) could both be true, but scientists treat them as conflicting explanations. According to principle 6, they incohere because they both are claimed to explain why dinosaurs became extinct and there is no explanatory relation between them.

Principle 7, Acceptability, proposes in part 7(a) that we can make sense of the overall coherence of a proposition in an explanatory system just from the pairwise coherence relations established by principles 1--5. If we have a hypothesis P that coheres with evidence Q by virtue of explaining it, but incoheres with another contradictory hypothesis, should we accept P? To decide, we cannot merely count the number of propositions with which P coheres and incoheres, since the acceptability of P depends in part on the acceptability of those propositions themselves. We need a dynamic and parallel method of deriving general coherence from particular coherence relations; such a method is provided by the connectionist program described below.

Principle 7(b), reducing the acceptability of a hypothesis when much
of the relevant evidence is unexplained by any hypothesis, is intended to
handle cases where the best available hypothesis is still not very good,
in that it accounts for only a fraction of the available evidence. Consider,
for example, a theory in economics that could explain the stock market crashes
of 1929 and 1987 but had nothing to say about myriad other similar economic
events. Even if the theory gave the best available account of the two crashes,
we would not be willing to elevate it to an accepted part of general economic
theory. What does ``relevant'' mean here? As a first approximation, we can
say that a piece of evidence is *directly* relevant to a hypothesis
if the evidence is explained by it or by one of its competitors. We can
then add that a piece of evidence is relevant if it is directly relevant
or if it is similar to evidence that is relevant, where similarity is a
matter of dealing with phenomena of the same kind. Thus a theory of the
business cycle that applies to the stock market crashes of 1929 and 1987
should also have something to say about nineteenth century crashes and major
business downturns in the twentieth century.

I will only briefly review how the principles are implemented in the computer program ECHO. My emphasis will be on the graphical and computational properties of ECHO networks most relevant to comparison with probabilistic networks. ECHO takes as input statements such as

(`EXPLAIN '(H1 H2) 'E1)`

whose interpretation is that hypotheses H1 and H2 together explain evidence
E1. ECHO represents each proposition by a network node called a *unit*,
and constructs links between units in accord with TEC. Whenever two propositions
cohere according to TEC, ECHO places an excitatory link (with weight greater
than 0) between the units that represent them. Whenever two propositions
incohere according to TEC, ECHO places an inhibitory link (with weight less
than 0) between them.

In accord with Principle 1, Symmetry, all links are symmetric. Given the above input, Principle 2, Explanation, requires that ECHO produce excitatory links between the units representing the following pairs of propositions: H1 and E1, H2 and E1, and H1 and H2. Notice the last link, required by Principle 2(b), since it provides one of the greatest differences between ECHO networks and probabilistic networks.

In accord with principle 2(b), Simplicity, the weights among the units given the above input are lower than they would be if only one hypotheses had been needed in the explanation.

Principle 3 says that analogy can also be a source of coherence, and
ECHO constructs the appropriate excitatory links given input that says that
propositions are analogous. To implement Principle 4, Data Priority, ECHO
takes input specifying propositions as data and constructs an excitatory
link between each of those propositions and a special evidence unit. On
the basis of Principles 5 and 6, Contradiction and Competition, ECHO constructs
inhibitory links between units representing propositions that contradict
each other or compete to explain other propositions. Finally, Principle
7, Acceptability, is implemented in ECHO using a simple connectionist method
for updating the activation of a unit based on the units to which it is
linked. Units typically start with an activation level of 0, except for
the special evidence unit whose activation is always 1. Activation spreads
from it to the units representing data, and from them to units representing
propositions that explain data, and then to units representing higher-level
propositions that explain propositions that explain data, and so on. Inhibitory
links between units make them suppress each other's activation. Activation
of each unit u, a*j*, is updated in parallel using the equation:

{netj (max - aj (t)) |
if\ net_j > 0 | |

aj (t+1) = aj (t)(1 - theta ) + |
||

{cr netj (aj (t) - min) |
otherwise |

Here is a decay parameter that decrements each unit at every cycle, min
is minimum activation (-1), max is maximum activation (1), and net*j*
is the net input to a unit. This is defined by: net*j* = Sigma-*i*
w*ij*a*i*(t) Full description of the algorithms for ECHO's operation
has been given elsewhere (Thagard 1989, 1992b).

Figure 1.1 displays a typical network produced by ECHO, given that H1 and H2 together explain E1, H1 and H2 together explain E2, H4 and H5 together also explain E1 and E2, and H1 is itself explained by H3. All links are symmetric, although the flow of activation is not, since it originates in the EVIDENCE unit. Note that there are many loops in the graph in figure 1.1. Loops with 3 nodes arise both from explanation (e.g. H1, H2, and E1) and from competition (e.g. H1, H4 and E1). Loops with 4 nodes occur whenever there are two propositions that both explain two other propositions (e.g. H1, H4, E1, E2). The presence of loops brings no complications to ECHO's computation of acceptability, but we shall see that loops are problematic for probabilistic networks.

(Graphic not available)

**Figure 1.1** A typical ECHO network. Solid lines indicate excitatory
links, while dotted lines indicate inhibitory links.

Updating an ECHO network is very simple and can be performed locally
at each unit *u*, which must have access, for each *u* to which
is linked, to the weight of the link between *u* and *u* and to
the activation of *u*. ECHO networks are never completely connected
- most units are not directly linked to most other units. Even if networks
were completely connected, the maximum number of links per unit in a network
with *n* units would be *n-1*, so that *n*(n-1)* simple calculations
are all that updating requires. In principle, if each unit has its own processor,
the *n-1* calculations can be performed independently of each other.
There is, however, no guarantee that a small number of updating steps will
suffice to produce stable activation values. It is possible that activation
will pass round and round the network, producing oscillations in activations
of units so that no judgment of acceptability is produced. But computational
experiments have shown that ECHO networks can be highly stable given appropriate
weight values for excitatory and inhibitory links. For most of the major
ECHO examples, 1000 runs have been done to determine how sensitive network
performance is to excitation, inhibition, and decay, considering each in
the range of .01 to .1. The sensitivity experiments show that ECHO performs
well so long as excitation is low relative to inhibition, when the inhibitory
links between units make the network settle without the oscillations that
uncontrolled excitation would produce. Resulting activations can range between
1 (maximum acceptance) and -1 (maximum rejection).

(Graphic not available)

**Figure 1.2** Graph of settling time versus number of units for 15
ECHO applications, as described in: Thagard (1989, 1991b, 1991a); Thagard
and Nowak (1988, 1990); Nowak and Thagard (1991a, 1991b).

On the basis of these experiments, I chose the default parameter values of .04 for excitation, -.06 for inhibition, and .05 for decay, since they tend to lead to rapid settling. All runs of ECHO on different examples now use these parameter values. Figure~\ref{echo-2} shows the settling time of the most interesting ECHO applications as a function of number of units. Strikingly, larger networks with more links do not require more updating steps to reach stable activation values. This result is consistent with experiments performed using ARCS, a program for analog retrieval that uses networks similar to ECHO's but with many more units and links (Thagard, Holyoak, Nelson, and Gochfeld 1990). Thus although we can not show analytically that ECHO can efficiently compute explanatory coherence, experimental results suggest that ECHO's ability to determine the acceptability of propositions is little affected by the size or connectivity of networks. Further information on how ECHO works will be provided in section 5 which shows how ECHO handles some probabilistic cases. (5)

The theory of explanatory coherence employs vague concepts such as ``explanation'', ``coherence'', and ``acceptability'', and ECHO requires input specifying explanatory relations. It is reasonable to desire a more precise way of understanding causal reasoning, for example in terms of the mathematical theory of probability. That theory can be stated in straightforward axioms that establish probabilities as quantities between 0 and 1. From the axioms it is trivial to derive Bayes' theorem, which can be written as:

P(H/E) = P(H) P(E/H) quantity over P(E)

It says that the probability of a hypothesis given the evidence is the prior probability of the hypothesis times the probability of the evidence given the hypothesis, divided by the probability of the evidence. Bayes' theorem is very suggestive for causal reasoning, since we can hope to decide what caused an effect by considering what cause has the greatest probability given the effect. (6) In practice, however, application of the probability calculus becomes complicated. Harman (1986, p.~25) points out that in general probabilistic updating is combinatorially explosive, since we need to know the probabilities of a set of conjunctions whose size grow exponentially with the number of propositions. For example, full probabilistic information about three propositions, A, B, and C, would require knowing a total of 8 different values: P(A & B & C), P(A & B & not-C), P(A & not-B & not-C), etc. Only 30 propositions would require more than a billion probabilities.

Probabilistic networks prune enormously the required number of probabilities
and probability calculations, since they restrict calculations to a limited
set of dependencies. Suppose you know that B depends only on A, and C depends
only on B. You then have the simple network *A right arrow B right arrow
C*. This means that A can affect the probability of C only through B,
so that the calculation of the probability of C can take into account the
value of B while ignoring A.

Probabilistic networks have gone under many different names: causal networks, belief networks, Bayesian networks, influence diagrams, and independence networks. For precision's sake, I want to consider a particular kind of probabilistic network, concentrating on the elegant and powerful methods of Pearl (1988). Since methods for dealing with probabilistic networks other than his are undoubtedly possible, I shall not try to compare ECHO generally with probabilistic networks, but will make the comparison specifically with Pearl networks.

In Pearl networks, each node represents a multi-valued *variable*
such as a patient's temperature, which might take three values: high, medium,
low. In the simplest cases, the variable can be propositional, with two
values, true and false. Already we see a difference between Pearl networks
and ECHO networks, since ECHO requires separate nodes for a proposition
and its negation. But translations between Pearl nodes and ECHO nodes are
clearly possible and will be discussed below.

More problematic are the edges in the two kinds of networks. Pearl networks
are directed, acyclic graphs. Edges are directed, pointing from causes to
effects, so that *A right arrow B* indicates that A causes B and not
vice versa. In contrast, ECHO's links are all symmetric, befitting the character
of coherence and incoherence (Principle 1 of TEC), but symmetries are not
allowed in Pearl networks. The specification that the graphs be acyclic
rules out relations such as those shown in figure 1.3. Since the nodes are
variables, a more accurate interpretation of the edge *A right arrow B*
would be: the values of B are causally dependent on the values of A.

(Graphic not available)

**Figure 1.3** Examples of cyclic graphs.

(Graphic not available)

**Figure 1.4 **Sample Pearl network, in which the variable D is dependent
on A, B, and C, while E and F are dependent on D. Lines with arrows indicate
dependencies.

The structure of Pearl networks is used to localize probability calculations and surmount the combinatorial explosion that can result from considering the probabilities of everything given everything else. Figure1.4 shows a fragment of a Pearl network in which the variable D is identified as being dependent on A, B, and C, while E and F are dependent on D. The probabilities that D will take on its various values can then be calculated only by looking at A, B, C, E, and F, ignoring other variables in the network from which D is assumed to be conditionally independent, given the five variables on which it is directly dependent. The probabilities of the values of D can be expressed as a vector corresponding to the set of values. For example, if D is temperature and has values (high, medium, low), the vector (.5 .3 .2) assigned to D means that the probability of high temperature is .5, of medium temperature is .3, and low temperature is .2. In accord with the axioms of probability theory, the numbers in the vector must sum to 1, since they are the probabilities of all the exclusive value of the variable. The desired result of computing using a Pearl network is that each node should have a stable vector representing the probabilities of its values given all the other information in the network. If a measurement determines that the temperature is high, then the vector for D would be (1 0 0). If the temperature is not known, it must be inferred using information gathered both from the variables on which D is dependent and the ones that depend on D. In terms of Bayes' theorem, we can think of A, B, and C as providing prior probabilities for the values of D, while E and F provide observed evidence for it. The explanatory coherence interpretation of figure~\ref{pearl} is that A, B, and C explain D, while D explains E and F.

For each variable X, Pearl uses BEL(x) to indicate the computed degrees of belief (probabilities) that X takes on for each of its values x. BEL(x) is thus a vector with as many entries as X has values, and is calculated using the equation:

BEL(x) = alpha * lambda(x) * pi(x)

Here alpha is a normalizing constant used to ensure that the entries in the vector sum to 1. Lambda(x) is a vector representing the amount of support for particular values of X coming up from below, that is from variables that depend on X. Pi(x) is a vector representing the amount of support for particular values of X coming down from above, that is from variables on which X depends. For a variable V at the very top of the network, the value for passed down by V will be a vector of the prior probabilities that V takes on its various values. Ultimately, BEL(x) should be a function of these prior probabilities and fixed probabilities at nodes where the value of the variable is known, producing BEL vectors such as (1 0 0).

Calculating BEL values is non-trivial, because it requires repeatedly
updating the lambda, pi, and BEL values until the prior probabilities and
the known values based on evidence have propagated throughout the network.
It has been shown that the general problem of probabilistic inference in
networks is NP-hard (Cooper 1990), so we should not expect there to be a
universal efficient algorithm for updating BEL.(7) Pearl presents algorithms
for computing BEL in the special case where networks are *singly connected*,
that is where no more than one path exists between two nodes (Pearl 1988,
ch.~4; see also Neapolitain 1990). A loop here is a sequence of edges independent
of direction. If there is more than one path between nodes, then the network
contains a loop that can interfere with achievement of stable values of
BEL, lambda, and pi. Hence methods have been developed for converting multiply
connected networks into singly connected ones by clustering nodes into new
nodes with many values. For example, consider the network shown in figure
1.5 (from Pearl 1988, p. 196). In this example, metastatic cancer is a cause
of both increased total serum calcium and brain tumor, either of which can
cause a coma. This is problematic for Pearl because there are two paths
between A and D. Clustering involves collapsing nodes B and C into a new
Z representing a variable with values that are all possible combinations
of the values of B and C: increased calcium and tumor, increased calcium
and no tumor, no increased calcium and tumor, and no increased calcium and
no tumor. ECHO deals with cases such as these very differently, using the
principle of competition, as we will see in section 5. There are other ways
of dealing with loops in probabilistic networks besides clustering.

Pearl discusses two approximating alternatives, while Lauritzen and Spiegelharter (1988) have offered a powerful general method for converting any directed acyclic graph into a tree of cliques of that graph. See also Neapolitain 1990, ch.~7. Hrycej (1990) shows how approximation by stochastic simulation can be understood as sampling from the Gibbs distribution in a random Markov field.

(Graphic not available)

**Figure 1.5** Pearl's representation of a multiply connected network
that must be manipulated before probability calculations can be performed.

ECHO | Pearl | |
---|---|---|

Nodes represent | propositions | variables |

Edges represent | coherence | dependencies |

Directedness | symmetric | directed |

Loops | many | must be eliminated |

Node quantity updated | activation [-1,1] | BEL: vector of [0,1] |

Additional updating | none | lambda, pi |

Additional info used | explanations | conditional probabilities |

data | prior probabilities |

What probabilities must actually be known to compute BEL(x) even in singly connected networks? Consider again figure 1.4, where BEL values for node D are to be computed based using and values for A, B, C, E, and F. To simplify, consider only values a, b, c, d, and e of the respective variables. Pearl's algorithms do not simply require knowledge of the conditional probabilities P(d/a), P(d/b), and P(d/c). (Here P(d/a) is shorthand for the probability that D has value d given that A has value a.) Rather, the calculation considers the probabilities of d given all the possible combinations of values of the variables on which D depends. Consider the simple propositional case where the possible values of D are that it is true (d) or false (not-d). Pearl's algorithm requires knowing P(d/a & b ot-c), P(d/a & not-b & not-c) and five other conditional probabilities. The analogous conditional probabilities for not-d can be computed from the ones just given.

More generally, if D depends on *n* variables with *k* values
each, *k^n* conditional probabilities will be required for computation.
This raises two problems that Pearl discusses. First, if *n* is large,
the calculations become computational intractable, so that approximation
methods must be used. Probabilistic networks are nevertheless much more
attractive computationally than the general problem of computing probabilities,
since the threat of combinatorial explosion is localized to nodes which
we can hope to be dependent on a relatively small number of other nodes.
Second, even if *n* is not so large, there is the problem of obtaining
sensible conditional probabilities to plug into the calculations. Pearl
acknowledges that it is unreasonable to expect a human or other system to
store all this information about conditional probabilities, and shows how
it is sometimes possible to use simplified models of particular kinds of
causal interaction to avoid having to do many of the calculations that the
algorithms would normally require.

Table 1.1 summarizes the descriptions so far of ECHO and Pearl networks. The characterizations have been very terse, but for more detail the original sources can be consulted. We now have enough information to begin considering the relation between ECHO and Pearl networks.

To see why it is reasonable to consider translating ECHO networks into
Pearl networks, let us review some simple examples that illustrate ECHO's
capabilities (Thagard 1989, section 4). Given a choice between competing
hypotheses, ECHO prefers ones that explain more. A Pearl network can be
expected to show a similar preference, since the function will send more
support up to a value *v* of a variable from variables whose values
*vi* are known and where P(*vi/v*) is high. ECHO prefers hypotheses
that are explained to ones that are not, and a Pearl network will similarly
send down to a value of a variable support using the function. To determine
whether Pearl networks can duplicate other aspects of ECHO networks, we
will have to consider a possible translation in more detail.

A translation algorithm from ECHO networks to Pearl networks could take either of two forms. The most direct would be an immediate network-to-network translation. Every proposition node in the ECHO network would become a variable in a Pearl network, and every ECHO link would become a Pearl conditional probability between values of variables. This direct translation clearly fails, since ECHO's symmetric links would translate into 2-way conditional probabilities not allowed in Pearl networks. Moreover, the translation would produce many cycles that Pearl networks exclude by definition. (8) Alternatively we can bypass the creation of the ECHO network and simply use the input to ECHO directly to generate a Pearl network. We can then try to produce a program that will take ECHO's input and produce an Pearl network suitable for running Pearl's algorithms. Let us call this program PECHO.

First we must worry about creating the appropriate nodes. ECHO creates nodes when it is given input describing a proposition P. Analogously, PECHO would create a variable node with two values, TRUE and FALSE. At this point, PECHO would have to consult ECHO's input concerning contradictions, and check whether there is some proposition not-P that contradicts P. If so, there is no need to construct a new variable node, since not-P would simply be represented by the FALSE value of the variable node for which P represents the value TRUE. It becomes more complicated if there are several propositions in ECHO that all contradict each other, but these could all be amalgamated into one variable node with multiple values.

Since the vector representing the BEL values for a variable is required to sum to 1, PECHO will be able to duplicate ECHO's effect that the acceptability of a proposition counts against the acceptability of any proposition that contradicts it. Because we are not directly translating from ECHO networks to Pearl networks, we do not have to worry that ECHO activation values range from -1 to 1, rather than from 0 to 1 like probabilities; but in any case it is possible to normalize ECHO's activations into the same range as probabilities. To normalize ECHO in the simplest case, just add 1 to a unit's activation and divide the result by 2. If two units represent contradictory propositions, normalize the resulting values by multiplying each value by 1 divided by the sum of the values. When a proposition contradicts more than one other, the normalization becomes problematic unless the propositions in question are all mutually contradictory, as when they all represent different values of the same variable. In ECHO networks this is not always the case.

Simulation of Copernicus's case against Ptolemy (Nowak and Thagard 1992a) includes:

P6 The Earth is always at the center of the heavenly sphere.

P12 The sun moves eastward along a circle about the earth in one year.

C12 The sun is immobile at the center of the universe.

Here Ptolemy's propositions P6 and P12 each contradict Copernicus's C12, but they do not contradict each other.

ECHO uses the range [-1,1] for both conceptual and computational reasons.
Conceptually, ECHO is interpreted in terms of degrees of *acceptance*
(activation > 0) and degrees of *rejection* (activation < 0),
sharing with some expert systems the intuition that attitudes toward hypotheses
are better characterized in terms of accept/reject rather than degrees of
belief as probabilists hold (cf. Buchanan and Shortliffe 1984, ch. 10).
The computational reason is that activation updating in ECHO has the consequence
that if a hypothesis coheres with one that is deactivated, it will also
tend to be deactivated. Thus sets of hypotheses (theories) tend to be accepted
and rejected as wholes, as is usually the case in the history of science
(Thagard 1992b).

Now we get to the crucial question of links. When ECHO reads the input

`(EXPLAIN '(P1 P2 P3 P4) 'Q),`

it creates excitatory links between each of the explaining propositions and Q. PECHO correspondingly would note that the variable node whose TRUE value represents Q is causally dependent on the variable node whose TRUE value represents P1, and so on. More problematically, PECHO would have to contrive 4 conditional probabilities: P(Q/P1), P(Q/not-P1), P(not-Q/P1), and P(not-Q/not-P1). The first of these could perhaps be derived approximately from the weight that ECHO puts on the link between the nodes representing P1 and Q, and we could derive the third from the first since P(Q/P1) and P(not-Q/P1) sum to 1; but ECHO provides no guidance about the other two probabilities. In fact, the situation is much worse, since Pearl's algorithms actually require 32 different conditional probabilities, for example P(Q/P1 & not-P2 & P3 & not-P4). In the most complex ECHO network to date, modeling the case of Copernicus against Ptolemy (Nowak and Thagard 1992a), there are 143 propositions. A search through the units created by ECHO, counting the number of explainers, shows that the number of conditional probabilities that PECHO would need is 45,348. This is a big improvement over the 2^143 (more than 10^43) probabilities that a full distribution would require, but it is still daunting. Of the 45,348, only 469 could be directly derived from the weight on the ECHO link. Does it matter what these probabilities are? Perhaps PECHO could you give them all a simple default value and still perform as well as ECHO. We shall shortly see that in fact TEC requires some constraints on the conditional probabilities if PECHO is to duplicate ECHO's judgements.

A derivation of P(Q/P1) based on the ECHO link between Q and P1 would effectively implement TEC's simplicity principle, 2(c), since it would mean that in updating the Pearl network P1 would get less support from Q than it would if P1 explained Q without the help of other hypotheses. This is because ECHO makes the strength of such links inversely proportional to the number of hypotheses. PECHO is able to get by with a unidirectional link between the node for P and the node for Q, since the contribution of the lambda and pi functions to the BEL functions of the two nodes effectively spreads support in both directions.

The construction just described would enable PECHO to implement principles 2(a) and 2(c) of TEC, but what about 2(b), according to which P1, P2, P3 and P4 all cohere with each other? Here PECHO encounters serious difficulties. Explanatory coherence theory assumes that cohypotheses (hypotheses that participate together in an explanation) support each other, but this is impossible in Pearl networks which have to be acyclic. Here we see a deep difference in the fundamental assumptions of TEC and the probabilistic networks. Probabilistic networks gain their relative efficiency by making strong assumptions of independence. In contrast, TEC assumes that every proposition in a belief system can be affected by every other one, although the effects may be very indirect and small. To put it graph-theoretically, ECHO networks are strongly connected, by virtue of their symmetric links, but probabilistic networks with directed edges are emphatically not. Another aside: At first glance, ECHO networks might seem to be similar to the noisy or-gates for which Pearl (1988, 188ff.) provides an efficient method. However, his method requires assumptions not appropriate for ECHO, such as that an event be presumed false if all conditions listed as its causes are false.

Pearl networks can, however, get the effects of excitatory links between cohypotheses by means of the clustering methods that are used to eliminate loops. We saw in the last section that Pearl considers collapsing competing nodes into single nodes with multiple values. If H1 and H2 together explain E, then instead of creating separate variable nodes for H1 and H2, PECHO would create a variable node <H1-H2> with values representing H1 & H2, H1 & not-H2, not-H1 & H2, and not-H1 & not-H2. To implement TEC's principle 2(b) that establishes coherence between H1 and H2, PECHO would have to ensure that it has conditional probabilities such that P(E/H1 & H2) is greater than either P(E/H1 & not-H2) and P(E/not-H1 & H2). (I am simplifying the representation here: in the Pearl network, by P(E/H1 & H2) I mean the probability that variable E takes the value TRUE given that variable <H1-H2> takes the value <H1 & H2>.)

A similar method should enable PECHO to deal with the inhibitory links required by ECHO to implement TEC's Principle 6, Competition. We saw that PECHO can handle contradictions by constructing complex variables, but it has no direct way of having a negative impact of one hypothesis on another when they are competing to explain a piece of evidence. The clustering technique mentioned in the last section shows how this can be done. If an ECHO network has H1 and H2 independently explaining E, PECHO will have to replace variable nodes for H1 and H2 by a combined node with values for H1 & H2, H1 & not-H2, not-H1 & H2, and not-H1 & not-H2. PECHO can enforce competition between H1 and H2 by requiring that P(E/H1 & H2) be less than either P(E/H1 & not-H2) or P(E/not-H1 & H2). (9)

The clustering situation gets much more complicated, since H1 may compete
with other hypotheses besides H2. In the Copernicus versus Ptolemy simulation,
ECHO finds 214 pairs of competing hypotheses, and there is an important
Copernican hypothesis that competes with more than 20 Ptolemaic hypotheses.
In general, if a proposition P has *n* hypotheses participating in
explaining it, either in cooperation or in competition with each other,
then a clustered variable node with 2^*n* values will have to be created.
For example, in the Copernicus versus Ptolemy simulation, there are pieces
of evidence explained by 5 Ptolemaic hypotheses working together and by
5 Copernican hypotheses. To handle both cohypotheses supporting each other
and competition between conflicting hypotheses, PECHO would need to have
a single node with 1024 values, in place of the 10 nodes corresponding to
ECHO units. In fact, the node would have to be still more complicated because
these 10 hypotheses compete and cohere with many others because they participate
in additional explanations. Typically, the set of hypotheses formed by collecting
all those that either compete with or co-explain with some member of the
set will be virtually all the hypotheses that there are. In the Copernicus
simulation, there are 80 explaining hypotheses, including Ptolemaic ones,
and search shows that each is connected to every other by chains of coherence
or competition. We would then require a single hypothesis node with 2^80
values, more than the number of milliseconds in a billion years. Thus dealing
with competition and cohypotheses in Pearl networks can be combinatorially
disastrous, although there may be more efficient method of clustering. Pearl
1988 (p. 201) considers an alternative method akin to that of Lauritzen
and Spiegelharter (1988) in which the nodes would represent cliques in an
ECHO network such as the cluster of H1, H2 and E. (10) In the Copernicus
simulation, there are more than 4000 such cliques, so the reconstituted
Pearl network would be much larger than the original. More importantly,
it is not at all clear how to assign conditional probabilities in ways that
yield the desired results concerning cohypotheses and competitors. Thus
in principle ECHO input can be used to drive a Pearl network, but in practice
the computational obstacles are formidable.

The input to ECHO also includes information about data and analogies. PECHO could implement something like TEC Principle 6, Data Priority, by special treatment of variable nodes corresponding to evidence propositions in ECHO. It would be a mistake to instantiate an evidence variable node with a value (1 0) since that would not allow the possibility that the evidence is mistaken. (ECHO can reject evidence if it does not fit with accepted hypotheses.) Pearl (1988, p. 170) provides a method of dummy variables that allows a node to represent virtual evidence, an effective solution if updating can lead to low BEL values for such nodes. As for analogy, there is no direct way in a Bayesian network in which a hypothesis H1 can support an analogous one H2, but once again dummy nodes might be constructed to favor the hypothesis in question. Analogy is normally used to support a contested hypothesis by analogy with an established one, so little is lost if there is no symmetric link and the contested one is simply viewed as being slightly dependent on the established one. Analogy is thus viewed as contributing to the prior probability of the contested hypothesis.

In sum, the theory of explanatory coherence that is implemented in ECHO by connectionist networks can also be implemented by probabilistic networks. There is, however, a high computational cost associated with the alternative implementation. A massively greater amount of information in the form of conditional probabilities is needed to run the algorithms for updating probabilistic networks, and the problem of creating a probabilistic network is non-trivial: reconstruction is required to avoid loops, and care must be taken to retain information about cohypotheses and competitors. Combinatorial explosions must be avoided. Hence while ECHO's connectionist networks can be abstractly viewed in probabilistic terms, there are potentially great practical gains to be had by not abandoning the explanationist approach for the apparently more general probabilist one. Practically, the probabilist approach must use the explanationist one for guidance in assessing probabilities. We saw that consideration of cohypotheses and competitors puts constraints on the conditional probabilities allowable in probabilistic networks, and explanatory coherence theory can also contribute to setting prior probabilities. One can also think of the principles of analogy and data priority as giving advice on how to set prior probabilities. How far can we do with ECHO alone?

If ECHO is to qualify as an alternative to probabilistic networks, it must be able to handle cases viewed as prototypical from the probabilist perspective. Consider Pearl's (1988, p. 49) example of Mr. Holmes at work trying to decide whether to rush home because his neighbor Mr. Watson, a practical joker, has called to say that his alarm at home has sounded. If the alarm has sounded, it may be because of a burglary or because of an earthquake. If he hears a radio report of an earthquake, his degree of confidence that there was a burglarly will diminish. Appropriate input to ECHO would be:

`(EXPLAIN '(BURGLARY) 'ALARM)`

`(EXPLAIN '(EARTHQUAKE) 'ALARM)`

`(EXPLAIN '(EARTHQUAKE) 'RADIO-REPORT)`

`(EXPLAIN '(ALARM) 'WATSON-CALLED)`

`(EXPLAIN '(JOKE NO-ALARM) 'WATSON-CALLED)`

`(CONTRADICT 'ALARM 'NO-ALARM)`

`(DATA '(WATSON-CALLED RADIO-REPORT))`

The network created by ECHO using this input is shown in figure 1.6. Note that in implementing the principle of competition ECHO automatically places inhibitory links between BURGLARY and EARTHQUAKE, and between ALARM and JOKE. Given the above input, ECHO reaches the conclusion that there was an earthquake rather than a burglary.

(Graphic not available)

**Figure 1.6** ECHO network created for Pearl's burglary example for
input given in the text. Solid lines indicate excitatory links, while dotted
lines indicate inhibitory ones.

This simple qualitative information may give misleading results in cases where statistical information is available. Suppose Holmes knows that burglaries almost always set off his alarm, but earthquakes do so only rarely. ECHO need not assume that every explanation is equally good, but allows the input to include an indicator of the strength of the explanation. We could, for example, alter the above input to include the statements:

`(EXPLAIN '(BURGLARY) 'ALARM .8)`

`(EXPLAIN '(EARTHQUAKE) 'ALARM .1)`

This has the effect of making the excitatory link between BURGLARY and ALARM nine times stronger than the link between EARTHQUAKE and ALARM, so, other things being equal, ECHO will prefer the burglary hypothesis to the earthquake hypothesis.

Statistical information that provides prior probabilities can be used in similar ways. Suppose that an alarm is as likely given a burglary as given an earthquake, but Mr. Holmes knows that in his neighborhood burglaries are far more common than earthquakes. Without the radio report of the earthquake, Holmes should prefer the burglary hypothesis to the earthquake hypothesis. In Bayesian terms, the burglary base rate is higher. ECHO can implement consideration of such prior probabilities by assuming that the base rates provide a statistical explanation of the occurrences (cf. Harman 1986, p. 70). The base rates can be viewed has hypotheses that themselves explain statistical information that has been collected. We could thus have the additional input to ECHO:

`(EXPLAIN '(BURGLARY-RATE) 'BURGLARY .1)`

`(EXPLAIN '(BURGLARY-RATE) 'BURGLARY-STATISTICS)`

`(EXPLAIN '(EARTHQUAKE-RATE) 'EARTHQUAKE-STATISTICS)`

`(EXPLAIN '(EARTHQUAKE-RATE) 'EARTHQUAKE .01)`

`(DATA '(BURGLARY-STATISTICS EARTHQUAKE-STATISTICS))`

The network constructed by ECHO is shown in figure 1.7. Similar cases in which prior probabilities need to be taken into account often arise in medical diagnosis. Medical students are cautioned to prefer routine diagnoses to exotic ones with the adage: when you hear hoofbeats, think horses, not zebras.

(Graphic not available)

**Figure 1.7 **Enhanced ECHO network for burglary example including
statistical explanations.

ECHO is also capable of handling the cancer example (figure~\ref{cancer}) using the prior and conditional probabilities provided by Pearl (1988, p. 197). Of course, ECHO's final activations are not exactly equivalent to the final probabilities that Pearl calculates, but without recourse to clustering methods ECHO gets results that are qualitatively very similar. ECHO strongly accepts just the propositions to which Pearl's calculation gives high probability, and strongly rejects just the propositions to which he gives low probability.

ECHO is thus capable of using probabilistic information when it is available, but does not require it. There may well be cases in which a full probability distribution is known and ECHO can be shown to give a defective answer because activation adjustment does not exactly mirror calculation of posterior probabilities. In such cases where there are few variables and the relevant probabilities are known, it is unnecessary to muddy the clear probabilistic waters with explanatory coherence considerations. But in most real cases found in science, law, medicine, and ordinary life, the explanationist will not be open to the charge of being probabilistically incoherent, since the probabilities are sparsely available and calculating them is computationally unfeasible. What matters then are the qualitative considerations that explanatory coherence theory takes into account, and probabilities are at best epiphenomenal. I shall not discuss here the argument that probabilistic approaches are preferable because they provide a clear semantics for numerical assessment of hypotheses. (11)

At the most general level, this paper can be understood as offering a reconciliation of explanationism and probabilism. ECHO, the most detailed and comprehensive explanationist model to date, has a probabilistic interpretation. This interpretation should make explanatory coherence theory more respectable to probabilists, who should also appreciate how explanatory coherence issues such as data priority, analogy, cohypotheses, and competition place constraints on probability values.

At a more local level, however, it is an open question whether explanationist or probabilist accounts are superior. Local disputes can be epistemological, psychological or technological. If one accepts the view of Goldman (1986) that power and speed are epistemological goals as well as reliability, then explanationist models can be viewed as desirable ways of proceeding apace with causal inferences while probabilistic models are still lost in computation. Similarly, the computational cost associated with the probabilistic interpretation of explanatory coherence suggest that such models may be inappropriate as models of human psychology. ECHO and probabilistic networks can be compared with respect to human performance, with the latter apparently predicting that people should be much slower at inference tasks that require the most work to translate into probabilistic terms. We saw such cases arise when there are cohypotheses and competitors. ECHO takes such complications in stride, whereas Pearl networks require computations to realign networks and many more conditional probabilities to handle such cases. It should therefore be possible to present people with examples of varying complexity and determine whether their reasoning ability declines rapidly as the complexity of probabilistic computations suggest.

Similarly, for technological applications in expert systems, ECHO may perform better than probabilistic networks. If rich probabilistic information is not generally available, and if the domains are complex enough with cohypotheses and competitors, then ECHO may be more effective than probabilistic cases. The issue must be decided on a local basis, application by application, just as the psychological issue depends on experiments that have not yet been done. My conjecture is that the psychological and technological applicability of explanationist and probabilist techniques will vary from domain to domain, with the following approximate ordering from most appropriate for explanationist to most appropriate for probabilist approaches: social reasoning, scientific reasoning, legal reasoning, medical diagnosis, fault diagnosis, games of chance. The psychological and technological answers need not be the same: diagnosis may well be an enterprise where a non-psychological probabilistic approach can bring technological gains.

Since the theory of explanatory coherence has a probabilistic interpretation, albeit a computationally expensive one, the analysis in this paper suggests that probabilism reigns supreme as the epistemology of Eternal Beings. But explanationism survives as epistemology for the rest of us.

1. Thanks to Gilbert Harman, Steve Kimbrough, and Steve Roehrig for very helpful discussions, and to Eugene Charniak for suggesting this comparison. Michael Ranney and Patricia Schank provided useful comments on an earlier draft. Thanks to Cameron Shelley for editorial assistance. This research is supported by the National Science and Engineering Research Council of Canada.

2. One important recent contribution is Peng and Reggia (1990), which I review in Thagard (1990).

3. In the original statement of TEC in Thagard (1989), Principle 3 included a second clause concerning disanalogies that is not included here because it lacks interesting applications. The old principle 7, system coherence, has similarly been deleted because it does little to illuminate actual cases. A new principle 6, Competition, has been added to cover cases where non-contradictory hypotheses compete with each other (Thagard 1991). The old principle 6, Acceptability, becomes the new principle 7.

4. The next 11 paragraphs are adapted from Thagard (1992b), pp. 66-69, with permission of Princeton University Press.

5. Settling time is also unaffected by number of links. The total time for ALL 15 runs on a Sun 4 workstation is less than two minutes, if graphics and detailed screen output are not used.

6. Hence probabilists are often called Bayesians.

7. Note for non-computer-scientists. A problem belongs in class P if it can be solved by a deterministic automaton in polynomial time. A problem belongs in class NP if it can be solved by a non-deterministic automaton in polynomial time. Here ``non-deterministic'' means that given an automaton in a particular state and given a particular input, the automaton may reach more than one state. Anything that a non-deterministic automaton can do can be done by a deterministic one, but the non-deterministic automaton may require an exponentially increasing number of steps to do it. Because of the rapid growth of exponential functions, problems requiring an exponential number of steps are said to be intractable. A problem is NP-complete if it is in NP and all other problems in NP are reducible to it. A problem is NP-hard if there is some NP-complete problem that is reducible to it. A problem that is NP-complete or NP-hard cannot be solved in polynomial time unless P=NP, which is widely believed to be false. See Garey and Johnson (1979) for more careful and complete definitions. Some problems involving computing the best explanation non-probabilistically have also been shown to be NP-hard (Bylander 1990).

8. Clarification: in Pearl's terminology, all cycles are loops, but not
all loops are cycles. *A right arrow B right arrow C right arrow A*
is a cycle, because the direction of the path is maintained. However, *A
right arrow B right arrow C left arrow A* is a loop, since for loops
direction is ignored, but it is not a cycle.

9. In very simple situations, it is possible to enforce competition in probabilistic networks without using clustering. Pearl describes how the effect of one cause ``explaining away'' another can be modeled in singly connected networks. Often, however, in the examples to which ECHO has been applied, two hypotheses compete to explain more than one piece of evidence. Units representing those hypotheses are therefore connected by two different paths, and clustering or some other technique will be necessary to translate the network into one not multiply connected.

10. Cliques are subgraphs whose nodes are all adjacent to each other.

11. While probability theory undoubtedly has a clear syntax, the meaning or meanings of probability is an unsolved problem. All the available interpretations, in terms of frequencies, propensities, degrees of belief, and possible worlds, can be challenged. Cohen (1989) provides a concise review.

Allen, R. (1991). The nature of juridical proof. Cardozo Law Review, 13, 373--422.

Aristotle (1984). The complete works of Aristotle. Ed. J. Barnes. Princeton: Princeton University Press.

Blake, R. (1960). Theory of hypothesis among renaissance astronomers. In R. Blake, C. Ducasse, and E.H. Madden (eds.), Theories of scientific method. Seattle: University of Washington Press, pp. 22--49.

Buchanan, B., and Shortliffe, E. (Eds.), (1984). Rule-based expert systems. Reading, MA: Addison Wesley.

Bylander, T. (1990). When efficient assembly performs correct abduction and why abduction is otherwise trivial or intractable. Paper presented at AAAI Spring Symposium on Automated Abduction.

Carnap, R. (1950). Logical foundations of probability. Chicago: University of Chicago Press.

Cohen, L.J. (1977). The probable and the provable. Oxford: Clarendon Press.

Cohen, L.J. (1989). An introduction to the philosophy of induction and probability. Oxford: Clarendon.

Cooper, G. (1990). The computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence, 42, 393--405.

Garey, M., and Johnson, D. (1979). Computers and intractability. New York: Freeman.

Goldman, A. (1986). Epistemology and cognition. Cambridge, MA: Harvard University Press.

Goldman, R., and Charniak, E. (1990). Incremental construction of probabilistic models for language abduction: Work in progress. Paper presented at AAAI Spring Symposium on Automated Abduction.

Hacking, I. (1975). The emergence of probability. Cambridge: Cambridge University Press.

Harman, G. (1973). Thought. Princeton: Princeton University Press.

Harman, G. (1986). Change in view: Principles of reasoning. Cambridge, MA: MIT Press/Bradford Books.

Hesse, M. (1974). The structure of scientific inference. Berkeley: University of California Press.

Horwich, P. (1982). Probability and evidence. Cambridge: Cambridge University Press.

Howson, C., and Urbach, P. (1989). Scientific reasoning: The Bayesian tradition. Lasalle, IL: Open Court.

Hrycej, T. (1990). Gibbs sampling in Bayesian networks. Artificial Intelligence, 46, 351--363.

Jeffrey, R. (1983). The logic of decision. Second edition. Chicago: University of Chicago Press. First published 1965.

Josephson, J, Chandrasekaran, B., Smith, J., and Tanner, M. (1987) A mechanism for forming composite explanatory hypotheses. IEEE Transactions on Systems, Man, and Cybernetics, 17: 445--454.

Keynes, J.M. (1921). A treatise on probability. London: Macmillan.

Kyburg, H. (1983). Epistemology and inference. Minneapolis: University of Minnesota Press.

Laudan, L. (1981). Science and hypothesis. Dordrecht: Reidel.

Lauritzen, S., and Spiegelharter, D. (1988). Local computation with probabilities in graphical structures and

their applications to expert systems. Journal of the Royal Statistical Society B, 50.

Lempert, R. (1986). The new evidence scholarship: analyzing the process of proof. Boston University Law Review, 66, 439--477.

Levi, I. (1980). The enterprise of knowledge. Cambridge, MA: MIT Press.

Lycan, W. (1988) Judgement and justification. Cambridge: Cambridge University Press.

Neapolitain, R. (1990). Probabilistic reasoning in expert systems. New York: John Wiley.

Ng, H., and Mooney, R. (1990). The role of coherence in constructing and evaluating abductive explanations.

Paper presented at AAAI Spring Symposium on Automated Abduction.

Nowak, G., and Thagard, P., (1992a). Copernicus, Ptolemy, and explanatory coherence. In R. Giere (Ed.), Cognitive Models of Science, Minnesota Studies in the Philosophy of Science, vol. 15. Minneapolis: University of Minnesota Press, 274-309.

Nowak, G., and Thagard, P., (1992b). Newton, Descartes, and explanatory coherence. In R. Duschl and R. Hamilton (Eds)., Philosophy of science, cognitive psychology and educational theory and practice, Albany: SUNY Press, 69-115.

O'Rorke, P., Morris, S., and Schulenburg, D., (1990). Theory formation by abduction: A case study based on the chemical revolution. In J. Shrager and P. Langley (Eds.), Computational models of discovery and theory formation. San Mateo, CA: Morgan Kaufman. 197--224.

Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo: Morgan Kaufman.

Peirce, C. (1931--1958). Collected papers. 8 vols. Edited by C. Hartshorne, P. Weiss, and A. Burks.

Cambridge, MA: Harvard University Press.

Peng, Y. and Reggia, J. (1990). Abductive inference models for diagnostic problem solving. New York: Springer-Verlag.

Pennington, N., and Hastie, R. (1986). Evidence evaluation in complex decision making. Journal of Personality and Social Psychology 51:242--258.

Rajamoney, S. (1990). A computational approach to theory revision. In J. Shrager and P. Langley (Eds.), Computational models of discovery and theory formation. San Mateo, CA: Morgan Kaufman. 225--253.

Ram, A. (1990). Decision models: A theory of volitional explanation. Proceedings of the Twelfth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum. 198--205.

Thagard, P. (1978). The best explanation: Criteria for theory choice. Journal of Philosophy, 75, 76--92.

Thagard, P. (1988). Computational philosophy of science. Cambridge, MA: MIT Press/Bradford Books.

Thagard, P. (1989). Explanatory coherence. Behavioral and Brain Sciences, 12, 435--467.

Thagard, P. (1990). Review of Y. Peng and J. Reggia, Abductive Inference Models for Diagnostic Problem Solving, SIGART Bulletin, 2, no. 1, 72--75.

Thagard, P. (1991) The dinosaur debate: Explanatory coherence and the problem of competing hypotheses. In J. Pollock and R. Cummins (Eds.), Philosophy and AI: Essays at the interface. MIT Press/Bradford Books, 279--300.

Thagard, P. (1992a). Adversarial problem solving: Modeling an opponent using explanatory coherence. Cognitive Science, 16, 123--149.

Thagard, P. (1992b). Conceptual revolutions. Princeton Unversity Press.

Thagard, P., Holyoak, K., Nelson, G., and Gochfeld, D. (1990). Analog retrieval by constraint satisfaction. Artificial Intelligence, 46, 259--310.

Thagard, P., and Nowak, G. (1988). The explanatory coherence of continental drift. In A. Fine and J. Leplin (Eds.), PSA 1988, vol. 1. East Lansing, Mich.: Philosophy of Science Association, 118--126.

Thagard, P., and Nowak, G. (1990). The conceptual structure of the geological revolution. In J. Shrager and P. Langley (Eds.), Computational models of discovery and theory formation. San Mateo, CA: Morgan Kaufman. 27--72.

Whewell, W. (1967). The philosophy of the inductive sciences. New York: Johnson Reprint Corp. Originally published 1840.