A relation between log-likelihood and cross-validation log-scores

It is shown that the log-likelihood of a hypothesis or model given some data is equivalent to an average of all leave-one-out cross-validation log-scores that can be calculated from all subsets of the data. This relation can be generalized to any $k$-fold cross-validation log-scores.

It is shown that the log-likelihood of a hypothesis or model given some data is equal to an average of all leave-one-out cross-validation log-scores that can be calculated from all subsets of the data. This relation can be generalized to any k -fold cross-validation log-scores.
Note: Dear Reader & Peer, this manuscript is being peer-reviewed by you. Thank you.

Log-likelihoods and cross-validation log-scores
The probability calculus unequivocally tells us how our degree of belief in a hypothesis H h given data D and background information or assumptions I , that is, P ( H h | D I ) , is related to our degree of belief in observing those data when we entertain that hypothesis as true, that is, P ( D | H h I ) : . (1b) D , H h , I denote propositions, which are usually about numeric quantities. I use the terms 'degree of belief', 'belief', and 'probability' as synonyms. By 'hypothesis' I mean either a scientific (physical, biological, etc.) hypothesis -a state or development of things capable of experimental verification, at least in a thought experiment -or more generally some proposition, often not precisely specified, which leads to quantitatively specific distributions of beliefs for any contemplated data set. In the latter case we often call H h a '(probabilistic) model' rather than a 'hypothesis'. Expression (1b) assumes that we have a set { H h } of mutually exclusive and exhaustive hypotheses under consideration, which is implicit in our knowledge I . In fact it's only valid if Only rarely does the set of hypotheses { H h } encompass and reflect the 1 This document is designed for screen reading and two-up printing on A4 or Letter paper extremely complex and fuzzy hypotheses lying in the backs of our minds. They're simplified pictures. That's also why they're called 'models'. Expression (1a) is universally valid instead, but it's rarely possible to quantify its denominator P ( D | I ) unless we simplify our inferential problem by introducing a possibly unrealistic exhaustive set of hypotheses, thus falling back to (1b) . We can bypass this problem if we are content with comparing our beliefs about any two hypotheses through their ratio, so that the term P ( D | I ) cancels out. See Jaynes's 1 insightful remarks about such binary comparisons, and also Good's 2 .
The term P ( D | H h I ) in eq. (1) is called the likelihood of the hypothesis given the data 3 . Its logarithm is surprisingly called log-likelihood: where the logarithm can be taken in an arbitrary basis (Turing, Good 4 , Jaynes 5 recommend base 1 0 1 / 10 , leading to a measurement in decibels; see the cited works for the practical advantages of such choice). The ratio of the likelihoods of two hypotheses, called relative Bayes factor , or its logarithm, the relative weight of evidence , 6 are often used to quantify how much the data favour our belief in one versus the other hypothesis (that is, assuming at least momentarily that they be exhaustive). 'It is historically interesting that the expression "weight of evidence", in its technical sense, anticipated the term "likelihood" by over forty years' 7 .
Recent literature 8 seems to exclusively deal with relative Bayes factors. I'd like to recall, lest it fades from the memory, the definition of the non-relative Bayes factor for a hypothesis H h provided by data D : 9 where the odds O is defined as O : P /( 1 − P ) . Looking at the expression on the right, which can be derived from the probability rules, it's clear that the Bayes factor for a hypothesis involves the likelihoods of all other hypotheses as well as their pre-data probabilities. This quantity and its logarithm, the (non-relative) weight of evidence, have important properties which relative Bayes factors and relative weights of evidence don't enjoy. For example, the  Good 1985Good , 1950Good , 1969 Jaynes 2003 § 4.2. 6 Good 1950Good ch. 6, 1975Good , 1981Good , 1985, and many other works in Good 1983;Osteyee et al. 1974§ 1.4, MacKay 1992, Kass et al. 1995; see also Jeffreys 1983 chs V, VI, A. 7 Osteyee et al. 1974 § 1.4.2 p. 12. 8 for example Kass et al. 1995. 9 Good 1981 § 2. expected weight of evidence for a correct hypothesis is always positive, and for a wrong hypotheses always negative 10 . See Jaynes 11 for further discussion and a numeric example.
The literature in probability and statistics has also employed and debated other ad-hoc measures to quantify how the data relate to the hypotheses -or even to select one hypothesis for further use, discarding the others 12 . Here I consider one measure in particular: the leave-one-out cross-validation log-score 12 , which I'll just call 'log-score' for brevity: where every D i is one datum in the data D ≡ d i 1 D i , and D − i denotes the data with datum D i excluded. The intuition behind this score can be colloquially expressed thus: 'let's see what my belief in one datum would be, on average, once I've observed the other data, if I consider H h as true'. 'On average' means considering such belief for every single datum in turn, and then taking the geometric mean of the resulting beliefs. Other variants of this score use more general partitions of the data into two disjoint subsets 12 .
If you find this you can claim a postcard from me. My purpose is to show an exact relation between the log-likelihood (3) and the leave-one-out cross-validation log-score (5) . This relation doesn't seem to appear in the literature, and I find it very intriguing because it portrays the log-likelihood as a sort of full-scale use of the log-score: it says that the log-likelihood is the sum of all averaged log-scores that can be formed from all data subsets . The relation can be extended to more general cross-validation log-scores, and it can be of interest for the debate about the soundness of log-scores in deciding among hypotheses.

A relation between log-likelihood and log-score
We can obviously write the likelihood as the d th root of its d th power:  Stone 1977, Geisser et al. 1979, Vehtari et al. 2012, 2002, Krnjajić et al. 2011, Gelman et al. 2014, Chandramouli et al. 2019 where we have dropped the subscript h for simplicity. By the rules of probability we have no matter which specific i ∈ { 1 , . . . , d } we choose (temporal ordering and similar matters are completely irrelevant in the formula above: it's a logical relation between propositions). So let's expand each of the d factors in the identity (6) using the product rule (7) , using a different i for each of them. The result can be thus displayed: this column leads to the log-score Upon taking the logarithm of this expression, the d factors vertically aligned on the left add up to the log-score (5) , as indicated. But the mathematical reshaping we just did for P ( D | H I ) -that is, the rootproduct identity (6) and the expansion (8) -can be done for each of the remaining factors P ( D − i | H I ) vertically aligned on the right in the expression above; and so on recursively. Here is an explicit example for d 3 : In this example the logarithm of the three vertically aligned factors in the left column is, as already noted, the log-score (5) . The logarithm of the six vertically aligned factors in the central column is an average of the log-scores calculated for the three distinct subsets of pairs of data Likewise, the logarithm of the six factors vertically aligned on the right is the average of the log-scores for the three subsets of data singletons In the general case with d data there are d k subsets with k data points. We therefore obtain which can be compactly written That is, the log-likelihood is the sum of all averaged log-scores that can be formed from all (non-empty) data subsets with k elements , the average for log-scores over k data being taken over the d k subsets having the same cardinality k .
There's also an equivalent form with a slightly different crossvalidating interpretation: We take each datum D j in turn and calculate our log-belief in it conditional on all possible subsets of remaining data, from the empty subset with no data (term k 0 ), to the only subset D − j with all data except D j (term k d − 1 ). These log-beliefs are averaged over the d − 1 k subsets having the same cardinality k . The result can be expressed as

Brief discussion
It's remarkable that the individual log-scores in expressions (11) and (12) above are computationally expensive, but their sum results in the log-likelihood, which is less expensive.
The relation (11) invites us to see the log-likelihood as a refinement and improvement of the log-score. The log-likelihood takes into account not only the log-score for the whole data, but also the log-scores for all possible subsets of data. Figuratively speaking it examines the relationship between data and hypothesis locally, globally, and on all intermediate scales. To me this property makes the log-likelihood preferable to any single log-score (besides the fact that the log-likelihood is directly obtained from the principles of the probability calculus), because our interest is usually in how the hypothesis H relates to single data points as well as to any collection of them. I hope to discuss this point, which also involves the distinction between simple and composite hypotheses 13 , more in detail elsewhere 14 .
By applying the identity (6) and generalizing the expansion (7) to different divisions of the data -leave-two-out, leave-three-out, and so on -we see that the relation (11) can be generalized to any k -fold crossvalidation log-scores. Thus the log-likelihood is also equivalent to an average of all conceivable cross-validation log-scores for all subsets of data, though I haven't calculated the weights of such average.

Thanks
. . . to the Kavli Foundation and the Centre of Excellence scheme of the Research Council of Norway (Yasser Roudi group) for financial support. To Aki Vehtari for some references. To the staff of the NTNU library for their always prompt assistance. To Mari, Miri, Emma for continuous encouragement and affection, and to Buster Keaton and Saitama for filling life with awe and inspiration. To the developers and maintainers of Open Science Framework, L A T E X, Emacs, AUCT E X, Python, Inkscape, Sci-Hub for making a free and impartial scientific exchange possible.