Introduction

Several readers of the discussion version of the fldgen paper have questioned whether fitting the mean field and variability models over the entire range of RCP scenarios introduces a form of bias. Here is one example of such a comment from Anonymous Reviewer #3:

I had similar thoughts as reviewer two re using different RCPs used to train the emulator, and I didn’t find the response fully convincing – some tests should be possible here? In particular, I found myself wondering whether EOF1 reflected responses to different RCPs. Did you check whether EOF1 was similar in an emulator trained on a few simulations with a single RCP (and didn’t disappear, as when you trained on a single simulation)? Why not compare the variance of the full emulator (i.e. trained on all 9 simulations) separately with the RCP2.6 simulations and with the RCP8.5 simulations to quantify any scenario bias? In any event, some discussion of this should be included in the text as I suspect many people will question it. Repeating from Reviewer Two “Section 4.2 got me thinking that as the model is trained on the RCP outputs, is there any difference in the results when taking just the set of realisations from RCP2.6 and RCP8.5? Certaintly across ESMs, the variance across models increases with increasing global mean temperature. It would therefore not be correct to use a variability model that is trained on RCP8.5 for low forcing scenarios or those with a peak and decline. I note the authors address this in section 4.3, but I wonder if they have tested this.”

We can break the reviewer’s conjecture down into several specific, testable claims.

  1. Is the mean field model fit to a single RCP significantly different to that fit to the ensemble of RCPs? Our experience with fitting more sophisticated pattern analysis algorithms suggests that it does not, but we have not formally tested that hypothesis with linear pattern scaling.
  2. Does the first EOF of the variability model represent an adjustment from compromise fit of the mean field to the specific behaviors of the individual warming scenarios? In other words, the claim here is that when we fit the mean field model to the entire collection of warming scenarios, we produce a mean field that is not exactly appropriate for any of the RCP warming scenarios, and that EOF-1 represents a correction field that brings the mean field into agreement with the individual RCP mean field behavior. If that conjecture is true, then it would be a mistake to randomize the phase of that component, as it is not truly representing random variation, but rather a component of the forced behavior that we are not capturing.
  3. Is the residual variance higher when fitting to a multi-scenario ensemble than when fitting to a single-scenario ensemble? This would be an indication that some of what we are describing as “variability” is actully forced variation not being captured by the mean field model.

I will refer to these claims collectively as the Compromise Conjecture (CC). If the CC is wholly or partly true, then our emulator is representing some of the forced variation as dynamical variability, which may require some sort of correction.

Of course, we already know that the CC is true to some extent. The assumption that the behavior of the ESMs can be cleanly separated into deterministic forcing and random variability was always an approximation; as we note in the main body of the paper. So, the real question here is, how big is the error created by that approximation? Is it large enough to require explicit correction?

Testing the Compromise Conjecture

Testing these claims is tricky for two reasons. First, the statistical properties of the ensemble depend on the number of scenarios in the ensemble, as well as on the mix or RCP pathways in the ensemble. Therefore, it would be very hard to get any interpretable result comparing the nine-run multi-rcp emulator to an emulator trained on two or three runs from a single rcp. Instead, we will have to investigate these questions by training an emulator on the three rcp8.5 scenarios that we have and comparing that to an emulator trained on three scenarios from across the rcps (I suggest 2.6, 6.0, and 8.5 as giving the most uniform coverage).

The second reason it is difficult to test these claims is that they hinge on the size of the effect. It may not be enough merely to test whether or not any effects are statistically significant because statistical significance is not the same as practical significance. If an effect is not statistically significant, then we can probably dismiss it on the grounds that we can’t even be sure it exists, let alone is important in practical applications. However, I am a bit concerned that on a large grid like we have (55,296 grid cells), even the tiny effects, which we know must be present as a consequence of our separability approximation, could be statistically detectable.

For claims where this happens, we need to define an effect size and a threshold of acceptability to compare it to. Further complicating this task is the quantitative results we are looking at are defined in individual grid cells, and the direction of the shift (if any) and scale of variability will be different between grid cells. We will have to find a way to aggregate any judgment of practical significance to the system as a whole.

Software version

These calculations were run with fldgen-v1.0.0

Simple tests evaluating the claims

library(fldgen)
library(dplyr)
library(ggplot2)
library(ggthemes)
set.seed(867-5309)
emulator_rcp85 <- train('data-rcp85', latvar='lat_2', lonvar='lon_2')
emulator_multi_rcp <- train('data-multi-rcp', latvar='lat_2', lonvar='lon_2')

Claim #1

The heart of this claim is that the mean field model for an ensemble of ESM runs from the same RCP will be different from the mean field model for an ensemble of runs covering a variety of RCP scenarios. We can examine the coefficients models directly to see how different the mean field models are.

ggplot(mapping=aes(x=as.vector(emulator_multi_rcp$pscl$w), 
                   y=as.vector(emulator_rcp85$pscl$w))) +
    geom_point(color='lightgrey', shape='.') +
    theme_solarized_2(light=FALSE, base_size=14) +
    xlab('MULTI coefficient') + ylab('RCP85 coefficient') +
    ggtitle('Comparison of linear coefficient (w)')

ggplot(mapping=aes(x=as.vector(emulator_multi_rcp$pscl$b), 
                   y=as.vector(emulator_rcp85$pscl$b))) +
    geom_point(color='lightgrey', shape='.') +
    theme_solarized_2(light=FALSE, base_size=14) +
    xlab('MULTI coefficient') + ylab('RCP85 coefficient') +
    ggtitle('Comparison of intercept coefficient (b)')

It is easy enough to see that those two models are extremely similar. We can quantify just how similar they are by fitting a linear model predicting the RCP85 coefficient from the corresponding multi-rcp coefficient.

lmw <- lm(emulator_rcp85$pscl$w~emulator_multi_rcp$pscl$w + 0)
lmb <- lm(emulator_rcp85$pscl$b~emulator_multi_rcp$pscl$b + 0)
summary(lmw)

Call:
lm(formula = emulator_rcp85$pscl$w ~ emulator_multi_rcp$pscl$w + 
    0)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.173075 -0.015505  0.003771  0.023052  0.183827 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
emulator_multi_rcp$pscl$w 0.9942681  0.0001213    8195   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04184 on 55295 degrees of freedom
Multiple R-squared:  0.9992,    Adjusted R-squared:  0.9992 
F-statistic: 6.715e+07 on 1 and 55295 DF,  p-value: < 2.2e-16
summary(lmb)

Call:
lm(formula = emulator_rcp85$pscl$b ~ emulator_multi_rcp$pscl$b + 
    0)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.504  -5.284   0.438   5.852  48.225 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
emulator_multi_rcp$pscl$b 0.9866313  0.0002057    4796   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.88 on 55295 degrees of freedom
Multiple R-squared:  0.9976,    Adjusted R-squared:  0.9976 
F-statistic: 2.3e+07 on 1 and 55295 DF,  p-value: < 2.2e-16

So, the average ratio between the RCP85 and multi-rcp linear coefficients (\(w\)) is 0.994. The \(R^2\) on this relationship is 0.999. Most of the residuals are within +/- 0.02 of 0 (for a coefficient that ranges approximately from 0-3).

For the intercept coefficient (\(b\)), the relationship is ever so slightly rougher; the coefficient ratio is 0.987, with an \(R^2\) of “only” 0.998. Most of the residuals are between -5 and +6 (the scale of this variable is considerably larger: -650 - +300.)

The mean field models are virtually indistinguishable, whether fit on a single RCP or a selection of RCPs; therefore, although the long-period EOFs are telling us something about the differences between individual runs, there doesn’t seem to be any evidence that these differences are a systematic function of RCP.

To me, these results alone vitiate the Compromise Conjecture. Without any significant difference in the mean response under different climate change pathways, there is no real motivation for the other claims.

Claim #2

The second claim in the CC is that the long-period EOFs are a representation of the mismatch between the mean field model and the RCP outputs from which the residuals are calculated. If this were true, then we would expect that for emulators fit to runs representing a single rcp, there should not be a long-period mode. We can check this hypothesis by plotting the power spectrum for EOF-1 for the single-rcp emulator.

psd_rcp85_eof1 <- emulator_rcp85$fx$mag[1:47 ,2]^2
rpsd_rcp85_eof1 <- psd_rcp85_eof1 / max(psd_rcp85_eof1)
psd_freq <- (seq_along(rpsd_rcp85_eof1)-1)/95
ggplot(mapping=aes(x=psd_freq, y=rpsd_rcp85_eof1)) + geom_smooth(se=FALSE, color='lightgrey') +
    xlab('Frequency') + ylab('EOF-1 Power Spectral Density') + 
    ggtitle('RCP85 residuals with RCP85 mean field') +
    theme_solarized_2(light=FALSE, base_size = 14)

psd_multi_eof1 <- emulator_multi_rcp$fx$mag[1:47 ,2]^2
rpsd_multi_eof1 <- psd_multi_eof1 / max(psd_multi_eof1)
ggplot(mapping=aes(x=psd_freq, y=rpsd_multi_eof1)) + geom_smooth(se=FALSE, color='lightgrey') +
    xlab('Frequency') + ylab('EOF-1 Power Spectral Density') + 
    ggtitle('MULTI residuals with MULTI mean field') +
    theme_solarized_2(light=FALSE, base_size = 14)

The single-rcp power spectrum looks very similar to the corresponding figure from the paper. It is actually slightly more concentrated at low frequencies than the corresponding power spectrum for the emulator trained on all 9 ESM runs. The result of the multi-rcp fit is qualitatively similar, though with rather more relative power at higher frequencies.

Based on these results, it seems clear that the long-period EOF is not the result of anything related to the CC, as it occurs even when the entire model is fit on a single rcp.

Claim #3

This claim posits that the residuals for the ESMs in the multi-rcp emulator have a variance that is inflated relative to the corresponding residuals for the single-rcp emulator, due to the mismatch in the mean field model. In light of the results for Claim #1, this claim seems hard to support. Nevertheless, it’s worth going through how we might test a claim like this.

What makes this claim tricky to adjudicate is that although we have a sample of residuals in each grid cell, we generally expect the variance of those values to be grid cell dependent. Therefore, we cannot readily combine all of these residuals into a single statistical test. What we can do is to employ something like our statistical analysis from the paper, where we perform the F-test on a grid cell by grid cell basis and then compare the number of failures to the expected number of failures under the null hypothesis and under an alternate hypothesis of a de minimis change in variance.

Alternatively, we could apply the F-test individually in each grid cell using the Holm-Bonferroni correction. (Holm 1979). This correction adjusts for the fact that when running many independent, but related tests, some (many, in the case of the large number of tests we will be performing) will have low p-values just due to random variation. For the sake of variety, we will take the latter approach in this analysis.

## Residuals for the rcp85 model are stored in the emulator
resid85 <- emulator_rcp85$pscl$r
## To calculate the residuals for the multi-rcp model we have to apply that mean field
## model manually
calc_resid <- function(pscl, tgav, griddata) {
    tmean <- fldgen::pscl_apply(pscl, tgav)
    griddata$tas - tmean
}
residmulti <- calc_resid(emulator_multi_rcp$pscl, emulator_rcp85$tgav, emulator_rcp85$griddata)
getpval <- function(i) {
    rmulti <- residmulti[ , i]
    r85 <- resid85[ , i]
    ## Test that the ratio of var(rmulti)/var(r85) is _greater_ than 1
    var.test(rmulti, r85, alternative='greater')$p.value
}
pvals <- sapply(seq(1,ncol(resid85)), getpval)
## Accept or reject using the Holm-Bonferroni procedure.
pvals_srt <- sort(pvals)
holm_fac <- rev(seq_along(pvals_srt))
pvals_holm <- pmin(pvals_srt * holm_fac, 1.0)
## Find the number of null hypotheses to reject.  The rule is find the first one that
## _doesn't_ reject; reject all the ones before that
base_pval <- 0.05
n_null_reject <- min(which(pvals_holm > base_pval)) - 1
n_null_reject
[1] 0

From this calculation we see that none of the grid cells show evidence of a larger variance when using the multi-rcp mean field model. In fact, this test is not even close. Even without the Holm-Bonferroni correction, only 6 grid cells show a statistically significant difference in variance. Given the tiny difference in the mean field models in question, this is to be expected.

Conclusion

It turns out that we never ended up needing to worry about effect sizes or the like. The mean field models are practically identical for the two ensembles. As a result, the three claims in the CC don’t even rise to the level of statistical significance. Given the amount of data here, that would indicate that the effect sizes involved are truly tiny. Therefore, the CC, though plausible a priori, appears not to be a factor in the results presented in the paper.

References

Holm, S. (1979), “A Simple Sequentially Rejective Multiple Test Procedure”, Scandinavian Journal of Statistics, 6(2), pp. 65-70.

---
title: 'Fldgen v1.0: Analysis of the Compromise Conjecture'
author: "Robert Link"
date: '2019-03-03'
output:
  html_notebook: default
  pdf_document: default
---

## Introduction

Several readers of the discussion version of the fldgen paper have questioned
whether fitting the mean field and variability models over the entire range of
RCP scenarios introduces a form of bias.  Here is one example of such a comment
from Anonymous Reviewer \#3:

> I had similar thoughts as reviewer two re using different RCPs used to train
the emulator, and I didn’t find the response fully convincing – some tests
should be possible here? In particular, I found myself wondering whether EOF1
reflected responses to different RCPs. Did you check whether EOF1 was similar in
an emulator trained on a few simulations with a single RCP (and didn’t
disappear, as when you trained on a single simulation)? Why not compare the
variance of the full emulator (i.e. trained on all 9 simulations) separately
with the RCP2.6 simulations and with the RCP8.5 simulations to quantify any
scenario bias? In any event, some discussion of this should be included in the
text as I suspect many people will question it. Repeating from Reviewer Two
“Section 4.2 got me thinking that as the model is trained on the RCP outputs, is
there any difference in the results when taking just the set of realisations
from RCP2.6 and RCP8.5? Certaintly across ESMs, the variance across models
increases with increasing global mean temperature. It would therefore not be
correct to use a variability model that is trained on RCP8.5 for low forcing
scenarios or those with a peak and decline. I note the authors address this in
section 4.3, but I wonder if they have tested this.”

We can break the reviewer's conjecture down into several specific, testable 
claims.

1. **Is the mean field model fit to a single RCP significantly different to that
fit to the ensemble of RCPs?**  Our experience with fitting more sophisticated
pattern analysis algorithms suggests that it does not, but we have not formally
tested that hypothesis with linear pattern scaling.
3. **Does the first EOF of the variability model represent an adjustment from
compromise fit of the mean field to the specific behaviors of the individual 
warming scenarios?**  In other words, the claim here is that when we fit the 
mean field model to the entire collection of warming scenarios, we produce a
mean field that is not exactly appropriate for any of the RCP warming scenarios,
and that EOF-1 represents a correction field that brings the mean field into 
agreement with the individual RCP mean field behavior.  _If_ that conjecture is
true, then it would be a mistake to randomize the phase of that component, as
it is not truly representing random variation, but rather a component of the 
forced behavior that we are not capturing.
2. **Is the residual variance higher when fitting to a multi-scenario ensemble
than when fitting to a single-scenario ensemble?**  This would be an indication
that some of what we are describing as "variability" is actully forced variation
not being captured by the mean field model.


I will refer to these claims collectively as the _Compromise Conjecture_ (CC).
If the CC is wholly or partly true, then our emulator is representing some of
the forced variation as dynamical variability, which may require some sort of
correction.

Of course, we already _know_ that the CC is true to some extent.  The assumption
that the behavior of the ESMs can be cleanly separated into deterministic forcing
and random variability was always an approximation; as we note in the main body of the paper.
So, the real question here is, how big is the error created by that approximation?
Is it large enough to require explicit correction?

## Testing the Compromise Conjecture

Testing these claims is tricky for two reasons.  First, the statistical
properties of the ensemble depend on the number of scenarios in the ensemble, as
well as on the mix or RCP pathways in the ensemble.  Therefore, it would be very
hard to get any interpretable result comparing the nine-run multi-rcp emulator 
to an emulator trained on two or three runs from a single rcp.  Instead, we will
have to investigate these questions by training an emulator on the three rcp8.5
scenarios that we have and comparing that to an emulator trained on three 
scenarios from across the rcps (I suggest 2.6, 6.0, and 8.5 as giving the most 
uniform coverage).  

The second reason it is difficult to test these claims is that they hinge on
the size of the effect.  It may not be enough merely to test whether 
or not any effects are statistically significant because statistical significance
is not the same as _practical_ significance.  If an effect is not statistically
significant, then we can probably dismiss it on the grounds that we can't even
be sure it exists, let alone is important in practical applications.  However,
I am a bit concerned that on a large grid like we have (55,296 grid cells), even
the tiny effects, which we know must be present as a consequence of our separability
approximation, could be statistically detectable.  

For claims where this happens, we need to define an effect size and a threshold
of acceptability to compare it to.  Further complicating this task is the
quantitative results we are looking at are defined in individual grid cells, and
the direction of the shift (if any) and scale of variability will be different
between grid cells.  We will have to find a way to aggregate any judgment of
practical significance to the system as a whole.


## Software version

These calculations were run with `r paste0('fldgen-v',packageVersion('fldgen'))`

## Simple tests evaluating the claims

```{r setup}
library(fldgen)
library(dplyr)
library(ggplot2)
library(ggthemes)

set.seed(867-5309)
```

```{r loaddata}
emulator_rcp85 <- train('data-rcp85', latvar='lat_2', lonvar='lon_2')
emulator_multi_rcp <- train('data-multi-rcp', latvar='lat_2', lonvar='lon_2')
```

## Claim \#1

The heart of this claim is that the mean field model for an ensemble of 
ESM runs from the same RCP will be different from the mean field model for
an ensemble of runs covering a variety of RCP scenarios.  We can examine
the coefficients models directly to see how different the mean field models
are.
```{r plotcoefs}
ggplot(mapping=aes(x=as.vector(emulator_multi_rcp$pscl$w), 
                   y=as.vector(emulator_rcp85$pscl$w))) +
    geom_point(color='lightgrey', shape='.') +
    theme_solarized_2(light=FALSE, base_size=14) +
    xlab('MULTI coefficient') + ylab('RCP85 coefficient') +
    ggtitle('Comparison of linear coefficient (w)')

ggplot(mapping=aes(x=as.vector(emulator_multi_rcp$pscl$b), 
                   y=as.vector(emulator_rcp85$pscl$b))) +
    geom_point(color='lightgrey', shape='.') +
    theme_solarized_2(light=FALSE, base_size=14) +
    xlab('MULTI coefficient') + ylab('RCP85 coefficient') +
    ggtitle('Comparison of intercept coefficient (b)')
```

It is easy enough to see that those two models are _extremely_ similar.  We can
quantify just how similar they are by fitting a linear model predicting the RCP85
coefficient from the corresponding multi-rcp coefficient.

```{r coefmodels}
lmw <- lm(emulator_rcp85$pscl$w~emulator_multi_rcp$pscl$w + 0)
lmb <- lm(emulator_rcp85$pscl$b~emulator_multi_rcp$pscl$b + 0)

summary(lmw)
summary(lmb)
```

So, the average ratio between the RCP85 and multi-rcp linear coefficients ($w$) is
0.994.  The $R^2$ on this relationship is 0.999.  Most of the residuals are within
+/- 0.02 of 0 (for a coefficient that ranges approximately from 0-3).

For the intercept coefficient ($b$), the relationship is ever so slightly rougher;
the coefficient ratio is 0.987, with an $R^2$ of "only" 0.998.  Most of the residuals
are between -5 and +6 (the scale of this variable is considerably larger: -650 - +300.)

The mean field models are virtually indistinguishable, whether fit on a single
RCP or a selection of RCPs; therefore, although the long-period EOFs are telling
us _something_ about the differences between individual runs, there doesn't seem
to be any evidence that these differences are a systematic function of RCP.

To me, these results alone vitiate the Compromise Conjecture.  Without any
significant difference in the mean response under different climate change
pathways, there is no real motivation for the other claims.

## Claim \#2

The second claim in the CC is that the long-period EOFs are a representation of the 
mismatch between the mean field model and the RCP outputs from which the residuals 
are calculated.  If this were true, then we would expect that for emulators fit to 
runs representing a _single_ rcp, there should not be a long-period mode.  We can
check this hypothesis by plotting the power spectrum for EOF-1 for the single-rcp
emulator.

```{r lpeof}
psd_rcp85_eof1 <- emulator_rcp85$fx$mag[1:47 ,2]^2
rpsd_rcp85_eof1 <- psd_rcp85_eof1 / max(psd_rcp85_eof1)
psd_freq <- (seq_along(rpsd_rcp85_eof1)-1)/95
ggplot(mapping=aes(x=psd_freq, y=rpsd_rcp85_eof1)) + geom_smooth(se=FALSE, color='lightgrey') +
    xlab('Frequency') + ylab('EOF-1 Power Spectral Density') + 
    ggtitle('RCP85 residuals with RCP85 mean field') +
    theme_solarized_2(light=FALSE, base_size = 14)

psd_multi_eof1 <- emulator_multi_rcp$fx$mag[1:47 ,2]^2
rpsd_multi_eof1 <- psd_multi_eof1 / max(psd_multi_eof1)
ggplot(mapping=aes(x=psd_freq, y=rpsd_multi_eof1)) + geom_smooth(se=FALSE, color='lightgrey') +
    xlab('Frequency') + ylab('EOF-1 Power Spectral Density') + 
    ggtitle('MULTI residuals with MULTI mean field') +
    theme_solarized_2(light=FALSE, base_size = 14)
```

The single-rcp power spectrum looks _very_ similar to the corresponding figure
from the paper.  It is actually slightly _more_ concentrated at low frequencies
than the corresponding power spectrum for the emulator trained on all 9 ESM
runs.  The result of the multi-rcp fit is qualitatively similar, though with
rather more relative power at higher frequencies.

Based on these results, it seems clear that the long-period EOF is _not_ the 
result of anything related to the CC, as it occurs even when the entire model
is fit on a single rcp.

## Claim \#3

This claim posits that the residuals for the ESMs in the multi-rcp emulator
have a variance that is inflated relative to the corresponding residuals 
for the single-rcp emulator, due to the mismatch in the mean field model.
In light of the results for Claim \#1, this claim seems hard to support.
Nevertheless, it's worth going through how we might test a claim like this.

What makes this claim tricky to adjudicate is that although we have a sample of
residuals in each grid cell, we generally expect the variance of those values to
be grid cell dependent.  Therefore, we cannot readily combine all of these
residuals into a single statistical test.  What we can do is to employ something
like our statistical analysis from the paper, where we perform the F-test on a
grid cell by grid cell basis and then compare the number of failures to the
expected number of failures under the null hypothesis and under an alternate
hypothesis of a _de minimis_ change in variance.

Alternatively, we could apply the F-test individually in each grid cell using
the Holm-Bonferroni correction. (Holm 1979).  This correction adjusts for the
fact that when running many independent, but related tests, some (many, in the
case of the large number of tests we will be performing) will have low p-values
just due to random variation.  For the sake of variety, we will take the latter
approach in this analysis.

```{r residual_analysis}
## Residuals for the rcp85 model are stored in the emulator
resid85 <- emulator_rcp85$pscl$r
## To calculate the residuals for the multi-rcp model we have to apply that mean field
## model manually
calc_resid <- function(pscl, tgav, griddata) {
    tmean <- fldgen::pscl_apply(pscl, tgav)
    griddata$tas - tmean
}
residmulti <- calc_resid(emulator_multi_rcp$pscl, emulator_rcp85$tgav, emulator_rcp85$griddata)

getpval <- function(i) {
    rmulti <- residmulti[ , i]
    r85 <- resid85[ , i]
    ## Test that the ratio of var(rmulti)/var(r85) is _greater_ than 1
    var.test(rmulti, r85, alternative='greater')$p.value
}
pvals <- sapply(seq(1,ncol(resid85)), getpval)
```

```{r holm_bonferroni}
## Accept or reject using the Holm-Bonferroni procedure.
pvals_srt <- sort(pvals)
holm_fac <- rev(seq_along(pvals_srt))
pvals_holm <- pmin(pvals_srt * holm_fac, 1.0)
## Find the number of null hypotheses to reject.  The rule is find the first one that
## _doesn't_ reject; reject all the ones before that
base_pval <- 0.05
n_null_reject <- min(which(pvals_holm > base_pval)) - 1
n_null_reject
```

From this calculation we see that none of the grid cells show evidence of a larger variance when using the multi-rcp
mean field model.  In fact, this test is not even close.  Even without the Holm-Bonferroni correction,
only `r sum(pvals <= 0.05)` grid cells show a statistically significant difference in variance.  Given 
the tiny difference in the mean field models in question, this is to be expected.

## Conclusion

It turns out that we never ended up needing to worry about effect sizes or the like.  The mean field
models are practically identical for the two ensembles.  As a result, the three claims in the CC don't
even rise to the level of statistical significance.  Given the amount of data here, that would indicate
that the effect sizes involved are truly tiny.  Therefore, the CC, though plausible a priori, appears
not to be a factor in the results presented in the paper.


## References

Holm, S. (1979), "A Simple Sequentially Rejective Multiple Test Procedure",
_Scandinavian Journal of Statistics_, 6(2), pp. 65-70.