Extreme Metrics and Large Ensembles

We consider the problem of estimating the ensemble sizes required to characterize the forced component and the internal variability of a range of extreme metrics. While we exploit existing large ensembles contributed to the CLIVAR Large Ensemble Project, our perspective is that of a modeling center wanting to estimate a-priori such sizes on the basis of an existing small ensemble (we use five members here). We therefore ask if such small-size ensemble is sufficient to estimate the population variance in a way accurate enough to apply a well established formula that quantifies the expected error as a function of n (the 5 ensemble size). We find that indeed we can anticipate errors in the estimation of the forced component for temperature and precipitation extreme metrics as a function of n by applying the population variance derived by five members in the formula. For a range of spatial and temporal scales, forcing levels (we use RCP8.5 simulations), and both models considered here as our proof of concept, CESM1-CAM5 and CanESM2, it appears that an ensemble size of 20 or 25 members can provide estimates of the forced component for the extreme metrics considered that remain within small absolute and percentage errors. Additional 10 members beyond 20 or 25 add only marginal precision to the estimate, which remains true when extreme value analysis is used. We then ask about the ensemble size required to estimate the ensemble variance (a measure of internal variability) along the length of the simulation, and – importantly – about the ensemble size required to detect significant changes in such variance along the simulation with increased external forcings. When an F -test is applied to the ratio of the variances in question, one estimated on the basis of only 5 or 10 ensemble members, one estimated using the full ensemble (up to 50 members in our 15 study) we do not obtain significant results even when the analysis is conducted at the grid-point scale. While we recognize that there will always exist applications and metric definitions requiring larger statistical power and therefore ensemble sizes, our results suggest that for a wide range of analysis targets and scales an effective estimate of both forced component and internal variability can be achieved with sizes below 30 members. This invites consideration of the possibility of exploring additional sources of uncertainty, like physics parameter settings, when designing ensemble simulations. 20


Introduction
size, and asking if the statistical approach buys any power with respect to the simple "counting" of events across the ensemble realizations. 90 If a random variable z (say the temperature of the hottest day of the year, TXx) is distributed according to a GEV distribution, its distribution function has the form: We estimate the parameters of the GEVs by maximum likelihood. If p (say p = 0.01) is the tail probability to the right of level z under the GEV probability density function, z p is said to be the return level associated to the 1/p-year return period 95 (100-yr return period in this example), and is given by: for ξ = 0.
Thus z p in our example represents the temperature in the hottest day of the year expected to occur only once every 100 year (in a stationary climate) or with 0.01 probability every year (a definition more appropriate in the case of a transient climate).
We use the R package extRemes available from https://CRAN.R-project.org/package=extRemes to fit GEVs and determine return levels and confidence intervals.  Milinski et al. (2020), use the ensemble mean computed on the basis of the full ensemble as a proxy for the true forced signal, and analyze how its approximation gains in precision by using an increasingly larger ensemble size. By a bootstrap approach, subsets of the full ensemble of a given size n are sampled (without replacement) multiple times (in our analysis 105 we use 100 times), their mean (for the metric of interest) is computed, and the multiple replications of this mean are used to compute the Root Mean Square Error (RMSEs) with respect to the full ensemble mean, assumed to be the true forced signal. Note that this bootstrap approach at estimating errors is expected to become less and less accurate as n increases, as was also noted in Milinski et al. (2020). For n approaching the size of the full ensemble, the repeated sampling from a finite population introduces increasingly stronger dependencies among the samples, which share larger and larger numbers of 110 members, therefore underestimating RMSE(n). More problematically, this approach would not be possible if we did not have a full-ensemble to exploit, and if our model was thought of having different characteristics in variability than the models for which large ensembles are available. As a more realistic approach, therefore, we assume that only 5 ensemble members are available, and test how our estimate of the forced signal and the expected RMSE(n) (as n increases) may differ. We base our expectation on the statistics we can gather from the 5 members and we compare them to the "truth" that the availability of a 115 4 https://doi.org/10.5194/esd-2021-53 Preprint. Discussion started: 6 July 2021 c Author(s) 2021. CC BY 4.0 License.

Identifying the forced component
large ensemble provides. It is a well known result of descriptive statistics that the standard error of the sample mean around the true mean decreases as a function of n as in σ/ (n). Here σ is the true standard deviation of the population of ensemble members and we are left with estimating it on the basis of five of them. (This result has in fact been applied in Wehner (2000) to estimate the sampling size in ensemble simulation well before the advent of Large Ensembles.) We will compare the estimated RMSE(n) for n ≥ 5 with the actual departure of the mean computed by averaging an n-size ensemble from the "true" mean 120 computed on the basis of all available members.
Since we are considering extreme metrics that can be modelled by a GEV, we also derive return levels for given multi-year periods (e.g., 10−, 50−, 100−year events) Because of the availability of multiple ensemble members we can choose a narrow window along the simulations (we choose 11 years, short enough to satisfy the requirement of stationarity that the GEV fit postulates) centered around several dates along the 20 th and 21 st century, i.e., 1953, 2000, 2050, 2097 (the first and last chosen 125 to allow extracting a symmetric window at the beginning and end of the simulations). On the basis of the GEV parameters we compute X = 1/p-year events (X = 2, 5, 10, 20, 50, 100) and their uncertainty and assess when the estimates of the central value converge and what the trade-off is between sample size and width of the confidence interval. Lastly, we can use a simple counting approach to determine those same X-year events from all available ensemble members, and compare those estimates to the ones derived by the GEV. The comparison will test if fitting the GEV allows any saving (in terms of sample size) to 130 achieve an accurate estimate of the same event obtained on the basis of the full ensemble.

Characterizing internal variability
Recognizing the importance of characterizing variability besides the signal of change, we ask how many ensemble members are required to fully characterize the size of internal variability and its possible changes over the course of the simulation due to increasing anthropogenic forcing. Process based studies are suited to tackle the question of how and why changes in internal 135 variability manifest themselves in transient scenarios (Huntingford et al., 2013), while here we simply describe the behavior of a straightforward metric, the within-ensemble standard deviation. We look at this quantity at the grid-point scale and we investigate how many ensemble members are needed to robustly characterize the full ensemble behavior, which here again we assume to be representative of the truth, i.e., the true variability of the system. This translates into two separate questions.
First, for a number of dates along the simulation spanning the 20 th and 21 st centuries, we ask how many ensemble members member ensemble. We will discuss if and when the results presented in this section differ from those in the supplementary material.
4.1 Identifying the forced component 150 We start from time series of annual values of globally averaged TNx and Rx5Day (Figure 1, top panels). We compute them for each ensemble member separately, and average them over n ensemble members as the ensemble size n increases, applying the bootstrap approach and computing RMSE(n) (see Section 3.1) at every year along the simulation.  , so that the y-axis range allows a clearer assessment of the relative size of the uncertainty ranges for different sizes of the ensembles. Bottom row plots the bootstrapped RMSE for every year and each ensemble size.
As Figure 1 indicates, for both quantities the marginal effect of increasing the ensemble size by 5 members is not constant but rather decreases as the ensemble size increases. This is qualitatively visible in the evolution of the ranges in the panels of 155 the first two rows, and is measured along the y-axis of the plots along the bottom row, where RMSE(n) for increasing n is shown (each n corresponding to a different color). individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 35, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σt/ √ n where σt is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown). We also compare estimates derived by plugging into the formula a value of σt estimated by a subset of 5 ensemble members, and 5 years around the year t considered (columns labelled by "(F-5)"). Results are shown for four individual years (t) along the simulation (column-wise), since σt varies along its length.
This behavior is to be expected, as we know the RMSE of a mean behaves in inverse proportion to the square root of the size of the sample from which the mean is computed, but the actual behavior shown in the plots and the table could be misleading, as the variability of the largest means (largest in sample size n) could be underestimated by our bootstrap (see Section 3.1). More 160 importantly, this assessment would not be possible if all we had was a 5-member ensemble for our model. We can therefore compute the formula for the standard error of a mean, σ/ √ n (see Section 3.1), using the full ensemble to estimate σ, which we assume to be the true standard deviation of the ensemble. We then repeat the estimation by substituting an estimate for σ derived using only 5 ensemble members. Table 1 shows RMSEs for the same increasing values of n. Each pair of columns compares side by side the bootstrap (B) and the formula (F) results, the latter also reporting the 95% confidence intervals due 165 to having to estimate σ t . Also shown is the result of applying the formula by estimating σ t on the basis of a small ensemble (5 members) but, importantly for the accuracy of our results, increasing the sample size by using a window of 5 years around the individual dates. From the table entries we can assess that the bootstrap estimation is the most optimistic about the size of the RMSE for the estimation of the forced signal of these two quantities once the ensemble size exceeds about 15 − 20 (out of 40 available). For the larger sizes the RMSE estimated by the bootstrap falls in all cases outside of the confidence interval under 170 the (F) column. However, the estimates of RMSE associated with an ensemble size of 10 or 15 already quantifies a high degree of accuracy for the approximation of the ensemble mean of the full 40-member ensemble: those RMSEs for TNx are on the For both metrics, estimating σ t using only 5 ensemble members (and a window of 5 years around the year t of interest) delivers accurate estimates of the RMSE and its confidence interval as soon as the ensemble size considered exceeds 5 or 10.
The lesson learned here is that 1. if global average quantities of these indices are concerned, and 2. if the formula for estimating the RMSE on the basis of a given sample size is adopted, it is possible, on the basis of an existing 5-member ensemble to accurately estimate the required ensemble size to identify the forced component within a given tolerance for error. Of course the size of this tolerance would be ideally dictated by the use the analysis is put towards.
We note here that the calculation of the RMSE for increasing ensemble sizes is straightforward, once σ t is estimated. Even 185 more straightforward is the calculation of the expected "gain" in narrowing the RMSE. A simple ratio calculation shows that for n spanning the range 5 to 45 (relevant sizes for our specific examples) the reduction in RMSE follows the sequence {100 * 1/ √ n} n=1,5,...,45 . If we take the RMSE affecting estimates based on only one member (in essence, our estimate of σ) as reference, we would expect an RMSE that is 45%, 32%, 22%, 17% or 15% of that for ensemble sizes of n = 5, 10, 20, 35, or 45 respectively.

190
We assess how the results of the formula compare to the actual error by considering the difference between the smaller size ensemble means and the truth (the full ensemble mean), year by year and comparing that difference to twice the expected RMSE derived by the formula (akin to considering twice the standard deviation of a normally distributed quantity). Figure 2 for global averages of the two same quantities, shows the ratios of actual vs. expected error, indicating the 100% level by a horizontal line for reference (and indicating by a diagonal line the behavior of σ t over time, for reference, as summarized by 195 fitting a linear regression in t). As can be gauged, the actual error is in most cases much smaller than the expected, especially for ensemble sizes greater than 5, and we see that only occasionally the actual error spikes above the expected (above 100%) for individual years, consistent with what would be expected of a normally distributed error compared to the 2σ quantity. The diagonal line and the scale on the right axis give the actual size and behavior over time (as σt may vary with t) of the expected error.
We choose to show the stylized behavior ofthe estimated error here, as the the least square linear regression fit of σt/ √ n onto t. Each plot corresponds to a different and increasing ensemble size: 1,5,10,15,20,25,30,35. The top two rows of plots are for TNx; the bottom two rows are for Rx5Day. CESM1-CAM5 results.
In the supplementary material we report the results of applying the same analysis to the rest of the indices, and for regional results. Even if we can't show all results, we tested country averages, zonal averages, land and ocean regions separately,

200
confirming that the qualitative behavior we assess here is common to all these other instances.
Here we go on to show how the same type of analysis can be applied at the grid scale, and still deliver an accurate bound  100% indicate that the estimated error is an effective upper bound for the real error in estimating the anomaly at that location. The color scale highlights in dark red the values above 100%, whose total fraction is reported in Table B1. 11 https://doi.org/10.5194/esd-2021-53 Preprint. Discussion started: 6 July 2021 c Author(s) 2021. CC BY 4.0 License. 100% indicate that the estimated error is an effective upper bound for the real error in estimating the anomaly at that location. The color scale highlights in dark red the values above 100%, whose total fraction is reported in Table B3.
Overall, these results attest to the fact that we can use a small ensemble of 5 members to estimate the population standard deviation around the population mean, and apply the formula for the standard error of a mean as a function of sample size to decide how large an ensemble we need in order to approximate the forced component to a given degree of accuracy. This holds true across the range of spatial scales afforded by these models, starting from global means down to subcontinental regional 220 averages all the way to grid-point values.

GEV results
As explained in Section 3, the extreme metrics we chose can be fit by a Generalized Extreme Value distribution, and return levels for arbitrary return periods derived, with their confidence interval. In this section we ask two questions.
1. How many ensemble members are needed for the estimates to stabilize and the size of the confidence interval not to 225 change in a significant way? And 2. Is there any gain in applying GEV fitting rather than simply "counting" rare events across the ensemble?
We perform the analysis for a set of individual locations (i.e., grid-points), as for most extreme quantities there would be little value in characterizing very rare events as means of large geographical regions. Figure C1 shows the 15 locations that we chose with the goal of testing a diverse set of climatic conditions. We choose three years along the simulations (2000, 2050 and 230 2095 1 ) around which we extract an 11-yr window of data. This relatively short period allows us to assume quasi-stationarity eliminating the need for temporal covariates in the estimation of the GEV paramaters. We estimate return levels (event sizes, z p in our notation) for a number of return periods, i.e., 2-,5-,10-,20-,50-and 100-years by concatenating the 11 year segments across the n ensemble members. Here we show results for our two metrics, choosing two different locations for each. These  on estimating a GEV by using 11-yr windows of data around each date. In each plot, for increasing ensemble sizes along the x-axis (from 5 to the full ensemble, 40), the red dots indicate the central estimate, and the pink envelope represents the 95% confidence interval. The estimates based on the full ensemble, which we consider the truth, are also drawn across the plot for reference, as horizontal lines. The blue dots in each plot show the same quantities estimated simply by counting, i.e., computing the empirical cumulative distribution function of TNx on the basis of the n × 11 years where n is the ensemble size. The first three rows show results for a location in Australia while the following three rows show results for a location in Northern North America (see Figure C1). based on estimating a GEV by using 11-yr windows of data around each date. In each plot, for increasing ensemble sizes along the x-axis (from 5 to the full ensemble, 40), the red dots indicate the central estimate, and the pink envelope represents the 95% confidence interval.
The estimates based on the full ensemble, which we consider the truth, are also drawn across the plot for reference, as horizontal lines. The blue dots in each plot show the same quantities estimated simply by counting, i.e., computing the empirical cumulative distribution function of Rx5Day on the basis of the n × 11 years where n is the ensemble size. The first three rows show results for a location in Northern Asia while the following three rows show results for a location in Southern Africa (see Figure C1).
Figures 5 and 6 compare for each return level (along the columns), and across the projection dates (along the rows), the behavior of the GEV central estimates (red dots) and 95% confidence intervals (pink envelope) based on an increasing ensemble size (along the x-axis) to the ones obtained by the full ensemble (considered to be the truth), which are drawn as a reference across each plot as horizontal lines. Further, estimates of the central quantities based on computing an empirical cumulative 240 distribution function from the data are added to each plot as blue dots for each of the ensemble sizes considered (in this case as well using 11-year windows for each ensemble member, so that the sample has the same size as that used for the GEV fitting of n × 11). The general message can be summarized by two observations. First, an ensemble size of 20 or 25 (corresponding to a sample size of 220 to 275 years) appears to be the lower bound at which the estimates stabilize in most cases. Both the central estimates and the confidence intervals do not depart significantly from the point estimate and range of 245 the truth (horizontal lines), and gradually converge to it reliably (barred the odd result that we cannot exclude given the large seems appropriate to compute, either empirically or by GEV fits, return levels for events as rare as having p = 0.01 to occur in a given year. The use of a GEV approach allows to characterize the uncertainty bounds straightforwardly as opposed to the empirical "counting" approach but the central estimates from the two approaches do not seem to differ significantly in most cases tested.
The same statistical precision may be realized with fewer ensemble members by relaxing the quasi-stationary assumption 255 and extending the analysis period to contain a similar number of years. However, this then necessitates usage of temporal covariates adding another source of fitting uncertainty.

Characterizing internal variability
After concerning ourselves with the characterization of the forced component we turn to the complementary problem of characterizing internal variability. Rather than aiming at eliminating the effects of internal variability as we have done so far in 260 the estimation of a forced signal, we take here the opposite perspective, wanting to fully characterize that internal variability.
After all, the real world realization won't be akin to the mean of the ensemble, but to one of its members, and we want to be sure to estimate the range of variations such members may display. Thus, we ask how large the ensemble needs to be to fully characterize the variations that the full-size ensemble produces, in the form of the ensemble variance; we also ask how large an ensemble is needed to detect changes in the size of internal variability with changing external forcing. Both these questions we 265 tackle directly at the grid-point scale, as the answer to that problem is bound to be a conservative answer to any other problem that concerns the characterization of variability at a larger spatial scale. Figures 7 through 10 synthesize our findings for both these questions. Figure 7. Estimating the ensemble variance for TNx: Each plot corresponds to a year along the simulation length (1950,1975,2000,2025,2050,2075,2100). The color indicates the number of ensemble members needed to estimate an ensemble variance at that location that is statistically indistinguishable from that computed on the basis of the full 40-member ensemble. The results of the first two columns use only the specific year across the members. The results of the third and fourth columns enrich the samples by using five years around the specific date.
16 https://doi.org/10.5194/esd-2021-53 Preprint. Discussion started: 6 July 2021 c Author(s) 2021. CC BY 4.0 License. Figure 8. Estimating the within ensemble variance for Rx5Day: Each plot corresponds to a year along the simulation length (1950,1975,2000,2025,2050,2075,2100). The color indicates the number of ensemble members needed to estimate an ensemble variance at that location that is statistically indistinguishable from that computed on the basis of the full 40-member ensemble. The results of the first two columns use only the specific year across the members. The results of the third and fourth columns enrich the samples by using five years around the specific date.
The two columns on the left-hand side of Figure 7 show for several years along the simulation how many ensemble members are needed (denoted by the colors of the legends) in order to estimate an ensemble variance at each grid-point that is not 270 statistically distinguishable from the same variance estimated by the full 40-member ensemble, for TNx. Note that we do this at various times along the length of the simulation (1950,1975,2000,2025,2050,2075, 2100) because we account for the possibility that internal variability might change over its course with increasing external forcing, but for now we remain agnostic on this issue. For all dates, most of the area indicates that 5 members are sufficient, but a remaining noisy pattern shows that at some locations ten members are needed. The same type of Figure  Detecting changes in the size of the variance over time by comparing two dates over the simulation is a problem that we expect to require more statistical power than the problem of characterizing the size of the variance at a given point, as the 285 difference between stochastic quantities is affected by larger uncertainty than the quantities individually considered, unless those are strongly correlated. Figure 9 shows the ensemble size required to detect the same changes in the ensemble variance of TNx that the full ensemble of 40-members detects. Each plot is at the intersection of a column and a row corresponding to two of the dates considered in the previous analysis, indicating that the solution applies to detecting a change in variance between those two dates.

Signal-to-Noise considerations
Another aspect that is implicitly relevant to the establishment of a required ensemble size, if the estimation is concerned with 310 emergence of the forced component, or, more in general, with 'detection and attribution'-type analysis is the Signal-to-Noise ratio of the quantity of interest. Assuming as we have done so far that the quantity of interest can be regarded as the mean µ of a noisy population, the signal to noise ratio is defined as S N = µ/σ where σ is the standard deviation of the population. A critical threshold, say K, for S N is usually set at K = 1 or 2, and it is immediate to derive the sample size required for such threshold to be hit, by computing the value of n that makes µ/(σ/ √ n) ≥ K, i.e., n ≥ K 2 /S 2 N . Figure 11 shows two maps 315 of the spatially varying ensemble sizes required for the signal to noise ratio to exceed 1, when computing anomalies at midand end-of-century for the warmest night of the year, TNx. The anomalies are computed as five year mean differences, as in Section 4.1 under RCP8.5, and clearly by the end of the century the entire Earth's surface experiences the emergence of the signal, even by averaging 1 or 2 ensemble members. The map of changes by mid-century is more interesting, as evidently some areas require more statistical power, i.e., a larger n, especially around the polar regions, as expected, where internal variability 320 is significantly larger than in lower latitudes, translating into a smaller S N .

Conclusions
In this study we have addressed the need of deciding a-priori the size of a large ensemble, using an existing five member ensemble as our guidance. Aware that the optimal size ultimately depends on the purpose the ensemble is used for, and in order to cover a wide range of possible uses, we chose metrics of temperature and precipitation extremes and we considered 325 output at grid-point scale and at various scales of aggregation, up to global averages. We tackled the problem of characterizing forced changes along the length of a transient scenario simulation, and that of characterizing the system's internal variability and its possible changes. By using a high emission scenario like RCP8.5, but considering behaviors all along the length of the simulations, we are also implicitly addressing a wide range of signal-to-noise magnitudes. Using the availability of existing large ensembles with two different models, CESM1 and CanESM2, we could compare our estimates of the expected errors that 330 a given ensemble size would generate with actual errors, obtained using the full ensembles' estimates as our "truth".
First, we find that for the many uses that we explored, it is possible to put a ceiling on the expected error associated with a given ensemble size by exploiting a small ensemble of 5 members. We estimate the ensemble variance at a given simulation date (e.g., 2000, or 2050, or 2095), which is the basis for all our error computations, on the basis of five members, "borrowing strength" by using a window of five years around that date. The results we assess are consistent with assuming that the quantities 335 of interest are normally distributed with standard deviation σ/ √ n, where σ can be estimated on the basis of the 5 members available: the error estimates and therefore the optimal sizes computed on the basis of choosing a given tolerance for such errors provide a safe upper bound to the errors that would be committed for a given ensemble size n. This is true for all metrics considered, both models, and the full range of scales of aggregation. When we use such estimates (later verified by the availability of the actual large ensembles) there appears to be a sweet spot in the range of ensemble sizes that provides accurate 340 estimates for both forced changes and internal variability, consisting of 20 or 25 members. The larger of these sizes also appears approximately sufficient to conduct an estimation of rare events with as low as 0.01 probability of occurrence each year, by fitting a GEV and deriving return levels and their confidence intervals. In most cases (locations around the globe, times along the simulation, and metrics considered) enlarging the sample size beyond 25 members provides only marginal improvement in the confidence intervals, while the central estimate does not change significantly from the one established using 25 members, 345 and accurately approximating that obtained by the full ensemble.
In all cases considered a much smaller ensemble size of 5 to 10 members, if enriched by sampling along the time dimension (that is, using a 5-year window around the date of interest) is sufficient to characterize the ensemble variability, and its changes along the course of the simulations under increasing greenhouse-gases, when found significant using the full ensemble size.
Some caveats are in order. Obviously, the question of how many ensemble members are needed is fundamentally ill-posed, 350 as the answer ultimately and always depends on the most exacting use to which the ensemble is put. One can always find a higher-frequency, smaller-scale metric, and a tighter error bound to satisfy, requiring a larger ensemble size than any previously identified. As tropical cyclone permitting and eventually convection permitting climate model simulations become available, these metrics will be more commonly analyzed. Even for a specific use, the answer depends on the characteristics of internal variability, and the fact that for these two models 5 ensemble members are sufficient to obtain an accurate estimate of it is promising, but not guarantee that 5 are sufficient for all models. In fact, this could also be invalidated by a different experimental exploration of internal variability: new work is exploring different types of initialization, involving ocean states, which could uncover a dimension of internal variability that has so far being under-appreciated. This would likely change our best estimates of internal variability, and with it possibly the ensemble sizes required to accurately estimate it.
With this work, however we have shown a way to attack the problem "bottom up", starting from a smaller ensemble and build-

360
ing estimates of what would be required for a given problem. One can imagine a more sophisticated set-up where an ensemble can be recursively augmented (rather than assuming a fixed 5-member ensemble as we have done here) in order to approximate the full variability incrementally better. We have also shown that for a large range of questions the size needed is actually well below what we have come to associate with "Large Ensembles". There exist other important sources of uncertainties in climate modeling, one of which is beyond reach of any single modeling center, having to do with structural uncertainty Knutti et al.

365
(2010). Adopting the perspective of an individual model, however, parameter settings have as important a role -at leastas initial conditions. Together with scenario uncertainty, all these dimensions compete over computational resources for their exploration. Our results may be of guidance in choosing how to allocate those resources among these alternative sources of variation.
Code and data availability. The large ensembles output is available through the CLIVAR Large Ensemble Working Group webpage, in the 370 archive maintained through the NCAR CESM community project cesm.ucar.edu/projects/community-projects/MMLEA/.
R code for these analyses is available from the first author on reasonable request.  Table A1. Global mean of TNn as simulated by the CESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 35, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown).
We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years  Table A3. Global mean of TXn as simulated by the CESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 35, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown).
We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years  Table A4. Global mean of Rx1Day as simulated by the CESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 35, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown).
We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years  Table A5. Global mean of TNx as simulated by the CanESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 40, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown).
We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years  Table A8. Global mean of TXx as simulated by the CanESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 40, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown).
We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years around the year considered (columns labelled by "(F-5)"  Table A9. Global mean of TXn as simulated by the CanESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 40, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int. is shown).
We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years around the year considered (columns labelled by "(F-5)"  Table A10. Global mean of Rx1Day as simulated by the CanESM ensemble: Values of the RMSE in approximating the full ensemble mean by the individual runs (first row, n = 1), and by ensembles of increasingly larger sizes (from 5 to 40, along the remaining rows). The estimates obtained by the bootstrap approach (columns labeled by "(B)") are compared to the estimates obtained by the formula σ/ √ n where σ is estimated as the ensemble standard deviation using all ensemble members (columns labelled by "(F)", where also the 95% conf. int.
is shown). We also compare estimates derived by plugging into the formula a value of σ estimated by a subset of 5 ensemble members, and 5 years around the year considered (columns labelled by "(F-5)"). Results are shown for four individual years along the simulation (column-wise), since σ varies along it.  Table B1. Percentage of the global, land or ocean surface where the actual errors exceed the errors estimated on the basis of the formula "a-priori" using 5 ensemble members to estimate σ. Results for all temperature extreme metrics, derived from the CESM ensemble whose full size is 40 members. Calculations apply cosine-of-latitude weighting. Results for TNx are summaries of the behavior shown in Figure 3, i.e., the fraction of surface represented by locations where the error ratio is larger than 100%. Numbers under small n's are affected by noise, as we randomly choose n members from the full ensemble, only once. As can be gauged, the decreasing behavior of the fractions stabilizes for n ≥ 15.  Table B2. Percentage of the global, land or ocean surface where the actual errors exceed the errors estimated on the basis of the formula "a-priori" using 5 ensemble members to estimate σ. Results for all temperature extreme metrics, derived from the CanESM ensemble whose full size is 50 members. Calculations apply cosine-of-latitude weighting. Numbers under small n's are affected by noise, as we randomly choose n members from the full ensemble, only once. As can be gauged, the decreasing behavior of the fractions stabilizes for n ≥ 15.  Table B3. Percentage of the global, land or ocean surface where the actual errors exceed the errors estimated on the basis of the formula "a-priori" using 5 ensemble members to estimate σ. Results for the two precipitation extreme metrics, derived from the CESM ensemble whose full size is 40 members. Calculations apply cosine-of-latitude weighting. Results for Rx5Day are summaries of the behavior shown in Figure 4, i.e., the fraction of surface represented by locations where the error ratio is larger than 100%. Numbers under small n's are affected by noise, as we randomly choose n members from the full ensemble, only once. As can be gauged, the decreasing behavior of the fractions stabilizes for n ≥ 15.