End users studying impacts and risks caused by human-induced climate change are often presented with large multi-model ensembles of climate projections whose composition and size are arbitrarily determined. An efficient and versatile method that finds a subset which maintains certain key properties from the full ensemble is needed, but very little work has been done in this area. Therefore, users typically make their own somewhat subjective subset choices and commonly use the equally weighted model mean as a best estimate. However, different climate model simulations cannot necessarily be regarded as independent estimates due to the presence of duplicated code and shared development history.

Here, we present an efficient and flexible tool that makes better use of the ensemble as a whole by finding a subset with improved mean performance compared to the multi-model mean while at the same time maintaining the spread and addressing the problem of model interdependence. Out-of-sample skill and reliability are demonstrated using model-as-truth experiments. This approach is illustrated with one set of optimisation criteria but we also highlight the flexibility of cost functions, depending on the focus of different users. The technique is useful for a range of applications that, for example, minimise present-day bias to obtain an accurate ensemble mean, reduce dependence in ensemble spread, maximise future spread, ensure good performance of individual models in an ensemble, reduce the ensemble size while maintaining important ensemble characteristics, or optimise several of these at the same time. As in any calibration exercise, the final ensemble is sensitive to the metric, observational product, and pre-processing steps used.

Multi-model ensembles are an indispensable tool for future climate projection and the quantification of its uncertainty. However, due to a paucity of guidelines in this area, it is unclear how best to utilise the information from climate model ensembles consisting of multiple imperfect models with a varying number of ensemble members from each model. Heuristically, we understand that the aim is to optimise the ensemble performance and reduce the presence of duplicated information. For such an optimisation approach to be successful, metrics that quantify performance and duplication have to be defined. While there are examples of attempts to do this (see below), there is little understanding of the sensitivity of the result of optimisation to the subjective choices a researcher needs to make when optimising.

As an example, the equally weighted multi-model mean (MMM) is most often used
as a “best” estimate for variable averages

Instead of accounting for this dependence problem, most studies use whatever
models and ensembles they can get and solely focus on selecting ensemble
members with high individual performance (e.g.

Given that climate models developed within a research group are prone to
share code and structural similarities, having more than one of those models
in an ensemble will likely lead to duplication of information. Institutional
democracy as proposed by

Only a few studies have been published that attempt to account for dependence
in climate model ensembles. A distinction can be made between approaches that
select a discrete ensemble subset and those that assign continuous weights to
the ensemble members. For example,

Another method also using continuous weights but considering climatologies
rather than time series was proposed by

In the previous paragraph we discussed approaches that assign continuous
weights to model runs. Regional dynamical downscaling presents a slightly
different problem to the one stated above, as the goal of regional climate
models is to obtain high-resolution climate simulations based on lateral
boundary conditions taken from global climate models (GCMs) or reanalyses

The problem of defining and accounting for dependence is made more
challenging by the fact that there is no uniformly agreed definition of
dependence. A canonical statistical definition of independence is that two
events A and B are considered to be independent if the occurrence of B does
not affect the probability of A,

One disadvantage of many of these studies is that they are technically challenging to implement and therefore discourage frequent use. Further, the sensitivity of each approach to the choice of metrics used, variables included, and uncertainties in observational products is largely unexplored. This leads to a lack of clarity and consensus on how best to calibrate an ensemble for a given purpose. Often, out-of-sample performance has not been tested, which we consider essential when looking at ensemble projections.

The aim of this study is to present a novel and flexible approach that selects an optimal subset from a larger ensemble archive in a computationally feasible way. Flexibility is introduced by an adjustable cost function which allows this approach to be applied to a wide range of problems. The meaning of “optimal” varies depending on the aim of the study. As an example, we will choose a subset of the CMIP5 archive that minimises regional biases in present-day climatology based on RMSE over space using a single observational product. The resulting ensemble subset will be optimal in the sense that its ensemble mean will give the lowest possible RMSE against this observational product of any possible combination of model runs in the archive. The more independent estimates we have, the more errors tend to cancel. This results in smaller biases in the present day, which reduces the need for bias correction. Such an approach with binary (0/1) rather than continuous weights is desired to obtain a smaller subset that can drive regional models for impact studies, as this is otherwise a computationally expensive task. More precisely, it is the number of zero weight that leads to some models being discarded from the ensemble. Out-of-sample skill of the optimal subset mean and spread is tested using model-as-truth experiments. The distribution of projections using model runs in the optimal subset is then assessed.

We then examine the sensitivity of this type of result to choices of the cost function (by adding additional terms), variable, and constraining dataset. We argue that optimally selecting ensemble members for a set of criteria of known importance to a given problem is likely to lead to more robust projections for use in impact assessments, adaptation, and mitigation of climate change.

This approach is not meant to replace or supersede any of the existing approaches in the literature. Just as there is no single best climate model, there is no universally best model weighting approach. Whether an approach is useful depends on the criteria that are relevant for the application in question. Only once the various ensemble selection approaches have been tailored to a specific use case can a fair comparison be made. Flexibility in ensemble calibration by defining an appropriate cost function that is being minimised and metric used is key for this process.

In the following section, we introduce the model data and observational
products used throughout this study. Section

We use 81 CMIP5 model runs from 38 different models and 21 institutes which are available in the historical period (1956–2013; RCP4.5 after 2005) and the RCP4.5 and RCP8.5 period (2006–2100); see Table S1 in the Supplement. We examine gridded monthly surface air temperature (variable: tas) and total monthly precipitation (variable: pr). Results shown here are based on raw model data (absolute values), although repeat experiments using anomalies (by subtracting the global mean climatological value from each grid cell) were also performed (not shown here).

Multiple gridded observation products for each variable were considered with
each having different regions of data availability (see Table S2 and
additional results in the Supplement). Model and observation data were remapped using
a first-order conservative remapping procedure

We first illustrate the technique by considering absolute surface air
temperature and total precipitation climatologies (time means at each
grid cell) based on 1956–2013. The land-only observational product, CRUTS
version 3.23

Next, we select an ensemble subset of size

Figure

Size of the CMIP5 subset on the horizontal axis and the resulting
RMSE of the ensemble mean and its improvement relative to the multi-model
mean (MMM) on the vertical axes for surface air temperature

For the optimal ensemble (black circles), RMSE is initially large, the value
representative of the single best performing model run (black dot being
behind the green one). The RMSE of the ensemble mean rapidly decreases when
more model runs are included until it reaches a minimum (red circle indicates
the optimal subset over all possible ensemble sizes). That is, the RMSE
improvement relative to the MMM (solid horizontal line) is largest at this
ensemble size. One could investigate defining the effective number of
independent models for a given application based on the optimal ensemble size

Note that as the selection of one ensemble member depends on the remaining members in the ensemble, the optimal subset is sensitive to the original set of model runs that we start out with. So, if members are added to or removed from the original ensemble, then the optimal subset is likely going to change. Any subset selection approach that does not make use of the available information about the original ensemble is most likely not optimal.

Another characteristic of the optimal ensemble is that there is not
necessarily any ensemble member consistency (with increasing subset size).
There are other methods which do maintain this consistency (e.g.

Of the three subsampling approaches, it is evident that the optimal ensemble
mean is the best performing one for all ensemble sizes if the bias of the
model subset average should be minimised – essentially indicating that the
solver is working as anticipated. Regional biases in different models cancel
out most effectively using this approach. Across different observational
products, we observe an improvement in RMSE relative to the MMM of between
10 and 20 % for surface air temperature and around 12 % for total precipitation
(see Figs. S1 and S2 in the Supplement). The size of the optimal subset is significantly
smaller than the total number of model runs considered in this study (see red
text in Fig.

We achieve similar RMSE improvement if we exclude closely related model runs a priori and start off with a more independent set of model runs (one model run per institute; see Fig. S3).

Figure

We now develop this optimisation example to highlight the flexibility of the
method. In doing so, it should become clear that calibration for performance
and dependence is necessarily problem dependent. A graphical representation
of the experimental choices we explore is shown in Fig.

The dependence (in terms of average pairwise error correlation across all possible model pairs in the ensemble) is plotted against the performance (in terms of RMSE) for three different sampling techniques. It is based on surface air temperature and CRUTS3.23 is used as observational product. For the circular markers, the mean of model–observation distances within the ensemble is plotted against the mean of pairwise error correlations for the individual members within an ensemble for a certain ensemble size. The diamonds are used to show the RMSE of the ensemble mean (rather than the mean RMSE of the individual members) compared to the observational product. The values on the vertical axis are the same as for the circular markers. The larger the ensemble size, the darker the fill colour. The red dotted line indicates the lowest RMSE for the optimal ensemble (based on the ensemble mean).

Graphical representation of the method for this study and its flexibility. The different colours are used for three sections in this publication: Data, Method, and Results.

The ensembles in the previous subsection were calibrated on a single
observational product (depicted in green in Fig.

Here, we only optimise our ensemble to one observational product at a time and investigate how sensitive the optimal subset is to that choice.

The selection of the variable has a profound influence on the resulting
optimal subset. This was already briefly highlighted in Fig.

The presented approach can obtain an optimal subset for any given variable, as long as it is available across all model runs and trustworthy observational products exist. One might even consider using process-oriented diagnostics to give us more confidence in selecting a subset for the right physical reasons.

Results presented in this study are all based on absolute values rather than anomalies. Whether or not bias correction is required depends on the variable and the aim of the study. To study the Arctic sea ice extent, for example, absolute values are a natural choice as there is a clear threshold for near ice-free September conditions. An example for which bias correction is necessary is in the field of extreme weather. For example, mean biases between datasets must be removed before exceedance probabilities beyond some extreme reference anomaly can be calculated.

As part of the data pre-processing step, we computed climatologies
(time means at each grid cell) for the model output and observational
dataset. In addition to climatologies, we decided to consider time-varying
diagnostics (“trend” and “space

To assess whether our optimal subset has improved skill, we need to define a
benchmark. In Fig.

An essential part of the optimisation problem is the cost function.
Comparison of all the sensitivities mentioned above is made possible only
because our subsets are truly optimal with respect to the prescribed cost
function. For the results above, the cost function

Reasons to use ensembles of climate models are manifold, which goes hand in
hand with the need for an ensemble selection approach with an adjustable cost
function. Note that we do not consider the MSE of the ensemble mean as the
only appropriate optimisation target for all applications. Even though it has
been shown that the multi-model average of present-day climate is closer to
the observations than any of the individual model runs (e.g.

Of course this cost function can and should be adjusted depending on the aim
of the study, as long as the expressions are either linear or quadratic. To
illustrate this idea, we add two new terms to the cost function above that
account for different aspects of model interdependence:

The function

Based on the climatological metric, Gurobi can solve Eq. (

The cost function presented in this study solely uses MSE as a performance
metric. There are of course many more metrics available
(e.g.

For those concerned about overconfidence of the ensemble projections (due to the “unknown unknowns”), one could add another term which maximises future spread. This would result in an ensemble which allows us to explore the full range of model responses. It is also possible to start weighting the terms of the cost function differently depending on what seems more important.

The optimal selection approach is clearly successful at cancelling out
regional biases in the historical period for which observations are available.
We refer to this period as “in-sample”. Is a model that correctly simulates
the present-day climatology automatically a good model for future climate
projections? To answer this question, we need to investigate if regional
biases persist into the future and determine whether the approach is fitting
short-term variability. In other words, we have to ensure that our
subset selection approach is not overfitting on the available data in-sample,
which can potentially lead to spurious results out-of-sample. This is done by
conducting model-as-truth experiments. This should give an indication of
whether sub-selecting in this way is likely to improve future predictability
or if we are likely to be overconfident with our subset. Rigid model tuning,
for example, could cause the ensemble to be heavily calibrated on the
present-day state. An optimal subset derived from such an ensemble would not
necessarily be skillful for future climate prediction as we are dealing with
overfitting and we are not calibrating to biases that persist into the
future. This is exactly where model-as-truth experiments come into play. For
this purpose, one simulation per institute is considered to be the “truth”
as though it were observations, and then the optimal subset from the
remaining 20 runs (one per institute) is determined for the in-sample
period (1956–2013) based on the cost function in Eq. (

Results of the model-as-truth experiment based on three different
metrics

Figure

Figure

It is evident that both the climatological metric and the space

The number of times the “model-as-truth” is within the
10th–90th percentile of ensemble spread (defined by the optimal subset for a
given size) averaged across all “truths” is plotted against the subset
size.

The trend metric is different, however. To be clear, here we obtain the
optimal subset based on a two-dimensional field with linear (58-year) trends
at each grid cell in the in-sample period. We then use this subset trained on
trend values to predict the out-of-sample trend field (using the same
simulation as “truth” as in the in-sample period). The RMSE improvement
presented in Fig.

Figure

Results for the ensemble spread are shown in Fig.

Such a model-as-truth experiment can also assist with the choice of an optimal subset size for the application to projections. It does not necessarily have to be the same as the in-sample ensemble size, as aspects like mean skill improvement and reduction of the risk of underdispersion have to be considered.

Similar to Fig.

Can a subset calibrated on absolute historical temperature constrain
temperature

This result is partly about the discrepancy between the metric used to derive the optimal ensemble and that used to evaluate it and reinforces how sensitive this type of calibration exercise is to the somewhat subjective choices faced by a researcher trying to post-process climate projections. It is an important limitation that should be kept in mind when using this sampling strategy to constrain future projections.

In earlier sections we presented results based on a single observational
product per variable. However, the importance of the choice of product should
not be neglected. The influence of obtaining an optimal subset based on
different observational products can be visualised with maps. To create
Fig.

We presented a method that selects a CMIP5 model subset which minimises a given cost function in a computationally efficient way. Such a calibrated smaller ensemble has important advantages compared to the full ensemble of opportunity, in particular reduced computational cost when driving regional models, smaller biases in the present day, which reduce the need for bias correction, reduced dependence between the members, and sufficient spread in projections. The cost function can be varied depending on the application. The simplest cost function presented here simply minimises biases of the ensemble mean. We have shown that this method accounts to some degree for the model dependence in the ensemble by the way it optimises the ensemble mean, but closely related models or even initial condition ensemble models of the same models are not penalised and can still occur. This optimal subset performs significantly better than a random ensemble or an ensemble that is solely based on performance. The performance ranking ensemble sometimes even performs worse than the random ensemble in its mean, even though of course the individual models perform better. Depending on the application, one or the other will matter more.

We also illustrated the expansion of the cost function to optimise additional criteria, enabling an optimal subset that minimises the ensemble mean bias, the individual model biases, the clustering of the members, or any combination thereof. One could also, for example, add a term that maximises the ensemble projection spread to avoid overconfidence. The choice of what is constrained by the cost function clearly depends on the aim of the study (e.g. present-day bias, dependence issue, future spread). We highlight the importance of testing the sensitivity to the metric and observational product used (including varying data availability), as they can lead to quite different results.

The difference between the multi-model mean (81 runs; first averaged across initial condition members and then averaged across 38 models) and the optimal subset is shown for the RCP8.5 surface air temperature change between (2081–2100) and (1986–2005). The optimal subset is different depending on which observational product is used. The grey contours outline the region which was used to obtain the optimal subset in the historical period. The optimal ensemble size for each observational product is given in the title of each map.

Same as Fig.

The lasso regression analysis method

Model-as-truth experiments were used to investigate the potential for overconfidence, estimate the ensemble spread, and test the robustness of emergent constraints. Based on those experiments we learned that absolute present-day values constrain absolute values in the future (due to a persistent bias). However, absolute present-day values do not constrain projected changes relative to a present-day state.

There were other pertinent questions we did not address, of course. These include the question of how best to create an optimal subset across multiple variables and gridded observational products. This seems especially important if physical consistency across variables should be maintained. Having a Pareto set of ensembles (by optimising each variable separately) rather than a single optimal subset is a potential solution, but is clearly more difficult to work with.

Using model-as-truth experiments, we observed that the skill of the optimal subset relative to the unweighted ensemble mean decreases the further out-of-sample we were testing it. This breakdown of predictability is not unexpected as the climate system reached a state it has never experienced before. This is certainly an interesting aspect which should be investigated in more depth in a future study.

Many of the points raised here are also clearly not restricted to GCMs. The same holds for regional climate models, hydrological models, and perhaps ecological models. We encourage others to apply the same approach to different kinds of physically based models.

Critically, we wish to reinforce that accounting for dependence is essentially a calibration exercise, whether through continuous or discrete weights, as was the case here. Depending on the cost function, the data pre-processing, and the observational product, one can end up with a differently calibrated ensemble. Depending on the application, bias correction of the model output might be appropriate before executing the calibration exercise. We suggest that the approach introduced in this study is an effective and flexible way to obtain an optimal ensemble for a given specified use case.

Future research will help to provide confidence in this method and enable researchers to go beyond model democracy or arbitrary weighting. This is especially important as replication and the use of very large initial condition ensembles will likely become a larger problem in future global ensemble creation exercises. An approach that attempts to reduce regional biases (and therefore indirectly dependence) offers a more plausible and justifiable projection tool than an approach that simply includes all available ensemble members.

A simplified and easily adjustable Python code (based on
the Gurobi interface) is accessible on a GitHub repository
(

CMIP5 data can be obtained from

The supplement related to this article is available online at:

NH conducted the analysis, produced the figures, and prepared the paper. GA came up with the core idea of ensemble selection to minimise regional biases, discussed results, and helped in writing the paper. RK contributed to discussions on the methodology and results and helped in writing the paper. OA helped shape the methodology and contributed to the interpretation of results. KL provided support while writing the Python code for the mathematical solver Gurobi. BMS provided useful discussions and feedback which helped shape this work.

The authors declare that they have no conflict of interest.

We would like to thank Jan Sedláček for providing access to the next-generation CMIP5 archive based at ETHZ. We are also grateful to Steve Sherwood and Ruth Lorenz for interesting discussions which helped shape this study.

We acknowledge the support of the Australian Research Council Centre of Excellence for Climate System Science (CE110001028).

The authors acknowledge support from the H2020 project CRESCENDO, “Coordinated Research in Earth Systems and Climate: Experiments, kNowledge, Dissemination and Outreach”, which received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement no. 641816.

We acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modelling groups (listed in Table S1) for producing and making available their model output. For CMIP the US Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led the development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. Edited by: Fubao Sun Reviewed by: Martin Leduc and two anonymous referees