General comments
In this paper the authors develop a framework to evaluate the process of selecting a subset of simulations from a large multi-model ensemble (CMIP5). This subject is very important because climate model data users often need to use only a small subset of simulations given limited resources for data processing. However, research groups often face many difficulties when selecting simulations because there is no widely accepted approach to do that, and this highly depends on the climate application. Overall, I found the paper to be insightful and well written, and it should ultimately be considered for publication in ESD after the following minor modifications.
The paper introduces three approaches to select a subset of N simulations from a larger ensemble. These approaches are compared within a common benchmark, that is the equally-weighted ensemble averaging of historical climatology among the selected simulations. The first selection method consists of randomly choosing a N-subset of simulations. This procedure can be repeated several times in order to cover the range of uncertainty associated with a random selection, for instance according to the error of the ensemble mean compared with observed climatology. The second method is the "performance ranking ensemble" in which the N best simulations are selected (according to their individual error in reproducing the observed climatology). The third approach is the "optimal ensemble" by which simulations are selected by minimizing a cost function based on 3 terms: 1) mean square error of the ensemble mean, 2) mean square error of individual members, 3) a measure of model dependence.
An important result in this study is that the ensemble mean of a performance ranking ensemble will perform poorly compared with that of an optimal ensemble, and is comparable with the mean of randomly selected ensembles. This is due to the fact that selecting only the best simulations will leave common biases among them, which will not cancel through the averaging procedure. On the other hand, the optimal ensemble allows to minimize the error of the ensemble mean, discard poorly performing models and maximize the cancellation of errors among simulations. Hence, more independence can be expected between simulations of a same optimal ensemble, while several initial conditions members of a single good model can be part of the same performance ranking ensemble. The paper shows well that the optimal ensemble is the best approach compared with both random and ranking ensemble selections. The way the optimal ensemble is selected is based on a flexible cost function, where weight can be applied to the existing terms while new terms can be added as well (e.g. maximizing the spread among future projections to reduce overconfidence).
I think this paper allows to get new insights on the impact of selecting a subset of simulations from a large ensemble. However, one weak point is the lack of information about how to use this tool in real life for practitioner and what an "optimal ensemble" actually means in this context. For instance, let us investigate the example of regional climate modelling as given in the manuscript. We first assume that a group can only afford to dynamically downscale 5 GCM simulations with their RCM and that they have defined their own cost function. If they hence use the current tool to optimally select a 5-GCM ensemble to downscale with their RCM, and that a few months later after starting the RCM simulations, they discover that there was a bug in the experiment of GCM #5 and that it should be discarded. It is very likely that GCM #1-4 will no more be an optimal subset of size 4. Similar situation would happen if they realize they can afford producing one more simulation, so they would need to select one more GCM and so the new 6-GCM ensemble will neither be optimal. The concept of an optimal ensemble implies that the selection of one member depends on the other ensemble members. I think this is an important limitation in the applicability of the current method to real-life situations. Moreover, the fact that for each ensemble size there is a ranking of several ensembles is difficult to interpret. It seems to me that for slight differences in RMSE and in the cost function, many other different ensembles are possible. So it should be explained in more details how practitioners should deal with the complexity of coexisting similarly optimal ensembles.
Another point that should be improved is the last part of the introduction, which doesn't clearly explain the structure of the paper (as there are many subsections) and what is aimed to be achieved. There are also few explanations in the text that are not very clear or lacking some details. See specific comments below.
Specific comments
- P3L19-21 "Regional dynamical downscaling presents a slightly different problem to the one stated above, as the goal is to find a small subset that reproduces certain statistical characteristics of the full ensemble." I know what the authors mean but this paragraph is lacking context about regional climate models, whose goal is to obtain high resolution climate simulations based on lateral boundary conditions taken from GCMs or reanalyses. See for instance:
+ Laprise, R. (2008) Regional Climate Modelling, Journal of Computational Physics, 227(7), 3641–3666. http://dx.doi.org/10.1016/j.jcp.2006.10.024
- P4L4-34 I think there are too many technical details in the last part of the introduction. As said previously, it should better explain the whole structure of the document. For instance, the three sub-sampling strategies are not explicitly mentioned here. Moreover, the paper contains many subsections, so giving the general plan in the introduction would be useful to the reader.
- P5L24 How did the authors determined that 100 iterations were enough ? Would the error bars change by a lot if one would use 200, or 1000 iterations instead ?
- P6 What is the reason for the drop between 30-35 members for the performance ranking ensemble ? Could it be due to the fact that some models have several members ?
- P6L31 Is there any relationship between the minimum of the optimal ensemble curve (in Fig. 1; between 5 and 8 members for temperature and around 12 for precipitation) and the effective number of independent models in the ensemble ?
- P10L17 Why did the authors choose f3 to be the pairwise MSE rather than the pairwise correlation of errors as shown in figure 2 ?
- P10L20 "This is a way to address dependence in ensemble spread." Would be worth adding here that it prevents from selecting several members from the same model as well.
- P10L27 "the members of the optimal ensemble", I guess we mean the 3-term one but it is not clear. Also "have a better average performance", it is not clear at all in figure 1 that the 3-term optimal selection is better than the 1-term one, and even the RMSE of the 3-term selection seems a bit higher (triangle are a bit higher than the circles).
- P11L19 What do the authors exactly mean by "whether the approach is fitting short term variability" ?
- P12L1-2 The different metrics (trend and space+time) should be defined more explicitly.
- P12L14 "all available runs" Do we mean 81 runs or one per institute ? Please clarify here and elsewhere in the text.
- P13L31 "mean of all 81 model runs" As the authors use this example very often in the paper, it should I think be stated somewhere that this is really bad practice to average all models and realizations in a CMIP experiment because we arbitrarily give more weight to the models represented by the largest number of members. It should also be made more clear why they use this benchmark rather than simply the multi-model ensemble mean based on one realization per model (that is 38 models) ?
- P15L3-4 Similarly to the issue of a selection based on multiple variables and observational dataset (as pointed in the previous reviews), many applications such as impacts assessments or regional climate modelling should imply an ensemble selection based on a specific region. So the fact that the optimal ensemble will depends of the region where it is calibrated should be discussed as well.
- Figure 2: The label of the y-axis is misleading because a high error correlation rather implies model-model similarity.
- Figure 3: Regional downscaling should be as well in the "Application to the future", not only in the "historical data" blue box. |

I received two review comments after the this round of review. Based on the review comments, here I decide to return to the authors for major revisions. We will seek another round of review to the original reviewers before making the final decision.

The editor