To weight or not to weight: assessing sensitivities of climate model weighting to multiple methods, variables, and domains
- 1South Central Climate Adaptation Science Center, University of Oklahoma, Norman, OK, 73019, USA
- 2Department of Environmental Science, Policy and Management, University of California Berkeley, Berkeley, CA, 94720, USA
- 3Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 91109, USA
- 1South Central Climate Adaptation Science Center, University of Oklahoma, Norman, OK, 73019, USA
- 2Department of Environmental Science, Policy and Management, University of California Berkeley, Berkeley, CA, 94720, USA
- 3Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 91109, USA
Abstract. Given the increasing use of climate projections and multi-model ensemble weighting for a diverse array of applications, this project assesses the sensitivities of climate model weighting, and their resulting ensemble means, to multiple components, such as the weighting schemes, climate variables, or spatial domains of interest. The analysis makes use of global climate models from the Coupled Model Intercomparison Project Phase 5 (CMIP5), and their statistically downscaled counterparts created with the Localized Canonical Analogs (LOCA) method. This work focuses on historical and projected future mean precipitation and daily high temperatures of the south-central United States. Results suggest that model weights and corresponding weighted projections are highly sensitive to the weighting method as well as to the selected variables and spatial domains. For instance, when estimating model weights based on Louisiana precipitation, the weighted projections show a wetter and cooler south-central domain in the future compared to other weighting schemes. Alternatively, for example, when estimating model weights based on New Mexico temperature, the weighted projections show a drier and warmer south-central domain in the future. However, when considering the entire south-central domain in estimating the model weights, the weighted future projections show a compromise in the precipitation and temperature estimates. If future impact assessments utilize weighting schemes, then our findings suggest that how the weighting scheme is derived and applied to the projections may depend on the needs of an impact assessment or adaptation plan. From the results of our analysis, we summarize our recommendations concerning multi-model ensemble weighting as follows:
- Weighted ensemble means should be used not only for national and international assessments but also for regional impacts assessments and planning.
- Multiple strategies for model weighting are employed when feasible, to assure that uncertainties from various sources (e.g., weighting strategy used, domain or variable of interest applied, etc.) are considered.
- That weighting is derived for individual sub-regions (such as the NCA regions) in addition to what is derived for the continental United States.
- That domain-specific weighting be derived using both common (e.g. precipitation) and stakeholder-specific (e.g. streamflow) variables to produce relevant analysis for impact assessments and planning.
- Preprint
(6714 KB) -
Supplement
(3096 KB) - BibTeX
- EndNote
Adrienne Wootten et al.
Status: open (until 04 Jun 2022)
-
RC1: 'Comment on esd-2022-15', Anonymous Referee #1, 13 May 2022
reply
In this study, the authors set up a systematic exploration of several combinations of choices in multimodel ensemble weighting schemes, and describe the resulting projections when weighting CMIP5 models and their downscaled and bias-corrected LOCA versions.
The authors offer that the value of this work is in this systematic exploration of the effects of weighting, but I am sorry to say that, aside from some very nice and thoughtful discussion of general issues (which by the way have been treated in some depth by a guidance document for the IPCC AR5 report as early as 2010, available here https://www.wcrp-climate.org/wgcm/references/IPCC_EM_MME_GoodPracticeGuidancePaper.pdf, and more recently in a review paper by Abramowitz et al. (2019) https://doi.org/10.5194/esd-10-91-2019), and the appreciation of the large amount of work that the authors have undertaken, I come away from this study only reinforcing what we all already knew: that different weighting schemes produce different results and nobody knows how to interpret the real value of those differences and what to do about it.
In my view, there would be two ways to make this exploration more useful.Â
First, perform this exercise with a clear accounting of internal and inter-model variability. I don't know what to make of pictures that show me multimodel means and how they differ from one another. The question is, do they differ in a way that is significant, compared to internal variability? And do they differ in a way that is significant with respect to a measure of uncertainty around the multimodel mean, which could be taken (likely underestimating it and therefore possibly favoring the detection of significant differences, but that could be expressed as a caveat) as its standard deviation, computed by the inter-model standard deviation divided by the square root of the ensemble size (at each grid point)?Â
Second, perform a perfect model exercise where one model furnishes the truth, current and future, and the rest of the models undergo this exercise in variation of weights (derived using the left-out model historical portion as observations), so that besides ascertaining that the weights have diverse effects, we can start seeing something about the value of applying them: Do they produce anything more accurate than the unweighted projection? Which of the choices does that better, if any?
The need to take into account internal variability requires the "true model" to be one that has produced initial condition ensembles, but there are plenty of CMIP5-era large ensembles now available through US CLIVAR SMILEs (https://www.cesm.ucar.edu/projects/community-projects/MMLEA/), and the authors could easily choose one  which has also participated in CMIP5 (e.g., CESM1, CanESM, MPI).
A study that can tell me something more than "things look different" and can distinguish differences that are simply noise from differences in the signal estimated by these various weighting schemes, then proceed to tell me which one of these weighting schemes, if any, produces projections closest to the "truth" would be really valuable and a real step forward in this old and somewhat frustrating debate.
And I realize that using the perfect model set-up pre-empties the idea of using LOCA, but I would argue that the loss would be more than balanced by the gain in interpretability of the results. Plus the bias correction of LOCA makes the value of using performance-based weights rather debatable, and my guess is that the differences that surface in that part of the exercise would turn out to be drowned by internal variability if that was accurately accounted for (given that observations used to bias-correct are also just one realization, heavily affected by internal variability at these grid-point scales).Â
I also would like to raise a point about impact modelling. The authors discuss more than once the relevance of the weighting choice for impact modelers, but I would like to be better convinced of that. My experience of impact model(er)s is that they need climate information that looks like reality (one realization of it, or multiple realization of it) not like a big smooth mean. So I agree that the multimodel mean (weighted or unweighted) might  be relevant as a synthetic "bird-eye view" of how climate impact-drivers look in the future, and can inform discussions and produce useful catalogs of maps in documents like IPCC or NCA assessments. However, when it comes to impact modeling, my expectation is that feeding multimodel means to a process or empirical model would be nonphysical.  Even a large, global scale impact modeling exercise like ISIMIP (https://www.isimip.org/) has provided individual realizations of multiple models for use in its "children" exercises. I would think that using temperature, precipitation and whatever else is needed that behave like reality as input to the impact model, and only after having produced the impact response worrying about averaging, is even more necessary for regional impact assessments like the ones that the authors are mostly concerned about. If I'm wrong, I will happily stand corrected, but in that case I would like to see citations of current impact modeling studies that use multimodel ensemble means.In conclusion, my assessment of this work is that it represent a very diligent and substantial exercise, informed by thoughtful considerations, but does not help to advance the field until it takes up a better treatment of internal and model variability that could help to determine the significance of the differences resulting from the various weighting schemes, and until it can say something about the usefulness of weighting at all. I tried to suggest ways to do just that. I would be very excited to see the new results, which I hope would not be too difficult to produce, given the efficient machinery that the authors have obviously already in place.Â
-
RC2: 'Comment on esd-2022-15', Anonymous Referee #2, 17 May 2022
reply
Review of ‘To weight or not to weight: assessing sensitivities of climate model weighting to multiple methods, variables, and domains’ by Wootten and colleagues
The manuscript presents a comparison between what seems to be 2 different climate model weighting schemes in 4 different setups. Weights are based on 2 different variables, 3 different domains, and 2 different model ensembles resulting in 48 different sets of weights. These are applied to the same 2 variables and 3 regions, resulting in 288 sets of weighted ensemble means discussed in the paper. The differences between these setups are visualised, described and discussed. In the second part of the manuscript several recommendations are given regarding how to apply weighting methods in general.
With this manuscript the authors set out to answer a big question as stated in their title: ‘to weight or not to weight’? And in more detail (line 77): ‘Should model weights be developed separately when investigating different climate variables? Should model weights be estimated separately when investigating different domains?
I would like to propose a somewhat provocative argument about this aim: With the setup suggested here it is impossible to answer these questions. To advocate for (or against) weighting future projections in whatever way one would need to show an added value of the weighted ensemble compared to the unweighted one (for example increased skill by some metric). This is notoriously hard to prove (some would say impossible) as we do not know the ground truth in the future. Approaches have been suggested to circumvent this problem, at least partly. These include out-of-sample validation of weighted ensembles in the historical period where there are still observations available or model-as-truth approaches. None of this is done in this manuscript. The authors merely provide an extensive comparison of the effect of different weighting setups. As far as I can tell, most of the recommendations on weighting provided in the second part of the manuscript are not connected to the results presented (which mainly show relative difference between the different methods employed and as such can not answer the questions posed).My second main criticism is that the results presented in this work are not really new or surprising. The authors basically show that weights based on different variables and regions differ - but this is what they are designed to do. If weights for different regions and variables were all identical there would be something wrong with the model ensemble or the setup of the weights, right? Finally, the authors give several recommendations but these are more of a general nature and I had a hard time connecting them to the specific results presented. As a matter of fact several of the arguments have been made before and are not connected to any of the work done here (for example the discussion about spatial coherence in line 363f).
In addition, I find the heavy self-citation, partly ignoring large chunks of other literature, employed in this paper somewhat strange. I would encourage the authors to put their work better into the context of the international scientific literature (for example lines 35, 56-67, specific comments on lines 321, 325, 380). In addition, the authors state at several points that their study is the first to ‘assess the sensitivities of the model weights and resulting ensemble means to the combinations of variables, domains, ensemble types (raw or downscaled), and weighting schemes used’ (e.g. line 284). This might be so but what is the gain? Again, I am not surprised that the selection of the metrics used to inform the weights has an influence on the weights. If that was not so, weighting would hardly make sense, right?
The number of sets of weights  (48) and the number of weighted means produced (288) is in my opinion too excessive. The authors should pick a few representative and/or interesting examples to discuss and move the rest of the results into the supplement. I found it almost impossible to follow the discussion of methods, domains, variables and ensembles that are in turn applied to ensembles and domains.
Finally, I would like to urge the authors to provide at least a basic description of the methods which are at the core of this manuscript. As it is, the reader is merely referred to three papers by the authors (Wootten et al. 2020a, Massoud et al. 2020a, 2019) for more information. For a potential reader (or reviewer) it would be quite convenient to have a more self-sustained paper with at least the basic setup of the methods clearly described and only the details requiring reading several more papers.
Overall this manuscript has several major problems raised above beside the many specific issues outlined below and I do not think that it can be published without a major overhaul. This should include, most importantly, clearer formulated research questions that can be addressed in the manuscript and a clear separation between conclusions based on results and general recommendations based on the authors experience. In addition, a better representation of already existing literature and more focused plots (showing only a subset of cases) would help the manuscript.
Minor comments
title: the quite narrow focus on parts of the United states should be reflected in the title.
line 16: At this point I am confused about the terminology. My a priori assumption is that there are different weighting schemes and in addition each scheme might use different variables to calculate the weights. Here they are mixed up so either the authors use another terminology (then they should make it clear) or this should be reformulated.
16-21: I am not sure what the authors point is here as this behaviour seems to be totally expected? Â Is the important point not rather that the metrics (including variable and region) the weights are based on need to be well-justified? With cherry-picked metrics it is probably possible to achieve any kind of weighting, right?
28: please introduce NCA
line35: can the authors please cite a broader sample of the literature not limiting it to their own publications (assuming that they are not the only ones publishing on that topic)?
39: I would argue that the ensemble mean is not representative for the members (one of the reasons why we need weighting)
44: model weights themselfs can not have any skill I would argue
47: As a matter of fact the idea of independence weighting has not only come up in the last few years and is, eg, mentioned in Knutti 2010 which is cited by the authors in the line before.
60: ‘performance skill of atmospheric rivers globally’ again, what would be the skill of an atmospheric river? I assume the authors refer to the model skill in simulating atmospheric rivers?
69: Knutti et al. 2017 did not base their weights (only) on precipitation as seems to be suggested here
70: What is a common variable?
73: ‘Other studies have applied model weighting to a specific domain (e.g. globally) and went on to apply the developed weights on a different domain (e.g. North America or Europe) (Massoud et al., 2019).’ This sentence does not seem to make sense. Do the authors mean that they have calculated the weights based on metrics in one domain and then applied them to projections for another domain? Please reformulate this to make in more clear.
79: I am not convinced by the relevance of these research questions and their implications. For a weighting method to have skill the weights need to be based on metrics that are physically and statistically connected to the variable that the weights are applied to. See for example the discussion about emergent constraints in Hall et al. (2019; 10.1038/s41558-019-0436-6). In lack of a certain variable in a certain region that is informative for all other variables in all other regions the answer to both questions has to be yes, without any further analysis from a purely skill-based perspective I’d argue. There might be other considerations against it but they depend on the application (and are, hence, independent on the outcome), such as physical and spatial consistency of the weighted distributions.
83: is the entire domain the combination of Louisiana and New Mexico or are there additional regions not covered by them? Maybe indicate the sub-domains in figure 1?
84: can the authors motivate why they use CMIP5 instead of the newer CMIP6?
line 106/figure 1: I am not familiar with the term ‘high temperature’ is this the same as ‘maximum temperature’ which is (in my opinion) a frequently used term? And what is is annual high temperature? Is it the maximum over different annual mean temperatures or the maximum of the maximum daily temperature or something else entirely? Over which time period?
115: Just so that I understand correctly, also the CMIP5 models are interpolated to 10km – corresponding to a resolution much finer than the native one?
section 2.4: The authors aim to provide a comparison of different weighting schemes but here these weighting schemes are not introduced at all requiring the reader to read several other papers to get any information at all about them. Please provide at least the basic properties and differences between the schemes investigated in this study.
141: ‘The Skill strategy utilises each model’s skill in representing the historical simulations’ I assume the authors mean ‘historical observations’ here?
150: If the authors write ‘weighting schemes are applied’ here they mean that weights are calculated is that correct? I find this confusing since they also write ‘applied’ for the process of calculating a weighted mean of the future projections. Could the authors try to find a less ambiguous language throughout the manuscript?
156: ‘(ensemble choice x weighting methods choice x variable choice x domain choice = 2 x 2 x 3 x 4 = 48).’  This is mixed up please correct
166: I am not sure I understand why the weights are applied to the sub-domains separately. The resulting maps should identical to the corresponding region in the full region, correct?
figure 3: ‘grey dots’ do the authors mean the red dots?
figure 3: as a general question: should weights not be normalised in order to be comparable across the different cases?
182: ‘One observation seen in these weighting combinations is that the weighting schemes themselves are all sensitive to the ensemble, variable, and domain for which they are derived.’ I do not agree with this statement in this general form, could the authors provide a bit more detail? To give just two examples: the bcc model gets consistently low weights for all cases and the low weight of NorESM1 (among many other models) is not sensitive to variable and domains but only to the ensemble.
185: what are ‘model combinations’
189: is this surprising given that (from what I understand) BMA is a structurally different method while the other three are variants of the same method?
193: I would tend to say the colour is red not orange. How is significance established for this case or is this just a qualitative statement? Then maybe use a different wording.
195: what are differences ‘within each combination of ensemble, variable, and domain’?
197: what does ‘combinations’ refer to here?
206 ‘Similar to the CMIP5 ensemble in Figure 3, the BMA weights tend to be larger
for the highest weighted models in the LOCA ensemble compared to those derived with the Skill, SI-h, and SI-c schemes’ Can the authors speculate on the reason for this behaviour?212: ‘the weights for the LOCA ensemble [tmax, Louisiana] generally range from 0.025 to 0.05’ Do the authors mean 0.25-0.5? Otherwise it is impossible to see this in the figure 4. The authors might want to explain the notable exception from this. How is ‘BMA best’ calculated from the 100 iterations of BMA? How is a case like MIROC with a median of about 0.25 but a best of close to 0 possible?
223: what is ‘co-dependence between models in an ensemble’? Does ‘Skill’ account for dependence at all as seems to be suggested here?
225: ‘BMA tends to be the most sensitive’ could this somehow be quantified?
239: So why not just not use the sub-domains at all?
figure 5: is there are particular reason for selecting a base period of 25 years and a future period of 30 years? What do the boxes, whiskers represent?
271-281: I am not sure I understand why this paragraph is here? Should the reader look at and understand all the figures listed here? Or is this just an outlook? The authors might want to consider dropping it.
321: Maybe the authors could give some examples of the literature that does exist? To give just a few examples (there are more): 10.1029/2019GL083053, 10.1088/1748-9326/ab492f, 10.3389/frwa.2021.713537, 10.1029/2020JD033033
325: Again, there are counter-examples that might be good to mention here: 10.5194/acp-20-9961-2020, 10.3389/frwa.2021.713537
327: ‘Third, for situations where projections are provided to impact models, does this type of study need to be repeated using impact model results’ I don’t think I understand this question.
334: This is not correct so generally, see references above.
342: Who are these ‘others’? Please provide references
349: Why does a unweighed mean over-favor certain models? I would assume that by definition in  an unweighted case all models are treated equally.354: applying multiple methods as suggested here might lead to contradictory results, can the authors say something about what a user that tries to get a single answer should do in such a case?
380: ‘Climate model evaluations and national assessments typically focus on the continental United States or North America.’ There are assessments also for other continents.
394: Is this recommendation somehow connected to the results shown in this manuscript or just the authors opinion?
346: ‘a multi-model ensemble of climate projections should incorporate model weighting’ The ensemble itself can not incorporate weighting I’d argue. Weights can only be applied once the ensemble is aggregated along the model dimension (for example by calculating a multi-model mean).
446 (recommendations): Could the authors connect these recommendations to their results?
456: how can a domain be small compared to internal variability?
Â
Â
Adrienne Wootten et al.
Adrienne Wootten et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
325 | 66 | 10 | 401 | 19 | 4 | 4 |
- HTML: 325
- PDF: 66
- XML: 10
- Total: 401
- Supplement: 19
- BibTeX: 4
- EndNote: 4
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1