the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Performance-based sub-selection of CMIP6 models for impact assessments in Europe
Tamzin E. Palmer
Carol F. McSweeney
Ben B. B. Booth
Matthew D. K. Priestley
Paolo Davini
Lukas Brunner
Leonard Borchert
Matthew B. Menary
Download
- Final revised paper (published on 21 Apr 2023)
- Supplement to the final revised paper
- Preprint (discussion started on 03 Aug 2022)
- Supplement to the preprint
Interactive discussion
Status: closed
-
RC1: 'Comment on esd-2022-31', Anonymous Referee #1, 03 Aug 2022
Peer review of “Performance based sub-selection of CMIP6 models for impact assessments in Europe” by Palmer et al. (ESD).
This paper presents a performance assessment of CMIP6 simulations for Europe and selects a subset of models for regional climate impact studies. The performance criteria include large-scale processes such as storm tracks, circulation patterns, and temperature biases. The selection of models is based primarily on subjective assignment of each model into three categories for each performance criterion. The authors highlight that there is a strong tendency for the models with high regional performance to have higher global climate sensitivity. While the causes of this relationship is left for future investigation, the authors note that this relationship creates a tension between selecting for high regional performance and selecting an ensemble consistent with observational constraints on global ECS.
This paper is thoughtful and well executed. It will be useful for European climate impact assessments, and also as a template/benchmark for performance assessments in other regions. While the paper is acceptable with minor technical corrections, I have added some optional suggestions for improvement. The most important of these suggestions is for an assessment of the role of internal variability in the performance evaluation.
Corrections required:
There are many spelling and grammar mistakes. I noted typos in lines 10, 52, 69, 82, 114, 137, 188, 202, 211, 238, 245, 279, 319, 330, 426, 432, 436, 444, 464, 467, and 471 and in the spelling of “conceptulization.”
Table 1. ACCESS-CM2 Is missing from the left column. Also, since the right column is a subset of the left, couldn’t this table be replaced with a (less space-consuming) list, with selected models highlighted in bold?
Table 2. The selected model in each cluster needs to be identified. This info isn’t available from figure 7 or anywhere else in the main text.
Suggestions for improvement (optional):
Models are evaluated on the basis of a single realization each. To what extent does internal variability affect the assessments? The paper would be more solid if it included an analysis of the robustness of the performance criteria to multiple realizations of at least one model.
This paper’s strength is in the process evaluations, which will be a useful reference for analysts creating bespoke ensembles. The 3x3 matrix of examples of models in the three subjective categories is a nice way of presenting the results in the main paper and the appendix. however, many analysts would benefit from a supplementary section showing the maps for the full set of assessed models, so they can make their own subjective assessments and better understand figures 4 and 5.
The finding that many of the high-skill models are outside the IPCC assessed ECS range is interesting and important. However, this tension between regional skill and global climate sensitivity seems somewhat overstated. There are a couple of solutions that partially resolve this tension. First, there is the option of presenting analyses relative to global warming levels instead of time, as widely practiced in the literature and advocated by Hausfather et al. (2022). While the GWL approach doesn’t fully resolve the tension (time does matter to many studies), it warrants some discussion here. Indeed, the results of this paper add further weight to the importance of the GWL approach. Second, the IPCC’s very likely ECS range is a more inclusive and defensible (66% is a high bar, given the observational uncertainties on the upper tail of ECS) criterion that would only exclude three independent models (CanESM, UKESM/HadGEM, and CESM2). Discussion of these nuances would give more direction to the reader in the face of the tension that this paper highlights.
The completeness of scenario experiments by each model is an important consideration in ensemble selection that doesn’t receive any attention here. For example, HadGEM3-GC3.1 provides only one simulation of SSP126 and no simulations of SSP370 (https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html), and as a result may not be viable for some study designs. The paper could benefit from some documentation and/or discussion of this and other practical considerations that will affect the utility of the recommended ensemble.
The exclusion of UKESM1 based on orange flags comes across as a bit haphazard and arbitrary, especially given that analysis of storm track performance is not available for this model. While I noted the discussion on the confluence of reasons for excluding UKESM1, the paper would benefit from a more systematic documentation of the interaction of criteria leading to model exclusion. Perhaps also there is a role for a “marginal” category of models for which exclusion wasn’t clear-cut.
Minor comments:
Line 225. Some more detail on the reanalysis/observational data would be helpful.
Lines 427-8. “The retention of higher sensitivity models is an emergent consequence of assessment of skill at reproducing regional processes.” This wording implies some functional relationship between regional skill and model sensitivity that hasn’t been established (as duly noted in the conclusion). Simpler wording would reduce the chance of misinterpretation by the reader.
Lines 459-60. Shiogama (2021) excluded models based on a criterion of high recent warming relative to observations, rather than based on ECS or TCR as implied here. Mahony (2022) (DOI:10.1002/joc.7566) would be a more direct example of ensemble selection based on the IPCC assessed ECS range.
Citation: https://doi.org/10.5194/esd-2022-31-RC1 -
AC1: 'Comment on esd-2022-31', Tamzin Palmer, 11 Aug 2022
Initial response reviewer #1 from authors
Peer review of “Performance based sub-selection of CMIP6 models for impact assessments in Europe” by Palmer et al. (ESD).
This paper presents a performance assessment of CMIP6 simulations for Europe and selects a subset of models for regional climate impact studies. The performance criteria include large-scale processes such as storm tracks, circulation patterns, and temperature biases. The selection of models is based primarily on subjective assignment of each model into three categories for each performance criterion. The authors highlight that there is a strong tendency for the models with high regional performance to have higher global climate sensitivity. While the causes of this relationship is left for future investigation, the authors note that this relationship creates a tension between selecting for high regional performance and selecting an ensemble consistent with observational constraints on global ECS.
This paper is thoughtful and well executed. It will be useful for European climate impact assessments, and also as a template/benchmark for performance assessments in other regions. While the paper is acceptable with minor technical corrections, I have added some optional suggestions for improvement. The most important of these suggestions is for an assessment of the role of internal variability in the performance evaluation.
We thank the reviewer for their overall positive and very constructive response. Along with their helpful suggestions for improving the manuscript. Our initial response is given below.
Corrections required:
There are many spelling and grammar mistakes. I noted typos in lines 10, 52, 69, 82, 114, 137, 188, 202, 211, 238, 245, 279, 319, 330, 426, 432, 436, 444, 464, 467, and 471 and in the spelling of “conceptulization.”
Table 1. ACCESS-CM2 Is missing from the left column. Also, since the right column is a subset of the left, couldn’t this table be replaced with a (less space-consuming) list, with selected models highlighted in bold?
Table 2. The selected model in each cluster needs to be identified. This info isn’t available from figure 7 or anywhere else in the main text.
We thank the reviewer for noting these errors. The final manuscript will be proof-read, and the typos noted above corrected. Table 1 will also be replaced with a list as suggested here.
The selected model from each cluster will be identified in table 2, either in bold or by an alternative method. All the models in figure 7 will be identified by numbering the points in the supplementary material (Fig. S4).
Suggestions for improvement (optional):
Models are evaluated on the basis of a single realization each. To what extent does internal variability affect the assessments? The paper would be more solid if it included an analysis of the robustness of the performance criteria to multiple realizations of at least one model.
We agree with the reviewer that the assessment would be more robust with an understanding of the importance of internal variability. We have had a provisional look at other realisations for some of the models (e.g., MIROC6, CAMS-CSM1-0, CanESM5), that are excluded (have a red flag), due to temperature bias and/or circulation errors. This provisional investigation indicates that the assessment would not be altered for these models by using the 2nd or 3rd instead of the 1st realisation (for example). We are still considering how to respond, and many of the team are currently away from work at the moment. However, we anticipate using a larger assessment of internal variability perhaps with CanESM5 model, which has a large number of available realisations, for a number of criteria which are currently used to exclude models. These results would be added to the supplementary material for the manuscript.
This paper’s strength is in the process evaluations, which will be a useful reference for analysts creating bespoke ensembles. The 3x3 matrix of examples of models in the three subjective categories is a nice way of presenting the results in the main paper and the appendix. however, many analysts would benefit from a supplementary section showing the maps for the full set of assessed models, so they can make their own subjective assessments and better understand figures 4 and 5.
We agree with the reviewer that it would be beneficial to users to make the maps for the full set of assessed models available. An accessible github repository is currently under construction, that will be linked to from this paper. This will include the maps used in the assessment (at a minimum including temperature, SST and large-scale circulation) and plots for the precipitation annual cycle. Including this in a github repository would enable this to potentially be maintained as a living document, that can be added to as more models or diagnostics become available.
In addition, this repository also includes a spreadsheet of all assessments carried out for the CMIP6 models to date. The sample of 31 models included in this study from this spreadsheet were selected because they had both a minimum number of assessed criteria, and ssp585 future projection data for Tas and Precipitation available.
The finding that many of the high-skill models are outside the IPCC assessed ECS range is interesting and important. However, this tension between regional skill and global climate sensitivity seems somewhat overstated. There are a couple of solutions that partially resolve this tension. First, there is the option of presenting analyses relative to global warming levels instead of time, as widely practiced in the literature and advocated by Hausfather et al. (2022). While the GWL approach doesn’t fully resolve the tension (time does matter to many studies), it warrants some discussion here. Indeed, the results of this paper add further weight to the importance of the GWL approach. Second, the IPCC’s very likely ECS range is a more inclusive and defensible (66% is a high bar, given the observational uncertainties on the upper tail of ECS) criterion that would only exclude three independent models (CanESM, UKESM/HadGEM, and CESM2). Discussion of these nuances would give more direction to the reader in the face of the tension that this paper highlights.
We agree with the reviewer that the tension between the IPCC assessed climate sensitivity range, and the regional skill of the models is not an issue if the GWL method is a suitable approach. Some discussion of this is warranted in the manuscript. We intend to provide a more nuance, revised discussion in the text.
There are however cases where the GWL method is not suitable, such as, where the distribution of the ensemble is used as a measure of likelihood. As shown in our results, the distribution of the filtered models with greater regional skill is skewed towards higher climate sensitivity. In this case the tension between the regional skill and climate sensitivity then becomes relevant. It is particularly important where assessments are made by a risk adverse user, that is interested in a high impact, low likely hood (but plausible) temperature change within a given time frame (e.g., 2030 or 2040).
The very likely IPCC range for ECS will be added to figure 6 for reference.
The completeness of scenario experiments by each model is an important consideration in ensemble selection that doesn’t receive any attention here. For example, HadGEM3-GC3.1 provides only one simulation of SSP126 and no simulations of SSP370 (https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html), and as a result may not be viable for some study designs. The paper could benefit from some documentation and/or discussion of this and other practical considerations that will affect the utility of the recommended ensemble
We agree that the completeness of the scenario experiments is likely to be a consideration for users. The focus of the paper is on the process-based assessment, rather than attempting to address some of the wider potential considerations for selecting representative models, for downscaling and impact assessments. However, we agree that some relevant links to the documentation, or a table of the filtered models with the simulations available would be a useful addition to the supplementary information.
The exclusion of UKESM1 based on orange flags comes across as a bit haphazard and arbitrary, especially given that analysis of storm track performance is not available for this model. While I noted the discussion on the confluence of reasons for excluding UKESM1, the paper would benefit from a more systematic documentation of the interaction of criteria leading to model exclusion. Perhaps also there is a role for a “marginal” category of models for which exclusion wasn’t clear-cut.
This is useful feedback. We agree that the decision to remove models due to a certain percentage of orange flags is somewhat arbitrary. The decision as to how many orange flags warrants the removal of a model has a degree of subjectivity. We suggest adding an alternative filtered sub-set that only removes models with a red (inadequate) flag and therefore includes both UKESM1 and TaiESM1 as ‘marginal’ or less preferred models in addition to our current filtered example.
Another approach that a user may wish to consider, is to follow the method of McSweeney et al., (2018), where ‘marginal’ or less preferred models (with larger numbers of orange flags) are removed if the projection range is not reduced by the removal of a model. If the aim is maintaining the full range (for marginal models) the UKESM1 model is an important consideration as an outlier that otherwise does well at many of the criteria (other than a substantial winter cold temperature bias), as its removal reduces the upper tail of the projected temperature range for Europe.
Minor comments:
Line 225. Some more detail on the reanalysis/observational data would be helpful
Lines 427-8. “The retention of higher sensitivity models is an emergent consequence of assessment of skill at reproducing regional processes.” This wording implies some functional relationship between regional skill and model sensitivity that hasn’t been established (as duly noted in the conclusion). Simpler wording would reduce the chance of misinterpretation by the reader.
Lines 459-60. Shiogama (2021) excluded models based on a criterion of high recent warming relative to observations, rather than based on ECS or TCR as implied here. Mahony (2022) (DOI:10.1002/joc.7566) would be a more direct example of ensemble selection based on the IPCC assessed ECS range.
Thank you for these points, these will be addressed in the manuscript.
McSweeney, C. et al. (2018) Selection of CMIP5 members to augment a perturbed–parameter ensemble of global realisations of future climate for the UKCP18 scenarios.
Citation: https://doi.org/10.5194/esd-2022-31-AC1
-
AC1: 'Comment on esd-2022-31', Tamzin Palmer, 11 Aug 2022
-
AC1: 'Comment on esd-2022-31', Tamzin Palmer, 11 Aug 2022
Initial response reviewer #1 from authors
Peer review of “Performance based sub-selection of CMIP6 models for impact assessments in Europe” by Palmer et al. (ESD).
This paper presents a performance assessment of CMIP6 simulations for Europe and selects a subset of models for regional climate impact studies. The performance criteria include large-scale processes such as storm tracks, circulation patterns, and temperature biases. The selection of models is based primarily on subjective assignment of each model into three categories for each performance criterion. The authors highlight that there is a strong tendency for the models with high regional performance to have higher global climate sensitivity. While the causes of this relationship is left for future investigation, the authors note that this relationship creates a tension between selecting for high regional performance and selecting an ensemble consistent with observational constraints on global ECS.
This paper is thoughtful and well executed. It will be useful for European climate impact assessments, and also as a template/benchmark for performance assessments in other regions. While the paper is acceptable with minor technical corrections, I have added some optional suggestions for improvement. The most important of these suggestions is for an assessment of the role of internal variability in the performance evaluation.
We thank the reviewer for their overall positive and very constructive response. Along with their helpful suggestions for improving the manuscript. Our initial response is given below.
Corrections required:
There are many spelling and grammar mistakes. I noted typos in lines 10, 52, 69, 82, 114, 137, 188, 202, 211, 238, 245, 279, 319, 330, 426, 432, 436, 444, 464, 467, and 471 and in the spelling of “conceptulization.”
Table 1. ACCESS-CM2 Is missing from the left column. Also, since the right column is a subset of the left, couldn’t this table be replaced with a (less space-consuming) list, with selected models highlighted in bold?
Table 2. The selected model in each cluster needs to be identified. This info isn’t available from figure 7 or anywhere else in the main text.
We thank the reviewer for noting these errors. The final manuscript will be proof-read, and the typos noted above corrected. Table 1 will also be replaced with a list as suggested here.
The selected model from each cluster will be identified in table 2, either in bold or by an alternative method. All the models in figure 7 will be identified by numbering the points in the supplementary material (Fig. S4).
Suggestions for improvement (optional):
Models are evaluated on the basis of a single realization each. To what extent does internal variability affect the assessments? The paper would be more solid if it included an analysis of the robustness of the performance criteria to multiple realizations of at least one model.
We agree with the reviewer that the assessment would be more robust with an understanding of the importance of internal variability. We have had a provisional look at other realisations for some of the models (e.g., MIROC6, CAMS-CSM1-0, CanESM5), that are excluded (have a red flag), due to temperature bias and/or circulation errors. This provisional investigation indicates that the assessment would not be altered for these models by using the 2nd or 3rd instead of the 1st realisation (for example). We are still considering how to respond, and many of the team are currently away from work at the moment. However, we anticipate using a larger assessment of internal variability perhaps with CanESM5 model, which has a large number of available realisations, for a number of criteria which are currently used to exclude models. These results would be added to the supplementary material for the manuscript.
This paper’s strength is in the process evaluations, which will be a useful reference for analysts creating bespoke ensembles. The 3x3 matrix of examples of models in the three subjective categories is a nice way of presenting the results in the main paper and the appendix. however, many analysts would benefit from a supplementary section showing the maps for the full set of assessed models, so they can make their own subjective assessments and better understand figures 4 and 5.
We agree with the reviewer that it would be beneficial to users to make the maps for the full set of assessed models available. An accessible github repository is currently under construction, that will be linked to from this paper. This will include the maps used in the assessment (at a minimum including temperature, SST and large-scale circulation) and plots for the precipitation annual cycle. Including this in a github repository would enable this to potentially be maintained as a living document, that can be added to as more models or diagnostics become available.
In addition, this repository also includes a spreadsheet of all assessments carried out for the CMIP6 models to date. The sample of 31 models included in this study from this spreadsheet were selected because they had both a minimum number of assessed criteria, and ssp585 future projection data for Tas and Precipitation available.
The finding that many of the high-skill models are outside the IPCC assessed ECS range is interesting and important. However, this tension between regional skill and global climate sensitivity seems somewhat overstated. There are a couple of solutions that partially resolve this tension. First, there is the option of presenting analyses relative to global warming levels instead of time, as widely practiced in the literature and advocated by Hausfather et al. (2022). While the GWL approach doesn’t fully resolve the tension (time does matter to many studies), it warrants some discussion here. Indeed, the results of this paper add further weight to the importance of the GWL approach. Second, the IPCC’s very likely ECS range is a more inclusive and defensible (66% is a high bar, given the observational uncertainties on the upper tail of ECS) criterion that would only exclude three independent models (CanESM, UKESM/HadGEM, and CESM2). Discussion of these nuances would give more direction to the reader in the face of the tension that this paper highlights.
We agree with the reviewer that the tension between the IPCC assessed climate sensitivity range, and the regional skill of the models is not an issue if the GWL method is a suitable approach. Some discussion of this is warranted in the manuscript. We intend to provide a more nuance, revised discussion in the text.
There are however cases where the GWL method is not suitable, such as, where the distribution of the ensemble is used as a measure of likelihood. As shown in our results, the distribution of the filtered models with greater regional skill is skewed towards higher climate sensitivity. In this case the tension between the regional skill and climate sensitivity then becomes relevant. It is particularly important where assessments are made by a risk adverse user, that is interested in a high impact, low likely hood (but plausible) temperature change within a given time frame (e.g., 2030 or 2040).
The very likely IPCC range for ECS will be added to figure 6 for reference.
The completeness of scenario experiments by each model is an important consideration in ensemble selection that doesn’t receive any attention here. For example, HadGEM3-GC3.1 provides only one simulation of SSP126 and no simulations of SSP370 (https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html), and as a result may not be viable for some study designs. The paper could benefit from some documentation and/or discussion of this and other practical considerations that will affect the utility of the recommended ensemble
We agree that the completeness of the scenario experiments is likely to be a consideration for users. The focus of the paper is on the process-based assessment, rather than attempting to address some of the wider potential considerations for selecting representative models, for downscaling and impact assessments. However, we agree that some relevant links to the documentation, or a table of the filtered models with the simulations available would be a useful addition to the supplementary information.
The exclusion of UKESM1 based on orange flags comes across as a bit haphazard and arbitrary, especially given that analysis of storm track performance is not available for this model. While I noted the discussion on the confluence of reasons for excluding UKESM1, the paper would benefit from a more systematic documentation of the interaction of criteria leading to model exclusion. Perhaps also there is a role for a “marginal” category of models for which exclusion wasn’t clear-cut.
This is useful feedback. We agree that the decision to remove models due to a certain percentage of orange flags is somewhat arbitrary. The decision as to how many orange flags warrants the removal of a model has a degree of subjectivity. We suggest adding an alternative filtered sub-set that only removes models with a red (inadequate) flag and therefore includes both UKESM1 and TaiESM1 as ‘marginal’ or less preferred models in addition to our current filtered example.
Another approach that a user may wish to consider, is to follow the method of McSweeney et al., (2018), where ‘marginal’ or less preferred models (with larger numbers of orange flags) are removed if the projection range is not reduced by the removal of a model. If the aim is maintaining the full range (for marginal models) the UKESM1 model is an important consideration as an outlier that otherwise does well at many of the criteria (other than a substantial winter cold temperature bias), as its removal reduces the upper tail of the projected temperature range for Europe.
Minor comments:
Line 225. Some more detail on the reanalysis/observational data would be helpful
Lines 427-8. “The retention of higher sensitivity models is an emergent consequence of assessment of skill at reproducing regional processes.” This wording implies some functional relationship between regional skill and model sensitivity that hasn’t been established (as duly noted in the conclusion). Simpler wording would reduce the chance of misinterpretation by the reader.
Lines 459-60. Shiogama (2021) excluded models based on a criterion of high recent warming relative to observations, rather than based on ECS or TCR as implied here. Mahony (2022) (DOI:10.1002/joc.7566) would be a more direct example of ensemble selection based on the IPCC assessed ECS range.
Thank you for these points, these will be addressed in the manuscript.
McSweeney, C. et al. (2018) Selection of CMIP5 members to augment a perturbed–parameter ensemble of global realisations of future climate for the UKCP18 scenarios.
Citation: https://doi.org/10.5194/esd-2022-31-AC1 -
RC2: 'Comment on esd-2022-31', Anonymous Referee #2, 16 Sep 2022
Comments on the manuscript entitled “Performance based sub-selection of CMIP6 models for impact assessments in Europe” by Palmer et al. submitted to Earth System Dynamics
Recommendation: Major revision
The authors assessed CMIP6 models in terms of their performance and diversity in simulating several variables, e.g. temperature, precipitation, and circulation, over Europe. Based on the assessment, they created sub-sets of CMIP6 models, which can be used for downscaling or impacts assessments. The approach can also be applied to other regions of the world. The topic is important and falls within the scope of the journal. The manuscript is generally well written. My major concern includes: the assessment of CMIP6 models did not well consider the link between the model's ability to simulate historical climate and future climate change. The assessments are overly dependent on subjective assessment criteria. Detailed comments are laid out below.
Major comments:
- No link was established in terms of the model's ability to simulate the historical climate and the projected changes. Thus, the models that can better reproduce historical climate may not necessarily generate a more reliable projection of future climate. After excluding the least realistic models, the filtered CMIP6 models show higher sensitivity. Is the result reasonable?
- Quantitative measures are preferred for model evaluation. Visual inspection hinders the inter-comparison of various studies to a certain degree as different people may have different judgments on “satisfactory”, “unsatisfactory”, and “Inadequate”. I’m wondering to what extent the results will be different if the authors use objective assessment criteria only.
- How was the RMSE of the zonal mean track calculated? It seems that the authors calculated the zonal mean track and obtained a time series. The RMSE is calculated using the time series derived from models and observation. Please note it makes no sense by comparing the year-to-year variation of the unforced internal variability derived from AOGCMs against the observed one. In this case, the RMSE is largely determined by the phase discrepancy between simulation and observation. Please also check the use of RMSE elsewhere.
Other comments:
Section 2: It is not clear to me how the CMIP6 models are grouped into classifications. Please clarify how the quantitative and qualitative measures were used and what is the threshold of quantitative measures to group the models. I suggest the authors introduce the “criteria” first and explain the classification definitions based on the criteria.
L64: “processed based” -> “process-based”
L70: How the regional processes are linked to future changes?
L137: “process base” -> “process-based”, “ does not use and regional or global warming trends”->”does not use regional or global warming trends”. Please carefully read throughout the manuscript and correct the typos or grammar mistakes. E.g. L202 …
L217: What is the temporal resolution of the dataset, monthly mean or daily mean? Which CMIP6 experiment was used for the baseline period? Both the baseline and future periods are only 20 years. The climatological means averaged over 20 years may still contain internal climate variability, e.g. AMO or PDO, which may affect the evaluation and selection of the models to a certain extent.
L225: Please clarify what reanalysis and observational data were used in this study.
L254-255: How the circulation pattern is measured? Is the RMSE calculated using two wind speed fields or an RMS vector error between two vector fields? If the RMSE is calculated with wind speed, it does not reflect the errors in wind direction. Instead, the RMSE for vector field can reflect both errors in wind speed and wind direction. Therefore, I suggest the authors use the latter one. Similarly, the difference in wind speed illustrated in Fig. 1 can only describe the errors in wind speed. The same wind speed does not mean the same wind direction. The authors may consider using a vector difference between the model and ERA5. The magnitude of vector difference takes both differences in wind speed and wind direction into account.
Xu et al, 2016: A diagram for evaluating multiple aspects of model performance in simulating vector fields. Geosci. Model Dev., 9, 4365–4380
L270: Please explain how the “track density” is defined. Please use the degree symbol “°” to represent latitude and longitude here and elsewhere.
L321: “depending to on” -> “depending on”
L334: “with with” -> “with”
L343-345: How about the range of other quantities, e.g. precipitation and storm track density?
L362: Please clarify what numerical score was given for each group of models.
L644: “35°N-75°” -> “35°N-75°N”
Fig. S4: What does the “??” refer to in the figure caption?
Citation: https://doi.org/10.5194/esd-2022-31-RC2 -
AC3: 'Reply on RC2', Tamzin Palmer, 28 Oct 2022
Reviewer 2 response
Please note that a clearer version of this response is attached as a pdf.
Recommendation: Major revision
The authors assessed CMIP6 models in terms of their performance and diversity in simulating several variables, e.g., temperature, precipitation, and circulation, over Europe. Based on the assessment, they created sub-sets of CMIP6 models, which can be used for downscaling or impacts assessments. The approach can also be applied to other regions of the world. The topic is important and falls within the scope of the journal. The manuscript is generally well written. My major concern includes: the assessment of CMIP6 models did not well consider the link between the model's ability to simulate historical climate and future climate change. The assessments are overly dependent on subjective assessment criteria. Detailed comments are laid out below.
We thank the reviewer for this overall positive and constructive response.
Major comments:
- No link was established in terms of the model's ability to simulate the historical climate and the projected changes. Thus, the models that can better reproduce historical climate may not necessarily generate a more reliable projection of future climate. After excluding the least realistic models, the filtered CMIP6 models show higher sensitivity. Is the result reasonable?
It is correct that we do not attempt to explicitly link baseline performance to the credibility of future projections. What we do suggest, is that there are a number of issues around using climate model projections from models which do not behave realistically in terms of key large scale regional climate characteristics in the baseline climate. Here the question is not whether well performing (better) models can offer a (more) reliable projection, but whether those models that we know to be particularly unrealistic in terms of the key large scale climate characteristics that determine the regional weather, and its variability can offer useful information about projected future climate to the climate impacts community. An increasing body of literature does link short comings in the ability of a model to realistically represent an observed baseline to being an indicator that the models’ future projections are less reliable (e.g., Whetton et al., 2007; Overland et al., 2011; Lutz et al., 2016; Ruane and McDermid, 2017; Jin, Wang and Liu, 2020; Chen et al., 2022).
Having identified models that we consider particularly unrealistic to arrive at a filtered subset, we then explore what that means for the range of future projections. We find that the better-performing filtered subset happens to contain a higher proportion of higher sensitivity models. This study is not intended to present an emergent constraint, but an exploration of how the performance-based filtering impacts projection range compared with other sub-selection approaches. We do not conclude that the upper-end of the projection range is more credible for Europe – indeed this would not be a reasonable result as the reviewer asks, but we do think that the identified relationship between filtered ensembles and climate sensitivity highlights a tension with other potential selection approaches, such as selecting models based on global historical trends, or matching IPCC distributions of climate sensitivity. Our intention is to expose this tension for potential users of these simulations, over Europe.
These findings are complemented by a recent study that takes account of regional temperature trends which, finds that for some European areas (e.g., France), constraining the CMIP6 ensemble based on regional temperature trends, or a combination of regional and global temperature trends finds that projected summer temperature changes are shifted towards high sensitivities rather than the lower sensitivities suggested by global analyses (Qasmi and Ribes, 2022; Ribes et al., 2022). We find that the higher sensitivity models that are part of our filtered ensemble may still provide a useful projection for the European region.
We propose to clarify in our manuscript that our result is to highlight this tension between selecting subsets based on regional performance and selecting subsets based on other criteria e.g., to represent the IPCCs plausible range of climate sensitivity.
Chen, Z. et al. (2022) ‘Observationally constrained projection of Afro-Asian monsoon precipitation’, Nature Communications, 13(1), p. 2552. doi: 10.1038/s41467-022-30106-z.
Jin, C., Wang, B. and Liu, J. (2020) ‘Future Changes and Controlling Factors of the Eight Regional Monsoons Projected by CMIP6 Models’, Journal of Climate. Boston MA, USA: American Meteorological Society, 33(21), pp. 9307–9326. doi: 10.1175/JCLI-D-20-0236.1.
Lutz, A. F. et al. (2016) ‘Selecting representative climate models for climate change impact studies: an advanced envelope-based selection approach’, International Journal of Climatology. John Wiley & Sons, Ltd, 36(12), pp. 3988–4005. doi: https://doi.org/10.1002/joc.4608.
Overland, J. E. et al. (2011) ‘Considerations in the Selection of Global Climate Models for Regional Climate Projections: The Arctic as a Case Study’, Journal of Climate. Boston MA, USA: American Meteorological Society, 24(6), pp. 1583–1597. doi: 10.1175/2010JCLI3462.1.
Qasmi, S. and Ribes, A. (2022) ‘Reducing uncertainty in local temperature projections’, Science Advances. American Association for the Advancement of Science, 8(41), p. eabo6872. doi: 10.1126/sciadv.abo6872.
Ribes, A. et al. (2022) ‘An updated assessment of past and future warming over France based on a regional observational constraint’, Earth Syst. Dynam. Discuss., 2022(March), pp. 1–29. doi: 10.5194/esd-13-1397-2022.
Ruane, A. C. and McDermid, S. P. (2017) ‘Selection of a representative subset of global climate models that captures the profile of regional changes for integrated climate impacts assessment’, Earth Perspectives, 4(1), p. 1. doi: 10.1186/s40322-017-0036-4.
Whetton, P. et al. (2007) ‘Assessment of the use of current climate patterns to evaluate regional enhanced greenhouse response patterns of climate models’, Geophysical Research Letters. John Wiley & Sons, Ltd, 34(14). doi: https://doi.org/10.1029/2007GL030025.
- Quantitative measures are preferred for model evaluation. Visual inspection hinders the inter-comparison of various studies to a certain degree as different people may have different judgments on “satisfactory”, “unsatisfactory”, and “Inadequate”. I’m wondering to what extent the results will be different if the authors use objective assessment criteria only.
We understand the reviewers point that in the case of more subjective criteria, to a certain degree people may have different judgements. We have used a combination of quantitative and qualitative measures where we have found them appropriate. One point to note is that ‘quantitative’ is not always synonymous with ‘objective’ – e.g., the choice of metric and threshold for classification involves subjective judgements.
There are two main reasons for our use of qualitative measures – firstly is to account for the variety of characteristics in errors that different models display and allow us to judge their implications and significance. If we look at the climatological circulation assessment, we find that the RMSE calculated in parallel with the quantitative assessment doesn’t always lead to the same classification as the visual inspection – in this case because some patterns of error are more concerning to us than others – errors in magnitude of the mean circulation (feature in broadly correct locations but with errors in magnitude) are less concerning than cases where features are incorrectly located. Visual inspection allows us to understand the characteristic of the error and consider its impact on other aspects of the model.
Note: Please see attached PDF file for Figure 1.
Figure 1 shows some examples of where bias alone and/or a RMSE threshold for windspeed would not be suitable to determine the classification of the models. BCC-CSM2-MR is classified as satisfactory for DJF circulation (Fig.1b and e). This is because although there are some errors in windspeed magnitude over western and central Europe, the pattern of large- scale circulation is reasonably well captured (as compared to ERA5 in Fig.1a). The BCC-CSM-MR model has a similar regional RMSE as BCC-ESM1 (Fig.1f), however this model is classified as unsatisfactory due to a lack of south westerly winds over the northern UK and Scandinavia (Fig,1c and f). This is also highlighted by the negative bias in windspeed over these areas, indicating that the winds are too weak (Fig.1f). The ACCESS-ESM1-5 model (Fig 1.d and g), is also classified as unsatisfactory despite a lower regional RMSE than BCC-CSM2-MR, this is due to the wind direction being too westerly in the North Atlantic and over the UK and northern Europe. The windspeeds over Scandinavia are too weak, while the windspeed over the UK and central Europe is too strong (Fig.1g).
A quantitative metric might be designed to capture these characteristics on which our judgement is made, but this may ‘miss’ another error characteristic that subsequently appears in another model.
The second reason for using visual inspection is that the process of examining the fields offers us a much better understanding of model characteristics, which does not arise from summary statistics. In the study presented we have often shown quantitative metrics which were used in parallel with visual inspection.
How was the RMSE of the zonal mean track calculated? It seems that the authors calculated the zonal mean track and obtained a time series. The RMSE is calculated using the time series derived from models and observation. Please note it makes no sense by comparing the year-to-year variation of the unforced internal variability derived from AOGCMs against the observed one. In this case, the RMSE is largely determined by the phase discrepancy between simulation and observation. Please also check the use of RMSE elsewhere.
Thank you for bringing to our attention that this part of the methodology requires further clarification. The RMSE was not calculated using a time series or via consideration of each model’s internal variability. This is the case for all the variables. The zonal mean of the model mean track density from 20W-20E was taken to get a profile of storm number by latitude. Then the RMSE was calculated of the models compared to the profile obtained from ERA5. The RMSE was calculated from 25-80N. There is no timeseries element of this and it is just the RMSE of the zonal mean, model mean track density. At no point is the unforced interval variability of the models compared or used in the RMSE calculations. We will clarify this in the paper’s text.
Other comments:
Section 2: It is not clear to me how the CMIP6 models are grouped into classifications. Please clarify how the quantitative and qualitative measures were used and what is the threshold of quantitative measures to group the models. I suggest the authors introduce the “criteria” first and explain the classification definitions based on the criteria.
Models were classified for individual criteria and not grouped into an overall classification (figures 4 and 5 in the manuscript). Models were then sub-selected based on whether they had any red flags (inadequate) and the percentage of orange (unsatisfactory) flags. This is presented as only one example of how the assessment can be used to sub-select models. An alternative approach, for example, would be to only remove models with an inadequate flag. Thank you for this suggestion we will improve the clarity of the main text.
L64: “processed based” -> “process-based”
L70: How the regional processes are linked to future changes?
L137: “process base” -> “process-based”, “does not use and regional or global warming trends”->”does not use regional or global warming trends”. Please carefully read throughout the manuscript and correct the typos or grammar mistakes. E.g. L202 …
Thank you for noting these errors, these will be corrected, and the final manuscript will be proofread.
L217: What is the temporal resolution of the dataset, monthly mean or daily mean? Which CMIP6 experiment was used for the baseline period? Both the baseline and future periods are only 20 years. The climatological means averaged over 20 years may still contain internal climate variability, e.g., AMO or PDO, which may affect the evaluation and selection of the models to a certain extent.
Thank you for highlighting that this is not clear, we use monthly datasets and the historical experiment for the baseline. We have selected the time periods used in the assessment to align with the European Projections Project (e.g., Brunner et al., 2020).
Brunner, L. et al. (2020) ‘Comparing Methods to Constrain Future European Climate Projections Using a Consistent Framework’, Journal of Climate, 33(20), pp. 8671–8692. doi: 10.1175/JCLI-D-19-0953.1
L225: Please clarify what reanalysis and observational data were used in this study.
ERA5 was the reanalysis data used for the assessment criteria. The exception is for precipitation where EOBS data was used. This will be clarified in the text.
L254-255: How the circulation pattern is measured? Is the RMSE calculated using two wind speed fields or an RMS vector error between two vector fields? If the RMSE is calculated with wind speed, it does not reflect the errors in wind direction. Instead, the RMSE for vector field can reflect both errors in wind speed and wind direction. Therefore, I suggest the authors use the latter one. Similarly, the difference in wind speed illustrated in Fig. 1 can only describe the errors in wind speed. The same wind speed does not mean the same wind direction. The authors may consider using a vector difference between the model and ERA5. The magnitude of vector difference takes both differences in wind speed and wind direction into account.
The wind speed was used as a measure of the magnitude of error, while the circulation pattern of wind direction and magnitude was assessed visually. Thank you for this suggestion, it may be interesting to use the vector error in addition to the windspeed and see if this is a better indicator of errors in the circulation pattern.
Xu et al, 2016: A diagram for evaluating multiple aspects of model performance in simulating vector fields. Geosci. Model Dev., 9, 4365–4380
L270: Please explain how the “track density” is defined. Please use the degree symbol “°” to represent latitude and longitude here and elsewhere.
Hodges, K. I., 1994: A general method for tracking analysis and its
application to meteorological data. Mon. Wea. Rev., 122,
2573–2586, https://doi.org/10.1175/1520-0493(1994)122,2573:
AGMFTA.2.0.CO;2.
Hodges, K. I., 1995: Feature tracking on the unit sphere. Mon. Wea. Rev.,
123, 3458–3465, https://doi.org/10.1175/1520-0493(1995)123,3458: FTOTUS.2.0.CO;2.
Priestley, M. D. K. et al. (2020) ‘An Overview of the Extratropical Storm Tracks in CMIP6 Historical Simulations’, Journal of Climate. Boston MA, USA: American Meteorological Society, 33(15), pp. 6315–6343. doi: 10.1175/JCLI-D-19-0928.1
L321: “depending to on” -> “depending on”
L334: “with with” -> “with”
L343-345: How about the range of other quantities, e.g. precipitation and storm track density?
The authors agree that it would be interesting to investigate other variables, it would extent the scope and length of the existing paper considerably to consider projections from the filtered ensemble for all the criteria that have been assessed. This is something that the authors are interested in exploring further and in a more thoroughly in a second follow up paper.
L362: Please clarify what numerical score was given for each group of models.
This information can be added to the manuscript in the supplementary info. Each model was scored individually. Models were classified for individual criteria and not given an overall classification. Models were then sub-selected based on whether they had any red flags (inadequate) and the percentage of orange (unsatisfactory) flags. This is presented as one example of how the assessment can be used to sub-select models. Model sub-selection is always subjective to some extent and the approach will depend on the application.
L644: “35°N-75°” -> “35°N-75°N”
Fig. S4: What does the “??” refer to in the figure caption?
This is a typo, it refers to table 2 in the main manuscript, thank you for noting this, it will be corrected.
-
CC1: 'Comment on esd-2022-31', Dave Rowell, 27 Sep 2022
You could check that the reduction in spread in Fig.6 is statistically significant. Particularly because random sub-sampling will also inevitably reduce uncertainty. If it is significant, that would strengthen the paper. Eg. there's a simple bootstrapping approach for this in sect.5 of Rowell et al. (2016, Climatic Change).
Citation: https://doi.org/10.5194/esd-2022-31-CC1 -
AC2: 'Reply on CC1', Tamzin Palmer, 30 Sep 2022
Thank you for this suggestion, this will be applied,
Citation: https://doi.org/10.5194/esd-2022-31-AC2
-
AC2: 'Reply on CC1', Tamzin Palmer, 30 Sep 2022