Comment on esd-2021-65

The abstract is too general. You need to be more specific and give detailed numerical results of the study (the same comment also applies to the conclusion – no numerical values...) In the current version, no numerical values (temperature or precipitation changes, with uncertainties) are given even though they are an important outcome of this study (especially the CMIP5/CMIP6 comparison, and the model weighting). Please consider modifying the abstract (and conclusion) to include them (e.g. ll. 11-12 summer precipitation change, ll. 9-10 how much more warming in CMIP6 vs. CMIP5, etc.) Also, some of the wording is quite general, like l. 14 “in some regions” > please specify. Methods section. I find it a bit difficult to read. Most of the elements are there but the structure could be improved. For instance, you begin with “all computations were performed with...” But what computations do you perform? They are only introduced later at l. 122. It would be clearer to begin by saying what computations you perform, and then specify all the technical details (regridding, land vs. ocean, etc.) Maybe also consider having a separate sub-section for the weighing approach. Weights definition. The paragraph on the calculation of weights is not very clear. How exactly are Di and Sij are calculated? Is Di = sqrt((TAS-DIFF)**2+(TASSTD)**2+(TAS-TREND)**2+(PSL-DIFF)**2+(PSL-STD)**2)? Appendix B is not very helpful either as it does not contain the equation. Similarly, the definition of Sij is unclear. Showing an equation would be much better. Additionally, in one sentence the diagnostics are said to be “the 20-year PSL and TAS climatologies” and in the next you say that the diagnostics are computed over the 35-year period. Two different time periods are also used for Di and Sij, which is confusing (see also my next comment) Also, what observational reference do you choose (in DIFF)? The mean calculated over all observation/reanalysis products? Baseline periods. The fact that you use two different periods is confusing. First, 20 years is a bit short to calculate averages. 30 years is usually preferred. You mention that 20 years of data are heavily influence by inter-annual variability; that is true for trends, but for averages also. The extra 10 years of observations should also be used to assess GCM performance. Since you have to merge the historical and RCP8.5 simulations in CMIP5 to calculate trends for the 1980-2014 period, why not merge them to calculate averages also? The issue with having those two reference periods is that you mix them in the calculation of weights, which is not very consistent. Trend significance. It seems that to detect statistical significance the authors are implementing a t-test to determine whether the ensemble-mean average trend (in TAS or PR) is significantly different from zero. But that is not really appropriate. Trend statistical significance should be assessed for each model separately based on its interannual variability. The spread in trend values across models is not related to the magnitude of the trends themselves. For instance, there can be a large spread across models (+1,+2,+5,+10°C) but each trend may be statistically significant for the corresponding model (because inter-annual variability is smaller for the +1°C model than for the +10°C model). A better definition of significance in this context might be the fraction of models for which the trend is significant (or, like robustness, to impose that the trend is significant for at least 80% of models) This could change the conclusions for HighResMIP. Results. The section is a bit difficult to read. Maybe you could try to structure it a bit more? For instance, in section 3.2 you move from temperature to precipitation back to temperature and precipitation again, switching between scenarios, seasons and periods. Discussion. Despite the emphasis on the “hotspot” aspect, the discussion contains no information on the physical mechanisms responsible for the existence of the Mediterranean hotspot. Some literature exists on the topic (e.g., Brogli et al. https://doi.org/10.1175/JCLI-D-18-0431.1, Tuel et al. https://doi.org/10.1175/JCLID-20-0429.1) Please consider adding a short discussion on the comparison of the hotspot between CMIP5 and CMIP6, and the links to the known/likely physical mechanisms.

Weights definition. The paragraph on the calculation of weights is not very clear. How exactly are Di and Sij are calculated? Is Di = sqrt((TAS-DIFF)**2+(TAS-STD)**2+(TAS-TREND)**2+(PSL-DIFF)**2+(PSL-STD)**2)? Appendix B is not very helpful either as it does not contain the equation. Similarly, the definition of Sij is unclear. Showing an equation would be much better. Additionally, in one sentence the diagnostics are said to be "the 20-year PSL and TAS climatologies" and in the next you say that the diagnostics are computed over the 35-year period. Two different time periods are also used for Di and Sij, which is confusing (see also my next comment) Also, what observational reference do you choose (in DIFF)? The mean calculated over all observation/reanalysis products? Baseline periods. The fact that you use two different periods is confusing. First, 20 years is a bit short to calculate averages. 30 years is usually preferred. You mention that 20 years of data are heavily influence by inter-annual variability; that is true for trends, but for averages also. The extra 10 years of observations should also be used to assess GCM performance. Since you have to merge the historical and RCP8.5 simulations in CMIP5 to calculate trends for the 1980-2014 period, why not merge them to calculate averages also? The issue with having those two reference periods is that you mix them in the calculation of weights, which is not very consistent.
Trend significance. It seems that to detect statistical significance the authors are implementing a t-test to determine whether the ensemble-mean average trend (in TAS or PR) is significantly different from zero. But that is not really appropriate. Trend statistical significance should be assessed for each model separately based on its interannual variability. The spread in trend values across models is not related to the magnitude of the trends themselves. For instance, there can be a large spread across models (+1,+2,+5,+10°C) but each trend may be statistically significant for the corresponding model (because inter-annual variability is smaller for the +1°C model than for the +10°C model). A better definition of significance in this context might be the fraction of models for which the trend is significant (or, like robustness, to impose that the trend is significant for at least 80% of models) This could change the conclusions for HighResMIP.

Results.
The section is a bit difficult to read. Maybe you could try to structure it a bit more? For instance, in section 3.2 you move from temperature to precipitation back to temperature and precipitation again, switching between scenarios, seasons and periods.
Discussion. Despite the emphasis on the "hotspot" aspect, the discussion contains no information on the physical mechanisms responsible for the existence of the
ll.17-19 This sentence is unclear. Can you please rephrase? l.24 "global warming mean" -> "global-mean warming" l.26 add "are" before "projected" ll.26-28 Unclear what this sentence refers to here… l.37 "tools" -> you mean GCMs? l.59 "assumption" -> "criteria"? l.62 "presented in section 3" l.69 It is not just PSL that is used to calculate model weights; TAS is used also, correct? l.79 "mangnitudes" l.82 "has" -> "have" l.89 "initial conditions" (no "-") l.96 "containing" -> "including" l.106 "differences in the thermodynamic properties of the surfaces" -> "differences in surface thermodynamic properties" l.149 "weight" -> "more weight"? l.167 "30-45N latitudinal belt mean" -> Why not all land regions? One could argue that to make it a global hotspot one should compare against all other land areas (say of the same size). One issue also is that both the Mediterranean and the 30-45N belt contain many grid points with very small precipitation averages -> potentially large relative changes which may bias the analysis.
Also, you compare to 30-45N values but only over land, right? In that case Figure S3 should not have data over the oceans. For the sake of readers who are not used to landonly values ("global" often means land and ocean), I suggest you specify "global land mean", e.g., at l.166. l.176 "projects larger precipitation increases in regions where the hotspot has a negative sign such as the southeast of the domain" -> unclear. Larger increases where the change is negative? l.179 "larger scale means" -> "global average" Figure S5: What are OBS? It would be better to show here the values for the different observation/reanalysis products. Or at least their mean and the range across products (maybe that is what is currently shown, and if yes, please specify in the caption) In HighResMIP values the different markers are also a bit too small to see the difference. Make them bigger maybe? l.206 "for the remaining seasons" -> "for MAM and SON" (or specify in the previous sentence that you look at DJF and JJA). l.210 "trend" -> "trends" l.211 "but the PR high-resolution (HR) models trends display outliers in summer" -> "but some of the high-resolution (HR) models exhibit trends outside the CMIP6 range for PR in summer" Figures 3 and S6: Could you please add horizontal grid lines? Right now it is difficult to look at this figure and see the differences between weighted and unweighted results. l.220 "under for" -> "under" l.227 "cannot be drawn". Still, you could compare the HighResMIP values with those of the corresponding, low-resolution climate model versions.
l.229 delete "respectively" l.237 Figure S5, not S3 ll.237-238 "Generally, the signal is weak and the inter-model spread is wide for all multimodel ensembles" -> what does this refer to? Precipitation projections only? If yes "weak" is not really appropriate. Mid-to-long term trends in JJA precipitation are large (-15% or below) ll.240-244 What is the conclusion here? If you constrain model ECS then you will get a smaller spread in projections. l.248 "Student's t-test" l.253 "CMIP6 systematically projects" instead of "keeps projecting" l.260 "precipitation changes only get more robust and significant with time" -> does this mean that temperature changes don't get more robust and significant with time?
l.265-267 Please rephrase. l.267 "concord" -> "agree" l.272-273 It sounds like you are saying that precipitation both increases and declines in the Balkans.
l.276-277 This sentence comes a bit out of nowhere. Also, what is the 90% range? Please clarify.
l.278 suggest "Weighted projections" to be consistent with section 3.2 l.293 "The mean signal in CMIP6 decrease whereas it increases in CMIP5" ll.297-298 "Nevertheless, even if the probability of a future extreme-warming decreases, such temperature increases are still considered valid by the weighted ensemble" -> I suggest rephrasing along the lines of "Nevertheless, even though the weighting approach reduces the probability of the most extreme warming values, they remain possible in the weighted ensemble". l.304 "Mediterranean" ll.306-307 suggest rephrasing as "We have shown that average Mediterranean temperature changes were larger than the global-mean average during summer, but close to it during winter, for all scenarios, time periods and model ensembles." l.324 "no clear improvement could be seen from the increased resolution" -> did you compare the HighResMIP models with the lower-resolution versions of the same GCMs? l.330 "The largest source of uncertainty to determine the warming and precipitation change by the mid and long-term periods is the emission scenario." Where did you show that? Is it true for both TAS and PR? l.365 "Precipitation weighted projections are not shown in this study as we have no proof that the diagnostics used to assess temperature are relevant to evaluate the models' precipitation response." -> you could still weigh models based on their past precipitation trends, no?