Climate model projections from the Scenario Model Intercomparison Project (ScenarioMIP) of CMIP6

The Scenario Model Intercomparison Project (ScenarioMIP) defines and coordinates the main set of future climate projections, based on concentration-driven simulations, within the Coupled Model Intercomparison Project phase 6 (CMIP6). This paper presents a range of its outcomes by synthesizing results from the participating global coupled Earth system models. We limit our scope to the analysis of strictly geophysical outcomes: mainly global averages and spatial patterns of change for surface air temperature and precipitation. We also compare CMIP6 projections to CMIP5 results, especially for those scenarios that were designed to provide continuity across the CMIP phases, at the same time highlighting important differences in forcing composition, as well as in results. The range of future temperature and precipitation changes by the end of the century (2081–2100) encompassing the Tier 1 experiments based on the Shared Socioeconomic Pathway (SSP) scenarios (SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5) and SSP1-1.9 spans a larger range of outcomes compared to CMIP5, due to higher warming (by close to 1.5 C) reached at the upper end of the 5 %–95 % envelope of the highest scenario (SSP5-8.5). This is due to both the wider range of radiative forcing that the new scenarios cover and the higher climate sensitivities in some of the new models compared to their CMIP5 predecessors. Spatial patterns of change for temperature and precipitation averaged over models and scenarios have familiar features, and an analysis of their variations confirms model structural differences to be the dominant source of uncertainty. Models also differ with respect to the size and evolution of internal variability as measured by individual models’ initial condition ensemble spreads, according to a set of initial condition ensemble simulations available under SSP3-7.0. These experiments suggest a tendency for internal variability to decrease along the course of the century in this scenario, a result that will benefit from further analysis over a larger set of models. Benefits of mitigation, all else being equal in terms of societal drivers, appear clearly when comparing scenarios developed under the same SSP but to which different degrees of mitigation have been applied. It is also found that a mild overshoot in temperature of a few decades around mid-century, as represented in SSP5-3.4OS, does not affect the end outcome of temperature and precipitation changes by 2100, which return to the same levels as those reached by the gradually increasing SSP4-3.4 (not erasing the possibility, however, that other aspects Earth Syst. Dynam., 12, 253–293, 2021 https://doi.org/10.5194/esd-12-253-2021 C. Tebaldi et al.: Climate model projections from ScenarioMIP of CMIP6 255 of the system may not be as easily reversible). Central estimates of the time at which the ensemble means of the different scenarios reach a given warming level might be biased by the inclusion of models that have shown faster warming in the historical period than the observed. Those estimates show all scenarios reaching 1.5 C of warming compared to the 1850–1900 baseline in the second half of the current decade, with the time span between slow and fast warming covering between 20 and 27 years from present. The warming level of 2 C of warming is reached as early as 2039 by the ensemble mean under SSP5-8.5 but as late as the mid-2060s under SSP1-2.6. The highest warming level considered (5 C) is reached by the ensemble mean only under SSP5-8.5 and not until the mid-2090s.

previous scenario set for both baseline and mitigation scenarios ). Yet another was chosen to address new policy objectives (SSP1-1.9, designed to meet the 1.5°C target at the end of the century). The request of prioritizing initial condition ensemble members for only one of the scenarios (SSP3-7.0) was aimed at gathering sizable ensembles (10 120 members or more) from various modelling centers. This was decided in recognition of the important role of internal variability in contributing to future changes, whose exploration is facilitated by initial condition ensembles Santer et al., 2019). It was also recognized that the spread in aerosol scenarios in the four RCPs used in CMIP5 was too narrow, as all assumed a large reduction in atmospheric aerosol emissions (Moss et al. 2010, Stouffer et al., 2017. The new SSP-based scenarios better address this uncertainty by sampling a larger range of aerosols pathways consistent with the 125 corresponding GHG emissions (Riahi et al. 2017). Scenario experiments were enabled by another community effort, input4mip: Based on the IAM's emission trajectories, and after harmonization of those to historical emission levels (Gidden et al., 2019), a community effort took place to translate those emission time series and to amend them with additional input fields for use by ESMs. These range from providing land-use patterns (https://doi.org/10.22033/ESGF/input4MIPs.1127), Given the multi-model focus of CMIP and the overview purpose of this paper, the results reported here aim at giving a broadscale representation of ensemble results (mean and ranges, or other measures of variability). The ScenarioMIP design 135 responded to many complex objectives and science questions, among which a high priority was the need to lay the foundation for integrated research across the geophysical, mitigation, impacts, adaptation and vulnerability research communities (O'Neill et al., 2020). The focus of this paper is to provide physical climate context for these more detailed analyses. Other Model Intercomparison Projects within CMIP6 have prescribed experiments that complement the ScenarioMIP design to address questions about the effects of small radiative forcing differences, specific (and often local) 140 forcings like from land-use and short-lived climate forcers (SLCFs), the differential effects of emission versus concentration driven experiments testing the strength of the carbon cycle (Arora et al., 2019), and the effectiveness of emergent constraints in reshaping the uncertainty ranges of the new multi-model ensemble (Nijsse et al., 2020;Tokarska et al., 2020). They are the Land Use MIP (LUMIP, Lawrence et al., 2016), the Aerosol Chemistry MIP (AerChemMIP, Collins et al., 2017), the Coupled Climate-Carbon Cycle MIP (C4MIP, Jones et al., 2016), the Geoengineering MIP (GeoMIP, Kravitz et al., 2015) 145 and the Carbon Dioxide Removal MIP (CDRMIP, Keller et al., 2018).
In this study, we focus the analysis on the future evolution of average temperatures and precipitation. We show time series over the 21st century of means computed globally and over land-only vs. ocean-only areas. We also look at spatial patterns of change with a focus on detecting similarities and differences across models and scenarios. In addition, for three of the new SSP-based scenarios designed to correspond to three CMIP5-era RCPs we show a comparison of outcomes. Questions about 150 internal variability and benefits of mitigation are also addressed.

ScenarioMIP experiments and participating models
As described in detail in O'Neill et al. (2016) and summarized in the matrix display of Figure A1 in the Appendix, the ScenarioMIP design consists of the following concentration-driven scenario experiments, subdivided into two tiers to guide 155 prioritization of computing resources. Tier 1 consists of four 21st century scenarios. Three of them provide continuity with CMIP5 RCPs by targeting a similar level of aggregated radiative forcing (but we highlight important differences in the coming discussion): SSP1-2.6, . An additional scenario, SSP3-7.0, fills a gap in the medium to high end of the range of future forcing pathways with a new baseline scenario, assuming no additional mitigation beyond what is currently in force. The same scenario also prescribes larger SLCFs concentrations and land-use changes compared to the 160 other trajectories.
Only Tier 1, which can be satisfied by one realization per model, is required for participation in ScenarioMIP.
Tier 2 completes the design by adding • SSP1-1.9, informing the Paris Agreement target of 1.5°C above pre-industrial; 165 • SSP4-3.4, a gap-filling mitigation scenario; • SSP4-6.0, an update of the CMIP5-era RCP6.0; • SSP5- 3.4OS (overshoot), that tests the efficacy of an accelerated uptake of mitigation measures after a delay in curbing emissions until 2040: the scenario tracks SSP5-8.5 until that date, then decreases to the same radiative forcing of SSP4-3.4 by 2100; 170 • three extensions to 2300, two of them continuing on from SSP1-2.6 and SSP5-8.5 and one extending the  overshoot pathway towards the lower radiative forcing level of 2.6Wm-2, to inform the analysis of long-memory processes, like ice-sheet melting and corresponding sea level rise.
• nine additional initial condition ensemble members under SSP3-7.0 to explore internal variability and signal to noise characteristics of the different participating models. 175 A list of the participating models, with references for documentation and data, is shown in Table A1 in the Appendix. Table   A2 lists the CMIP5 models used in the comparisons.
For the results shown in this section we extracted monthly mean near-surface air temperature (TAS) and precipitation (PR) from the models listed in Table A1. These were averaged globally or separately over land and oceans for time series analysis (no correction for drift was performed), and regridded to a common 1-degree grid by linear interpolation for pattern analysis.
All figures of this paper are produced with the Earth System Model Evaluation Tool (ESMValTool) version 2.0 (v2.0) (Righi et al., 2020;Eyring et al., 2020;Lauer et al., 2020), a tool specifically designed to improve and facilitate the complex 185 evaluation and analysis of CMIP models and ensembles.  Figure A2 in the Appendix for time series of the same variables disaggregated into land-only and ocean-only area averages; also see Tables A3 and A4 for changes under the different scenarios at mid-century and end-of-the-century). The historical baseline is taken as 1995-2014 (2014 being the last year of the CMIP6 historical simulations). The five scenarios presented in these plots consist of the Tier 1 experiments (SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5) and the additional scenario designed to 195 limit warming to 1.5°C above 1850-1900 (a period often used as a proxy for pre-industrial conditions), SSP1-1.9.

Global Temperature and Precipitation
In the plots the thick line traces the ensemble average (see legend or tables for the number of models included in each scenario calculation) and the shaded envelopes represent the 5-95% ranges (assuming a normal distribution, these are obtained as 1.64σ, where σ is the inter-model standard deviation of annual means). Only one ensemble member (in the majority of cases r1i1p1f1) is used even when more runs are available for some of the models. By the end of the century 200 (i.e., as the mean of the period 2081-2100) the range of warming spanned by the ensemble means is between 0.80°C and 4.03°C relative to 1995-2014 (0.84°C more when using the 1850-1900 baseline). Considering the multi-model ensemble means as the best estimates of the forced response under each scenario, the range spanned by them can be interpreted as an estimate of scenario uncertainty. When considering the shaded envelopes around the ensemble mean trajectories, reflecting the compound effects of model-response uncertainty and the --likely conservative --measure of internal variability in the 205 individual model trajectories, about 0.7°C at the lower end and 1.6°C at the upper end are added to this range. Using the 5-95% confidence intervals as ranges, we find that by the end of the 21st century (2081-2100 average, always compared to the 1995-2014 average) global mean temperatures are projected to increase from 2.42°C to 5.64°C for SSP5-8.5, from 1.84°C to 4.48°C under SSP3-7.0, and from 1.14°C to 3.08°C for SSP2-4.5. Global temperatures stabilize or even somewhat decline in the second half of the century in SSP1-1.9 and SSP1-2.6 which span a range from 0.1°C to 1.49°C and 0.41°C to 1.92°C, 210 respectively, whereas they continue to increase to the end of the century in all other SSPs. The ensemble spread appears to consistently increase with the higher forcing and over time. This suggests that the model response uncertainty increases for stronger responses, an expected result ) that appears robust, given the number of models involved in this https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License. synthesis (around 30 for all Tier 1 experiments). Only the number of models contributing to the lowest scenario (SSP1-1.9) is significantly less, i.e., 10 at the time of writing, but the analysis of ensemble behavior of Section 3.2.1 below suggests that, 215 for these two global quantities, ten ensemble members provide a representative sample of the internal climate variability.
The same qualitative behavior appears for land-only and ocean-only averages ( Figure A2), with the faster warming over land than ocean reaching on average up to 5.58°C under SSP5-8.5 (compared to the global average reaching 4.03°C) and some models reaching a much larger value under this scenario, as the shading indicates. For the lower scenarios, limiting warming in 2100 to 0.80°C and 1.16°C globally translates to an average warming on land of 1.10°C and 1.59°C respectively for 220 SSP1-1.9 and SSP1-2.6 (see Table A3 for all projections, and their ranges referenced to the historical baseline).
In order to characterize when pairs of scenarios diverge, we define separation, as in Tebaldi and Friedlingstein (2013), the first occurrence of a positive difference between two time series, one under the higher and one under the lower forcing scenarios, which is then maintained for the remainder of the century (note that the definition would need to be modified if 225 overshoot scenarios --crossing their reference as they decrease --were the main focus of this analysis). We use time series of GSAT after applying a 21-year running mean, as we are concerned with differences in climate rather than in individual years, whose average temperatures are affected by large variability. We also choose 0.1°C as the threshold by which we consider the difference "positive". In Table A5 we report the precise years when the ensemble means of the smoothed GSAT time series under the various scenario pairs separate according to this definition, and, in parenthesis, when the last of all 230 individual models' pairs of trajectories separate. Here we discuss the results in more qualitative terms. The ensemble average trajectory of GSAT under SSP5-8.5 separates from the lower scenarios' ensemble average trajectories between 2030 and 2035 with the longer time as expected applying to the separation from SSP3-7.0. SSP3-7.0 separates from the two scenarios at the lower end of the range just after 2035, and ten years later from SSP2-4.5. The ensemble average trajectory of global temperature under SSP2-4.5 separates from those under the two lower scenarios by 2040, while five additional years are 235 needed for the ensemble average GSAT under the two lower scenarios, SSP1-1.9 and SSP1-2.6, to separate from one another (in Figure A3 the differences between ensemble averages for each pair of scenarios appear as red lines). When considering individual models' trajectories under the different scenarios, and therefore defining the time of separation when the last of all individual trajectories separates, model structural differences and a larger effect of internal variability cause a significant delay compared to the ensemble mean separation ( Figure A3, black lines). For the lowest scenario, SSP1-1.9, 5 more years 240 are required for the last of the 10 model trajectories to separate ( Figure A3, left panels). This result, however, may not be robust, given the small numbers of models available. For the larger ensembles (of 30 members) available under all the pairs of scenarios from Tier 1, separation according to all individual models satisfying the criterion requires between 7 and 25 more years to be satisfied. For example, about 7 additional years are needed for SSP5-8.5 to separate from SSP1-2.6, 12 more to separate from SSP2-4.5, and about 25 more to separate from SSP3-7.0. 245 Ensemble mean precipitation change by 2081-2100 (as a percentage of the 1995-2014 baseline) is between 2.4 and 2.9% for the lowest scenarios (SSP1-1.9 and SSP1-2.6), 4.1 and 4.8% for SSP2-4.5 and SSP3-7.0, and close to 6.5% for SSP5-8.5. As https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License. expected, the larger variability of precipitation changes (relative to temperature changes), both from internal sources and model response uncertainty, is such that only the highest scenario ensemble mean trajectory separates from the lower ones before 2050 while the remaining four scenario ensemble means overlap until close to 2070. The multi-model spread and 250 year-to-year variations confound the trajectories under the different scenarios until the end of the century (Figure 1, right panel). Both the magnitude of the changes and their variability are larger for precipitation averages over land than over oceans ( Figure A2; see also Table A4 for a more complete list of mid-and late century changes).

Normalized Patterns
In Figure A4 we show ensemble average patterns of change by the end of the century under the five scenarios for both variables. In this section we focus our discussion on the general features emerging from the average normalized patterns.
Normalized patterns are computed as the end-of-century (percent) change compared to the historical baseline, divided by the 270 corresponding change in global mean temperature. This computation is first performed for each individual model/ scenario, at each grid point, after regridding temperature and precipitation output to a common 1°x1° grid. The individual normalized https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License.
patterns are then averaged across models and the five scenarios. As we will show, the total variations among the population of normalized patterns that form this grand average is mainly driven by inter-model variability, rather than inter-scenario differences. Thus we choose to synthesize patterns of change across all scenarios by presenting regional changes per degree 275 of global warming. More in depth analyses, also exploiting complementary experiments from LUMIP and AerChemMIP, may provide a more refined view of the inter-scenario differences possibly arising from different regional forcings. Figure 2 shows the spatial characteristics of warming, and of wetting and drying. For temperature changes, the left panel confirms the well-established gradient of warming decreasing from Northern high latitudes (with the Arctic regions warming at twice the pace of the global average) to the Southern Hemisphere, and the enhanced warming in the interior of the 280 continents compared to ocean regions (which consistently warm slower than the global average). This differential is particularly pronounced in the Northern Hemisphere (and would be muted if the normalized pattern was computed at equilibrium). The familiar cooling spot in the Northern Atlantic appears as well -the only region with a negative sign of change. Studies have suggested that the cooling signal is an effect of the slowing of the Atlantic Meridional Overturning Circulation, which creates a signal of slower northward surface-heat transport, resulting in an apparent local cooling (Caesar 285 et al., 2018;Keil et al., 2020).
For precipitation, the strongest positive changes are in the equatorial Pacific and the highest latitudes of both hemispheres, especially the Arctic region. The large changes in subtropical Africa and Asia are due more to the small precipitation amounts of the climatological averages in these regions (at the denominator of these percent changes), than to a truly substantial increase in precipitation (see also below, for variability considerations). A strong drying signal continues to be 290 projected for the Mediterranean together with central America, the Amazon region, Southern Africa and Western Australia. (2014), we give a measure of robustness of these patterns by computing the standard deviation at each grid-point across individual model/scenario patterns ( Figure A5). We further distinguish the relative contribution of scenario and model variability by computing standard deviations after averaging across models separately for each individual scenario, and across scenarios for each individual model, respectively. Figure A5, top row, highlights in 295 darker colors regions where the standard deviation is higher and patterns are less robust. For temperature patterns, as has been found in earlier studies of pattern scaling (starting from Santer et al. (1990) and in more recent work, like Herger et al. (2015)) the edges of sea ice retreat at both poles are areas where models disagree, and scenarios, in lesser measure, can be at odds due to their different timing of persistent ice melt. The variability and therefore uncertainty of the precipitation pattern mirrors the signal of change at low latitudes in the Pacific and over Africa and Asia. The comparison of patterns in the 300 middle and bottom rows of the figure elucidate the role of inter-model variability rather than scenario variability for both temperature and precipitation normalized changes, with scenario uncertainty only contributing to a small area of sea ice variability in the Arctic for temperature change, and a subregion of the Sahara for precipitation change. Given the radically different sample sizes used to compute the averages from which scenario-driven standard deviations are derived compared to model-driven (on the order of 30 for the former, and only 5 for the latter), we can also infer that internal variability is a likely 305 https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License.

Similar to Tebaldi & Arblaster
contributor to model-driven standard deviation, while is mostly eliminated before the computation of the scenario-driven standard deviation. The robustness of these multi-model average patterns and the sources of their variability can be assessed by considering the 315 same type of graphics computed from the four RCPs from the CMIP5 model ensemble.  We deem a rigorous quantification of the differences between patterns beyond the scope of this paper, and focus on a 325 qualitative assessment of the similarities that surface. As mentioned, the use of these experiments in conjunction with their variants by LUMIP and AerChemMIP could further attribute some of these scenario-dependent features to differences in regional forcing like land-use or aerosols. Also, a subset of CMIP6 models are running the CMIP5 RCPs, and results from those experiments will allow a clean analysis of variance, partitioning sources between models and scenarios "generations".

Comparison of climate projections from CMIP6 and CMIP5 for three updated scenarios
In the previous section the comparison of normalized patterns was by construction scenario independent. The design of ScenarioMIP, however, deliberately included scenarios aimed at updating CMIP5 RCPs, and three of those are in Tier 1.
Updates in the historical point of departure (2015 for CMIP6 rather than 2006 for CMIP5) together with updates in the 335 models forming the ensemble are obvious differences that hamper a straightforward comparison. In addition, the emission composition of the scenarios also changed with the update, and we summarize how after presenting the projection comparison.
We shows time series of global temperature for the three updated scenarios and the corresponding results from their CMIP5 counterparts: SSP1-2.6 vs RCP2.6, SSP2-4.5 vs RCP4.5, and SSP5-8.5 vs RCP8.5 from CMIP6 and CMIP5, respectively. 340 We show warming relative to the same historical baseline of 1986-2005 used by CMIP5 (Taylor et al., 2012) and to 1850-1900. We further show how observational constraints applied to the range of trajectories from the new models based on recently published work (Tokarska et al., 2020) result in lower and narrower projections at the end of the century, and have the effect of bringing CMIP6 projections in closer alignment to CMIP5 end-of-the century warming.
https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License.   Figure 1 (left panels of both rows) but as anomalies/percent changes from the period 1986-2005, i.e., the last 20 years of the CMIP5 historical period (Taylor et al., 2012). The right-hand side panels show CMIP5 results for the three corresponding RCPs (see Table A2 for a list of the models used), also using the 1986-2005 360 baseline. The right axis on the temperature plots allows an assessment of changes compared to the 1850-1900 baseline. Table   A6 lists mid-and late century changes for all model ensembles under the different scenarios. The new unconstrained results reach on average warmer levels, and have a larger inter-model spread, especially when comparing SSP5-8.5 to RCP8.5.
There is between 0.35°C (for the scenarios reaching 2.6Wm-2) and 0.50°C (for the 4.5 and 8.5Wm-2 scenarios) more mean warming, while the upper end of the shading for SSP5-8.5 reaches 1.15°C higher than the CMIP5 results (Table A6). The 365 larger warming resulting from the CMIP6 experiments is a combination of different forcings and the presence among the new ensemble of models with higher climate sensitivities than the members of the previous generations. The higher climate sensitivities in CMIP6 compared to CMIP5 (Meehl et al., 2020;Zelinka et al., 2020) become more critical for higher forcings, explaining the differential in the higher warming across the range of new scenarios, with the largest difference evident for SSP5-8.5. 370 Tokarska et al. (2020) and Liang et al. (2020) are at the time of writing the only published studies that sought to constrain the ensemble projections according to the evaluation of the ensemble historical behavior (Ribes et al., 2020 adopts a similar approach and is currently in revision). All studies find a strong correlation between the simulated warming trends over the observed historical period and the warming in SSP scenarios, which suggested constraining future warming using observed warming trends estimated from several observational products. Here in the top left panel of Figure 4 (and in Table A6) we  375 show constrained ranges from Torkarska et al. (2020) as 2081-2100 means and note that the result is to bring CMIP6 projections closer to CMIP5 ranges in both mean and spread (especially the upper bound). In other words, the models that project the most warming tend to do the least well in reproducing historical warming trends. Now the difference in the mean changes by 2081-2100 is 0.08 and 0.15 for the two lower scenarios respectively, and a negative 0.17 (i.e. CMIP5 warming more than CMIP6) under SSP5-8.5/RCP8.5. The upper ranges are now in all cases within less than a tenth of a degree. A 380 similar result is produced by applying the second study approach (Liang et al., 2020, not shown). A fourth study approach (Brunner et al., 2020, available as a discussion paper) supports these conclusions as well. Note however that the CMIP5 projections were not submitted to the same constraints, which arguably would have changed their statistics as well and possibly recover at least some of the differential seen in the constrained projections.

385
Global precipitation projections follow temperature projections (O'Gorman et al., 2012), and therefore we see (unconstrained) CMIP6 trajectories reaching higher percent changes than CMIP5, with the same increasing differential across the three scenarios from lowest to highest. In particular, we see up to 1% change more in the ensemble mean by the end of the century for SSP5-8.5 compared to RCP8.5. Consistent with the relatively larger means, the spread of trajectories https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License. along individual scenarios, which combines internal variability with model uncertainty, is larger for the new models and 390 scenarios.
As mentioned, part of the differences described are due to forcing differences between the corresponding scenarios in CMIP5 and CMIP6. These are by design small in terms of aggregate radiative forcing, when radiative forcing is defined as IPCC-AR5-consistent total global stratospheric adjusted radiative forcing (AR5-SARF). By this measure of forcing, 395 scenarios differ by less than 6% in 2100 for the RCP2.6-SSP1-2.6 pair, 5% for the RCP4.5-SSP2-4.5 pair and around 0.3% at 8.9 Wm-2 for the RCP8.5-SSP5-8.5 pair. Differences over the full pathway from 2015 to 2100 are below 15%, 5% and 4%, respectively. However, the literature in recent years has moved away from the AR5-SARF definition (in particular, Etminan et al., 2016see also implementation in Meinshausen et al., 2020), towards the use of effective radiative forcing (ERF), which differs from AR5-SARF in that it includes any non-temperature mediated feedbacks (see e.g., Smith et al., 2018). 400 Given that CMIP5 and CMIP6 concentration pathways differ with respect to their composition across gases and other radiatively active species, whose respective ERFs can be very different despite a similar AR5-SARF, the similarity between RCP and SSP scenarios in terms of forcing deteriorates when moving away from an AR5-SARF definition. For example, in SSP5-8.5 the AR5-SARF contribution of CH4 is by 2100 about 0.5 Wm-2 lower than in the CMIP5 RCP8.5 pathway. This is offset by the difference in CO2 AR5-SARF, where SSP5-8.5 is around 0.5Wm-2 higher. In contrast, these compensating 405 effects do not hold any longer when using ERF. In fact, because ERF is higher than AR5-SARF for CO2 and even more so for CH4, the 2100 radiative forcing level after which both the RCP and SSP pathway are named are not met precisely anymore when measured by ERF. Another pronounced difference between the CMIP5 RCPs and the new generation of SSP-RCP scenarios is that the latter span a wider range of aerosol emissions and corresponding forcings. The main reason for this difference is a wider consideration of the possible development of air pollution policies, ranging from major failure to 410 address air pollution in the SSP3-7.0 pathway to very ambitious reductions of air pollution in the SSP1-2.6, SSP1-1.9 as well as SSP5-8.5 pathways (Rao et al., 2017). All the CMIP5 RCPs followed by comparison a more "middle of the road" pollution policy path. Last, the effective radiative forcing levels reached by both sets of pathways can be differentdepending on each climate model processes -from their nominal AR5-SARF values labeling the pathway, usually obtained by running the emission pathways through simple models, like using MAGICC in its AR5-consistent setup (Riahi et al., 415 2017). A recent study with the EC-Earth model finds that about half of the difference in warming by the end of the century when comparing CMIP5 RCPs and their updated CMIP6 counterparts is due to difference in effective radiative forcings at 2100 of up to 1 Wm-2 (Wyser et al., 2020). Figure A7, adapted from Meinshausen et al., (2020) shows a break-down of the comparison into the three main forcing agents among greenhouse gases, CO2, CH4 and N2O, from which the significant differences in the composition can be assessed. Next to the AR5-consistent SARF time series, we also show effective 420 radiative forcing ranges under the SSPs for the end of the 21st century for comparison using a newer version of MAGICC,
Here we note that in an effort to make the comparison more direct, CMIP5 RCP forcings are available to be run with CMIP6 models, and several modeling centers have started --at the time of writing --these experiments, which have been added to the Tier 2 design of ScenarioMIP since the description in O'Neill et al. (2016). If enough models contribute these results, a 425 cleaner comparison of the effects of the updated forcing pathways, controlling for the updated models' effect, will be possible. Preliminary results with the Canadian model, CanESM2, confirm the significant role of higher radiative forcings found with EC-Earth.

Scenarios and Warming Levels 430
The ever-increasing attention to warming levels as policy targets, also due to the recognition that strong relations are found between them and a large set of impacts, motivates us to identify the time windows at which the new scenarios' global temperature trajectories reach 1.5, 2.0, 3.0, 4.0 and 5.0°C since 1850-1900. Table 1 shows the timing of first crossing of the thresholds by the ensemble average and the 5-95% uncertainty range around that date. This is derived by computing the 5-95% range for the ensemble of trajectories of GSAT, and identifying the dates at which the upper and lower bounds of the 435 range cross the threshold. The range is computed by assuming a Normal distribution for the ensemble, as 1.64 times the inter-model standard deviation. Considering this range rather than the minimum and maximum bounds of the ensembles ameliorates the fact that the different scenarios have been run with different ensemble sizes, some as small as 10 members.
The analysis is conducted after smoothing each of the individual models' time series by an 11-year running average, to smooth out interannual variability. The width of the intervals would change if constraints based on the observed warming 440 trends were applied to the ensemble along the whole century (as shown in Figure 4 for the end of the century) but here the unconstrained ensemble is used. The anomalies from 1850-1900 are computed as described in section 3.1.1, by computing anomalies with respect to the historical baseline (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) and then adding the offset value of 0.84°C.
We first synthesize results from the experiments from Tier 1, for which a similar ensemble of between 28 and 33 models is available, and for which we can therefore draw similarities and contrasts robustly. 445 The lowest warming level of 1.5°C from pre-industrial is reached on average between 2025 and 2028 across SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5 with largely overlapping confidence intervals that start from 2020 as the shortest waiting time and extend until 2048 at the latest under SSP2-4.5. Note however that the lower bound of the ensemble trajectories under SSP1-2.6 does not warm to 1.5°C for the whole century (the NA as the upper bound signifies "not reached"). The next level of 2.0°C is reached as soon as 13 years later on average under SSP5-8.5, and as late as 33 years later under SSP1-2.6, a 450 striking reminder of how different the pace of warming is in these scenarios. The confidence intervals have similar lower bounds between 2028 and 2030 but extend to 2085 for SSP2-4.5, while they are significantly shorter for the higher scenarios Only 10 models among the 30-plus running Tier 1 experiments are available at the time of writing under the lowest scenario specifically designed to meet the Paris Agreement target of 1.5°C warming by the end of the century. Of those, one remains below that target for the entire century, while others have a small overshoot of the target which was expected by design. The ensemble mean reaches 1.5°C already by 2026. The lower bound never crosses that level, while the upper bound is already at 1.5°C currently, i.e., by 2020, (as a reminder, CMIP6 future simulations start at 2015 so it is not impossible for a warm 465 model to warm the fraction of a degree needed to reach the target in 5 years). In Table A7 in the Appendix, a comparison of CMIP5/CMIP6 for the three corresponding scenarios (SSP1-2.6, SSP2-4.5 and SSP5-8.5 compared to RCP2.6, RCP4.5 and RCP8.5) shows dates compatible with the warmer characteristics of the CMIP6 models/scenarios. On average, the same target is reached from 2 to 9 years earlier by the CMIP6 ensemble. A more in depth analysis than is in our scope is necessary to fully characterize the causes of this acceleration. Here we note that we are using unconstrained projections, where high 470 climate sensitivity models, also those less adherent to historical trends, play a role in the behavior of the ensemble mean and of course the upper bound of the range. In addition, as we discussed in the previous section, even scenarios having the same AR5-SARF label see different forcings at play. The result is to make the pace of warming faster, and, in several cases, a target that was not reached by the CMIP5 models under a given scenario is instead reached by the corresponding CMIP6

SSP3-7.0 Initial Condition Ensembles 495
Five models (CanESM5, IPSL-CM6A-LR, MPI-ESM1-2-HR, MPI-ESM1-2-LR and UKESM1) contributed at least ten initial condition ensemble members under SSP3-7.0. We focus here on the behavior of the ensemble spread over the 21st century, as measured by the values of the inter-realization standard deviations. In the following the phrase "ensemble spread" is used, which has to be interpreted as the value of such standard deviation. Figure 5 shows the time evolution (over 1980-2100) of the ensemble spreads for global temperature and precipitation computed on an annual basis (top row) and 500 after smoothing the individual time series by an 11-yr running mean (bottom row). One of the models, CanESM5, provides 50 ensemble members that we use to randomly select subsets of 10 members and form a background "distribution" of the timeseries of ensemble spreads, shown in grey in Figure 5. This is not meant to provide a quantitative assessment but rather a qualitative representation of the variability of "10-member ensembles", which is what most models provide. When we compute trends for the time series of the temperature ensemble spread all show a negative slope, indicating that the ensemble 505 spread has a tendency to narrow over time. In the case of the spread computed among annual values, only two of the models pass a significance test at the 5% level, while for decadal averages all models show significantly decreasing spreads (significantly negative trends). Trends of the ensemble spreads for precipitation are non-significant for all models when the https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License. spread is computed from annual values, while all are significantly negative, indicating a decrease in the spread, when that is computed from decadal means. This result appears new, and confirmation with a larger number of models providing sizeable 510 initial condition ensembles will be important. After detrending the values, we compare the distribution of the ensemble spreads for an individual model to that of other models in order to assess if models produce ensembles with spreads that are significantly different. We use a Kolmogorov-Smirnov test (at 5% level) which measures differences in distribution. For several pairs of models, ensemble spreads based on annual values turn out to be indistinguishable: for temperature, CanESM5 ensemble spread is not significantly different from those of the MPI-ESM model at Low Resolution and those of 515 the UKESM1 model. The latter in turn has an ensemble spread that is not different from that of the IPSL-CM model. For precipitation, CanESM5 and IPSL-CM produce comparable spreads, as do the two MPI-ESM models, and the MPI-ESM at Low Resolution compared to UKESM1. When we test the spreads of decadal means, all models appear significantly different from one another. Last, we can exploit the CanESM5 large ensemble in order to assess the number of ensemble members necessary to estimate the forced response of globally averaged TAS and PR, assuming that the mean response 520 obtained by averaging the full ensemble of 50 member is representative of the true forced response. It is found that, for temperature, ten ensemble members produce an ensemble mean trajectory indistinguishable from the one obtained averaging 50 members. For precipitation, only year-to-year variability is not completely smoothed out by averaging ten rather than 50 ensemble members, but filtering by an 11-year running mean effectively cancels out annual "wiggles".   The ScenarioMIP design includes two pairs of scenarios, each of which is derived from the same SSP and integrated assessment model and consists of one baseline scenario without mitigation and one scenario assuming mitigation policies 540 that reduce radiative forcing. They can therefore be used to cleanly attribute differences in climate outcomes to mitigation efforts. The two sets of scenarios are SSP4-6.0 and SSP4-3.4 (produced with the GCAM model, Calvin et al., 2017), and SSP5-8.5 and SSP5-3.4OS (produced with the ReMIND-MagPIE model, Kriegler et al., 2017). Figures 6 and 7 show time https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License.
series of global temperature and percent precipitation anomalies with respect to the baseline period of 1995-2014 for the two pairs, and the patterns of differences in temperature and percent precipitation changes by the end of the century, which we 545 can characterize as the benefits of mitigation within the two SSP worlds. For reference, the pattern of change for the lower scenario in the pair is also shown. Figure 6 shows these outcomes for the pair of scenarios developed under SSP5. One of them is the unmitigated pathway already featured in the previous sections, SSP5-8.5, assuming high reliance on fossil fuels to support economic development, and reaching 8.5Wm-2 by the end of the century. The other scenario, SSP5-3.4OS, follows the same path of emissions until 550 2040, when it enforces a steep decline in greenhouse gas emissions, which become negative after 2070 and therefore create an overshoot in concentrations, radiative forcing and global average temperature, to end up at 3.4Wm-2 at 2100. Note that the end-point of this scenario, according to these global measures, coincides with the end-point of SSP4-3.4, the lower scenario of the other pair considered in this section, which is however reached along a traditional non-exceed pathway. Figure 7 shows results for the other pair, developed under SSP4, which by the end of the century reached 6.0Wm-2 (without 555 mitigation) and 3.4Wm-2 (with mitigation) respectively. Their greenhouse gas emissions start diverging immediately, by 2020, with those of the lower scenario already decreasing by that time, while those of the baseline scenario continue to increase for two more decades, plateauing and then decreasing only after 2060. Both scenarios have a non-decreasing shape in radiative forcing and temperature.
At global scales, Figure 6 and Figure A8 Figure A8) shows that separation takes place even earlier for this pair of scenarios, by 2040 (2045 for the last of the individual models), consistently with the earlier start of the mitigation. A large majority of the precipitation trajectories still overlap at the end of the century.
The differential patterns of temperature and precipitation change have strikingly similar spatial features when comparing Figures 6 and 7, only modulated by the strength of the changes, proportional to the gap in radiative forcings. Temperature 570 changes benefit from mitigation over the whole globe, but more significantly and increasingly so the higher the latitude in the Northern Hemisphere. All land regions see a benefit of mitigation (in terms of the forced signal, again represented by the difference in ensemble mean changes) of at least 2°C to 3°C in annual average temperatures at the end of the century, larger in most of the NH land regions and reaching 8°C in the Arctic for the SSP5-3.4OS/SSP5-8.5 scenario pair. For precipitation changes, the larger differences translate in a more than doubled intensity (note that the colors are the same or stronger in the 575 difference plot than in the scenario change plot) in both directions of change over the high latitudes (wetting) and the subtropics (drying). It is worth pointing out that patterns of change under the individual scenarios and patterns of differences https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License. between scenarios are similar, a further indication of the stable nature of the patterns of future change across different forcing scenarios.
Last, we use Figure 6 and 7, together with the third panel of Figure A8 for an additional comparison, as the presence of two 580 scenarios ending at the same level of radiative forcing (AR5-SARF), SSP4-3.4 and SSP5-3.4OS, allows us to compare the effects of the overshoot, after performing the same differencing for the 5 models that ran both of these scenarios.

Summary and Discussion
This paper provides an overview of ScenarioMIP results for surface temperature and precipitation projections under both Tier 1 and Tier 2 experiments, in addition to a comparison to CMIP5 outcomes for a subset of experiments that updated three of the RCPs. 610 The number of models contributing results for the simulations of 21st century scenarios ranges from more than 30 for experiments in Tier 1 to only 7 for some of the experiments in Tier 2. At the time of writing the availability of the long-term simulations results is too scarce to provide a robust multi-model ensemble perspective and we have not included those results.
Ensemble mean trajectories of global temperature under the Tier 1 and the 1.5°C scenarios (SSP1-1.9, SSP1-2.6, ) span values between 0.8°C and 4.03°C above the historical baseline (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) (1.64°C-4.87°C above 1850-1900 average), but individual models reach significantly larger warming levels under the highest scenario, beyond 5.6°C (above 6.4°C from . A comparison with the three CMIP5 RCPs (RCP2.6, RCP4. 5 and RCP8.5) reaching the same nominal level of radiative forcings (in terms of AR5-SARF) shows a wider range covered in the newest simulations, especially with respect to the upper end. Studies have confirmed the interplay of both higher radiative forcings 620 by 2100 in the scenarios, when measured by the currently preferred metric, ERF, and higher climate sensitivities in a subset of the CMIP6 models. We have shown that if constraints are applied based on historical warming rates that end up downweighting models on the basis of their performance, ensemble means and ranges of the CMIP6 experiments are brought closer to the corresponding means and ranges from CMIP5 model results, as many of the models with higher climate sensitivities also tend to perform less well over the historical period in terms of regional and aggregate warming trends. This 625 better agreement could be changed, however, if the same constraints were applied to the CMIP5 ensemble. A recent assessment performs a thorough attempt at constraining the distribution of climate sensitivity based on multiple lines of evidence, independently of climate models characteristics (Sherwood et al., 2020). If the resulting distribution of ECS were to be used to downweigh or cull models whose ECS is deemed an outlier, we would see changes in the CMIP6 ensemble projections in the same direction as those obtained by historical warming constraints, but formal studies applying this 630 alternative type of constraint have not yet been published. According to the Tier 1 scenarios and SSP1-1.9 the 1.5°C target (above 1850-1900) is reached on average (across models and scenarios) in the second half of the current decade. The scenario decides if the next level of 2.0°C is reached after only 13 more years (SSP5-8.5) or after more than 30 (SSP1-2.6).
Only under SSP3-7.0 and SSP5-8.5 does a majority of models reach 4°C, while 5°C is reached by the majority only under pace of the global average in the Arctic region. The cooling North Atlantic upwelling region emerges clearly. Precipitation change appears with the (by now familiar) patterns of wetting and drying, with the high latitudes and the equatorial Pacific seeing increases, and the semi-arid regions of the Mediterranean, Australia and South Africa expecting further drying. As was the case for CMIP5 and previous multi-model ensembles, the average response across models is very robust to changes in the size and trajectory of well-mixed GHG forcings, and therefore similar across scenarios. However, individual models' 645 regional behavior may deviate from the average behavior significantly, especially in the regions at the edges of sea-ice melt for temperature.
The availability of ten (or more) ensemble members under SSP3-7.0 prescribed under Tier 2 and completed by 5 models at the time of writing allows to detect a tendency to decreasing internal variability over time for both temperature and precipitation on decadal scales in all models. At the annual time scale only 2 of the models show significantly decreasing 650 spread, and only for global temperature. For several pairs of models, ensemble spreads based on annual values turn out to be indistinguishable, while after computing running decadal means all models show significantly different spreads from one another, confirming that the representation of the climate system internal noise characteristics remains model dependent.
CanESM5 provides 50 members and a subsampling of its ensemble confirms that ten realizations are sufficient to robustly estimate the forced signal of global temperature and precipitation by their averages. 655 Lastly, a new feature of ScenarioMIP's design builds on the matrix framework combining SSPs to different radiative forcing levels and therefore allows estimates of the benefits of mitigation for two pairs of scenarios, one pair under SSP4, the other under SSP5, and also an evaluation of the path dependency of warming in the presence of an overshoot. The comparison of SSP5-8.5 to the overshoot pathway that departs from it in 2040 to strongly mitigate radiative forcing down to 3.4Wm-2 by 2100 (SSP5-3.4OS) shows that the warming and absolute changes in precipitation avoided could be up to half the expected 660 changes under the high scenarios. The comparison of the other pair, SSP4-6.0 and SSP4-3.4 shows a similar geography of avoided physical impacts, but with smaller absolute differences, given the smaller reduction in radiative forcing between these two scenarios. We also compare the end points of SSP4-3.4, which follows a traditional non-decreasing path over the century, and of SSP5-3.4OS which overshoots the late century levels in radiative forcings and temperature, and therefore reaches them from above. Both temperature and precipitation changes (averaged over the last 20 years of the 21st Century) 665 appear comparable in magnitude, suggesting a short memory of the climate system (with regard to global average temperature and precipitation) at least after it exceeds the ultimate target by up to 4 decades, and for not more than half o f a degree, as in this comparison. A more general analysis of the time it takes for the various scenarios to see a persistent separation of GSAT trajectories shows that the ensemble averages can show the effects of mitigation already within 15 years from the divergence of forcings when comparing SSP5-8.5 to the two lower scenarios, SSP1-1.9 and SSP1-2.6. "Adjacent" 670 scenarios take longer to separate but they all do so, in mean, by the mid 2040s. Individual pairs of trajectories from the ensemble members can take between about 5 and 25 years longer than the ensemble means (the larger number corresponding to the comparison between the two higher scenarios, SSP3-7.0 and SSP5-8.5). We have limited this analysis to two variables and simple descriptive statistics of their behavior. The ScenarioMIP design together with the presence of complementary https://doi.org/10.5194/esd-2020-68 Preprint. Discussion started: 16 September 2020 c Author(s) 2020. CC BY 4.0 License. experiments in several other MIPs, and of the richness of the archived data (Jukes et al., 2020) from the ESMs simulations is 675 going to provide the basis for many more in-depth analyses of the physical system behavior. This will be further supported by a subset of CMIP6 models that are running CMIP5 RCPs, thus enabling a rigorous separation of the sources of variation between the two generations of experiments. Importantly, the ScenarioMIP effort aims at supporting integrated analyses of Earth and human systems' responses to future changes. These studies will integrate socio-economic changes described by SSPs with climate system changes characterized by ESM outcomes to assess risks and possible mitigation and adaptation 680 response options. While we don't address the integration of ScenarioMIP outcomes in interdisciplinary studies within this overview, that integration remains the overarching motivation for ScenarioMIP coordinated effort.

Data and Code Availability
CMIP5 (see Table A2) and CMIP6 (see Table A1) model output is available through the Earth System Grid Foundation (ESGF) and can be directly used within the ESMValTool (e.g. https://esgf-data.dkrz.de/projects/esgf-dkrz/). The 685 corresponding recipe that can be used to reproduce the figures of this paper will be included in ESMValTool v2.0 (Righi et al., 2020;Eyring et al., 2019a;Lauer et al., 2020;Weigel et al., 2020) as soon as the paper is published. The ESMValTool is released under the Apache License, VERSION 2.0. The ESMValTool code is available from the ESMValTool webpage at https://www.esmvaltool.org/ and from github (https://github.com/ESMValGroup/ESMValTool). As of August 2020, 23 modeling centers participated in ScenarioMIP by running at a minimum its Tier 1 experiments and provided their output 690 through the ESGF. Table A1 lists them, together with their model(s) and the doi referencing the data.

Author Contributions
C.Tebaldi, V.Eyring, J. Fyfe and E. Fischer designed and organized the analysis. K. Debeire performed data processing and analysis, and drew all figures and most of the tables. C. Tebaldi wrote the first draft of the paper. All authors provided input, 695 comments and editing on the various parts of the analysis. In addition, modeling centers representatives (from S. Bauer to T. Ziehn in the authors' list) were responsible for performing the ScenarioMIP simulations and publishing their model output to the ESGF. The authors declare that they have no conflict of interest. 1070 trajectories takes place. Separation is defined as the emergence of a positive difference (we use 0.1℃ as threshold to eliminate the effects of noisy emergences) that persists for the remainder of the century. We first apply a 21-year running mean to the GSAT time series in order to characterize separation "of climates".  Tokarska et al. (2020). For the latter the number of models remains the same as for the unconstrained projections. All changes are relative to the CMIP5 baseline period, 1986-2005.    1175 take additional adjustments into account that are non-temperature induced and differ from stratospheric-adjusted radiative forcings. Shown are 2080-2100 probabilistic results of SSP ERFs, using MAGICC7.3. These ERFs differ from SARFs and tend to be higher for CO2 and total radiative forcings (see panel b and f). Given that the efficacy and rapid adjustments are different for different forcing agents, also the match between RCPs and SSP scenarios differs when comparing them in the effective radiative forcing space, rather than in terms of their stratospheric-adjusted radiative forcings. 1180 1185 Figure A8: As in Figure A3, year by year GSAT differences for the two pairs of scenarios differing only by the amount of mitigation assumed (left and center panel) and for the two scenarios that achieve the same level of radiative forcing by 2100, one by overshooting it in the middle of the century (right panel). From left to right: year by year differences for the seven models that 1190 ran SSP5-8.5 and SSP5-3.4OS, for the seven models that ran SSP4-6.0 and SSP4-3.4, and for the 5 models that ran SSP4-3.4 and SSP5-3.4OS. Black lines are differences computed between pairs of GSAT trajectories for each of the models. Red lines are differences between the two ensemble mean trajectories.