Potential for bias in effective climate sensitivity from state-dependent energetic imbalance

. To estimate equilibrium climate sensitivity from a simulation where a step change in carbon dioxide concentrations is imposed, a common approach is to linearly extrapolate temperatures as a function of top of atmosphere energetic imbalance to estimate the equilibrium state ("Effective Climate Sensitivity"). In this study, we find that this estimate may be biased in some models due to state-dependent energetic leaks. Using

an ensemble of multi-millennial simulations of climate model response to a constant forcing, we estimate equilibrium climate sensitivity through Bayesian calibration of simple climate models which allow for responses from subdecadal to multi-millennial timescales. Results suggest potential biases in Effective Climate Sensitivity in the case of particular models where radiative tendencies imply energetic imbalances which differ between pre-industrial and quadrupled CO 2 states, whereas for other models even multi-thousand year experiments are 15 insufficient to predict the equilibrium state. These biases draw into question the utility of effective climate sensitivity as a metric of warming response to greenhouse gases and underline the requirement for operational climate sensitivity experiments on millennial timescales to better understand committed warming following a stabilisation of greenhouse gases. 20 Equilibrium Climate Sensitivity (ECS) is the theoretical equilibrium increase in global mean temperature experienced in response to an instantaneous doubling in Earth's carbon dioxide concentrations over pre-industrial levels. Introduced as a metric of response of the Earth System to greenhouse gases in the early years of computational climate science (Charney et al., 1979;Hansen et al., 1984), it remains a very common metric of the sensitivity of the Earth to greenhouse gas forcing (Knutti et al., 2017;Masson-Delmotte et al., 2021). 25 Measuring ECS in a coupled climate model, however, is difficult owing to the time required for the equilibration of the system to a change in forcing (Wetherald et al., 2001;Solomon et al., 2010;Jarvis and Li, 2011) necessitating simulations of multiple millennia to obtain a near-equilibrated estimate of temperature response (Rugenstein et al., 2020). The computational burden of conducting such simulations implies that standard practise for model assessment is to measure an "Effective Climate Sensitivity" (EffCS) using feedbacks extrapolated from those simulated in the first 150 years simulation forced with a step-wise 30 quadrupling of CO 2 (Gregory et al., 2004;Murphy, 1995;IPCC, 2013;Forster, 2016;Andrews et al., 2012).

Introduction
A core assumption in the calculation of EffCS is that the system will ultimately stabilise in a state of energetic balance (Gregory et al., 2004). However, in practise a number of models exhibit energetic radiative top of atmosphere imbalances in the control state in both CMIP5 (Hobbs et al., 2016) and CMIP6 (Irving et al., 2021), and as such the Effective Climate Sensitivity is calculated using net flux anomalies relative to the control mean top of atmosphere net radiative fluxes. However, 35 it remains untested as to whether such models will ultimately converge to the same state of imbalance.
In the present study, we consider an alternative approach for calculating climate sensitivity from a climate simulation in which there is a step change in carbon dioxide concentrations. We consider how the method of calculating effective climate sensitivity, either from initial response or from millennial scale simulations, may be potentially subject to biases arising from assumptions on the equilibrated radiative state. Finally, we consider how these uncertainties relate to our confidence in the relationship between transient and equilibrium climate feedbacks. 5 We consider the role of non-equilibrated models in the context of recent research, which has highlighted potential uncertainties in the EffCS approximation of ECS -studies have found that net radiative feedbacks can exhibit both timescale and state dependencies (e.g., Senior and Mitchell 2000;Armour et al. 2013;Andrews et al. 2015;Rugenstein et al. 2016;Proistosescu and Huybers 2017;Pfister and Stocker 2017;Dunne et al. 2020;Andrews et al. 2018;Bloch-Johnson et al. 2021) both of which draw into question the implicit constant feedback assumption used to calculate EffCS. 10 The LongrunMIP project set out in part to quantify this error by running a subset of ESMs in idealised carbon dioxide perturbation experiments with simulations of millennial timescale response (Rugenstein et al., 2019). Initial studies compared the EffCS as derived using the first 150 years of the simulation with that derived using the last 15 percent of warming in multi-thousand year experiments -finding that the accuracy of the EffCS varied by model, but the two methods differed by 5-37% in the estimate of ECS (Rugenstein et al., 2020). A follow-up study (Rugenstein and Armour, 2021) considered a range 15 of approaches for characterising feedbacks on different timescales, and found that feedbacks assessed in the period 100-400 years after the initial quadrupling of CO 2 concentrations may provide a practical prediction of equilibrium response accurate within 5% or less. They found also, however, there were large inconsistencies in some models between estimates of climate sensitivity derived from extrapolation to radiative equilibrium and those methods which relied on a fitting of exponentially decaying temperature trend, leaving uncertainty on the best practise for integrating model-derived EffCS distributions into 20 uncertainty in long term warming trajectories.
A general assessment of the likely range of EffCS (Sherwood et al., 2020) (which itself informed the Forster et al. (2021) assessed likely EffCS range) rested strongly on combined historical and paleo evidence, contributing to the headline result that values of EffCS of greater than 4.7K are unlikely. These findings somewhat challenge the use of the CMIP6 ensemble of climate models as a proxy for climate projection uncertainty in assessment, given approximately 1/3 of the ensemble have 25 apparent EffCS values of greater than 4.7K (O'Neill et al., 2016;Eyring et al., 2016;Meehl et al., 2020;Zelinka et al., 2020)leading to arguments that such 'hot models' should be excluded from assessment (Hausfather et al., 2022).
So can these models be ruled out? Although studies suggest that post-1980 warming may help constrain the Transient Climate Response (Jiménez-de-la Cuesta and Mauritsen, 2019; Nijsse et al., 2020;Tokarska et al., 2020), recent historical warming alone is only weakly correlated with EffCS in the CMIP5 and CMIP6 ensemble (Tokarska and Gillett, 2018). In the present study, we find that this might in part be due to the fact that a key assumption in EffCS (that the model will return to the radiative balance observed in the control simulation) may not hold in a number of CMIP-class models.

Methods
We consider fits of a simple multi-timescale model to idealised climate change experiments from LongRunMIP (Rugenstein et al., 2019), which provide in general an estimate of the multi-millennial response of the Earth System to a constant radiative 35 forcing level. The supplementary material also illustrates results from CMIP5  and CMIP6 , but in general these simulations are insufficiently long to constrain the simple model response.
We assume that the temperature and radiative response to a step change in forcing can be modelled by a sum of exponential decay terms, a basis set which is consistent with the general solution of two layer simple climate models and one which holds for the solution of a number of proposed multi-layer linear energy balance models in response to constant forcings (Caldeira 40 and Myhrvold, 2013;Proistosescu and Huybers, 2017;Sanderson, 2020;Geoffroy et al., 2013a;Winton et al., 2010;Smith et al., 2018;Geoffroy et al., 2013b) . It has been shown also that some non-linear models have a solution set which can also be expressed in the same exponential basis (Proistosescu and Huybers, 2017;Bastiaansen et al., 2021). We consider N exponential response modes, such that: where T p (t) and R p (t) are the global annual mean surface temperature and net top of atmosphere radiative flux timeseries in response to a unit step change in forcing respectively, τ n is the decay time associated with the timescale n, S n and R n are scaling factors and T 0 and R 4x extrap are constant terms. T 0 represents the pre-pulse temperature, taken here as the mean temperature in the last available 500 years of the control simulation. R 4x extrap is the radiative flux imbalance as t → ∞ in the forced simulation and is calibrated during the calculation. 5 We distinguish between the radiative flux imbalance in the PICTRL (R CT RL0 ) and in the asymptotic limit of the ABRUPT4X simulations (R 4x extrap ). For models which provided constant forcing extensions of transient experiments, we assume R 4x extrap is a fixed property of the fitted pulse-response function. R CT RL0 is calculated as the time average of net Top of Atmosphere (TOA) flux from the last 500 years of PICTRL. In fully equilibrated models with no energetic leaks, it would be expected that R CT RL 0 = 0, but it has been noted previously that this is not always the case and small energetic imbalances 10 remain in some models even after the model global mean temperature trends have ceased (Rugenstein et al., 2019).
Existing studies differ in the number of independent equilibration timescales (N ) which describe the joint evolution of top of atmosphere net radiative balance (R p (t)) and the global mean surface temperature (T p (t)) in response to a step change in forcing, generally using 2 (Smith et al., 2018;Rugenstein and Armour, 2021) or 3 timescales (Proistosescu and Huybers, 2017;Rugenstein and Armour, 2021;Caldeira and Myhrvold, 2013) timescales. Here we consider solutions ranging from 2 15 to 5 timescales allowing for a range of thermal responses corresponding approximately to sub-decadal, decadal, centennial, millennial and multi-millennial timescales (see Tables 1 and 3).
For LongrunMIP models which provide an experiment with an abrupt quadrupling of CO 2 (ABRUPT4X hereon), we take T p (t) and R p (t) as global annual mean values from ABRUPT4X simulations to directly calibrate the parameters in Equations 1a and 1b. Some models, however, do not provide ABRUPT4X, instead providing constant forcing extensions of other climate 20 change experiments (see Rugenstein et al. 2020). For these models, we further assume a linear pulse-response formulation to represent the thermal global mean response to the corresponding forcing time-series as the convolution of the thermal response to a step change in forcing, combined with the forcing timeseries itself (Joos et al., 2013).
where F (t) is the forcing time series of the corresponding experiment. Here we assume approximate logarithmic forcing dependencies (Myhre et al., 1998) for carbon dioxide (a dependency which is an empirical outcome of more complex radiative transfer models; Huang and Bani Shahabadi 2014) and integrated forcing estimates (Meinshausen et al., 2011) for the one model (ECEARTH) which extended a multi-forcer future scenario experiment in LongRunMIP. The latter forcing estimate is an approximation with central estimates for aerosol and greenhouse gas forcing rather than model-specific values, but the 30 effective forcing timeseries experienced by ECEARTH under RCP85 is not knowable without dedicated simulations (Pincus et al., 2016).

Bayesian Calibration of model response parameters
We fit the response equations detailed in Eqs. 2a and 2b to the output of each ensemble member's global mean radiative flux and surface temperature timeseries using a Markov Chain Monte Carlo optimizer (Foreman-Mackey et al. 2013; as implemented 35 in the 'lmfit' Python module), sampling models which allow for a range of N = [2, 3, 4, 5] representative decay timescales.  Myhre et al. (1998). (**) F historical (t), FRCP 85(t) forcing is taken according to Meinshausen et al. (2011).

Assessment of model response timescale
The following section is used to assess the simplest acceptable multi-timescale model for the emulation of different ESMs in the LongRunMIP archive. We quantify this using the Root Mean Square Error (RMSE) associated with the least-square fit optimization (assessed as the best performing member of the MCMC posterior solution). If the addition of an additional, longer 5 timescale in the fit corresponds to a reduction in combined RMSE of 0.5% or more -the longer timescale model is used. The performance of fitted multi-timescale models for GM T (Global Annual Mean Surface Temperature) and N ET (Global Annual mean net Top of Atmosphere radiative imbalance) timeseries is summarised in Figure 1, which shows the combined error in the fits for GM T and N ET associated with the absolute least-square fit for each of the model variants described in Table 1. The associated timeseries for the best fitted model in the context of the original model data for GM T and N ET are 10 shown in supplemental Figures (Figures A1 and A2).
We find that for all LongRunMIP models, the N=2 timescale model performs significantly worse than N≥3 timescale models allowing for centennial and longer response timescales. This is both evident by the significantly larger best fit errors ( Figure 1) as well as visibly poor fits (Supplemental Figures A1 and A2). Illustration of the Root Mean Square Error for the fit to global mean temperature and net TOA radiative balance using models allowing for a range of timescales. Dec., Cen., Mil., and m.m. are Decadal, Centennial, Millenial, and Multi-millenial timescale models respectively. RMSE values for each variable (NET and GMT) are normalised relative to the best overall fit for that variable, each multiplied by 0.5 to give a combined error. The shortest timescale model with errors within a .5 percent tolerance of the overall best performing model is illustrated in red. Included modes and parameter priors are detailed in Table 1 and 3). In cases where error is truncated by the vertical axis, value is printed in white.
Differences between the N=3, 4 and 5 timescale models are dependent on the ESM being fitted. For some models (CCSM3, CNRMCM61, ECEARTH, ECHAM5MPIOM, GISSE2R, HadCM3L, IPSLCM5A, MPIESM12), no significant improvement in fit is seen beyond the centennial timescale model (Figure 1). For other models, fits are further improved by allowing a millennial (CESM104,FAMOUS, GFDLCM3) or multi-millennial timescale (HadGEM2, MIROC32). Parameters associated with the best fitting models are listed in Supplemental Table A1, and fitted MCMC ensembles corresponding the selected class 5 of model illustrated in red in Figure 1 are carried through for the remainder of the study.

Assessment of climate sensitivity
The conventional effective climate sensitivity (EffCS) is calculated using the first 150 years of simulation, linearly extrapolating GMT as a function of NET to R CT RL 0 . Control global mean temperatures and TOA energetic imbalances are expressed as anomalies relative to T 0 . We assess errors EffCS due to state-dependent radiative imbalance by calculating EffCS corr , where 10 feedbacks in the first 150 years are instead linearly extrapolated to R extrap 4x .
A third estimate of equilibrium warming, ∆T best−est , follows Rugenstein et al. (2020), by calculating the effective climate sensitivity based on the years corresponding to the last 15% of warming in the simulation (that is, for all years following the point when the simulation first exceeds 85% of the average global mean temperature anomaly in the last 20 years of the ABRUPT4X simulation). For models which do not directly provide ABRUPT4X (GFDLCM3, GFDLESM2M and MIROC32),

15
∆T best−est is calculated by scaling by the ratio of radiative forcing in ABRUPT4X relative to that in the multi-thousand year constant forcing period in the experiment provided (following Rugenstein et al. (2020), see Table 2) We finally calculate a fourth estimate of climate sensitivity ∆T extrap as in Eq. 3 in the equilibrated (ABRUPT4X) simulation using the ensemble of fitted parameters from Bayesian calibration of Equation 1a, using again global mean temperature anomalies from ABRUPT4X relative to T 0 (taken as mean temperatures over the last 100 years of PICTRL).

20
We estimate the long term radiative imbalance in the ABRUPT4X simulation from the fitted values for R 4x extrap (along with R n , the amplitude of the decay in forcing at the timescale corresponding to τ n ) from Eq. 1b. Previous studies have assumed in the calculation of ∆T best−est that R 4x extrap = R CT RL 0 (Rugenstein et al., 2020), an assumption we test here. We follow convention by reporting climate sensitivities for a doubling of carbon dioxide from pre-industrial levels. As such, 25 we follow standard practice in dividing ABRUPT4X sensitivities by 2 to obtain (EffCS, ∆T extrap and ∆T best−est ) (Meehl et al., 2020), though we note that in some models this approximation introduces minor errors (Jonko et al. (2012); Bloch-Johnson et al. (2021), these are not the focus of the present study.

Relevance of energetic leakages
We consider first the radiative tendencies of the models in the climate change experiments, compared with the control state. 30 Figure 2 shows the evolution of the top of atmosphere net radiative imbalance in the LongRunMIP climate change experiments, as well as the control simulation -together with the projected evolution of a simulated ABRUPT4X simulation using the fitted multi-timescale model. We note that there is significant model diversity in the behaviour of models in the approach to equilibrium. Some models (CESM104, GISSE2R, GFDLESM2M, GFDLCM3 and MPIESM11) behave as expected, showing R CT RL 0 = 0 and R extrap 4x = 0 (Figure 2). 35 A second class of model exhibits a radiative imbalance in the control simulation, but the ABRUPT4X simulation converges to the same state (R CT RL 0 = R extrap 4x ̸ = 0 e.g. MIROC32, MPIESM11). Finally, a third class appears to converge to different states in PICTRL and ABRUPT4X (R CT RL 0 ̸ = R extrap 4x e.g. CCSM3, CNRMCM61, ECEARTH, HadCM3L, MIROC32, MPIESM12 and IPSLCM5A) -implying that Effective Climate Sensitivity may be biased in these models if calculated assuming that the ABRUPT4X simulation is tending towards the equilibrium radiative state of the PICTRL simulation. 40 Figures 3 and 4 show the impact on these biases on the derived value for Equilibrium Climate Sensitivity. The relationship between temperature and TOA fluxes for the fitted multi-timescale models for ABRUPT4X simulations in the LongRunMIP archive are presented in Figure 3, while Figure 4 shows the temperature evolution as a function of time.
Models with exact agreement between R CT RL 0 and R extrap 4x also tend to exhibit similar values for ∆T best−est and ∆T extrap and in cases where there is little or difference in feedbacks in the early and late stages of the simulation (e.g. CESM104, GISSE2R, MPIESM11), EffCS is also similar to ∆T best−est and ∆T extrap . Other models (e.g. ECEARTH, ECHAM5MPIOM, FAMOUS, GFDLCM3, GFDLESM2M, IPSLCM5A) show significant differences in early and late stage feedbacks -manifested as a ∆T best−est which differs from EffCS. , while dashed black line shows the predicted ABRUPT4X radiative imbalance (R 4x extrap ). Semi-transparent blue and green points show annual mean upgoing net radiative flux from PICTRL and the submitted simulation (printed in blue text) respectively. Thick blue line shows the MCMC posterior median TOA timeseries for the submitted simulation using the chosen multi-timescale model (see Table 1). Black line shows the simulated response to ABRUPT4X for the multi-timescale model, while shaded grey regions and thin lines show the 10th and 90th percentiles of the fitted ensemble projections for ABRUPT4X. Vertical axis shows absolute top of atmosphere net radiative imbalance, horizontal axis shows surface temperature relative to the final 500 years of the control simulation. Models marked '*' did not provide ABRUPT4X directly (see Table 2). Solid black lines show the median simulation of ABRUPT4X for the fitted MCMC posterior of the multi-timescale model, shaded grey areas show 5-95% confidence intervals. Light blue points are individual years from ABRUPT4X (if available). For * models, grey points show years in the latter portion of the simulation after which forcing is constant, scaled according to Table 2   (CNRMCM61, FAMOUS, ECEARTH, HadCM3L, IP-SLCM5A, MPIESM12), exhibit similar biases in both ∆T best−est and EffCS. For example, CNRMCM61 exhibits relatively constant feedbacks on century and millennial timescales, so ∆T best−est and EffCS are similar (5.42K, 5.51K respectively), but ∆T extrap , which is well fitted by the data is significantly lower (4.47±0.01K) (Figure 4 and Table 4) due to the differing estimated equilibrium energetic imbalance in ABRUPT4X and PICTRL simulations. The fitting process for HadGEM2 determined 5 that a multi-millennial response mode was necessary, which remains unconstrained by the fit so it is not possible to estimate ∆T extrap with confidence for this model (the simulation length for HadGEM2 is 1299 years, so it remains possible that a 5000 year simulation as provided by a number of other models could rule out the need for the multi-millennial response mode).
Of these models with apparently state-dependent energetic balance, some (HadCM3L, FAMOUS, ECEARTH) appear to show a control simulation where R CT RL 0 ≈ 0, but an ABRUPT4X simulation which converges to a state of energetic imbalance 10 ( Figure 2). This, in turn introduces a source of potential bias in the estimate of effective climate sensitivity if the system is converging to a non-equilibrated state, implying that the control simulation may be tuned to exhibit energetic balance but the equilibrated 4xCO 2 state is subject to an energy leak. A particularly extreme example is FAMOUS, where a small difference in extrapolated energetic balance, combined with large feedback parameter results in a much larger values of ∆T best−est (9.27K) than ∆T extrap (6.99K, see Table 4, Figure 4) or EffCS (7.13K) 1 . Similarly for HadCM3L, the fitted extrapolated sensitivity 15 ∆T extrap (3.03K, see Table 4 and Figure 4) is lower than ∆T best−est (3.49K) and EffCS (3.29K).
∆T extrap differs from EffCS both due to the presence of state dependent energetic biases, but also due to feedbacks which occur over the multi-thousand year timescales resolved in the LongrunMIP experiments. We can isolate the bias in EffCS induced by state-dependent energetic imbalance in the LongrunMIP cases by using a different extrapolated energetic state ( Figures 5, 3). As in the standard calculation of EffCS, we take a least-squares linear fit of temperature as a function of N in the 20 first 150 years, but instead linearly extrapolating to N = R 4x extrap rather than N = R CT RL 0 in the standard calculation to produce a bias corrected EffCS corr . We find that 2 models in LongRunMIP are significantly impacted by this correction (see Figure 5 and Supplemental Figure A3) -CNRMCM61 (Ef f CS = 5.42K, Ef f CS corr = 4.42K) and IPSLCM5A (Ef f CS = 4.33K, Ef f CS corr = 3.65K). A number of other models are impacted to a lesser extent (see Table 4).
The analysis was repeated for the wider CMIP5 and CMIP6 ensembles. However, the standard CMIP5 and CMIP6 simula- 25 tions are insufficiently long to fit response timescales of centennial or longer -hence ∆T extrap (or R 4x extrap ) are not constrained using the multi-timescale fitting approach (see Figure 5). It is notable that flux imbalances are present in the control state of a number of models in both CMIP5 and CMIP6, but longer simulations are required to assess if these represent structural imbalances or an insufficiently long spinup. The centennial and longer timescales are not constrained in 150 year simulations, hence it is not possible to estimate ∆T extrap and R 4x extrap with any confidence. We note, however, that in most cases the uncertainties 30 in the fitted 3 timescale solution generally allow for equilibrium values which are higher then the effective climate sensitivity 1 Using R 4x extrap = −0.16W m −2 rather than R CT RL 0 = −.01W m −2 would result in a value of ∆T best−est = 7.01K, broadly consistent with EffCS and ∆Textrap as assessed over the first 150 years of simulation. Only a small number of models allow for fitted solutions which have a lower ∆T extrap than the EffCS (CESM2, CCSM4, MIROC5, CNRMESM2.1, ACCESS-CM2). One of these cases (CNRMCM6.1) is a close relative of the CNRMESM2.1 -the LongrunMIP simulation which we identified to be potentially subject to biases owing to energetic imbalances in the 4xCO 2 equilibrium state. 5 We have considered an alternative approach for calculating long term tendencies of temperature and planetary energetic imbalance from simulations in which atmospheric carbon dioxide concentrations are instantaneously perturbed. This approach relies on the assumption that the evolution of the system can be represented as a sum of decaying exponential terms with differing timescales. An existing project, LongrunMIP, provides multi-millennial simulations which allow for the fitting a multi-timescale simple model which allows for annual, decadal, centennial and millennial responses. 10 We find that this approach highlights some potential limitations and biases associated with using effective climate sensitivity to predict equilibrium warming. It has been observed before that energetic imbalances exist in some models in the CMIP archive (Rugenstein et al., 2019;Hobbs et al., 2016;Irving et al., 2021), and in this study we show that such control state radiative imbalances are relatively widespread in CMIP5 and CMIP6. The conventional assumption used to calculate effective climate sensitivity in these cases is that such imbalances remain constant, such that radiative anomalies from the control state 15 can be used to calculate the effective climate sensitivity. Critically, in some LongrunMIP simulations, we observe that energetic imbalances are themselves state-dependent. This undermines the concept of effective climate sensitivity -if we do not know what the radiative imbalance will be when temperatures stabilise in an ABRUPT4X simulation, we in turn cannot predict the climate sensitivity (using this method) with precision.

Conclusions
In practice, only some models in CMIP5 and CMIP6 appear to exhibit significant radiative imbalances in the control state 20 (see Figure 5), and although the 150 year ABRUPT4X simulations are insufficient to assess if these energetic imbalances are state-dependent, these are the cases where we might be least confident in the effective climate sensitivity value. Models may exhibit non-equilibrium fluxes in the control state for a number of different reasons -either the model has not been run for sufficiently long in the control configuration to reach a state of energetic balance, or there is a persistent energetic leak in the model, which may be constant or evolving (Hobbs et al., 2016). In either case, the results presented in this study draw into 25 doubt whether such imbalances can be assumed to remain constant in a climate perturbed through alteration of climate forcers.
Further, we find that some models which are in or close to energetic balance in the control state do not converge to energetic balance following the step change in climate forcing. This implies that models fall into two potential categories: those where the energetic budget of the model is structurally closed through the elimination of all leaks, and those where the model parameters have been adjusted to produce near-zero net TOA fluxes in the control state. The latter case is still potentially subject to errors 30 in the estimation of effective climate sensitivity, because if energetic imbalances are dependent on climate forcers, then the calibrated minimisation of net TOA fluxes may be inappropriate for the perturbed climate state. A simple analysis of the net fluxes in the control simulation cannot distinguish between structurally balanced models and tuned balanced models -but centers which operationally adjust parameters to minimize energetic losses should be aware of this potential bias in effective climate sensitivity. 35 Models with state dependent energetic imbalance will not reach true energetic equilibrium (as defined by a state of radiative balance of the system) in response to a climate forcing . This still allows for the model to reach an asymptotic stable state (effectively including an energy leak) but it does not allow for the derivation of effective climate sensitivity which requires prior knowledge of the asymptotic equilibrium TOA balance. The method suggested here presents an alternative approach for deriving climate sensitivity, but it is clearly less than ideal -requiring simulations of 5000 years of simulation to produce a 40 stable estimate for some models. We must also consider the possibility for these models that there is no stable state. If energy leaks are a function of the climate state, and the system is not tending towards a state of radiative equilibrium, our evidence that models are converging to a stable temperature is empirical and longer simulations will be required to investigate these multi-millennial dynamics and confirm that a stable asymptotic solution exists.
Our results highlight the potential for error in the estimation of effective climate sensitivity through the assumptions on 45 the asymptotic radiative balance of climate models. In the case of LongrunMIP -there is a significant difference between the distribution of fitted asymptotic values of energetic imbalance in ABRUPT4X compared with the mean energetic balance in PICTRL in 11 of 15 models (see Table 4). In 5 out of 15 cases, this results in a bias in Effective Climate Sensitivity of 0.3K or more, but this bias is not universally in the same direction. Quantifying the presence of such biases in the wider CMIP6 ensemble is not possible without multi-thousand year control and ABRUPT4X simulations. However, their relatively common 50 occurrence in LongrunMIP suggests that more models could be impacted. This directly impacts our ability to accurately measure EffCS from short simulations, and draws into question whether EffCS should be used as a factor at all in assessing the fidelity of climate models (Hausfather et al., 2022). Effective climate sensitivity has known limitations that it describes effective feedbacks at a certain representative timescale following a change in forcing (Rugenstein and Armour, 2021), but our results here highlight another issue that EffCS can only be used if we can be confident in the asymptotic energetic balance of the model. Such confidence can arise either from a ground-up demonstration of 5 structural energy conservation in the model (Hobbs et al., 2016), or by running sufficiently long simulations to be empirically confident both in the pre-industrial energetic balance and in the asymptotic multi-millennial tendencies of the model following a change in climate forcing. Such experiments are currently difficult to achieve for CMIP class models, the multi-millennial year simulations conducted in Rugenstein et al. (2020) were significantly longer than any experiments conducted previously -and we find in the present study that even a 1300 year simulation is too short to have confidence in the asymptotic state for 10 some models.
Given this, our study has multiple recommendations. Firstly, a greater emphasis in climate model design and quality checking needs to be placed on structural closure of the energy budget in the climate system. Models which can demonstrate that energy is conserved in the model equations can allow confidence that the system as a whole will converge to a state of true radiative equilibrium following a perturbation, which would allow a robust calculation of EffCS. For models which cannot demonstrate 15 this, longer simulations are required to be confident in the asymptotic state. These simulations may be prohibitively time and resource consuming. but such limits could potentially be alleviated through the use of lower resolution configurations (Kuhlbrodt et al., 2018;Shields et al., 2012) (with the risk that such models will exhibit different feedbacks from their high resolution counterparts) or by considering analytical approaches to accelerate convergence of complex systems (Xia et al., 2012). 20 However, in the short term, a more practical approach may be to consider alternative climate metrics which do not require assumptions about the equilibrium state of the system. Transient Climate Response does not require assumptions about radiative flux, but it does not provide direct information on the warming expected under stabilising forcing. A possible alternative is A140 (the warming observed 140 years after a step quadrupling in CO2 concentrations; Sanderson 2020; Gregory et al. 2015), which requires no assumption on equilibrated state -and is more informative on the warming expected under high mitigation 25 scenarios than EffCS itself (even if EffCS is known without bias due to energetic leaks). In conclusion, the use of Effective Climate Sensitivity as a metric in assessing the response of the climate system should be treated with caution, both due to its lack of relevance to projected warming under mitigation scenarios (Knutti et al., 2017;Frame et al., 2006;Sanderson, 2020) but also due to the fact that its derivation requires assumptions about the asymptotic state of the climate system which do not hold in a number of Earth System Models. 30 Figure A1. Results fitting multi-timescale models to output of LongRunMIP multi-thousand year experiments for global mean surface temperature. Different colors represent different models as detailed in Table 1, shaded areas indicate the 5-95th percentile range in the MCMC fit to the timeseries. Text indicates the model scenario used in the fit (as detailed in table 2).    Table A4.