What could we learn about climate sensitivity from variability in the surface temperature record?

We examine what can be learnt about climate sensitivity from variability in the surface air temperature record over the instrumental period, from around 1880 to the present. While many previous studies have used trends in observational time series to constrain equilibrium climate sensitivity, it has also been argued that temporal variability may also be a powerful constraint. We explore this question in the context of a simple widely used energy balance model of the climate system. We consider 5 two recently-proposed summary measures of variability and also show how the full information content can be optimally used in this idealised scenario. We find that the constraint provided by variability is inherently skewed and its power is inversely related to the sensitivity itself, discriminating most strongly between low sensitivity values and weakening substantially for higher values. It is only when the sensitivity is very low that the variability can provide a tight constraint. Our investigations take the form of “perfect model” experiments, in which we make the optimistic assumption that the model is structurally perfect 10 and all uncertainties (including the true parameter values and nature of internal variability noise) are correctly characterised. Therefore the results might be interpreted as a best case scenario for what we can learn from variability, rather than a realistic estimate of this. In these experiments, we find that for a moderate sensitivity of 2.5◦C, a 150 year time series of pure internal variability will typically support an estimate with a 5–95% range of around 5◦C (e.g. 1.9–6.8◦C). Total variability including that due to the forced response, as inferred from the detrended observational record, can provide a stronger constraint with 15 an equivalent 5–95% posterior range of around 4◦C (eg 1.8–6.0◦C) even when uncertainty in aerosol forcing is considered. Using a statistical summary of variability based on autocorrelation and the magnitude of residuals after detrending proves somewhat less powerful as a constraint than the full time series in both situations. Our results support the analysis of variability as a potentially useful tool in helping to constrain equilibrium climate sensitivity, but suggest caution in the interpretation of precise results. 20


Introduction
For many years, researchers have analysed the warming of the climate system as observed in the modern instrumental temperature record (spanning the mid 19th to early 21st century), in order to understand the response of the climate system to external forcing. For the most part, the focus has been on the long-term energy balance as constrained by the warming trend in atmospheric and oceanic temperatures (e.g. Gregory et al., 2002;Otto et al., 2013;Lewis and Curry, 2015). However, some research has focused more specifically on the temporal variability exhibited in the surface air temperature record (Schwartz, 2007;Cox et al., 2018a), which is the focus of this paper. Schwartz (2007) argued on the basis of a simple zero-dimensional energy balance model that an analysis based on the fluctuation-dissipation theorem (Einstein, 1905) could be used to directly diagnose the sensitivity of the Earth's climate sys-5 tem S -here conventionally defined as the equilibrium surface air temperature response to a doubling of the atmospheric CO 2 concentration -from variability in the observed record of annually and globally averaged surface air temperature observations over the observational record. While we do not wish to repeat the arguments here, we will note that several researchers disputed this analysis, demonstrating inter alia that this method did not reliably diagnose the sensitivity of climate models, and also arguing why it could not be expected to do so, given their complexity (Foster et al., 2008;Knutti et al., 2008;Kirk-10 Davidoff, 2009). Perhaps as a consequence of these arguments, this line of research was largely ignored for the subsequent decade.
More recently however, Cox et al. (2018a) reopened this question with an analysis based on an emergent constraint approach.
That is, rather than following the directly diagnostic approach of Schwartz (2007), they instead observed that a quasi-linear relationship existed across an ensemble of CMIP5 models (Taylor et al., 2012), between the sensitivities of these models, and 15 their interannual temperature variabilities as summarised in a statistic which they denoted Ψ. It has been cogently argued that an emergent constraint should only be taken seriously if supported by some theoretical basis (Caldwell et al., 2014), andCox et al. (2018a) did indeed present an analysis -again based on simple zero-dimensional energy balance modelling -which qualitatively underpinned this linear relationship. Using the value of Ψ obtained from observations of surface air temperature, together with the empirical relationship between Ψ and S they had derived from the climate models, they produced a best 20 estimate of the equilibrium climate sensitivity of 2.8 • C with a likely (66% probability) range of 2.2-3.4 • C, a substantially tighter range than most previous research. However, questions have also been raised about this result (Brown et al., 2018;Rypdal et al., 2018;Po-Chedley et al., 2018;Cox et al., 2018b).
In this paper, we explore the question of to what extent temporal variability in the globally and annually averaged temperature record can be used to constrain equilibrium climate sensitivity. We consider both the internal variability of the climate system 25 itself, and also the total variability including deviation from a linear trend due to the forced response. Our investigations are performed in the paradigm of a simple idealised modelling framework, using a two-layer energy balance model which has been widely used to simulate the climate system and which generalises and improves on the performance of the zero-dimensional model. As part of our investigations, we examine the relationship between the Ψ statistic and the equilibrium sensitivity in the model. We also show how the full time series of variability can be used to constrain climate sensitivity, under a variety of 30 idealised scenarios. Our results are based on "perfect model" experiments and therefore may be more readily interpreted as a best case scenario for what we can learn from variability, rather than a realistic estimate of this.
In the next section, we present the two-layer energy balance model and briefly outline the experimental methods used in this paper. We first focus on internal variability, that is to say, the temporal variability arising entirely from internal dynamics of the climate system in the absence of forcing. We evaluate the power of the Ψ statistic in constraining equilibrium sensitivity, 35 and also consider the more general question of what could in principle be learnt from the full time series. We then consider variability over the period of the observational record (primarily the 20th century, but with some extension into the 19th and 21st centuries). This includes forced variability due to temporal changes in both natural and anthropogenic forcings as well as the internal variability of the climate system. Throughout the paper, the term variability refers simply to all temporal variation in the annually-averaged temperature time series after any linear trend is removed. 5 2 Methods

Model
The basic underpinning of previous work is energy balance modelling of the climate system, from which it is anticipated that interannual variability may be informative regarding the equilibrium sensitivity. While previous research was based on analysis of the simplest possible zero-dimensional single layer planetary energy balance, there is evidence that the behaviour 10 of the climate system over the historical period is poorly modelled by such a system (e.g. Rypdal and Rypdal, 2014). Therefore, we use here a slightly more complex two-layer model based on Winton et al. (2010);Held et al. (2010). This model has been shown to reasonably replicate the transient behaviour of the CMIP5 ensemble of complex climate models (Geoffroy et al., 2013b, a). The model is defined by the two equations: This is a two-layer globally-averaged energy balance model which simulates the mixed (T m ) and deep (T d ) ocean temperature anomalies in the presence of time-varying forcing F t . λ = −F 2× /S is the radiative feedback parameter where S is the equilibrium sensitivity and F 2× is the forcing due to a doubling of the atmospheric CO 2 concentration. C m and C d are the heat 20 capacities of the mixed-layer and deep ocean respectively and γ represents the ocean heat transfer parameter. The values of the various adjustable parameters are listed in Table 1. We assume depths of 75m and 1000m for the mixed and deep ocean layers respectively which are used to calculate the heat capacities C m and C d respectively based on ocean coverage of 70% of the planetary surface area. The default values for adjustable parameters are given in Table 1 and the values used here lie close to the mean of those obtained by fitting the model to CMIP5 simulations by Geoffroy et al. (2013a). Our aim here is not specifically to replicate or mimic this ensemble but to allow for a reasonable range of parameter values. If we set γ = 0 and 5 ignore the deep ocean then we recover the single layer model of Hasselmann (1976) which was used by both Schwartz (2007) and Cox et al. (2018a) in their theoretical analyses.

Bayesian estimation
Our investigations are performed within the paradigm of Bayesian estimation. In general, the Bayesian approach provides us with a way to estimate a set of unknown parameters Θ from a set of observations O via Bayes' Theorem, 10 P (Θ|O) = P (O|Θ)P (Θ)/P (O). ( Here P (Θ|O) is the posterior probability distribution of Θ conditioned on a set of observations O, P (O|Θ) is the likelihood function that indicates the probability of obtaining observations O for any particular set of parameters Θ, which in this paper will always contain S and may include other parameters. P (Θ) is a prior distribution for the parameters Θ, and P (O) is the probability of the observations which is required as a normalising constant in the calculation of the posterior probability 15 distribution.
Formally, the value of the observations is fully summarised by the likelihood function P (O|Θ), but we primarily present our results as posterior pdfs in order to provide an easily interpreted output which can be directly compared to previously published results. We therefore use a uniform prior in S as this is typically the implicit assumption in emergent constraint analyses (Williamson et al., 2019). This choice results in the posterior being visually equivalent to the likelihood even though their interpretation is somewhat different. In some experiments, we will consider that only the sensitivity is unknown, but in 5 others we will consider a wider range of parametric uncertainties. The priors that we use for all uncertain parameters are shown in Table 1.

Additional data
While this study primarily focusses on the behaviour of the simple energy balance model, we also use and present some data from external sources. In order to perform simulations of the historical period, we force our climate model with annual time ∆t and therefore the estimation of τ was not presented in an entirely objective algorithmic form, but for the simple one-layer climate model that was considered, the expected value of τ calculated from a long unforced time series is independent of 30 lag. Cox et al. (2018a) argued that the function Ψ = σ T / √ − ln ρ 1 should be linearly related to the equilibrium sensitivity. In this function, ρ 1 is the lag-1 autocorrelation of the time series of annual mean surface temperatures, and σ T is the magnitude of interannual variability of these temperatures. Ψ and τ are closely related and co-vary very similarly over a wide range of sensitivity when other model parameters are held fixed (not shown). Henceforth in this section we focus solely on Ψ as it is more precisely defined and has been recently discussed in some detail (Williamson et al., 2019). However very similar results are also obtained when equivalent experiments are performed using τ .

5
We now present some investigations into the relationship between Ψ and S in unforced simulations of the two-layer model introduced in Section 2. We perform a multifactorial experiment in which 1000-member ensembles of simulations are integrated for both 150 and 1000 year duration, over a range of S from 0 to 10C, and with γ set to either the default value of 0.7 or alternatively set to 0 in which case we recover the single-layer version of the model. All other model parameters are held fixed at standard values in these experiments. Since there is no forced trend in these experiments, we do not include any explicit 10 detrending step in these analyses. However the results are insensitive to detrending. is also strong heteroscedasticity, that is to say, the variance of each ensemble of Ψ values increases markedly with sensitivity.
This variation arises from the sequence of noise terms which generate the internal variability in each simulation of the model and is therefore an intrinsic aspect of the theoretical framework relating Ψ to S. For these unforced simulations, it seems quite possible for a model with its sensitivity set to a value of 5 • C or even higher to generate a time series which has a modest value 20 for Ψ of say 0.1, even though the expected value of Ψ from such model simulations would be much higher. Similarly to the results shown by Kirk-Davidoff (2009), an accurate diagnosis of Ψ could in principle be made with a sufficiently long time series of internal variability, but the sampling uncertainty only decreases with the square root of the duration of the time series (as expected from the Central Limit Theorem), so this is unlikely to be of use in practical applications with real data.
When we consider the two-layer model using the standard parameter value of γ = 0.7 then the situation is a little different. In also presented a theoretical analysis of this two-layer model in which they argued that the response of Ψ was close to linear across the GCM parameter range, and our result confirms this for sensitivity values from around 2 to 4 or even 5 • C. However, the gradual curvature for larger values results in a saturation of the response of Ψ to increases in S and this, together with the increasing sampling uncertainty, has consequences that will be shown in subsequent experiments. In fact the relationship between Ψ and the transient climate response (i.e. the warming observed at the time of CO 2 doubling under a 1% per annum 35 analysis that they presented, but augments it with additional details. We now directly consider the question of how useful an observed value Ψ o can be as a constraint on the equilibrium climate sensitivity through Bayesian estimation. It is not trivial to directly calculate the exact value of the likelihood P (Ψ o |Θ) for a given observed value Ψ o , as Ψ is itself a random variable arising from the stochastic model and thus depends on the sequence of known as Approximate Bayesian Computation (Diggle and Gratton, 1984;Beaumont et al., 2002). This is a rejection-based Our experiments take the form of a 'perfect model' scenario, where the model is assumed to be a perfect representation of the system under consideration, with no structural imperfections. Our uncertainties here are due solely to unknown parameter values and internal variability noise. In these experiments, we assume that Ψ for the true system is calculated from a 150 year temperature time series of the unforced system, without any observational error whatsoever. The results of four experiments Strictly, when considering the strength of the constraint obtained from the variability, we should focus on the likelihood 20 P (Ψ|S) rather than the posterior pdf P (S|Ψ), since the latter depends also on the prior which is in principle independent of the observations. The likelihood for different values of S, which tells us the relative probability of any particular sensitivity value generating the observation, can be directly read off from Figure 2 as the height of the appropriate density curve at the specific sensitivity value. In the experiment performed with Ψ o = 0.1, the maximum likelihood value is achieved at a value of S = 2.5 • C, and the likelihood drops by a factor of 10 at both S=1.3 • C and S=7.9 • C. Kass and Raftery (1995) suggest that a 25 likelihood ratio of 10 or more between two competing hypotheses could be taken to represent "strong" evidence in favour of one over the other, so if we adopt this linguistic calibration we could say that the observation of Ψ o = 0.1 represents strong evidence in favour of S = 2.5 • C versus all values outside of the range 1.3-7.9 • C (but conversely, does not represent strong evidence to discriminate between any pair of values inside that range). It is somewhat coincidental that this range seems quite similar to the 5-95% range of the posterior pdf as the philosophical interpretation of the ranges is rather different. There is a 30 strong skew in this range, which extends much further towards higher values of S than lower ones, compared to the maximum likelihood estimate. We stress that this skew is a fundamental property of the physical model and is not due to the Bayesian analysis paradigm.

Additional uncertainties
The pdfs plotted in the top panel of Figure 2 assume that all model parameters other than S are known with certainty. In reality, we have significant uncertainty as to what values we should assign to several other parameters. We consider just three of these: the ocean heat uptake parameter γ, the efficacy or pattern effect parameter and the internal noise parameter σ. Geoffroy et al. (2013a) fitted the two-layer model to various GCM outputs in order to estimate parameter values including γ and and 5 based on these results we use as priors for these parameters the distributions N (0.7, 0.2) and N (1.3, 0.3) respectively which generously encapsulate their results. Geoffroy et al. (2013a) did not consider internal variability and thus we do not have such a solid basis for a prior in σ and assume a comparable relative uncertainty of 20%, i.e. a prior of N (0.05, 0.01). When we repeat the previous experiments but include these additional parametric uncertainties, then for the experiment where we use Ψ o = 0.05 as a constraint, the posterior for S widens substantially from the previous spread of 0.6-1.7 • C, to 0.6-4.9 • C as also 10 shown as the dashed blue line in Figure 2. The largest factor generating this substantial increase in uncertainty is due to the uncertainty in σ. The equivalent posteriors using the larger observational values for Ψ o also broaden somewhat but this is less visible in the results as they are of course always constrained by the prior range.

Using the full time series
Although the results in Figure 2 show that an observation of Ψ taken from a short unforced simulation cannot tightly constrain 15 equilibrium sensitivity in this model (except perhaps in the most exceptional of circumstances), it could still be hoped that a more precise constraint could possibly be gleaned by a more advanced analysis that uses some different diagnostic of the time series. In this section, we show how the total information of the time series can be used. By doing this, we create the most optimistic possible scenario for using internal variability to constrain equilibrium sensitivity of this simple climate model. generating the observed sequence is exactly the probability of the required noise sequence being sampled. This value is readily calculated, since the joint density of these independent and identically distributed δ i is simply the product of their individual densities. This unusual approach which we do not believe has been previously implemented in this context is possible here as we are assuming zero observational uncertainty. With this exact likelihood calculation, the Bayesian estimation process is 30 straightforward. In the case where only sensitivity is considered uncertain, it can be performed by direct numerical integration, sampling the sensitivity on a fine regular grid and calculating the likelihood (and therefore posterior) directly at these values.
It is worth emphasising that this calculation represents an absolute best case scenario for using the time series of temperature anomalies as a constraint. There can be no diagnostic or statistical summary of the observations that provides more information than the full set of observations themselves contain. Thus, we cannot hope to obtain a better constraint by some alternative analysis of the temperature time series.  Figure 3 shows results obtained from this approach, in the case where only S is considered uncertain. To aid visual comparability with Figure 2, the y-axis scale is fixed at the same value despite cutting the peaks of some pdfs. It is not possible to define what a "typical" noise sequence might look like and therefore we plot 20 replications with different randomly generated instances of internal variability for each sensitivity value tested. It seems that the results may be a little more precise than was obtained using Ψ alone (as shown by the pdfs generally having higher peak densities), though this depends on the 5 specific sample of internal variability that was obtained. It is still only in the case of the lowest sensitivity value of 1 • C that we reliably obtain a tight constraint. With the true sensitivity of 2.5 • C, the posterior 5-95% range, averaged over the samples, is 1.9-7.0 • C, marginally narrower than the 1.8-8.1 • C range obtained previously when an equivalent Ψ value of 0.1 was used.

Density from full time series
When additional parametric uncertainties are considered in this unforced scenario, the constraints again weaken, though not to quite such an extent as in Section 3.2 when only Ψ is used as a constraint. We do not show these results here.

10
Thus, there appears to be the potential for internal variability, as represented by the full temperature time series, to provide a slightly better constraint than that obtained by a summary statistic alone, but the improvement is marginal and even our optimal calculation which uses the exact likelihood of the full time series cannot accurately diagnose equilibrium sensitivity except when the true value is very low. These results again show a skew similar to that obtained when Ψ was used as the into their calculations. In order to calculate their Ψ statistic, they applied a windowed detrending method in order to focus on variability of both model simulations and observations of the historical period. However, the window length of 55 years that they used was justified primarily in empirical terms and cannot remove shorter-term variations in forced response.
In this section, we perform a series of analyses based on historical forced simulations, in order to investigate more fully the 25 potential for such forced effects to improve the constraint. We force the climate model with annual time series for the major forcing factors based on IPCC (Annex II: Climate System Scenario Tables 2013). Our two-layer model with a one-year time step (and Euler method of numerical integration) reacts rather too strongly to short-term spikes in forcing and thus we scale the volcanic forcing to 70% of the nominal value in order to give more realistic simulations. In some of the following experiments, we consider aerosol forcing as a source of uncertainty in addition to that arising from the internal parameters of the model. This 30 uncertainty is implemented via a scaling factor denoted by α which is uncertain but constant in time, applied to the original aerosol forcing time series. In these cases, our prior distribution for α is N (1, 0.5 2 ).  parameter sets which were chosen to give reasonable agreement with observational data, and additionally one simulation for each of these parameter sets in which internal variability was not included in order to show the pure forced response. These simulations are shown merely to indicate the typical behaviour of the model under historical forcing estimates and are not directly relevant to our analyses. Note that the observations of the real climate system which are also plotted here include 5 observational error (estimated to be roughly ± 0.05 • C at the 1 standard deviation level) whereas the model output is presented as an exact global temperature. Thus it is to be expected that the model results are somewhat smoother and less variable than the observations, although it may also be the case that the two layer model has insufficient variability. For each of these three simulations without internal variability, the RMS differences between model output and observations is 0.13 • C.
When we hold other parameters at default levels, best agreement between model and data (defined here simply by RMS 5 difference between the two time series) is achieved for a rather low sensitivity of 1.78 • C. If the γ and parameters are increased slightly above their defaults then we can achieve an equally good simulation (again as measured by RMS difference) with a higher value for sensitivity of 2.5 • C. If, additionally, aerosol forcing is also increased above the default value, then a higher sensitivity still of 5 • C achieves an equally good match to the observed temperature time series at the global scale. While there are hemispheric differences in these forcings that may provide some additional information (Aldrin et al., 2012), these 10 simulations help to illustrate why it has been so challenging to effectively constrain equilibrium sensitivity from the long-term observed warming.

Using Ψ
In order to assess what we can learn about sensitivity from the variability of historical temperature observations, we first consider the utility of Ψ calculated from simulations over this period. Cox et al. (2018a) proposed that the effect of forcing 15 over this interval could be effectively removed by a process of windowed detrending. Figure 5 shows results analogous to the two-layer simulations in Figure 1, but using forced simulations of the historical period rather than unforced control simulations, and therefore with Ψ calculated via the windowed detrending method of Cox et al. (2018a). One very minor discrepancy with the calculations presented in that paper is that our simulations only extend to 2012 (this being the limit of the forcing time series that we are using) and therefore we omit the last 4 years of the time period that they used. This does not significantly 20 affect any of our results. We do not consider one-layer simulations here as this version of the model is known to provide poor simulations of historical changes (Rypdal and Rypdal, 2014).
Grey dots and error bars indicate results obtained when only S is considered uncertain. These results are qualitatively similar to those obtained by the unforced two-layer simulations in Figure  Blue and red crosses also shown on this figure show results obtained from the CMIP5 and CMIP6 ensembles. The CMIP models appear to generate slightly lower values of Ψ than the two-layer model does with the same sensitivity, although the results seem broadly compatible. Reasons for the difference may include biases in parameters of the two-layer model, structural limitations, or differences in the forcings used. Although we are not replicating the emergent constraint approach here, we do note that there is a significant correlation in the CMIP5 ensemble results shown here between their sensitivities and Ψ values.
However, the relationship seen here is is weaker than that obtained by Cox et al. (2018a) for a different (though overlapping) set of models, and explains a lower proportion of the variance. This remains true whether we perform the regression with S as predictor, as suggested by our Figure, or  Finally, we repeat the approach of Section 3.3, and use the full information of the time series, by the same method of inverting the model to diagnose the internal variability noise that is required to generate the observed temperature time series. Since we are interested solely in variability, we only consider the temperature residuals after detrending. We use a simple linear detrending over the period 1880-2012, which will leave a signature primarily due both to volcanic events and also the contrasting temporal evolution of negative aerosol forcing and positive greenhouse gas forcing, which both generally increase throughout 20 this period but which exhibit different multidecadal patterns.
The calculation is similar to that of Section 3.3, but there are some minor details which are worth mentioning. Although the detrending is performed over the interval 1880-2012, we initialise the simulations in 1850 to allow for a spin-up. In contrast to Section 3.3 where detrending was not performed, knowledge of the residuals after detrending does not actually enable an exact reconstruction of the internal variability of the model simulation, as any random trend in this internal variability will have been 25 removed by detrending. In fact a whole family of different model simulations will be aliased onto the same residuals. Therefore our inversion only truly calculates the noise perturbations after the removal of the component that generates any linear trend. This is not a significant problem for the likelihood calculation as the effect of this aliasing is minor and its dependence on model parameters is negligible.
The results of multiple replicates are shown in Figure 3, in which we consider first the case where only S is uncertain, 30 and then our larger set of parametric uncertainties. In the case where only S is unknown, the full time series of detrended residuals provides strong evidence on S which can as a result be tightly constrained except perhaps when it takes a high value such as 5 • C. For S = 2.5 • C, the posterior 5-95% range is typically under 2C in width, with the average of our samples being 1.9-3.5 • C. When multiple uncertainties are considered, the constraint is however markedly weakened. For a true sensitivity of S = 2.5 • C, the posterior 5 -95% range is typically over 4 degrees, at 1.6 -6.0 • C, with a "likely" 17-83% range of 2.2-4.1 • C.
As in the unforced experiments, these optimal constraints appear somewhat narrower than can be obtained by using the Ψ statistic, but are not necessarily tight enough to be compelling in themselves.

Conclusions
We have explored the potential for using interannual temperature variability in estimating equilibrium sensitivity. While - temperatures, by calculating the exact likelihood for the complete set of these observations. However, even in this scenario of a 15 perfect model with a few well-characterised parametric uncertainties and no observational uncertainty on the temperature time series, the constraint on sensitivity is seriously limited by the variability inherent to the model. It is only in the case where the true value of the sensitivity is very low that such an approach can generate a tight constraint. For example, if the true sensitivity takes a moderate value of 2.5 • C, then we could only expect to generate a constraint with a typical 5-95% range of around 1.9-6.8 • C. As was the case when using Ψ, estimates generated from the full time series are rarely close to symmetric and 20 instead are typically skewed with a long tail to high values. This skew is an inherent property of the physical model that defines the likelihood and not an artefact of our analysis methods.
Forced variability, such as that occurring during the instrumental period, does provide additional information in our experiments, and therefore we could in theory hope to calculate a narrower posterior range, with a typical width of around 4 • C (e.g. 1.8-6.0 • C) when the true sensitivity is 2.5 • C. It must however be emphasised that these calculations rely on very optimistic 25 assumptions and therefore represent a best case that is unlikely to be realised in reality. Nevertheless, our results do suggest that variability can inform on the sensitivity and may generate a useful constraint in addition to that arising from the longer-term observed trend.