On the Future Role of the Most Parsimonious Climate Module in Integrated Assessment

. In the following, we test the validity of a 1-box climate model as an emulator for Atmosphere-Ocean General Circulation Models (AOGCMs). The 1-box climate model is currently employed in the integrated assessment models FUND, MIND and PAGE, widely used in policy making. Our findings are two-fold. Firstly, when directly prescribing AOGCMs’ respective equilibrium climate sensitivities (ECSs) and transient climate responses (TCRs) to the 1-box model, global mean 10 temperature (GMT) projections are generically too high by 0.5 K at peak temperature for peak-and-decline forcing scenarios resulting in a maximum global warming of approximately 2 K. Accordingly, corresponding integrated assessment studies might tend to overestimate mitigation needs and costs. We semi-analytically explain this discrepancy as resulting from the information loss resulting from the reduction of complexity. Secondly, the 1-box model offers a good emulator of these AOGCMs (accurate to within 0.1 K for representative concentration pathways (RCPs), namely RCP2.6, RCP4.5, and RCP6.0), 15 provided the AOGCM’s ECS and TCR values are universally mapped onto effective 1-box counterparts and a certain time horizon (on the order of the time to peak radiative forcing) is not exceeded. Results that are based on the 1-box model and have already been published are still just as informative as intended by their respective authors; however, they should be re-interpreted as being influenced by a larger climate response to forcing than intended.


Introduction
Climate-economy integrated assessment models (IAMs) are used to derive welfare-optimal climate policy scenarios (Kunreuther et al., 2014) or constrained welfare-optimal scenarios that comply with a prescribed policy target (Clarke et al., 2014). Most of them employ relatively simple climate modules emulating sophisticated climate models, Atmosphere-Ocean 25 General Circulation Models (AOGCMs). These climate modules (hereafter: 'simple climate models' (SCMs)) offer computational efficiency and hence allow researchers to project a broader set of scenarios in orders of magnitude less time.
For IAMs based on a decision-analytic framework involving intertemporal welfare optimization, SCMs are in fact indispensable, as these IAMs' numerical solvers may need to access the climate module anywhere from ten to one hundred thousand times before numerical convergence is flagged.
The need to qualify the degree of accuracy with which SCMs mimic AOGCMs or properly represent ensembles of AOGCMs is increasingly being recognized (Calel & Stainforth, 2017;van Vuuren et al., 2011a), as this aspect might have immediate monetary consequences in connection with derived policy scenarios (Calel & Stainforth, 2017). In previous work, Van Vuuren 5 et al. (2011a) found that IAMs tend to underestimate the effects of greenhouse gas emissions.
Due to the centennial-scale quasi-linear properties of AOGCMs' global mean temperature (GMT) dynamics, SCMs have proven capable of emulating AOGCMs' behavior regarding GMT change, deviations being a function of spread of forcing, SCM complexity  and quality of SCM calibration. The climate component of MAGICC  represents the most complex SCM currently in use. In some sense one could even call MAGICC 10 an Earth System Model of Intermediate Complexity. It has demonstrated its capacity to emulate all AOGCMs' GMT even more precisely than the standard deviation of interannual GMT variability , with a fixed set of parameters, utilized for the whole range of representative concentration pathways (RCPs) (see van Vuuren et al., 2011b). This represents the current gold standard of AOGCM emulation using SCMs.
The most extreme opposite end of the scale of complexity within the model category of SCMs is provided by the 1-box model 15 as introduced by Petschel- Held et al. (1999) (hereafter: 'PH99'), converting a radiative forcing time series into a GMT time series. The current role of this model as assessed in the literature is as follows: by fitting PH99 to GMT time series, it can be used as a diagnostic instrument, as Andrews & Allen (2008) have done. However, its main application is as an emulator of AOGCMs. In conjunction with the most parsimonious carbon cycle model (described in Petschel- Held et al. (1999) as well), PH99 has been used to derive 'admissible' greenhouse gas emission scenarios in view of prescribed GMT targets (Bruckner 20 et al., 2003;Kriegler & Bruckner, 2004). Furthermore, the following climate-economic IAMs are currently utilizing PH99: FUND (Anthoff & Tol, 2014), MIND (Edenhofer et al., 2005) and PAGE (Hope, 2006) -the last of which was used in the 'Stern Review' to the UK government (Stern, 2007). While MIND has since been succeeded by the IAM REMIND (Luderer et al., 2011) when it comes to spatial resolution or representing the energy sector by dozens of technologies, it currently serves as a state-of-the-art IAM for decision-making under uncertainty (Held et al., 2009;Lorenz et al., 2012;Neubersch et al., 2014;25 Roth et al., 2015) or joint mitigation-solar radiation management analyses (Roshan et al., 2018;Stankoweit et al., 2015). Kriegler and Bruckner (2004) validated PH99 in conjunction with a simple carbon cycle model. When diagnosing the effect of the IS92a emissions scenario (Kattenberg et al., 1996) on GMT, they demonstrated deviations of less than 0.2 K for the 21 st century (see their Fig.5). Recently, Calel & Stainforth (2017) highlighted the potential future role of PH99 and hence further validation of its behavior is warranted . 30 in the global average temperature to well below 2°C above pre-industrial levels and pursuing efforts to limit the temperature increase to 1.5°C above pre-industrial levels…' Hence in the policy domain, a difference in terms of 0.5 K does matter. In fact we believe that further validation is both necessary and possible at a higher level of consistency. Firstly, the respective GMT time series as checked in Kriegler and Bruckner (2004) is convexly increasing. However in the context of scenario generation in keeping with the well-below 2° target (UNFCCC, 2016), validation along GMT stabilization or even peaking scenarios is 5 crucial, displaying a qualitatively different shape from IS92a. Secondly, in Kattenberg et al. (1996) the forcing was reconstructed by the additional assumption that non-CO2 greenhouse gas forcing approximately balances aerosol cooling.
Here we employ recently diagnosed forcings for 14 CMIP5-AOGCMs by Forster et al. (2013). As a main finding we diagnose that in the context of 2° stabilization scenarios, it would be necessary to implement a smaller ECS value in PH99 compared to the ECS value of the very AOGCM which PH99 is supposed to emulate. Hence previous work based on PH99 (see Hope 10 (2006), Anthoff & Tol (2014) and all the MIND-based work on decision-making under ECS uncertainty (see citations above)) might require a re-interpretation. Needless to say, we are not claiming that the previously published IAM-based work mentioned above is 'worthless'. Rather, we argue that the parameters and probability density distributions need to be interpreted as transformed ones, essentially because a response has been sampled which is higher than that of the corresponding AOGCM. Hence we propose calibrating PH99 by mapping AOGCMs' ECS and TCR to respective effective values, which are 15 suitable for a centennial time horizon, before using them in PH99.
In this way, PH99 could complement the use of increasingly complex climate modules, ranging from DICE's 2-box model (Nordhaus, 2013) to the complex upwelling-diffusion climate module used in MAGICC ). The potential benefits of doing so are two-fold: firstly, the most parsimonious SCM, PH99, ensures maximum comprehensibility.
Secondly, in the context of numerically solving decision-making under climate response uncertainty (Kunreuther et al., 2014), 20 having to simultaneously deal with dozens, hundreds or even thousands of alternate climate 'states of the world' (the economist's term for the uncertain system property) poses a significant challenge for numerical solvers and memory. In this regard, PH99 appears particularly attractive. Keeping the state space as slim as possible proves particularly relevant for decision-making under uncertainty with endogenous learning. For that reason, Traeger (2014) utilizes a 1-box rather than a 2box model, however with an exogenously given time series somewhat mimicking the existence of a deep ocean layer. 25 Finally, our article represents a warning: if PH99 is to be used in the future, it should be done in a re-scaled manner, adjusted to the time horizon under investigation.
This article is organized as follows. Section 2 introduces the data-based part of our analysis. We call for a 3-step procedure, including: (i) a conventional, though not naïve calibration of PH99 with regard to climate sensitivity and transient climate response (i.e. the GMT change in response to a 1%/yr. increase in the CO2 concentration until doubling compared to the pre-30 industrial value); (ii) an AOGCM-specific calibration; and (iii) the validation of (ii). In Sect. 3 we first demonstrate that (i) would lead to emulation errors of up to 0.5 K for scenarios approximately compatible with the 2° target. We then show that this emulation error can generically be reduced to 0.1 K when choosing AOGCM-specific calibrations of PH99. This calibration is subsequently validated by independent scenarios. Note that, in Sect. 3, we focus on only RCP2.6 scenario for calibration and use RCP4.5 and RCP8.5 for validation and leave further analyses, which show that PH99 can be generally calibrated to and validated by a variety of scenarios, for the sake of brevity, to Appendix 2. In Sect. 4 we present a scheme of how to calibrate PH99 for a given ECS, thereby avoiding AOGCM-specific calibrations. This results in a larger emulation error than achieved in Sect. 3, but one that would nevertheless suffice for most applications. In Section 5 we explain the observed discrepancy between PH99 and AOGCMs as reported for step one of Sect. 2 by pursuing a semi-analytical, 5 physically-based approach. In Sect. 6 we discuss the implications of our findings for the integrated assessment community, while Sect. 7 presents our conclusions and outlines further research needs.
Before we proceed, a brief note on the role of AOGCM data in our article is in order. We compare PH99 to AOGCM data because we utilize AOGCMs here as the entities closest to 'reality' available on the 'model market'. We do not, however, claim that IAM modelers were using them or should be using them. AOGCM data is used to demonstrate how ECS and TCR 10 data can skew the calibration of PH99, and how it should be corrected. The same correction should in principle be used for ECS data inferred from any source, e.g. abstract distributions such as those presented in Bindoff et al. (2013). Mirroring PH99 in AOGCM data, however, is currently the most direct way to infer the quality of a (not) re-calibrated PH99.

Method
This section introduces the analytic structure of PH99, relates it to ECS and TCR, to then describe a three-step scheme for a 15 PH99 / AOGCM intercomparison.
PH99 projects the atmospheric GMT anomaly compared to its preindustrial level. Petschel- Held et al. (1999) specified the model for a CO2-only forcing scenario and accordingly PH99 reads (1) Here T denotes the GMT anomaly, c is the CO2 concentration in units of its pre-industrial level, and α and µ are constant 20 tuning parameters.
From Eq. (1) we can readily read the ECS, the equilibrium temperature anomaly in response to a doubling of the CO2 concentration compared to its pre-industrial value: also in line with Petschel- Held et al. (1999) and Kriegler and Bruckner (2004). In Appendix 1 we briefly derive the TCR 25 (GMT) from a stylized experiment after the CO2 concentration has been exponentially increased with the rate γ (of 1%/yr.) until the concentration has doubled for this model: In the following we propose a 3-step validation approach to clarify PH99's range of applicability.

Step One
We first check whether simply calibrating PH99 from AOGCM-specific ECS and TCR data would deliver good emulations (i.e. accurate to within 0.1K) for 2°-target-compatible scenarios. After a technical derivation, we summarize this method of mapping AOGCMs' ECS and TCR onto PH99's two parameters.
Some difficulty arises due to the fact that AOGCMs have not been run for 2°-target-compatible scenarios for CO2-only forcing, 5 but solely for a plethora of simultaneous forcings that would add up to a total forcing. Hence we generalize Eq. (1) to its totalforcing counterpart (see Eqs. (4)-(7)) to be driven by total forcing time series as reconstructed in Forster et al. (2013).

10
In order to generalize Eq. (1), we recall its derivation from an energy balance approach, as summarized in Kriegler and Bruckner (2004), allowing for a physical interpretation of the model. We start by introducing the general energy balance equation, expressing the change in oceanic heat content as the difference of ingoing (F) and outgoing (λ ) radiative flux while h denotes the constant effective oceanic heat capacity (see also Geoffroy et al., 2013, Eqs. 1-4).
F also represents the total radiative forcing as applied in Forster et al. (2013). However the equation could still not be integrated as h and λ are yet to be determined. In order to solve the posed problem (CO2-only versus total forcing) we note that h and λ represent universal parameters of PH99 in the sense that their numerical values would not depend on the mix of substances (i.e. CO2, other greenhouse gases, aerosols, etc.) causing the total radiative forcing. Therefore, h and λ can be determined by considering the CO2-only case and, hence, by tracing them back to the already determined α and µ . For the CO2-only case, Q2 denotes the additional forcing from the doubling of the CO2 concentration compared to its pre-industrial value and is listed for all of the AOGCMs (see Forster et al., 2013, Table 1).
If we then divide by h, we obtain: 25 A comparison with Eq. 1 readily reveals = λ ℎ and µ = 2 ℎ ln 2 . (7) These equations would allow for determining h = Q2 / (µln2) and λ =αh. Utilizing these equations and Eq. (4), we generate PH99's temperature response to the total radiative forcing as specified in Forster et al. (2013). 30 The derivation displayed so far can be summarized in terms of the following recipe to generate PH99's parameters on the basis of AOGCMs' ECS and TCR: 1. Set PH99's ECS and TCR equal to the selected AOGCM's ECS and TCR.
Finally, to avoid differences occurring over the historical period (pre-2006 for the RCPs), we need to initialize PH99 with each AOGCM's 2006 temperature anomaly with respect to the pre-industrial value. To do this, for each AOGCM we calculate the mean temperature over the period 1881-1910 and set this as the pre-industrial value. We then calculate the mean temperature over the period 1991-2020 and use this as an indicator for the 2006 temperature level. The difference between these two values 10 is fixed as the initial temperature anomaly for PH99.
Each temperature trajectory should be compared to the temperature data from the corresponding AOGCM. As for GMT-targetconstrained economic optimizations (Clarke et al., 2014;Edenhofer et al., 2005), the maximum GMT (rather than the whole time series) is of special importance. Hence we use the difference between the respective 2071-2100 GMT time averages of PH99 and the AOGCM as an error metric. If the deviations are tolerable (accurate to within 0.1K), the climate module is 15 validated; if they are intolerable, we proceed with steps two and three.

Step Two
For each AOGCM, α and µ are tuned such that the difference between PH99 and the AOGCM GMT anomaly for the RCP2.6 scenario in the period 2006-2100 is minimized using a least squares approach. For further diagnostics we then determine the new 'effective' ECS and TCR from Eq. (2) and Eq. (3). As in step one, the deviations in 2071-2100 means of GMT between 20 PH99 and the respective AOGCM are determined as an accuracy check.

Step Three
Lastly, we validate the PH99 model versions generated in step two. For this purpose, independent temperature and forcing paths must be run as a nontrivial test to check whether the trained climate module can accurately project other temperature data trajectories. To do so, the values for α and µ determined in step two are implemented in PH99, the latter then being driven 25 by the total climate forcing of the RCP4.5 and RCP8.5 scenarios. Similar to steps one and two, the deviations in 2071-2100 means of GMT between PH99 and the respective AOGCM are determined as an accuracy check.
One might be interested in seeing if the calibrated module is capable of mimicking other scenarios such as RCP6.0 or what if PH99 was calibrated to RCP4.5 or others. Stating that, in general, the procedure outlined above brings about similar results, 30 for the sake of brevity of the main text, we present the respective results in Appendix 2. Table 1 shows the calculated α and µ together with the feedback response time 1/α in step one. For all of the indicators we also compute the mean values and standard deviations of the samples. The mean value of the ECS for GCM data is 3.35 K, with a minimum and maximum of 2.11 K and 4.67 K, respectively. The mean value of the time scales is roughly 35 years. Figure 1 represents the projected PH99 temperature evolution for the scenario RCP2.6 of each GCM in 2006-2100, using the 5 data from Table 1  increase in the global average temperature to well below 2°C above pre-industrial levels and pursuing efforts to limit the temperature increase to 1.5°C above pre-industrial levels…' Hence a difference in 0.5 K does matter. Accordingly, we must proceed with step two.
In step two, for each of the GCMs, we tune α and µ such that the GMT deviations for the whole period 2006-2100 are minimized in a least squares manner as represented in Figure 3 and Figure 4. From the thereby adjusted α and µ we derive the 15 ECS and TCR, which are presented in Table 2. MTDs for the various AOGCMs are shown in Figure 2.
The results tell us three main things. Firstly, the average of the absolute values of deviations is significantly reduced when α and µ are tuned. Indeed, the MTD average drops to below 0.02 K. Secondly, while the average ECS decreases by 0.9 K (from 3.35 K to 2.46 K), the average TCR increases by 0.14 K (from 1.90 K to 2.04 K). Thirdly, the mean value of feedback response times decreases significantly, from roughly 35 years to less than 12 years. 20 For validation we move on to step three. We utilize the RCP4.5 temperature and forcing data as provided by Forster et al. (2013). In Figure 3 and Figure 4 the respective GMT trajectories for any AOGCM are contrasted with the PH99-generated ones, where and µ are fixed to their values as determined in step two. The MTDs are shown in Figure 2. The results confirm that the climate module is sufficiently well trained in the second step that it can suitably mimic the actual temperatures (accurate to within 0.1K) for RCP4.5 and RCP8.5. As shown, the average MTD is approximately 0.05 K for RCP4.5 and about 0.14K 25 for RCP8.5. For RCP4.5, the deviations for three of the GCMs, namely CCSM4, CNRM-CM5 and NorESM1-M, are even better than those diagnosed for RCP2.6 in step two. See Appendix 2 for further analyses.

A mapping of ECS onto their PH99-specific counterparts α and µ
Finally, we attempt to abstract from fitting PH99 to individual AOGCMs and provide an approximate way to calibrate PH99 within the cloud of AOGCMs simply by knowing the ECS. Then PH99 could be utilized for any ECS in analyses where the 30 ECS is uncertain.

An existing mapping for PH99
Before diving into our suggestions, we examine one of the existing options (a reader solely interested in our improved method of utilizing PH99 can move straight onto Subsection 4.2). We inspect the curve suggested by Lorenz et al. (2012), which correlates α and µ to ECS. Using a sample from Frame et al. (2005) and assuming a strict relationship between 1/µ and ECS, Lorenz et al. (2012) suggest the following approximation: 5 where µ is the mean value of µ in the sample (see Fig.7 in Lorenz et al., 2012, all quantities measured in the units utilized in Kriegler & Bruckner, 2004). Knowing µ, Eq.
(2) and Eq. (8) have been repeatedly used in studies employing MIND and concerning uncertainties and ECS (Neubersch et al., 2014;Roshan et al., 2018;Roth et al., 2015). 10 We employ Eq. (2) and Eq. (8) for all ECSs from Table 1 and show the MTDs for the RCP2.6 scenario in Figure 5. Notice that TCR can readily be calculated using Eq. 3. Clearly, on average, employing Lorenz's curve does not result in a better situation than step one. However, this might not necessarily be a case of comparing like with like. At the time of Frame et al. (2005), the two-dimensional uncertainty information was obtained by reconstructing the 20 th century's warming signal from fingerprinting by means of a single AOGCM and then using this observational data as a constraint. It is well known that 15 observational constraints may lead to different distributions than ensembles of AOGCMs do (Andrews & Allen, 2008).
Nevertheless we include this piece of information here for the sake of completeness.

A multiple AOGCM-based mapping for PH99
Given the inferred estimates in Table 2, one can directly relate α and µ to the ECS. To do so, we generate polynomial fits (of 20 orders 2 and 3) of α and µ against all AOGCMs' ECSs. Predicting a two-dimensional manifold from ECS alone implicitly exploits the fact that AOGCMs' TCRs can be predicted well using ECSs (see e.g. Meinshausen et al., 2009) in a statistical sense. Another option would be to derive α and µ analytically (like in the first step) when the inferred ECS and TCR are correlated to the ECS and TCR of AOGCMs.  Table 2) to the ECS (from Table 1), using linear, quadratic and cubic polynomial approximations. 25 For the case of a linear approximation, we put the model GISS_E2_R out as an outlier. Figure 5 indicates that on average all approximations mimic the actual temperature paths better than a non-fitted one. The cubic estimation projects significantly smaller deviations compared to the quadratic approximation and slightly smaller deviations compared to the linear approximation. The maximum MTD in the cubic approximation is 0.3 K for IPSL-CM5A-LR, which is roughly a third of the maximum in the quadratic approximation that is revealed for CSIRO-Mk3-6-0. 30 We also consider alternative ways to map ECS and TCR from the 14 utilized AOGCMs onto PH99-intrinsic properties, going beyond the scheme displayed in Figure 6. As one option, shown in Figure 7, we linearly regress the ECS and TCR values inferred from step two against their original AOGCM counterparts respectively and obtain with a= 0.5846, b= 0.5095 K, and R 2 =0.8158, as long as ECSPH99 < ECSAOGCM and PH99 ≈ AOGCM + with c= 0.9763, d= 0.1829 K, and R 2 =0.667. 5 The other option consists in using Eq. (9) along with a linearly regressed PH99 over AOGCM , that is with m= 0.4582, n= 0.5044 K, and R 2 =0.7876.
The respective MTDs are shown in Figure 5. Although both approximations mimic the actual temperature paths better than a non-fitted one, regressing both the inferred effective ECS and TCR solely against AOGCMs' ECS (hereafter: ETE) clearly 10 offers the best overall approximation.
Using the ETE has four major advantages over all other options dealt with here, especially for the IAM community. Firstly, its approximation is better than all options but the cubic fit. Secondly the ETE still has an advantage over the cubic fit because one can easily use a broader range of climate sensitivities, for example, from 1 K to 9 K, which may not be accurately determined by the cubic fit. Even though the cubic fit may yield a better approximation, in our analysis it is only better by 0.03 15 K at the expense of a non-intuitive shape that might result in even worse deviations for out of sample data. Thirdly, prior knowledge regarding the TCR is no longer a decisive factor. Note that prior knowledge regarding the TCR can make approximations better. However, as we tested, for example, in the case of linearly regressing both the inferred effective ECS and TCR against both AOGCMs' ECS and TCR, the R-squares for Eq. (9) and Eq. (11) only improve by 6% and 7% respectively, and the MTD is no better than the ETE. Finally, in the case of ETE, we do not need to re-evaluate our sample 20 and possibly drop any model as an outlier. Given the explorations already done and their performance, we leave explorations beyond the linear approximation for future research.

An analytic interpretation of the AOGCM-PH99 intercomparison
In the following, we explain why PH99 systematically overestimates maximum GMT for peaking scenarios when fitted for exponentially growing scenarios. As an AOGCM is analytically not accessible, we investigate an intermediate step of model 25 replacement by moving from a 1-box to a 2-box SCM (as utilized in DICE (Nordhaus, 2013)). In fact we qualitatively trace back the effects reported so far to the information loss incurred by replacing a 2-box SCM with a 1-box SCM like PH99. We then also investigate the quality of alternative fitting schemes based on our semi-analytic analysis, which complements our previously mentioned AOGCM-based validation.
Following Geoffroy et al. (2013) we introduce a 2-box SCM as a more universal emulator of AOGCMs' mapping from 30 radiative forcing onto temperature.
T2B denotes the 2-box analogue of the 1-box temperature T in Eq. (1). The upper and the lower equation represent the upper and the lower ocean, respectively.
In order to contrast PH99 with this 2-box model, we search for analytic approximations of generic shapes of the forcing F(t) and examine the long-term projections under various RCPs as depicted in Meinshausen et al. (2011b) -an excerpt is included 5 in Figure 8 for the reader's convenience. Particularly in view of the peaking, mitigation-oriented lowest forcing scenario, we approximate forcing paths in three phases: zero forcing, linear increase, and linear decrease, under a continuity assumption.
We approximately identify t1 with the year 2035 and t=0 with 100 years earlier, i.e. we assume a ramp-up time t1 for the forcing of roughly 100 years. Furthermore, k2<0 and |k2 / k1| =: ε ≪ 1. From Figure 8 we approximate a generic value of ε=0.2.  (14)) This represents two linear modes of amplitudes af and as (with sum equal to 1), delayed by the characteristic time scales of a fast and a slow mode, τf and τs, respectively, and continuously matched to the initial condition '0' by an exponential. In  Table 4; for centennial effects, this mode would nearly match the equilibrium response). Furthermore we can see that τs ranges from 100 yrs. to 300 yrs. for 15 out of 16 AOGCMs. Hence the 2-box model is characterized by a marked time-scale separation between the two linear modes. With the aid of these two approximations, the last equation can be simplified to 20 We then extend the analytic range of that formula, given the two approximations above, for t > t1 (for a derivation, see Appendix 3): The analogous expressions for the 1-box model read 25 and

Explaining the PH99-AOGCM discrepancy for equal ECS and TCR values
We are now prepared to mimic Step One in Section 2: we calibrate the 1-box model such that it is characterized by the same ECS and TCR as the 2-box model. As λ=Q2/ECS2B , equal ECS values for both models deliver λ=λ2B.
Determining the second degree of freedom of PH99 (e.g. as expressed by θ ) from some transient property proves more 5 intricate. We request whereby we introduce tTCR as the moment in time when T needs to be evaluated in order to determine TCR. In Appendix 1 we note, by definition, that tTCR= (ln2)/γ ≈70yrs for a growth rate γ=1%/yr of the carbon dioxide concentration, hence 0<tTCR<t1.
Therefore, when exploiting Eq. 19, Eqs. 15 and 17 (rather than 16 and 18) apply and result in the expression 10 with h denoting the auxiliary function (see Figure 9) where lim →0 ℎ( ) = 0, lim From this, we can already get a first impression of the scale of θ, prior to numerical inversion: as τ is generically markedly larger than tTCR, the right-hand side of the defining equation above approximates ½. Further, if we boldly assume a slight timescale separation between θ and tTCR, the former being smaller than the latter, then the linear approximation of h would apply and θ ≈ tTCR/2≈35 yrs. For a centered value of τ=250 yrs, this approximation is confirmed in a direct numerical treatment of Eq. (20). We are now equipped to compare the two models' temperature projections and apply the 3-phase forcing as defined above for ε=0.2. a1/λ is chosen such that peak temperatures enter the 2° regime for illustrative purposes. We exploit the coincidence that tTCR just happens to approximately correspond to our starting year 2006 for PH99 (because 2035-100+70=2005). Hence the formulas for the 1-box model do not need to be adapted for an explicit initial condition for this purpose. Figure 10 shows that by construction, both temperature responses match at tTCR ≈ 70 yrs., although the 1-box model's maximum exceeds the maximum by 0.5 K. This phenomenon can be explained as follows. As the 1-box model responds with a finite time scale, its derivative must be continuous in response to a continuous forcing. Hence the leading term is quadratic when the forcing starts.
In contrast, the 2-box model contains a virtually degenerate time scale (the fast one); hence its leading term is linear. If the two 5 curves are to nevertheless match at tTCR, the 1-box model's derivative at tTCR must transcend the 2-box model's derivative.
This, together with the right-bending kink in the 2-box model's response at t1, leads to a larger maximum in the 1-box model.
In summary, on time-scales much smaller than the slow mode, the slow mode, compared to the fast mode, cannot develop yet; hence the fast mode will dominate the slow mode. As such, fitting a 1-modal model in a convex regime is likely to yield poor predictions of a temperature maximum for mitigation-based forcings. 10 This explains the discrepancies found in our PH99-AOGCM comparison when directly transferring AOGCMs' ECS and TCR onto PH99. Figure 10 further suggests that if PH99 were used to predict correct maxima and emulate AOGCMs in this time regime, it would need to be used with a markedly smaller time scale. However, a simple reduction in time scale would lead to a new inter-model discrepancy before the kink; hence the overall amplitude of PH99's response would need to be reduced as well. The latter scales with the ECS; hence the ECS must be reduced by a certain factor towards a new 'effective ECS,' which 15 could also be called a 'transient climate sensitivity.'

Testing the validity of a recalibrated PH99 for a 2-box model
In Sect. 5.1 we derived an analytic explanation for why a naïve transfer of an AOGCM's ECS and TCR to PH99 results in a maximum GMT which is too large when driven by a mitigation forcing scenario. However we show in Sections 3 and 4 that PH99 in fact is a good emulator of an AOGCM within 0.1K if it either were directly fitted to that AOGCM or if the AOGCM's 20 ECS and TCR were transformed into effective quantities for PH99. Hereby 'good emulator' expresses the fact that the same parameter set can be utilized for any RCP (2.6,4.5,6.0,8.5). From a practical point of view, we could stop our analysis here and suggest that this type of validation might be sufficient to generate trust in PH99 as an emulator for any forcing scenario.
However for further validation, in this Subsection we would like to exploit the fact that for a 2-box / 1-box intercomparison we can validate PH99 for an order-of magnitude larger set of forcing scenarios (again presupposing that a 2-box model would 25 emulate an AOGCM qualitatively better than a 1-box model). We systematically test the previously suggested adjustment formulas Eqs. (9) to (11) for a range of t1 and ε values, hence varying mitigation scenarios, given alternative ECS and slow mode's time scale τ for the 2-box model. We find numerically that θ is on the order of 10 years, and the ECS needs to be reduced by 1/4 to 1/3. We test for the centred ECS values of 3 K and 4 K and a slow mode's time scale, which generically ranges from 100 yrs. to 300 yrs (see Geoffroy et al., 2013). 30 In principle, for any forcing scenario characterized by varying t1 and ε , we would need to compare GMT as calculated by Eqs.
On the contrary, our validation scheme as utilized in Sections 3-4 would fix PH99 to the AOGCM at the year 2006. The latter point in time we denote by t0 (≈tTCR). Having transformed ECS and TCR according to Eqs. (9)-(11) we cannot expect any longer T(t0)=T2B(t0). Therefore we have to force the solution of PH99 to the solution of the 2-box model at t0 and call the thereby initialized solution of PH99 'Tinit':
(24) Figure 11 shows the relative deviations of the GMT maxima of the 1-box and the 2-box model for the extrapolation scheme 10 ETE (Eqs. (9) and (11)). In a certain regime, the extrapolation delivers sufficiently accurate results, however, not everywhere.
When utilizing the mapping scheme represented by Eqs. (9) and (10), the results look similar. The overall impression is that the mapping removes the bias. However, it does not deliver a universal correction as found for the direct intercomparison between PH99 and AOGCMs. Hence we cannot exclude the possibility that AOGCMs are easier to emulate as they contain many more time scales than the 2-box model and their effects might in part cancel. 15 While we observe a qualitative gain, Figure 11 reveals there is still room for improvement. Accordingly, we further transform the ECS to request perfect matching for t1=100 yrs, ε=0.2; the results can be seen in Figure 12. The fit is much further improved such that a major fraction of (t1, ε) values would lead to a relative error of <5%, and another large fraction to a relative error of <10%. As the standard deviation of annual GMT is between 0.1°C and 0.2°C and a typical application might be a costeffectiveness analysis of the 2°C target, such errors might still seem tolerable. However we observe structural problems for 20 very small values of ε, the latter implying very late assumption of a maximum. In this case, the slow mode becomes more relevant, and hence the quality of the calibration deteriorates. The calibration is valid for a time horizon on the order of t1 to 2 t1, i.e. on the order of the time to peak forcing.

Discussion
The previous section offers a key mechanism to explain why, for given ECS and TCR, GMT responses generated by PH99 in 25 response to peak-and-decline forcing scenarios are biased towards higher temperatures. How does this relate to the observation that PH99 tend to underestimate the effect of greenhouse gas emissions (van Vuuren et al.  Fig. 3). Here FUND, based on PH99, displayes a GMT lower than that of MAGICC-4 by more than 0.8 K at certain times during the most transient phase, although both models share the same ECS. This can be explained by the lack of time underestimates GMT is hence a strong function of the functional shape of forcing. Our article highlights the effects of a naively calibrated PH99 on mitigation scenarios. However, one should not forget about potential additional mechanisms. Firstly, the statistical errors in determining AOGCMs' ECS, TCR and Q2 may lead, mediated through the nonlinear mapping on PH99's parameters, to an overall bias in PH99's GMT. Furthermore, diagnosing the total radiative forcing active in an AOGCM is a complex undertaking (see e.g. Meinshausen 5 et al., 2011a, for a discussion). A bias to the high end here would also result in inaccurately large GMT responses by PH99.
However, in the context of this article, we contend that the information loss when moving from a 2-box to a 1-box model is the key source of the observed discrepancy -last but not least, we find Figure 10 compelling in this regard. Complying with the latter interpretation raises a key question: Can PH99 be seen as a 'physical model' and if so, what are the implications for users? It is readily apparent that a 1-box model cannot mimic a 2-box model, characterized by a marked time-scale separation 10 for all forcings at all times. However it is equally clear that the simplest temperature equation is in fact the one that treats the ocean as a single box. It would still explain warming with forcing in a quasi-linear manner, though with some delay. If we are willing to accept that the calibration of PH99 is time-horizon-specific, then PH99 still holds some semi-physical meaning. If, however, this is seen as unacceptable, then we would have to recognize that PH99 is more an efficient emulator than a physical model. In this context we would like to recall that virtually every model has a limited range of validity -and as such, PH99 is 15 no different from most other models.
When investigating the 1-box / 2-box-models' differences, our research also suggests that within the class of peak-and-decline scenarios PH99 provides a good emulation (accurate to within 0.2 K for a generic AOGCM setting such as ECS=4 K, a peaking of forcing between 2020 and 2100, and a ratio of slopes of pre-and post-peaking forcing of 0.1 to 0.4). For the AOGCM/PH99 intercomparison, PH99 performs even better: for RCP2.6, 4.5, 6.0 (~0.1K) and, to a lesser extent, 8.5. 20 What are the ramifications of our findings for previous publications based on PH99? Those authors who claimed to have worked with PH99 in conjunction with ECS=3°C have effectively worked with a more complex model in conjunction with ECS≈4°C for the centennial time horizon. Much of the work done based on MIND in conjunction with PH99 and the lognormal distribution for ECS by Wigley & Raper (2001), has essentially been based on a log-normal distribution shifted to larger ECS values. The 5%, 50% and 95% quantiles of the log-normal distribution by Wigley & Raper (2001) are 1.2 K, 2.6 25 K and 5.8 K, respectively. When interpreting these values as PH99 values, as they have in fact been utilized in PH99 for the MIND model since Lorenz et al. (2012), in the sense of a rough estimate one could ask what were the corresponding effective ECS values of a more complex model according to our Figure 7. The respective values are 1.2 K, 3.6 K and 9.0 K. From Figure   13, which reflects IPCC AR5's synopsis of current knowledge regarding ECS (Bindoff et al., 2013), we can see that these are still in line with the range spanned by instrumental studies. Hence the results obtained by PH99 in conjunction with the 30 distribution by Wigley & Raper (2001) are not erroneous, but simply need to be re-interpreted as rather high-end representatives within the collection of ranges as seen in IPCC AR5.
For future applications we can conclude that PH99 must be applied and interpreted with greater care -utilizing transformed values for ECS and TCR -than in the past, if it is not to be replaced by at least a 2-box model as suggested by Geoffroy et al. under uncertainty and anticipated future learning. As an illustration, execution of the MIND model currently demands between hours and days for 20 different values of climate sensitivity in conjunction with one learning step (E. Roshan, pers. comm.).
The execution time needed will grow exponentially with the number of learning steps and at least linearly with the number of state variables influenced by uncertainty. For endogenous learning in a recursive design, computation time scales factorially 5 with the numerical resolution per state variable. The change from a 1-box to a 2-box model might hence imply an order of magnitude larger execution time (C. Traeger, pers. comm. in conjunction with Traeger (2014)). So a 1-box model will remain an attractive alternative in numerical applications addressing decision-making under anticipated future learning. Users who would like to go that road might, however, also consider the augmented 1-box model by Traeger (2014) as an alternative to PH99, employing an additional exogenous forcing of that single box to somewhat emulate two boxes. 10

Summary and Conclusion
We utilize recent data on total radiative forcing (Forster et al., 2013) from 14 state-of-the-art CMIP5 Atmosphere Ocean General Circulation Models (AOGCMs) in order to test the validity of the 1-box climate module by Petschel-Held (1999, 'PH99') for scenarios approximately compatible with the 2° target. PH99 is currently utilized within the integrated assessment models FUND, MIND and PAGE. 15 We find that when prescribing the equilibrium climate sensitivity (ECS) and transient climate response (TCR) of these AOGCMs to the emulator PH99, global mean temperature (GMT) is generically projected 0.5 K higher. In contrast, by directly fitting PH99 to the RCP2.6 time series and validating with the RCP4.5 and RCP6.0 series, we find that PH99 can emulate AOGCMs to a degree of accuracy better than 0.1 K. Even for RCP8.5 the error is on the same order of magnitude, although somewhat larger (up to 0.2 K). 20 We numerically demonstrate that PH99 can be used to excellently emulate AOGCMs (accurate to within 0.1 K on average) within centennial-scale integrated assessment of the 2° target, provided its ECS and TCR are re-interpreted as effective values and mapped from original ECS and TCR values. We suggest such a mapping.
Furthermore we explain the observed discrepances and the need to reduce PH99's ECS compared to the AOGCM's ECS as being due to the information loss produced by approximating a 2-box-based energy balance model with a 1-box-based model 25 (assuming that a 2-box model mimics an AOGCM better than a 1-box model). The key point is that PH99 has a fundamentally different response shape to an AOGCM and hence ECS alone does not allow one to easily move between the two. The transformation we propose adjusts PH99's ECS, sacrificing agreement in the long-term response in order to gain agreement in the centennial response (which is useful given it is more often than not the timescale of interest).
In fact the slow mode of the 2-box model is so slow that in a climate-policy-relevant context it can unfold only up to a relatively 30 small extent; hence for practical purposes the 2-box model's ECS cannot fully develop. Accordingly, adjusting the ECS to lower values also proves to be compatible with reducing PH99's response time. When comparing PH99 and AOGCMs, the match is even better -a phenomenon for which the explanation is beyond the scope of this article.
Hence older work based on PH99, executed within FUND, MIND and PAGE, may need to be re-interpreted in the sense that a response had been sampled which is higher than that of the corresponding AOGCM. This effect, in turn, proves equivalent to utilzing higher ECS values in the more complex model. Even when having dealt with distributions of ECS as for the MIND 5 model, ECS values re-interpreted in that sense are still within the range outlined by IPCC AR5 (see Figure 13). Hereby we see this 're-interpretation' as a mere numerical fix. In terms of the underlying physics, we stress that using ECS alone to characterise climate response on a few hundred year timescale is fundamentally flawed, given that ECS takes on the order of a thousand years to emerge.
For future work, we propose the following steps: (i) By comparison with more sophisticated, multi-box climate modules it 10 should be tested again whether the effect of a transient climate sensitivity (and TCR) alone could explain our observed PH99-AOGCM discrepancy; (ii) Future discussions with the AOGCM community should illuminate to what extent the further explanations we suggested might also apply, thereby potentially reducing the need to correct for PH99; (iii) An AOGCM-and scenario class-independent, yet centennial time-scale-specific two-dimensional mapping from ECS/TCR onto ECS/TCR and designed for PH99 should be derived in conjunction with two-dimensional distributions inferred from observations as done in 15 Frame et al. (2005). The IAM community could then be offered both options for emulation: the one presented here, trained by AOGCMs, and one based on observational data and mediated by more complex SCMs.
In summary, PH99 could continue to be used as the most parsimonious emulator of AOGCMs, and is especially efficient for decision-making under climate response uncertainty. However its calibration proves to be much more involved than previously assumed. Future users should carefully consider whether they actually want to use PH99, or whether they prefer a less 20 parsimonious solution.

Appendix 1: An Analytic Expression of TCR in PH99
We rearrange Eq. (1) as TCR is defined as the temperature change in response to a 1%/yr. increase in CO2 concentration, starting from preindustrial 25 conditions. Hence the concentration, expressed in units of the pre-industrial concentration, reads = exp(γ ) (A2) with γ denoting the above rate of change. As Eq. (A1) represents a linear ordinary differential equation with constant coefficients, and the initial temperature anomaly is to vanish, its solution reads Temperature should be evaluated at t2 when the concentration is doubled. t2 is determined by c(t2)=2 ⇒ t2=ln2/γ. From this and Eq. (A3) we conclude Eq. (3). (In fact we find the same result using an expression provided in Andrews & Allen, 2008, when we plug in our expression for t2 into theirs, which is phrased in terms of ECS.)

Appendix 2: Further Analysis on Calibration and Validation
As further validation of the trained PH99 calibrated to RCP2.6, Figure 14 shows the respective GMT trajectories of AOGCMs 5 for RCP6.0 scenario contrasted with its respective PH99-generated ones where and µ are fixed to their value as determined in step two. MTDs are shown in the 3 rd columns of Table 3. The missing models are due to either lack of temperature trajectories for AOGCM or lack of total forcing. Notice that 1 st , 2 nd , and 4 th columns are exactly the numbers related to the Figure 2. The results confirm that the climate module is so well trained in the second step that it can appropriately mimic the actual temperatures (accurate to within 0.1K) for RCP6.0. As shown, the average value of MTD is about 0.06 K for RCP6.0. 10 Column 5 thereafter in Table 3 show MTDs in the situations when PH99 is calibrated to the other RCP scenarios and is validated as against the others.

Appendix 3: Derivation of Eqs. (16)-(18)
We start by rewriting Eq. (14) in a way that it is most consequently decomposed into the contributions from the two modes i ∈ {f , s} (for 'slow' and 'fast' mode, respectively). 15 One could derive Eq. (16) from an intuitive perspective by noticing that for any of the modes i, its contribution to the temperature response would consist of an equilibrium response, delayed by , and a summand of exponential decay which would ensure continuity with respect to the initial condition. This very principle can be followed again for the time horizon beyond t1. 20 However, for those readers who would like to see a more formal derivation, we provide the following ansatz: For t>t1, we decompose T2B into three contributions, according to the superposition principle for linear differential equations: 1. T1, induced by a forcing k2 (t-t1) with T1(t1)=0 . This contribution can be treated analogously to T2B(0<t<t1) when noticing the replacements k1→k2, t→t-t1. From Eq. (A4) we infer 25 2. T2, induced by a constant forcing k1 t1 with T2(t1)=0 . Also this problem has been solved by Geoffroy et al. (2013) in terms of their Eq. (9) which we rewrite in our notation: 2 ( ≥ 1 ) = 1 1 3. T3 as the decaying initial condition at = 1 . For reasons of continuity, this initial condition is identical to the terminal condition according to Eq. (A4). Hence, 3 ( ≥ 1 ) = 1 λ 2B When we add these three components, we receive 5 Allowing for the limit f → 0 and noticing that 2 = − 1 we verify Eq. (16) by a summand-by-summand comparison.
Allowing for f = s = (i.e. simulating a 1-box setting by a 2-box approach), we obtain Eq. (17) from Eq. (A4) and Eq. (18) from Eq. (A8).      where PH99 is calibrated to RCP2.6 scenario.  Table 2 are related to ECS and TCR in Table 1. Using linear (yellow bars), quadratic (light green bars), and 5 cubic functions (dark green bars), α and µ are related to ECS when outlier is put out for the linear case. Using linear fits, ECS and TCR are related to ECS (blue bars). Using linear fits, ECS and TCR are related to ECS and TCR respectively (light blue bars).
The dark blue bars show the deviations for RCP2.6 when and µ are from Table 1 and not fitted (the same as Fig.2). The orange bars indicate MTD using Lorenz's curve.  Table 2 to ECS in Table 1. Notice that in the linear case the model GISS_E2_R (the upper left sample), as an outlier, is out.

Figure 7: Inferred effective TCR [K] vs. AOGCMs' TCR [K] (a), inferred effective ECS [K] vs. AOGCMs' ECS [K] (b), and inferred effective TCR [K] vs. AOGCMs' ECS [K] (c)
. While the TCRs differ by less than 0.2 K, the ECSs differ by up to 2 K. This opens the door for a discussion as to whether PH99 should be calibrated using scenario-class-adjusted effectively lower ECS