Evaluating Climate Emulation: Unit Testing of Simple Climate Models

Simple climate models (SCMs) are numerical representations of the Earth’s gas cycles and climate system. SCMs are easy to use and computationally inexpensive, making them an ideal tool in both scientific and decisionmaking contexts (e.g., complex climate model emulation; parameter estimation experiments; climate metric calculations; and probabilistic analyses). Despite their prolific use, the fundamental responses of SCMs are often not directly characterized. In this study, we use unit tests of three chemical species (CO2, CH4, and BC) to understand the 15 fundamental gas cycle and climate system responses of several SCMs (Hector v2.0, MAGICC 5.3, MAGICC 6.0, FAIR v1.0, and AR5-IR). We find that while idealized SCMs are widely used, they fail to capture important global mean climate response features, which can produce biased temperature results. Comprehensive SCMs, which have non-linear forcing and physically-based carbon cycle representations, show improved responses compared to idealized SCMs. Even some comprehensive SCMs fail to capture response timescales of more complex models under BC or 20 CO2 forcing perturbations. These results suggest where improvements should be made to SCMs. Further, we provide a set of fundamental tests that we recommend as a standard validation suite for any SCM. Unit tests allow users to understand differences in model responses and the impact of model selection on results.

SCMs use even fewer equations, which do not necessarily correspond to individual physical processes, to parametrically represent the climate system (Millar et al., 2017).

35
SCMs are widely used in scientific and decision-making contexts largely because of their advantageous features, including their ease of use and low computational intensiveness. In particular, SCMs are traditionally used within human-Earth system models. These models couple the climate system with representations of the dynamics within the human system (e.g., energy systems and land-use changes) (Hartin et al., 2015;Ortiz and Markandya, 2009;S.H. 40 Schneider and S.L. Thompson, 2000;Strassmann and Joos, 2018) and are used to assess global forcing or temperature targets (e.g., Representative Concentration Pathways (van Vuuren et al., 2011b), Shared Socioeconomic Pathways (Moss et al., 2010)). Several studies investigated potential sources of human-Earth system model uncertainty by exploring the climate components driving the models (Calel and Stainforth, 2017;Harmsen et al., 2015;van Vuuren et al., 2008van Vuuren et al., , 2011a. Van Vuuren et al. (2011a) concluded that in most cases the results from human-Earth system 45 models and SCMs were similar to the more complex, coupled Earth System Models (ESMs). The authors further noted that differences in SCM results can have implications for decision makers informed by such results, illustrating the need for improvements in uncertainty analysis (e.g. carbon cycle feedbacks or inertia in climate response). Harmsen et al. (2015) extended van Vuuren's analysis to investigate emission reduction scenarios by including non-CO2 radiative forcing. The authors concluded that many models may underestimate forcing differences after applying 50 emission reduction scenarios, due to the omission of important short-lived climate forcers, such as black carbon (BC).
Few studies utilize idealized SCMs in human-Earth system models because of their inability to represent nonlinear forcings, such as air-sea exchanges (Khodayari et al., 2013) or ocean chemistry (Hooss et al., 2001;Tanaka and Kriegler, 2007). With simple extensions of the carbon cycle (e.g., ocean carbonate chemistry), both Hoos et al. (2001) 55 and Tanaka and Kriegler (2007) found improved responses from their respective impulse response models, applicable when coupling to human-Earth system models.
Arctic temperature response to regional short-lived climate forcer emissions (e.g., BC), and compared these responses to more complex models.
Climate indicators, such as transient climate response (TCR) (Allen et al., 2018;Millar et al., 2017), are also calculated using SCMs. TCR is the measure of the climate response to a 1% yr −1 increase in CO2 concentration until doubling of 75 CO2 relative to pre-industrial level. TCR is useful for understanding the climate response on shorter time scales, as CO2 concentration doubling takes place in 70 years, a time-frame relevant for many planning decisions (Flato et al., 2013;. Used in combination with TCR, the equilibrium climate sensitivity (ECS) can also be used to attribute the fraction of observed warming to anthropogenic influences, called the realized warming fraction (RWF).  investigated TCR and ECS within a global climate-calibrated impulse-response model to show the 80 implications of these values on future climate projections by specifically looking at the RWF.
Sums of exponentials are also commonly used to calculate other climate metrics, such as global warming potential (GWP) and global temperature potential (GTP) (Aamaas et al., 2013;Berntsen and Fuglestvedt, 2008;Fuglestvedt et al., 2010;Peters et al., 2011;Sarofim and Giordano, 2018). Stylized SCMs, however, often do not account for carbon 85 cycle feedbacks, important for more realistic representations of climate. Both Millar et al. (2017) and Gasser et al. (2017) investigated the effects of adding carbon cycle feedbacks on these metrics produced with stylized SCMs, and found that accounting for feedbacks improved model responses (at least modestly, Gasser et al. 2017).
Despite their wide use, the fundamental responses of SCMs have not been fully characterized (Thompson, 2018).

Unit tests.
We use impulse-response tests, a type of unit test, to address this gap as recently suggested by the US National Academies (National Academies of Sciences, Engineering, 2016). An impulse-response test characterizes the SCMs' climate and gas-cycle response to a forcing or emission impulse (Good et al., 2011;Joos et al., 2013).
Here, we take a comprehensive approach evaluating several SCMs using forcing and emission impulse tests to 95 understand the response of the climate system and gas cycles in the models. We use three main impulse tests: (a) a concentration impulse of CO2, (b) emissions impulses of BC, CH4, or CO2, (c) a 4×CO2 step increase in CO2 concentration. We carry out these experiments by instantaneously increasing emissions or forcing values in 2015 to avoid the model base years of our SCMs (see SI1). 100 2.2 Background. Our unit tests are conducted against a changing CO2 concentration background using the Representative Concentration Pathway (RCP) 4.5 scenario (Thomson et al., 2011). For each unit test, therefore, we run a reference scenario in the SCMs, followed by each perturbation case described above. We report the response, which is obtained by subtracting the reference from the perturbation results. The changing CO2 concentration background is more realistic and also reveals biases not otherwise apparent under constant concentration conditions,

105
Earth Syst. Dynam. Discuss., https://doi.org/10.5194/esd-2018-63 Manuscript under review for journal Earth Syst. Dynam. Discussion started: 27 September 2018 c Author(s) 2018. CC BY 4.0 License. for example, in SCMs insensitive to changing background concentrations. Further, for emissions impulses this methodology is more readily implemented as a standard unit test (see SI1), as we recommend below.

Models.
Three comprehensive SCMs-Hector v2.0 (Hartin et al., 2015;Kriegler, 2005), MAGICC 5.3 BC-OC (Smith and Bond, 2014), and MAGICC 6.0 (Meinshausen et al., 2011)-are used in this study (SI2). The models were 110 selected based on their availability, use in the literature, and their applicability to decision making. We also include two idealized SCMs which employ sums of exponentials to represent the climate or gas-cycle responses, a general approach often used in the literature (Aamaas et al., 2013;Fuglestvedt et al., 2003), referred to as impulse response functions (IRFs). IRFs linearly approximate the response of a system to a given forcing (Hooss et al., 2001). A widely used version tested here is the impulse response (IR) model used in the Intergovernmental Panel on Climate Change 115 Fifth Assessment Report (Myhre et al., 2013), referred to as AR5-IR. Additionally, we test version 1.0 of the Finite Amplitude IR (FAIR) model, an extension of AR5-IR including a representation of carbon cycle feedbacks and nonlinear forcing (Millar et al., 2017).

Results
We highlight differences in model responses to a suite of unit tests to support an informed model selection (see Table   120 1). We begin by testing the fundamental dynamics of the temperature response to a well-mixed greenhouse gas forcing impulse by perturbing CO2 concentrations (Fig. 1), bypassing the carbon cycle (if present).
We report both time-series responses (Fig. 1a) and time-integrated responses ( Fig. 1b; SI 9). Integrated responses form the basis of commonly used metrics, such as GWP and GTP (Fuglestvedt et al., 2010). The idealized SCMs show varied responses to a CO2 concentration impulse. AR5-IR has a much stronger response compared to the comprehensive SCMs; the integrated response is 6% larger than the comprehensive SCMs 20 years 135 after the pulse, increasing to 30% by the end of the model runs. This large difference is due to the absence of feedbacks and nonlinearities in the AR-IR model. FAIR represents such nonlinearities, responding similarly to the comprehensive SCMs in the near-term, but has a 7% weaker response 285 years after the impulse. The approximations used to represent the carbon cycle and non-linear forcing might account for this, but it is unclear from these results. horizons. AR5-IR, an idealized SCM, responds 11% stronger than the comprehensive SCMs average 20 years after the pulse, increasing to a 17% difference 285 years after the impulse. We complete the model response sequence by examining the temperature response from emissions perturbations, which is conceptually the combination of the temperature response from a concentration impulse (Fig. 1) and the 150 forcing response from an emissions impulse (Fig. 2). Similarities in the comprehensive SCM responses in Figs. 1 and 2 are reflected in the <5% difference in the temperature response from a CO2 emissions perturbation 20 years after the impulse (Fig. 3b). AR5-IR responds 30% stronger and FAIR <10% weaker compared to the comprehensive SCMs average 20 years after the perturbation (Fig. 3a). FAIR introduces a state-dependent carbon cycle representation

160
The airborne fraction is, therefore, higher in our results. Despite the difference in methodology, comparing the  SCMs and to a lesser extent, FAIR, offer an improved response compared to AR5-IR (Millar et al., 2017). 180 years after the pulse (Fig. 2b). As in the CO2 emissions perturbations, AR5-IR has a much stronger response (22%) to a CH4 emissions perturbation 20 years after the pulse, with no meaningful increase 50 years after the pulse (SI8).
Finally, we look at the models' temperature responses to aerosols by perturbing black carbon (BC) forcing (Fig 3).
The BC response increases quickly in both MAGICC models compared to the other SCMs (SI9). Differences in these 185 responses to a BC perturbation derive from model design. Both versions of MAGICC have differential and faster forcing responses over land, where most BC is located, compared to oceans, termed the geometrical effect (Meinshausen et al., 2011). This results in MAGICC responding faster than Hector v2.0, which does not differentiate forcing over land and ocean. Because AR5-IR represents the aerosol forcing as an exponential decay, the integrated temperature response is 20% stronger 20 years after the pulse compared to the other SCMs.

190
Due to the geometrical effect, we presume that the faster response in MAGICC is more realistic. However, models vary in the representations of aerosol effects (SI2). The greenhouse gas-like representation of aerosols in AR5-IR, for example, results in the unrealistically long response time scale found in this test. We do not explicitly conduct other aerosol perturbations (e.g., sulfate), but we would expect results showing similar responses.

195
BC has a unique set of atmospheric interactions, acting as an absorbing aerosol and causing inhomogeneous warming (Stjern et al., 2017). The response to a step in BC has been found to have a flat long-term temperature response (Sand et al., 2016). We find that comprehensive simple models respond over a much longer time scale than an ESM experiment investigating the climate response to BC (SI12). This is an indication that SCM responses to BC, in 200 particular, should be reevaluated.

Responses to 4xCO2 Concentration
Step. Finally, we compare our SCMs with complex models using the abrupt 4xCO2 concentration experiment from Phase 5 of the Coupled Model Intercomparison Project (CMIP5) (Taylor et al., 2012) (see SI1 and SI3). We find that Hector, MAGICC 5.3, and FAIR have initially quicker responses to an abrupt 205 4xCO2 concentration increase (Fig. 4). This is also reflected in their long term RWF, which is also larger than most of the complex models (see SI9). Compared to the other SCMs, AR5-IR has a slower response to an abrupt 4xCO2 concentration increase which does not substantially increase 25 years after the pulse, reflected in a lower RWF. Differences between the model responses to a finite pulse (Fig 1) and a large concentration step (Fig. 4) demonstrates the expected bias in AR5-IR under larger perturbations. The insensitivity of idealized SCMs to changing background 210 concentrations will bias results if used under realistic future pathways (Millar et al., 2017).
Compared to the other comprehensive SCMs, MAGICC 6.0 initially responds more strongly under a CO2 concentration impulse (Fig. 1). In the non-linear abrupt 4xCO2 concentration regime MAGICC 6.0 responds more slowly, similar to the complex model responses, especially in the first 20 years after the pulse. MAGICC 6.0 appears 215 to respond more reasonably under stronger forcing conditions than the other SCMs. representations. Fundamental forcing tests, such as a 4xCO2 concentration step, show that some SCMs (Hector, MAGICC 5.3, and FAIR) have a faster warming rate in this strong forcing regime compared to more complex models.
However, comprehensive SCM responses are similar to more complex models under smaller, more realistic perturbations (Joos et al., 2013).

240
There are numerous benefits to using simplified models, but the selection of the model should be rooted in a clear understanding of the model responses (see Table 1). Our work illustrates the necessity of using fundamental unit tests to evaluate SCMs. Given that idealized SCMs are biased in their response patterns, more comprehensive SCMs could 245 be used for many applications without compromising on accessibility or computational requirements.

Impulse
Species Hector v2.0 MAGICC 5.3 MAGICC 6.0 FAIR v1.0 AR5-IR The views and opinions expressed in this paper are those of the authors alone.

265
All model input files generated for our experiments, and the resulting impulse response functions, are provided in the Supplementary Materials. The authors appreciate that any use of this data be attributed.