An investigation of weighting schemes suitable for incorporating large ensembles into multi-model ensembles

Multi-model ensembles can be used to estimate uncertainty in projections of regional climate, but this uncertainty often depends on the constituents of the ensemble. The dependence of uncertainty on ensemble composition is clear when single-model initial condition large ensembles (SMILEs) are included within a multi-model ensemble. SMILEs allow for the quantification of internal variability, a non-negligible component of uncertainty on regional scales, but may also serve to inappropriately narrow uncertainty by giving a single model many additional votes. In advance of the mixed multi-model, the SMILE Coupled Model Intercomparison version 6 (CMIP6) ensemble, we investigate weighting approaches to incorporate 50 members of the Community Earth System Model (CESM1.2.2-LE), 50 members of the Canadian Earth System Model (CanESM2-LE), and 100 members of the MPI Grand Ensemble (MPI-GE) into an 88-member Coupled Model Intercomparison Project Phase 5 (CMIP5) ensemble. The weights assigned are based on ability to reproduce observed climate (performance) and scaled by a measure of redundancy (dependence). Surface air temperature (SAT) and sea level pressure (SLP) predictors are used to determine the weights, and relationships between present and future predictor behavior are discussed. The estimated residual thermodynamic trend is proposed as an alternative predictor to replace 50-year regional SAT trends, which are more susceptible to internal variability. Uncertainty in estimates of northern European winter and Mediterranean summer end-of-century warming is assessed in a CMIP5 and a combined SMILE–CMIP5 multi-model ensemble. Five different weighting strategies to account for the mix of initial condition (IC) ensemble members and individually represented models within the multi-model ensemble are considered. Allowing all multi-model ensemble members to receive either equal weight or solely a performance weight (based on the root mean square error (RMSE) between members and observations over nine predictors) is shown to lead to uncertainty estimates that are dominated by the presence of SMILEs. A more suitable approach includes a dependence assumption, scaling either by 1/N , the number of constituents representing a “model”, or by the same RMSE distance metric used to define model performance. SMILE contributions to the weighted ensemble are smallest (< 10 %) when a model is defined as an IC ensemble and increase slightly (< 20 %) when the definition of a model expands to include members from the same institution and/or development stream. SMILE contributions increase further when dependence is defined by RMSE (over nine predictors) amongst members because RMSEs between SMILE members can be as large as RMSEs between SMILE members and other models. We find that an alternative RMSE distance metric, derived from global SAT and hemispheric SLP climatology, is able to better identify IC members in general and SMILE members in particular as members of the same model. Further, more subtle dependencies associated with resolution differences and component similarities are also identified by the global predictor set. Published by Copernicus Publications on behalf of the European Geosciences Union. 808 A. L. Merrifield et al.: An investigation of weighting schemes suitable for incorporating large ensembles

dependent is discarded, or through a weighting scheme, where information is scaled by degree of dependence. In this study, we evaluate if a performance and independence weighting scheme Lorenz et al., 2018;Brunner et al., 2019) can be used to include three SMILEs into a CMIP5 multi-model ensemble and provide a justifiably constrained estimate of European regional end-of-century warming uncertainty. Northern European winter and Mediterranean summer SAT changes between the 1990-2009 and 2080-2099 mean state are considered. We discuss details of the weighting method including 100 emergent predictor relationships and optimal parameter choices for attempting to comprehensively characterize member performance while separating independent information from information known to have common origin (SMILE members). We highlight a new metric, estimated residual thermodynamic trend, which can be used as an alternative to trend-based metrics that do not optimally reflect a model's performance on regional scales. We compare how five different weighting strategies, based on different independence assumptions, constrain uncertainty in a CMIP5 multi-model ensemble with and without the SMILEs 105 included. Weighted SMILE contributions in each CMIP5-SMILE "ALL" ensemble are explicitly computed. The five weighting strategies come from the continuum of assumptions that can arise in multi-model ensemble construction: (1) all members are independent and equally plausible (equal weighting), (2) some members are more realistic than others (performance weighting), (3,4) members from the same model are dependent (1/N scaling, N being number of IC members or modelling center contributions), and (5) all members are dependent to some degree (RMSE distance metric scaling). For the last approach, we 110 demonstrate that an RMSE dependence scaling that groups SMILE members and distinguishes them from other models can be obtained using on large-scale, long term SAT and sea level pressure (SLP) climatology fields. The SMILEs, CMIP5, and observational datasets used in the weightings are described in Section 2, while the weighting schemes are detailed in Section 3. The influence of SMILE inclusion on the weighting under different independence assumptions and the predictor set that identifies SMILE members as dependent entities based on RMSE distance are discussed in Section 4. To close, conclusions 115 and discussion is presented in Section 5.

Data
The multi-model ensemble used in this study is comprised of members from the CMIP5 archive and three SMILEs: a 50member ensemble generated using the Community Earth System Model version 1.2.2 (CESM1.2.2-LE), the 50-member Canadian Earth System Model version 2 large ensemble (CanESM2-LE), and the 100-member Max Planck Institute for Meteorology 120 Grand Ensemble (MPI-GE). This combined CMIP5-SMILE ensemble is summarized in Table 1, which lists the name of each model and the members used. A similar CMIP5 multi-model ensemble was used in Lorenz et al. (2018) and Brunner et al. (2019) and features 88 members from 40 (named) model setups, including 13 initial condition ensembles ranging from 2 to 10 members. Additionally, for the GISS-E2-H and GISS-E2-R experiments, NASA GISS provides members from 3 physicsversion ("p") setups that differ in atmospheric composition (AC) and aerosol indirect effects (AIE) (Miller et al., 2014). We 125 treat the 3 setups as follows: p1 (prescribed AC and AIE) and p3 (prognostic AC and partial AIE) members are treated as 2 member IC ensembles and the p2 member (prognostic AC and AIE) is treated as a single member representation (Table 1).
In Table 1, initial condition ensembles are indicated in italic and SMILEs are indicated in bold with a star. Horizontal lines  denote modelling center/known development streams that are grouped as dependent entities under the fourth independence assumption we investigated.

130
The CESM1.2.2-LE used in this study was derived from a 4700-yr CESM control simulation with constant preindustrial forcing generated at ETH Zürich (Sippel et al., 2019 and has a horizontal atmospheric resolution of 1.9 • x 2.5 • with 30 vertical levels (Hurrell et al., 2013). The preindustrial control run was branched at 20-year intervals, starting from the year 580, to create an ensemble with "macro" initial conditions, i.e., different coupled initial conditions picked from well separated start dates (Stainforth et al., 2007;Hawkins et al., 2016).

135
Members of the macro initial condition ensemble were run from 1850-1940 driven by historical CMIP5 forcing (Meinshausen et al., 2011). At year 1940, each macro initial condition member was branched into four different realizations, each subject to an atmospheric temperature perturbation of 10 −13 to create "micro" initial condition ensembles (Hawkins et al., 2016). From these micro initial condition ensembles, 50 members were selected for the CESM1.2.2-LE (specifically, 4 micro ensemble members from macro ensemble members 1 through 12 and 2 micro ensemble members from macro ensemble member 13).

140
The MPI-GE was generated using the low resolution set up of the MPI Earth System Model (MPI-ESM1.1) (Giorgetta et al., 2013). The 100 member ensemble has macro initial conditions: a preindustrial control simulation was branched on the first of January for selected years between 1874 and 3524 to sample different states of a stationary and volcano-free 1850 climate (Maher et al., 2019). The MPI-GE uses ECHAM6.3 run in a T63L47 configuration (Stevens et al., 2013) as its atmospheric component for a horizontal resolution of approximately 1.8 • .

145
The CanESM2-LE (Arora et al., 2011) was initiated from the 5 CanESM2 members contributed to CMIP5 (which are included in our CMIP5 basis multi-model ensemble). As with CESM1.2.2, the CanESM2 large ensemble has a combination of macro and micro initial conditions. Macro initial conditions were taken from year 1950 of the 5 original CanESM2 members.
Each were then branched 10 times with micro initial conditions (a random permutation to the seed used in the random number generator for cloud physics) to give a total of 50 members (Swart et al., 2018). The CanESM2-LE uses the CanAM4 atmosphere The NEU and MED regions used are the SREX regions defined in Seneviratne (2012 (Meinshausen et al., 2011). The multi-model CMIP5 ensemble (Fig.1 blue) has a larger spread than the single model SMILEs, demonstrating that model uncertainty does rise above well-defined estimates of internal variability in the two European regions and seasons considered. The combined macro-micro perturbation CESM1.2.2-LE ( Fig.1 red) has a larger ensemble spread than the CanESM2-LE ( Fig.1 yellow), but, on average, warms less by end-of-century. The MPI-GE ( Fig.1 green) (Compo et al., 2011;Slivinski et al., 2019). BEST was created to be an independent estimate of global temperature, obtained through spatiotemporal interpolation 170 of in situ temperature measurements (Rohde et al., 2013).
The weighting can comprehensively account for observational uncertainty (Brunner et al., 2019), but for this study, we chose to use the average of two observational estimates in order to have a simple and straight-forward definition of climate within which the sensitivity of the weighting scheme can be interrogated. ERA-20C and NOAA-20C reanalyses were chosen because they provide temporally and spatially complete fields that extend back to 1950. Additionally, as reanalysis products are, after all, model-based, we chose a reanalysis product with both SLP and SAT available (ERA-20C), as well as SAT and SLP fields from different sources (NOAA-20C and BEST). We further used the SLP-SAT relationship to obtain the circulation-induced component of SAT, which is removed to obtain the estimated residual thermodynamic SAT trends (see Appendix A). Though all products are observational estimates, we henceforth refer to them as "observations" or "OBS" to distinguish them from members of the multi-model ensemble.

Weighting Schemes
The weighting strategies used to constrain uncertainty in this study are rooted in a combination performance and independence weighting metric developed by Knutti et al. (2017), following on the work of Sanderson et al. (2015a, b). Summarized in the subsections below, the five strategies considered arise from common assumptions surrounding plausibility and similarity made about constituents of multi-model ensembles. With the exception of the first strategy, which assigns each member an equal 185 weight, the basic principle of the weighting is as follows: a member will receive a performance weight based on how closely it resembles observed climate (based on nine chosen predictors; detailed in the following section). That performance weight will then be divided by an independence scaling that represents whether (or to what degree) a member is identified as a "duplicate" of another member over the historical period. It is important to note that independence in this study is never determined by future behavior. Doing so would jeopardize the "agreement suggests robustness" paradigm by penalizing convergence. Rather,190 independence is either a model property decided upon beforehand or determined through RMSE distances between historical aspects of climate.

Equal Weighting
The first way in which the multi-model ensemble is weighted is by all members receiving a weight, w I i , of 1.
This equal weighting follows from the assumption that all multi-model ensemble members are independent and equally plausible and is sometimes referred to as a "model democracy" assumption (Knutti et al., 2010a;Knutti, 2010). In instances where SMILEs are incorporated into a multi-model ensemble, the equal weighting strategy is clearly flawed; 50-100 members from the same model is a clear voting advantage within the model democracy. However, equal weighting serves as a baseline handling of multi-model ensemble information against which other weighting strategies can be compared.

Performance Weighting
The second weighting strategy builds upon the first in that all members are still assumed to be independent, but some members are identified to be more realistic than others. Members are thus weighted (w II i ) by a measure of performance, here, based on the numerator of the Knutti et al. (2017) weighting function.
The term D i represents the RMSE distance between a multi-model ensemble member and observations; w II i decreases exponentially as members increasingly differ from observations (D i >> 0). A shape parameter σ D dictates the width of the performance weight Gaussian, determining how far apart a member and observations must be to be down-weighted. For a smaller value of σ D , models are more rapidly down-weighted as they diverge from observed climate which often results in a weighting where few models receive weights of meaningful magnitude. For a larger value of σ D , models are not as strongly 210 penalized for not resembling observations which often results in a more even distribution of weights within the ensemble. Here, we select σ D to be 0.32 for the DJF NEU weighting and 0.4 for the JJA MED weighting (further discussion in Appendix B).

1/N scaling, IC members
The third weighting strategy extends the performance weighting by including an independence assumption, making it suitable for the combined CMIP5-SMILE ensemble we evaluate. Each "model" gets a unique weight. The independent entity, "model", 215 is assumed to be determinable by name (as listed in Table 1), which renders members of IC ensembles within the multi-model ensemble (the 13 within the CMIP5 ensemble and the 3 SMILEs) dependent entities. To achieve the model weighting, models that are represented by one member receive their performance weight w II i . Models that are represented by IC members receive an average of the performance weights of their N constituents ( 1 . That average performance weight, scaled by N, is assigned to each IC member. Therefore, the weight each member receives, w III i , is: Each IC member is assigned the average performance weight of the IC ensemble (rather than its individually computed performance weight) to reflect the assumption that all IC members represent an equally likely outcome of the model. This choice accounts for the fact that when computed by RMSE, performance weights differ between IC members due to internal variability.

1/N scaling, modelling center
The fourth weighting strategy is identical to the third, but has a different definition of "model". The independent entity is determined not by name, but by a conjecture about model origin. Similar to the "same-center hypothesis" (Leduc et al., 2016), we group all members provided by a modelling center and/or in a known development stream (i.e., the CESM1.2.2-LE is grouped with the NCAR models, though it was run at ETH Zürich) as dependent entities. The weight of each model, w IV i , is 230 computed as in the IC case, with averages taken over N, the number of members that constitute a model.

RMSE distance scaling
Finally, the fifth weighting strategy operates under the assumption that independence cannot necessarily be determined by model name, but shared biases in simulating historical climate can give an idea of dependence that comes from differently 235 named models sharing ideas and code. Instead of relying on knowledge of model origin, the RMSE weighting (w V i ) initially proposed by Knutti et al. (2017) relies solely on model output to determine a model's overall weight. It features an independence scaling based on RMSE distance metrics in addition to the RMSE-derived performance weights. For results to be compatible with past assessments of this weighting scheme (e.g. Lorenz et al., 2018;Brunner et al., 2019), we assign each member their unique performance weight (as computed in w II i ) even if they are IC ensemble members. This puts the RMSE weighting in 240 contrast to the 1/N scaling approaches which ensure IC ensemble members have identical weights.
S ij represents the distance between multi-model ensemble member i and multi-model ensemble member j. Unlike in the 1/N strategies, the independence scaling is based solely on S ij , how far a member is from all the other members in the ensemble, and not on any prior knowledge of the multi-model ensemble member's origin. As with the performance weight, a shape 245 parameter σ S dictates the width of the Gaussian that is applied to the member pairs. σ S represents how close a member must be to another member before they are considered dependent entities. For a member with no close neighbors (S ij >> σ S ), the independence scaling tends to 1, preserving the member's overall weight. For a member with many close neighbors (S ij << σ S ), the independence scaling is greater than 1 and reduces its overall weight. For the CMIP5-SMILE ensemble, the goal is to select a σ S that is large enough such that members of a SMILE are considered dependent entities, but not so large that the 250 majority of multi-model ensemble members are considered dependent as well. Here, we select DJF NEU σ S to be 0.25 and JJA MED σ S to be 0.26. Sensitivity to the choice of σ S and further details on selection strategies are discussed in Appendix B.
Upon computation of the weights in each strategy, each weight is normalized by Σ i w i such that they sum to 1.

Defining "Climate": Predictor Selection
Both the performance weight used in weighting strategies two through five and the independence scaling used in strategy five 255 are based on a chosen definition of climate. A model's performance is based on its ability to reproduce observed climate.
Under assumption five, a member's independence is based on how much its climate differs from the climate in other members.
When defining climate, the aim is to optimize the "fitness for purpose", which should include choosing predictors that are physically associated with the target and will indicate if a model is biased in a way that would make it unsuitable for realistic simulation of the target. For example, in Knutti et al. (2017), aspects of climate relevant for September sea ice extent, such as the 260 climatological mean and trend in hemispheric mean September Arctic sea ice extent, gridded climatological mean and standard deviation in SAT for each month, were chosen. The chosen predictors reflected that models with almost no sea ice in the present day or significantly more sea ice in the future than presently observed were less suitable for the task of projecting changes in sea ice extent. It is also good practice to avoid using a single predictor to define climate to avoid an over-confident uncertainty estimate. No one model property can comprehensively reflect if the model is "good" for a particular purpose, and it is dangerous 265 to constrain uncertainty by dismissing models that do not match observations for a particular statistical definition for those that happen to be tuned to match that definition. Lorenz et al. (2018) discusses a more holistic strategy for choosing predictors and ultimately selected from a set of 24 predictors deemed relevant for projecting North American maximum temperature, based on known physical relationships, predictor-target correlations, and variance inflation considerations.
Here "fitness for purpose" is a relatively simple and straight-forward definition of climate within which the sensitivity of 270 the weighting scheme can be interrogated. We base the performance weighting and the RMSE independence scaling on nine predictors: the climatology and interannual variability (represented by standard deviation) of SAT and SLP during the periods of 1950-1969 and 1990-2009 and a 50-year derived SAT trend (estimated residual thermodynamic trend; described in more detail in subsequent paragraphs) for the period of 1960-2009. We chose predictors to be aspects of regional temperature and pressure in a domain that encompasses modes of atmospheric circulation variability relevant to European climate, because 275 they are (1) physically associated with the target (end-of-century warming) and (2) fields that may reflect model biases that would affect realistic simulation of future climate. For example, a model with a warmer-than-observed mean state in the Mediterannean may experience an enhanced land-atmosphere feedback mechanism that amplifies drying and warming of the region (e.g. Christensen and Boberg, 2012;Mueller and Seneviratne, 2014;Vogel et al., 2018). SAT and SLP are found to be highly relevant predictors by earlier studies (Brunner et al., 2019) and are among the most comprehensively measured 280 atmospheric fields prior to the satellite era (Trenberth and Paolino, 1980 To compute the aggregate distance metrics from nine predictors, all predictor and observational fields are bilinearly interpolated to a shared 2.5 • x 2.5 • latitude-longitude grid. The predictors are then time-aggregated, with the mean or standard deviation computed over the periods 1950-1969 and 1990-2009, and the estimated residual thermodynamic trend computed over the period 1960-2009. For each time-aggregated predictor, the differences between the observed mean value and member value (or member value and member value in the case of the RMSE independence scaling) are computed at each grid point 290 and subsequently squared. The squared differences are then area-averaged over the predictor domain and square-rooted to obtain an RMSE distance for observed-member and member-member pairs. For each predictor, the resulting distributions of observed-member and member-member RMSEs are then normalized by their mid-range value ([maximum + minimum]/2), such that the distance for each of the nine predictors are on the same order of magnitude and can be combined into a single D i ( Figure B1) and S ij ( Figure B2) for each member.

295
A final consideration in predictor selection is one of relationship between past and future predictor behavior. A model's performance weight is based on its ability to reproduce observed climate, and this methodological choice follows from the  concept of emergent constraints (e.g. Hall and Manabe, 1999;Allen and Ingram, 2002;Borodina and Knutti, 2017). The assumption is that if a model accurately represents an aspect of historical climate, it is likely to realistically represent relevant physical processes and therefore is likely to provide a reliable future projection. If a model is significantly biased with respect 300 to observed climate, its future representation of climate may be cause for concern . For these tendencies to hold, a statistical relationship between the historical and future climate feature of interest must exist. In the absence of a strong relationship, predictors serve to add degrees of difference between members that helps to ward against overconfident weighting.
Statistical relationships between historical and future climate can be obscured by internal variability, and the inclusion of In particular, internal variability is shown to influence trends in regional SAT even on the 50-year predictor timescales we have selected (Deser et al., 2016). Because of this, a member may have a similar-to-observed SAT trend (and thus a higher performance weight) by chance, simply because it has similar-to-observed climate variability over the trend period (i.e. a similar set of El Niño and La Niña events or similar phasing of the Atlantic Multi-decadal Oscillation). Because internal variability 310 is inherently random in temporal phase (Deser et al., 2012), a member's match to observations over one trend period does not guarantee a match in the future. This issue is demonstrated in Figure 2ai, which shows that there is no discernible relationship (R 2 ∼ 0) between the DJF EUR SAT trend from 1960-2009 and from 2050-2099 in CMIP5 with (black line) or without (blue line) the SMILEs. Even the two observational estimates differ in European winter trend by more than a degree over 50 years. In summer, a season with less midlatitude climate variation, a relationship emerges between 1960-2009 and 2050-2099 European 315 SAT trends. The linear relationship between past and future trend is reinforced by the SMILEs in a model mean sense, i.e., the three new models added to the CMIP5 ensemble support the relationship (Fig.2bi). It is not evident within the SMILEs themselves, which reflects that the relationship is due to model differences and not the behavior of individual IC members.
The removal of the estimated influence of internal atmospheric variability from regional SAT, however, provides an alternative performance metric on which observations and models can be compared. Using a method of dynamical adjustment 320 (described in Appendix A and in further detail in (Deser et al., 2016)), we construct an estimate of the component of SAT variability induced by large-scale atmospheric circulation patterns, remove it from the SAT record, and obtain the estimated residual thermodynamic trend for 1960-2009 and 2050-2099. The estimated residual thermodynamic trend is an estimate of both the influence of surface processes (i.e., land-atmosphere interactions Merrifield et al., 2017)) and the influence of the radiative forcing, an influence often defined as the forced response. In the model world, the forced response of 325 a field is often defined as the ensemble mean or average across multiple ensemble members. However, there is no observational equivalent to the ensemble mean; there is only one observed realization of climate. Therefore, we use the estimated residual thermodynamic trend as a predictor because it can be computed in the same manner through dynamical adjustment in both observations and each multi-model ensemble member.
Internal atmospheric variability serves to amplify both observed SAT trends in winter by approximately 0.6 • C. Removing the 330 influence of dynamics results an average observed estimated residual thermodynamic trend that falls centrally within the CMIP5 and SMILE distributions (Fig.2aii). In summer, dynamical adjustment also centers estimated residual thermodynamic trend and slightly reduces the difference between observational datasets (Fig.2bii). In terms of weighting, the shift of observed values to the center of the model distribution will lead to more models "performing" in their simulation of trend, which will, in turn, allow more models to contribute to the uncertainty estimate. The estimated residual thermodynamic trend can also be thought of as a 335 property of each model, a measure that includes the response to the shared forcing analogous to climate sensitivity . We find that SMILE members, which share both model setup and forcing, also tend to have similar estimated residual thermodynamic trends (Fig.2a,bii). In winter, the clustering of SMILE estimated residual thermodynamic trends is striking in comparison with SMILE trends: CESM1.2.2-LE members tend to have the least EUR warming in both periods, while CanESM2-LE members tend to warm the most. The addition of the SMILEs then introduces a slightly positive relationship 340 between past and future responses (Fig. 2aii, black trend line) not apparent in the CMIP5 ensemble (Fig. 2aii, blue trend line), though no strong relationship emerges from variability in either case. In summer, the positive relationship seen between past and future Mediterranean SAT trends (Fig.2bi) is robust to the combination of removing internal atmospheric variability and adding the SMILES (Fig.2bii). CanESM2 has the most JJA MED warming in both the past and future periods, while MPI-GE has the least. Because estimated residual thermodynamic SAT trends in the broader European region are more comparable 345 between members and observations due to the removal of an estimate of the influence of atmospheric variability that manifests on multi-decadal time-scales, we use them as the ninth predictor in the definition of climate used in our performance weightings and RMSE independence weighting. Emergent relationships within the other eight predictors are discussed in Appendix C.

Results
To assess the influence of the weightings, we evaluate the magnitude of regional European end-of-century warming in terms shown by solid horizontal lines within the box elements. Weighted ensemble spread is illustrated by the box, which indicates the 25th and 75th percentile, and the whisker, which indicates the 5th and 95th percentile.
For each weighting strategy, comparisons between the CMIP5 and ALL distributions help to elucidate (i) how the weighting constrains uncertainty in the magnitude of end-of-century regional European warming and (ii) how the inclusion of SMILE members influences the distribution. To explicitly determine the contribution of the SMILEs, we also show the fraction of total 360 weight received by each SMILE and the CMIP5 in Fig.3 c and d. Contributions are determined by summing the normalized weights of the 50 CESM1.2.2-LE members (red bar), 50 CanESM2-LE members (yellow bar), 100 MPI-GE members (green bar), and the remaining 88 CMIP5 members (blue bar). For the most part, the weighting strategies introduce only modest distributional shifts; both Northern European winters and Mediterranean summers are projected to warm, most likely by about 5-6 • C, by end-of-century (Fig.3a,b). What is more at issue than the distributional statistics, though, is what the distribution 365 actually represents.
An equal weighting results in a distribution representative of warming in the models with the most votes; in this case the SMILEs. In both seasons, the equal weighting demonstrates why it is important to treat SMILE members as dependent entities within a multi-model ensemble. The CMIP5 ensemble projects an ensemble mean end-of-century warming of 5.9 • C and an interquartile spread of 2.2 • C for Northern European winter (Fig.3a), and an ensemble mean end-of-century warming of 5.5 • C 370 and an interquartile spread of 1.5 • C for Mediterranean summer (Fig.3b). The addition of 200 SMILE members to the 88 member CMIP5 ensemble shift the end-of-century warming distributions towards less DJF NEU end-of-century warming and more JJA MED end-of-century warming, and reduces interquartile spread by approximately 25% in both cases. The large a) Change in DJF NEU SAT, [2080-2099] -[1990-2009]   contributions of the three added SMILEs artificially constrain uncertainty, the CESM1.2.2-LE and CanESM2-LE each receive 17.4% of the total ALL ensemble weight, while the MPI-GE makes up the majority 34.7% (Fig.3c,d).

375
A performance weighting results in a distribution representative of warming in the models that historically get things right.
By diminishing the contribution of members that differ from observational estimates, the performance weight acts to constrain uncertainty in both the CMIP5 and the ALL ensemble. For DJF NEU SAT change, the performance weight shifts the CMIP5 ensemble mean downwards by 0.75 • C, the the 75th percentile downwards by 1.2 • C, and 25th percentile downwards by 0.44 • C.
This distributional shift towards less end-of-century warming is a due, in part, to members with SAT ∆ greater than 8 • C 380 receiving weights that are two orders of magnitude smaller than the average assigned weight. Uncertainty in the DJF NEU ALL ensemble is constrained both by the performance weight diminishing the contribution of CMIP5 members and because MPI is one of the highest performing models based on the chosen DJF predictors. The high performing MPI-GE receives 65.8% of the total ALL ensemble weight, though individual MPI-GE members only receive up to three times more weight than the averaged assigned weight. The aggregate impact of 100 high performing members, however, is outsized and results in 385 the narrowing of the performance weighted end-of-century warming distribution. The narrowing does not reflect the increased certainty that comes from the agreement of independent entities within the ensemble. Instead, it exemplifies that there is a need for an independence assumption in order to avoid the outsized influence that comes from being both historically realistic and numerously represented in the ensemble.
For JJA MED SAT change, the performance weight reduces the contribution of the three SMILEs to the ALL distribution in 390 comparison to the equal weighting case, with the largest reduction made to CanESM2-LE contribution (17.4% to 7.4%; Fig.3d).
However, the three SMILEs (three independent entities) still receive 51% of the total JJA MED ALL ensemble weight, their contributions again augmented by numerous representations. As in the equal weighting case, the JJA MED ALL performanceweighted ensemble mean is still modestly shifted towards more end-of-century warming than its JJA MED CMIP5 counterpart.
This reflects the above CMIP5-average SAT change of the CESM1.2.2-LE and the CanESM2-LE in Mediterranean summer.

395
In an effort to more appropriately handle the mix of models and IC members present in the ALL ensemble, we next explore three scalings that reflect different member dependence assumptions: that IC members are dependent (Fig.3a,b; 1/N, IC members), that modelling center contributions are dependent (Fig.3a,b; 1/N, modelling center), and that members with similar historical climate are dependent (Fig.3a, The scaling of IC ensemble member weight within the CMIP5 ensemble (blue element) decreases DJF NEU end-of-century warming uncertainty and slightly increases JJA MED end-of-century warming uncertainty with respect to equal weighting. It is therefore evident that the IC ensembles within CMIP5, which range from 2 to 10 members, exert influence on the performance weighted DJF NEU distribution in the same way the SMILEs influence the corresponding performance weighted ALL distribution. While this is not seen in the corresponding JJA MED CMIP5 equal and performance weightings, it is important 410 to note that even two or three extra votes for a high performing model are enough to influence uncertainty. The reduction of IC member influence is even more striking in the ALL distribution; the three SMILEs contribute 11.4% of the total weight in the DJF NEU and 3.1% in the JJA MED, down from performance weight contributions of 81.6% and 50.7% respectively. As with other strategies, the 1/N IC member scaled DJF NEU ALL distribution is shifted towards less end-of-century warming with respect to its CMIP5 counterpart. The ALL and CMIP5 1/N IC member scaled JJA MED distributions are almost identical.

415
In addition to IC members, it is reasonable to assume that members of the same model that differ in resolution (i.e., MPI-ESM-LR and MPI-ESM-MR) or in component module used (i.e., MIROC-ESM and MIROC-ESM-CHEM) are dependent entities. However, determining where to draw the line between dependence and independence is difficult; models from different modelling centers share components, while models in a modelling center's development chain can differ from each other in most major parameterizations . Here, we chose to take a logical approach to the dependent entity grouping, 420 based largely on model name or knowledge of institution of origin (Table 1,  Finally, in the instance that dependence is not known a priori, an RMSE-based metric can be used to assign dependence. The idea is that because of model biases, dependent entities can be identified by their similar climates. Using the same set 430 of predictors as used for performance, each member receives a unique weight: RMSE-based performance scaled by RMSEbased dependence. The RMSE dependence scaling allows for more SMILE contribution than the 1/N dependence scaling approaches ( Fig.3 c,d) because internal variability distinguishes SMILE members from one another and thus allows them to be treated as separate entities. With more entities in the ensemble, it follows that the degree of dependence of the existing CMIP5 models increases (CMIP5 models become more dependent) in tandem with SMILE members degree of dependence decreasing 435 (SMILE members become less-than-fully dependent). In the DJF NEU, it is striking that the high performing MPI-GE again contributes over 40% of the total weight. In the JJA MED, the RMSE dependence scaling leads to a comparable CMIP5 and ALL distributions with the ALL distribution projecting slightly less warming than the CMIP5 distribution. This is in contrast to the performance weighted case where the ALL distribution is narrower than and features more warming than the CMIP5 distribution. This addresses the issue that we may not truly know how independent a model is based on name or modelling center of origin 445 alone. However, when dependent entities (i.e., SMILE members) are known, the RMSE metric must be able to identify them as dependent and scale their influence appropriately. In practice, this means we seek a RMSE scaling that approaches (or exceeds) 1/N for the SMILEs and the IC ensembles within the CMIP5 ensemble. The goal of an RMSE scaling proportional to ensemble size comes with the understanding that scaling may be larger if the IC ensemble is very similar to other models or smaller if the IC ensemble is not fully identified as one model (as was the case with the nine predictor RMSE scaling).

450
One way to achieve an RMSE scaling that identifies IC members as dependent is to remove internal variability from the metric through predictor choice. While it would not be good practice to base member performance on few predictors because of over-confidence concerns, member dependence may be more accurately reflected by fewer predictors that distinguish models from one another. Advantages of choosing different sets of predictors for determining dependence and performance is two-fold: first, by selecting for ability to distinguish models rather than realism, dependence predictors can achieve a more substantial 455 separation between SMILE-SMILE distances and SMILE-model distances. This reduces reliance on and sensitivity to the independence shape parameter σ s (Appendix B). Second, the "convergence to reality" paradox is no longer an issue; models will not be down-weighted for moving closer to observations (and thus each other) based on performance predictors.
We find that large-scale, long-term climatological averages are the most suitable predictors for this purpose because, in general, the influence of internal variability increases on smaller spatial-scales and shorter time-scales (Hawkins and Sutton, In contrast to the nine predictor RMSE scaling (Figure 4a,b), the global land SAT-Northern Hemisphere SLP RMSE scaling allows for SMILE members to distinguish themselves and to approach or exceed 1/N values (Fig. 4c). In both DJF and JJA, no member of the ALL ensemble has a nine predictor RMSE scaling that exceeds 1/45. Inter-member RMSE distances, shown in panels a and b of Fig.B2, reflect why this occurs; SMILE members can be as different from one another as CMIP5 models 475 are. The nine predictor independence scaling is better able to distinguish SMILE members from CMIP5 members in JJA than in DJF (Fig.4b).
With the global land SAT and Northern Hemisphere SLP CLIM predictors, SMILE members are clearly closer each other than to other models, with the exception of the CanESM2-LE. Because the CanESM2-LE is created using the five CanESM2 contributions to CMIP5, the SMILE and CMIP5 contributions cluster as a 55-member CanESM2 ensemble within the ALL ensemble (Fig.4c). In terms of scaling, 55 CanESM2 members are scaled by an average of 1/55.0, while 480 the CESM1.2.2-LE and the MPI-GE are scaled by an average of 1/48.7 and 1/100.5 respectively (Fig.4c). In addition to the SMILEs, other IC ensembles within CMIP5, such as the 10-member CSIRO-Mk3-6-0 ensemble, also achieve a 1/N scaling.
Individually represented models, such as FGOALS-g2, are considered more independent and are thus scaled by factors that approach unity. On the other end of the independence continuum, the four MPI-M contributions to CMIP5 are identified to have a high degree of similiarity to the MPI-GE and are scaled accordingly by factors exceeding 1/60.  To understand why large-scale, long-term CLIM predictors are able to group SMILE members and set a degree of independence for CMIP5 members, we investigate where each member falls in the global land SAT and Northern Hemisphere SLP climatology predictor space in Figure 5. Each member is labelled either by color (SMILEs) or by model name (CMIP5) and IC ensembles within CMIP5 are circled. Circling IC ensembles within CMIP5 is possible because, along with the SMILEs, the IC members also tend to cluster. This phenomenon is in line with the assumption that IC members are dependent entities; 490 the two large-scale, long-term CLIM predictors reflect this dependence. Notable IC clusters include MIROC5 (3 members) and EC-EARTH (5 members). The bifurcation in GISS-E2-H and GISS-E2-R ensembles reflect the p3 (top) vs. the p1 and p2 The assumption that members from the same modelling center are dependent entities, however, is not as clear cut in the 495 global land SAT and Northern Hemisphere SLP climatology predictor space. GISS contributions share a response (lower Northern Hemisphere average SLP and higher global land average SAT), while the contributions from CMCC, GFDL, and IPSL feature markedly different responses (Fig.5). Another clustering feature present in both seasons is that of several separate clusters for a modelling center. This can be seen for the NCAR modelling center grouping: CCSM4 and CESM1-BGC form a cluster separate from both the CESM1-CAM5 cluster and the CESM1.2.2-LE cluster. The NCAR case illustrates that new 500 models in a modelling center's development stream can be distinct from their predecessors and should not necessarily be considered dependent based on their shared name. On the other hand, there are also instances where models of different names are similar to each other. Bcc-csm1-1 falls within the CCSM4-CESM1-BGC cluster (Fig.5), which suggests that with shared components (Knutti, 2010), models can have similar responses and be identified as more dependent than their name would suggest. Ultimately, discrepancies between model name and model response suggest that assigning each member a degree 505 of independence is a useful way to handle the continuum of dependence assumptions. Provided care is taken to select an appropriate set of predictors for independence scaling, IC members cluster in an anticipatable way while an interplay between named and unnamed model dependence remains.
We find that the performance and independence weighting scheme pioneered by Knutti et al. (2017) can be used to incorporate 510 regional climate information from three single member initial condition large ensembles into a CMIP5 multi-model ensemble and return a justifiably constrained estimate of European regional end-of-century warming uncertainty. The performance weighting, which accounts for an ensemble member's ability to reproduce selected aspects of observed climate, is based on regional surface air temperature and sea level pressure climatology and interannual variability over two 20-year intervals during the historical period (1950-1969 and 1990-2009) and a 50-year estimated residual thermodynamic SAT trend computed using a 515 method of dynamical adjustment (Deser et al., 2016). These predictors bring both emergent relationships between past and future climate and aspects of climate that are important for a model to historically simulate in order to realistically project future warming to the definition of performance. The principle of emergent constraints underpins the choice to use estimated residual thermodynamic SAT trend over SAT trend, as the former is an estimate of a model-specific property that can be compared with observations and the latter is influenced by internal variability even on 50-year timescales. to the distribution now containing 20 uniquely weighted entities. Finally, by acknowledging dependencies may not always be clearly determinable a priori, the independence scaling based on inter-member RMSE distances from the same nine predictors used to determine performance allows for reasonable levels of SMILE contribution to Mediterranean summer end-of-century warming uncertainty. However, the high performing MPI-GE contributes approximately 40% of the total weight to the Northern European winter distribution as a result of predictor internal variability distinguishing SMILE members as independent models.

540
The advantages of the RMSE-based independence scaling, which include allowing for degrees of independence, are subverted somewhat by the inability of performance predictors to distinguish known dependent entities (i.e., IC members) from (presumed) independent ones. To address this issue, we show that a set of two predictors, 60-year annual average global land SAT climatology and 60-year annual average Northern Hemisphere SLP climatology, is capable of rendering an RMSE scaling of 1/N for SMILE members while assigning a degree of dependence to the rest of CMIP5. A notable achievement for these 545 large-scale, long-term predictors is their ability to identify the CanESM2 members from CMIP5 as being from the same model version as the CanESM2-LE and scale the 55-member ensemble accordingly. A deeper look into groupings in the global land SAT and Northern Hemisphere SLP climatology predictor space reveals clustering of IC ensembles within the CMIP5 ensemble, in addition to the SMILEs, in both seasons. MPI-ESM-MR and MPI-ESM-LR contributions cluster near the MPI-GE, while the NCAR model group separates into three distinct clusters consistent with NCAR's model development over time.

550
The interplay between model name and model response does exhibit some complexity; models from the same center (i.e., GFDL) can have markedly different responses and models from different centers (bcc-csm1-1 and CCSM4) can have similar responses. This suggests that assigning degrees of independence is a useful way to represent the information in an ensemble of opportunity like CMIP5.
It is important to note that while the weighting has a relatively straightforward functional form, it requires application-555 specific sets of predictors and appropriate shape parameters. Strategies to select optimal shape parameters are discussed in Appendix B of this study, and we advise that emergent predictor relationships are explored, as in Appendix C, to provide justification for the performance metric. When defining model skill for performance, it is important to carefully consider whether predictors are relevant to a model's ability to project the future target realistically. Different targets, such as hydrological changes, may require predictors to capture a more complex set of physical processes. It is also important to assess 560 RMSE distance to observations of known dependent entities such as SMILEs to ensure internal variability in the selected set of predictors does not assign them skill of different orders of magnitude. Because SMILE members had relatively similar RMSE distance to observations over the nine original predictor, we did not require members of the SMILE to have identical performance weights under the performance and RMSE case assumptions evaluated. We, however, do see the merit in fixing IC member performance to an ensemble average value to insure model skill is appropriately assigned. We also recommend that 565 different sets of predictors be used for determining performance weight and independence scaling to avoid down-weighting independent models with historical climate that converges to reality. Independence predictors should be fields with minimal internal variability, such large-scale, long-term averages, and ideally fields that model developers do not explicitly tune, such as absolute global temperature (Mauritsen et al., 2012;Hourdin et al., 2017).
We assess a relatively unconventional multi-model ensemble in this study, which is comprised of 200 members from three 570 models and only 88 members from the remaining 40 named models. This is a deliberate choice made to test and improve the independence scaling, as determining best practices for representing uncertainty in a multi-model ensemble that includes initial condition ensemble members is necessary in advance of CMIP6, as modelling centers are slated to submit more ensemble members to the project than were submitted to CMIP5 (Eyring et al., 2016;Stouffer et al., 2017). For more conventional multimodel ensembles that may include just a few initial condition ensemble members amongst the models, results may be less 575 sensitive to choices underpinning the independence scaling. When large ensembles are included, however, it becomes clear that an independence scaling that scales known dependencies appropriately (i.e., 1/N for IC ensemble members), such as the RMSE global predictor scaling presented here, is necessary. Such an independence scaling will be a useful tool with which to assess uncertainty in the combined multi-model, multi-initial condition ensemble member CMIP6 ensemble.

580
To obtain estimated forced trends in SAT, a method of dynamical adjustment, based on constructed circulation analogues, is used (Deser et al., 2016;Merrifield et al., 2017;Guo et al., 2019). Dynamical adjustment provides an empirically-derived estimate of the SAT trends induced by atmospheric circulation variability; removal of this circulationdriven component from a SAT record thus reveals an estimate of the SAT trend associated with thermodynamic processes and radiative effects. Dynamical adjustment relies on the ability to reconstruct a monthly mean circulation field, which we represent  (Deser et al., 2012(Deser et al., , 2016.
It is important to acknowledge that because of the paucity of analogue choices in leave-one-out dynamical adjustment, the term "analogue" is a bit of a misnomer. The term evokes the idea of a match, though in practice, analogues may not closely resemble the target. For convenience, we will continue to refer to the months used in target SLP construction as "analogues", but we do so with the understanding that target and analogue patterns may differ over the selection domain.

595
A month is determined to be an analogue of the target month if the Euclidean distance between target and analogue SLP is small. Euclidean distance is computed at each grid point and averaged over the European sector domain also used for SLP predictors (25-90 • N, 60 • W-100 • E). This selection metric, therefore, does not require an analogue to match the target month spatially over the whole domain. This is necessary because, with 60 possible options, it is statistically unlikely that a "perfect" analogue will exist for a particular target month. van den Dool (1994) found that it would take on the order of 10 30 years to find 600 two Northern hemisphere circulation patterns that match within observational uncertainty. With this in mind, a smaller than hemispheric domain and an iterative averaging schemes are employed to make the most of "imperfect" analogues available (Wallace et al., 2012;Deser et al., 2014Deser et al., , 2016. Once the Euclidean distances are determined, the 50 closest SLP analogues are chosen, and the iterative process of selecting 30 of 50 SLP analogues and optimally reconstructing a target SLP field X h commences. The optimal reconstruction of the 605 target SLP is mathematically equivalent to multivariate linear regression; each analogue is assigned a weight (β) such that a weighted linear combination of analogues produces a least-squares estimate of the target SLP. β is computed through a singular value decomposition of a column vector matrix X c containing the 30 selected analogues and can also be estimated using through a Moore-Penrose pseudoinverse: The analogue weighting scheme ensures that analogues which are further from (closer to) the target, in a Euclidean distance sense, contribute less (more) to the constructed SLP field.
After the target SLP field is constructed, the β values derived for each SLP analogue are applied to their corresponding monthly-averaged SAT fields. Prior to the application of weights, a quadratic trend representing anthropogenic warming is removed from the SAT record at each point in space. The purpose of this detrending is so that months picked from the end 615 of the record do not contribute higher SAT anomalies simply because of the anthropogenically forced warmer background climate, even if the SLP patterns are the same . Detrending strategies are further discussed in Deser et al. (2016). The weighted, detrended SAT fields are then used to construct a dynamic SAT anomaly field for the target month. SLP, which is representative of low-level atmospheric circulation, and SAT are physically related; SLP-derived weights are applied to SAT to empirically construct that relationship. Conceptually, dynamic SAT anomalies are those that would occur given the Appendix B: Selecting σ D and σ S Determining the shape parameters σ D and σ S is an important step in the RMSE weighting process . σ D can be set using a perfect model test, as described in Lorenz et al. (2018). Here, a simplified perfect model test is performed on an 47 member ensemble, which includes only the first initial condition member from the SMILEs and each of the CMIP5 630 models ensembles (40 named models with an additional 4 members from GISS-E2-R and GISS-E2-H physics physics-version ensembles). This is done because having multiple IC members (or a SMILE) in the ensemble could bias the perfect model test, which is based on predicting one member using a weighted distribution of the rest. We use member 1 for each initial condition ensemble because, often, when multiple initial condition members are available, the first member is selected (e.g. Liu et al., 2012;Karlsson and Svensson, 2013;Sillmann et al., 2013). During the perfect model test, each member is assumed to be the 635 "truth" once, and a weighting is performed using the remaining members to predict the "true" SAT change. RMSE distances (based on nine predictors) are computed with respect to the truth for the remaining members and used in the performance weighting function w II i described in section 3.2. The performance weights are computed for σ D values ranging between 0 and 2 (on 0.01 intervals). For each σ D , the weighted mean SAT change is computed and compared to the "true" SAT change.
The optimal σ D for each truth is chosen to be where the difference between the weighted mean SAT change and the true SAT 640 change is minimized. In the few cases when the weighted mean exhibits asymptotic behavior with no clear minimum difference prior to σ D = 2, the σ D value is selected at the point where the leveling off begins (as determined by the intersection between a threshold value and the weighted mean curve). For the nine predictor RMSE weightings, we set σ D values to the mean of the 47 optimal σ D values computed during the perfect model test. It is important to note that this choice is ultimately subjective and further parameter sensitivity testing is recommended in studies focused on model performance.

645
The RMSE distances between multi-model ensemble members and observations (D i ) are shown in Figure B1. Members of the ALL ensemble are plotted in ascending order with the position of SMILE member indicated in red for the CESM1.2.2-LE, in yellow for the CanESM2-LE, and in green for the MPI-GE. In winter (Fig.B1a), distances between CMIP5 members and observations are distributed in a positively skewed fashion with the mode of the distribution at approximately D i = 0.40 with a tail of larger D i values. In contrast, CMIP5 distances in summer (Fig.B1b) are approximately normally distributed about a 650 mean of D i = 0.85. The addition of the SMILEs to the distribution contribute to both of these distributional tendencies. σ D is set to 0.32 in DJF and 0.4 in JJA in both the CMIP5 and ALL ensembles to eliminate a degree of freedom of the method.
Members are more strongly weighted by performance in winter than in summer, due to the different distance distributions. σ S can be determined using initial condition ensembles present in the multi-model ensemble, including SMILEs. The inclusion of SMILE members in a multi-model ensemble emphasizes the need for σ S to be carefully selected, as SMILEs add 655 redundant information and the purpose of σ S is to reduce the influence of redundant information. However, not all information added by a SMILE is distinguishable from information in other models; inter-member distances in an initial condition ensemble can be as large as inter-model distances in the multi-model ensemble ( Figure B2a,b). Checking inter-member vs. inter-model distances is an important first step in determining σ s , too much overlap between the distributions can lead to outsized con If σ S is too small or too large, there are implications for the nine predictor RMSE weighted ensemble mean and spread.

660
This sensitivity to σ S is shown in in Figure B3. We assess the characteristics of the nine predictor RMSE weighted CMIP5 distributions (Fig.B3a,bi) and RMSE weighted ALL distributions (Fig.B3a,bii) for different values of σ S , varying from 0.05 to 0.8.
For small σ S , only members that are very close to each other in predictor space are considered dependent; most members of the multi-model ensemble will therefore be considered independent. In this case, the RMSE weighting tends toward the 665 performance weighted approach. If σ S is set on the order of the largest inter-member distances in a SMILE (σ S ≥∼ 0.4), few members of the multi-model ensemble will be considered independent from each other, despite coming from different models.
The systematic scaling of performance weights in the ensemble at large tends to also lead to a narrowing of uncertainty. Only members that are very far from other members will not have a scaled performance weight, but these "independent" members tend to also be far from observations and therefore have little performance weight to begin with. For σ S between approximately 670 0.2 to 0.4, uncertainty in the RMSE weighted distributions increases in all but the JJA MED CMIP5 case. The JJA MED CMIP5 distribution is relatively insensitive to σ S because 50% of the RMSE distances between CMIP5 members are between 0.56 and 0.71 (Fig.B2b). For the ALL distributions, the RMSE weighted mean shifts up modestly in DJF and down in JJA. In order to avoid an underestimate of uncertainty, either due to redundancy or from down-weighting independent information, we propose that σ S should be set carefully. For the set of nine predictors, we set σ S based on the S ij distribution in initial condition 675 ensembles present within the multi-model ensemble. We compute the S ij with the three SMILEs and set σ S at two standard deviations below the SMILE S ij mean value (Fig. B2). The three values are then averaged. By this metric, DJF NEU σ S is 0.26 and JJA MED σ S is 0.25.
Another more robust option, as discussed in the main text, is to select a set of independence predictors that explicitly differentiate inter-IC member distances from inter-model distances. In this case, σ S should not be set to two standard deviations 680 below the SMILE S ij mean, rather it should be set to a value greater than all IC member S ij but less than inter-model S ij (particularly differently named models). For the large-scale CLIM predictor set explored in Figure 4, σ S can be computed based on initial condition member intermember distances as described in Brunner et al. (2019); σ S in this instance is 0.22.

Appendix C: Emergent Predictor Relationships
In addition to relationships between past and future (estimated residual thermodynamic) trend (Fig.2), emergent relationships 685 among the remaining predictors we use to represent climate are shown in Figure C1  spatially in Figure C3 and C4. Mean states within SMILEs tend to cluster together. With the exception of JJA MED SLP 690 climatology (Fig.C2c), the addition of the SMILEs does not change the linear relationship found in the CMIP5 multi-model ensemble.
For variability (standard deviation over the given period), members of SMILEs differ as much from each other as from other multi-model ensemble members in DJF (Fig.C1b,d). In JJA (Fig.C2b,d),  to the CMIP5 multi-model ensemble reduces correlations between historical and future variability for SAT and SLP in both seasons. This is particularly striking in JJA where the correlations tend to be due to the CMIP5 multi-model ensemble outliers.
Because the SLP predictor domain has a larger spatial extent than the SAT predictor domains, we also assess spatial patterns of climatological SLP which average to the lowest and highest domain-averages values in the 1990-2009 climatological period ( Figures C3 and C4). The "end-members" illustrate the climatological emergent constraint relationship seen in Figures C1 and   700 C2 in terms of pattern, that is important for a field like SLP which tends to feature dipoles on basin and continental scales. For simplicity, we compare the end-members to one observational estimate from ERA-20C.
In winter, multi-model ensemble members tend to feature similar-to-observed spatial patterns of climatological SLP in the predictor domain, with a low pressure center over the high latitude North Atlantic and a region of high pressure over the Eurasian continent (Fig.C3). For the member with the lowest domain-average, the difference arises from a further extension 705 of the low pressure center across Northern Europe and a weaker high pressure center than observed, especially in the vicinity b) SLP Climatology d) SLP Variability c) SAT Variability a) SAT Climatology i. ii. i. ii. i. ii. i.
ii. Figure C2. As is Figure C1, but for JJA. a) MED SAT climatology ( • C), b) MED SAT standard deviation ( • C), c) SLP climatology over the predictor region (hPa) and d) SLP standard deviation over the predictor region (hPa) are eight of the nine predictors used to determine member performance and independence. of the Tibetan plateau (Fig.C3 ii,v). For the member with the highest domain-average, the difference arises from high pressure features over high altitude regions, such as Greenland and the Tibetan plateau (Fig.C3 iii,vi).
In summer, members differ in spatial patterns of climatological SLP in the predictor domain, though most feature a high pressure center over the subtropical North Atlantic and lower pressure over the Eurasian continent seen in ERA-20C (Fig.C4).

710
The member with the lowest domain-average features the afforementioned spatial pattern, but with a higher-than-observed amplitude i.e. both a higher North Atlantic subtropical high pressure center and a lower region of continental low pressure ( Fig.C4 ii,v). In contrast, the member with the highest domain-average has high pressure over the entire Atlantic basin as well as over Greenland and the Tibetan plateau (Fig.C4 iii,vi). Most importantly, in all cases, the climatological behavior of the past continues into in the future, which supports the primary tenet of an emergent constraint.

715
Author contributions. RK, RL, and LB conceived of and wrote the weighting scheme python package. ALM and LB implemented the weighting scheme with contributions from RL. ALM, LB, and IK analyzed the output. ALM wrote the paper with contributions from all co-authors.  Figure C3. The spatial pattern of DJF SLP climatology for: 1950-1969 (i-iii), 1990-2009 (iv-vi), and 2080-2099 (vii-viii). The observational estimate of SLP climatology (ERA-20C) is shown in the left column (i,iv). The ensemble member with the lowest domain-average SLP climatology for the two historical periods (GISS-E2-Rr1i1p2) is shown in the center column (ii,v,vii). The ensemble member with the highest domain-average SLP climatology the two historical periods (IPSL-CM5B-LRr1i1p1) is shown in the right column (iii,vi,viii).
Competing interests. We declare that we have no conflict of interest.
Acknowledgements. We would like to thank Drs. vii. viii. Figure C4. As is Figure A3, but for JJA SLP climatology. The observational estimate of SLP climatology (ERA-20C) is shown in the left column (i,iv). The ensemble member with the lowest domain-average SLP climatology for the two historical periods (CanESM2-LEr35i1p1) is shown in the center column (ii,v,vii). The ensemble member with the highest domain-average SLP climatology the two historical periods (IPSL-CM5B-LRr1i1p1) is shown in the right column (iii,vi,viii).