Evaluation of Global Teleconnections in CMIP6 Climate Projections using Complex Networks

. In climatological research, the evaluation of climate models is one of the central research subjects. As an expression of large-scale dynamical processes, global teleconnections play a major role in interannual to decadal climate variability. Their realistic representation is an indispensable requirement for the simulation of climate change, both natural and anthropogenic. Therefore, the evaluation of global teleconnections is of utmost importance when assessing the physical plausibility of climate projections. 5 We present an application of the graph-theoretical analysis tool δ -MAPS, which constructs complex networks on the basis of spatio-temporal gridded data sets, here sea surface temperature and geopotential height in 500 hPa. Complex networks complement more traditional methods in the analysis of climate variability, like the classification of circulation regimes or empirical orthogonal functions, assuming a new non-linear perspective. While doing so, a number of technical tools and metrics, borrowed from different fields of data science, are implemented into the δ -MAPS framework in order to overcome specific challenges 10 posed by our target problem. Those are trend-EOFs, distance correlation and distance multicorrelation, and the Structural Similarity Index. δ -MAPS is a two-stage algorithm. In the first place, it assembles grid cells with highly coherent temporal evolution into so-called domains. In a second step, the teleconnections between the domains are inferred by means of the non-linear distance correlation. We construct two unipartite and one bipartite network for 22 historical CMIP6 climate projections and two century- 15 long coupled reanalyses (CERA-20C and 20CRv3). Potential non-stationarity is taken into account by the use of moving time windows. The networks derived from projection data are compared to those from reanalyses. Our results indicate that no single climate projection outperforms all others in every aspect of the evaluation. But there are indeed


Introduction
The evaluation of general circulation models (GCM) is one of the key topics of climate sciences. This evaluation is indispensable in the assessment of uncertainties in the projection of climate change. At the same time, it serves as a guideline for further model development.  NorESM2-LM r1i1p1f1 Seland et al. (2020) NorESM2-MM r1i1p1f1 Seland et al. (2020) TaiESM1 r1i1p1f1 Lee et al. (2020) UKESM1-0-LL r1i1p1f2 Sellar et al. (2019) of such parameters are only recently becoming more reliable and less sparse, such that the fidelity of their reanalyses fields is impossible to verify.
The SST (Z500) data was remapped to a common grid of 2.25 • × 2.25 • (2.5 • × 2.5 • ) resolution. Regions with sea ice are 95 avoided in SST as well as circles of 5 • radius around the poles in Z500 because of possibly biased representation of the polar vortices. The analysis is carried out for seasonal anomalies on the overlapping time period from 1901 to 2010.

Methods
The procedure used to assign an assessment score to each model run comprises a number of algorithmic stages that build on each other. As they are not yet well-known in the climatological community, we present them in detail in the following -Distance multivariance and distance multicorrelation (Sect. 3.3.1) -Network comparison with structural similarity index and multivariate network quality score (Sect. 3.4)

Detrending with trend-EOF
Prior to the construction of the δ-MAPS networks, the data has to be detrended to avoid the correlations being distorted by long-110 term trends. Although it is still the most widely used technique, linear detrending has been shown little appropriate to remove the effects of external forcing (anthropogenic and natural) from climatic time series (Frankignoul et al., 2017), given its nonlinear structure and the dynamical response mechanisms including long-range memory. Conventional Empirical Orthogonal Function (EOF) decomposition is not well suited for trend detection either for a number of reasons (Hannachi, 2007), which often cause the spreading of long-term trends between several modes of internal variability. Instead, we will apply a non-115 parametric technique, so-called trend-EOF (Hannachi, 2007), which identifies spatial patterns of trends defined as common non-linear, but monotone increase. The method is based on the Singular Value Decomposition (SVD) of the matrix of inverse ranks, instead of the direct observations as in conventional EOF-analysis. Since sequences of inverse ranks provide a robust measure of monotonicity, trend-EOFs are able to separate patterns associated with monotone (non-linear) trends, albeit small, from patterns not associated with trends.

120
Trend-EOFs have been applied since in a number of studies (e.g. Barbosa and Andersen (2009) PCA-based techniques, which are designed to extract space-time patterns maximizing criteria like persistence, predictability or autocorrelation. In contrast to conventional EOFs, all the tested methods very robustly detect a leading EOF pattern with a respective principal component (PC) that presents a distinct non-linearly increasing trend. We consider trend-EOFs therefore 125 an appropriate technique for identifying anthropogenic Greenhouse Gas (GHG)-forced trends.
Let X = ((x it )) be the matrix of anomaly data at grid cells i = 1 . . . n (numbered consecutively) and times t = 1 . . . T . The time series x i at grid cell i is transformed to the vector of inverse ranks q i by setting q it equal to the time position of the t th largest value in x i . The sequence q i indeed reflects the total monotonicity of x i : in monotone series the inverse ranks are ordered according to the trend. The stronger the trend in x i , the stronger the pattern in q i . By maximizing the correlation pattern is composed of the regression coefficients between the trend-PC w 1 and the anomaly time series of the original field To allow for an annual cycle in the trend patterns, we extend the trend-EOFs in analogy to season-reliant EOFs (Wang 145 and An (2005), see also cyclo-stationary EOFs in Yeo et al. (2017)), Q = (Q MAM |Q JJA |Q SON |Q DJF ) (seasonally centered, inverse ranks calculated for each season individually), which extract a recurrent sequence of seasonal trend patterns with one associated trend-PC for the magnitude of the whole cycle as opposed to one common pattern for all seasons as in non-seasonal EOF analysis or four individual patterns with their associated individual PCs as in seasonal EOFs, respectively. At this stage it would be possible to apply a secondary SVD to the seasonal warming patterns to obtain a smoother annual cycle. While such 150 procedure seems undue for seasonal data, it would be a reasonable approach in the case of monthly data. Instead of applying two sequential EOFs to Q, a tensor decomposition like HOSVD (De Lathauwer et al., 2000) would serve this purpose more elegantly.
After having detrended the time series, we are able to standardize the seasonal variances without the interference of the seasonal trends, which would otherwise bias our estimates. On their part, seasonally varying variances could degrade the 155 estimated correlations between grid cells in the first stage of the δ-MAPS algorithm, giving increased weight to seasons with higher variance. In turn, the spatial component of the variance will be important in the second stage of δ-MAPS, therefore we augment the deseasonalized time series again with their overall (non-seasonal) variance.

160
To infer the functional interactions within and between spatio-temporal gridded datasets of climatological parameters, we adopt the δ-MAPS algorithm proposed by Fountalis et al. (2018). This algorithm is rooted in network sciences/graphical modelling, in which graphs are used to express the dependence structure between random variables. A graph or network consists of a set of nodes connected by a set of edges, which describe the interactions between the nodes. Networks can be classified depending on their topology: simple networks like lattices and fully-connected networks; complex networks like scale-free and small-165 world networks. Small-world networks are often observed in climate and other earth sciences, in the human brain and in social networks. Their nodes are strongly clustered into semi-autonomous components and the average shortest path length between any two nodes is small.
In contrast to structural networks or flow networks, where the edges are physically observable (like wired connections or trajectories of particles, respectively), functional networks are inferred from the behaviour of the nodes. We consider the grid 170 cells of a selected climatological field as the nodes of the graph. The spatial embedding is naturally given by the locations of the grid cells. In Fountalis et al. (2018) the edges of a fully-connected grid cell-level network are defined using the unpruned Pearson correlation ϱ of the time series as association measure between any pair of nodes. Based on this weighted network, the δ-MAPS algorithm identifies semi-autonomous components D 1 . . . D K , called domains. A domain is a spatially contiguous set of grid cells with highly correlated temporal activity. Fountalis et al. (2018) propose an iterative algorithm that alternately 175 expands and merges a preliminary set of domain-seeds S (neighborhoods with locally maximal correlation, 3 × 3 grid cells in our case) so as to find the maximum possible sets of grid cells that satisfy the homogeneity constraint δ: Let D be a spatially contiguous set of grid cells with cardinality |D| where ϱ ij is the correlation between the time series at grid cells i and j and δ is a chosen parameter to regulate the number 180 and size of the domains. The domains are expanded to neighboring grid cells (one at a time) as long as ϱ D ≤ δ. Two domains D i and D j are merged if they contain at least one pair of adjacent grid cells and their union still satisfies the threshold δ. The algorithm stops, when no more domains can be merged or expanded.
The number of domains K generated by this algorithm is not predefined. Overlapping domains are allowed in δ-MAPS, because grid cells might be influenced by more than one physical process. If a grid cell does not satisfy the homogeneity 185 constraint with any of its neighbors, it remains unassigned. Deviating from Fountalis et al. (2018), we use Spearman's Rank correlation to determine the similarity between grid cells to allow for monotone, yet non-linear association. Furthermore, we set the threshold δ for minimal average correlation within a domain to equal a selected high quantile of all pairwise correlations (our δ is not based on a significance test, therefore there is no need to correct for auto-correlation). Lower thresholds allow the domains to expand and merge further resulting in a smaller number of spatially larger domains, which means lower parcellation, 190 and vice versa. In Sect. 4, we will choose δ so as to produce "intuitive" domains evocative of known teleconnection patterns.
In Falasca et al. (2020), the identification of domains was further refined: grid cells are assigned to a common domain

Network of domains
Subsequently, the domains identified above serve as super-nodes in the second stage of δ-MAPS. A functional weighted network is inferred between the domains on the basis of a dependence measure (in Fountalis et al. (2018) the lagged Pearson correlation is used; we will use distance correlation, see Sect. 3.3). The time series of a domain is defined as where φ i is the latitude of grid cell i. In contrast to Falasca et al. (2019), we use the means instead of the sums of the grid cells for domain time series. We do so, because otherwise the variances of the domains would grow with their size, something that would hinder interpretation. On the other hand, the spatial correlation within the domains, the precondition for grid cells to form a domain, impedes the decrease of the variance of the domain mean following the Central Limit Theorem at the rate of |D|. Instead, the variances of the domain means are of comparable magnitude regardless of the domain size.

215
Every possible link with every possible lag −L ≤ l ≤ L is tested for significance, which constitutes a multiple-testing problem such that the cumulative probability of type I errors increases. One way to control the false discovery rate FDR to be smaller than a predefined level α was proposed by Benjamini (2010): the p-levels of the individual tests are ordered ascendingly, p (1) ≤ · · · ≤ p ( 1 2 K(K−1)(2L+1)) , and the hypothesis (H 0 : link is insignificant) is rejected only for those tests, where

220
The network consists of two maps D and W . D : set of nodes (grid cells) −→ power set of domains P (D 1 . . . D K ), which assigns one/several/no domains to every grid cell, and W : set of pairs of domains {D 1 . . . D K } × {D 1 . . . D K } −→ maximal (lagged) dependence ∈ R, which assigns every pair of domains a link that equals the maximal (lagged) dependency between them (we allow lags up to 10 seasons).
The distinction between grid cells that are dependent within the same domain and grid cells that are dependent across two 225 different domains allows δ-MAPS to differentiate between local diffusion phenomena and remote interactions as for instance an atmospheric bridge or an oceanic tunnel (Liu and Alexander, 2007).
Since the techniques to construct the δ-MAPS network are statistical, long time series are convenient in order to obtain robust estimates of the dependence measures. In the case of non-stationarity, such estimates would be biased and reflect only a temporal average connectivity between the components of the network. The time dependence can be addressed using evolving 230 networks, which are constructed over sliding time windows (see for instance Kittel et al. (2021) and Novi et al. (2021)). The present study considers a time-constant network for the period 1901-2010, and a shorter period network for 1951-2010, where more observations are available for assimilation into the reanalyses. To investigate the temporal evolution, a third network is constructed for 1901-1955.
The complex networks framework offers a lot more approaches in order to exploit the richness of the data, as for instance 235 multi-scale, causal, and multi-layer networks. Wavelet multi-scale networks were proposed for investigating interactions in the climate system simultaneously at different temporal scales, revealing features which usually remain hidden when looking at one particular time scale only (Agarwal et al., 2018(Agarwal et al., , 2019. Interactions between processes evolving on different time scales are investigated by Jajcay et al. (2018). Moreover, as the number of identified domains within a climatological field is drastically smaller than the number of original grid cells, this also opens up the possibility of investigating the causal relationships between 240 them (Nowack et al., 2020), although the basic assumption of causal networks inference, that the dependence structure can be represented by a directed acyclic graph, is questionable in the climate context. The construction of both dependence based and causal networks can naturally be extended to cross-networks, which include multiple fields (Feng et al., 2012;Ekhtiari et al., 2021).

Distance covariance and distance correlation 245
As physical processes in climate are highly dynamical and mostly non-linear (Donges et al., 2009), we decided to substitute the Pearson correlation in the second step of network inference by a non-linear dependence measure: distance correlation proposed by Székely et al. (2007). To begin with, distance covariance, calculated from the pairwise Euclidean distances within each sample, is an analogue to the product-moment covariance, but it is zero if and only if the random vectors are independent.
The intuition of distance covariance is that if there exists a dependence between the random variables X and Y , then for two 250 similar realizations of X, say x s and x t , the two corresponding realizations of Y , y s and y t , should be similar as well. Note that the opposite (x s , x t unsimilar =⇒ y s , y t unsimilar) is true for linear dependence, but not true in general.
Unlike the widely used information measures, distance covariance has a compact representation, is computationally fast, and reliable in a statistical sense for sample sizes common in climatology, because it is not necessary to estimate the density of the samples. We use the unbiased version of distance covariance given in Székely and Rizzo (2014). Let (x t ), (y t ), t = 1 . . . T be a 255 statistical sample from a pair of real or vector valued random variables X, Y . First, compute all pairwise Euclidean distances: Then distance covariance dCov is definded as Distance variance dVar and distance correlation dCor are defined analogously to moment variance and moment correlation, resepctively: the need to correct for autocorrelation, as it was done in Fountalis et al. (2018). The correction of autocorrelation involves the estimation of a rather large number of autocorrelation coefficients. This might add to statistical uncertainty and its expendebility is therefore statistically advantageous.
An efficient test of distance correlation based on the χ 2 -distribution was proposed by Shen et al. (2022), which is universally consistent, and valid for α ≤ 0.05: Distance correlation is defined between vectors of arbitrary dimension. One way to take advantage of this property in the construction of networks would be to assign the measurement of more than one climatological variable to every node, e.g. sea surface temperature and salinity, or 500 hPA geopotential height and temperature.
We will apply distance correlation in the network inference between the domains, but not in the construction of the domains.

280
The reason is that in domain construction we are looking for similar temporal behavior between grid cells. We choose Spearman's Rank correlation, because it accounts for non-linear, yet monotone association. In contrast, in network inference we are expressly interested in non-linear dependence including non-monotonicity.

Distance multivariance and distance multicorrelation
Distance correlation has also been generalized to distance multivariance/multicorrelation by Böttcher et al. (2019) to measure 285 the dependence between an arbitrary number n of random variables in the sense of Lancaster interaction (Lancaster, 1969;Streitberg, 1990). The Lancaster interaction ∆F quantifies the fraction of dependence between them that is not explained by factorization, their synergy. For n = 3, let F 123 be the 3-dimensional joint distribution function of X 1 , X 2 , X 3 , F 12 , F 13 and F 23 the pairwise joint and F 1 , F 2 and F 3 the marginal distribution functions. Then the Lancaster interaction is defined as the fraction of F 123 that is not explained by pair-wise dependence. Lancaster interaction excludes, in particular, linear dependence as this is indeed explained by pair-wise dependence.
The concept of higher-order dependence is related to joint cumulants and higher-order moments, in that κ n (X 1 . . . X n ) = x 1 . . . x n d∆F (Streitberg, 1990). Joint cumulants are traditionally applied in multiple-point statistics and hyper-spectral analysis to describe non-linear interaction and non-gaussian multidimensional distributions. Climate science has seen only 295 a small number of implementations, including the contributions of C. A. L. Pires related to teleconnections (e.g. Hannachi (2017, 2021)). As a feature of complex systems, higher-order interactions have already been recognized as critical for the emergence of complex behavior such as synchronization and bifurcation in scientific fields as diverse as social networks science, ecology, molecular biology, quantum physics, neurosciences, epidemics, geodesy, image processing and genetics (Battiston et al., 2020), and tools for the construction of hypergraphs (graphs with links that comprise more than two nodes) 300 are increasingly available. To our knowledge, hypergraphs have not yet been introduced in climatology.
Distance multivariance is defined analogously to distance variance (Equation 3) and is a strongly consistent estimator of Lancaster interaction (Böttcher et al., 2019). For n = 3, with C st the analogue to A st and B st for a third random variable Z: and likewise distance multicorrelation with a slightly differing normalisation: Obviously, distance covariance between two random variables is covered by distance multivariance for n = 2. Significance tests for distance multivariance are also given in Böttcher et al. (2019). As the asymptotic test is conservative and furthermore, in the case of non-zero pairwise dependence, the test statistic is not guaranteed to diverge, it is convenient to choose a larger FDR level than the usually employed significance levels between 0.1 and 0.01.
Alternatively, M could be rearranged in a 4-modal hypermatrix or tensor made of the Kronecker product of the lat-lon field times itself containing the dependencies between the grid cells.
Apart from replacing Pearson by distance correlation, our definition of M differs from the one in Falasca et al. (2019) in three 320 aspects. Firstly, our links are undirectional, because distance correlation is much less sensitive to temporal lag than Pearson correlation. The distance correlation coefficients for lags −10 ≤ L ≤ 10 differ only marginally from the value for L = 0.
So although we do construct M using maximum lagged distance correlation, we do not venture to infer the direction of the interaction from it. Secondly, we have defined W (D k , D k ) = 1, causing M ij = 1 if x i and x j pertain to the same domain (and no other) to emphasize that grid cells within one domain are more strongly linked to each other than to the grid cells of netCorr between two networks measures the spatial correlation between the respective adjacency matrices, not considering the overall level and variability within the networks. We propose to augment netCorr to netSSIM. SSIM is the Structural Simi-330 larity Index, a measure very popular in image processing, which combines terms for brightness (mean), contrast (variance) and structure (pattern correlation) of images (Wang et al., 2004). It was introduced to the hydrological/meteorological community by Mo et al. (2014). Let X, Y be two gridded fields: where µ X , µ Y are the means, σ 2 X , σ 2 Y are the variances and σ XY is the Pearson covariance between X and Y , and small 335 constants c 1 = c 2 = c 3 (we choose 0.00001) to ensure regularity. The SSIM ranges from −1 to 1, it equals 1 only in case of identity and −1 for an anti-analogue (equal mean and variance, but correlation = −1). SSIM = 0 means no similarity. Note that the SSIM is not invariant under translation and rotation, which corresponds to our requirements, because we want the teleconnections to sit in the right place. SSIM is not a distance metric, but a distance metric can be constructed from it (Brunet et al., 2011). We apply the (latitude weighted) SSIM to two adjacency matrices M (Equation 8) constructed from the significant distance correlations in two reference and/or model networks. In this way, we calculate netSSIM indices for the unipartite networks for SST and Z500, and for the cross-networks between the SST and Z500 domains.
Alternatively, we could calculate the SSIM between adjacency matrices in a point-wise manner, comparing the slices of the In order to combine the three NQSs wrt. SST, Z500 and SST-Z500, we take the geometric mean (equal to the exponential of the arithmetic mean of the squared differences (1 − netSSIM) 2 ). This shall be the Multivariate Network Quality Score MNQS: The MNQS corresponds to the exponential transform of the squared Euclidean distance between the 3-dimensional vector-netSSIM to the ideal vector-netSSIM value (1, 1, 1), which would be attained by a network identical to the reference, normalized with the distance between (1, 1, 1) and the value (0, 0, 0) indicating no similarity.
Any other vector norm could be utilized for the construction of MNQS, for instance an L p -norm with p ̸ = 2 or some weight-365 ing of the directions. netSSIMs for additional parameters can be incorporated into the MNQS in a straight forward way. Finally, the considered models can be ranked with respect to these scores.
The netSSIM is also useful, when exploring the differences between networks in more detail. As mentioned above, the slices of M wrt. a single grid cell or domain can be compared one by one. It is further possible to calculate the netSSIM for all pairwise links in a certain region, excluding the rest of the globe, or for all links from one region to another. This way, 370 differences across models or time periods can be tracked down directly to their origin.

Results and discussion
We will demonstrate the functioning of every sub-procedure considering the CERA-20C ensemble mean over the whole period 1901-2010 as an example. All procedures are furthermore applied to the periods 1901-1955 and 1951-2010. Individual runs of CERA-20C as well as 20CRv3 and CMIP6 model realisations will be discussed depending on special interest. Analogous plots for geopotential height anomalies in 500 hPa for the CERA-20C ensemble mean can be found in Figure   390 1(c) and (d). Unfortunately, we were not able to find any comparable study in the literature, where Z500 was analysed for trend over the 20 th century. Gillett et al. (2013), Knutson and Ploshay (2021), Garreaud et al. (2021), and Raible et al. (2005) considered SLP trends over different time periods and regions. Although not fully comparable, there is a certain similarity. The projected trends as well as the loading patterns in the 20CRv3 best estimate are somewhat different for the period 1901-2010 ( Figure S2) , but agree much better for 1951-2010 (not shown). This might well be related to low observational 395 coverage during the first half of the century, we thus take this disagreement as a signal for caution.
When subject to the same procedure, the CMIP6 model output SST and Z500 anomalies produce trend-EOFs and loading patterns roughly similar to CERA-20C and 20CRv3 (not shown). Differences are more or less obvious, though, such that an evaluation of the GMSST anomaly time series in the spirit of Papalexiou et al. (2020) would be an obvious choice, but is out of the scope of this paper.     Oscillation (NAO) with a link a13↔a14 (and much weaker a1↔a14). Other complex teleconnections also seem to involve the Spensberger et al. (2020). Lagged dependence is irrelevant in the Z500 network.
We notice that many known atmospheric teleconnections are defined as higher-order modes of some EOF decomposition.
As such they exist only as additive modulations of their corresponding leading modes. We would therefore not expect to find many of them in our networks.
Network methods allow the investigation of interactions between different climatological fields in a straight forward way, 465 constructing cross-networks between (in our case) SST and Z500 domains that describe the coupled ocean-atmosphere variability (Liu and Alexander, 2007). We notice that the inference of links between the domains of two unipartite networks is different from the construction of bipartite communities in multi-layer networks as in Ekhtiari et al. (2021). Here, we just calculate the distance correlations between pairs of one SST and one Z500 domain. The inferred CERA-20C SST-Z500 cross-links are shown in Figure 2(d). The connectivity is mostly quite weak, except for the cross-links from the Tropical Belt 470 (a15) and the northern and southern Pacific Z500 domains (a9, a10, a12) to the ENSO-related SST domains (o7, o8, o9, o10, o11), the tropical Indian Ocean (o3), but also the Great Australian Bight (o6). This feature was also observed by Feng et al. (2012), who related it to the Walker circulation. Z500 domains a4 and a7 participate in this pattern, but to a lesser extent. Z500 domains over oceans are usually connected to their underlying SST counterparts (a4↔o2, a7↔o6, a9/a10↔o8, a14↔o13 [SST modulating the NAO], a15↔o3/o11), although in the Atlantic this dependence is exceptionally weak (a15↔o1/o15, a16↔o14, 475 a3↔o12). But teleconnections to more distant SST domains are, in some instances, as strong as or even stronger than those proximate cross-links (a4↔o3/o10/o11/o12, a7↔o12, a14↔o15). Interestingly, the Arctic Z500 domain (a13) is weakly linked to the AMO-domain (o15), but not to the North Pacific. Except for slightly increased overall connectivity levels, supposedly mediated by the SST, allowing for lagged dependence does not change the network.
The analogous plot for 20CRv3 can be found in Figure S3.

3 rd -order interactions
As before, this subsection presents only results for CERA-20C over the time period 1901-2010. The overall high level of connectivity between SST domains motivated us to take a deeper look into the dependence structure of the climate system. In a modest first attempt, we search for interacting triples in the sense of Lancaster, in graph theory termed as 2-hyperedges, taking all combinations of three SST domains and calculating their distance 3 rd -order multicorrelation as introduced in Sect. 3.3.1

485
Equation 7. As discussed there, we choose a large FDR level α = 0.2 in order not to suppress too many distance multicorrelations. To avoid cumbersome evaluations with different lag combinations, we stick to instantaneous networks.
Only a small number (13) of significant 3 rd -order dependencies is detected (we list them in Table 2   interaction. Even though, one distance multicorrelation is also detected in the North Atlantic: the SST triple (o14,o15,o16), which corresponds to the AMO.

495
Note that not every triple with strong pairwise dependencies also has a significant 3 rd -order dependence. Table 2 shows the hyperedges along with their distance multicorrelation and the sum of their pairwise distance correlations. As distance multicorrelation is symmetric, every significant hyperedge is listed only once in the table. Note also that the sum of pairwise distance correlations is not bounded by 1, because the pairwise dependencies are not mutually exclusive. Although the detected distance multicorrelations are significant, they are at most 20% of the sum of the respective pairwise distance correlations. That 500 means 3 rd -order interactions complement, but not outweigh pairwise dependence in the 3-dimensional joint dependence.
The same comments essentially apply to cross-hyperedges consisting of two SST and one Z500 domains or one SST and two Z500 domains. We detected 15 and 5 significant cross-hyperedges, respectively, in the Pacific, which all resemble some ENSO interaction. The Z500 domains a13 and a14 (NAO) have no notable distance multicorrelation with North Atlantic SST domains, indicating that the North Altlantic is linked to the NAO-domains on a pairwise basis (o15↔a13/a14), but no 505 higher-order interaction is taking place. There is no hyperedge of three Z500 domains with significant multicorrelation. Known atmospheric tripoles like the Arctic Oscillation (a9, a13, a14) and the Pacific North America Pattern (a10, a11, a14) apparently lack significant 3 rd -order dependence.
We believe that the construction of higher-order networks including hyperedges by means of distance multicorrelation might well be one step towards understanding the synergies emerging from multivariate coupling of large-scale oceanic/atmospheric 510 teleconnections.

Reference networks
We turn to the comparison of reference networks in terms of the NQS and MNQS criteria (see Sect. 3.4), calculated from the adjacency matrices M containing the regionally distributed distance correlation links between all pairs of domains (Equation 515 8). As CERA-20C was produced as a 10-member ensemble representing the inevitable sampling and modelling uncertainty inherent in the production process, we take this opportunity to construct the δ-MAPS networks individually for each member.
The results are matched to the networks derived for the CERA-20C ensemble mean.   -20C 20CRv3 1901-19551951-20101901-20101901-19551951-20101901-2010CERA-20C 1901-1955   strength from ENSO to the tropical Indian Ocean, but also from ENSO to the North Tropical Atlantic, to the North Pacific, and to the Southern Ocean. The same is visible in the network slices for the AMO-domain (Figure 3(d),(e),(f)).
In contrast, the CERA-20C Z500 network slices for the Tropical Belt (Figure 3(g),(h),(i)) bear more apparent similarity than the SST network slices, which was already apparent in the Network Quality Scores above. Although the shape of the Tropical Belt differs slightly more than the shape of the ENSO-domain, the links to the rest of the globe resemble each other more 545 strongly. However, the domains over the North Pacific and the southern Indian Ocean seem somewhat ambiguous.
The cross-links from ENSO to the Z500 domains (Figure 3(j),(k),(l)) and from the Tropical Belt to the SST domains (not shown) show similar differences like the unipartite networks. Yet, the stabilizing effect of the self-links (large patches with distance correlation 1) does not apply to the SST-Z500 cross-networks, such that the network scores may turn out a little lower.

550
As regards the second reanalysis 20CRv3, we observe strong similarity to the CERA-20C ensemble mean on the two shorter time periods 1951-2010and 1901-1955 and MNQS= 0.88, Table 3 and Figure S4), where disagreement within the same time period is mainly restricted to higher southern latitudes (remember that the SSIM includes an area weighting). But dissimilarities between the first and the second half of the century are stronger in 20CRv3 than in CERA-20C (MNQS= 0.81 and MNQS= 0.87, Table 3). Notably, in 1901-1955 shows the same strong connection between SST domains around 555 the whole tropics as CERA-20C, which is lost in 1951-2010 in both reanalyses. In contrast, the similarity between 20CRv3 and CERA-20C is slightly reduced in 1901-2010 (MNQS= 0.82, Table 3 and Figure S4) mainly due to differing atmospheric interactions and the weaker cross-links in 20CRv3 compared to CERA-20C (NQS SST = 0.94, NQS Z500 = 0.81, NQS SST-Z500 = 0.72). In all three networks (SST, Z500, SST-Z500) we observe that regional unsimilarity increases with latitude. Table S1 shows pairs of most similar domains between CERA-20C and 20CRv3 along with their domain-wise network quality score.  Table 3 and Figure S4). For example, in contrast to CERA-20C, the link between ENSO and the South Pacific vanishes after 1950 in 20CRv3. This might be a consequence of sparse observations in the first half of the century and thus a stronger dynamical heritage from the models used to produce the reanalyses. On the other hand, there might have been changes in connectivity driven by increasing GHG levels, which are 565 not equally reflected in CERA-20C and 20CRv3 (they are model results after all). Caution leads us therefore to restrict the comparison of CMIP6 data sets to reanalyses to the period 1951-2010.

CMIP6 networks
The networks belonging to the CMIP6 historical projections (listed in Table 1) are compared in Figure 4 to the CERA-20C ensemble mean (bold black cross marks) and to the 20CRv3 best estimate (bold red cross marks) on the time period and in terms of MNQSs for each reference respectively (d). Finally we take the average of both MNQSs to account for the uncertainty inherent in the reanalyses: 1/2(MNQS(CERA-20C) + MNQS(20CRv3)) ((e), bold crossmarks). As expected, the similarity between models and references is generally weaker than between references, although in the Z500 networks some The differences between the reanalyses are also reflected in the MNQSs of the models, where the reanalyses agree very well upon some models (HadGEM3-GC31-LL, IPSL-CM6A-LR, MPI-ESM11-2-HR, MIROC-ES2L, MIROC6), but less upon others (MRI-ESM2-0, TaiESM1, CNRM-CM6-1, CNRM-ESM2-1). But altogether, a tendency to differentiate between more/less similar models wrt. reanayses is clearly visible. We conclude that, when combining several references from independent 585 sources, the average MNQS over these references is a valid evaluation instrument for assessing, whether the teleconnec-tions between large climate components in a general circulation model are realistically represented. Still, as our evaluation is restricted to a single run per model, we are not able to differentiate between good runs and good models as such.
Using the example of four of the highest-ranking GCM runs wrt. MNQS, we illustrate in short the opportunities offered by the δ-MAPS approach to detect model deficiencies. We examine some of the point-wise adjacency maps of EC-Earth3, 590 UKESM1-0-LL, MPI-ESM1-2-HR and IPSL-CM6A-LR in comparison to CERA-20C and20CRv3 over 1951-2010 (Figures S5-S9). In the SST networks, we notice that differences are not restricted to higher latitudes, as was the case for the two reanalyses. Even in the main feature of interannual variability, ENSO, spatial connectivity deviates significantly. In all models the tropical Indian Ocean depends much more strongly, although to varying degrees, on ENSO than in both reanalyses ( Figure   S5). EC-Earth3 and IPSL-CM6A-LR not at all reproduce the northern extension of the ENSO domain seen in both reanalyses 595 ( Figure S5(a),(d),(e) and (f)), which reflects the widely recognized low-frequency interdependency between ENSO and the Pacific Decadal Oscillation (PDO) (Henley et al., 2015). The links to the southern Indian Ocean and the South Atlantic differ considerably across models, but no model shows better performance in all domains. In MPI-ESM1-2-HR the dependence between AMO and ENSO is exaggerated, whereas in IPSL-CM6A-LR the Norwegian Sea is nearly disconnected from the Tropical North Atlantic, which is not consistent with AMO ( Figure S6(c) and (d)). As regards Z500, UKESM1-0-LL shows 600 an unrealistic link between the Tropical Belt and the Antarctic domain ( Figure S7(b)). At the same time, the dependence of the Arctic domain is matched well only in UKESM1-0-LL ( Figure S8(b)). In contrast, the cross-links from ENSO to Z500 are well-represented in all four models ( Figure S9).
Continuing the analysis of all point-wise adjacency maps, it would be possible to identify regions/climate phenomena of higher and lower confidence in any model. An exercise that might be instructive for both modelling groups and downstream 605 users of climate projections.

Conclusion
In order to evaluate the physical plausibility of CMIP6 GCM output, we have constructed functional interaction networks within and between the SST and Z500 multivariate time series of two reanalyses (CERA-20C and 20CRv3) and 22 GCM output data sets using the δ-MAPS procedure. In response to several theoretical challenges related to the nature of long term 610 climate data, a number of innovations were introduced into δ-MAPS: -Detrending with season-reliant trend-EOFs -Network construction using distance correlation -Distance multicorrelation for higher-order interactions -Network comparison with the structural similarity index 615 -Construction of a multi-reference multivariate network quality score First of all, the two reanalyses were compared to one another in considerable detail, including the temporal evolution of the interactions in the course of the 20 th century. It could not be excluded that inconsistencies between the first and second half of the century arise at least partly from data uncertainty. The evaluation of CMIP6 model output against the references revealed a very high general similarity of the atmospheric connectivity, though with gradual differences. Oceanic teleconnections are less 620 accurately reflected and the model differences more pronounced. The strongest deviations are found in the cross-networks between Z500 and SST, which co-occur sometimes, but not always, with lower network quality scores in the unipartite networks.
We combined the three network quality scores for each CMIP6 model on an equal basis, emphasizing the equivalent importance of all considered geophysical subsystems in the generation of the earth's climate. Taking into account the uncertainty inherent in any reference, the average multivariate network quality score over several, preferably independent, references can 625 certainly be considered a suitable criterion to assess the similarity of physical interactions between climate components in a model to those in observations.
In addition, the proposed complex networks framework combined with the distance correlation measure offers many promising multivariate extensions of δ-MAPS as, for example, node definition based on multivariate time series, consideration of higher-order dependence, interactions on multiple time scales and time-evolving networks. Such comparisons could be very 630 useful to investigate subtle differences between various reanalyses. Besides, the characterization of network evolution from past to future could add a new facet to the understanding of climate change.
Code and data availability. Author contributions. CD developed the concept, processed the data, prepared the manuscript, and produced all figures; KW and AW contributed with in-depth discussions, interpretation and review.
Competing interests. The contact author has declared that neither she nor her co-authors have any competing interests.