Reconstructing coupled time series in climate systems using three kinds of 1 machine learning methods

. 7 Despite the great success of machine learning, its application in climate dynamics has not been 8 well developed. One concern might be how well the trained neural networks could learn a dynamical 9 system and what will be the potential application of this kind of learning. In this paper, three 10 machine learning methods are used: reservoir computer (RC), back propagation based artificial 11 neural network (BP), and long short-term memory neural network (LSTM). It shows that the 12 coupling relations or dynamics among variables in linear or nonlinear systems can be inferred by RC 13 and LSTM, which can be further applied to reconstruct one time series from the other. Specifically, 14 we analyzed the climatic toy models to address two questions: (i) what factors significantly 15 influence machine learning reconstruction; and (ii) how to select suitable explanatory variables for 16 machine learning reconstruction. The results reveal that both linear and nonlinear coupling relations 17 between variables do influence the reconstruction quality of machine learning. If there is a strong 18 linear coupling between two variables, then the reconstruction can be bi-directional, and both of 19 these two variables can be an explanatory variable for reconstructing the other. When the linear 20 coupling among variables is absent, but with the significant nonlinear coupling, the machine 21


Introduction
Applying neural network-based machine learning in climate fields has attracted great attention (Reichstein et al., 2019).Machine learning approach can be applied to downscaling and data mining analyses (Mattingly et al., 2016;Racah et al., 2017), and is also used to predict the time series of climate variables, such as temperature, humidity, runoff and air pollution (Zaytar and Amrani, 2016;Biancofiore et al., 2017;Kratzert et al., 2019;Feng et al., 2019).Besides, previous studies found that some temporal dynamics of the underlying complex systems can be encoded in these climatic time series.For example, chaos is a crucial property of climatic time series (Lorenz, 1963;Patil et al., 2001).Thus, there is significant concern regarding the ability of machine learning algorithms to reconstruct the temporal dynamics of the underlying complex systems (Pathak et al., 2017;Du et al., 2017;Lu et al., 2018;Carroll, 2018;Watson, 2019).The chaotic attractors in Lorenz system and Rossler system can be reconstructed by machine learning (Pathak et al., 2017;Lu et al., 2018;Carroll, 2018), and the Poincare return map and Lyapunov exponent of the attractor can be recovered as well (Pathak et al., 2017;Lu et al., 2017).These results are important to deeply understand the applicability of machine learning in climate fields.is the key factor determining the performance of machine learning approach to climatic time series.This is crucial for investigating why machine learning cannot perform well with some datasets, and how to improve the performance for them.One possible key factor is the coupling between different variables.Because different climate variables are coupled with one another in different ways (Donner and Large, 2008), and the coupled variables will share their information content with one another through the information transfer (Takens, 1981;Schreiber, 2000;Sugihara et al., 2012).Furthermore, a coupling often results in that the observational time series are statistically correlated (Brown, 1994).Correlation is a crucial property for the climate system, and it often influences the analysis of climatic time series."Pearson Coefficient" is often used to detect the correlation, but it can only detect the linear correlation.It is known that when the Pearson correlation coefficient is weak, most of tasks based on traditional regression methods will fail in dealing with the climatic data, such as fitting, reconstruction and prediction (Brown, 1994;Sugihara et al., 2012;Emile-Geay and Tingley, 2016).However, a weak linear correlation does not mean that there is no coupling relation between the variables.Previous studies (Sugihara et al., 2012;Emile-Geay and Tingley, 2016) have suggested that, although the linear correlation of two variables is potentially absent, they might be nonlinearly coupled.For instance, the linear cross-correlations of sea air temperature series observed in different tropical areas are overall weak, but they can be strong locally and vary with time (Ludescher et al., 2014), and such time-varying correlation is an indicator of non-linear correlation (Sugihara et al., 2012).These non-linear correlations of the sea air temperature series have been found to be conductive to the better El Niño predictions (Ludescher et al., 2014;Conti et al., 2017).The linear correlations between ENSO/PDO index and some proxy variables are also overall weak but nonlinear coupling relations between them can be detected, and they contribute greatly to reconstructing longer paleoclimate time series (Mukhin et al., 2018).These studies indicate that nonlinear coupling relations would contribute to the better analysis, reconstruction, and prediction (Hsieh et al., 2006;Donner, 2012;Schurer et al., 2013;Badin et al., 2014;Drótos et al., 2015;Van Nes et al., 2015;Comeau et al., 2017;Vannitsem and Ekelmans, 2018).Accordingly, when applying machine learning to climatic series, is it necessary to pay attention to the linear or nonlinear relationships induced by the physical couplings?This is what we want to address in this study.
In a recent study (Lu et al., 2017), a machine learning method called reservoir computer was used to reconstruct the unmeasured time series in the Lorenz 63 model (Lorenz, 1963).It was found that the Z variable can be well reconstructed from the X variable by reservoir computer, but it failed to reconstruct X from Z. Lu et al. (Lu et al., 2017) demonstrated that the nonlinear coupling dynamic between X and Z was responsible for this asymmetry in the reconstruction.This could be explained by the nonlinear observability in control theory (Hermann and Krener, 1977;Lu et al., 2017): for the Lorenz 63 equation, both (X(t), Y(t), Z(t)) and (-X(t), -Y(t), Z(t)) could be its solutions.Therefore, when Z(t) was acting as an observer, it cannot distinguish X(t) from -X(t), and the information content of X was incomplete for Z(t), which determined that X cannot be reconstructed by machine learning.The nonlinear observability for a nonlinear system with a known equation can be easily analyzed (Hermann and Krener, 1977;Schumann-Bischoff et al., 2016;Lu et al., 2017).But for the observational data from a complex system without any explicit equation, the nonlinear observability is hard to be inferred, and few studies ever investigated this question.Furthermore, does such asymmetric nonlinear observability in the reconstruction also exist in nonlinearly coupled climatic time series?This is still an open question.
In this paper, we apply machine learning approaches to learn the coupling relation between climatic time series (training period), and then reconstruct the series (testing period).Specifically we aim to make progress on how machine learning approach is influenced by the physical couplings of climatic series, and the abovementioned questions can be addressed.There are several variants of machine learning methods (Reichstein et al., 2019), and recent studies (Lu et al., 2017;Reichstein et al., 2019;Chattopadhyay et al., 2020) suggest that three of them are more applicable to sequential data (like time series): reservoir computer (RC), back propagation based artificial neural network (BP), and long short-term memory neural network (LSTM).Here we adopt these three methods to carry out our study, and provide a performance comparison among them.We first investigate their performance dependence on different coupling dynamics by analyzing a hierarchy of climatic conceptual models.Then we use a novel method to select explanatory variables for machine learning, and this can further detect the nonlinear observability (Hermann and Krener, 1977;Lu et al., 2017) for a complex system without any known explicit equations.
Finally, we will discuss a real-world example from climate system.It is known that there exist atmospheric energy transportations between the tropics and the Northern Hemisphere, and this can result in the coupling between the climate systems in these two regions (Farneti and Vallis, 2013).
Due to the underlying complicated processes, it is difficult to use a set of formulas to cover the coupling relation between the tropical average surface air temperature (TSAT) series and the Northern Hemispheric surface air temperature (NHSAT) series.We employ machine learning methods to investigate whether the NHSAT time series can be reconstructed from the TSAT time series, and whether the TSAT time series can be also reconstructed from the NHSAT time series.By this way, the conclusions from our model simulations can be further tested and generalized.
Our paper is organized as follows.In section 2, the methods for reconstructing time series and detecting coupling relation are introduced.The analyzed data and climate conceptual models are described in section 3.In section 4, we will investigate the association between the coupling relation and reconstruction quality by machine learning, and present an application to real-world climate series.Finally summary is made in section 5.

Learning coupling relations and reconstructing coupled time series
Firstly, we introduce our workflow for learning couplings of dynamical systems by machine learning, and reconstructing the coupled time series.The total time series can be divided into two parts: the training series (time lasting denoted as t) and the testing series (time lasting denoted as t').
For the systems of toy models, the coupling relation or dynamics is stable and unchanged with time, i.e., there is the stable coupling or dynamic relation

Reservoir computer
A newly developed neural network called RC (Du et al., 2017;Lu et al., 2017;Pathak et al., 2018) has three layers: the input layer, the reservoir layer and the output layer (see Fig. 2).If (i) ) (a vector with length L) is input into the input layer and reservoir layer.There are four components in these layers: the initial reservoir state ) (a vector with dimension N, representing the N neurons), the adjacent matrix "M" (size N×N) representing connectivity of the N neurons, the input-to-reservoir weight matrix "W in " (size N×L), and the unit matrix " " (size N×N) which is crucial for modulating the bias in the training process (Lu et al., 2018).The elements of "M" and "W in " are randomly chosen from a uniform distribution in [-1, 1], and we set N=1000 here (we have tested that this choice yields the good performance).These components are employed to derive output as an updated reservoir state ) : ) then gets into the output layer that consists of the reservoir-to-output matrix "W ou ".And ) will be used to estimate the value of ̂ (see Eq. ( 2)).
The mathematical form of "W ou " is defined as and it is a trainable matrix that fits the relation between ) and b ) in the training process.
"‖ ‖" denotes the L 2 -norm of a vector (L 2 represents the least square method) and is the ridge regression coefficient, whose values are determined after the training.
After this reservoir neural network is trained, we can use it to estimate b , and the estimated value is noted as ̂ .

Back propagation based artificial neural network
Here, the used BP neural network is a traditional neural computing framework, and it has been widely used in climate research (Watson, 2019;Reichstein et al., 2019;Chattopadhyay et al., 2020).
There are six layers in the BP neural network: the input layer with 8 neurons; 4 hidden layers with 100 neurons each; the output layer with 8 neurons.In each layer, the connectivity weights of the neurons need to be computed during training process, where the back propagation optimization with the complicated gradient decent algorithm is used (Dueben and Bauer, 2018).A crucial difference between the BP and the RC neural networks is as follows: unlike RC, all neuron states of the BP neural network are independent of the temporal variation of time series (Reichstein et al., 2019;Chattopadhyay et al., 2020), while the neurons of RC can track temporal evolution (such as the neuron state ) in Fig. 2) (Chattopadhyay et al., 2020).If a(t) and b are two time series of a system, through the BP neural network, we can reconstruct b from a(t).

Long short-term memory neural network
The LSTM neural network is an improved recurrent neural network to deal with time series (Reichstein et al., 2019;Chattopadhyay et al., 2020).As Fig. 3 shows, LSTM has a series of components: a memory cell, an input gate, an output gate, and a forget gate.When a time series a(t) is input to train this neural network, the information of a(t) will flow through all these components, and then the parameters at different components will be computed for fitting the relation between a(  The crucial improvement of LSTM on the traditional recurrent neural network (Reichstein et al., 2019) is, that LSTM has a forget gate which controls the information of the previous time to flow into the neural network.This will make the neuron states of LSTM have ability to track the temporal evolution of time series (Kratzert et al., 2019;Reichstein et al., 2019;Chattopadhyay et al., 2020), and this is the crucial difference between the LSTM and the BP neural networks.
Here, we also test the LSTM neural network without the forget gate, and call it LSTM * .This means that the information of the previous time cannot flow into the LSTM * neural network which does not have the memory for the past information.We will compare the performance of LSTM with that of LSTM * , so that the role of the neural network memory for the previous information can be presented.

Evaluation of reconstruction quality
The Root Mean Square Error (RMSE) of residuals is used here to evaluate the quality of reconstruction (Hyndman and Koehler, 2006).The residual represents the difference between the real series b(t') and the reconstructed series b ̂ '), and it is defined as In order to fairly compare the errors of reconstructing different processes with different variability and units (Hyndman and Koehler, 2006;Pennekamp et al., 2018;Huang and Fu, 2019), . (5)

Linear correlation
As mentioned in the introduction, the linear Pearson correlation is a commonly-used method to quantify the linear relationship between two observational variables.The Pearson correlation between two series a(t) and b(t) is defined as The sy bols "mean" and "std" denote the average and standard deviation for series a(t) and b(t), respectively.

Convergent cross mapping
To measure the nonlinear coupling relation between two observational variables, we choose the convergent cross mapping (CCM) method that has been demonstrated to be useful for many complex nonlinear systems (Sugihara et al., 2012;Tsonis et al., 2018;Zhang et al. 2019).
Considering (t) and b(t) as two observational time series, we begin with the cross mapping (Sugihara et al., 2012) from a(t) to b(t) through the following steps: i) Embedding a(t) (with length L) into the phase space with a vector represents a historical moment in the observations), where embedding dimension (m) and time delay (τ) can be determined through the false nearest neighbor algorithm (Hegger and Kantz, 1999).
ii) Estimating the weight parameter w i which denotes the associated weight between two vectors "M )" and "M i )" ("t" denotes the excepted time in this cross mapping), and it is defined as: denotes the Euler distance between vectors "M )" and "M i )".The nearest neighbor to "M )" generally corresponds to the largest weight.
b ̂ denotes the estimated value of b(t) with this phase-space cross mapping.Then, we will evaluate the cross mapping skill (Sugihara et al., 2012;Tsonis et al., 2018) as: The cross mapping skill from b to a is also measured according to the above steps, marked as ρ b→ .
Sugihara et al. (Sugihara et al. 2012) and Tsonis et al. (Tsonis et al. 2018) defined the causal inference according to ρ →b and ρ b→ as: (i) if ρ →b is convergent when L is increased, and ρ →b is of high magnitude, then b is suggested to be a causation of a. (ii) Besides, if ρ b→ is also convergent when L is increased, and is of high magnitude, then the causal relationship between a and b is bidirectional (a and b cause each other).In our study, all values of the CCM indices are measured when they are convergent with the data length (Tsonis et al. 2018).
According to previous studies (Sugihara et al., 2012;Ye et al., 2015), the CCM index is related to the ability of using one variable to reconstruct another variable: if b influences a but a does not influence b, the information content of b can be encoded in a (through the information transfer from b to a), but the information content of a is not encoded in b (there exists no information transfer from a to b).By this way, the time series of b can be reconstructed from the records of a.For the CCM index ( ab ρ  ), its magnitude represents how much information content of b is encoded in the records of a.Therefore, the high magnitude of ab ρ  means that b causes a, and we can get good reconstruction from a to b.In this paper, we will test the association between the CCM index and the reconstruction performance of machine learning.
In this model, d is a fractional differencing parameter, and p and q are the orders of the autoregressive and moving average components.Here, the parameters are set as: p=3, d=0.2 and q=3.
Hence x(t) is a time series composited with three components: the third-order autoregressive process whose coefficients are 0.6, 0.2 and 0.1, the fractional differencing process with Hurst exponent 0.7, and the third-order moving average process whose coefficients are 0.3, 0.2 and 0.1 (Granger and Joyeux, 1980).These two time series ε(t) and x(t) are used for the reconstruction analysis.
A nonlinearly coupled model: The Lorenz 63 chaotic system (Lorenz, 1963) When the parameters are fixed at (σ, μ, B) = (10, 28, 8/3), the state in the system is chaotic.We employed the fourth-order Runge-Kutta integrator to acquire the series output from this Lorenz 63 system.The time step was taken as 0.01.The time series X(t) and Z(t) are used for the reconstruction analysis.
A high-dimensional model: The two-layer Lorenz 96 model (Lorenz, 1996) is a high-dimensional chaotic system, and it is commonly used to mimic mid-latitude atmospheric dynamics (Chorin and Lu, 2015;Hu and Franzke, 2017;Vissio and Lucarini, 2018;Chen and Kalnay, 2019;Watson, 2019).It reads In the first layer of the Lorenz 96 system, there are 18 variables marked as X k (k is an integer ranging from 1 to 18), and each X k is coupled with Y k, j (Y k, j is from the second layer).The parameters are set as fellows: J = 20, h 1 = 1, h 2 = 1, and F=10.The parameter can alter the coupling strength: when is decreased, the coupling strength between X k and Y k, j will be enhanced.The fourth-order Runge-Kutta integrator and periodic boundary condition are adopted (that is: ), and the integral time step is taken as 0.05.The time series X 1 (t) and Y 1, 1 (t) are used for the reconstruction analysis.

Real-world climatic time series
TSAT, NHSAT and the Nino3.4index are chosen as the example from real-world climatic time series used for reconstruction analysis.The original data was obtained from National Centers for Environmental Prediction (https://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis2.html) and KNMI Climate Explorer (http://climexp.knmi.nl).The series of TSAT and NHSAT were obtained from the regional average of gridded daily data in NCEP Reanalysis 2. The selected spatial range is 20 0 N-20 0 S for the tropics and 20 0 N-90 0 N for the Northern Hemisphere.The selected temporal range is from 1981/09/01 to 2018/12/31.
Training and testing datasets: Before analysis, all the used time series are standardized to take zero mean and unit variance so that any possible impact of mean and variance on the statistical analysis is avoided (Brown, 1994;Hyndman and Koehler, 2006;Chattopadhyay et al., 2020).The total series were divided into two parts: 60% of the time series training the neural network and 40% being the testing series.Specific data lengths of the training series and testing series will be also listed in the results section.

Linear coupling relation and machine learning
We first consider the simplest case: the linear coupling relation between two variables.Here, two time series x(t) and ε(t) in ARFIMA (3, 0.2, 3) model are analyzed.Obviously, there are different temporal structures in x(t) and ε(t), especially for their large-scale trends (Fig. 4a) and power spectra (Fig. 4b).The marked difference between x(t) and ε(t) lies in their low-frequency variations, and there are more low-frequency and larger-scale structures in x(t) than in ε(t).We employ neural networks (RC, LSTM, LSTM*, and BP) to learn the dynamics of this model (Eq.( 11)) through the procedure shown in Fig. 1 Frequency Detailed comparisons between the real and reconstructed series are shown in Figs.4c and 4d.
When ε(t) is input, the trained RC and LSTM neural networks can be applied to accurately reconstruct x(t).When x(t) is reconstructed from ε(t) through LSTM, the minimum of nRMSE (0.01) is reached; all reconstructed series are nearly overlapped with the real ones and cannot be visually differentiated (see Fig. 4c).For RC, the reconstruction quality is also good.The good performance of LSTM benefits from its memory function for the past information (Reichstein et al., 2019;Chattopadhyay et al., 2020).When the memory function of LSTM is stopped, the reconstruction of LSTM* is no longer better than that of RC (see Table 1).The reconstruction by BP is successful in this linear system (Fig. 4), but its performance is not as good as LSTM and RC (Table 1).This performance difference may be due to that, unlike LSTM and RC, the neuron states of BP cannot track the temporal evolution of a time series (Chattopadhyay et al., 2020).

Nonlinear coupling relation and machine learning
It is known that a strong linear correlation is useful for training neural networks and reconstructing time series.When the linear correlation between variables is very weak, could these machine learning methods be applied to learn the underlying coupling dynamics?To address this question, two nonlinearly coupled time series X(t) and Z(t) in the Lorenz 63 system (Lorenz, 1963) are analyzed.There is a very weak overall linear correlation between variables X and Z (Pearson correlation: 0.002) in the Lorenz63 model (Table 2), and such a weak linear correlation is resulted from the time-varying local correlation between variables X and Z (see Fig. 5a): For example, X and Z are negatively correlated in the time interval of 0-200, but positively correlated in 200-400.This alternation of negative and positive correlation appears over the whole temporal evolutions of X(t) and Z(t), which leads to an overall weak linear correlation.In this case, we cannot use a feasible linear regression model between X(t) and Z(t) to reconstruct one from the other, since there is no such good linear dependency as found in the ARFIMA (p, d, q) system (see Figs. 6a and 6b).In a nonlinearly coupled system, it is known that the coupling strength between two variables cannot be estimated by the linear Pearson correlation (Brown, 1994;Sugihara et al., 2012).Here, we use CCM to estimate the coupling strength between X and Z, and then it shows a high magnitude of the CCM index: 0.91 . According to the CCM theory (see Method), such a high magnitude of the CCM index indicates that the information content of Z is encoded in the time series of X.
Therefore, we conjecture that: when inputting X(t) to the neural network, not only the information content of X(t), but also the information content of Z(t) can be learned by the neural network.And then it is possible to reconstruct Z(t) from the trained neural network.We will test it in the following.
Figure 5b shows the results of reconstructing Z time series through RC, LSTM and BP respectively.Different from the case of linear system, the successful reconstruction for the time series of the Lorenz63 system depends on the used machine learning methods.The series reconstructed by LSTM nearly overlaps with the real series (Fig. 5b), and has the minimum nRMSE (0.004, see Table 2); moreover, the RC performs quite well, with only a little difference found at some peaks and dips (Fig. 5b).These reconstruction results suggest that, even though the linear correlation is very weak, a strong nonlinear correlation will allow RC and LSTM to fully capture the underlying coupling dynamics.However, BP and LSTM* perform poorly, and their reconstruction results have large errors (nRMSE = 0.17 for BP, and nRMSE = 0.24 for LSTM*).The reconstructed series heavily depart from the real series, especially for all peaks and dips, and the reconstructed values for each extreme point are underestimated (Fig. 5b).This means that both of BP and LSTM* cannot learn the nonlinear coupling.
As mentioned in section 2.2, a BP neural network does not track the temporal evolution, since its neuron states are independent to the temporal variation of time series.For LSTM*, it does not include the information of previous time.Previous studies have revealed that the temporal evolution and memory are very important properties for a nonlinear time series (Kantz and Schreiber, 2003;Franzke et al. 2015), and this could not be neglected when modeling nonlinear dynamics.These might be responsible for that BP and LSTM* fail in dealing with this nonlinear Lorenz 63 system.
Investigations for the application of BP in other different nonlinear relationships need to be further addressed in the future.

Reconstruction quality and impact factors
From the above results, it is revealed that RC and LSTM are able to learn both linear and nonlinear coupling relations, and then the coupled time series can be well reconstructed.In this section, we further investigate what factors can influence the reconstruction quality.

Direction dependence and variable dependence
When reconstructing time series of the linear model of Eq. ( 11), it can be found that the reconstruction is bi-directional (see Fig. 4d and Table 1): one variable can be taken as explanatory variable to reconstruct another variable well; oppositely, it can be also well reconstructed by another variable.Furthermore, when the linear correlation is weak but the nonlinear coupling is strong, will the bi-directional reconstruction be still allowed?The answer is usually NO.For example, when comparing the reconstruction quality of reconstructing Z(t) from X(t) (Fig. 5b) with that of reconstructing X(t) from Z(t) (Fig. 5c), all of the used machine learning methods fail in reconstructing X(t) from Z(t) (large values of nRMSE are all close to 0.3).This result is consistent with the nonlinear observability mentioned by Lu et al. (Lu et al., 2017).The reconstruction direction is no longer bi-directional in this nonlinear system, but the reconstruction quality is direction-dependent and variable-dependent.
Therefore, we further discuss how to select the suitable explanatory variable or the reconstruction direction.Tables 1 and 2 show that the reconstruction quality in a linear coupled system highly depends on the Pearson correlation, however it is different for a nonlinear system.For the Lorenz 63 system, the bi-directional CCM coefficients between the variables X and Z are asymmetric (with a stronger 0.91 ), and then variable Z can be well reconstructed from variable X by machine learning but X cannot be reconstructed from Z (Fig. 5b     and 5c).The CCM index can be taken as a potential indicator to determine the explanatory variable and reconstructed variable for this nonlinear system.Here the asymmetric reconstruction quality is resulted from the asymmetric information transfer between the two nonlinearly coupled variables (Hermann and Krener, 1977;Sugihara et al., 2012;Lu et al., 2017).In the coupling relation between X and Z, much more information content of Z is encoded in X, so that it performs well for reconstructing Z from X (Lu et al., 2017), which can be detected by the CCM index (Sugihara et al., 2012;Tsonis et al., 2018).

Generalization to a high-dimensional chaotic system
Choosing direction and variable is important for the application of neural networks in reconstructing nonlinear time series, but this is derived from the low-dimensional Lorenz 63 system.
In this subsection, we present the results from a high-dimensional chaotic system of the Lorenz 96 model.Furthermore, we will investigate the association between the CCM index and reconstruction quality in the machine learning frameworks.
Firstly, we use variables X 1 and Y 1,1 in Eq. ( 13) to illustrate the direction dependence in the high-dimensional system.Details of X 1 and Y 1,1 are shown in Fig. 7a, and the Pearson correlation between X 1 and Y 1,1 is weak (only -0.11, see Table 3).In Eq. ( 13), the forcing from X 1 to Y 1,1 , is much stronger than the forcing from Y 1,1 to X 1 .The CCM index shows: . This indicates that reconstructing X 1 from Y 1,1 may obtain a better quality than from X 1 to Y 1,1 .As expected, by means of RC, the error of reconstructing X 1 from Y 1,1 is: nRMSE = 0.01, however it is nRMSE = 0.06 in the opposite direction (Table 3).The result of LSTM is similar to that of RC in this case.Thus, direction dependence does exist in reconstructing this high-dimensional system, and the result is consistent with the indication of the CCM index.In this case, performance of the reconstruction through BP and LSTM* is not good and just as analyzed in section 4.2.3.(b) Through RC, when Y 1,1 , X 2 and multivariate are acting as the explanatory variable respectively, the corresponding reconstructed X 1 time series (red) are shown, and blue lines denote the real X 1 time series.(c) Through LSTM, when Y 1,1 , X 2 and multivariate (multiple variables: X 2 , X 17 and X 18 ) are acting as the explanatory variable respectively, the corresponding reconstructed X 1 time series (red) are shown, and blue lines denote the real X 1 time series.The reconstruction between X 1 and X 2 in the same layer of Lorenz 96 system is also shown.
There is an asymmetric causal relation ( ) between X 1 and X 2 , and their linear correlation is very weak (see Table 3).The RC gives better result of reconstructing X 1 from X 2 (nRMSE=0.13)than reconstructing X 2 from X 1 (nRMSE=0.17).LSTM also has different results for the reconstructed X 1 and X 2 (Table 3), where the quality of reconstructing from X 1 to X 2 (nRMSE=0.16)is better than reconstructing from X 2 to X 1 (nRMSE=0.20).In this case, the reconstruction quality of LSTM is worse than that of RC, and the reconstruction results by LSTM are consistent with the indication of the CCM index.Chattopadhyay et al. ( 2020) also suggests that LSTM performs worse than RC in some cases, and this might be related to the use of a simple variant of the LSTM architecture.This variant of LSTM was tested and it was found that the time-varying local mean in time series would sometimes influence its performance.However further investigation is required for a deeper understanding of the real reason.In this high-dimensional system, the reconstruction quality is also influenced by the chosen explanatory variables: The quality of reconstructing X 1 from Y 1,1 is better than the quality of reconstructing X 1 from X 2 through RC and LSTM (see Fig. 7b and 7c).
Besides, the number of the chosen explanatory variables also influences the reconstruction quality.If more than one explanatory variable in the same layer is used, the reconstruction of X 1 from X 2 can be greatly improved (see Figs. 7b and 7c).For example, when all of X 2 , X 17 and X 18 are acting as the explanatory variables, the nRMSE of reconstructed X 1 is reduced from 0.13 to 0.08 (Table 3).For both of RC and LSTM, the multivariable reconstruction reaches the lower error than those from unit-variable reconstruction.In the above results, the CCM index is used to select explanatory variable for RC and LSTM.
Now we employ more variables to test the association between the CCM index of the data and the performance of RC and LSTM.The values of the CCM index are calculated between X 1 and X 2 , X 3 …, X 18 respectively; meanwhile, X 1 is reconstructed from X 2 , X 3 …, X 18 respectively.We find a significant correspondence between the nRMSE and the CCM index (Fig. 8), for both results of RC and LSTM.Here we only use a simple LSTM architecture, and there are many other variants of this architecture where the abnormal point of LSTM in Fig. 8 might be reduced.The result of Fig. 8 reveals the robust association between the CCM index and reconstruction quality in the machine learning frameworks of RC and LSTM.For other machine learning methods, such association deserves further investigation.

Performance of BP and LSTM* in Lorenz 96 system
In nonlinear systems, the performance of reconstruction through BP and LSTM* is much worse than that of RC and LSTM (Fig. 5).Here we present a simple experiment, to illustrate what might influence the performance of BP and LSTM* in a nonlinear system.
The experiment is set as follows: in Eq. ( 13), the value of h 1 is set as 0, and the value of is decreased from 0.7 to 0.3.When is equal to 0.7, the forcing from X 1 to Y 1,1 is weak (the Pearson correlation between X 1 and Y 1,1 is only 0.48), and the performances of BP and LSTM* are not good.
When is equal to 0.3, the forcing is dramatically magnified.As the second panel in Fig. 9a shows, the strong forcing makes Y 1,1 synchronized to X 1 , and the Pearson correlation between X 1 and Y 1,1 is greatly increased to 0.8.When the forcing strength is magnified, the performance of machine learning is also enhanced (Fig. 9b): the reconstructed series by BP and the reconstructed series by LSTM* are much closer to the real target series.This means that the reconstruction quality of BP and LSTM* is greatly improved when the linear correlation is increased.This experiment reveals that, the coupling strength in a nonlinear system can alter the Pearson correlation of two time series, which further influences the performance of BP and LSTM* in a nonlinear system.However, RC and LSTM are not restricted to the Pearson correlation in this nonlinear system.
When is altered from 0.7 to 0.3, although the Pearson correlation is changed a lot, the values of the CCM index are kept consistently above 0.9.For all values of θ, RC is able to equally produce a good quality reconstruction of X 1 .Fig. 9b shows that the reconstructed series through RC and LSTM always overlap with the real time series.These results indicate that the performance of both RC and LSTM is sensitive to the value of CCM index, which is in line with the results given in section 4.2.2.

Application to real-world climate series: reconstructing SAT
The natural climate series are usually nonstationary, and are encoded with the information of many physical processes in the earth system.In the following, we illustrate the utility of the above methods and conclusions by investigating a real-world example.
The daily NHSAT and TSAT time series are shown in Fig. 10a.There are quite different temporal patterns in the NHSAT and TSAT series, with a weak linear correlation (0.08, see Table 4) between them.In the scatter plot for the NHSAT and TSAT (Fig. 10b), the marked nonlinear structure is observed between NHSAT and TSAT.Such a weak linear correlation will make the linear regression method fail to reconstruct one series from the other.Meanwhile, there is no explicit physical expression that can transform TSAT and NHSAT to each other.Now we try to use machine learning to learn their coupling between them and then to reconstruct these climate series.The CCM index of that NHSAT cross maps TSAT is 0.70, and the CCM index of that TSAT cross maps NHSAT is 0.24 (Table 4).The CCM index means that the information content of TSAT is well encoded in the records of NHSAT, and the information transfer might be mainly from TSAT to NHSAT.This finding is consistent with previous studies (Farneti and Vallis, 2013).Further, the CCM analysis indicates that the reconstruction from NHSAT to TSAT might obtain a better quality than that from the opposite direction.The results validate our conjecture that the nRMSE of reconstruction from NHSAT to TSAT is lower than that from TSAT to NHSAT (Table 4).By using RC, the TSAT time series can be relatively well described by the reconstructed ones (Fig. 11a), with nRMSE equal to 0.13.This nRMSE is a bit high because some extremes of the TSAT time series have not been well described (Fig. 11b).When using TSAT to reconstruct the time series of NHSAT, the reconstructed time series cannot describe the real time series of NHSAT (Fig. 11c), and the corresponding nRMSE is equal to 0.21.Besides, we also use LSTM and BP to reconstruct these natural climate series, the performances of these two neural networks are worse than RC (  We can further improve the reconstruction quality of TSAT.Considering that the tropical climate system interacts not only with the Northern Hemisphere climate system, we can use the information of other systems to improve the reconstruction.Looking at the time series of Nino 3.4 index (Fig. 10a), some of its extremes occur at the same time intervals as the extremes of TSAT.
Moreover, when Nino 3.4 index is included into the scatter plot (Fig. 10c), a nonlinear attractor structure is revealed.We combine NHSAT with Nino 3.4 index to reconstruct the time series of TSAT through RC.The reconstructed TSAT (Fig. 11e) is much closer to the real TSAT series, and the corresponding nRMSE has been reduced to 0.08.Finally, we make a further comparison between the real TSAT and the reconstructed TSAT: (i) the annual variations of the reconstructed TSAT are close to those of the real TSAT (Fig. 12a).(ii) The power spectra of TSAT and the reconstructed TSAT are compared in Fig. 12b, and the main deviation occurs at the frequency bands of 0-15 days.The reason might be that the local weather processes are not input into this RC reconstruction.This conjecture can be further confirmed through a red-noise test with response time 15 days for the residual series (this red-noise test is the same as the method used in Roe, 2009).All data points of the residual series lie within the confidence intervals (Fig. 12c).This means that the residual is possibly induced by local weather processes, and this information is not input into RC for the reconstruction.

Conclusions and discussions
In this study, three kinds of machine learning methods are used to reconstruct the time series of toy models and real-world climate systems.One series can be reconstructed from the other series by machine learning when they are governed by the common coupling relation.For the linear system, variables are coupled through the linear mechanism, and a large Pearson coefficient can benefit to machine learning with bi-directional reconstruction.For a nonlinear system, the coupled time series often have a small Pearson coefficient, but machine learning can still well reconstruct the time series when the CCM index is strong; moreover, the reconstruction quality is direction-dependent and variable-dependent, which is determined by the coupling strength and causality between the dynamical variables.
Choosing suitable explanatory variables is crucial for obtaining a good reconstruction quality.
But the results show that machine learning performance cannot be explained only by linear correlation.In this study, we suggest to use the CCM index to select explanatory variables.
Especially for the time series of nonlinear systems, the strong CCM index can be taken as a benchmark to select an explanatory variable.When the CCM index is higher than 0.5 in this study, then the nRMSE is often smaller than 0.1 with the reconstructed series very close to the real series in the presented results.Thus, the CCM index higher than 0.5 may be considered as a criterion for choosing appropriate explanatory variables.It is well known that atmospheric or oceanic motions are nonlinearly coupled over most of time scales, and therefore, in the natural climate series, there would be similar nonlinear coupling relation as found in the Lorenz 63 and the Lorenz 96 systems (the weak Pearson correlation but the high CCM coefficient).If only Pearson coefficient is used to select the explanatory variable, then some useful nonlinearly correlated variables may be left out.
Finally, it is worth noting the potential application for machine learning in the climate studies.
For instance, a series b(t) is unmeasured during some periods for the measuring instrument failure, but there are other kinds of variables without missing observations.Then, CCM can be applied to select the suitable variables coupled with b(t), and RC or LSTM can be employed to reconstruct the unmeasured part of b(t) (following Fig. 1).This is useful for some climate studies, such as paleoclimate reconstruction (Brown, 1994;Donner 2012; Emile-Geay and Tingley, 2016), interpolation for the missing points in measurements (Hofstra et al., 2008), and the parameterization schemes (Wilks, 2005;Vissio and Lucarini, 2018).Our study in this article is only a beginning for reconstructing climate series by machine learning, and more detailed investigations will be reported soon.

Appendix
Governing equations for the LSTM neural network

If
) and b denote two time series, and ) is input into LSTM to estimate b , then the governing equations for the LSTM architecture (Fig. 3) are as follows: ), ( 18) f , i , and o denote the forget gate, input gate, and output gate respectively.h and c represent the hidden state and the cell state, the dimension of the hidden layers are set as 200, which could yield the good performance in our experiment.All these components can be found in Fig. 3, and the information flow among these components are realized by Eqs. ( 14)-( 20).There are many parameters in the LSTM architecture: σ f is the softmax activation function; f , i , and are the biases in the forget gate, the input gate, and the hidden layers; the weight matrixes "W f ", "W i ", "W c " and" W o " denote the neuron connectivity in each layers.These parameters need to be computed during training (Chattopadhyay et al., 2020).a(t) and ̂ ) represent the input and output time series.
Though applying machine learning to climate fields has been attracting much attention, there are still open questions what can be learnt by machine learning during the training process, and what

Figure 2
Figure 2 Schematic of the RC neural network: the three layers are input layer, reservoir layer, and output layer.
t) and b(t).The governing equations for the LSTM architecture are shown in the Appendix.After the training is accomplished, a(t) can be used to reconstruct b(t) by this neural network.

Figure 3
Figure 3 Schematic of the LSTM architecture.LSTM has a memory cell, an input gate, an output gate, and a .

Figure 5
Figure 5 (a) The X time series (black) and the Z time series (blue) of the Lorenz 63 model.(b) Comparison of the

Figure 6
Figure 6 (a) Scatter plot of x(t) and ε(t) of the ARFIMA(3,0.2,3)model.(b) Scatter plot of X time series and Z time

Figure 7
Figure7(a) The Y 1,1 time series (black), X 2 time series (black) and X 1 time series (blue) of the Lorenz 96 system.

Figure 8
Figure 8 Scatter plot of nRMSE values and the CCM index values.Blue and grey dashed lines are the fitted linear

Figure 9
Figure 9Influence of strong nonlinear coupling on linear Pearson correlation and machine learning performance.

Figure 10
Figure 10 (a) Daily time series of TSAT, NHSAT and Nino 3.4 index.(b) Scatter plot of normalized NHSAT and

Figure 12
Figure 12 (a) Comparison of the annual mean values between the reconstructed TSAT and real TSAT.(b) If this inherent coupling relation can be inferred by machine learning in the training series, the inferred coupling relation should be reflected by machine learning in the testing series.Therefore, the workflow of our study can be summarized as follows (Fig. 1): Diagram illustration for reconstructing time series by machine learning.(1) The available part of the Figure 1 . The training parts of ε(t) are selected from the gray shadow in Fig.4a.RC, LSTM, LSTM*, and BP are trained to learn the coupling relation between x(t) and ε(t).Then, the trained neural networks together with ε(t) are used to reconstruct x(t).The reconstruction results and the performance of different neural networks are presented in Table1.It shows that there is a strong linear correlation (0.88) between x(t) and ε(t).This reconstruction result suggests that the strong linear coupling can be well captured by these three neural networks since all values of nRMSE are low.

Table 2
Details of Lorenz63 system reconstruction

Table 3
Details of reconstructing the Lorenz 96 model Table4).For BP, this worse performance may be due to its inability to deal with nonlinear coupling.LSTM performs worse than RC in this real-world case might be induced by the used simple variant of LSTM architecture.

Table 4
Details of te perature records' reconstruction