Climate indices for Baltic States from principal component analysis

We used principal component analysis (PCA) to derive climate indices that describe the main spatial features of the climate in the Baltic States (Estonia, Latvia and Lithuania). Monthly mean temperature and total precipitation values derived from the ensemble of bias-corrected regional climate models (RCM) were used. Principal components were derived for years 1961-1990. The first three components describe 92% of the variance of the initial data and were chosen as climate indices in further analysis. Spatial patterns of these indices and their correlation with the initial variables were analyzed and 10 it was observed that higher values of each index corresponded to: (1) less distinct seasonality, (2) warmer and (3) wetter climate. The loadings from the chosen principal components were then further used to calculate values of the climate indices for years 2071-2100. Overall increase was found for all three indices with minimal changes in their spatial pattern.


Introduction
Spatial representation of the climate, e.g., the mapping of climatic zones, is a useful tool in climate analysis.First, it can be used to better convey information about the climate features of the region for applications in climate change adaptation and mitigation.Second, the spatial patterns can give insight into both the possible relationship between and the impact of the climate on other fields, e.g., phenological processes and vegetation distribution (Feng et al., 2012).Third, they illustrate geographical features that influence climate, such as hillsides and coastal zones.There is a wide variety of approaches for creating spatial representations of climate, but usually they belong to either rule-driven or datadriven methods.Rule-driven methods are used more often, the most popular being the Köppen-Geiger classification (Peel et al., 2007).These methods are based on certain predefined rules; for example, thresholds of meteorological variables or frequency of events.Climate zones derived from classifications of this type usually correspond to vegetation distributions in the sense that each climate type is dominated by one vegetation zone or eco-region (Belda et al., 2014).However, predefined rules make these methods subjective.Alternatively, the spatial pattern can be derived from datadriven or analytical methods.These include principal component analysis (PCA; Benzi et al., 1997;Estrada et al., 2009), cluster analysis (Bieniek et al., 2012), or a combination of both methods (Briggs and Lemin, 1992;Fovell and Fovell, 1993;Baeriswyl and Rebetez, 1997;Malmgren et al., 1999;Fan et al., 2014;Forsythe et al., 2015).Analytical methods, depending on the chosen variables, can give results that are similar to those of rule-driven methods, but the results are more homogenous (Netzel and Stepinski, 2016).Analytical methods provide a spatial pattern that must be interpreted before it can be linked with possible applications.
Principal component analysis or empirical orthogonal function analysis has two important applications.First, it can reduce the number of variables that are used to describe regional climate while still retaining most of the variation seen Published by Copernicus Publications on behalf of the European Geosciences Union.
in the initial data.Second, principal components provide new indices that are a linear combination of the chosen variables.The loadings of the chosen principal components are the coefficients that define the newly created indices, which then describe the main features of climate.Variables for PCA can be chosen and indices calculated with a specific purpose in mind; for example, indices for the classification of different types of winters (Hagen and Feistel, 2005) or estimation of crop yield based on the climate (Cai et al., 2013).Indices can also be chosen to describe the climate of the region in general (Estrada et al., 2009).However, the problem with the indices that are derived using analytical methods is that their meaning is not known beforehand, so their interpretation may require further analysis.
For many practical applications, temperature and precipitation are the two main variables of interest for a certain region.They are usually sufficient for representing vegetation types in corresponding climate zones (Zhang and Yan, 2014).Vegetative production, organic matter decomposition, and the cycling of nutrients are strongly influenced by temperature and moisture (Briggs and Lemin, 1992).Distinct changes in temperature and precipitation are to be expected in the future (BACC II, 2015).Thus, any climate patterns based on these two variables will consequently be affected, leaving a significant impact on living organisms.For instance, plant species inhabiting regions subjected to climate change might have too little time to adapt (Mahlstein et al., 2013).
The Baltic state region exhibits significant spatial and temporal climatic variability, with an influence from air masses of arctic to subtropical origin (Jaagus and Ahas, 2000;Rutgersson et al., 2014).The terrain is mostly flat, with the highest elevations extending slightly above 300 m.The Baltic Sea and the shape of its coastline have an important role in the climate of the region.PCA has been used to describe precipitation patterns in the Baltic countries with atmospheric and landscape variables (Jaagus et al., 2010).
To study the effects of climate change on climate patterns, regional climate model (RCM) data can be used (Castro et al., 2007;Mahlstein and Knutti, 2010;Tapiador et al., 2011;Fan et al., 2014).RCMs are continuously improving and correspond rather well to climate observations (Tapiador et al., 2011).Other advantages of using RCM data are that (a) their data are regularly spaced, while PCA applied to irregularly spaced data can produce distorted loading patterns (Karl et al., 1982), and (b) RCM data are also available as future projections, giving insight into the manifestation of climate change.Additionally, the spatial representativeness of the network of observation stations in the Baltic states has been reported to be problematic (Remm and Jaagus, 2011).
The aim of this work is to define climate indices that represent the main features of Baltic state climate in a compact form.The study consists of several parts.First, RCM data for temperature and precipitation were bias corrected.Second, monthly average values for the reference period 1961- 1990 were calculated and standardized.Third, PCA was performed and the main principal components were identified.
The acquired principal components and their spatial patterns were analyzed.Fourth, the loadings of chosen principal components were used to calculate indices for the years 2071-2100 and compared to reference data.

Climate data and methods
The source of the RCM ensemble data is the ENSEMBLES project (van der Linden and Mitchell, 2009).Model data sets for the A1B scenario are given for the time period 1961-2100, and 22 model runs were considered (shown in Table 1).We used time series of daily average air temperature at 2 m of height and daily precipitation.RCMs are known to be prone to systematic biases (Teutschbein and Seibert, 2012).A bias correction method (Sennikovs and Bethers, 2009) that uses quantile mapping was chosen and the cumulative distribution function was calculated for each day of the year using an 11-day running average -the data for 5 days before and 5 days after the day of interest.The ensemble median was then used for PCA.The control period for bias correction was .Bias-corrected data were then interpolated to a regular grid because it has been shown that PCA applied to irregularly spaced data can produce distorted loading patterns (Karl et al., 1982).The bias correction method and model resolution is described in detail in Sennikovs and Bethers (2009).
Two time periods were chosen: 1961-1990 (as a reference climate) and 2071-2100 (as future climate projections).For each time period, monthly average temperature and precipitation were calculated for each grid point.In total 24 climatic variables were used for each time period: 12 monthly precipitation and 12 monthly average temperatures.This is an "Rmode" analysis according to Cattell (1952).The spatial distribution of these variables for the reference period is shown in Figs. 1 and 2. Figure 1 shows a north-south gradient of monthly precipitation during April-June and an east-west gradient of monthly precipitation during October-January.
Figure 2 shows an east-west gradient of monthly temperatures during October-February and a north-south gradient of monthly temperatures during April-June.This implies that some of the variables can be combined in seasons (as done by Malmgren et al., 1999, andForsythe et al., 2015) and that for some months temperature and precipitation are correlated.A better understanding of variables with similar patterns can be gained by examining the correlation matrix in Fig. 3.The matrix areas that represent strongly correlated variables are marked in this figure, and they show the following relationships.
1. Very strong correlation (above 0.8) between precipitation levels in winter months.Locations with more precipitation in, e.g., December also have more precipitation in January (compared to the rest of the territory).
2. Strong correlation (above 0.5) between precipitation and temperature in spring months.Thus, locations with colder springs also are dryer, whilst locations with warmer springs also have more spring precipitation.
3. Strong negative correlation (below −0.5) between precipitation in autumn and late spring/early summer temperature.Locations with more precipitation in autumn also have colder springs.
4. Very strong correlation (above 0.8) between temperatures of autumn and winter months.Locations with warmer autumns also have warmer winters.
Figure 3 shows that the 24 monthly variables contain redundant information, and through PCA we can summarize the information and create new variables.

PCA method
The aim of PCA is to create a new set of uncorrelated variables that are a linear combination of the initial variables and explain as much of the initial variation as possible.An extensive description of PCA can be found in Jolliffe (2002), and its applications to climate are described in Preisendorfer (1988).
Although PCA is a widely used methodology, the terminology in the literature can vary (Wilks, 2011).We will briefly describe the terminology used in this article.
Suppose that X is an n × p data matrix, where n is the number of objects and p is the number of variables.The means of the p variables have been subtracted.In our case we have p = 24 climatic variables in n = 7143 grid points.A typical PCA is applied to p ×p covariance (or correlation) matrix calculated by Eq. (1).By solving Eq. ( 2) we can find eigenvectors e i , i = 1, . .., 24 and corresponding eigenvalues λ i , i = 1, . .., 24.As a result we have obtained non-correlated linear combinations of the initial climatic variables calculated by Eq. ( 3).
Values λ i represent the explained variance of each "principal component" Y i .Linear weights e i that define each principal component will be called "loadings"."Indices" describe Y i values that are calculated using loadings from the reference period (but not necessarily reference period data).For the reference period, principal components coincide with indices, but indices can be also calculated using future period data and reference period loadings.
An important choice must be made when applying PCA: whether to use a correlation matrix or covariance matrix in the calculation of loadings.If the covariance matrix is used then a second choice must be made: whether to use standardization and what type.The scaling process has a significant impact on the PCA process.When performing data standardization, the following issues should be taken into account.
1. Variables should be of a similar scale; otherwise, variables with considerably larger variance will dominate the principal components.Different scales are usually a consequence of different units of measurement.In our case the variance for precipitation measured in millimeters is considerably larger than that for temperature measured in degrees Celsius.2. In the case of variables measured in the same units, variances contain useful information and can improve the interpretation of PCA (Overland and Preisendorfer, 1982).Therefore, for variables that are measured in the same units (for example, average temperature in different months) we wish to keep the ratio between variances of different months.This means that the correlation matrix, in which each variable is divided by its square root of variance, should not be used as it would bring the variances of all 24 variables to 1.
3. As we are planning to use the acquired loadings as coefficients for the calculation of climate indices for the future time period and compare them with the reference climate, it is necessary that the same standardization process be used for the data of the future time period.
4. It is important to note that subtraction of the mean (or a similar constant) for each variable does not impact the result of PCA as it does not impact the covariance between variables.However, if the initial values have a zero mean (the mean is subtracted from each variable) then the resulting principal components have a similar scale, and spatial patterns are more convenient to review.
Taking into account the issues described above we propose using standardization as defined by Eq. ( 4), in which the spatial mean is subtracted for each variable as usual, but the average variance of all temperature or precipitation variables is used for scaling: where V (T ) , V (P ) represents the average variance of 12 temperature and precipitation variables for the reference period.
The variances before and after such standardization for the reference period are shown in Table 2.The ratio of variances for different months is retained.For data representing the future time period, the standardization is performed by using the mean values and average variances from the reference www.earth-syst-dynam.net/8/951/2017/Earth Syst.Dynam., 8, 951-962, 2017  period.The results of data standardization for the future time period are shown in Table 3.It can be seen that in the future the variance in precipitation data will increase and the variance in temperature data will decrease.However, the distribution of variances over the year is similar.
Another detail that must be considered when using PCA is the choice of method for determining the number of principal components that describe data variation sufficiently well and can be used in further analysis.There are multiple methods to choose from (Preisendorfer, 1988); however, in our case one of the most common methods, the scree plot, gives excellent and clear results.A scree plot is a graph of explained variances in acquired principal components, and the number of principal components is decided based on the break point in such a graph.Components to the left of the break point are retained.

Principal components for the control period (1961-1990)
The explained variance and loadings of the first three principal components are shown in Table 4.The scree plot of all principal components is shown in Fig. 4. The first two components already describe 78 % of the variance in the initial variables, while the first three components describe 92 % of the variance.According to Jolliffe (2002) the cutoff point should be between 70 and 90 % of the explained variance.However, the scree plot clearly shows that the first three principal components can be retained, so we chose to further analyze the first three components.
Figure 5 shows the spatial pattern of the first three principal components for the reference climate.They should be analyzed together with the correlation coefficients between the new variables and initial variables shown in  which the bright red or blue colors mark high positive or negative correlation.One can see that variables that were initially highly correlated (positively or negatively; Fig. 3) show similar (or in the case of negative correlation, the opposite) values in Table 5.
Correlation coefficient values (Table 5) show that the first principal component (PC1) has a high positive correlation with the autumn-winter temperature and precipitation and a high negative correlation with temperature and precipitation in late spring and early summer months.This means that higher values of PC1 correspond to places with warmer winters with more precipitation (snow or rain) and colder summers with less precipitation.However, it is also important to note that the total sum of the loadings is above 1, which implies that a constant increase in all variables would also result in higher values of PC1.From the spatial distribution (Fig. 5) we can see that PC1 has an east-west gradient implying less distinction between seasons at the seaside.It can be concluded that PC1 reflects the continentality of climate, and it represents the influence of the Baltic Sea.
The second principal component (PC2) is positively correlated with all monthly temperatures and negatively correlated with precipitation in autumn.This means that high PC2 values correspond to regions that are generally warmer than others and have low precipitation in autumn.For PC2 a north-south gradient is evident with the warmer climate in the south.This means that PC2 represents the influence of latitude.This pattern is also slightly influenced by geographical features (elevation) and the shape of the coast.
PC3 is mainly positively correlated with precipitation for most of the year (December-August) and spring temperature (April-May).This means that high PC3 values correspond to places with overall high precipitation or, in other words, an overall wetter year.PC3 mainly reflects the terrain, i.e., the distribution of elevation.When the spatial patterns of PC2 and PC3 are analyzed the effect of orography can be seen.The location of the highlands is especially visible, while for PC1 the terrain seems to have little impact.

Climate indices for future climate (2071-2100)
Loadings (linear weights) acquired through PCA from the reference data (Table 4) can be used as coefficients that define new climate indices.We can use these coefficients to calculate climate from different data (other time periods or other geographical locations).It is also important to note that statistics (mean values and variances) from the reference data used in data standardization should also be applied to other data for comparison to be possible.In our case we calculated such climate indices for future climate (corresponding to the period 2071-2100) and analyzed the change in climate patterns.The standardization of the variables is shown by Eq. ( 5), and the calculation of the climate indices is shown www.earth-syst-dynam.net/8/951/2017/Earth Syst.Dynam., 8, 951-962, 2017 by Eq. ( 6): where T k , P k represents temperature and precipitation values for the future period, T k , P k represents mean temperature and precipitation values for the reference period, and V (T ), V (P ) represents the average variance in 12 temperature and precipitation variables for the reference period.
where X i represents temperature and precipitation data for the future period, c i represents coefficients (loadings) from the reference period, and Y i represents climate indices for the future period.
It is important to note that Y i values should not be called "principal components" even though they hold a similar meaning as principal components from the reference data.Y i values are not derived using PCA directly and they do not use eigenvectors from future data.
In Fig. 6 the correlation coefficients between indices and initial variables are shown and it can be seen that they are similar to those for past climate.Therefore, they have the same interpretation and it is possible to analyze the change in spatial patterns between the past and future climate.The spatial distributions of future indices are shown in Fig. 7. Statistical descriptors, e.g., the minimal, maximal, and mean value of past and future indices, are summarized in Table 6.In addition, as we have used the same standardization (subtraction of the reference period mean) and climate index calculation process (loadings from the reference period), we can derive conclusions about increases or decreases in these climate indices.However, it is important to note that no conclusions can be derived about the value by which the increase or decrease has happened.All indices have higher values in future climate.This can be interpreted as an overall warmer climate (increase in PC2) and wetter climate (increase in PC3).The interpretation of PC1 is more complicated as coefficients (Table 4) for some variables are positive and negative for others.An increase in PC1 would be observed in the case of a constant increase in all variables.However, an increase would also be observed in the case of a temperature and precipitation decrease in spring and summer.An average increase of "standardized" (by Eq. 5) mean values is 1.4 units for temperature and 4.5 units for precipitation.Such a constant increase with the coefficients in Table 4 would result in a 6.5 unit increase for PC1.As we can see from the index statistics in Table 7, an increase of 8.4 units is observed for PC1, so we suspect that the additional increase can be attributed to changes in seasonality.
For PC1 it is shown that the values corresponding to coastal regions in the reference climate will "move" to the eastern part of the Baltic states in the future projections.The expected changes in PC2 are the largest, and the maximum values of PC2 for the reference climate (in southern Lithuania) are lower than the minimum values for the future climate (in central Estonia).The statistics in Table 6 show that the reference range of this index does not overlap with the range of future values.The climate corresponding to the reference values of PC3 in western Lithuania (the Zemaiciai Highland) will in the future be observable on plateaus in the central and northeastern parts of the Baltic states.

Discussion
The methodology used in this study has been able to reduce 24 climate variables to three new indices that more efficiently and compactly represent the main features of the climate in the Baltic countries.The methodology can also be applied to future climate data and therefore the impacts of climate change can be analyzed.Additional analysis is needed for the interpretation of the acquired indices.
Some insight into the possible interpretation of the acquired climate indices can be gained from the literature.The spatial distribution of PC1 is similar to the spatial patterns of the mean start date of winter (see results for Estonia in Jaagus and Ahas, 2000) with higher PC1 values corresponding to later winters.
As PC2 is mainly linked to temperature, the patterns exhibited by PC2 can be expected to be similar to the spatial distribution of phenological events for which temperature is the main driving factor.For example, the spatial pattern of PC2 shows similarities to spring and summer start dates in the Baltic Sea region and to more specific phenological events, such as apple tree blossoming and the beginning of the vegetation of rye (Jaagus and Ahas, 2000) or strawberry blooming and harvest (Bethere et al., 2016).In gen- High values of winter precipitation and high temperatures in spring can be interpreted in the context of spring floods; however, additional analysis is needed to account for the snow cover.The spatial distribution of PC3 is similar to the map of average annual precipitation in the study region (Jaagus et al., 2010).Interestingly, the precipitation in autumn months (September-October) has a small contribution to PC3 (Table 5).
Conclusions based on spatial pattern and correlation coefficient analysis are summarized in Table 7.
The methodology could be further improved to better link the acquired indices with phenological processes or seasons by either rotating the acquired principal components (Jolliffe, 2002) or performing correlation or regression analysis with other variables, such as crop yield (Cai et al., 2013).This approach would be especially useful in the case of PC1, for which analysis is currently complicated due to both changes in seasonality and the constant increase affecting PC1 values.Another approach that could be used to describe the spatial variability of the climate in the Baltic states is clustering based on the chosen principal component values (Fovell and Fovell, 1993;Forsythe et al., 2015).
If variables other than temperature or precipitation are used for the principal component analysis, in some cases the standardization procedure should be modified.However, it should be taken into account that when more than one data set is used, e.g., when past and future climate is compared, the same values used for standardization should be applied to all of them.

Conclusions
Most of the spatial variability in monthly average temperature and precipitation over the Baltic countries can be represented by three principal components for both past and future climate.These components can be considered climate indices, in which higher values correspond to locations with (1) climate with less distinct seasons, (2) warmer climate, and (3) climate with more precipitation.Each component has a distinct spatial pattern.The index related to seasonality exhibits a clear east-west (or inland) gradient with less distinct seasonality at the seaside (west).The second index (warmer climate) shows a north-south gradient with a warmer climate in the south.This index also reflects orography with colder climate in hilly regions.The third index reflects the overall precipitation.Its spatial distribution is mainly dominated by elevation, with maxima at the highlands and less precipitation in the plains and at the seaside.A specific standardization of the data also allows for the calculation of such indices for the future climate.Change in the climate indices in the future implies less distinct seasons and a warmer and wetter climate.
Although there is significant change in the magnitude of the indices between the future and reference periods, the change in spatial distribution is relatively small.For the first and third components, regions can be identified in which the future climate will be similar to the current climate in other regions.

Figure 3 .
Figure 3. Temperature-precipitation correlation matrix; biascorrected data.Marked and numbered features show especially high absolute correlation: (1) strong correlation between precipitation levels in winter months; (2) strong correlation between precipitation and temperature in spring months; (3) strong negative correlation between precipitation in autumn and spring temperature; (4) strong correlation between temperatures in autumn and winter months.

Figure 5 .
Figure 5. Spatial pattern of first three principal components based on monthly temperature and precipitation data for the years 1961-1990.

Figure 6 .
Figure 6.Correlation coefficients between indices (principal components) and initial variables for the reference and future climates.

Table 1 .
List of the regional climate model (RCM) ensemble members used (ENSEMBLES) showing the originating institution, the name of the RCM, and the driving general circulation model (GCM).For an explanation of abbreviations, seevan der Linden  and Mitchell (2009).

Table 2 .
Variances of climate variables before and after standardization for the years1961-1990.

Table 3 .
Variances of climate variables before and after standardization for the years 2071-2100.

Table 4 .
Explained variance and loadings of the first three principal components calculated from temperature and precipitation data for the years1961-1990.

Table 5 .
Correlation coefficients between principal components and standardized initial data for the years 1961-1990.High positive correlation corresponds to darker red color and high negative correlation corresponds to darker blue color.

Table 6 .
Statistics of climate indices (based on PCA) for past and future data.

Table 7 .
Description and interpretation of climate indices based on PCA.