Groundwater storage dynamics in the world's large aquifer systems from GRACE: uncertainty and role of extreme precipitation

. Under variable and changing climates groundwater storage sustains vital ecosystems and enables freshwater withdrawals globally for agriculture, drinking water, and industry. Here, we assess recent changes in groundwater storage ( (cid:49) GWS) from 2002 to 2016 in 37 of the world’s large aquifer systems using an ensemble of datasets from the Gravity Recovery and Climate Experiment (GRACE) and land surface models (LSMs). Ensemble GRACE-derived (cid:49) GWS is well reconciled to in situ observations ( r = 0 . 62–0.86, p value < 0 . 001) for two tropical basins with regional piezometric networks and contrasting climate regimes. Trends in GRACE-derived (cid:49) GWS are overwhelmingly non-linear; indeed, linear declining trends adequately ( R 2 > 0 . 5, p value < 0 . 001) explain variability in only two aquifer systems. Non-linearity in (cid:49) GWS derives, in part, from the episodic nature of groundwater replenishment associated with extreme annual ( > 90th percentile, 1901–2016) precipitation and is inconsistent with prevailing narratives of global-scale groundwater depletion at the scale of the GRACE footprint ( ∼ 200000 km 2 ). Substantial uncertainty remains in estimates of GRACE-derived (cid:49) GWS, evident from 20 realisations presented here, but these data provide a regional context to changes in groundwater storage observed more through piezometry.


Introduction
Groundwater is estimated to account for between a quarter and a third of the world's annual freshwater withdrawals to meet agricultural, industrial, and domestic demand (Döll et al., 2012;Wada et al., 2014;Hanasaki et al., 2018). As the world's largest distributed store of fresh water, groundwater plays a vital role in sustaining ecosystems and enabling adaptation to increased variability in rainfall and river discharge brought about by climate change (Taylor et al., 2013a). Sustained reductions in the volume of groundwater (i.e. groundwater depletion) resulting from human withdrawals or changes in climate have historically been observed as declining groundwater levels recorded in wells (Scanlon et al., 2012a;Castellazzi et al., 2016;MacDonald et al., 2016). The limited distribution and duration of piezometric records hinder, however, direct observation of changes in from land surface models (LSMs) either exclusively (Rodell et al., 2009;Famiglietti et al., 2011;Scanlon et al., 2012a;Famiglietti and Rodell, 2013;Richey et al., 2015;Thomas et al., 2017) or in combination with in situ observations (Rodell et al., 2007;Swenson et al., 2008;Shamsudduha et al., 2012).
Substantial uncertainty persists in the quantification of changes in terrestrial water stores from GRACE measurements that are limited in duration (2002 to 2016) and the application of uncalibrated, global-scale LSMs (Shamsudduha et al., 2012;Döll et al., 2014;Scanlon et al., 2018). Computation of GWS from GRACE TWS is argued, nevertheless, to provide evaluations of large-scale changes in groundwater storage where regional-scale piezometric networks do not currently exist (Famiglietti, 2014).
Previous assessments of changes in groundwater storage using GRACE in the world's 37 large aquifer systems (Richey et al., 2015;Thomas et al., 2017) (Fig. 1, Table 1) have raised concerns about the sustainability of human use of groundwater resources. One analysis (Richey et al., 2015) employed a single GRACE TWS product (CSR) in which changes in subsurface storage ( SMS + GWS) were attributed to GWS. This study applied linear trends without regard to their significance to compute values of GRACEderived GWS over 11 years from 2003 to 2013 and concluded that the majority of the world's aquifer systems (n = 21) are either "overstressed" or "variably stressed". A subsequent analysis (Thomas et al., 2017) employed a different GRACE TWS product (Mascons) and estimated SWS from LSM data for both surface and subsurface runoff, though the latter is normally considered to be groundwater recharge (Rodell et al., 2004). Using performance metrics normally applied to surface water systems including dams, this latter analysis classified nearly a third (n = 11) of the world's aquifer systems as having their lowest sustainability criterion.
Here, we update and extend the analysis of GWS in the world's 37 large aquifer systems using an ensemble of three GRACE TWS products (CSR, Mascons, GRGS) over a 14-year period from August 2002 to July 2016. To isolate GRACE-derived GWS from GRACE TWS, we employ estimates of SMS, SWS, and SNS from five LSMs (CLM, Noah, VIC, Mosaic, Noah v.2.1) run by NASA's Global Land Data Assimilation System (GLDAS). As such, we explicitly account for the contribution of SWS to TWS, which has been commonly overlooked (Rodell et al., 2009;Richey et al., 2015) despite evidence of its significant contribution to TWS (Kim et al., 2009;Shamsudduha et al., 2012;Getirana et al., 2017). Further, we characterise trends in time series records of GRACE-derived GWS by employing a non-parametric seasonal trend decomposition procedure based on Loess (STL) (Cleveland et al., 1990) that allows for the resolution of seasonal, trend, and irregular components of GRACE-derived GWS for each large aquifer system. In contrast to linear or multiple linear regression techniques, STL does not assume that data are normally distributed or that the underlying trend is linear (Shamsudduha et al., 2009;Humphrey et al., 2016;Sun et al., 2017).

Global large aquifer systems
We use the World-wide Hydrogeological Mapping and Assessment Programme (WHYMAP) Geographic Information System (GIS) dataset for the delineation of world's 37 large aquifer systems (Fig. 1, Table 1) (WHYMAP and Margat, 2008). The WHYMAP network, led by the German Federal Institute for Geosciences and Natural Resources (BGR), serves as a central repository and hub for global groundwater data, information, and mapping with a goal of assisting regional, national, and international efforts toward sustainable groundwater management (Richts et al., 2011). The largest aquifer system in this dataset (Table S1 in the Supplement) is the East European Aquifer System (WHYMAP no. 33; area: 2.9 million km 2 ), and the smallest is the California Central Valley Aquifer System (WHYMAP no. 16; area: 71 430 km 2 ), which is smaller than the typical sensing area of GRACE (∼ 200 000 km 2 ). However, Longuevergne et al. (2013) argue that GRACE satellites are sensitive to total mass changes at a basin scale, so TWS measurements can be applied to smaller basins if the magnitude of temporal mass changes is substantial due to mass water withdrawals (e.g. intensive groundwater-fed irrigation). The mean and median sizes of these large aquifers are ∼ 945 000 and ∼ 600 000 km 2 , respectively.
The CSR land solution (version RL05.DSTvSCS1409) is post-processed from spherical harmonics released by the Centre for Space Research (CSR) at the University of Texas at Austin. CSR gridded datasets are available at a monthly time step and a spatial resolution of 1 • × 1 • (∼ 111 km at Equator), though the actual spatial resolution of the GRACE footprint (Scanlon et al., 2012a) is 450 km ×450 km or ∼   Table 1 and correspond to the numbers shown on this map for reference. Grey shading shows the aridity index based on CGIAR's database of the Global Potential Evapotranspiration (Global-PET) and Global Aridity Index (https://cgiarcsi.community/, last access: 5 August 2020); the proportion (as a percentage) of long-term trends in GRACE-derived GWS of these large aquifer systems that is explained by linear trend fitting is shown in colour (i.e. linear trends toward dark red and non-linear trends toward light brown-yellow).
200 000 km 2 . To amplify TWS signals we apply the dimensionless scaling factors provided as 1 • × 1 • bins that are derived from minimising differences between TWS estimated from GRACE and the hydrological fields from the Community Land Model (CLM4.0) (Landerer and Swenson, 2012). JPL Mascons (version RL05M_1.MSCNv01) data processing involves the same glacial isostatic adjustment correction but applies no spatial filtering as JPL RL05M directly relates inter-satellite range-rate data to mass concentration blocks (Mascons) to estimate monthly gravity fields in terms of equal-area 3 • × 3 • mass concentration functions in order to minimise measurement errors. Gridded Mascon fields are provided at a spatial sampling of 0.5 • in both latitude and longitude (∼ 56 km at the Equator). Similar to the CSR product, dimensionless scaling factors are provided as 0.5 • ×0.5 • bins (Shamsudduha et al., 2017) to apply to the JPL Mascons product that also derive from the Community Land Model (CLM4.0) . The scaling factors are multiplicative coefficients that minimise the difference between the smoothed and unfiltered monthly TWS variations from the CLM4.0 hydrology model . Finally, GRGS GRACE (version RL03-v1) monthly gridded solutions with a spatial resolution of 1 • × 1 • are extracted, and aggregated time series data are generated for each aquifer system. A description of the estimation method for GWS from GRACE and in situ observations is provided below.

Estimation of ∆GWS from GRACE
We apply monthly measurements of terrestrial water storage anomalies ( TWS) from Gravity Recovery and Climate Experiment (GRACE) satellites and simulated records of soil moisture storage ( SMS), surface runoff or surface water storage ( SWS), and snow water equivalent ( SNS) from NASA's Global Land Data Assimilation System (GLDAS version 1.0) at 1 • × 1 • grids for the period of August 2002 to July 2016 to estimate (Eq. 1) groundwater storage changes ( GWS) in the 37 WHYMAP large aquifer systems. This approach is consistent with previous global (Thomas et al., 2017) and basin-scale (Rodell et al., 2009;Asoka et al., 2017;Feng et al., 2018) analyses of GWS from GRACE. We apply three gridded GRACE products (CSR, JPL Mascons, GRGS), an ensemble mean of TWS and individual storage components of SMS and SWS from four land surface models (LSMs: CLM, Noah, VIC, Mosaic), and a single SNS from Noah model (GLDAS version 2.1) to derive a total of 20 realisations of GWS (Table S5) for each of the 37 aquifer systems. We then averaged all the GRACE-derived GWS estimates to generate an ensemble mean GWS time series record for each aquifer system. GRACE and GLDAS LSM-derived datasets are processed and analysed in the R programming language (R Core Team, 2017).

Global precipitation datasets
To evaluate the relationships between precipitation and GRACE-derived GWS, we use a high-resolution (0.5 • ) gridded, global precipitation dataset (version 4.01) (Harris et al., 2014) available from the Climatic Research Unit (CRU) at the University of East Anglia (https://crudata.uea. ac.uk/cru/data/hrg/, last access: 5 August 2020). In light of uncertainty in observed precipitation datasets globally, we test the robustness of the relationship between precipitation and groundwater storage using the GPCC (Global Precipitation Climatology Centre) precipitation dataset (Schneider et al., 2017) (https://www.esrl.noaa.gov/psd/data/gridded/data. gpcc.html) from 1901 to 2016. Time series (January 1901 to July 2016) of monthly precipitation from CRU and GPCC datasets for the WHYMAP aquifer systems were analysed and processed in the R programming language (R Core Team, 2017).

Seasonal trend decomposition (STL) of GRACE ∆GWS
Monthly time series records (August 2002 to July 2016; Figs. S1-S36) of the ensemble mean GRACE TWS and GRACE-derived GWS were decomposed to seasonal, trend, and remainder or residual components using a non-parametric time series decomposition technique known as the seasonal trend decomposition procedure based on a locally weighted regression method called Loess (STL) (Cleveland et al., 1990). Loess is a nonparametric method so that the fitted curve is obtained empirically without assuming the specific nature of any structure that may exist within the data (Jacoby, 2000). A key advantage of the STL method is that it reveals relatively complex structures in time series data that could easily be overlooked using traditional statistical methods such as linear regression. STL decomposition techniques have previously been used to analyse GRACE TWS regionally (Hassan and Jin, 2014) and globally (Humphrey et al., 2016). GRACE-derived GWS time series records for each aquifer system were decomposed using the STL method (see Eq. 2) in the R programming language (R Core Team, 2017) as where Y t is the monthly GWS at time t, T t is the trend component, S t is the seasonal component, and R t is a remainder (residual or irregular) component. The STL method consists of a series of smoothing operations with different moving window widths chosen to extract different frequencies within a time series, and it can be regarded as an extension of classical methods for decomposing a series into its individual components (Chatfield, 2003). The nonparametric nature of the STL decomposition technique enables the detection of non-linear patterns in longterm trends that cannot be assessed through linear trend analyses (Shamsudduha et al., 2009). For STL decomposition, it is necessary to choose values of smoothing parameters to extract trend and seasonal components. The selection of parameters in STL decomposition is a subjective process. The choice of the seasonal smoothing parameter determines the extent to which the extracted seasonal component varies from year to year: a large value will lead to similar components in all years, whereas a small value will allow the extracted component to track the observations more closely. Similar comments apply to the choice of smoothing parameter for the trend component. We experimented with several different choices of smoothing parameters (see Fig. S37) and checked the residuals (i.e. remainder component) for the overall performance of the STL decomposition model. We conducted the Shapiro-Wilk normality test on the residuals after fitting the STL smooth line with a range of trend cycle (t.window) and seasonal (s.window) windows and compared the p values. Visualisation of the results with several smoothing parameters (Fig. S37) and the corresponding smaller p values (i.e. p value <0.01) of the normality test suggested that the overall structure of time series at all sites could be captured reasonably well using window widths of 13 for the seasonal component and 37 for the trend. We apply the STL decomposition with a robust fitting of the Loess smoother (Cleveland et al., 1990) to ensure that the fitting of the curvilinear trend does not have an adverse effect due to extreme outliers in the time series data (Jacoby, 2000). Finally, to make the interpretation and comparison of non-linear trends across all 37 aquifer systems, smoothing parameters were then fixed for all subsequent STL analyses.

GRACE ∆GWS and evidence from in situ piezometry
Evaluations of computed GRACE-derived GWS using in situ observations are limited spatially and temporally by the availability of piezometric records (Swenson et al., 2006;Strassberg et al., 2009;Scanlon et al., 2012b;Shamsudduha et al., 2012;Panda and Wahr, 2015;Feng et al., 2018). Consequently, comparisons of GRACE and in situ GWS remain opportunity-driven and, here, comprise the Limpopo Basin in southern Africa and the Bengal Basin in Bangladesh where we possess time series records of adequate duration and density. The Bengal Basin is a part of the Ganges-Brahmaputra aquifer system (aquifer no. 24), whereas the Limpopo Basin is located between the Lower Kalahari-Stampriet Basin (aquifer no. 12) and the Karoo Basin (aquifer no. 13). The two basins feature contrasting climates (i.e. tropical humid versus tropical semi-arid) and geologies (i.e. unconsolidated sands versus weathered crystalline rock) that represent key controls on the magnitude and variability expected in GWS. Both basins are in the tropics and, as such, serve less well to test the computation of GRACE-derived GWS at middle and high latitudes.
In the Bengal Basin, computed GRACE and in situ GWS values demonstrate an exceptionally strong seasonal signal associated with monsoonal recharge that is amplified by dryseason abstraction (Shamsudduha et al., 2009(Shamsudduha et al., , 2012 and high storage of the regional unconsolidated sand aquifer, represented by a bulk specific yield (S y ) of 10 % (Fig. S38a). is one-fifth of that in the Bengal Basin (2276 mm), the seasonal signal in GWS, primarily in weathered crystalline rocks with a bulk S y of 2.5 %, is smaller (Fig. S38b). Time series of GRACE and LSMs are shown in Fig. S40. Comparison of in situ GWS, derived from a network of 40 piezometers (mean density of one piezometer per 1175 km 2 ), and computed GRACE-derived GWS shows broad correspondence (r = 0.62, p value <0.001), though GRACEderived GWS is "noisier"; intra-annual variability may result from uncertainty in the representation of other terrestrial stores using LSMs that are used to compute GRACEderived GWS from GRACE TWS. The magnitude of uncertainty in monthly SWS, SMS, and SNS, which are estimated by GLDAS LSMs to compute GRACE-derived GWS in each large-scale aquifer system, is depicted in Figs. 2 and S1-S36. The favourable, statistically significant correlations between the computed ensemble mean GRACEderived GWS and in situ GWS shown in these two contrasting basins indicate that, at large scales (∼ 200 000 km 2 ), the methodology used to compute GRACE-derived GWS has merit.

Trends in GRACE ∆GWS time series
Computation of GRACE-derived GWS for the 37 largescale aquifers globally is shown in Figs. 2 and 5. Figure 2 shows the ensemble GRACE TWS and GLDAS LSM datasets used to compute GRACE-derived GWS for the High Plains Aquifer System in the USA (aquifer no. 17 in Fig. 1); datasets used for all other large-scale aquifer systems are given in the Supplement (Figs. S1-S36). In addition to the ensemble mean, we show uncertainty in GRACE- derived GWS associated with 20 realisations from GRACE products and LSMs. Monthly time series data on ensemble GRACE-derived GWS for the other 36 large-scale aquifers are plotted (absolute scale) in Fig. 5 (in black) and fitted with a Loess-based trend (in blue). For all but five large aquifer systems (e.g. Lake Chad Basin WHYMAP no. 7, Umm Ruwaba 8, Amazon 19, West Siberian Basin 25, and East European 33), the dominant time series component explaining variance in GRACE-derived GWS is trend (Figs. 3b and S41-S77). Trends in GRACE-derived GWS are, however, overwhelmingly non-linear (curvilinear); linear trends adequately (R 2 >0.5, p value <0.05) explain variability in GRACE-derived GWS in just 5 of 37 large-scale aquifer systems, and of these, only two (Arabian 22, Canning 37) are declining. GRACE-derived GWS values for three intensively developed, large-scale aquifer systems (Table S1: California Central Valley 16, Ganges-Brahmaputra 24, North China Plain 29) show episodic declines (Fig. 5), though in each case their overall trend from 2002 to 2016 is declining but non-linear (Fig. 1).

Computational uncertainty in GRACE ∆GWS
For several large aquifer systems primarily in arid and semiarid environments, we identify anomalously negative or positive estimates of GRACE-derived GWS that deviate substantially from underlying trends (Figs. 6 and S78). For example, the semi-arid Upper Kalahari-Cuvelai-Zambezi Basin (11) features an extreme negative anomaly in GRACEderived GWS (Fig. 6a)  In the snow-dominated, humid Angara-Lena Basin (27), a strongly positive combined signal of SNS+ SWS exceeding TWS leads to a very negative estimation of GWS when groundwater is following a rising trend (Figs. 6c and S26).

GRACE ∆GWS and extreme precipitation
Non-linear trends in GRACE-derived GWS (i.e. difference in STL trend component between two consecutive years) demonstrate a significant association with precipitation anomalies from the CRU dataset for each hydrological year (i.e. percent deviations from mean annual precipitation between 2002 and 2016) in semi-arid environments ( Fig. 7; Pearson correlation, r = 0.62, p<0.001). These associations over extreme hydrological years are particularly strong in a number of individual aquifer systems ( Fig. 5; Tables S3 and S4) including the Great Artesian Basin (36) (r = 0.93), California Central Valley (16) (r = 0.88), North Caucasus Basin (34) (r = 0.65), Umm Ruwaba Basin (8) (r = 0.64), and Ogallala (High Plains) Aquifer (17) (r = 0.64). In arid aquifer systems, overall associations between GRACE GWS and precipitation anomalies are statistically significant but moderate (r = 0.36, p<0.001); a strong association is found only for the Canning Basin (37) (r = 0.52). In humid (and sub-humid) aquifer systems, no overall statistically significant association is found, yet strong correlations are noted for two temperate aquifer systems (Northern Great Plains Aquifer (14) (Tables S3 and S4 and Figs. S2, S8, S12, S16, S22). Similar rises in GRACE-derived GWS in response to extreme annual rainfall in arid basins include the Lake Chad Basin (7) in 2012 and Ogaden-Juba Basin (9) in 2013 (Table S3 and Figs. S7, S9). In the Canning Basin, a substantial rise in GRACE-derived GWS occurred in 2010-2011 (Tables S3  and S4 and Fig. S36) in response to extreme annual rainfall, though the overall trend is declining.
Non-linear trends that feature substantial rises in GRACEderived GWS in response to extreme annual precipitation under humid climates are observed in the Maranhão Basin   (Tables S3 and S4 and Fig. S24). In the sub-humid Northern Great Plains (14), distinct rises in GRACE-derived GWS occurred in 2010 (Tables S3 and S4 and Fig. S14) in response to extreme annual precipitation, though the overall trend is linear and rising. The overall agreement in mean annual precipitation between the CRU and GPCC datasets for the period of 1901 to 2016 is strong (median correlation coefficient in 37 aquifer systems; r = 0.92).

Uncertainty in GRACE-derived ∆GWS
We compute the range of uncertainty in GRACE-derived GWS associated with 20 potential realisations from applied GRACE (CSR, JPL Mascons, GRGS) products and LSMs (CLM, Noah, VIC, Mosaic). Uncertainty is generally higher for aquifer systems located in arid to hyper-arid environments (Table 2, see Fig. S79). Computation of GRACEderived GWS relies upon uncalibrated simulations of individual terrestrial water stores (i.e. SWS, SWS, SNS) from LSMs to estimate GWS from GRACE TWS. A recent global-scale comparison of TWS estimated by GLDAS LSMs and GRACE (Scanlon et al., 2018) indicates that LSMs systematically underestimate water storage changes. Further, the absence of river routing and representation of lakes and reservoirs in the estimation of SWS by LSMs constrains the computation of GRACE GWS as similarly recognised by Scanlon et al. (2019). Finally, substantial variability in SMS among GLDAS models and the limited depth (<3.5 m b.g.l.) to the deepest soil layer over which these LSMs simulate SMS also hamper the estimation of GRACE GWS, especially in drylands where the thickness of unsaturated zones may be an order of magnitude greater .
We detect probable errors in GLDAS LSM data from events that produce large deviations in GWS (Fig. 5). These errors occur because GRACE-derived GWS is computed as a residual (Eq. 1); overestimation (or underestimation) of these combined stores produces negative (or positive) values of GRACE-derived GWS when the aggregated value of other terrestrial water stores is strongly positive (or negative) and no lag is assumed (Shamsudduha et al., 2017). Evidence from the limited piezometric data presented here and elsewhere (Panda and Wahr, 2015;Feng et al., 2018) suggests that the dynamics in computed GRACE-derived GWS are nonetheless reasonable, yet the amplitude of GWS from piezometry is scalable due to uncertainty in the applied S y (Shamsudduha et al., 2012).
Assessments of GWS derived from GRACE are constrained by both their limited time span (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)   and course spatial resolution (>200 000 km 2 ). For example, centennial-scale piezometry in the Ganges-Brahmaputra aquifer system (no. 24) reveals that recent groundwater depletion (i.e. groundwater withdrawals that are unlikely to be replenished within a century as per Bierkens and Wada, 2019) in NW India traced by GRACE (Figs. 5 and S23) (Rodell et al., 2009;Chen et al., 2014) follows more than a century of groundwater accumulation (see Fig. S80) through the leakage of surface water via a canal network constructed primarily during the 19th century (MacDonald et al., 2016). Long-term piezometric records from central Tanzania and the Limpopo Basin of southern Africa (Fig. S81) show dramatic increases in GWS associated with extreme seasonal rainfall events that occurred prior to 2002 and thus provide a vital context to the more recent period of GWS estimated by GRACE. At regional scales, GRACE-derived GWS can differ substantially from more localised, in situ observations of GWS from piezometry. In the Karoo Basin (aquifer no. 13), GRACE-derived GWS is also rising (Figs. 5 and S13) over periods during which groundwater depletion has been reported in parts of the basin (Rosewarne et al., 2013). In the Guarani Aquifer System (21), groundwater depletion is reported for 2005 to 2009 in Ribeirão Preto near São Paulo as a result of intensive groundwater withdrawals for urban water supplies and the irrigation of sugarcane (Foster et al., 2009), yet GRACE-derived GWS over this same period is rising.

Variability in GRACE ∆GWS and role of extreme precipitation
Non-linear trends in GRACE-derived GWS arise, in part, from inter-annual variability in precipitation which has similarly been observed in analyses of GRACE TWS (Humphrey et al., 2016;Sun et al., 2017;Bonsor et al., 2018). Annual precipitation in the Great Artesian Basin (aquifer no. 36) provides a dramatic example of how years (2009-2010 and 2010-2011 from both CRU and GPCC datasets) of extreme precipitation can generate anomalously high groundwater recharge that arrests a multi-annual declining trend (Fig. 5), increasing variability in GRACE-derived GWS over the relatively short period (15 years) of GRACE data. The disproportionate contribution of episodic, extreme rainfall to groundwater recharge has previously been shown by Taylor et al. (2013b) from long-term piezometry in semi-arid central Tanzania where nearly 20 % of the recharge observed over a 55-year period resulted from a single season of extreme rainfall, associated with the strongest El Niño event Figure 7. Relationships between precipitation anomaly and annual changes in non-linear trends of GRACE GWS in the 37 large aquifer systems grouped by aridity indices. Annual precipitation is calculated based on hydrological year (August to July) for 12 of these aquifer systems and the other 25 following the calendar year (January to December); the highlighted (red) circles on the scatterplots are the years of statistically extreme (> 90th percentile; period: 1901 to 2016) precipitation. (1997)(1998) of the last century (Fig. S81a). Further analysis from multi-decadal piezometric records in drylands across tropical Africa  confirms this bias in response to intensive precipitation.
The dependence of groundwater replenishment on extreme annual precipitation indicated by GRACE-derived GWS for many of the world's large aquifer systems is consistent with evidence from other sources. In a pan-tropical comparison of stable-isotope ratios of oxygen ( 18 O: 16 O) and hydrogen ( 2 H: 1 H) in rainfall and groundwater, Jasechko and Taylor (2015) show that recharge is biased to intensive monthly rainfall, commonly exceeding the 70th percentile. In humid Uganda, Owor et al. (2009) demonstrate that groundwater recharge observed from piezometry is more strongly correlated with daily rainfall exceeding a threshold (10 mm) than all daily rainfall. Periodicity in groundwater storage indicated by both GRACE and in situ data has been associated with large-scale synoptic controls on precipitation (e.g. El Niño-Southern Oscillation, Pacific Decadal Oscillation) in southern Africa (Kolusu et al., 2019) and has been shown to amplify recharge in major US aquifers (Kuss and Gurdak, 2014) and groundwater depletion in India (Mishra et al., 2016).
In some large-scale aquifer systems, GRACE-derived GWS exhibits comparatively weak correlations with precipitation. In the semi-arid Iullemmeden-Irhazer Aquifer (6) variance in rainfall over the period of GRACE observations following the multi-decadal Sahelian drought is low (Table S1), and the net rise in GRACE-derived GWS is associated with changes in the terrestrial water balance resulting from land cover change (Ibrahim et al., 2014). In the Amazon (16), rising trends in GRACE-derived GWS, which are aligned with TWS reported previously by Scanlon et al. (2018) and Rodell et al. (2018), occur during a period (2010-2016; see Table S18) that is the driest since the 1980s (Chaudhari et al., 2019); analyses over the longer term  nevertheless point to an overall intensification of the Amazonian hydrological cycle.

Trends in GRACE ∆GWS under global change
Our analysis identifies non-linear trends in GRACE-derived GWS for the vast majority (32 of 37) of the world's large aquifer systems (Figs. 1, 5, and 8). Non-linearity reflects, in part, the variable nature of groundwater replenishment observed at the scale of the GRACE footprint that is consistent with more localised emerging evidence from multi-decadal piezometric records (Taylor et al., 2013b) (Fig. S81a). The variable and often episodic nature of groundwater replenishment complicates assessments of the sustainability of groundwater withdrawals and highlights the importance of long-term observations over decadal timescales in undertaking such evaluations. Dramatic rises in freshwater withdrawals, primarily associated with the expansion of irrigated agriculture in semi-arid environments, have nevertheless led to groundwater depletion, as computed globally from hydrological models (e.g. Wada et al., 2010;de Graaf et al., 2017) and volumetric-based calculations (Konikow, 2011). Further, groundwater depletion globally has been shown to contribute to sea level rise (e.g. Wada et al., 2016). However, as recognised in a comprehensive review by Bierkens and Wada (2019), groundwater depletion is often localised, occurring below the footprint (200 000 km 2 ) of GRACE as has been well demonstrated by detailed modelling studies in the California Central Valley (Scanlon et al., 2012a) and North China Plain (Cao et al., 2013).
Projections of the sustainability of groundwater withdrawals under global change are complicated, in part, by uncertainty in how radiative forcing will affect large-scale regional controls on extreme annual precipitation like the El Niño-Southern Oscillation (Latif and Keenlyside, 2009). Globally, Reager et al. (2016) show a trend towards enhanced precipitation on land under climate change. Given this trend and the observed intensification of precipitation on land from global warming (Allan et al., 2010;Westra et al., 2013;Zhang et al., 2013;Myhre et al., 2019), groundwater recharge to many large-scale aquifer systems may increase under climate change as revealed by the statistical relationships found in this study between GWS and extreme annual precipitation. The magnitude of this increase is, however, unlikely to offset the impact of human withdrawals in areas of intensive abstraction for irrigated agriculture as shown in NW India by Xie et al. (2020). The developed set of GRACE-derived GWS time series data for the world's large aquifer systems provided here offers a consistent, additional benchmark alongside long-term piezometry to assess not only large-scale climate controls on groundwater replen-ishment but also opportunities to enhance groundwater storage through managed aquifer recharge.

Conclusions
Changes in groundwater storage ( GWS) computed from GRACE satellite data continue to rely upon uncertain, uncalibrated estimates of changes in other terrestrial stores of water found in soil, surface water, and snow-ice from globalscale models. The application here of ensemble mean values of three GRACE TWS processing strategies (CSR, JPL Mascons, GRGS) and five land surface models (GLDAS 1: CLM, Noah, VIC, Mosaic; GLDAS 2: Noah) is designed to reduce the impact of uncertainty in an individual model or GRACE product on the computation of GRACE-derived GWS. We nevertheless identify a few instances in which erroneously high or low values of GRACE-derived GWS are computed; these occur primarily in arid and semi-arid environments where uncertainty in the simulation of terrestrial water balances is greatest. Over the period of GRACE observations (2002 to 2016), we show favourable comparisons between GRACE-derived GWS and piezometric observations (r = 0.62 to 0.86) in two contrasting basins (i.e. semi-arid Limpopo Basin, tropical humid Bengal Basin) for which in situ data are available. This study thus contributes to a growing body of research and observations reconciling computed GRACE-derived GWS and ground-based data.
GRACE-derived GWS from 2002 to 2016 for the world's 37 large-scale aquifer systems shows substantial variability as explicitly revealed by 20 potential realisations from GRACE products and LSMs computed here; trends in ensemble mean GRACE-derived GWS are overwhelmingly (87 %) non-linear. Linear trends adequately explain variability in GRACE-derived GWS in just five aquifer systems for which linear declining trends, indicative of groundwater depletion, are observed in two aquifer systems (Arabian, Canning); overall trends for three intensively developed, large-scale aquifer systems (California Central Valley, Ganges-Brahmaputra, North China Plain) are declining but non-linear. This non-linearity in GRACE-derived GWS for the vast majority of the world's large aquifer systems is inconsistent with previous analyses at the scale of the GRACE footprint (∼ 200 000 km 2 ), asserting globalscale groundwater depletion. Groundwater depletion, more commonly observed by piezometry, is experienced at scales well below the GRACE footprint and is likely to be more pervasive than suggested by the presented analysis of large-scale aquifers. Non-linearity in GRACE-derived GWS arises, in part, from episodic recharge associated with extreme (> 90th percentile) annual precipitation. This episodic replenishment of groundwater, combined with natural discharges that sustain ecosystem functions and human withdrawals, produces highly dynamic aquifer systems that complicate assessments of the sustainability of groundwater withdrawals from large aquifer systems. These findings highlight, however, potential opportunities for sustaining groundwater withdrawals through induced recharge from extreme precipitation and managed aquifer recharge. Data availability. Time series data for a monthly anomaly (August 2002 to July 2016) from an ensemble mean of three GRACE products, GLDAS land surface models, and monthly precipitation data from CRU for the 37 world's large aquifer systems can be accessed at the UK's National Geoscience Data Centre (NGDC) at https://doi.org/10.5285/0387fd53-9fed-468e-bda9-0bde8f5fda13 .

Supplement.
Supplementary information is available for this paper as a single PDF file. The supplement related to this article is available online at: https://doi.org/10.5194/esd-11-755-2020supplement.
Author contributions. MS and RGT co-developed the study. MS processed and analysed datasets; both MS and RGT discussed and interpreted results. RGT drafted the paper, and both authors made comments and edited to finalise the paper.
Competing interests. The authors declare that they have no conflict of interest.

Acknowledgements. Mohammad Shamsudduha and Richard G.
Taylor acknowledge support from NERC-ESRC-DFID UPGro Gro-Futures (http://grofutures.org/, last access: 5 August 2020). Richard G. Taylor also acknowledges the support of a Royal Society Leverhulme Trust Senior Fellowship. This study acknowledges the World-wide Hydrogeological Mapping and Assessment Programme (WHYMAP) of the German Federal Institute for Geosciences and Natural Resources (BGR) and UNESCO for world's large aquifer system Geographic Information System (GIS) database. Gridded CSR and JPL Mascons GRACE data from NASA's GRCTellus repository and GRGS GRACE data from the French National Centre for Space Studies (CNES) are acknowledged. NASA's Goddard Earth Sciences Data and Information Services Centre (GES DISC) is duly acknowledged for GLDAS (version 1 and 2) land surface model data. This study also acknowledges the University of East Anglia (UK) for CRU and the National Center for Atmospheric Research (USA) for GPCC precipitation data. Finally, support from the Bangladesh Water Development Board (BWDB) and the Department of Water and Sanitation (South Africa) for groundwaterlevel data, as well as CGIAR for the Global Aridity Index data, is duly acknowledged.
Financial support. This research has been supported by the UK NERC-ESRC-DFID (grant no. NE/M008932/1 (UPGro GroFutures)) and the UK Royal Society (grant no. LT170004 (Leverhulme Trust Senior Fellowship)). Review statement. This paper was edited by Richard Betts and reviewed by Marc Bierkens and Soumendra Bhanja.