Future hydrological extremes: the uncertainty from multiple global climate and global hydrological models

. Projections of changes in the hydrological cycle from global hydrological models (GHMs) driven by global climate models (GCMs) are critical for understanding future occurrence of hydrological extremes. However, uncertainties remain large and need to be better assessed. In particular, recent studies have pointed to a considerable contribution of GHMs that can equal or outweigh the contribution of GCMs to uncertainty in hydrological projections. Using six GHMs and ﬁve GCMs from the ISI-MIP multi-model ensemble, this study aims: (i) to assess future changes in the frequency of both high and low ﬂows at the global scale using control and future (RCP8.5) simulations by the 2080s, and (ii) to quantify, for both ends of the runoff spectrum, GCMs and GHMs contributions to uncertainty using a two-way ANOVA. Increases are found in high ﬂows for northern latitudes and in low ﬂows for several hotspots. Globally, the largest source of uncertainty is associated with GCMs, but GHMs are the greatest source in snow-dominated regions. More speciﬁcally, results vary depending on the runoff metric, the temporal (annual and seasonal) and regional scale of analysis. For instance, uncertainty contribution from GHMs is higher for low ﬂows than it is for high ﬂows, partly owing to the different processes driving the onset of the two phenomena (e.g. the more direct effect of the GCMs’ precipitation variability on high ﬂows). This study provides a comprehensive synthesis of where future hydrological extremes are projected to increase and where the ensemble spread is owed to either GCMs or GHMs. Finally, our results underline the need for improvements in modelling snowmelt and runoff processes to project future hydrological extremes and the importance of using multiple GCMs and GHMs to encompass the uncertainty range provided by these two sources.


Introduction
The ongoing intensification of the water cycle at the global scale is expected to continue in the coming decades (Huntington, 2006;Stott et al., 2010). Projected changes in climate variables from global climate models (GCMs) indicate an increase in the frequency of hydrological extremes (Tebaldi et al., 2006;Seneviratne et al., 2012;Sillmann et al., 2013;Kharin et al., 2013). These hydrological shifts go hand in hand with a growing world population that will become ever more vulnerable with respect to access to water and food, and resilience to natural hazards (Lavell et al., 2012). In this context, global multi-model ensembles yield a valu-able opportunity for climate projections and impact assessments. In hydrology, multi-model ensemble experimentsconsisting of global hydrological models (GHMs) fed by input forcing simulated by GCMs -can be used to project future changes in the water cycle and future hydrological extremes, using modelled variables such as precipitation, runoff and soil moisture. In recent years, a number of studies have assessed the future changes in the global water cycle (e.g. Nohara et al., 2006;Hirabayashi et al., 2008;Sheffield and Wood, 2008). Although many of these studies have a representative number of GCMs in their ensembles, they rarely comprise more than one GHM, and this presents a limitation considering that GHMs provide more uncertainty than pre-viously thought (Haddeland et al., 2011;Hagemann et al., 2013;Schewe et al., 2013;Prudhomme et al., 2014). In addition, the coarse temporal and spatial resolution of the climate signal used in these studies does not reflect well the potential changes in sub-monthly extreme events at the regional and local scale (Forzieri et al., 2014).
Recently, model inter-comparison projects like WaterMIP (Haddeland et al., 2011) and ISI-MIP  have made it possible to include multiple GCMs and GHMs in global impact studies at unprecedented temporal (up to daily) and spatial (0.5 • ) resolution, thereby providing frameworks for consistent assessments of the terrestrial water cycle.
The ISI-MIP data set has been used to assess future changes in runoff at global and regional scales. Dankers et al. (2013) explored changes in a 30-year return period of river flow showing that flood hazard is projected overall to increase globally, although not uniformly, and that decreases occur mainly in areas where the hydrograph is dominated by spring snowmelt. Schewe et al. (2013) assessed future water scarcity by analysing changes in mean annual runoff together with global population patterns, showing how the number of people living in water scarcity is projected to increase globally. Davie et al. (2013) investigated runoff changes across models by grouping GHMs into hydrological and biome (including CO 2 and vegetation dynamics) models, showing that while both types agree on the sign of runoff change for most regions of the world (with contrasting exceptions like West Africa where biome models moisten and hydrological models dry), models accounting for varying CO 2 yield more runoff than those with constant CO 2 . Prudhomme et al. (2014) examined the future frequency of droughts using a variable threshold method on daily runoff. They identified drought hotspots globally and observed, similarly to Davie et al. (2013), how biome models accounting for varying CO 2 concentrations tend to project more runoff with increasing CO 2 than the hydrological models. All of these studies emphasize how both GCM and GHM uncertainty contribute to the spread in projected changes in the hydrological cycle. Their findings highlight the importance of including different types of GHMs and GCMs for making comprehensive assessments of uncertainty in climate impact studies.
In this context, modelling-induced uncertainty (i.e. intermodel spread of GCMs and GHMs) has been expressed by looking at the variance across both types of models. For example, Schewe et al. (2013) and Dankers et al. (2013) used the ratio of the variances of GCM and GHM results (for GCMs: variance of the change across all GCMs for each GHM, then averaged over all of the GHMs; and vice versa for GHMs). Similarly, using WaterMIP data, Hagemann et al. (2013) expressed the spread due to the choice of model type using the standard deviation of GCMs and GHMs (for GCMs: the mean across all GHMs for each GCM, and standard deviation of the GCMs; and vice versa for GHMs). Prudhomme et al. (2014) omit the partition into GCM/GHM and express the uncertainty through the signal-to-noise ratio (by grouping results per type of model) in order to infer which global model type in the ensemble brings about highest agreement.
The studies cited above have provided useful knowledge on climate change impacts on the water cycle using the ISI-MIP data set, however, a synthesis of future projections for high and low flows along with a consistent estimation of uncertainties is still missing. The present study builds on the work on low flows of Prudhomme et al. (2014), but introduces several new aspects. Firstly, low flows (Q 10 ) are now analysed using an improved index extraction. The variable threshold method used in Prudhomme et al. (2014) has been revisited to overcome a limitation of the 30-day moving window for which grid cells were assigned lower threshold values than the theoretical threshold assigned (Q 10 ) (i.e. a tendency to capture fewer occurrences, an effect perhaps attributable to GHMs' slow emptying of reservoirs during the recession phase). A shorter 5-day fixed time window eliminates this effect. Note that, in order to gather further data for the estimate of the quantile flow, the period of analysis was increased from 30 to 34 years, starting 4 years earlier (1972 for control and 2066 for future). Secondly, we now analyse high flows (Q 95 ), with the same method used for low flows (5-day fixed-window variable-threshold method). Dankers et al. (2013), who also analysed high flows, have focused on a different metric (annual extreme monthly flood peak with 30-year return level), as their aim was to describe changes in flood hazard, while our focus is on change in frequency of high flow days. In our study high and low flows are hence identified jointly with the same ensemble of five GCMs and six GHMs. While comprising the same number of GCMs, the ensemble used by Prudhomme et al. (2014) uses one additional GHM (JULES) and Dankers et al. (2013) uses three additional GHMs (JULES, LPJmL, MATSIRO). We did not use these additional GHMs as they showed large areas with long pools of zero values hindering the index extraction, making them unsuitable for our analysis, especially for the low flows; additionally, JULES was run at a coarser resolution (1.25-1.875 • vs. 0.5-0.5 • ) that would potentially influence the uncertainty analysis. Thirdly, we assess systematically the relative contribution of GHMs and GCMs to uncertainty using an analysis of variance (ANOVA) framework as in e.g. Yip et al. (2011) and Sansom et al. (2013). This uncertainty assessment moves beyond the signal-to-noise ratio by Prudhomme et al. (2014), as the quantification of each source (GCM/GHM) to total uncertainty allows us to describe the spatial variability of the contributions grid cell per grid cell. While Dankers et al. (2013) and Schewe et al. (2013) partition GCM/GIM uncertainty using ratios between the variances, our ANOVA approach adds the contribution of the error (or residual) to the partition of the variance along with post hoc testing on the residuals for model adequacy. We thus describe how high and low flows and inherent uncertainty vary at the seasonal and spatial scale, identifying areas where we have more confidence in the climate or in the hydrology (i.e. uncertainty is owed to GCMs or GHMs). Finally, to understand how the variance of the changes differs regionally, we carry out analysis at the regional scale expressing the ANOVA sum-of-squares of each source using homogeneous geo-climate regions (Köppen-Geiger). This allows for an improved understanding of how the climate and hydrological processes drive uncertainty for both runoff ends.
By comparing an ensemble of GCMs (5) and GHMs (6) for future projections (2066-2099) against the historical period , this study aims (i) to assess future high and low flows changes at global and annual and seasonal scales, and (ii) to quantify the uncertainty attributable to GHMs and GCMs using ANOVA. In the next section, the data set and the different steps of the methodology are detailed. The results of projected hydrological extremes and respective uncertainty are presented in Sect. 3 before discussing the important and wider implications of this research in the fourth and final section.

Data and methods
The data set used herein comes from the Inter-Sectorial Impact Model Intercomparison Project (ISI-MIP)  and consists of daily total un-routed runoff at a spatial resolution of 0.5 degrees from an ensemble of six GHMs forced with five CMIP5 GCMs' bias-corrected climate (Hempel et al., 2013) for the historical  and future (2066-2099) periods under the RCP8.5 scenario. The six GHMs are: H08, MPIHM, MacPDM, VIC, WBM, PcrGLOBWB (see Table A1 in Appendix A for a summary of the main characteristics), and the five GCMs are: HadGEM2-ES, IPSL-CM5A-LR, MIROC-ESM-CHEM, GFDL-ESM2M, NorESM1-M (refer to Warszawski et al., 2014, for further details on the models and to www. isi-mip.org to access the simulation protocol). It should be noted that the selection of GHMs was dictated by temporal (daily runoff) resolution and time series tractability: models with lengthy pools of runoff equal to zero over large portions of the globe imposing constraints to the index extraction were not included (this aspect is described further in Appendix B). The selected model combinations form an ensemble of 30 experiments, each consisting of a historical and future period; none of the GHMs include varying CO 2 .
Our analytical framework was composed of four steps: (i) time series of days classified as high and low flows were extracted from daily total runoff record; (ii) high and low flow indices (i.e. change in frequency of high/flow flows) were calculated (future minus historical period) and mapped; (iii) ANOVA was carried out on the high and low flow indices considering GCMs and GHMs as factors; and (iv) the dominant uncertainty factors were explored for high and low flows across different climate regions based on the Köppen-Geiger classification.
To quantify high and low flow inter-annual variability, daily binary series (zero or one) were extracted for every land grid cell: high flow days, HFD; and low flows days, LFD. The series extraction uses daily varying threshold curves obtained from the daily runoff series for the historical period , which are then applied to the historical period and future projections to identify days above (for HFD) or below thresholds (for LFD), as in e.g. (for low flows) Prudhomme et al. (2014). High flows are characterized by the 95th percentile (Q 95 -runoff equaled or exceeded 5 % of the time) and low flows by the 10th percentile (Q 10 -runoff equaled or exceeded 90 % of the time). For HFD, a value of 1 (high flow) is assigned to each cell if the cell's runoff exceeds the Q 95 value, otherwise a value of 0 (no high flow) is assigned. For LFD, a value of 1 (low flows) is assigned to each cell if the cell's runoff is below the Q 10 value, otherwise a value of 0 (no low flow) is assigned. A comprehensive description of the threshold and binary series extraction together with an explanatory picture (Fig. B1) are provided in Appendix B. Grid cells showing little or no seasonal change in the daily runoff of the control period  were screened-out and represented in grey on the maps (for a comprehensive explanation of the masking see Appendix B). These screenedout grid cells are often located in arid or frozen regions where there is little or no runoff during long periods of the year and so the index extraction becomes intractable due to the presence of repeated zero values in the series.
We use indices to express the change in the frequency (in %) of: future high (HFI) and low (LFI) flows. These indices are calculated as follows: for each ensemble member HFI (LFI) is equal to the difference between the frequency (in %) of high (low) flows days (100× mean of HFD (LFD)) from the future (2066-2099) and historical period , for the whole year and per season (DJF and JJA). Both HFI and LFI are composed of 30 series (i.e. six GHMs fed by five GCMs each). The agreement in the change across ensemble members is expressed by the signal-to-noise ratio, S2N, calculated by dividing the median of the ensemble flow indices (HFI and LFI) by the inter-quartile range (75th percentile minus 25th percentile). The higher the S2N, the higher the members' agreement in the signal, assuming signal greater than noise if S2N > 1.
In this study, the uncertainty is reflected by the spread of the flow indices due to the choice of GCM or GHM. To quantify the individual contribution of GCMs and GHMs to total uncertainty, a 2-factor ANOVA was carried out on the flow indices HFI and LFI for each grid cell. For this data set, model runs had no replicates, therefore the ANOVA model considers one case per treatment (Neter et al., 1999, Chap. 21), so no interactions (αβ ij = 0) and fixed factors levels (n = 1): where Y ij is the mean change for GCM i and GHM j , µ is a constant (the overall mean), α i is the main effect for GCM at the ith level, β j is the main effect for GHM at the j th level, ε ij is the residual ≈ N(0, σ 2 )iid. Thus, the variance is partitioned into two factors, GCMs and GHMs, plus the residuals. The results, expressed in terms of sum of squares, are used to quantify the factors' contributions to the total variance, here considered as uncertainty as in e.g. Sansom et al. (2013). ANOVA models are reasonably robust against certain types of departures from the model (e.g. error terms not being exactly normally distributed). Nonetheless, the suitability of the ANOVA model with the data at hand should be checked for serious departures from the conditions assumed by the model by looking at the residuals (Neter et al., 1999, Chap. 18) and testing their normality (e.g. Lilliefors test) and constancy of variance (e.g. Hartley test). Unsatisfactory results would require remedial measures like data transformation or a modification of the model. To understand how variance differs between climate regions, the ANOVA sum of squares for all model combinations are shown per Köppen-Geiger class. We used the Köppen-Geiger data classification based on the present day as proposed by Kottek et al. (2006) (a link to the map is provided at the end of Table 1). A total of 15 (out of 31) regions are considered leaving out under-represented regions with too few grid cells (< 1000).

Results
Annual mean changes and associated S2N across all GHMs and GCMs are shown for HFI and LFI in Fig. 1a and b. For high and low flow indices, the mean changes vary spatially and in magnitude ( Fig. 2)   The results of the ANOVA across the 30 members of HFI and LFI are shown in Fig. 1c; they are expressed, for each factor, as the proportion of sum of squares divided by the total sum of squares (refer to Appendix C for residuals testing for model adequacy). For the high flows, the vari-ance is explained mostly by the GCMs (yellow, 47 % of unmasked land, Fig. 1c), although the GHMs are the major factor over western Europe and central Canada (green, 28 % of unmasked land, Fig. 1c). For low flows, the proportions change: the GCMs (43 %) remain the major contributors over the globe, but GHMs (35 %) increase to a relative influence closer to the GCMs, and become the major factor in some northern (e.g. northeastern Russia) and southern (e.g. southern Africa, southwestern Australia) regions. Seasonal results (Figs. 3c and 4c) are very similar to their annual counterparts in the case of high flows in DJF and low flows in JJA, whereas for high flows in JJA and for low flows in DJF higher residual rates (i.e. decreased overall GHM and GCM contributions) are found, perhaps owing to fewer events occurring in these seasons for both low and high flow indices.
To capture better the spatial distribution of the major sources of uncertainty, ANOVA results are aggregated by climatic homogeneous regions based on the climatological Köppen-Geiger classification. Scatterplots in Fig. 5 show the proportions of sums of squares of GHMs (y-axis) vs. GCMs (x-axis); medians for each climatic region are shown as their class letter and summarize the prominent factor of uncertainty. For both high and low flows calculated over the year and seasonally, uncertainty in equatorial regions (A) is dominated by GCMs (median closest to the x-axis); while in snow-dominated climate (D) it is dominated by GHMs (median closest to the y-axis). In warm temperate regions Table 1. Summary of mean changes, signal-to-noise S2N, and sources of variance for high and low flows at the annual and seasonal (DJF, JJA) scale, and at the global and climate region scale. The first source of variance is shown in bold, the second one in italic font. (C), uncertainty is slightly higher for GCMs than GHMs. In arid regions (B), the variance is not well explained by either GCMs or GHMs (median farthest from 1; i.e. residuals explain most of the variance), suggesting that reproducing hydroclimatology over these regions represents a challenge for both GCMs and GHMs. The ANOVA results for the whole year and those for winter and summer seasons (DJF and JJA shown in Figs. 3c and 4c) are quantified further in Table 1. This table provides a breakdown with both the regional and global results expressed for mean changes, S2N and percentage of sum of squares per factor at the annual and seasonal (DJF and JJA) scale. Looking jointly at the annual and seasonal results in Table 1, it is clear that the widespread dominance of the GCMs' contribution to uncertainty is outweighed by the GHMs in the snow-and ice-dominated regions (D). This pattern is visible also on the scatterplots (Figs. 5 and 6) with the GHM uncertainty-dominated regions (near the y-axis) often populated by D regions for both HFI and LFI (although to a lesser extent for the former).

Discussion and conclusions
Using six global hydrological models (GHMs) fed by five global climate models (GCMs) under the RCP8.5 scenario, this study aimed to assess future high and low flow changes globally by the 2080s, and to quantify the uncertainty attributable to GHMs and GCMs. We decided to focus solely on the uncertainty coming from GHMs and GCMs using as many ensemble members (from the ISI-MIP project data set) as possible under the RCP8.5, in which change signals are expected to be larger (i.e. emissions continue to rise leading to global radiative forcing levels of 8.5 W m −2 by the end of the 21st century). The hydrological simulations used in this study do not account for anthropogenic influences (e.g. water abstraction, augmentation and artificial storage) or land-use changes.
High and low flow changes in the future (2066-2099) relative to the control period ) exhibit a number of robust large-scale features. Increases in high flow days were found at northern latitudes, with a strong signal over eastern Canada, Scandinavia, northwestern Russia, and around the Bering Sea (eastern Russia and Alaska). Increases in low flow days were found in southern Europe, southwestern and central Latin America, southeastern USA, more southerly parts of Central Africa, and southwestern Australia. These patterns are largely consistent with the few other studies carried out on runoff at the global scale with several GHM-GCM combinations: e.g. for high flows (Hirabayashi et al., 2013), low flows (Van Huijgevoort et al., 2013;Prudhomme et al., 2014) and for mean flows (Davie et al., 2013;Schewe et al., 2013;Hagemann et al., 2013). More specifically, the comparison of flood hazard patterns by Dankers et al. (2013) with the changes in the occurrence of high flow days from our study reveals some similarities, mostly northern North America and Northern Asia, while in some regions like northeastern Europe patterns are opposite. Low flow patterns are similar to Prudhomme et al. (2014) although they find a weaker S2N.
In this study we provide for the first time a comprehensive assessment of both ends of the runoff spectrum at the same time using the same data set globally. Moreover, we undertake a consistent partition of uncertainty via ANOVA for both high and low flows, showing that GCMs provide the largest uncertainty, although the GHM contribution can be substantial in particular regions. The results from our ANOVA framework are consistent with other global studies based on the ratios between the variances (or standard deviations) of ensemble members averaged per type of model Schewe et al., 2013;Hagemann et al., 2013). In particular, uncertainty results that Dankers et al. (2013) expressed with GCM/GHM variance are in agreement with our findings for high flows in the Southern Hemisphere, mainly driven by GCM uncertainty, whereas there is less agreement for the Northern Hemisphere (in North America, Central Canada is GCM-driven uncertainty, whereas it is GHM driven in our results). Uncertainty results for low flows from Prudhomme et al. (2014), expressed as S2N ratio, are not directly comparable, but as will be discussed later, the inclusion of the JULES GHM in their ensemble has pointed to lower model agreement (i.e. increased uncertainty). At the regional level, the uncertainty partition enables us to delineate in which climate region each factor (GCMs or GHMs) provides the largest uncertainty at the annual and seasonal scales. Notably, for snow-and ice-dominated polar regions, and for arid zones, GHMs bring about the largest portion of uncertainty, especially for low flows. This is likely to reflect uncertainty in the way the hydrological storage-release processes can modify the climate signal, particularly where these storage components are relatively large or water residence times high -hence the importance of considering several GHMs in studying changes in high and low flows. GCM and GHM uncertainty shares are similar for HFI and LFI globally, although the spatial patterns differ slightly (e.g. northeastern Russia, southwestern Australia and Alaska are GCM driven in HFI, and GHM driven in LFI). This could reflect different dominant processes for high and low flow generation, with high flow events mainly driven by precipitation inputs or snow/ice-melt (i.e. atmospheric-driven processes); whereas low flows event develop over longer durations and are influenced more by land-surface processes like evaporation, infiltration and storage, which are simulated by the GHMs, each one with its own scheme and parametrization: e.g. for evapotranspiration, Penman-Monteith, Hamon (Haddeland et al., 2011 and Table A1 in Appendix A). Haddeland et al. (2011) have identified in the snow scheme employed by different GHMs a major source of difference between the model runoff simulations, and recent studies at global (e.g. Hagemann et al., 2013) and regional scale (e.g. Jung et al., 2012) hint at an increase in uncertainty in snow-dominated regions. Our study shows that in snow-dominated and arid regions GHM uncertainty equals or outweighs GCM uncertainty for both high and low flows, highlighting the importance of comprising balanced sets of both global hydrological and climate models to encompass the overall uncertainty in these regions.
To put the current study in context and to provide suggestions for further studies, it is worth making a few considerations on the hydrological index extraction and clarify a few aspects of the uncertainty partition concerning the method and the data set we used.
The identification of high and low flows over long time series, and particularly over climate projections, is nontrivial. As an illustration, van Huijgevoort et al. (2014) in their multi-model ensemble study on droughts report that applying the threshold level method to the future period using a threshold derived from the control period can lead to spurious pooling of drought events. They suggest that future changes could be accounted for by linking the drought threshold to adaptation scenarios like Vidal et al. (2012) did over France. Wanders et al. (2014) used a transient threshold level method for a moving reference period, in order to reflect the changes in hydrological regime over time, finding that the nontransient threshold method projected larger shares of areas in drought (except in snow-dominated regions). For our study, the threshold was calculated over the control period, as changes in future extremes with respect to present day were sought. In general, the selection of threshold approach should consider that if, on the one hand, a consistent pooling of extreme events may be hampered by incremental shifts or shape changes of the hydrograph throughout the future; on the other hand, when assessing the changes in frequency with respect to the present, information on the present used for comparison is lost when the threshold adapts throughout the projections. The model runs used in this study have no replicates; therefore, our ANOVA partition set-up poses some limitations as it assumes that the factors do not interact (no degrees of freedom are available for the estimation of the experimental error). However, interactions between the factors may indeed be present and, as pointed out by Bosshard et al. (2013), these interactions may represent uncertainty contributions that do not behave linearly: e.g. a snowmelt bias of a GHM may depend on the temperature projection of the driving GCM that could lead to a nonlinear response in the simulated runoff. This could in part explain the high rate of residuals' contribution seen in some grid cells for which potential interactions hinder the ANOVA to properly disclose the factors main effects. To avoid this drawback multiple model runs would be necessary.
Bias correction and CO 2 and vegetation dynamics represent other sources of uncertainty that were not accounted for in this study, though their influence should be further investigated in future works. Bias correction is commonly used to overcome bias inconsistencies between GCMs and impact models (i.e. GHMs) in climate impact studies; however, this technique alters the model output by e.g. reducing the inter-GCM variability and potentially their contribution to total uncertainty in climate projections Wada et al., 2013), and it is argued that its use is not always justified (Ehret et al., 2012). Hagemann et al. (2011) even found that uncertainty due to bias-correction can be of the same order of magnitude as that related to the choice of GCM or GHM. As Huber et al. (2014) points out, findings on relative contributions of GCMs and GHMs to total impact uncertainty would need to stand the test of using non-I. Giuntoli et al.: Future global hydrological extremes uncertainty bias-corrected runs, but runs that have not been bias corrected (with a method designed to preserve the long-term trends in temperature and precipitation projections, Hempel et al., 2013) are unavailable within ISI-MIP or with the same GCM/GHM combinations.
As mentioned in the Introduction, biome models have shown a larger spread than GHMs without varying CO 2 and vegetation dynamics processes, and it is argued that, due to the additional processes that they simulate, the inclusion of biome models in multi-model ensemble studies is important to capture a comprehensive range of uncertainty (Davie et al., 2013;Prudhomme et al., 2014). Within our study specifically, biome models with runs at daily resolution were JULES and LPJmL. These models were excluded primarily for intractability in low flow analysis. Therefore, uncertainty from varying CO 2 is not sampled and could suggest overconfidence (or bias) in favour of nonbiome GHMs, which simulate less runoff than biome models. During our exploratory analysis we actually included JULES in the ensemble and found that the uncertainty was driven towards the GHM source (in agreement with Prudhomme et al., 2014, who found higher S2N, i.e. stronger agreement between the models, when considering the ensemble without JULES). However, the inclusion of models in the ensemble must be compatible with the applicability of the method, and the biome models available through ISI-MIP proved to hamper the global comparison assessment for the heavy masking over large areas with zero-rich time series. As shown in Table B1, low flow index extraction was vetoed over large areas of the globe, ultimately leaving 61 and 20 % of land cells for JULES and LPJmL respectively (note that the masking is formed by superimposing masking from each GHM-GCM combination). Also, JULES' coarser resolution (7558 vs. 67 420 total land grid cells for JULES and the other GHMs respectively, i.e. a ratio of 1 to 9 cells) may contribute to more uncertainty, although lower-resolution runs would be necessary to assess such contribution. Index extraction for high flows proved more favourable, but we adopted the pragmatic approach of using the largest possible ensemble of models common to both high and low flows. We are aware that the inclusion of multiple models is not sufficient to fully scope model uncertainty due to resolution and structural errors that are common across models and place a limit to the confidence we obtain from robustness (Knutti, 2010). However, our results demonstrated that, even excluding biome models and other model structure differences in the ISI-MIP ensemble, large uncertainty in the signal of changes in high and low flows is attributable to GHMs and not only on GCMs.
Were biome models' shortcomings not present, their inclusion in our ensemble would have required a modification of our uncertainty partition strategy because the presence of outliers (likely introduced by biome models) would limit our ANOVA model (whose assumptions include no or minimal presence of outliers). For their distinct behaviour from the other GHMs, biome models could be considered as a factor level in a two-way ANOVA framework with unequal sample sizes (Neter et al., 1999, Chap. 23), i.e. the spread of future hydrological extremes would be examined as the function of factor 1 -the type of hydrological model (level 1: six GHMs; level 2: two biome models) and factor 2 -the GCMs.
Finally, the focus of our uncertainty analysis was on GCMs and GHMs, therefore the effect of emission scenarios (RCPs) was neglected. The few studies that have considered this aspect hint at a relatively small role of emission scenarios Wada et al., 2013) all throughout the 21st century when compared to GCMs and GHMs, which play a stronger role in uncertainty contribution over most of the globe.
To conclude, knowledge of the dominant source of uncertainty in climate-to-hydrology signal is critical to modellers for improving modelling of the terrestrial water cycle and to scientists for putting together targeted multi-model ensembles for climate impact studies. In addition to GHMs and GCMs, further work is needed to assess the degree to which internal variability, bias correction, biome models (i.e. GHMs that simulate vegetation dynamics and varying CO 2 ) and emission scenarios contribute to total uncertainty.  Infiltration excess, Degree-day saturation excess, groundwater a All of the six models were run at the spatial resolution of 0.5 • × 0.5 • . b LW: downwelling long-wave radiation; LWn: net long-wave radiation; P : precipitation rate (rain and snow calculated in the model); Q: air specific humidity; R: rainfall rate; RH: relative humidity; S: snowfall rate; SP: surface pressure; SW: downwelling shortwave radiation; T : air temperature; Tmax(min): daily maximum (minimum) air temperature; W : wind speed.

Appendix B: High and low flow binary series extraction and masking
The schematic of extraction of binary series of days under high (HFD) and low (LFD) flows is shown in Fig. B1. The threshold curves are obtained by linearly interpolating percentiles calculated over fixed 5-day windows (e.g. 1-5 December, 6-10 December, and so forth, i.e. 73 for the whole year) of the historical period runoff (i.e. December 1971 to December 2005), having considered the hydrological year from December to November.
The percentiles are Q 95 (runoff equaled or exceeded 5 % of the time) for HFD, and Q 10 (runoff equaled or exceeded 90 % of the time) for LFD. In general, the identification of high and low flows at the global scale imposes the selection of a universal threshold level serving many hydrological regimes and climate regions at once (thereby pooling events that may not always be extreme) and it is based on physical processes: low flows are generally characterized by a slower onset, and a longer duration, and high flows by a sudden onset, and a shorter duration. Accordingly, high and low flows are not necessarily symmetric with respect to the median flow (Q 50 ). For low flows in particular, the choice of Q 10 comes from seeking a sufficiently low quantile without compromising the analysis, as quantiles lower than 10 % become intractable for the large presence of zero pools in some time series. This is in agreement with e.g. Gudmundsson et al. (2011) who showed how the performance of a similar set of WaterMIP global models decreased systematically from high Q 95 to low Q 5 runoff percentile over Europe.
The choice of a fixed 5-day time window with interpolation was preferred over the 30-day moving average used in e.g. Prudhomme et al. (2014) because the latter had shown some limitations with regards to the low flow quantile extraction. The effect of levelling out over 30 days could lead to lower values than expected in the control period (10 % by design). In addition, we wanted to use the same framework for high and low flows and considered 5 days to be appropriate to identify both types of events. The choice of a linear interpolation was preferred over the moving window approach to minimize dependence (i.e. inertia) within quantile estimates with the following rationale: (i) moving average aims to smooth out wiggles for a less spiky identification of hydrological events like droughts that could result in erratic threshold crossings, thereby pooling several times over the same event; however, its quantile estimates use the same information from neighbouring days (as many as the time window), resulting in a quantile series holding a correlation that is higher the longer the time window, potentially leading to inadvertent effects of large inertia during the extraction of the hydrological index. (ii) In our case, as we count high (low) flow days (as opposed to single events), smoothing the threshold is unnecessary. (iii) A 1-day window would assure a series of independent quantile estimates, but the computation over 34 points (i.e. 34 years of the control period) was considered insufficient for quantile estimation. (iv) Seeking a representative number of points for quantile extraction (170, i.e. 5 days×34 years), we decided to compute the quantile by extracting a point every 5 days and extrapolating values for intermediate days to the next 5-day point; as a result threshold values were obtained with a nonrecursive use of data, thereby minimizing dependence.
The index extraction described above is not applicable when the runoff is very low, i.e. when long periods of the year have the same value. Therefore, with reference to the control period , grid cells showing little or no seasonal change in daily runoff were screened out (represented in grey on the maps) using the 5-day percentiles series that form the threshold curves (i.e. one mask for HF and one for LF) following these rules: (i) percentiles are equal to zero for more than one-third of the year (ii) standard deviation of percentiles of first and/or second half-year equals zero (iii) annual percentiles Q 10 and Q 95 series are equal. Table B1 shows percentages of available land grid cells after screening for the different GCM-GHM combinations and runoff percentile. Although screened grid cells could become seasonal through the climate projection -e.g. Alessandri et al. (2014) investigated the expansion and retreat of specific climate boundaries (Mediterranean climate in Europe and western USA) using CMIP5 data -we neglect this aspect as our base reference for changes in projections is the control period.
Mean changes for each ensemble member (GHM-GCM combination) are shown in Figs. B2 and B3, for high and low flows respectively.