These authors contributed equally to this work.

Understanding Earth system dynamics in light of ongoing human intervention and dependency remains a major scientific challenge. The unprecedented availability of data streams describing different facets of the Earth now offers fundamentally new avenues to address this quest. However, several practical hurdles, especially the lack of data interoperability, limit the joint potential of these data streams. Today, many initiatives within and beyond the Earth system sciences are exploring new approaches to overcome these hurdles and meet the growing interdisciplinary need for data-intensive research; using data cubes is one promising avenue. Here, we introduce the concept of Earth system data cubes and how to operate on them in a formal way. The idea is that treating multiple data dimensions, such as spatial, temporal, variable, frequency, and other grids alike, allows effective application of user-defined functions to co-interpret Earth observations and/or model–data integration. An implementation of this concept combines analysis-ready data cubes with a suitable analytic interface. In three case studies, we demonstrate how the concept and its implementation facilitate the execution of complex workflows for research across multiple variables, and spatial and temporal scales: (1) summary statistics for ecosystem and climate dynamics; (2) intrinsic dimensionality analysis on multiple timescales; and (3) model–data integration. We discuss the emerging perspectives for investigating global interacting and coupled phenomena in observed or simulated data. In particular, we see many emerging perspectives of this approach for interpreting large-scale model ensembles. The latest developments in machine learning, causal inference, and model–data integration can be seamlessly implemented in the proposed framework, supporting rapid progress in data-intensive research across disciplinary boundaries.

Predicting the Earth system's future trajectory given ongoing human intervention into the climate system and land surface transformations requires a deep understanding of its functioning

With regard to the acquisition of sensor measurements and the derivation of downstream data products, Earth system sciences are well prepared. But can this multitude of data streams be used efficiently to diagnose the state of the Earth system? In principle, our answer would be affirmative, but in practical terms we perceive high barriers to interconnecting multiple data streams and further linking these to data analytic frameworks

As long as we do not overcome data interoperability limitations, Earth system sciences cannot fully exploit the promises of novel data-driven exploration and modelling approaches to answer key questions related to rapid changes in the Earth system

This paper has two objectives: first, we aim to formalize the idea of an Earth system data cube (ESDC) that is tailored to explore a variety of Earth system data streams together and thus largely complements the existing approaches. The proposed mathematical formalism intends to illustrate how one can efficiently operate such data cubes. Second, the paper aims at introducing the Earth System Data Lab (ESDL;

The remainder of the paper is organized as follows: Sect.

Our vision is that multiple spatiotemporal data streams shall be treated as a singular yet potentially very high-dimensional data stream. We call this singular data stream an Earth system data cube. For the sake of clarity, we introduce a mathematical representation of the Earth system data cube and define operations on it. Further details on an efficient implementation are provided in Sect.

Suppose we observe

The symbol

A data cube

In this view, the data can be treated as a collection

Typical sets of data cubes

To exploit an Earth system data cube efficiently, scientific workflows need to be translated into operations executable on data cubes as described above. More specifically, the output of each operation on a data cube should yield another data cube. The entire workflow of a project, possibly a succession of analyses performed by different collaborators, can then be expressed as a composition of several UDFs performed on a single (input) data cube. Besides unifying all statistical data analyses into a common concept, the idea of expressing workflows as functional operations on data cubes comes with another important advantage: as soon as a workflow is implemented as a suitable set of UDFs, it can be reused on any other sufficiently similar data cube to produce the same kind of output.

In its most general form, a user-defined function

A major advantage of thinking in data cube workflows is that low-dimensional functions can be applied to higher-dimensional cubes by simple functional extensions: a function can be acting along a particular set of dimensions while looping across all unspecified dimensions. For example, the function that computes the temporal mean of a univariate time series should allow for an input data cube, which, in addition to a temporal grid, contains spatial information. The output of such an operation should then be a cube of spatially gridded temporal means. Similarly, the function should be applicable to cubes containing multivariate observations. Here, we expect the output to contain one temporal mean per supplied variable.

In general, a function

Schematic illustration of the “apply” functionality: a function

The approach outlined above is very convenient to describe workflows, i.e. recursive chains of UDFs. Let

Overall, the definition of an Earth system data cube and associated operations on it do not only guide the implementation strategy but also help us summarize potentially complicated analytic procedures in a common language. For the sake of readability, in the following, we will not distinguish between a function

In the following, we present some special operations that are routinely needed in explorations of Earth system data cubes:

“Reducing” describes a function that calculates some scalar measure (e.g. the sample mean). Consider, for instance, the need to estimate the mean of a univariate data cube, of course weighted by the area of the spatial grid cells. An operation of this kind expects a cube with dimensions

“Cropping” is subsetting a data cube while maintaining the order of a cube. A cropping operation typically reduces certain axes of a data cube to only contain specified grid points (and therefore requires the input cube to contain these grid points). For instance, a function that extracts a certain “cropped” fraction

“Slicing” refers to a subsetting operation in which a dimension of the cube is degenerated, and the order of the cube is reduced and can be interpreted as a special form of cropping. For instance, if we only select a singular time instance

“Expansions” are operations where the order of the output cube is higher than the order of the corresponding input cube. A discrete spectral decomposition of time series, for example, generates a new dimension with characteristic frequency classes:

“Multiple cube handling” is often needed, for instance, when fitting a regression model where response and predictions are stored in different cubes. Also, we may be interested in outputting the fitted values and the residuals in two separate cubes. This amounts to an atomic operation:

The concept as described in Sect.

Workflow putting the ESDL concept into practice: selected data sets are preprocessed to common grids and saved in cloud-ready data formats (Zarr). Based on these cubed data sets, a global Earth system data cube can be produced that is either stored locally or in the cloud. Via appropriate application programming interfaces (APIs), users can efficiently access the ESDC in their native language. Users can fully focus on designing user-defined functions and workflows.

The data streams included so far were chosen to enable research on the following topics (a complete list is provided in Appendix

Ecosystem states at the global scale in terms of relevant biophysical variables. Examples are leaf area index (LAI), the fraction of photosynthetically active radiation (fAPAR), and albedo

Biosphere–atmosphere interactions as encoded in land fluxes of

Terrestrial hydrology requires a wide range of variables. We mainly ingest data from the Global Land Evaporation Amsterdam Model

State of the atmosphere is described using data generated by the Climate Change Initiative (CCI) by the European Space Agency (ESA) in terms of aerosol optical depth at different wavelengths

Meteorological conditions are described via the reanalysis data, i.e. the ERA5 product. Additionally, precipitation is ingested from the Global Precipitation Climatology Project

Together, these data streams form data cubes of intermediate spatial and temporal resolutions (0.25, 0.083

Visualization of the implemented Earth system data cube (an animation is provided online at

To show the portability of the approach, we have developed a regional data cube for Colombia. This work supports the Colombian Biodiversity Observational Network activities within the Group on Earth Observations Biodiversity Observation Network (GEO BON). This regional data cube has a 1 km (0.083

To put the concept of an Earth system data cube as outlined in Sect.

Given some large data cube

Knowledge of the desired

Depending on the exact needs,

Of course there are also alternatives to Julia.

The ESDL has been built as a generic tool. It is prepared to handle very large volumes of data. Storage techniques for large raster geodata are generally split into two categories: database-like solutions like Rasdaman

One disadvantage of the traditional file formats used for storing gridded data is that their data chunks are contained in single files that may become impossible to handle efficiently. This is not problematic when the data are stored on a regular file system where the file format library can read only parts of the file. In cloud-based storage systems, it is not common to have an API for accessing only parts of an object, so these file formats are not well suited for being stored in the cloud. Recently, novel solutions for this issue were proposed, including modifications to existing storage formats, e.g. HDF5 cloud, or cloud-optimized GeoTiff, among others, as well as completely new storage formats, in particular Zarr (

At present, the ESDL provides the same data cube in different spatial resolutions and different chunkings to speed up data access for different applications. In chunked data formats, a large data set is split into smaller chunks, which can be seen as separate entities where each chunk is represented by an object in an object store. There are several ways to chunk a data cube. Consider the case of a multivariate spatiotemporal cube

Resolutions and chunkings of the currently implemented global Earth system data cube per variable. Here, the cubes with chunk size 1 in the time coordinate are optimized for accessing global maps at a time, while the other cubes are more suited for processing time series or regional subsets of the data cube. The cubes are currently hosted on the Object Storage Service by the Open Telecom Cloud under

The overarching motivation for building an Earth system data cube is to support the multifaceted needs of Earth system sciences. Here, we briefly describe three case studies of varying complexity (estimating seasonal means per latitude, dimensionality reduction, and model–data integration) to illustrate how the concept of the Earth system data cube can be put into practice. Clearly, these examples emerge from our own research interest, but the concepts should be portable across different branches of science (the code for producing the results on display is provided as Jupyter notebooks at

Data exploration in the Earth system sciences typically starts with inspecting summary statistics. Global mean patterns across variables can give an impression on the long-term system behaviour across space. In this first use case, we aim to describe mean seasonal dynamics of multiple variables across latitudes.

Consider an input data cube of the form

Polar diagrams of median seasonal patterns per latitude (land only). The values of the variables are displayed as grey gradients and scale with the distance to the centroid. For each latitude, we have a median seasonal cycle specified with the central colour code. Panels

For temperature, the observed seasonal dynamics are less complex. We essentially find the constantly high temperature conditions near the Equator and visualize the pronounced seasonality at high latitudes. However, Fig.

This example analysis is intended to illustrate how the sequential application of two basic functions on this Earth system data cube can unravel global dynamics across multiple variables. We suspect that applications of this kind can lead to new insights into apparently known phenomena, as they allow to investigate a large number of data streams simultaneously and with consistent methodology.

The main added value of the ESDL approach is its capacity to jointly analyse large numbers of data streams in integrated workflows. A long-standing question arising when a system is observed based on multiple variables is whether these are all necessary to represent the underlying dynamics. The question is whether the data observed in

When thinking about an Earth system data cube, the question about its intrinsic dimensionality could be investigated along the different axes. In this study, we ask if the multitude of data streams, grid(var), contained in our Earth system data cube is needed to grasp the complexity of the terrestrial surface dynamics. If the compiled data streams were highly redundant, it could be sufficient to concentrate on only a few orthogonal variables and design the development of the study accordingly. Starting from a cube

Intrinsic dimension of 18 land ecosystem variables. The intrinsic dimension is estimated by counting how many principal components would be needed to explain at least 95 % of the variance in the Earth system data cube. The results for the original data are shown in panel

Estimating the intrinsic dimension of high-dimensional data sets has been a matter of research for multiple decades, and we refer the reader to the existing reviews on the subject

Figure

To verify that seasonality is the main source of variability in our analysis, we extend the workflow by decomposing each time series (by variable and spatial location) into a series of subsignals via a discrete fast Fourier transform (FFT). We then binned the subsignals into short-term, seasonal, and long-term modes of variability

Long-term modes of land surface variability show a rather complex spatial pattern in terms of intrinsic dimensions: overall, we find values between 6 and 7 (see also the summary in Fig.

Histogram of the intrinsic dimension estimated from 18 land ecosystem variables the Earth system data cube. The highest intrinsic dimension emerges in the short-term variability, while the original data are enveloped by the complexity of seasonal and long-term subsignals.

The analysis shows how a large number of variables can be seamlessly integrated into a rather complex workflow. However, the results should be interpreted with caution: one criticism of the PCA approach is its tendency to overestimate the correct intrinsic dimensions in the presence of nonlinear dependencies between variables. A second limitation is that the maximum intrinsic dimensions depend on the number of Fourier coefficients used to construct the signals, leading to different theoretical maximum intrinsic dimensions per timescale.

The question of the underlying dimensionality could also be investigated in a different way. While this study investigates the intrinsic dimensionality locally, i.e. along the dimensions of latitude and longitude, another recent study based on the ESDL by

Global patterns of locally estimated temperature sensitivities of ecosystem respiration

Another key element in supporting Earth system sciences with the ESDL (and related initiatives) is to enable model development, parameterization, and evaluation. To explore this potential, we present a parameter estimation study that considers two variables only, but it helps to illustrate the approach. In fact, the approach could be extended to exploit multiple data streams in complex models. The example presented here quantifies the sensitivities of ecosystem respiration – the natural release of

However, it has been shown theoretically

One generic solution to the problem is to exploit the variability of respiratory processes at short-term modes of variability. Specifically, one can apply a timescale-dependent parameter estimation

Bivariate histograms summarizing the joint distribution of surface moisture and gross primary production. The estimates are computed over the entire time series for the different Intergovernmental Panel on Climate Change (IPCC) regions. The density is square root transformed to emphasize areas of higher density. In arid regions (e.g. CAM, NEB, WAF, SAFM, EAF), the tight relation between surface water and primary production is evident.

From a more methodological point of view, this research application shows that it is well possible to implement a multistep analytic workflow in the ESDL that combines time series analysis and parameter estimation. Once the analysis is implemented, it requires essentially two sequential atomic functions. The results obtained have the form of a data cube and could be integrated into subsequent analyses. Examples include comparisons with in situ data, ecophysiological parameter interpretations, or assessment of parameter uncertainty in more detail. As mentioned above, this case study only considers two variables and thereby does not exploit the wider multivariate potential of the ESDL. The example of temperature sensitivity could easily be combined with further estimations of water stress, linked to primary production, or even become part of a simple terrestrial surface scheme.

The original idea of the data cube concept emerged from the need for working with large multivariate gridded data sets. However, the idea of data cubes can be possibly extended to other types of geographical data. One example is vector data cubes, where, e.g. polygons form an axis in their own right and each polygon points to a complex spatial shape. Consider, for instance, the need for statistical inferences on the spatial polygons often used in Intergovernmental Panel on Climate Change (IPCC) reports. One relevant question is, for example, understanding the relations of GPP and surface moisture. Figure

In the following, we describe the insights gained during the development of the concept and the implementation of the ESDL, addressing issues arising and critiques expressed during our community consultation processes. We also briefly discuss the ESDL in light of other developments in the field. Finally, we highlight some challenges ahead and proposed future applications.

During a community consultation process across various workshops and summer schools, users expressed confusion about the equitable treatment of data cube dimensions (Sect.

One of the most commonly expressed practical concerns is the choice of a unique data grid. The curation of multiple data streams within such a data cube grid requires that many data have to undergo reformatting and/or remapping. Of course, this can be problematic at times, in particular when data have been produced for a given spatial or temporal resolution and cannot be remapped without violating basic assumptions. For instance, keeping mass balances, integrals of flux densities, and global moments of intensive properties as consistent as possible should always be a priority. However, for the data cube approach implemented here, we decided to accept certain simplifications. The availability of a multitude of relevant data to study Earth system dynamics is a key incentive to use the ESDL and goes far beyond many disciplinary domains. But, as we have learned in this discussion, it comes at the price of some pragmatic trade-offs. A fundamental advancement of our approach would be to natively deal with data streams from unequal grids.

The current notation of the concept has been criticized for being unsuitable for dealing with so-called vector data cubes

One of the main concerns expressed by users, in particular by 30 young researchers who participated in the project during an early adopter phase, is the demand for the latest data in the ESDL. This is why the concept presented here and its implementation should be further developed into a persistent infrastructure. Such a step is challenging and there is a trade-off to be made between wishing to include latest data streams (ideally even in near-real time) and constantly expanding the access API and portfolio of example workflows. The ESDL thus depends on the enduring enthusiasm of the user community and funding agencies to support the idea in this respect and grow steadily into new domains, help us add data streams, and actively co-develop the approach.

Over the past few years, several initiatives, platforms, and software solutions

Among the other existing initiatives, the Climate Data Store (CDS) of the Copernicus Climate Change Service (

First, we note that most of the data cube initiatives were motivated by the need to access and/or analyse big, e.g. very-high-resolution, data

Second, most initiatives intend to preserve the resolutions of the underlying data. The ESDL, instead, is built around singular data cubes that then include variables as an additional dimension. The inevitable trade-off, as discussed above, is the need for a data curation and remapping process prior to the analyses.

Third, there is a wide consensus that data cube technologies need to enable the application of UDFs. However, at this stage, this aspect often appears not to be a priority of other data cube initiatives and, consequently, users are restricted in their analysis by the available tools. In this context, we see the strength of the ESDL, as it allows for the development of complex workflows and adding arbitrary functionalities efficiently. This is actually one reason why we decided to implement the ESDL in the quite young language of scientific computing Julia (side by side with the more commonly used Python tools).

Taken together, the ESDL has probably conceptually developed (and implemented) the most radical cubing principle following a strict dimension agnostic approach. We envisage that the ESDL front end could be coupled to a data cube technology as proposed by any of the other initiatives to combine its analytic strength with the efficiencies achieved by others in dealing with high-resolution data streams.

During the development of the ESDL, we identified several methodological challenges on the one hand and, on the other, application domains that could be addressed. With regard to potentially relevant methodological paths, we can only briefly mention, with no claim to completeness, some of the most ardently and widely discussed topics:

In terms of application domains, we see high potential in the following areas:

In summary, we have demonstrated that the ESDL is a flexible and generic framework that can allow various different communities to explore and analyse large amounts of gridded data efficiently. Thinking about the potential paths ahead, the ESDL could become a valuable tool in various fields of Earth system sciences, biodiversity research, computer sciences, and other branches of science. The widespread social and political uptake of the concept of planetary boundaries

Exploiting the synergistic potential of multiple data streams in the Earth sciences beyond disciplinary boundaries requires a common framework to treat multiple data dimensions, such as spatial, temporal, variable, frequency, and other grids alike. This idea leads to a data cube concept that opens novel avenues to efficiently deal with data in the Earth system sciences. In this paper, we have formalized the concept of data cubes and described a way to operate on them. The outlined dimension-agnostic approach is implemented in the Earth System Data Lab, which enables users applying a wide range of functions to all thinkable combinations of dimension. We believe that this idea can dramatically reduce the barrier to exploit Earth system data and serves multiple research purposes. The ESDL complements a range of emerging initiatives that differ in architectures and specific purposes. However, the ESDL is probably the most radical data cubing approach, offering novel opportunities for cross-community data-intensive exploration of contemporary global environmental changes. Future developments in related branches of science and latest methodological developments need to be considered and addressed soon. At its actual state of implementation, the ESDL can already contribute to the deeper understanding and more effective implementation of policy-relevant concepts such as the planetary boundaries, essential variables in different subsystems of the Earth, and global assessment reports. We see a particularly high future potential for data cube concepts as presented for, firstly, interpreting large-scale model ensembles, and secondly, analysing new multispectral satellite remote sensing data with their constantly increasing spatial, temporal, and spectral resolutions.

In the following, we give an overview of the actually available variables in the Earth System Data Lab. The list is constantly being updated.

Data streams in the current implementation of the ESDL.

Continued.

Continued.

Continued.

Continued.

Continued.

Continued.

All code necessary to build and analyse the ESDL is available from

All data are available via

MDM, FG, and MR developed the concept; FG implemented
the

The authors declare that they have no conflict of interest.

This paper was funded by the European Space Agency (ESA) via the Earth System Data Lab (ESDL) project. The authors also thank the Integrated Land Ecosystem Atmosphere Processes Study (iLEAPS), a FutureEarth Global Research Project for constant support. Special thanks are given to Anca Anghelea, Eleanor Blyth, Carsten Brockmann, Diego Fernández, Garry Hayman, Toby R. Marthews, Pierre-Philippe Mathieu, Espen Volden, and Uli Weber for continuous support and feedback. We also thank everyone participating in the various workshops and summer schools, and especially the young scientists participating in the “early adopters” call, for providing invaluable feedback on the development of the ESDL. Marius Appel, Edzer Pebesma, Alexander Winkler, and two anonymous referees provided excellent comments on the manuscript. The implementation of the regional Earth data cube for Colombia was done under the project “Champion user phase; Supporting the Colombia BON in GEO BON” with the ESDL project. The original idea emerged at the iLEAPS–ESA–MPG-funded workshop in Frascati 2011

This research has been supported by the European Space Agency (project Earth System Data Lab). The article processing charges for this open-access publication were covered by the Max Planck Society.

This paper was edited by Kirsten Thonicke and reviewed by two anonymous referees.