|I’m going to be upfront that I’m very torn about what to recommend with respect to this paper. On the one hand, I acknowledge the incredible amount of work that went into this project and believe that there is important and interesting science coming out of this project. On the other hand, based on the responses to questions raised, it is now clear there are definitely things here that I don’t think were done correctly. What complicates this is that many of the things done wrong (especially with respect to model process error) were also done wrong in previous papers on the Bayesian calibration of terrestrial carbon models (both by this team and others). This helps explain such mistakes, but it doesn’t justify them, and I worry that continuing to allow papers to make the same mistakes just perpetuates the situation. The crux of the issue is really in how the authors are treating the error term in their likelihood. First, they are ascribing 100% of the error as coming from the observations, and not acknowledging (statistically) that their model is imperfect (though their own Results and Discussion clearly demonstrate that the model is far from perfect). By incorrectly ascribing 100% of the error to observations, and none to process error (model misspecification, stochastic events, unaccounted for heterogeneity), the authors are also missing that (unlike observation error) process error propagates forward into model predictions. This means that modeled fluxes and pools are going to be consistently overconfident by an unknown (but potentially nontrivial) amount. Second, not only do the author ascribe all the error to observations, but they treat that observation error as a known parameter, despite acknowledging that the data products used don’t have error estimates. This is a significant departure from standard statistical modeling, where the variance is an unknown fit parameter. For example, when you fit a linear regression the model has three unknown parameters (slope, intercept, sigma) and sigma is virtually never treated as an a prior known quantity. While treating sigma as a known shouldn’t have large effects on the mean values of the model parameters (though this is far from guaranteed when dealing with nonlinear models; Jensen’s Inequality), more important is that it can have a real effect on the uncertainties about the model parameters. By subjectively choosing the observation error, one is also subjectively choosing the confidence intervals on the parameters. And since in CARDAMOM the only uncertainties that are included in predictions are parameter uncertainties, this also means you are subjectively choosing the uncertainty in the predictive confidence intervals. Ideally, these models should be refit including an unknown, fit model process error, and then that process error should be propagated into predictions/hindcasts. This process error ideally should also be in addition to, not instead of, an observation error (which may not be a known, but may have an informative prior on it)|
Additional points of concern:
1) Neither the DALEC2 model nor the CARDAMOM system appear to be publically archived. This means this work can’t be reproduced or expanded upon by others. I don’t know if such lack of openness is within the letter of the law of this journal, but it’s definitely a deviation from the current norms of the community.
2) As noted in my original review, I’m not comfortable with this system being called data assimilation, at least not with some additional qualifier being added (e.g. “parameter data assimilation”) to make it clear that the outputs are deterministic model forward simulations not a reanalysis. To me, calling this data assimilation is like calling linear regression “machine learning.” Sure people do it, but it makes the term pretty meaningless.
3) After clearly diagnosing your photosynthesis scheme (ACM) as being at the root of model biases and compensating errors, the decision to not include any ACM parameters in the calibration (and toss the issue up to a lack of acclimation rather than simple miscalibration) strikes me as odd and I cannot understand why the authors are digging in their heels on this.
4) Similar to (3), since NPP in DALEC is very tightly tied to GPP, and TT = Cstock/NPP, it sure seems like systematic biases in GPP will translate to systematic biases in TT. As noted earlier, I find some of the reported TT estimates to be implausible and don’t understand the authors resistance to even considering comparing their results to independent field estimates.
5) The differences between DALEC and observations are greater than the differences between DALEC and the ISI-MIP models, so why are the authors so hard on the ISI-MIP models?
L60: The authors responses suggested that a more complex calculation of TT was actually performed that relaxed the assumption of steady state. I would include that here (along with the steady state calculation) as I suspect a number of readers (myself included) would prefer to know that you’re not relying on a steady state assumption to assess a system that’s clearly not in steady state.
L160: This line refers to DALEC2 as an ‘intermediate complexity’ model, but later arguments actually hinge on it being a simple model, and most of us would consider DALEC to really be on the simple end of the process-model spectrum
L171: MODIS LAI reports an uncertainty estimate. How did you aggregate those uncertainties when aggregated the observations? This is nontrivial as neither the MODIS products or MODIS LAI validation papers report anything about the spatial or temporal autocorrelation in the product’s errors.
L188: Table S2 looks like it just contains a bunch of uniform priors for all other parameters. I think that should be stated here so that readers don’t need to find the supplement to learn that. It’s perfectly fair, however, to make readers go to the supplement to see the exact numerical values of the priors.
L194: This sentence states that MODIS doesn’t report an uncertainty estimate, but that’s not accurate.
L206: I’m concerned about the way the statistics are being reported here. For example, the RMSE of a model is traditionally based on the model error (difference between the model and the observations). Here, the authors are defining the model’s RMSE as the RMSE after applying both a multiplicative and additive bias correction (i.e. the predicted/observed regression). Similarly, the R2 isn’t the variance explained by the model, but the variance jointly explained by the model and a linear bias correction to that model. This results in a very optimistic view of the model’s actual performance.
L251: Just want to continue to express my skepticism about some of these pool and flux estimates. For example, in my own experiences in Alaska, the boreal forest has WAY more than 160% more structural tissue than the tundra. There needs to be some independent plot-scale validation of this.
L258: Likewise, this stem turnover time seems much too fast and needs independent validation. I understand that grid cell to plot- or plant-scale validation isn’t perfect, but it’s better to report the performance explicitly, and then cushion it based on possible scale mismatch, rather than to ignore whether these estimates are consistent with prior research.
L294: typo on “uncertainties”
L313: It would be good to have some sort of quantification of spatial coherence beyond RMSE & R2 (which are nonspatial). Look to the GIS and remote sensing literature for examples of what sort of statistics are available to do this.
L328: Don’t introduce new Methods in the Results. Please document what this analysis is and why you are doing it earlier in the paper.
L378: Consistent with my previous concerns, DALEC appears to be running to fast. That said, this is still a comparison to other models, not to data.
L391: Here you say you had a ‘strong prior on photosynthesis’ but as far as I can tell the photosynthetic parameters were fixed at defaults, not assigned priors. According to Eqn 2, the only 2 parameters assigned non-uniform priors were canopy efficiency (which in Tables 2 and S2 is labeled as a phenology parameter) and autotrophic respiration
L397: If you’ve demonstrated a bias in your photosynthetic model, I’m not sure I agree that this could be resolved with more precise data if you’re not updating the parameters in the photosynthetic submodel
L427: I fundamentally disagree that models should be benchmarked against highly-derived, model-based data products. But this isn’t the central point of the paper and thus I won’t hold up this paper over that disagreement.
L459: While it’s true that brute-force MCMC is not feasible for complex models, but there are other options available that do work with larger models, such emulators (Fer et al 2018 Biogeoscience) and ensemble or particle filters.
L477: For the record, if you didn’t fit every grid cell independently then you wouldn’t need to upscale/interpolate field observations.
L495: Where are the DALEC2 and CARDAMOM code repositories?
Table 2: I find it interesting that, given the papers focus on turnover times, turnover parameters are the least constrained part of the model.