Combining machine learning and SMILEs to classify, better understand, and project changes in ENSO events
- 1Max Planck Institute for Meteorology, Hamburg, Germany
- 2Cooperative Institute for Research in Environmental Sciences (CIRES) and Department of Atmospheric and Oceanic Sciences (ATOC), University of Colorado at Boulder, Boulder, CO 80309, USA
- 3Freelancer, Boulder, CO 80303, USA
- 4Climate and Global Dynamics Division, National Center for Atmospheric Research, Boulder, CO 80307, USA
- 5Cooperative Programs for the Advancement of Earth System Science, University Corporation for Atmospheric Research, Boulder, CO 80307, USA
- 1Max Planck Institute for Meteorology, Hamburg, Germany
- 2Cooperative Institute for Research in Environmental Sciences (CIRES) and Department of Atmospheric and Oceanic Sciences (ATOC), University of Colorado at Boulder, Boulder, CO 80309, USA
- 3Freelancer, Boulder, CO 80303, USA
- 4Climate and Global Dynamics Division, National Center for Atmospheric Research, Boulder, CO 80307, USA
- 5Cooperative Programs for the Advancement of Earth System Science, University Corporation for Atmospheric Research, Boulder, CO 80307, USA
Abstract. The El Niño Southern Oscillation (ENSO) occurs in three phases: neutral, warm (El Niño) and cool (La Niña). While classifying El Niño and La Niña is relatively straightforward, El Niño events can be broadly classified into two types: Central Pacific (CP) and Eastern Pacific (EP). Differentiating between CP and EP events is currently dependent on both the method and observational dataset used. In this study, we create a new classification scheme using supervised machine learning trained on 18 observational and reanalysis products. This builds on previous work by identifying classes of events using the temporal evolution of sea surface temperature in multiple regions across the tropical Pacific. By applying this new classifier to seven single model initial-condition large ensembles (SMILEs) we investigate both the internal variability and forced changes in each type of ENSO event, where events identified behave similar to those observed. It is currently debated whether the observed increase in the frequency of CP events after the late 1970s is due to climate change. We found it to be within the range of internal variability in the SMILEs. When considering future changes, we do not project a change in CP frequency or amplitude under a strong warming scenario (RCP8.5/SSP370) and we find model differences in EP El Niño and La Niña frequency and amplitude projections. Finally, we find that models show differences in projected precipitation and SST pattern changes for each event type that do not seem to be linked to the Pacific mean state SST change, although the SST and precipitation changes in individual SMILEs are linked. Our work demonstrates the value of combining machine learning with climate models, and highlights the need to use SMILEs when evaluating ENSO in climate models due to the large spread of results found within a single model due to internal variability alone.
- Preprint
(8521 KB) -
Supplement
(7172 KB) - BibTeX
- EndNote
Nicola Maher et al.
Status: final response (author comments only)
-
RC1: 'Comment on esd-2021-105', Anonymous Referee #1, 23 Feb 2022
general comments
This paper integrates 18 observational datasets and machine learning algorithms (supervised classification) to classify the CP (Central Pacific), EP (Eastern Pacific), and LN (La Nina) events in the past ~120 years. The trained/tuned model was then applied to SMILEs (single model initial-condition large ensembles) to investigate both the internal variability and forced changes in each ENSO event type. The main findings from this study are 1) machine learning (ML) does a nice job in reconstructing the ENSO events in the past 2) the observed increase in the frequency of CP events after the late 1970s is within the range of internal variability in the SMILEs (thus arguing against climate change as the cause) 3) the ML algorithm doesn’t project a change in CP frequency or amplitude in the following decades.
I find this paper well written, bearing important scientific merits, and nicely integrating climate model and machine learning. However, I do have several concerns and I hope the author could address them.
specific comments
ML related.
1) Metrics and scoring. The author used precision as their main metrics to check their model performance. However, as successfully detecting the CP and EP events is the most critical part, I think the author should use the recall rate. Imagine this scenario: if we have 20 total EP events, the ML successfully categorizes 5 of them to EP, and the other 15 are categorized to other types of events, and no more other events are categorized to EP, then based on the precision formula, the precision for EP will be 5/(5+0) = 1. However, the other 15 EP events are not captured by this. If using recall, then 5/(5+15) = 0.25 and it indicates the model needs to be improved. Although recall also has its own issue, I think at least a thorough explanation of why the authors chose to use precision needs to be there. And I recommend the authors compare the results of using recall compared to precision.
2) The author used several methods to determine/evaluate the ML model (e.g., train and evaluation/test, for the train dataset, use 10-fold cross validation). I think the author also needs to explain how they tune the training model. For example (in Table 2), why they choose 1 in the KNN, why they use the specific hidden layers and max iterations in their NN. More importantly, for the random forest algorithm, the max depth seems to be too big (500). A more detailed description of how they tune (not only evaluate) the models is needed as it will change the final model structure.
3) Can the author explain why they first use HadISST as the test data set? As the author mentioned, this will cover all the events through time and is not the ideal way to evaluate the model performance. I think their second approach is more appropriate (randomly split the events across all augmentation data sets). I suggest the authors delete the HadISST part (unless I miss something…).
4) The authors need to discuss whether the range of the feature values during the training will also cover the ranges for future predictions. One example is the random forest, whose prediction results will be capped by the data used for training. In a future warming world, will the features have values that are out of the scope of the current observational ones?
5) In line 90, for those that don’t quite know ML, I suggest the authors add a sentence or two to explain labelled dataset vs. unlabelled data.
6) line 180, “We additionally complete this split 100 times and manually choose 10 data splits that take CP and EP (the classes with the lowest numbers of events) from across the time-series, ensuring that not all events in the split come from the same part of the observational record.”. I feel a bit loss here. Does it mean the events in any split needs to cover the whole time period? Needs to be reworded or adding more details.
7) Line 165: We have 14 CP in total (see line 110), why we only have 13 here (12/13)?
Model interpretation related.
1) A very interesting finding (and important!) from this study is that due to the interval variability of the SMILEs, the assumed change in the frequency and amplitude of the ENSO events can be covered by the models themselves (Instead of required further forcing such as climate change). This is great but I am wondering if this could simply serve as the explanation of the change in the observational trend of ENSO events. For example, in figure 3, for the CP events, the HadISST shows a significant increase in CP frequency. Although the SMILEs cover this increase, it is mainly due to the wide band between minimum and maximum, the trend by SMILE is relatively flat (or slight decrease or increase). The authors need more nuanced explanation.
2) Line 75, The author needs to explain “undersampling internal variability” here.
technical corrections Line 40: change to “is uncertain, sparse, and intermittent”
Line 100: in 5. change to “to use the evaluation set to assess”. In 6. Add “better” before performance
Line 130: change “but chose” to “we chose”
Line 330: “too located too far west” reads not ideal
Line 345: delete “are“ before needed to evaluate ENSO
-
AC1: 'Reply on RC1', Nicola Maher, 21 Apr 2022
general comments
This paper integrates 18 observational datasets and machine learning algorithms (supervised classification) to classify the CP (Central Pacific), EP (Eastern Pacific), and LN (La Nina) events in the past ~120 years. The trained/tuned model was then applied to SMILEs (single model initial-condition large ensembles) to investigate both the internal variability and forced changes in each ENSO event type. The main findings from this study are 1) machine learning (ML) does a nice job in reconstructing the ENSO events in the past 2) the observed increase in the frequency of CP events after the late 1970s is within the range of internal variability in the SMILEs (thus arguing against climate change as the cause) 3) the ML algorithm doesn’t project a change in CP frequency or amplitude in the following decades.
I find this paper well written, bearing important scientific merits, and nicely integrating climate model and machine learning. However, I do have several concerns and I hope the author could address them.
We thank the reviewer for their positive and constructive comments on this manuscript.
specific comments
ML related.
1) Metrics and scoring. The author used precision as their main metrics to check their model performance. However, as successfully detecting the CP and EP events is the most critical part, I think the author should use the recall rate. Imagine this scenario: if we have 20 total EP events, the ML successfully categorizes 5 of them to EP, and the other 15 are categorized to other types of events, and no more other events are categorized to EP, then based on the precision formula, the precision for EP will be 5/(5+0) = 1. However, the other 15 EP events are not captured by this. If using recall, then 5/(5+15) = 0.25 and it indicates the model needs to be improved. Although recall also has its own issue, I think at least a thorough explanation of why the authors chose to use precision needs to be there. And I recommend the authors compare the results of using recall compared to precision.
We will add the recall metric to the the revised manuscript.
2) The author used several methods to determine/evaluate the ML model (e.g., train and evaluation/test, for the train dataset, use 10-fold cross validation). I think the author also needs to explain how they tune the training model. For example (in Table 2), why they choose 1 in the KNN, why they use the specific hidden layers and max iterations in their NN. More importantly, for the random forest algorithm, the max depth seems to be too big (500). A more detailed description of how they tune (not only evaluate) the models is needed as it will change the final model structure.
This information is provided on github in the form of jupyter notebooks. We will additionally add more detail into the Supplementary material of the tuning completed to come to the final parameters.
3) Can the author explain why they first use HadISST as the test data set? As the author mentioned, this will cover all the events through time and is not the ideal way to evaluate the model performance. I think their second approach is more appropriate (randomly split the events across all augmentation data sets). I suggest the authors delete the HadISST part (unless I miss something…).
This is done due to the limited number of events in the observed record. If we split via event rather than dataset we may lose events that are important to the training phase of the classification algorithm. As such we choose to keep this as is in the manuscript, but add text to better explain this choice.
4) The authors need to discuss whether the range of the feature values during the training will also cover the ranges for future predictions. One example is the random forest, whose prediction results will be capped by the data used for training. In a future warming world, will the features have values that are out of the scope of the current observational ones?
This is a limitation of the method that will be added to the discussion.
5) In line 90, for those that don’t quite know ML, I suggest the authors add a sentence or two to explain labelled dataset vs. unlabelled data.
This will be added.
6) line 180, “We additionally complete this split 100 times and manually choose 10 data splits that take CP and EP (the classes with the lowest numbers of events) from across the time-series, ensuring that not all events in the split come from the same part of the observational record.”. I feel a bit loss here. Does it mean the events in any split needs to cover the whole time period? Needs to be reworded or adding more details.
Apologies for confusion this will be reworded. The algorithm sometimes chooses for example CP events that all occur during the beginning of the observational record. As data quality depends on when in the observational record the event occurs in, it is only appropriate to use data splits that include events across the time period.
7) Line 165: We have 14 CP in total (see line 110), why we only have 13 here (12/13)?
Apologies this is a typo that will be corrected in the manuscript.
Model interpretation related.
1) A very interesting finding (and important!) from this study is that due to the interval variability of the SMILEs, the assumed change in the frequency and amplitude of the ENSO events can be covered by the models themselves (Instead of required further forcing such as climate change). This is great but I am wondering if this could simply serve as the explanation of the change in the observational trend of ENSO events. For example, in figure 3, for the CP events, the HadISST shows a significant increase in CP frequency. Although the SMILEs cover this increase, it is mainly due to the wide band between minimum and maximum, the trend by SMILE is relatively flat (or slight decrease or increase). The authors need more nuanced explanation.
This is why we added the ensemble member 1. We will assess max trend min trend of ensembles to better evaluate this in the revised manuscript.
2) Line 75, The author needs to explain “undersampling internal variability” here.
This will be clarified in the revised text. We refer to CMIP archives where a single run of a climate model is used. In this case variability is undersampled due to the lack of ensemble members.
technical corrections
All technical corrections will be applied to the revised manuscript as suggested.
Line 40: change to “is uncertain, sparse, and intermittent”
Line 100: in 5. change to “to use the evaluation set to assess”. In 6. Add “better” before performance
Line 130: change “but chose” to “we chose”
Line 330: “too located too far west” reads not ideal
Line 345: delete “are“ before needed to evaluate ENSO
-
AC1: 'Reply on RC1', Nicola Maher, 21 Apr 2022
-
RC2: 'Comment on esd-2021-105', Anonymous Referee #2, 24 Feb 2022
This paper applied the integrated observational dataset to train the classification of the EP El Niño, CP type El Niño, and La Niña with supervised learning and to investigate the ENSO diversity/complexity changes in multi-model large ensembles. Specifically, they found the supervised machine learning can reasonably classify ENSO events/types and the observed increase of CP El Niño events is within the range of internal variability, so does the ENSO amplitude and frequency changes. The research topic is interesting and necessary; however, there are issues in the machine learning setup and the goal/finding is not unique for machine learning. Therefore, this paper should not be accepted in Earth System Dynamics before major revisions.
A few major comments are followings:
ML related
- The setup of the supervised learning uses the combination of 18 observational datasets However, the combination of 18 observational datasets may overweight a few events and have limited difference. For instance, the events after 1980 are covered for most datasets but the events before are only covered by half of them. The authors should discuss this issue and provide additional analyses in the supplementary. One suggestion is to test with subgroup of the datasets. Another issue for the integrated observational datasets is the lack of differences for the dataset. Even though the reconstructions are all slightly different, the SSTs are still representing the same events. That is, the actual events consider in this study is only 14 CP, 20 EP, and 26 LN. This issue should be mentioned in the manuscript and needs to be tested with a small subgroup (or even extremely just one dataset) of datasets.
- The setup of the supervised learning uses the features from 5 regions from October to March. However, limited dynamical reasons are provided and other regions and times should be mentioned (or even tested). For example, the authors show results from the smaller regions and times in the supplementary, but not larger regions and times. For instance, the north subtropical region is known to be important for the onset of CP El Niño and recent papers have found an improvement from including it (Tseng et al., 2021). And the summer is related to how specific ENSO type is onset (Yu & Fang 2018). The authors should provide dynamical reasons for the choice of the regions and times, otherwise, the study should examine more regions and times for showing the current choice is an optimal one.
Yu, J. Y., & Fang, S. W. (2018). The distinct contributions of the seasonal footprinting and chargedâdischarged mechanisms to ENSO complexity. Geophysical Research Letters, 45(13), 6611-6618.
Tseng, Y. H., Huang, J. H., & Chen, H. C. Improving the Predictability of Two Types of ENSO by the Characteristics of Extratropical Precursors. Geophysical Research Letters, e2021GL097190.
Writing-related
- The introduction is a little bit lengthy. It will be easier to read if the authors make the description more succinct. For example, the paragraph for observed CP increased (55-63) should be combined with the EP/CP introduction in the beginning. It will be great if the introduction can be better organized.
- The ENSO complexity is recently considered with a broader perspective (Timmermann et al. 2018). Besides the EP/CP types of ENSO, the transition, propagation, and duration of ENSO are all parts of the ENSO complexity (Chen et al., 2017; Fang et al., 2020). Although these are not the focus in this paper, the ENSO complexity should be mentioned at least in the discussion section.
Timmermann, A., An, S. I., Kug, J. S., Jin, F. F., Cai, W., Capotondi, A., ... & Zhang, X. (2018). El Niño–southern oscillation complexity. Nature, 559(7715), 535-545.
Fang, S. W., & Yu, J. Y. (2020). Contrasting transition complexity between El Niño and La Niña: observations and CMIP5/6 models. Geophysical Research Letters, 47(16), e2020GL088926.
Chen, C., Cane, M. A., Wittenberg, A. T., & Chen, D. (2017). ENSO in the CMIP5 simulations: Life cycles, diversity, and responses to climate change. Journal of Climate, 30(2), 775-801.
Interpretation-related
- The study considers the classification of CP El Niño from Pascolini-Campbell et al. (2015) for the past 120 years, which combine various CP classification methods, but no classification is applied in the multi-model large ensembles. That is, the original CP classification is not compared with the supervised learning method in the SMILEs. If the method in Pascolini-Campbell et al. (2015) is too complicated, the authors should at least choose one or two existing method to justify how the existing classification in SMILEs is different with the one from supervised learning.
- The goal/finding is not unique for machine learning and have been discussed in studies. The authors classify ENSO events and compare the results for SMILEs. However, this can also be done by simply using existing ENSO classification method (Ng et al., 2021). The finding of this study should focus more on the uniqueness of the supervised learning. For example, since the classification method is trained from observational dataset, how each modeled ENSO in SMILEs is different with the observation? Or is machine learning do a better classification than existing methods?
Ng, B., Cai, W., Cowan, T., & Bi, D. (2021). Impacts of low-frequency internal climate variability and greenhouse warming on El Niño–Southern Oscillation. Journal of Climate, 34(6), 2205-2218.
- The authors compare the changes of SST pattern for the EP and CP El Niño under global warming. The interpretation should be more dynamics, as this change in pattern is seldom mentioned in other studies (maybe due to the difficulty of dynamical interpretation). I will suggest the authors to eliminate this result if no dynamical explanation is provided, as this is only discussed in one paragraph (292-302). Instead, the author can focus on the change in zonal SST gradient in the mean state and compare with the frequency or amplitude.
- The comparison of the increased CP El Niño frequency to SMILEs should be more precise. The authors use the ensemble spreads in each year to consider as the range of change for the internal variability; however, this is different with the increased CP El Niño frequency over a certain period. The authors should check how large the CP El Niño frequency can change in each ensemble and discuss the spread of the changes for all SMILEs.
Minor comments are provided below:
- Does the training and classification use the original SST or SST anomalies? Please clearly describe in the text.
- The calculation of frequency should also be mentioned in the method section, not only in the caption of Figure 3.
- The Figure 6 is a bit difficult to read as there are many colors and lines.
- Line 205, ‘to far’
- Line 48, ‘niños’
-
AC2: 'Reply on RC2', Nicola Maher, 21 Apr 2022
This paper applied the integrated observational dataset to train the classification of the EP El Niño, CP type El Niño, and La Niña with supervised learning and to investigate the ENSO diversity/complexity changes in multi-model large ensembles. Specifically, they found the supervised machine learning can reasonably classify ENSO events/types and the observed increase of CP El Niño events is within the range of internal variability, so does the ENSO amplitude and frequency changes. The research topic is interesting and necessary; however, there are issues in the machine learning setup and the goal/finding is not unique for machine learning. Therefore, this paper should not be accepted in Earth System Dynamics before major revisions.
A few major comments are followings:
We thank the reviewer for their constructive comments.
ML related
-
The setup of the supervised learning uses the combination of 18 observational datasets However, the combination of 18 observational datasets may overweight a few events and have limited difference. For instance, the events after 1980 are covered for most datasets but the events before are only covered by half of them. The authors should discuss this issue and provide additional analyses in the supplementary. One suggestion is to test with subgroup of the datasets. Another issue for the integrated observational datasets is the lack of differences for the dataset. Even though the reconstructions are all slightly different, the SSTs are still representing the same events. That is, the actual events consider in this study is only 14 CP, 20 EP, and 26 LN. This issue should be mentioned in the manuscript and needs to be tested with a small subgroup (or even extremely just one dataset) of datasets.
We will retest the algorithm with only HadISST and add the results to Table 3. However, we will continue to use this method as shown by Pascolini-Campbell et al 2015, events are classified quite differently when using different data products. As such we believe the use of many products is justified in this study.
-
The setup of the supervised learning uses the features from 5 regions from October to March. However, limited dynamical reasons are provided and other regions and times should be mentioned (or even tested). For example, the authors show results from the smaller regions and times in the supplementary, but not larger regions and times. For instance, the north subtropical region is known to be important for the onset of CP El Niño and recent papers have found an improvement from including it (Tseng et al., 2021). And the summer is related to how specific ENSO type is onset (Yu & Fang 2018). The authors should provide dynamical reasons for the choice of the regions and times, otherwise, the study should examine more regions and times for showing the current choice is an optimal one.
We believe that keeping this is important because it provides context by comparing to other studies and it shows there is no clear relationship between amplitude change and the zonal gradient change when evaluating many SMILEs. While we don’t explain the dynamics, our approach is a purely statistical exercise that makes use of more data than other studies and is therefore more robust and a valuable contribution.
-
The comparison of the increased CP El Niño frequency to SMILEs should be more precise. The authors use the ensemble spreads in each year to consider as the range of change for the internal variability; however, this is different with the increased CP El Niño frequency over a certain period. The authors should check how large the CP El Niño frequency can change in each ensemble and discuss the spread of the changes for all SMILEs.
This is why we added the ensemble member 1. We will assess the max trend and min trend of the ensembles to better evaluate this in the revised manuscript.
Minor comments are provided below:
All minor comments will be address in the revision. For 3. We will revise the Figure to make it clearer and easier to interpret.
-
Does the training and classification use the original SST or SST anomalies? Please clearly describe in the text.
-
The calculation of frequency should also be mentioned in the method section, not only in the caption of Figure 3.
-
The Figure 6 is a bit difficult to read as there are many colors and lines.
-
Line 205, ‘to far’
-
Line 48, ‘niños’
-
Nicola Maher et al.
Nicola Maher et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
595 | 168 | 14 | 777 | 36 | 4 | 7 |
- HTML: 595
- PDF: 168
- XML: 14
- Total: 777
- Supplement: 36
- BibTeX: 4
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1