the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Classification of Leading Edge Erosion Severity via Machine Learning Surrogate Models
Abstract. As the number and size of wind turbines has increased, manual observation and maintenance of the turbines has become increasingly dangerous and time consuming for human operators. One key form of turbine deterioration is leading-edge erosion which degrades the blade laminate over time. This erosion is caused by environmental factors such as blowing sand, rain, and bug accumulation. Blade damage reduces aerodynamic efficiency and shortens the operational lifespan of wind turbines, motivating the need for structural health monitoring systems. Ideally one would like to use a digital twin which couples a physical device (the turbine) with a computer model by bidirectional passage of information between the physical and digital twins. In a digital twin, sensor data from the turbine continually updates the computer model which then predicts the state of the system for future maintenance and operation decisions, potentially eliminating the need for frequent manual inspections. Machine learning-based classifiers trained on simulation data accurately detect damage, but require large training data sets, highlighting the need for computationally efficient alternatives to full physics simulation. A Gaussian process (GP) surrogate model can be trained from a small set of full simulation datapoints. Once trained, GP’s make predictions very fast (1000 times faster than a simulator evaluation) while also providing information about the uncertainty in the emulator prediction relative to the full physical simulator. The GP emulator methodology we employ includes two extensions to the standard GP. First, the output quantity of interest is vector-valued (rather than a scalar). In our case the vector contains statistics of relevant outputs such as lift, drag, generator power, etc. Second, the range of the outputs are constrained to fit specifications of the blade (so are not defined over the usual full-space domain required of Gaussian distributions). In this work we test two random forest classifiers developed to quantify levels of leading edge erosion. The classifiers differ in whether they are trained on full simulation data or data from the GP surrogate. We find that the classifier trained on surrogate data is as accurate as the classifier trained on full simulation data. Using the surrogate-generated dataset the classifier distinguishes between five erosion severity levels with 87% accuracy, surpassing the simulation-trained classifier’s accuracy of 83 %. These results highlight the promise of using GP surrogates to train classifiers for leading edge erosion, a key component of a digital twin for wind turbine maintenance.
- Preprint
(802 KB) - Metadata XML
- BibTeX
- EndNote
Status: open (until 01 Jun 2026)
- RC1: 'Comment on wes-2025-289', Anonymous Referee #1, 19 Mar 2026 reply
-
RC2: 'Comment on wes-2025-289', Anonymous Referee #2, 23 May 2026
reply
The paper
"Classification of Leading Edge Erosion Severity via Machine Learning Surrogate Models"
By
Gettemy et al.,
presents a surrogate-model-based workflow for classifying leading-edge erosion severity in wind turbine blades. The authors combine OpenFAST simulations, Morris screening, parallel-partial zero-censored Gaussian process emulation, and random forest classification. The topic is relevant to Wind Energy Science, particularly because computationally efficient surrogate models could help reduce the simulation burden associated with structural-health-monitoring and digital-twin workflows.
The manuscript contains a potentially useful methodological combination. The comparison between a classifier trained on full OpenFAST data and one trained on emulator-generated data is also a relevant question, and the public code and data statement is welcome. The paper is generally readable and gives substantial background on leading-edge erosion, Gaussian process emulation, and machine-learning-based classification.
Nonetheless, the manuscript requires major revisions before it can be considered for publication. The main issues concern the validation of the synthetic erosion model, the simplified simulation environment, the interpretation of the classifier comparison, the statistical support for the reported accuracy improvement, and several internal inconsistencies in the reported emulator metrics. Some conclusions about digital twins, real-time decision-making, and operational deployment are currently stronger than the evidence provided.
More specifically:
- The scientific contribution should be stated more precisely. The claimed novelty appears to be the combination of PPzGP emulation with leading-edge erosion classification, rather than a fundamentally new classifier or erosion model. The authors should distinguish clearly between methodological novelty, application novelty, and engineering proof-of-concept. The present text sometimes makes the contribution appear broader than what has actually been demonstrated.
- The erosion model is highly heuristic and needs stronger justification. The severity classes are generated through a multivariate normal erosion-level model and then translated into lift and drag changes through simple linear scaling. The covariance structure in Eq. (9), the assumed spatial correlation along the blade, and the linear use of the 53 % lift-loss and 500 % drag-increase limits should be justified with experimental, CFD, or literature-based evidence. If this is only a synthetic benchmark, this limitation should be stated explicitly throughout.
- The manuscript states that regional erosion levels lie in [0, 1], but the multivariate normal distribution in Eq. (8) is not naturally bounded. The authors should explain whether samples are clipped, rejected, transformed, or otherwise constrained. This is important because the erosion labels and the lift/drag perturbations depend directly on these sampled values.
- The simulation setting is considerably simplified. The study uses steady uniform wind files, a fixed nacelle, constant environmental inputs, and ultimately fixes wind shear after sensitivity screening. Turbulence, inflow variability, wave effects, yaw-control behavior, sensor noise, and operational transients are not considered. This is acceptable for an initial proof-of-concept, but the conclusions should not be framed as evidence for operational digital-twin deployment unless these limitations are addressed or explicitly qualified.
- The observability of the proposed damage predictors requires more discussion. Several highly ranked inputs are lift and drag coefficient quantities extracted from OpenFAST at a blade node. In a real turbine these are not directly available in the same form and would require pressure instrumentation, calibration, and noise handling. The authors should explain how such measurements would be obtained in practice, and should test the classifier under realistic sensor noise, bias, missing data, or reduced sensor availability.
- The data-partitioning strategy is not sufficiently transparent. The manuscript refers to a primary dataset, a separate emulator-training dataset, 10-fold cross-validation, feature selection with an initial random forest, and emulator-generated data. The exact relationship between these datasets should be described with a table or schematic. It must be clear which simulations are used for Morris screening, feature ranking, emulator training, random-forest training, hyperparameter tuning, and final testing.
- The feature-selection and hyperparameter-optimization procedures may introduce optimistic bias if they are not nested within the cross-validation loop. The authors should state whether predictor ranking, hyperparameter selection, and model evaluation were performed inside each training fold. If not, the classification results should be recomputed using a fully nested evaluation protocol.
- The reported improvement from the emulator-trained classifier is modest and may not be statistically meaningful. Table 9 reports an accuracy increase from 83.00 % to 86.74 %, but the reported standard deviations are 4.55 % and 6.69 %. The authors should provide paired confidence intervals, repeated cross-validation, or an appropriate statistical test. Macro-F1, balanced accuracy, per-class precision and recall, and calibration metrics would also be more informative than accuracy and AUC alone.
- The comparison between classifiers is confounded by training-set size. The simulation-trained classifier uses 500 simulated samples, while the emulator-trained classifier uses 5000 emulator-generated samples. The observed gain could be due to more training data rather than the surrogate approach itself. The paper would benefit from learning curves and equal-size comparisons, for example, 500 emulated samples versus 500 simulated samples, and progressively larger emulator-generated datasets.
- There are inconsistencies between the abstract, the results, and the tables. The abstract states that NRMSE values are below 10 % and credible-interval coverage exceeds 88 %. However, Table 7 shows NRMSE values above 10 % for at least two outputs, and Table 6 shows coverage of 80 % for root moment mean and tip acceleration standard deviation. These claims should be corrected, and the implications of undercoverage and higher errors should be discussed.
- The computational-speed claims need to distinguish single-prediction speed from full-workflow speed. Table 8 indicates that a single PPzGP prediction is much faster than one OpenFAST run, but the full surrogate workflow includes 173 simulations and imputation. The authors should report the total cost needed to generate the 5000-sample classifier-training dataset and compare it with the cost of generating the same dataset directly. Hardware, parallelization, software versions, and wall-clock versus CPU time should be specified. The text currently alternates between about 1000x, 4 orders of magnitude, and other implied speedups.
- The sensitivity analysis justifies fixing wind shear only within the simplified scenario tested. Since wind shear may interact with turbulence, yaw, and operating regime in realistic conditions, the authors should avoid generalizing this conclusion. They should also clarify whether the Morris screening was performed on raw time-series outputs or on the statistical moments later used for classification.
- The severe false-negative cases should be investigated. The text notes that the most serious misclassification occurs when a fully eroded blade is classified as clean. Even if rare, this is critical from a maintenance and risk perspective. The authors should report per-class recall, confusion costs, and whether the classifier can be tuned to reduce dangerous false negatives at the expense of less critical false positives.
- Several tables and figures need correction or improvement. Table 6 appears to repeat the label "Coefficient of Lift standard deviation" twice (rows 2 and 6). Figure 1 refers to regions described in Table 4, although, if I am not wrong, the region definitions are in Table 3. Figures 7 and 8 are too small for the confusion-matrix values and ROC labels to be read comfortably.
- The literature review contextualizing this work should be better defined in terms of general SHM/NDT applications to wind turbines (see, e.g., https://doi.org/10.3390/s22041627 and similar ones).
- The conclusions should be moderated. The present study establishes a synthetic proof-of-concept for surrogate-assisted classifier training under simplified conditions. It does not yet demonstrate real-time decision-making, field deployment, damage progression tracking, or robust digital-twin operation under realistic sensor and environmental variability. These points can be framed as future work rather than current achievements.
- The English is generally understandable, but the manuscript needs careful proofreading. Examples include "classifers", "leafs" instead of "leaves", inconsistent use of GP/GPs, inconsistent capitalization of OpenFAST module names, and occasional awkward or missing articles. The notation and unit formatting should also be made consistent throughout.
In summary, I find the topic relevant and the proposed workflow promising, but the current manuscript needs major revisions. The authors should strengthen the validation of the synthetic erosion model, clarify and harden the experimental protocol, correct the inconsistencies in the reported metrics, add stronger baselines and statistical tests, and moderate operational claims before the work is suitable for publication.
Citation: https://doi.org/10.5194/wes-2025-289-RC2
Data sets
Classification of Leading Edge Erosion Severity Via Machine Learning Surrogate Models Aidan Gettemy https://zenodo.org/records/16729170
Model code and software
Classification of Leading Edge Erosion Severity Via Machine Learning Surrogate Models Aidan Gettemy https://zenodo.org/records/16729170
Viewed
| HTML | XML | Total | BibTeX | EndNote | |
|---|---|---|---|---|---|
| 336 | 173 | 24 | 533 | 49 | 48 |
- HTML: 336
- PDF: 173
- XML: 24
- Total: 533
- BibTeX: 49
- EndNote: 48
Viewed (geographical distribution)
| Country | # | Views | % |
|---|
| Total: | 0 |
| HTML: | 0 |
| PDF: | 0 |
| XML: | 0 |
- 1
The manuscript addresses an important problem in wind turbine monitoring by exploring the use of surrogate models to generate training data for erosion classification. The approach is interesting and the paper is generally well written, but several aspects of the data generation, methodology, and positioning of the contribution could be strengthened to better reflect the complexity of the real monitoring problem and to clarify the novelty of the proposed framework.
Major Points
- The overall contribution is not yet clearly positioned with respect to the existing literature on surrogate modeling and wind turbine condition monitoring. Gaussian-process surrogates, sensitivity analysis, and random forest classifiers are all well-established techniques. The manuscript should clarify more explicitly what methodological advance is introduced beyond applying these tools to a specific erosion-monitoring scenario.
- The claim of novelty regarding the PPzGP surrogate is not sufficiently demonstrated. While combining parallel partial emulation and range-censored Gaussian processes is technically interesting, the manuscript does not clearly show why this combination enables capabilities that standard GP surrogates would not provide for this problem.
- The erosion model used to generate the data is highly simplified. Blade erosion is represented through a parametric scaling of lift and drag coefficients across six blade regions. While this may be suitable for a proof-of-concept study, the manuscript should discuss the limitations of this representation and justify why it captures the key aerodynamic effects of real leading-edge erosion.
- The erosion process is modeled as discrete severity classes rather than a continuous degradation process. In reality erosion evolves gradually and spatially across the blade surface. The use of five artificial classes may simplify the classification task and should be justified more clearly.
- The classification problem may be artificially easy because the erosion perturbations are directly embedded in the aerodynamic coefficients and the classifier is trained on outputs that are strongly linked to those coefficients (e.g., lift and drag sensor statistics). This raises the possibility that the model is learning the synthetic perturbation rather than identifying erosion signatures that would be observable in practice.
- The strong performance of a relatively simple random forest classifier suggests that the generated dataset may be too clean or too easily separable. In practice, leading-edge erosion detection is known to be challenging due to turbulence, operational variability, sensor noise, and confounding effects. The simulations appear to lack these disturbances, which may make the classification task unrealistically simple.
- The simulations assume uniform wind conditions rather than turbulent inflow. Since turbulence strongly influences turbine loads and vibration signals, the use of uniform wind fields likely underestimates the variability present in real monitoring data. Including turbulent wind realizations would significantly improve realism.
- The operational variability of the turbine is limited. Real turbines experience controller transitions, yaw adjustments, and varying operating regimes that influence measured signals. The current simulation setup may not capture these effects.
- The sensor configuration used in the study may not reflect practical monitoring systems. In particular, lift and drag pressure sensors are rarely available in operational wind turbines. The manuscript should discuss the feasibility of the assumed sensing setup or consider signals more commonly available in SCADA or structural monitoring systems.
- The feature extraction strategy reduces time-series signals to simple statistical moments (mean, standard deviation, skewness, kurtosis). This discards potentially important information contained in the temporal and spectral structure of the signals. The authors should justify this choice or explore richer feature representations.
- The surrogate model is trained on a relatively small number of simulations compared to the dimensionality and nonlinearity of the system. Although the reported prediction errors are moderate, the manuscript should discuss potential model bias and the limits of extrapolation.
- The reported classification improvement between simulator-trained and surrogate-trained models (83% vs 87%) is relatively modest. It would be helpful to provide statistical analysis or repeated experiments to assess whether this improvement is significant.
Minor Points
- The manuscript frequently refers to digital twins, but the work primarily demonstrates surrogate modeling and classification using simulated data. Since essential digital twin elements such as data assimilation, state estimation, or online updating are not included, the connection to digital twins should be described more cautiously.
- The introduction is somewhat lengthy and could be shortened. Several sections summarizing wind-energy background material could be condensed to focus more directly on the methodological contribution.
- Some terminology is used interchangeably throughout the manuscript (e.g., emulator, surrogate model). Consistent terminology would improve clarity.
- Figures illustrating emulator predictions are informative but could be improved in readability, particularly through larger axis labels and clearer legends.
- The manuscript would benefit from a clearer discussion of the gap between simulation-based validation and deployment on real turbine monitoring data.
- Minor typographical issues and formatting inconsistencies appear throughout the manuscript and should be corrected during revision.