Utilizing Physics-Based Input Features within a Machine Learning Model to Predict Wind Speed Forecasting Error

Machine learning is quickly becoming a commonly used technique for wind speed and power forecasting. Many of these methods utilize exogenous variables as input features, but there remains the question of which atmospheric variables provide the most predictive power, especially in handling non-linearities that lead to forecasting error. This investigation addresses this question via creation of a hybrid model that utilizes an autoregressive integrated moving average (ARIMA) model to make an initial wind speed forecast followed by a random forest model that attempts to predict the ARIMA forecasting 5 error using knowledge of exogenous atmospheric variables. Variables conveying information about atmospheric stability and turbulence as well as inertial forcing are found to be useful in dealing with non-linear error prediction. Wind direction (θ) and temperature (T ) are found to be the most beneficial individual input features. Streamwise wind speed (U ), time of day (t), turbulence intensity (TI), turbulent heat flux (w′T ′), θ, and T are found to be particularly useful when used in conjunction. The prediction accuracy of the ARIMA-RF hybrid is compared to that of the persistence and bias-corrected ARIMA models. 10 The ARIMA-RF model is shown to improve upon these commonly employed modeling methods, reducing hourly forecasting error by approximately 30% below that of the bias-corrected ARIMA model.

and random forest) are likely better suited to provide accurate power forecasts. Since wind power production is heavily reliant upon environmental conditions, improvements in wind speed forecasting would allow for more reliable wind power forecasts.
If we simplify our wind speed prediction process down to its core (which has no true relation to atmospheric motions), we can imagine a system of atmospheric flow without external forcing. This would result in a constant streamwise wind speed U (i.e. U τ = U τ −1 ; U is streamwise wind speed, τ a timestep; this assumes discrete timesteps for simplicity). In this case, 5 a persistence or autoregressive forecast would have zero forecasting error and uncertainty. However, uncertainty increases once we add an external force that we may represent by some variable x 1 . Now future wind speed may be seen to be U τ = f (U τ −1 , x 1,τ −1 ). Assuming the external force is notable in strength and coupled with the inertia associated with winds, the previous autoregressive model will now struggle to predict U τ because it does not take into account our external forcing x 1,τ −1 , resulting in an error ε (ε τ is abbreviated to ε for simplicity). We can then break down our future wind speed into two parts: 10 U τ =Û τ + ε whereÛ τ is our autoregressive forecast that is only dependent on U τ −1 (i.e.Û τ = f (U τ −1 )). The prediction error is thus skewed to represent the effects of the external force x 1,τ −1 upon U τ −1 .
If we continue to add external forces (x 1 , x 2 , ... x n ; n is the number of external forcing variables), our atmospheric system becomes much more complex and non-linear due to interactions between forcing mechanisms. We can again obtain our forecasting error as ε = f (U τ −1 , x 1,τ −1 , x 2,τ −1 , ... x n,τ −1 ), which we can discretize as ε = µ ε + ε (µ ε is the error bias, ε 15 the error fluctuations about µ ε ) given that we have a statistically significant sample size and the process is stationary. Squaring this equation and taking the average gives us the discretized equation for the mean squared error ε 2 = µ 2 ε + ε 2 , with ε 2 representing the error variance and overlines denoting the average over all samples (Lange, 2005). µ 2 ε represents the bias and may be removed via a simple bias-correction. The true concern is the error fluctuations (ε ) which comprise the error variance.
Assuming the external forcing variables (x's) are normally distributed, we can break down ε 2 into two constituents (Ku et al., 20 1966): where σ 2 xj is the variance of x j and σ xj ,x k is the co-variance between x j and x k (subscript τ removed for simplicity). Unless external forcing (or its coupling with U τ −1 ) is minimal, the error is likely highly non-linear and chaotic (i.e. large ε 2 ). Therefore, it behooves us to discover which forcing mechanisms and atmospheric variables are the best predictors of individual 25 fluctuations ε , which we will call "exogenous error".
Many studies that use machine learning (ML) techniques for wind speed or power forecasting utilize a handful of unadulterated atmospheric variables such as wind speed, pressure, and temperature as input features (Mohandes et al., 2004;Ramasamy et al., 2015;Lazarevska, 2018;Chen et al., 2019). Recently, a handful of investigations have begun to determine which variables may be most useful for these models. Vassallo et al. (2019) showed that turbulence intensity (T I) can vastly improve 30 vertical wind speed extrapolation accuracy. Similarly, Li et al. (2019) showed that T I improves wind speed forecasting on multiple timescales, while Optis and Perr-Sauer (2019) showed that both atmospheric stability and turbulence levels are important indicators for wind power forecasting. Markedly, it has been shown by Cadenas et al. (2016) that multivariate statistical models consistently outperform univariate models for wind speed forecasting. However, to the authors' knowledge, the question of which atmospheric variables are most useful in predicting exogenous error has not been addressed in the literature.
This investigation aims to determine if exogenous error may be, at least in part, predicted via a list of common meteorological measurements by following a similar methodology as that performed by Cadenas and Rivera (2010). The autoregressive integrated moving average (ARIMA) model first obtains an autoregressive forecast, and the forecasting error is extracted and 5 bias-corrected. A random forest model is then utilized to discover patterns in the exogenous variables (and their relations to the endogenous variable U ) that are predictive of exogenous error. The ARIMA-random forest hybrid model so constructed is referred to as the ARIMA-RF model.
This study is not intended to provide a catch-all list of input features that should or should not be used for every future study.
Rather, it aims to inform future researchers and industry professionals as to what types of meteorological information must be 10 used as ML inputs to predict the non-linear interactions between various atmospheric forces. Section 2 provides an overview of the models utilized, testing process, and feature extraction/selection methodology. Section 3 describes the Perdigão field campaign (the data source for the work), site characteristics, and instrumentation used for data collection. Section 4 provides testing results and Section 5 includes a brief discussion of the obtained results. Finally, conclusions can be found in Section 6.

15
This investigation utilizes two modeling methods, ARIMA and random forest regression, to create a hybrid model (ARIMA-RF) wherein the ARIMA model is first used to get a linear, univariate wind speed forecast. The ARIMA forecast is biascorrected and the exogenous error is then extracted and used as the target variable for the random forest. The random forest's goal (and the goal of the study) is to determine which atmospheric variables and forcing categories are useful for the prediction of exogenous error. After the most important individual variables have been established, combinations of these input features 20 are tested in an effort to determine whether multiple variables and/or informational categories can be coupled to improve exogenous error prediction. Finally, the ARIMA-RF results are compared with those of the persistence method and biascorrected ARIMA model. 75% of the data (the training set) is randomly selected and used for model construction and bias calculation. The final 25% of the data is set aside for testing to enable a direct, blind comparison between all models. Section 2.1 details the ARIMA model, while Section 2.2 describes random forest regression. Sections 2.3 and 2.4 provide more detail 25 on the feature extraction and selection methodology as well as the tests performed.

ARIMA
ARIMA (Box et al., 2015) is a univariate statistical model that is often used for time series forecasting. It is predicated on the combination of three functions: an autoregressive function that uses lagged values as inputs, a moving average function that uses past forecasting errors as inputs, and a differencing function used to make a time series stationary. In its simplest form, the next term in a time series sequence, y τ , is given by where p and q are the orders of the autoregressive and moving average functions, respectively, φ i and Θ j the i th autoregressive and j th moving average parameters, respectively, y τ −i the i th lagged value, ε τ −j the j th past prediction error, and ε τ the 5 error term at time τ . The order of differencing is given by the parameter d and does not show up directly in Eqn. 2.
The dataset was tested for non-stationarity using the Augmented Dickey Fuller Test (Dickey and Fuller, 1979) which, to a statistically significant degree, proved that the data is stationary. Therefore, the differencing parameter d was set to 0 (This turns the ARIMA model into an ARMA model, but we stick with the term ARIMA for uniformity). The autoregressive and moving average parameters used, p = 2 and q = 1, were determined via minimization of the Akaike information criterion (Shibata, 10 1976) and empirical testing. Increasing parameters beyond this point did not lead to improved ARIMA accuracy. Although the wind speed data is stationary, general atmospheric seasonality (Ramana et al., 2004;Chervin, 1986) is expected to have an impact on multiple input features, requiring training and testing data to be randomly shuffled.

Random Forest Regression
Random forest regression (Breiman, 2001) is an ensemble method that is made up of a population of decision trees. Bootstrap 15 aggregation (bagging) is used so that each tree can randomly sample from the dataset with replacement, while only a random subset of the total feature set is given to each individual tree. The trees can be pruned (truncated) to add further diversification.
After construction, the population's individual predictions are averaged to give a final prediction of the target variable. Ideally, this process results in a diversified and decorrelated set of trees whose predictive errors cancel out, producing a more robust final prediction. 20 An advantage of random forests is their ability to determine the importance of all input features for the predictive process. This is done by calculating the mean decrease impurity, or the decrease in variance that is achieved during a given split in each decision tree. The decrease in impurity for each input feature can be averaged over the entire forest, providing an approximation of the feature's importance for the prediction (feature importance estimates sum to 100% to ease interpretability). However, if two input variables are highly correlated (as is expected when testing atmospheric forcing), it is highly unlikely that the 25 reported importance values will accurately represent each variable's significance (Breiman, 2001). Therefore, each variable is first tested individually to determine its individual benefits prior to coupling with other exogenous variables. To assist the random forest in representing the dynamic nature of atmospheric processes, input variables are taken from the previous two timesteps (i.e. input feature U comprises U τ −1 and U τ −2 ).
The constructed random forest model contains 1,000 trees for tests of individual variables and 1,500 trees for tests of variable 30 combinations. This was found to be sufficiently large to ensure prediction stability (to within a root mean square error of ±0.001 m s −1 ), and the inclusion of additional trees does not result in higher prediction accuracy. To ease concerns of overfitting, each internal node was required have at least 100 samples in order to split (this truncation is a form of regularization). The random forest model was built using the scikit-learn Python library (Pedregosa et al., 2011).

Feature Extraction and Selection
In an effort to ensure that the findings are applicable to real-world campaigns, we limit our sources of information to those which may be measured by a typical meteorological mast containing sonic anemometers alongside temperature sensors. Using 5 this information, we can write our future wind speed U τ as a function of the following variables, which were broken down into their mean and fluctuating values: where U i and θ i are the mean streamwise wind speed and direction, respectively, W i the mean vertical wind speed, T i the mean temperature, t i the time of day, u i the fluctuating horizontal velocity, θ i the fluctuating wind direction, w i the fluctuating 10 vertical velocity, and T i the fluctuating temperature at each previous timestep i. Unfortunately, θ was not available within the dataset utilized (which had already been 5-minute averaged) and is therefore ignored for this study. Previous analysis, however, has shown that θ varies inversely with U in complex terrain (Papadopoulos et al., 1992), and we may therefore assume its influence is largely captured by U .
Although these unadulterated features give us an idea as to how the system is working at the moment, they may not explicitly 15 represent the relevant atmospheric forcing mechanisms. Our list of measurements allows us to break down our system into two principal forcing components: buoyancy and inertial forcing. Each of these forces can be further discretized into large and small scales (also called mean and fluctuating values; typically separated by at least one order of magnitude). The definitions and formulations of all non-obvious extracted variables used in this study can be found in Appendix A. From 20 this figure, it is clear that the variables in Eqn. 3, when manipulated, are able to describe both the inertial and buoyant forces at multiple scales. Large-scale inertial forcing can be described by the local mean wind speed and direction (U and θ) or vertical velocity W , while small-scale inertial forcing can be described by variables such as the fluctuating (standard deviation of) velocity σ u , friction velocity u * , and the turbulence kinetic energy T KE. Likewise, large-scale buoyancy forcing can be described by the squared buoyancy frequency N 2 , the temperature gradient ∂T / ∂z , or by proxy values such as the time of day 25 t or temperature T (which, on average, is higher during the day and lower at night). Small-scale buoyancy can be described by the turbulent heat flux w T . The interactions between forces and scales can also be described by non-dimensional variables such as the gradient Richardson number Ri g , flux Richardson number Ri f , turbulence intensity T I, and normalized friction velocity u * / U . These derived non-dimensional variables, or extracted features, are typically ignored by current ML models in lieu of raw features such as those listed in Eqn. 3.

30
Extracted variables like those in Fig. 1 may not provide any more information than the raw variables in Eqn. 3. However, they may ease the burden on the model by discretizing (or directly relating) informational categories, therefore reducing informational overlap and noise, providing more periodic/predictive power, and more accurately describing the underlying system. Further, such well-conceived meteorological variables have been seen to be useful for atmospheric prediction (Li et al., 2019;Kronebach, 1964). In theory, given enough data, the model should be able to decipher and interpret these extracted features on its own. Unfortunately there often isn't enough collected data for this to happen organically. Instead, by providing better information we can create a simpler, cheaper, more robust model that requires less training data and construction time.

5
Selected features will ideally represent the underlying system as accurately as possible without providing noisy or redundant information.

Testing
In an effort to understand the predictive capabilities of each variable, initial tests only include individual atmospheric input features. Once each input feature has been tested separately, a feature set is tested that utilizes all input features. Feature 10 importance estimates are then extracted from the random forest model and various user-selected combinations of the most important input features are tested. It must be noted that only select input feature sets were tested in this investigation due to the sheer multitude of potential feature sets.
In order to relieve any timescale bias, forecasts are made across multiple timescales. Typically, wind power utility operators require single-step short range power forecasts run hour-by-hour for a few days to reduce unit commitment costs. The forecast 15 skill of observation-based methods generally reduces with forecast lead time within an hour, and numerical models have higher skill in forecasting larger time leads (> 3 hours) (Haupt et al., 2014). Statistical learning methods have proved to be particularly effective from about 30 minutes to approximately three hours ahead (Mellit, 2008;Wang et al., 2012;Yang et al., 2012;Morf, 2014), and roughly this time frame is thus the focus for this study. The shortest forecast predicts wind speeds 10 minutes ahead, roughly within the turbulent spectral band ( Van der Hoven, 1957). Forecasts are also made one and three hours ahead, which are within the spectral gap between the turbulent and synoptic spectra and approach the 6-hour period wherein NWP models become particularly useful (Dupré et al., 2019). These are all single-step forecasts, which is to say that the averaging timescale increases with the forecasting timescale (e.g. a 10-minute forecast predicts 10-minute averaged wind speed, whereas 5 a three-hour forecast predicts three-hour averaged wind speed). Each test is performed 10 times to ensure forecasting stability.
Two metrics are utilized to determine how well the random forest predicts exogenous error. The root mean squared error (RMSE) of the bias-corrected ARIMA model is found, giving a metric of the true exogenous error. The random forest model is then trained to predict the exogenous error and the RMSE of the ARIMA-RF model is found. The reduction in RMSE (which comes exclusively from the random forest's prediction of exogenous error) is then found for the test set. The coefficient of 10 determination (R 2 ) between the true and predicted exogenous error is used to determine the amount of error variability captured by the random forest model. Eqn. 4 and Eqn. 5 describe both metrics, wherein U m is the target wind speed,Û m the predicted wind speed, ε m the true exogenous error,ε m the predicted exogenous error, ε the mean exogenous error (approximately zero), m each individual sample, and M the sample size.

Site, Data, & Instrumentation
Data for this study were taken from the Perdigão campaign, a multinational project located in central Portugal that took place in the spring of 2017 . The project site is characterized by two parallel ridges, both about 5 km in length with a 1.5 km wide valley between them. These ridges, which may be seen in Fig. 2a, run northwest to southeast and rise about 20 250 m above the surrounding topography, making the site highly complex and increasing forecasting difficulty. The ridges will be referred to as the northern and southern ridge.
A variety of remote and in situ sensors were positioned in and around the valley to provide an accurate and thorough description of the surrounding flow field. Foremost among these sensors was a grid of meteorological towers which ran both parallel and normal to the ridges. One 100 m tower located on top of the northern ridge (black marker in Fig. 2a) is utilized 25 in this study. This tower had sonic anemometers at 10,20,30,40,60,80, and 100 m above ground level (AGL) as well as temperature sensors at 2, 10,20,40,60,80, and 100 m AGL. The tower data was quality controlled; sonic winds have been corrected for boom orientation and tilt and are thus rotated into a geographic coordinate system. Further details about the data and quality control techniques can be found in NCAR/UCAR (2019). The combination of sensors on this tower provided sufficient information to measure both the inertial and buoyant forcing for flow passing over this location on the northern ridge. Sensors at 20 and 100 m AGL were chosen based on data availability. The data utilized spans three months, running from 10 March -16 June 2017. Data at 100 m were correlated with that at 20 m, and missing data were filled using the variance ratio measure-correlate-predict method (Rogers et al., 2005). Any periods unavailable at both heights were filled using linear interpolation with Gaussian noise. Manually filled periods (all periods are required for proper functionality and assessment of the ARIMA model) account for less than 1% of the total periods in the study and are not expected to make a noticeable 5 difference in the findings. All data were calculated at a 5-minute moving average in order to create a robust dataset (over 28,000 samples). To ease concerns of the model overfitting the overlapping dataset, each internal node in the random forest model (which already has built-in mechanisms that severely hinder overfitting, as described by Breiman (2001) and James et al. (2013)) was required to contain at least 100 samples in order to split (i.e. each branch of every decision tree stops splitting once there are less than 100 samples). 10 The target streamwise wind speed, or that to be forecasted, is located at 100 m AGL. N 2 , Ri g , Ri f , and ∂T / ∂z were calculated between 100-20 m AGL. u * was found at 20 m, just above surface roughness height . All other input variables utilized were found only at 100 m AGL. 15 input feature. Specific RMSE and R 2 values obtained for these cases may be found in Table B1 in Appendix B. The variables are broken down into three distinct categories: inertial (large scale dimensional variables signifying inertial forces in Fig. 1), stability (blue and purple regions in Fig. 1 which are akin to atmospheric stability), and turbulence variables (small scale and non-dimensional inertial variables in Fig. 1). It is immediately clear that there is a distinction between the results for the 10-minute forecast and those for the hourly and three-hour forecasts. Each random forest prediction of 10-minute exogenous error using individual input features resulted in an increase in RMSE (or negative RMSE reduction; Fig. 3a), indicating that exogenous error at such small timescales is highly chaotic and unpredictable based off of the information from any single atmospheric variable. In fact, these tests show that any correlative patterns observed between the utilized meteorological variables 5 and exogenous error are likely circumstantial and lead to deleterious predictions.  Table B2. be seen in Fig. 5. It is clear that, as prediction timescale increases, the correlation between true and predicted exogenous error increases, with the three-hour prediction having an R 2 value of 0.801.

Inertial Variables Stability Variables Turbulence Variables
Feature importance estimates were also obtained from the all-input test cases and can be seen in Fig. 6. A handful of variables, namely θ, U , T I, t, T , and w T , are particularly useful for the hourly and three-hour predictions. Because U , θ, and t are all variables that can be obtained from a simple cup anemometer and wind vane, they are used as the "base variables" 5 when testing discriminate input feature combinations. The results of these tests, which may be found in Table B2 in Appendix B, prove that a large majority of the model's predictive power (i.e. a majority of the relevant input information) is contained within these six variables. importance for the 10-minute prediction, b) for the hourly prediction, and c) for the three-hour prediction. Blue bars denote inertial variables, orange denote stability variables, and grey bars denote turbulence variables. Importance values for each test sum to 100%.

Discussion
There is a clear distinction between the results obtained for the 10-minute exogenous error predictions and those obtained for the hourly and three-hour predictions. All atmospheric input features, when used individually for the 10-minute forecasts, resulted in a faulty prediction of error. This is likely due to the turbulent nature of wind speeds at the 10 minute timescale.
Typically the large-eddy turnover timescale for the lower atmosphere is 10-20 minutes, and averaging timescales approaching 5 or less than this timescale exclude information on more stable and deterministic large eddies, thus making predictions more prone to random errors. This is exemplified by the work of Van der Hoven (1957), who shows that a 10-minute average is within the turbulent peak of the wind speed spectrum. The lack of large eddy influence results in a wind speed signal that is replete with random fluctuations originating in the inertial subrange, adding substantial noise to the prediction. These fluctuations overwhelm the ML model's pattern recognition capabilities, reducing the random forest prediction to a noisy guess. Such ML models will always make predictions based on patterns in the training data, even when those patterns are erroneous and do not hold for the testing dataset. This results in error predictions that are not correlated with the true exogenous error (as indicated by 10-minute R 2 values in Table B1).
As the forecasting timescales increase, smaller-scale turbulent fluctuations average out and the random forest model can recognize predictive patterns between atmospheric input features and the non-linear exogenous error. Tests involving individual 15 atmospheric variables effectively represent the magnitude of the first term on the right side of Eqn. 1. These tests show that predictions involving individual variables (or at least those tested) can only reduce exogenous error by approximately 4% and 12% for the hourly and three-hour predictions, respectively. While this is a considerable error reduction, the meteorological variables are most beneficial when combined.
A list of feature importance estimates, as determined by a test incorporating all input features, is shown in Fig. 6. Many of the features are correlated, meaning that exact importance values are likely misleading. Nevertheless, the reported importance estimates are likely a good indicator as to which features, when used in combination with others, are most useful in predicting exogenous error. θ is both the best individual predictor and the most important feature for all tests, likely because our measurements are taken atop an asymmetric ridge in complex terrain. As is detailed in Fernando et al. (2019), the complex terrain leads 5 to an ensemble of topographically induced ridge-top flow features such as jetting, mountain waves, and reversed flows which have a large impact at the measurement location. Hence, local atmospheric conditions are some of the principal variables which improve predictability of the exogenous error. The relative importance (and even the order of importance) of these variables are expected to be highly site-specific.
The six most important features for the hourly and three-hour predictions are identical (although scrambled), and were 10 therefore used to test discriminate feature set combinations. All tests with multiple input features contained U , θ, and t.
There are two reasons for prioritizing these three variables: they prove to be some of the most important input features for all timescales (Fig. 6) and they can all be captured by a simple cup anemometer and wind vane rather than a more expensive sonic anemometer. These three features, when used in conjunction, were able to capture about 66% of the maximum error reduction seen for all timescales. Discriminate input sets incorporating only U , θ, t, T I, w T , and T are able to capture 15 over 90% of the exogenous error caught by the tests incorporating all input features, indicating that almost all of the relevant information in our inputs can be retrieved from these six variables. Notably, many of the most important input features (U , θ, t, and W ) are directly measurable and need not be extracted (although W cannot be captured by a cup anemometer).
The most important variables that require extraction (i.e. values that are not direct measurements), T I, T KE, and w T , all contain small-scale (fluctuating) forcing components, indicating that small-scale processes may be more easily captured by 20 ML models after domain-specific interpretation. These small-scale variables provide significant predictive power, even at a multi-hour timescale. The testing results from the study show that, in order to achieve an optimal forecast of exogenous error, these small scales must be included as an input for the predictive model.
Tests combining multiple atmospheric variables are particularly useful because they incorporate the second term on the right side of Eqn. 1, an indication of how the exogenous error changes depending on the input features' co-variance. This is especially 25 true for the testing case incorporating all input features. As expected, this case provided the best predictions of exogenous error.
The correlation between the predicted and true exogenous error (Fig. 5) dramatically increases with increasing timescales, with the best three-hour random forest prediction capturing 80% of exogenous error variability. As Fig. 4 shows, the best ARIMA-RF error is roughly 0.5 m s −1 for all timescales even though both the persistence and bias-corrected ARIMA models get worse as timescales increase. This is an encouraging result, in that meteorological forecasting models need not necessarily get worse 30 with time (although the averaging timescales likely must increase proportionately). Exogenous error prediction gets far better with increasing timescales, with the best random forest prediction reducing forecasting RMSE by over 50%. There appears to be a floor (0.5 m s −1 ) on the predictability of exogenous error, indicating that there may be certain atmospheric information missing from the set of input features. This information could come from other external forces or could be a result of forcing at scales that have not been captured by our current input feature set.

35
Exogenous error arises from atmospheric forcing that is ignored or misrepresented in the modeling process. It has been shown that this error, or a portion thereof, can be predicted by an ML model given relevant atmospheric information. θ and T were found to be particularly beneficial as individual inputs, while the combination of U , θ, and t, features which may be derived from a simple cup anemometer and weather vane, were able to provide a majority of the maximum error reduction seen at every 5 timescale. Domain-specific feature extraction was found to be particularly useful for input features relating small-scale forcing, and these turbulence variables were found to have significant predictive power even for multi-hour forecasts. The lowest RMSE value was relatively constant at all prediction timescales, indicating that there is additional relevant atmospheric information that this list of inputs does not capture. The results are promising, however, in that they illustrate that forecasting accuracy need not decrease at large timescales. In fact, at large timescales turbulent fluctuations average out, allowing mesoscale and synoptic 10 forces to provide a clearer signal for exogenous error prediction.
While the exact results of this investigation are site-specific, many of the findings are expected to be generally applicable to numerous wind projects, especially those located in complex terrain. Accurate implementation of atmospheric forcing information, particularly that which is non-linear or derived via coupling of multiple forces, is crucial for the prediction of exogenous error and must be addressed to obtain optimal forecasting results. This study supports the supposition that a hybrid model using 15 ML techniques to correct a simpler statistical predictor (such as an ARIMA model) can be effective for wind speed forecasting.
Further improvements are still required to more accurately represent atmospheric forcing. Gridded meso or synoptic-scale information would allow the model to predict transitional periods including weather fronts and drastic wind ramp events. Multiple scales of forcing should also be incorporated to improve the pattern recognition capabilities of ML techniques. Additional information about microscale, mesoscale, and synoptic events would better depict atmospheric forcing and momentum, and 20 the effects of seasonality must be accounted for when possible. It is also worth exploring the model's capabilities when the dataset is not randomly shuffled (i.e. whether a model trained on past years' data can accurately predict exogenous error over an entire year). Hopefully, this study will be used as a forerunner for the improved incorporation of atmospheric physics within ML modeling.
Code and data availability. Data from the Perdigão campaign may be found at https://perdigao.fe.up.pt/. Due to the multiplicity of cases 25 analyzed in this study, example processing and modeling codes can be found at https://github.com/dvassall/.

Appendix A: Input Features
Atmospheric variables were measured using sonic anemometers and temperature sensors along a single 100 m tower. When possible, missing data from the 100 m sensors were filled via correlation with the 20 m sensors using the variance ratio measurecorrelate-predict method (Rogers et al., 2005). There were no periods with functional 100 m sensors and nonfunctional 20 m 30 sensors. All periods without any measurements from both sets of sensors (15 5-minute periods) were filled using a linear Friction velocity is defined as u * = u w 2 + v w 2 1 /4 and was measured at 20 m AGL, just above canopy height . Turbulence kinetic energy is defined as T KE = u 2 +v 2 +w 2 2 and was measured at both heights. Buoyancy frequency squared is typically defined as (see Kaimal and Finnigan (1994) for details of all parameters that appear below) where g is the gravitational force, ρ the air density, z the height AGL, T pv the virtual potential temperature, z the vertical coordinate, and subscript 0 indicates reference variables in using the Boussinesq approximation. The gradient Richardson number is defined as 10 where u and v are the two horizontal wind speed components. The flux Richardson number is defined as where T v is the virtual temperature while u w and v w are the Reynolds stresses that indicate the flow's vertical momentum flux. Ri f is typically used in conjunction with a stably stratified atmosphere. However, it is used here in the general sense as it is a measure of the ratio between buoyant energy production and mechanical energy production (associated with inertial forces) 15 related to Fig. 1. Negative N 2 values, corresponding to convective atmospheric conditions, are made to be 0. Ri g and Ri f are limited to a maximum of 5 and minimum values of 0 and −5, respectively, to remove extremes in both variables. Turbulence intensity is the ratio of fluctuating to the mean wind speed, or T I = σu /U. Both hour of the day and wind speed were broken into two oscillating components in order to eliminate any temporal or directional inconsistency. Table B1 presents the RMSE and R 2 obtained by the bias-corrected ARIMA model (total exogenous error) and that obtained by the ARIMA-RF using individual features. Features are separated into inertial, stability, and turbulence inputs as described in Section 4. Table B2 presents the RMSE and R 2 values obtained by the persistence and bias-corrected ARIMA models alongside that obtained by the ARIMA-RF while utilizing input feature combinations that are of interest. The final row in Table B2 shows the results of the ARIMA-RF when all input features are utilized.  Table B1. The top row shows RMSE obtained by the bias-corrected ARIMA model. Below are the resulting RMSE and R 2 (between true and predicted exogenous error) values from ARIMA-RF predictions utilizing individual inputs for all forecasting timescales. Input features are separated into inertial, stability, and turbulence variables, as described in Section 4.

20
Author contributions. Daniel Vassallo prepared the manuscript with the help of all co-authors. Data processing was performed by Daniel Vassallo, with technical assistance from Raghavendra Krishnamurthy. All authors worked equally in the manuscript review process.  Table B2. RMSE obtained by the persistence and bias-corrected ARIMA models (exogenous error is defined as the bias-corrected ARIMA error) as well as the RMSE obtained by the ARIMA-RF when utilizing select input feature combinations. R 2 values between true and predicted exogenous error is also reported for each test case. The final row shows the final test which uses all input features.