Evaluating the mesoscale spatio-temporal variability in simulated wind speed time series over northern Europe

. As wind energy increases its share of total electricity generation and its integration into the power system becomes more challenging, accurately representing the spatio-temporal variability in wind data becomes crucial. Wind ﬂuctuations impact power and energy systems, e.g. energy system planning, vulnerability to storm shutdowns, and available voltage stability support. To analyse such ﬂuctuations and their spatio-temporal dependencies, time series of wind speeds at an hourly or higher frequency are needed. We provide a comprehensive evaluation of the global and mesoscale-model-derived wind time series against observations by using a set of metrics that we present as requirements for wind energy integration studies. We also perform a sensitivity analysis to ﬁnd the best model setup of the Weather Research and Forecasting (WRF) model, focusing on evaluating the wind speed ﬂuctuation metrics. The results show that using higher spatial resolution in the WRF model simulations improves the representation of temporal ﬂuctuations; however, higher-spatial-resolution simulations often lower the correlations of wind time series with measurements. Thus, we recommend ﬁner-spatial-resolution simulations for modelling power ramp or voltage stability studies but ERA5 rather than mesoscale simulations for studies where correlations with measurements are essential. We also show that the nesting strategy is an important consideration, and a smoother transition from the forcing data to the nested domains improves the correlations with measurements. All mesoscale model simulations


Introduction
Many wind energy applications use meteorological data derived from atmospheric models; for example, in the production of wind resource atlases (Tammelin et al., 2013;Dörenkämper et al., 2020;Solbrekke et al., 2021), extreme wind atlases (Larsén et al., 2012), and wind turbine icing in cold climates (Hämäläinen and Niemelä, 2017). In general, most studies use time series of wind speed, but different applications require distinct qualities in meteorological data, and, therefore, different evaluations are necessary. For a wind resource atlas, for instance, accurate wind speed distributions are necessary (Dörenkämper et al., 2020;Knoop et al., 2020); for wind power forecasting, accurate timing is vital (Das et al., 2017;Olson et al., 2019). Energy and power system modelling, including optimal energy system planning (e.g. Gea-Bermúdez et al., 2020;Brown et al., 2018;Malvaldi et al., 2017), the study of power system ramp rates and vulnerability to storm shutdowns (e.g. , and available voltage stability support (e.g. Souxes et al., 2019), requires an accurate representation of temporal dependencies and spatial correlations, as well as accurate wind speed distributions. Time synchronization with measurements (i.e. the high correlation between measured and simulated data) is important when the wind time series are used in conjunction with other data, e.g. electricity load time series (Gea-Bermúdez et al., 2020).
Several studies have validated meteorological datasets for modelling weather-dependent wind power generation G. Luzia et al.: Spatio-temporal variability in wind time series and its highly fluctuating behaviour. These works use data provided by global atmospheric reanalysis (e.g., Cannon et al., 2015;González-Aparicio et al., 2017;Gruber et al., 2022), mesoscale numerical weather prediction (NWP) models (Murcia Leon et al., 2021;Koivisto et al., 2021), or both (Jourdier, 2020;Murcia et al., 2022). Because mesoscale NWP models cannot represent the effects of the most detailed microscale processes, extra information, such as the effect of the terrain in the wind speed distribution, can be added by combining (i.e., adjusting) mesoscale with microscale data (e.g., Staffell and Pfenninger, 2016;Ruiz et al., 2019;Murcia et al., 2022). Due to their relatively low temporal resolution (usually available from 30 min to 1 h resolution) and intrinsic numerical smoothing, data from mesoscale models cannot include minute-to second-scale variability, and even hourly variability may be too low compared to measurements . Therefore, it may be necessary to combine synthetic data through statistical methods (e.g. Hawker et al., 2016;Larsén et al., 2012; to represent wind fluctuations at shorter timescales. Validation studies of time series from existent highspatial-resolution datasets (of the order of a few kilometres) produced by mesoscale NWP models can be found in the literature for wind power integration studies. Jourdier (2020) compared datasets from several sources, including large-scale and regional downscaled reanalyses, with mesoscale data such as the NWP model AROME and the New European Wind Atlas (NEWA) (Hahmann et al., 2020b;Dörenkämper et al., 2020) in simulating mean wind speed, power production, and its temporal correlations over France. Murcia et al. (2022) performed a large-scale validation study, comparing the ERA5 reanalysis (Hersbach et al., 2020) and two datasets based on the Weather Research and Forecasting (WRF), NEWA, and an ERA-Interim-based (Dee et al., 2011) European-level atmospheric reanalysis (Nuño et al., 2018) with and without the addition of microscale details provided by the Global Wind Atlas (Badger et al., 2015). Regarding wind speed, validations were done using wind measurements over northern Europe on various time-series metrics, such as errors in autocorrelations, spatial correlations, and wind speed distribution. Both studies found that ERA5 is skilled but presents deficiencies in simulating wind speed in areas with more complex terrain if not corrected. Jourdier (2020) found that the higher-resolution regional models (i.e. AROME and COSMO-REA6) also show very good skills and reduced bias, especially in complex topography. Murcia et al. (2022) found that NEWA is significantly better than ERA5 in representing the temporal properties of the wind speed time series at individual locations or wind power plants.
Fewer articles present model development focused on time series for wind integration studies. Draxl and Clifton (2015) discussed that many efforts have been made to model the wind distributions for wind resource assessments with NWP models. Creating datasets for wind integration studies for modelling wind power variability, on the other hand, is a not straightforward step from the NWP outputs. Some studies utilize the mesoscale model to generate time series and validate it for an individual application. For example, Nuño et al. (2018) developed hourly time series of European transcontinental wind and solar photovoltaic generation using the WRF model to dynamically downscale a global reanalysis and analysed regionally aggregated power variability on different timescales, which is relevant for system planning, market studies, and others. Mehrens et al. (2016) assessed the WRF model's ability to simulate coherence, spatial correlation, and power spectra for a large range of distances over the North and Baltic seas. Focusing on local wind power generation variability, Koivisto et al. (2021) validated the WRF model time series aggregated for several European countries. The results of this article show that combining mesoscale and microscale data and the addition of missing power plant technical parameters through machine learning improve the representation of the annual capacity factors and hourly generation distributions for most countries. However, no significant differences are shown for the autocorrelations and spatial correlations. Draxl and Clifton (2015) generated a subhourly high-spatial-resolution dataset target for application in wind integration studies over the United States, but besides the high temporal resolution, the work focused on the validations only with daily and seasonal variability, in addition to wind speed distribution. By focusing on several essential aspects of time series, this paper adds to the development of time-series modelling of wind speeds for wind integration studies and evaluation techniques. The model comparison results allow users to select the most appropriate modelling and datasets for different applications.
This work focuses on modelling wind speed time series suitable for power and energy system applications. It adds to the literature by (1) investigating the impact of the interaction between the mesoscale model and its forcing data on the quality of the resultant time series and (2) providing a comprehensive evaluation of the different datasets, with a focus on how well they can represent temporal and spatial correlations over northern Europe. We perform a sensitivity study of the WRF model in multiple configurations, varying the influence of the forcing global reanalysis in the simulations to understand its role and distinguish the model configuration that outperforms various time series aspects. The results are also compared to ERA5 and NEWA mesoscale data. We hypothesize that these modelling aspects, defined by the nesting choice, size, and position of the domains, impact the accuracy of the time series more than the horizontal resolution of the model simulations.
This paper is structured as follows: Sect. 2 describes the simulations, the measured data, and the methodology used to generate and validate the time series; Sect. 3 presents the results of the time series comparisons; Sect. 4 presents the discussion and experiments ranking and Sect. 5, the conclusions and perspectives.

Simulated data
The simulations used in this work were produced by the WRF mesoscale model in two different versions, using the configuration for the model physics and model dynamics as in the New European Wind Atlas (NEWA) (Dörenkämper et al., 2020). The analyses are presented for the WRF model version 4.2.1 (Skamarock et al., 2019), which was the latest available version when running the experiments. However, the WRF model version 3.8 (Skamarock et al., 2008), used for the NEWA production, was utilized as a control simulation for all experiments to check for possible impacts on the results due to the model modifications. The model updates could lead to changes in the simulated wind due to changes in parameterizations that can affect the results. The configuration of both versions was checked to ensure some degree of consistency. Still, tracking the exact combination of modifications in the model version and its possible effects can be difficult.
All simulations use nudging to the forcing reanalysis in the outer domain with time tendencies computed from reanalysis data at 6 h, as outlined in Hahmann et al. (2015). The WRF model has been broadly validated and used for wind resource assessment and sensitivity studies for northern Europe; such as Hahmann et al. (2015Hahmann et al. ( , 2020b and Li et al. (2021) have shown that generally, the simulated wind profile matches the observations despite systematic biases and that the WRF model downscaling adds value to the reanalyses for wind energy applications.
The purpose of the nesting experiments (hereafter named "10 km", "6 km", "5 km", and "3.3 km") is to vary the influence of the forcing data in the innermost domain by changing the grid arrangement of the simulation or to verify the impact of the grid spacing for similar arrangements. The WRF model domains used in the simulations are presented in Fig. 1 using two nesting approaches: (1) using one nested domain (blue) and (2) using two nested domains (orange). The simulations were configured to have the innermost domains covering the same geographical area. The parent domain (d1) has the same dimensions and position for both arrangements. However, for nested domains, the number of grid points must be proportional to the parent grid ratio (Skamarock et al., 2008), which means that even for similar arrangements (i.e. single nest with ratio 1/3 or 1/5), the innermost domains have slightly (a few km) different geographical extension. These differences were neglected for the comparisons, as shown in Fig. 1. The horizontal resolution of the innermost domain is a result of the nesting ratio and the resolution jump used, as can be seen in Table 1. We included one additional experiment to test the impact of different forcing data. The forcing data provide initial and boundary conditions for the simulations. All experiments were forced by the ERA5 reanalysis (Hersbach et al., 2020) as in the NEWA configu- ration, except experiment "10 km_erai", which was forced by the ERA-Interim reanalysis (Dee et al., 2011). All experiments produced with WRF version 3.8 (as in the NEWA production) for checking the consistency among the results were analysed, but only that analogous to the 10 km experiment from version 4.2.1 (named "10 km_v3") is presented here. This arrangement makes all experiments comparable to the 10 km experiment with only one aspect modified, in the nesting arrangement, resolution jump, forcing data, or model version. The multiple experiments were limited to 1 year of simulations due to computational limitations, and the year 2009 was selected as a function of the observations availability (see Sect. 2.2). We analysed the ERA5 wind speed anomaly at 100 m in 2009 related to the 30-year climatology (see Fig. A1) to ensure that the selected year does not affect the results by being an atypical year in terms of wind speed over the region studied. Compared to the long-term average, 2009 was slightly more windy (around 0-0.25 m s −1 ) in northern Europe, which is within 1 standard deviation (around 0.2-0.4 m s −1 ) for all mast locations.
Time series from NEWA and ERA5 reanalyses were included in the comparisons. All time series (from the simulations and the existent reanalyses) were extracted using a horizontal linear interpolation and a logarithmic vertical interpolation for each measurement location and its respective height. The WRF model simulations and ERA5 and NEWA reanalyses use the two closest levels for the vertical logarithmic interpolation. The NEWA dataset and the WRF model simulations have outputs in several fixed levels ranging from 25 to 250 m. From the ERA5 dataset, two fixed height levels (10 and 100 m) were used, and extrapolation is assumed for sites taller than 100 m. No assumption about the atmospheric stability condition is used, and the same process is applied at every time step throughout the entire year.

Measured data
Data from 14 met masts over northern Europe ( Fig. 1) were processed and filtered using an adapted version of the quality control routine described in Ramon et al. (2020). The filter eliminates suspicious data or sequences of data, including implausible values or extreme variations, freezing, or stuck instrument readings. When necessary, a rough attempt was made to minimize the effect of flow distortion caused by the mast on the wind speed measurements. When wind speed measurements are available from only one boom at one height on the mast, winds originating at ±10 • of the boom direction are filtered. At other sites, where wind speeds at one height are measured with more than one boom direction, the wind speed measurements are combined according to the wind direction to minimize flow distortion. Wind speed time series measured at heights ranging from 30 to 140 m and originally at 10 min averaged resolution were aggregated to hourly resolution. To ensure that all measured data have been reported at UTC times (information not always provided by the data source), an inspection of the cross-correlations between measured data and reanalysis (Fig. 2f) has been done for checking suspicious shifted lags. Missing or invalid data identified in the measured time series were marked NaN (not a number) also in the simulated data. The completeness of the series is shown in

Evaluation metrics
Different qualities in wind speed time series are required for applications in power and energy system studies, as described in Sect. 1. We analyse five aspects of data quality using error metrics defined in Murcia et al. (2022). The comparisons against observations include the six WRF model experiments, NEWA, and ERA5 datasets. The time series evaluation metrics include the following: 1. Correlation to measurements. Given an observed time series, X(t, i), and a simulated time series, Y (t, i), at time t and location i, the Pearson correlation coefficient ρ(X(t, i), Y (t, i)) is calculated for each simulation with respect to the measured data. This metric is important when the wind speed data (or the resulting generation data) has to be correctly correlated with other time-stamped datasets (e.g. electricity price).
2. Error in the autocorrelation function. The autocorrelation functions (ACF), ρ(X(t, i), X(t − t, i)), and ρ(Y (t, i), Y (t − t, i)), at location i, are calculated for the observed and simulated time series, respectively. The error in the metric is computed as ACF lag= t X minus ACF lag= t Y . The ACF metric represents how well the simulated data can represent temporal variability seen in the measured data.
3. Error in wind speed distribution. This metric quantifies the difference between simulated and observed wind speed distribution. It is computed using the Earth mover's distance (EMD) technique, introduced in Hahmann et al. (2020b). The EMD is defined by the area between two cumulative density functions (CDFs) and, therefore, is always positive. An accurate wind speed distribution is essential for estimating a site's potential annual energy production, with a good fit also at the higher percentiles important for understanding storm shutdown risks. It is acknowledged that because of the shapes of power curves, certain parts of the wind speed distributions matter more than others when considering the wind generation output. However, in this study, the  focus was kept on modelling wind speeds, so all wind speed ranges are considered equally in EMD.

4.
Error in the standard deviation of the first difference.
The standard deviation of the first-difference time series σ (X(t, i)−X(t − t, i)) and σ (Y (t, i)−Y (t − t, i)) at location i is calculated for the observed and simulated time series, respectively. The error is computed as the difference between the simulated and observed standard deviations. This metric describes how well the simulated data can represent the 1 h ramps seen in the measured data.

5.
Error in the spatial correlation among measurements. The correlations ρ(X(t, i), X(t, j )) and ρ(Y (t, i), Y (t, j )) are calculated for the observed and simulated time series for all pairs of sites i, j . The metric is computed by fitting equation ρ ij = exp(−d ij /L), with d ij and L in kilometres, to the correlations and distances (d) between the locations for both the measured and the simulated data and taking the ratio of the characteristic length scales, L Y /L X . The smaller the length scale, the faster the correlations decay to zero as distance increases. Modelling spatial correlations well is relevant for system integration studies, as the probability of wind speeds being low or high or ramping up or down simultaneously in multiple locations impacts the aggregate wind generation variability in the system. Figure 2 illustrates all the metrics used in this work but includes only three datasets for simplicity. Figure 2a shows an example of a time series for Børglum, a short mast at a coastal location in Denmark, with observed (OBS, in black), ERA5 (orange), and NEWA (blue) datasets. Figure 2b shows the wind speed distribution for the three datasets. The CDFs for ERA5 and NEWA, in which the EMD metric is computed with respect to OBS, are illustrated in Fig. 2c. Figure 2d shows the ACF for the first 24 h, although only the ACF at lag = 1 h is used in the comparisons. Figure 2e illustrates the correlations between simulated and OBS. Figure 2f shows the cross-correlation function of simulated data with OBS, which was used for checking the correctness of the timestamps in every observed time series. Lastly, Fig. 2g illustrates the comparison between parameter L computed from the spatial correlations over a distance of all 14 pairs of locations. Each metric described is computed for every location and experiment, as well as the median over different site types (onshore, coastal, and offshore) and the median of all locations for each metric. The final rank among all datasets for each metric is based on the medians of all sites.

Results
This section presents the results for each metric described in Sect. 2.3. The tables contain the results for each dataset (rows) and every site (columns). Columns 1 to 4 are inland, 5 to 9 are inland but close to the sea (hereafter named "coastal"), and 10 to 14 are offshore sites. The four last columns present the median over sites "onshore", "coastal", "offshore", and "all" 14 sites, respectively. The colour palette represents the best results in dark purple and the worst results in brown. The rows are sorted by the column median "all", with the most accurate results on the top and the least accurate results on the bottom of each table.

Correlation to measurements
In Fig. 3, time series from ERA5 reanalysis have higher correlation with measurements (median all = 0.91), followed by the 5 km experiment (0.89). The 3.3 km experiment is the least correlated among all datasets (0.80). The type of location impacts the correlations. Sites offshore have higher correlations than coastal sites, which have higher correlations than onshore sites. The worst correlations for all simulations compared to observations are Sorø and Lindenberg, both met masts (43 and 98 m tall, respectively) located in forested sites. This reveals the difficulties of mesoscale models in simulating the effects of the forest on the flow dynamics, e.g. due to oversimplification and an unappropriated representation of roughness length, as discussed in Dellwik et al. (2014).

Autocorrelation function
The ACF results (Fig. 4) show a clear spatial-resolution impact. NEWA (3 km horizontal grid spacing) and the 3.3 km experiment present the smaller errors in ACF (∼ 0.031), while experiments with 10 km grid spacing (10 km_erai, 10 km, and 10 km_v3) and ERA5 (∼ 30 km) present the larger errors (0.047, 0.047, 0.049, and 0.051, respectively). As in the previous metrics, all simulations contain larger errors over forested sites (Sorø and Lindenberg). The ACF is simulated more accurately in offshore sites, followed by coastal and onshore sites.

Standard deviation of first difference
As in the ACF analysis, the standard deviation (SD) of the first difference (Fig. 5) is impacted by the grid spacing of the simulations. The 3.3 km experiment and NEWA simulations show smaller errors in this metric (−0.30 for both simulations), while experiments with 10 km grid spacing (10 km_erai, 10 km, and 10 km_v3) and ERA5 time series present the larger errors (−0.54, −0.55, −0.58, and −0.66, respectively). All simulations underestimate the metric (negative values). Unlike all previous metrics, the SD of the first difference is more accurately represented over inland sites (especially for Sorø). The reason for larger errors in coastal and offshore sites can be due to the difficulties of mesoscale models in simulating turbulence over and close to the sea (Floors et al., 2018). On the other hand, the displacement height applied in the simulated time series over Sorø can mask the 1 h step changing errors. Originally, the time series interpolated at Sorø height (43 m) overestimates the wind speed above the canopy of the trees at this forested site (Dellwik et al., 2014) and, therefore, produces errors in other metrics, such as correlations with measurements and wind speed distribution. We lowered the level of the interpolated time series using the fixed displacement height of −20.5 m without taking into account that displacement height depends, among other things, on the wind speed (Dellwik et al., 2006). It is  possible that the simulated time series exaggerates the turbulence at the displaced height and alleviates the underestimation in the SD of first difference metrics only for Sorø. For comparison, Lindenberg is a forest site, but we have not applied displacement height, and the errors are consistent with other inland sites. Even when the Sorø results are disregarded from the median onshore, the onshore sites are more accurately simulated with respect to the standard deviation of the first difference than coastal and offshore sites.

Wind speed distribution
The analysis of the wind speed distribution (Fig. 6) shows more homogeneous results over all experiments. The 3.3 km simulation has slightly smaller EMD (0.27 m s −1 ), while its more equivalent simulation, NEWA, presents an intermediate result (0.29 m s −1 ). Larger EMDs are found for inland sites (especially Sorø and Tystofte); however, the coastal sites DockingShoal and Lillgrund also present large EMD values. There is no clear sequence for the quality of wind speed distribution concerning the type of location, although onshore   Table 1 and reanalyses. The darkest purple colours are the best results; the darkest brown is the worst. The rows are sorted by the column median all, with the most accurate results on the top and the least accurate results on the bottom of the table.
sites have larger EMD values in all datasets. All datasets underestimate (overestimate) the low (high) wind speed values for inland observations at lower heights (less than 50 m tall). Figure 7 highlights this issue.

Spatial correlations
The metric to assess the spatial correlations (Fig. 8) is computed as described in Sect. 2.3. All simulations overestimate the correlations for most points, leading to larger parameters L, which agrees with previous studies (e.g. Murcia et al., 2022;Mehrens et al., 2016). The observed value of L is 410 km, while the ones derived from the simulations vary between 496 km in the NEWA simulations and 541 km in the ERA5 time series. Mehrens et al. (2016) discuss the WRF models inability to resolve wind variability sufficiently at higher frequencies due to the numerical smoothing, resulting in exaggerated correlations. Except for the NEWA time series, all simulations produce similar results despite horizontal grid spacing. The coefficient of determination r-squared (r 2 ) for all fitted curves and the standard error (e) of the estimated parameters L are also shown in Fig. 8.  Figure 7. Probability density function (PDF) and Earth mover's distance (EMD) between simulations and measurements (OBS) for the various experiments in Table 1 and reanalyses and two inland sites measured at lower heights.  (Table 2) and the fitted curves for the measurements (black) and the various model simulations in Table 1 and reanalyses. The estimated de-correlation length L is also shown for each simulation.

Impacts from model setup
The boxplots in Fig. 9 represent the ranking among all simulated datasets in the first four presented metrics. The plots show the median (50th percentile), the first quartile (25th percentile), the third quartile (75th percentile), and the maximum and minimum values, as well the outliers. The boxplots are ordered from best (lowest median, left) to worst (highest median, right) in all metrics.
The ranking for correlation to measurement (Fig. 9a) does not indicate a clear impact from spatial resolution. However, ERA5 (coarser resolution) presents a higher correlation with measurements, which can be in part due to spatial smoothness. Also, ERA5 has the advantage of data assimilation, which periodically adjusts and approximates the simulation to the observations. Regarding the nesting arrangement and the way correlations are passed into inner domains, the "5 km" experiment was more similar to the observed time series than its comparable "6 km" experiment. The 5 km experiment uses a nesting ratio of 1/3 (domain 1/domain 2), but setting domain 1 using 15 km and the 6 km experiment uses a ratio of 1/5 and the resolution of domain 1 equal to 30 km (close to the reanalysis resolution). The 10 km experiment (also ratio 1/3 but the resolution of domain 1 set to 30 km) correlates technically the same as the 6 km experiment. As the WRF model freely develops the flow in the inner domains, it loses the correlation to the reanalysis. Double nesting amplifies this effect, such as the 3.3 km experiment ratio 1/3/3). Using a smoother resolution jump, such as 15 to 5 km instead of 30 to 6 km, could be an advantage in keeping the correlations in the inner domain consistent with the driving reanalysis. Further tests are needed to confirm this behaviour. Nevertheless, the ratio 1/3 -15 to 5 km is double as computationally expensive as the 1/5 -30 to 6 km or the 1/3 -30 to 10 km. The comparison to the NEWA time series (1/3/3, 27 to 9 to 3 km) and the 3.3 km experiment (1/3/3, 30 to 10 to 3.3 km) supports this hypothesis. However, the NEWA simulations are not directly comparable to the 3.3 km simulations because NEWA uses different choices of domains size and placement (e.g. domain 1 in NEWA is much larger than in 3.3 km, and domain 3 is longitudinally longer in NEWA, while in the 3.3 km experiment it is larger in latitude).
The boxplot of error in autocorrelation function (Fig. 9b) shows a clear impact from spatial grid spacing. The NEWA and "3.3 km" simulations present smaller errors than the measured time series. The 10 km grid spacing experiments and the ERA5 time series show the most significant errors. Coarser-resolution simulations exaggerate the correlations due to the inherent spatial smoothness of the atmospheric models, which can be seen in the results for all simulations (Fig. 4). The same interpretation can be made from the boxplot for the standard deviation of the first difference (Fig. 9c), although in this metric experiments with very similar results, such as NEWA and the 3.3 km experiment, and the 10 km experiment and 10 km_erai, have inverted its ranking positions.
The EMD boxplot (Fig. 9d) has the least conclusive ranking order among the metrics. There is no apparent influence from the spatial resolution in the wind speed distribution since the ranking alternates finer-and coarser-resolution experiments. Both the 5 km and 6 km present intermediate results and nearly identical values. From these results, there is also no significant indication of an impact on the simulated wind speed distribution, neither positive nor negative, from the complexity of nesting (e.g. single versus two nested domains), the choice of nesting ratio, or the resolution jump.
The fitted spatial correlations (Fig. 8) show a clear distinction between NEWA and the rest of the simulations. The NEWA time series presents the smaller value of parameter L and the closest to the parameter determined for the measured points. A ranking of the simulations can be seen in Fig. 10, showing the NEWA simulations with the smallest simulated L y over observed L x , followed by all the Figure 10. Ratio of simulated and observed characteristic length scales L Y /L X , both parameters fitted from curves in Fig. 8, and the standard error of the estimated parameters (bars). Winter refers to months January-March; summer refers to July-September. L > 1 means that simulations overestimate the parameter L.
other simulations with close overestimated results. As for the EMD, the spatial correlations explain the ranking order, on neither the nesting choice nor the resultant spatial resolution. Part of the spread among the results from a single experiment comes from using various measurement heights. The results found for this metric agree with Murcia et al. (2022) that all simulations overestimate the spatial correlations and that the NEWA time series modelled this aspect of the time series more accurately than the ERA5 dataset.
To check the consistency of the results in different periods, we recalculated all the metrics for the winter (January-March) and summer (July-September) months. The results (not shown) keep a similar ranking order to the annual time series for all metrics except for the EMD and spatial correlations. In any seasonal period considered, the EMD values range from approximately 0.3-0.4 m s −1 , but the ranking of the simulations is different (not shown). For the spatial correlations, Fig. 10 shows a different order for each considered period. Almost all simulations shown have higher correlations during winter months than summer months. For all simulations and the observed time series, the de-correlation length L is larger during winter than during summer (not shown). This could be explained by the larger spatial scale of winter versus summer atmospheric processes and their variability. Nevertheless, this result contradicts Solbrekke et al. (2020), although that study only includes correlations versus distances over the northern North Sea and the Norwegian Sea and a limited number of measurement sites. All five WRF model experiments were repeated using the WRF model version 3.8, although only the 10 km_v3 one was included in the plots for comparison with the 10 km (WRF version 4.2.1). The rank is unchanged from that with WRF V4.2.1 (with minor differences) for the correlations to measurements, error in ACF, and the error in SD of the first difference (not shown). However, the ranking order is changed for the EMD and the spatial correlations. Nevertheless, the conclusions for these two metrics do not show a clear impact from grid spacing or model nesting. The direct comparison between the 10 km and 10 km_v3 shows no clear effect from the two WRF model versions, and both experiments present a similar position in the rank for most metrics (Fig. 9).
Lastly, an experiment testing different forcing data (10 km_erai) was included to compare simulations forced by ERA5 versus ERA-Interim reanalyses. For all metrics, the 10 km and 10 km_erai present a close position in the ranking (Fig. 9). The three experiments with 10 km grid spacing are clustered among the ranks due to the same resultant grid spacing and model nesting.

Conclusions
To investigate how to improve the mesoscale modelling of wind time series over northern Europe for power and energy system purposes, we performed a sensitivity study to various WRF model setups, including varying nesting configuration (one or two inner domains), nesting ratio (1/3 or 1/5), and resolution of the innermost domain (10, 6, 5, or 3.3 km). Simulations using different model versions and forced by different reanalyses are also explored. Five metrics relevant to wind power integration studies are presented for the time series derived from the WRF model simulations and compared to those from the New European Wind Atlas and the ERA5 reanalysis. We also ranked the time series simulation's metrics to identify significant factors controlling the simulation performance in their generated wind speed time series. Measured data from 14 sites over land, coastal, and offshore locations in northern Europe were used.
We found that the model configuration affects the value of the wind time series correlations with measurements metrics more than the grid spacing. Thus, we recommend ERA5 reanalysis over the mesoscale simulations for studies where the correlations with measurements are essential. However, when producing mesoscale simulations for power and energy system purposes, a smoother resolution jump from outer to inner domains benefits the simulations by keeping them more correlated to the forcing reanalysis. This is especially relevant when the wind speed time series are combined with other series data (e.g. electric load or price time series). Finer-spatial-resolution simulations such as NEWA and the 3.3 km experiment may be best for applications where temporal variability has to be well modelled, such as power ramp analyses or voltage stability studies. For more accurate simulations in terms of wind speed distribution and spatial correlations, NEWA presents more favourable results than ERA5.
The value of the metrics at the considered sites shows more accurate results for offshore and coastal than for inland locations in all metrics, except for the standard deviation of the first difference. Simulated sites in forest landscapes generally have more significant errors, especially when measurements are taken at lower heights (i.e. less than 50 m tall). This could be due to model deficiencies in simulating boundary layer processes near the ground in more complex terrain, in agreement with Hahmann et al. (2020b).
The evaluation of correlations to measurements indicates that strengthening the influence of the forcing from the reanalysis data on the mesoscale model simulation can be achieved by using a smooth transition between the computational domains. Thus, a nest transition from 15 to 5 km (domain 1 to domain 2) is more effective than using 30 to 6 km (considering the forcing data resolution close to 30 km) for maintaining the high correlations from the reanalysis in the inner domain. A comparison between simulations using three domains (30/10/3.3 and 27/9/3 km) confirms this result. However, the NEWA and 3.3 km simulations are not entirely comparable because they differ in the size of the outer domain. A large nudged outer domain appears to be important for improving the correlation with observations in the inner domains. Still, our results do not provide a systematic validation of this hypothesis. The ERA5-derived wind speed time series has the largest correlation to measurements for all sites, and this behaviour is in agreement with Jourdier (2020). From experiences in weather forecasting, it is known that higher resolution does not always produce improved statistics (Mass et al., 2002) because the various metrics are sensitive to the smoothness of the time series.
The ranking order in the autocorrelation function and standard deviation of the first difference is a function of decreasing spatial grid spacing rather than the nesting arrangement. This is probably a consequence of the higher frequency of occurrence of convective processes in finer grid spacing domains, as it is discussed in Mass et al. (2002) and Vincent et al. (2013). For the wind speed distribution, the results are inconclusive regarding the impact of the model configuration or the spatial resolution on the quality of the time series. The analysis of the spatial correlations confirmed results from previous papers that all simulations exaggerate the spatial correlations (Murcia et al., 2022;Mehrens et al., 2016) and that NEWA time series can simulate this aspect more accurately than does the time series derived from the ERA5 reanalysis (Murcia et al., 2022). Mass et al. (2002) show that finer horizontal resolution leads to lower correlations due to a higher spatial variability. However, our results for spatial correlations do not find an explanation in the model setup and are sensitive to the period of the year. Simulated time series longer than 1 year are needed to investigate these findings better. Also, this could be because our tested grid spacings are very similar (from 10 to 3.3 km), while in Mass et al. (2002), the simulation resolutions have a larger range from 36 to 4 km.
Due to computational cost, many other details related to the model setup have not been tested. For example, we used the same size and position of the innermost domain for all simulations. Therefore, we did not test the sensitiv-ity of the simulated time series to these aspects. Hahmann et al. (2020a) found that smaller domains in the WRF simulation tend to show smaller wind speed biases but higher root mean square errors (RMSEs) compared to observations. They claim, however, that it was unclear if this resulted from the domain size or the location of the boundaries in relation to the large-scale flow. Further tests, including these two model setup aspects, could point to improvements in modelling time series correlated with measurements since RMSE and correlations are related metrics. Additional numerical experiments on grid spacing could be carried out to clarify the potential impacts of horizontal resolution on the simulated spatial correlations and wind speed distribution. Simulations using a much larger outer domain than the one in Fig. 1 and the same inner domain could explain the different performances between NEWA and 3.3 km (Fig. 10) in representing spatial correlations. Finally, because of the shape of the power curves, a further analysis focusing on the errors (i.e. EMD) in certain parts of the wind speed distribution that contribute the most to energy production could be carried out by assigning higher weights to values between the cut-in and cut-out wind speeds.
Appendix A Figure A1. (a) Long-term mean wind speed and (b) standard deviation of the annual mean wind speed  and (c) wind speed anomaly during 2009 with regards to the long-term mean. All statistics are computed using ERA5 at 100 m a.g.l. (above ground level).