<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing with OASIS Tables v3.0 20080202//EN" "https://jats.nlm.nih.gov/nlm-dtd/publishing/3.0/journalpub-oasis3.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://docs.oasis-open.org/ns/oasis-exchange/table" xml:lang="en" dtd-version="3.0" article-type="review-article">
  <front>
    <journal-meta><journal-id journal-id-type="publisher">WES</journal-id><journal-title-group>
    <journal-title>Wind Energy Science</journal-title>
    <abbrev-journal-title abbrev-type="publisher">WES</abbrev-journal-title><abbrev-journal-title abbrev-type="nlm-ta">Wind Energ. Sci.</abbrev-journal-title>
  </journal-title-group><issn pub-type="epub">2366-7451</issn><publisher>
    <publisher-name>Copernicus Publications</publisher-name>
    <publisher-loc>Göttingen, Germany</publisher-loc>
  </publisher></journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5194/wes-11-1185-2026</article-id><title-group><article-title>Review of deep reinforcement learning for  offshore wind farm maintenance planning</article-title><alt-title>Review of deep reinforcement learning for offshore wind farm maintenance planning</alt-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author" corresp="yes" rid="aff1">
          <name><surname>Borsotti</surname><given-names>Marco</given-names></name>
          <email>m.borsotti@tudelft.nl</email>
        <ext-link>https://orcid.org/0000-0002-9424-7404</ext-link></contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Jiang</surname><given-names>Xiaoli</given-names></name>
          
        </contrib>
        <contrib contrib-type="author" corresp="no" rid="aff1">
          <name><surname>Negenborn</surname><given-names>Rudy R.</given-names></name>
          
        </contrib>
        <aff id="aff1"><label>1</label><institution>Department of Maritime &amp; Transport Technology, Delft University of Technology, Delft, the Netherlands</institution>
        </aff>
      </contrib-group>
      <author-notes><corresp id="corr1">Marco Borsotti (m.borsotti@tudelft.nl)</corresp></author-notes><pub-date><day>13</day><month>April</month><year>2026</year></pub-date>
      
      <volume>11</volume>
      <issue>4</issue>
      <fpage>1185</fpage><lpage>1204</lpage>
      <history>
        <date date-type="received"><day>28</day><month>October</month><year>2025</year></date>
           <date date-type="rev-request"><day>7</day><month>November</month><year>2025</year></date>
           <date date-type="rev-recd"><day>13</day><month>March</month><year>2026</year></date>
           <date date-type="accepted"><day>16</day><month>March</month><year>2026</year></date>
      </history>
      <permissions>
        <copyright-statement>Copyright: © 2026 Marco Borsotti et al.</copyright-statement>
        <copyright-year>2026</copyright-year>
      <license license-type="open-access"><license-p>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this licence, visit <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</ext-link></license-p></license></permissions><self-uri xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026.html">This article is available from https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026.html</self-uri><self-uri xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026.pdf">The full text article is available as a PDF file from https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026.pdf</self-uri>
      <abstract><title>Abstract</title>

      <p id="d2e97">Offshore wind farms face unique challenges in maintenance due to harsh weather, remote locations, and complex logistics. Traditional maintenance strategies often fail to optimize operations, leading to unplanned failures or unnecessary servicing. In recent years, deep reinforcement learning (DRL) has shown clear potential to tackle these challenges through a data-driven approach. This paper provides a critical review of representative DRL models for offshore wind farm maintenance planning, elaborating on both single- and multi-agent frameworks, diverse training algorithms, various problem formulations, and the integration of domain-specific knowledge. The review compares the benefits and limitations of these methods, identifying a significant gap in the widely adopted use of simplistic binary maintenance decisions, rather than including multi-level or imperfect repairs in the action space. In conclusion, this work suggests directions for future research to overcome current limitations and enhance the applicability of DRL methods in offshore wind maintenance.</p>
  </abstract>
    
<funding-group>
<award-group id="gs1">
<funding-source>Nederlandse Organisatie voor Wetenschappelijk Onderzoek</funding-source>
<award-id>KICH1.ED02.20.004</award-id>
</award-group>
</funding-group>
</article-meta>
  </front>
<body>
      

<sec id="Ch1.S1" sec-type="intro">
  <label>1</label><title>Introduction</title>
      <p id="d2e109">Offshore wind farm maintenance presents unique challenges due to harsh weather conditions, remote locations, and the intricate coordination of logistical resources <xref ref-type="bibr" rid="bib1.bibx36" id="paren.1"/>. Storms, high winds, and unpredictable sea states create uncertainty in scheduling, while the limited accessibility of offshore sites further complicates intervention planning. Additionally, the need to allocate maintenance crews, vessels, and spare parts increases operational complexity, often leading to delays and higher costs. Maintenance strategies, such as corrective or scheduled preventive measures, struggle to mitigate these challenges, often resulting in unplanned failures or unnecessary servicing <xref ref-type="bibr" rid="bib1.bibx18" id="paren.2"/>.</p>
      <p id="d2e118">Traditionally, offshore wind O&amp;M has been supported by deterministic or stochastic optimization models, rule-based policies, and predictive maintenance strategies informed by condition-monitoring and SCADA data. However, these approaches typically require predefined decision rules or restrictive modelling assumptions, limiting their ability to adapt to the dynamic and uncertain offshore environment <xref ref-type="bibr" rid="bib1.bibx8" id="paren.3"/>.</p>
      <p id="d2e124">In recent years, DRL has shown promising results as a data-driven approach to tackle these challenges. DRL is a class of algorithms that combine the sequential decision-making framework of reinforcement learning (RL) with the representational power of deep neural networks. In a typical RL setting, an agent interacts with an environment defined as a Markov decision process (MDP), observing a state <inline-formula><mml:math id="M1" display="inline"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>, selecting an action <inline-formula><mml:math id="M2" display="inline"><mml:mrow><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> according to a policy <inline-formula><mml:math id="M3" display="inline"><mml:mrow><mml:mi mathvariant="italic">π</mml:mi><mml:mo>(</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, and receiving a scalar reward <inline-formula><mml:math id="M4" display="inline"><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. The objective is to learn a policy that maximizes the expected cumulative discounted return <inline-formula><mml:math id="M5" display="inline"><mml:mrow><mml:msub><mml:mi>E</mml:mi><mml:mi mathvariant="italic">π</mml:mi></mml:msub><mml:mo>[</mml:mo><mml:msub><mml:mo>∑</mml:mo><mml:mi>t</mml:mi></mml:msub><mml:msup><mml:mi mathvariant="italic">γ</mml:mi><mml:mi>t</mml:mi></mml:msup><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M6" display="inline"><mml:mrow><mml:mi mathvariant="italic">γ</mml:mi><mml:mo>∈</mml:mo><mml:mo>[</mml:mo><mml:mn mathvariant="normal">0</mml:mn><mml:mo>,</mml:mo><mml:mn mathvariant="normal">1</mml:mn><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> is the discount factor determining how future rewards are weighted relative to immediate ones <xref ref-type="bibr" rid="bib1.bibx48" id="paren.4"/>. Classical RL struggles when the state or action space is high-dimensional or continuous, as in offshore wind maintenance, where turbine states, weather, and logistics create vast combinations. DRL overcomes this limitation by using deep networks to approximate policy and value functions, enabling end-to-end learning directly from high-dimensional or partially observed inputs such as condition monitoring or weather data. These features make DRL particularly suitable for complex, stochastic decision problems in offshore wind O&amp;M, where the agent must learn adaptive, long-horizon maintenance policies under uncertainty.</p>
      <p id="d2e238">Figure <xref ref-type="fig" rid="F1"/> summarizes four recurring challenges in offshore wind O&amp;M – harsh and uncertain weather, unplanned failures, remote locations with limited accessibility, and complex logistics requiring resource coordination – and highlights four corresponding opportunities where data-driven decision support, and DRL in particular, can add value. First, <italic>adaptive decision-making and learning</italic> directly target uncertainty: by learning policies that update actions based on newly observed information (e.g. condition-monitoring signals or revised weather forecasts), DRL can move beyond static decision rules and react to changing offshore conditions. Second, <italic>proactive scheduling and resource allocation</italic> address all challenges by exploiting forecasts and operational data to time interventions when access windows are likely (e.g. using metocean forecasts and, where available, operational sensing such as lidar-informed wind estimates) and to prioritize tasks before expected inaccessibility or risk escalation, thereby reducing weather-driven waiting time and avoidable downtime. Third, <italic>data-driven optimization</italic> provides a mechanism to jointly trade off competing objectives (e.g. cost, energy yield, risk, availability) under uncertainty, which is essential when failures and maintenance actions have long-horizon consequences. Finally, <italic>scalable frameworks</italic> respond to the growth in decision complexity as wind farms scale (more turbines, more components, more interacting constraints): function approximation, state abstraction (e.g. spatial or graph representations), and decentralized or hierarchical extensions provide pathways to maintain tractability when the state and decision spaces expand, increasing the complexity of logistics and resource coordination.</p>

      <fig id="F1" specific-use="star"><label>Figure 1</label><caption><p id="d2e258">Challenges of offshore wind maintenance and opportunities for deep reinforcement learning methods.</p></caption>
        <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f01.png"/>

      </fig>

      <p id="d2e267">Despite these advantages, DRL also comes with well-recognized limitations that are particularly relevant for safety-critical infrastructure such as offshore wind. First, policies are usually represented by deep neural networks, which behave as black-box models and make it difficult for operators to understand or audit the rationale behind individual decisions; this lack of transparency is identified as a barrier to deployment and has motivated a dedicated line of work on explainable and safe reinforcement learning <xref ref-type="bibr" rid="bib1.bibx43 bib1.bibx9" id="paren.5"/>. Second, state-of-the-art DRL algorithms tend to be data- and computation-hungry, often requiring millions of interactions with an environment for training; this sample inefficiency and dependence on high-fidelity simulators are highlighted as key obstacles to applying DRL in real-world systems where experiments are costly or risky <xref ref-type="bibr" rid="bib1.bibx17" id="paren.6"/>. Third, recent reviews of RL in power and energy systems emphasize the challenges of transferring policies from simulations to real assets, enforcing strict safety and reliability constraints, and encoding operational limits and human oversight in reward functions, all of which have so far limited large-scale industrial adoption <xref ref-type="bibr" rid="bib1.bibx41 bib1.bibx9" id="paren.7"/>.</p>
      <p id="d2e279">A substantial body of literature has reviewed different aspects of data-driven O&amp;M, but none provides a dedicated synthesis of deep reinforcement learning (DRL) for maintenance planning. For instance, <xref ref-type="bibr" rid="bib1.bibx18" id="text.8"/> review predictive and prescriptive O&amp;M strategies, highlighting how data-driven prognostics can support maintenance planning but without examining learning-based sequential decision frameworks. <xref ref-type="bibr" rid="bib1.bibx50" id="text.9"/> provide a systematic review of maintenance cost minimization models for offshore wind farms, focusing on optimization formulations and cost structures rather than on adaptive control or online decision-making. Similarly, reviews on machine-learning-based condition monitoring and prognostics, such as <xref ref-type="bibr" rid="bib1.bibx46" id="text.10"/>, <xref ref-type="bibr" rid="bib1.bibx49" id="text.11"/>, <xref ref-type="bibr" rid="bib1.bibx39" id="text.12"/>, and <xref ref-type="bibr" rid="bib1.bibx53" id="text.13"/>, synthesize diagnostic and remaining useful life (RUL) estimation techniques but do not address how such prognostic information interacts with maintenance-scheduling policies.</p>
      <p id="d2e301">Reinforcement learning itself has also been surveyed within the wind and power-system domain. <xref ref-type="bibr" rid="bib1.bibx35" id="text.14"/> reviews reinforcement-learning applications in wind energy, primarily in the context of control, forecasting, and wake steering. <xref ref-type="bibr" rid="bib1.bibx4" id="text.15"/> survey RL approaches for wind farm flow control, while <xref ref-type="bibr" rid="bib1.bibx29" id="text.16"/> discuss DRL for modern power-system control problems, including frequency and voltage regulation. Yet, none of these reviews analyse O&amp;M decision processes, nor do they compare DRL architectures, modelling assumptions, or their integration with offshore wind maintenance requirements.</p>
      <p id="d2e313">To the best of our knowledge, this is the first review that focuses specifically on DRL for offshore wind farm maintenance planning. In contrast to prior reviews, this work provides a focused synthesis of DRL approaches that have been proposed for, or are transferable to, offshore wind farm maintenance planning. Specifically, we (i) compare single- and multi-agent DRL frameworks and the algorithms they employ (value based, policy gradient, and actor critic) in relation to the maintenance decision problems they target; (ii) analyse the problem formulations adopted in these studies, including MDP, POMDP, graph-based, and hierarchical representations, and how they encode uncertainty, PHM information, wake effects, weather, and logistical constraints; and (iii) discuss the role of domain-specific knowledge and the remaining modelling gaps, with particular emphasis on the prevailing use of binary repair decisions instead of more realistic multi-level maintenance actions. By organizing the review along these dimensions, we aim to clarify what current DRL models can and cannot do for offshore wind maintenance planning and to outline research directions needed to move from promising simulation results toward practical adoption in real offshore wind operations.</p>
      <p id="d2e316">Rather than attempting an exhaustive survey of all works in the DRL domain, we have deliberately narrowed our focus to a select group of key studies that exemplify the state-of-the-art in this area.</p>
      <p id="d2e320">To ensure transparency and reproducibility, the corpus analysed in this review was identified through a structured search process. Searches were conducted in Scopus, the Web of Science, ScienceDirect, Researchgate, and Google Scholar using combinations of the following terms: “offshore wind”, “maintenance”, “reinforcement learning”, “deep reinforcement learning”, “predictive maintenance”, and “O&amp;M optimization”. The search window covered publications up to 2025.</p>
      <p id="d2e324">Inclusion criteria were (i) studies proposing or evaluating DRL-based methods for maintenance or inspection planning of offshore or onshore wind systems, (ii) papers applying DRL to components or systems representative of wind turbine O&amp;M, and (iii) articles providing sufficient methodological detail to characterize the learning formulation. Exclusion criteria included purely theoretical RL work, asset-management models without learning components, and high-level conceptual papers.</p>
      <p id="d2e327">This protocol yielded a total of 54 papers after full-text screening. The term “deliberately narrowed” refers to the intentional focus on works with explicit DRL formulations for maintenance decision-making, excluding broader O&amp;M optimization domains (e.g. dispatch, routing, or power forecasting) unless they directly informed maintenance planning.</p>
      <p id="d2e330">The primary goal of this paper is to synthesize the advancements demonstrated by these representative models, critically assessing their strengths and identifying the limitations that still exist.</p>
      <p id="d2e333">By doing so, we aim to demonstrate how DRL can effectively address the challenges of offshore wind maintenance planning, while also pinpointing areas that require further refinement. This targeted review not only clarifies the current state of research in this field but also offers insights into future research directions, particularly in the development of decision frameworks that move beyond simplistic binary repair actions.</p>
      <p id="d2e336">The remainder of this paper is organized as follows. In Sect. <xref ref-type="sec" rid="Ch1.S2"/>, we review single-agent DRL approaches for offshore wind maintenance, highlighting key algorithms and their performance in various decision-making settings. Section <xref ref-type="sec" rid="Ch1.S3"/> extends the discussion to multi-agent DRL frameworks, which address scalability by distributing decisions across multiple agents. In Sect. <xref ref-type="sec" rid="Ch1.S4"/>, we detail the formulations used to represent the maintenance problem, including Markov decision processes (MDPs), partially observable MDP (POMDP), and hierarchical and graph-based methods. In Sect. <xref ref-type="sec" rid="Ch1.S5"/> we discuss the applications of DRL and the integration of domain-specific knowledge, such as wind farm aerodynamics, weather constraints, logistics, and PHM data, and finally offer a recap of the reviewed models, schematizing their key features in a summary table. Section <xref ref-type="sec" rid="Ch1.S6"/> summarizes the key contributions and performance improvements achieved by the reviewed DRL models, and we also compare simulation-based studies with real-world applications and discuss the integration of nuanced repair types in the models. In Sect. <xref ref-type="sec" rid="Ch1.S7"/> we focus on what we believe is the main gap in current DRL models for maintenance planning, i.e. their reliance on a binary maintain-or-not decision. Instead, we argue that incorporating multiple levels of repair actions is necessary to reflect real-world maintenance scenarios more accurately. Finally, Sect. <xref ref-type="sec" rid="Ch1.S8"/> concludes with insights and directions for future research.</p>
</sec>
<sec id="Ch1.S2">
  <label>2</label><title>Single-agent DRL approaches for offshore wind O&amp;M</title>
      <p id="d2e363">Most DRL-based maintenance planners for offshore wind adopt a <italic>single-agent</italic> paradigm, where one agent learns an optimal policy for the entire system (e.g. a wind farm or a single turbine). This structure is suitable when a central decision-maker can coordinate all maintenance actions and information is aggregated at the farm or turbine level.</p>
      <p id="d2e369">In DRL, two key functions define the learning objective: the <italic>state action value function</italic> <inline-formula><mml:math id="M7" display="inline"><mml:mrow><mml:mi>Q</mml:mi><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which estimates the expected cumulative reward of taking action <inline-formula><mml:math id="M8" display="inline"><mml:mi>a</mml:mi></mml:math></inline-formula> in state <inline-formula><mml:math id="M9" display="inline"><mml:mi>s</mml:mi></mml:math></inline-formula>, and the <italic>state value function</italic> <inline-formula><mml:math id="M10" display="inline"><mml:mrow><mml:mi>V</mml:mi><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, which measures the expected reward of being in state <inline-formula><mml:math id="M11" display="inline"><mml:mi>s</mml:mi></mml:math></inline-formula> and following the current policy. Different algorithm families approximate and use these functions in distinct ways; thus, among single-agent methods, three main families of DRL algorithms are most frequently applied:</p>
      <p id="d2e432"><def-list>
          <def-item><term><italic>Value-based</italic>.</term><def>

      <p id="d2e442">Value-based methods, such as the deep Q network (DQN) algorithm <xref ref-type="bibr" rid="bib1.bibx33" id="paren.17"/>, learn an approximation of <inline-formula><mml:math id="M12" display="inline"><mml:mrow><mml:mi>Q</mml:mi><mml:mo>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> using a deep neural network and select actions that maximize this value. Stability is achieved through experience replay and a target network. DQN and its variants (double DQN, duelling DQN) are effective for discrete maintenance decisions such as <italic>maintain</italic> versus <italic>not maintain</italic>.</p>
          </def></def-item>
          <def-item><term><italic>Policy gradient</italic>.</term><def>

      <p id="d2e480">instead of estimating value functions, policy-gradient methods learn a parameterized policy <inline-formula><mml:math id="M13" display="inline"><mml:mrow><mml:msub><mml:mi mathvariant="italic">π</mml:mi><mml:mi mathvariant="italic">θ</mml:mi></mml:msub><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> by adjusting the parameters <inline-formula><mml:math id="M14" display="inline"><mml:mi mathvariant="italic">θ</mml:mi></mml:math></inline-formula> in the direction of the performance gradient. Proximal policy optimization (PPO) <xref ref-type="bibr" rid="bib1.bibx45" id="paren.18"/> is a widely used variant that constrains policy updates within a clipped trust region, improving stability and sample efficiency for large or continuous decision spaces, such as allocating multiple maintenance crews.</p>
          </def></def-item>
          <def-item><term><italic>Actor critic</italic>.</term><def>

      <p id="d2e522">Hybrid algorithms, such as  actor-critic methods, combine both paradigms by maintaining a policy (the <italic>actor</italic>) and a value estimator (the <italic>critic</italic>). Deep deterministic policy gradient (DDPG) <xref ref-type="bibr" rid="bib1.bibx31" id="paren.19"/> and soft actor critic (SAC) <xref ref-type="bibr" rid="bib1.bibx19" id="paren.20"/> extend DRL to continuous control using deterministic or stochastic policies with off-policy learning, while asynchronous advantage actor critic (A3C) <xref ref-type="bibr" rid="bib1.bibx34" id="paren.21"/> accelerates training through parallel environment instances.</p>
          </def></def-item>
        </def-list></p>
      <p id="d2e543">The flowchart in Fig. <xref ref-type="fig" rid="F2"/> illustrates these single-agent DRL approaches: all methods share common initial steps (environment setup, state observation) before diverging by learning logic – DQN (green) selects actions via <inline-formula><mml:math id="M15" display="inline"><mml:mi mathvariant="italic">ϵ</mml:mi></mml:math></inline-formula>-greedy exploration based on <inline-formula><mml:math id="M16" display="inline"><mml:mi>Q</mml:mi></mml:math></inline-formula> values, policy-gradient methods (purple) sample actions from a learned distribution and update directly via performance gradients, and actor-critic methods (orange) use a critic network to evaluate actions and guide policy improvement.</p>

      <fig id="F2" specific-use="star"><label>Figure 2</label><caption><p id="d2e564">Overview of single-agent deep reinforcement learning algorithm families: DQN (green), policy gradient (purple), and actor critic (orange).</p></caption>
        <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f02.png"/>

      </fig>

      <p id="d2e573">The following subsections review how each family has been applied to specific O&amp;M formulations and performance objectives.</p>
<sec id="Ch1.S2.SS1">
  <label>2.1</label><title>Deep Q networks (DQN)</title>
      <p id="d2e583">Value-based methods like DQN are popular for discrete maintenance decisions (e.g. whether to service a component now or later). For instance, <xref ref-type="bibr" rid="bib1.bibx26" id="text.22"/> combined DQN with graph neural networks to take into account asset topology. In their framework, a single agent uses a graph convolutional network (GCN) to group maintenance actions on geographically proximate pipes, yielding more efficient schedules. The DQN+GCN approach produced more reliable networks and higher maintenance grouping compared to a plain DQN and to conventional preventive/corrective policies.</p>
      <p id="d2e589">Similarly, <xref ref-type="bibr" rid="bib1.bibx28" id="text.23"/> developed a domain-informed DQN ensemble to schedule offshore wind farm maintenance tasks. They formulated maintenance scheduling as an MDP and incorporated wind wake effect models and weather variability into the state. By using convolutional layers to process spatial–temporal features (like turbine–wake interactions), their DQN agent improved power generation by 11.1 % compared to a baseline schedule.</p>
      <p id="d2e595">These studies chose DQN for its stability in discrete action spaces and supplemented it with domain-informed neural architectures such as convolutional neural networks (CNNs) or graph neural networks (GCNs) to accelerate learning and capture dependencies. Another value-based approach is the double DQN (DDQN) or duelling DQN to handle large state spaces more stably. <xref ref-type="bibr" rid="bib1.bibx58" id="text.24"/> tested DDQN for large state spaces in multi-component maintenance and showed better performance than simple threshold policies.</p>
      <p id="d2e601">However, value-based methods can struggle when the action space or planning horizon grows large or when the state is partially observed (e.g. uncertain component health) <xref ref-type="bibr" rid="bib1.bibx20" id="paren.25"/>. These challenges have led researchers to explore policy-gradient methods as well.</p>
</sec>
<sec id="Ch1.S2.SS2">
  <label>2.2</label><title>Policy-gradient methods</title>
      <p id="d2e615">Policy-based DRL algorithms directly optimize a parameterized policy <inline-formula><mml:math id="M17" display="inline"><mml:mrow><mml:mi mathvariant="italic">π</mml:mi><mml:mo>(</mml:mo><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> and are particularly advantageous in long-horizon, stochastic, or partially observable environments, which are characteristic of offshore maintenance planning. Because policy-gradient updates do not require explicit enumeration of all Q values, these methods handle continuous or large multi-discrete action spaces more naturally than value-based approaches. They also tend to yield smoother and more stable optimization when rewards are sparse or delayed <xref ref-type="bibr" rid="bib1.bibx45" id="paren.26"/>, which helps explain the successful use of PPO in discrete maintenance-planning studies such as <xref ref-type="bibr" rid="bib1.bibx42" id="text.27"/> and <xref ref-type="bibr" rid="bib1.bibx13" id="text.28"/>.</p>
      <p id="d2e645"><xref ref-type="bibr" rid="bib1.bibx42" id="text.29"/> developed a PPO-based agent to optimize maintenance dispatch for a wind farm with multiple crews. They formulated the problem as a sequential decision process and included prognostic information in the state (predicted RULs of turbines from PHM systems and even forecasted power production for upcoming days). The PPO agent learned a policy that outperformed corrective, scheduled, and threshold-based predictive maintenance benchmarks in profit maximization. Notably, the DRL policy automatically scheduled maintenance during low-power periods and anticipated failures using RUL predictions, something a simple RUL threshold policy could not do. PPO effectively handles continuous decision-making under uncertainty and yields better long-horizon rewards than static policies.</p>
      <p id="d2e650">In another example, <xref ref-type="bibr" rid="bib1.bibx13" id="text.30"/> applied both DQN and PPO to learn cost-optimal condition-based maintenance for an offshore turbine component. They investigated policies with dynamic inspection intervals and adaptive repair thresholds, formulating the decision as an MDP and comparing DRL algorithms. Both a DQN agent and a PPO agent were able to discover optimal policies under varying conditions, outperforming fixed-interval or fixed-threshold strategies by reducing lifecycle costs, although, in their case, PPO performed better than DQN.</p>
</sec>
<sec id="Ch1.S2.SS3">
  <label>2.3</label><title>Actor critic and others (DDPG, SAC, A3C)</title>
      <p id="d2e665">In problems with continuous-action decisions (e.g. scheduling exact timing), actor-critic methods like the deep deterministic policy gradient (DDPG) <xref ref-type="bibr" rid="bib1.bibx31" id="paren.31"/> and soft actor critic (SAC) become relevant. In a broader manufacturing maintenance context, for example, A3C has been used to optimize resource allocation problems where quick convergence was needed <xref ref-type="bibr" rid="bib1.bibx34" id="paren.32"/>.</p>
      <p id="d2e674">In the context of wind energy, <xref ref-type="bibr" rid="bib1.bibx59" id="text.33"/> used an SAC algorithm to solve a maintenance scheduling problem formulated as an MDP. SAC’s ability to handle continuous actions, maximizing the trade-off between reward and entropy, makes it suitable for maintenance problems requiring fine timing control. Their implementation allowed the agent to decide whether to perform maintenance on a turbine in each time period, aiming to minimize long-term cost. The SAC-based scheduler showed improved adaptability to random wind and failure events, outperforming greedy or periodic policies in simulation. Asynchronous advantage actor-critic (A3C) and related algorithms have also been mentioned in the context of maintenance optimization to benefit  parallel training. While we are not aware of specific applications of A3C to offshore wind maintenance, the algorithm’s ability to run multiple environment instances in parallel can accelerate training for complex simulations.</p>
      <p id="d2e680">In principle, A3C/A2C could be applied to wind farm O&amp;M planning to speed up learning across many simulated weather scenarios or failure realizations. However, stability can be an issue, so recent works have gravitated more to PPO for O&amp;M problems.</p>
      <p id="d2e683">To provide a clearer overview of how these families compare in terms of strengths, limitations, and typical use cases in O&amp;M decision-making, Table <xref ref-type="table" rid="T1"/> summarizes their key characteristics.</p>

<table-wrap id="T1" specific-use="star"><label>Table 1</label><caption><p id="d2e692">Comparison of single-agent deep reinforcement learning algorithm families for offshore wind operations and maintenance.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="3">
     <oasis:colspec colnum="1" colname="col1" align="justify" colwidth="120pt"/>
     <oasis:colspec colnum="2" colname="col2" align="justify" colwidth="170pt"/>
     <oasis:colspec colnum="3" colname="col3" align="justify" colwidth="155pt"/>
     <oasis:thead>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">Algorithm family</oasis:entry>
         <oasis:entry colname="col2" align="left">Strengths</oasis:entry>
         <oasis:entry colname="col3" align="left">Limitations</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><italic>Value based</italic> (DQN, DDQN, duelling DQN)</oasis:entry>
         <oasis:entry colname="col2" align="left">Effective for low-dimensional discrete decisions  Sample efficient with experience replay  Simple and stable for small action spaces</oasis:entry>
         <oasis:entry colname="col3" align="left">Less stable in long-horizon problems  Sensitive to partial observability  Harder to scale to multi-discrete decisions</oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><italic>Policy gradient</italic> (PPO)</oasis:entry>
         <oasis:entry colname="col2" align="left">Stable updates in long-horizon, stochastic settings  Robust in partially observable environments  Naturally handles multi-discrete decision elements</oasis:entry>
         <oasis:entry colname="col3" align="left">Less sample efficient than value-based methods  Performance sensitive to reward shaping  No inherent advantage in simple discrete tasks</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"><italic>Actor critic</italic>  (A2C/A3C, SAC, DDPG)</oasis:entry>
         <oasis:entry colname="col2" align="left">Combines policy stability with value-based guidance  Efficient for complex states or long dependency chains</oasis:entry>
         <oasis:entry colname="col3" align="left">Training can be unstable without tuning  Off-policy AC methods require careful replay-buffer design  Computationally heavier than DQN variants</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      <p id="d2e791">The discussion on single-agent DRL approaches has highlighted how tailored algorithms can effectively optimize maintenance decisions. However, as wind farms scale up and the complexity of interdependent maintenance decisions increases, a shift toward distributed decision-making might become necessary. The following section explores multi-agent DRL frameworks, which distribute the decision-making process across multiple agents, thereby addressing scalability challenges while enabling coordinated maintenance planning.</p>
</sec>
</sec>
<sec id="Ch1.S3">
  <label>3</label><title>Multi-agent DRL in maintenance planning</title>
      <p id="d2e803">As wind farms scale up and the environment increases in size, a multi-agent DRL (MADRL) approach can be considered to distribute the decision-making across multiple agents <xref ref-type="bibr" rid="bib1.bibx32" id="paren.34"/> (e.g. one agent per turbine or per subsystem).</p>
      <p id="d2e809">Figure <xref ref-type="fig" rid="F3"/> contrasts the single-agent and multi-agent DRL approaches. In the single-agent framework (left), one centralized policy observes the environment and makes all maintenance decisions, whereas in the multi-agent framework (right), each agent receives local observations and collectively coordinates actions.</p>

      <fig id="F3" specific-use="star"><label>Figure 3</label><caption><p id="d2e816">Comparison of single-agent and multi-agent deep reinforcement learning approaches.</p></caption>
        <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f03.png"/>

      </fig>

      <p id="d2e826">Multi-agent frameworks address the curse of dimensionality that a single agent faces when managing many components simultaneously <xref ref-type="bibr" rid="bib1.bibx38" id="paren.35"/>. In a cooperative MADRL setting, agents must learn policies that jointly optimize the overall maintenance outcome. A common approach is centralized training with decentralized execution: during learning the agents share information, but in execution each acts independently <xref ref-type="bibr" rid="bib1.bibx44" id="paren.36"/>.</p>
      <p id="d2e835">A multi-agent perspective is taken by <xref ref-type="bibr" rid="bib1.bibx6" id="text.37"/>, who developed a deep centralized multi-agent actor-critic (DCMAC) algorithm. This algorithm treats each component as an “actor” with individual actions and uses a centralized critic to evaluate the joint outcome. Their method allowed individualized component-level decisions (like a multi-agent system) but maintained a single value function for the overall system. This hybrid approach achieved strong results on high-dimensional maintenance problems, outperforming time-based and condition-based benchmarks. DCMAC can be seen as bridging single- and multi-agent methods: it is centrally trained on the whole state, but action vectors are factorized per component. The success of DCMAC and similar centralized-training approaches again underlines that decomposing the action space among multiple decision-makers is a powerful strategy for scalability.</p>
      <p id="d2e841">Another relevant study is <xref ref-type="bibr" rid="bib1.bibx37" id="text.38"/> <xref ref-type="bibr" rid="bib1.bibx15" id="paren.39"><named-content content-type="pre">expanded in</named-content></xref>, who optimized maintenance for a 13-component system using a weighted QMIX algorithm. Here, each component is controlled by an agent, and a mixing network combines their action values into a global Q value to enforce cooperation.</p>
      <p id="d2e852">The “weighted” QMIX (W-QMIX) variant addresses limitations of the standard QMIX by improving credit assignment to each agent’s actions <xref ref-type="bibr" rid="bib1.bibx30" id="paren.40"/>. By customizing QMIX, they overcame the exponential growth of joint action space and achieved cost-effective policies that significantly reduced total maintenance cost compared to independent or rule-based policies. In fact, their multi-agent policy outperformed a traditional threshold-based maintenance strategy, yielding 20 % lower cost in the case study. The agents learned to coordinate, essentially performing opportunistic maintenance on other components when one component required service, something that is hard-coded in opportunistic heuristics but emerged naturally via learning. Notably, they simplified the architecture by using a branching neural network (one network outputting multi-component actions).</p>
      <p id="d2e858">Figure <xref ref-type="fig" rid="F4"/> presents a high-level overview of how local Q values from multiple agents are combined into a single joint Q value through a central mixing network. At the top, global state observations and individual agents’ local Q values feed into the network. Within the mixing network, these local estimates are merged. The output at the bottom is a single joint Q value, enabling decentralized agents to learn coordinated policies that account for global objectives.</p>

      <fig id="F4" specific-use="star"><label>Figure 4</label><caption><p id="d2e866">Multi-agent mixing network architecture.</p></caption>
        <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f04.png"/>

      </fig>

      <p id="d2e875">While most offshore wind DRL studies to date use a single centralized agent, the multi-agent perspective is highly relevant. A wind farm can be seen as a team of turbines (or a team of maintenance crews) that could each be an agent. Multi-agent DRL can explicitly model interactions like shared resources (e.g. a vessel cannot fix two turbines at once) and learn decentralized policies. For example, one could have a DRL agent for each maintenance vessel coordinating via a mixing network to maximize farm availability.</p>
      <p id="d2e878">Nevertheless, multi-agent approaches introduce challenges such as non-stationarity during training and coordination complexity. To avoid these issues, for example, <xref ref-type="bibr" rid="bib1.bibx42" id="text.41"/> effectively used a single PPO agent to dispatch multiple crews, which could be interpreted as a centralized multi-action policy rather than fully decentralized agents.</p>
      <p id="d2e884">An interesting future direction is to combine multi-agent RL with the physical layout of a wind farm, e.g. treating turbines as agents that learn when to request maintenance or crews as agents learning which turbine to service to further scale O&amp;M optimization. In summary, multi-agent DRL is a promising direction that can address scalability and modularity in offshore wind maintenance, ensuring that solutions remain effective as the number of assets grows.</p>
      <p id="d2e887">Having examined both single-agent and multi-agent DRL frameworks for maintenance planning, it is clear that the choice of algorithm is intricately linked to how the underlying problem is modelled. The next section focuses on the core methodologies and decision frameworks used, ranging from Markov decision processes (MDPs) and partially observable MDPs (POMDPs) to hierarchical and graph-based representations, which provide the foundation for these DRL approaches. This discussion will clarify how the characteristics of the problem formulation drive the design and performance of the DRL models.</p>
</sec>
<sec id="Ch1.S4">
  <label>4</label><title>Problem formulation</title>
      <p id="d2e898">Formulating the maintenance planning problem correctly is a core requirement for DRL. In the literature, we see formulations as MDPs, POMDPs, and even graph-based representations, each chosen to capture the nature of the decision environment.</p>
<sec id="Ch1.S4.SS1">
  <label>4.1</label><title>Markov decision process (MDP)</title>
      <p id="d2e908">Many works assume the system state is fully observable. For example, if one uses direct sensor readings or known component health states as the state, the decision process can be modelled as an MDP.</p>
      <p id="d2e911">The reward design typically combines negative maintenance costs and downtime losses into a single scalar reward (or profit) to guide the agent toward cost-optimal decisions, as in <xref ref-type="bibr" rid="bib1.bibx42" id="text.42"/>.</p>
      <p id="d2e917">The CBM decision can also be formulated as an MDP where the state might include the current damage level or time since last inspection as in <xref ref-type="bibr" rid="bib1.bibx13" id="text.43"/>. They defined actions such as whether to inspect or repair, with rewards based on cost.</p>
      <p id="d2e924">Similarly, the DQN-based scheduling by <xref ref-type="bibr" rid="bib1.bibx28" id="text.44"/> uses an MDP where the state includes turbine power outputs (affected by wake and weather) and maintenance statuses, assuming those are known to the agent.</p>
      <p id="d2e931">MDP formulations are simpler and allow the use of standard DRL algorithms, but they rely on having a reliable estimator of the system’s health state.</p>
</sec>
<sec id="Ch1.S4.SS2">
  <label>4.2</label><title>Partially observable MDP (POMDP)</title>
      <p id="d2e942">Several offshore wind O&amp;M decision processes are more naturally modelled as POMDPs because the agent typically does not observe the true underlying system state but only noisy and delayed measurements (SCADA/CM signals, inspection outcomes, imperfect weather and access forecasts). In a POMDP, the agent maintains a belief to make decisions under uncertainty <xref ref-type="bibr" rid="bib1.bibx24" id="paren.45"/>.</p>
      <p id="d2e948">Beyond SCADA-based monitoring, a complementary body of literature focuses on non-destructive evaluation (NDE) techniques that can provide inspection- and SHM-driven observations for wind turbine components and structures (e.g. blades, tower, foundations). For instance, <xref ref-type="bibr" rid="bib1.bibx14" id="text.46"/> review non-destructive techniques for wind turbine condition and structural health monitoring over the last 2 decades, covering approaches such as visual inspection, acoustic emission, ultrasonic testing, infrared thermography, radiographic and electromagnetic methods, and oil monitoring, with deployments ranging from human inspections to robotic and UAV-based surveys. Such sensing and inspection modalities can enrich the observation space available to data-driven O&amp;M decision support, but they also highlight why offshore maintenance planning is often partially observable in practice: measurements can be sparse in time (inspection driven), noisy, and operationally constrained by access windows and logistics, motivating POMDP formulations and memory-/belief-based policy representations.</p>
      <p id="d2e954">In practical DRL implementations, three families of remedies are commonly used to mitigate partial observability and approximate belief-state reasoning:</p>
      <p id="d2e957"><def-list>
            <def-item><term><bold>History-based state augmentation.</bold></term><def>

      <p id="d2e966">A simple approach is to concatenate a fixed window of past observations and actions to the agent input, thereby providing short-term memory of recent dynamics. While straightforward, this can be insufficient when relevant dependencies extend over long horizons (e.g. degradation accumulation, delayed maintenance effects) <xref ref-type="bibr" rid="bib1.bibx24" id="paren.47"/>.</p>
            </def></def-item>
            <def-item><term><bold>Recurrent DRL (implicit belief state via memory).</bold></term><def>

      <p id="d2e979">Recurrent neural networks (typically LSTMs/GRUs) can compress the observation action history into a hidden state that acts as an implicit belief surrogate. This idea has been adopted in value-based learning (e.g. deep recurrent Q networks, DRQNs) <xref ref-type="bibr" rid="bib1.bibx20" id="paren.48"/> and in actor-critic/policy-gradient settings, where recurrent policies are trained end to end to improve performance under partial observability <xref ref-type="bibr" rid="bib1.bibx55 bib1.bibx45" id="paren.49"/>.</p>
            </def></def-item>
            <def-item><term><bold>Transformer-based memory (long-range dependencies).</bold></term><def>

      <p id="d2e995">When long-range temporal structure matters, transformer architectures offer an alternative to RNNs by using attention mechanisms to retrieve relevant past information. Transformer-based RL agents have demonstrated improved stability and credit assignment in partially observable and long-horizon tasks by enabling flexible access to historical context <xref ref-type="bibr" rid="bib1.bibx40" id="paren.50"/>. This is conceptually aligned with offshore maintenance planning, where optimal actions may depend on events far in the past (e.g. prior repairs, earlier inspections), although transformers may require careful regularization and substantial training data.</p>
            </def></def-item>
          </def-list></p>
      <p id="d2e1004">Beyond implicit memory, explicit belief-state estimation can also be pursued by combining filtering with RL, for instance using Bayesian filters or particle filters when a tractable transition/observation model is available or by learning latent-state models that infer hidden degradation dynamics from observations before or during policy learning <xref ref-type="bibr" rid="bib1.bibx24 bib1.bibx22" id="paren.51"/>. Overall, expanding offshore wind DRL formulations from MDP to POMDP settings primarily affects the policy representation: memory-augmented policies (recurrent or transformer based) and/or explicit belief estimators provide practical mechanisms to cope with uncertainty in health information, accessibility, and delayed maintenance outcomes.</p>
      <p id="d2e1010">A first example of such a formulation for an O&amp;M planning problem is <xref ref-type="bibr" rid="bib1.bibx26" id="text.52"/>, who formulate maintenance planning as a POMDP on a graph, where the underlying deterioration states are partially observed through inspections.</p>
      <p id="d2e1016">Similarly, <xref ref-type="bibr" rid="bib1.bibx27" id="text.53"/> treat their predictive maintenance problem for aircraft components as a POMDP, using a CNN to process raw sensor data into an observation for the DRL agent.</p>
      <p id="d2e1022">The choice of POMDP acknowledges that maintenance decisions must be made with imperfect information, and DRL agents in this setting are trained to be more robust to uncertainty (e.g. scheduling maintenance a bit earlier to hedge against uncertain failure times). Methodologically, solving a POMDP with DRL often means using belief state features or statistical features of uncertainty (such as predicted failure probability) in the state.</p>
</sec>
<sec id="Ch1.S4.SS3">
  <label>4.3</label><title>Graph- and network-based representations</title>
      <p id="d2e1033">Some complex systems benefit from graph-based formulations. The study by <xref ref-type="bibr" rid="bib1.bibx26" id="text.54"/> is an example of where the asset network structure (a sewer network in their case) is encoded as a graph, and a GCN is used to inform the DRL agent.</p>
      <p id="d2e1039">While their domain was not wind, one can imagine an offshore wind farm graph where nodes are turbine components and edges could represent spatial proximity or electrical/functional dependencies. This approach could let the agent learn policies that consider component interactions (e.g. if neighbouring turbines’ maintenance can be combined). In their framework, the graph-based state and GCN embedding encouraged the agent to group maintenance geographically, improving efficiency.</p>
</sec>
<sec id="Ch1.S4.SS4">
  <label>4.4</label><title>Hierarchical and interpretable models</title>
      <p id="d2e1051">To tackle the black-box nature of DRL, <xref ref-type="bibr" rid="bib1.bibx3" id="text.55"/> introduced a hierarchical DRL framework for turbofan engine maintenance that combines an input–output hidden Markov model (IOHMM) with a DQN-like agent. The high-level IOHMM module interprets sensor data to detect the likely fault mode or degradation state (providing a human-understandable diagnostic), while the low-level DRL module learns the optimal replacement or repair policy given that inferred state. This two-level approach achieved performance on par with end-to-end DRL but with the added benefit that decisions could be traced to identifiable health-state estimates.</p>
      <p id="d2e1057">In safety-critical domains like aerospace or offshore energy, such interpretability is valuable for gaining trust in the AI’s recommendations. Moreover, by narrowing the policy search to focus on critical decisions (informed by the HMM), the agent avoids spurious actions in sparse failure domains.</p>
      <p id="d2e1060">Although turbofan engines differ from wind turbines, the concept is relevant: they use a probabilistic model to identify health states and provide an interpretable layer, and the DRL agent makes maintenance decisions at a higher level. This yields a more transparent policy, which could be valuable in offshore wind where operators demand an understanding of the AI’s decisions <xref ref-type="bibr" rid="bib1.bibx5" id="paren.56"/>. Such hierarchical or hybrid frameworks (combining physics-based or expert models with learning) are a way to inject domain knowledge directly into the DRL algorithm, improving learning speed and trustworthiness.</p>
      <p id="d2e1066">Figure <xref ref-type="fig" rid="F5"/> illustrates different frameworks commonly used to represent state information in deep reinforcement learning (DRL) models for maintenance decision-making. Starting from an initial state <inline-formula><mml:math id="M18" display="inline"><mml:mi>S</mml:mi></mml:math></inline-formula>, the flowchart shows four distinct ways the state can be represented or processed before making a decision. The MDP formulation (blue) assumes full access to the true system state. The POMDP approach (yellow) highlights the partial observability of real-world systems, requiring the agent to infer underlying states from limited observations <inline-formula><mml:math id="M19" display="inline"><mml:mrow><mml:mi>O</mml:mi><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> of the true state. The graph-based method (green) structures state information using node and edge relationships to capture spatial or topological dependencies, allowing the agent to consider how interconnected assets affect one another’s maintenance needs. Finally, the hierarchical approach (red) incorporates domain-specific insights, from either expert knowledge or physics-based models. After selecting an action <inline-formula><mml:math id="M20" display="inline"><mml:mi>A</mml:mi></mml:math></inline-formula> based on the chosen representation, the system receives a reward <inline-formula><mml:math id="M21" display="inline"><mml:mi>R</mml:mi></mml:math></inline-formula> and transitions to a new state <inline-formula><mml:math id="M22" display="inline"><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mo>′</mml:mo></mml:msup></mml:mrow></mml:math></inline-formula>, looping back to begin a new decision-making cycle.</p>

      <fig id="F5" specific-use="star"><label>Figure 5</label><caption><p id="d2e1121">Overview of problem formulations in deep reinforcement learning models: Markov decision process (blue), partially observable Markov decision process (yellow), graph-/network-based formulation (green), and hierarchical approaches (pink).</p></caption>
          <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f05.png"/>

        </fig>

</sec>
<sec id="Ch1.S4.SS5">
  <label>4.5</label><title>Decision frameworks and objectives</title>
      <p id="d2e1138">Across these formulations, the decision-making frameworks can vary in objective. Some agents aim to maximize availability or energy production (profit) <xref ref-type="bibr" rid="bib1.bibx42 bib1.bibx28" id="paren.57"/>. Others aim to minimize total cost (including repair costs, downtime costs, and a possibly penalty for using resources) <xref ref-type="bibr" rid="bib1.bibx26" id="paren.58"/>.</p>
      <p id="d2e1147">These are effectively two sides of the same coin, since maximizing uptime or output will implicitly minimize downtime losses. The reward function should, in fact, include terms for lost revenue when a turbine is down, crew/vessel dispatch costs, spare part costs, and maybe even penalties for equipment degradation.</p>
      <p id="d2e1150">For example, <xref ref-type="bibr" rid="bib1.bibx42" id="text.59"/> designed a reward equal to the short-term profit (energy revenue minus O&amp;M costs) at each decision step, so the PPO agent’s cumulative reward corresponds to total profit. <xref ref-type="bibr" rid="bib1.bibx13" id="text.60"/> focused on cost rates, giving negative rewards for inspection or repair costs and for failures, to encourage the agent to find the policy with minimum average cost.</p>
      <p id="d2e1159">In multi-agent settings, a global reward is often used for full cooperation (as in QMIX, where agents maximize a shared cost-saving metric).</p>
      <p id="d2e1164">Some works also impose constraint handling in the formulation. For instance, maintenance actions might be invalid under certain weather conditions; a realistic environment simulator will simply not allow those actions (or will assign a large negative reward if attempted). Thus, ensuring decisions are feasible (e.g. not sending a crew when waves are too high) can be done by action masking or via constraint penalty in the reward. Recent research even explores incorporating optimization constraints into neural network design (e.g. using attention masks to enforce constraints) as in <xref ref-type="bibr" rid="bib1.bibx25" id="text.61"/>, though this is still emerging in O&amp;M applications.</p>
      <p id="d2e1170">In summary, the methodologies range from straightforward MDP models with fully observable states to sophisticated POMDP and graph-based models that embrace the complexities of offshore wind maintenance. The trend is toward more realistic problem formulations, acknowledging partial observability, incorporating spatial and logistical structure, and aligning the reward with business metrics (cost or profit). The following section provides a structured summary of these methodologies based on their application and the domain-specific knowledge considered, finally highlighting key aspects such as agent design, algorithm choices, and problem formulations in a comparative table. This summary aims to offer a clear and concise reference for understanding the variations in DRL-based maintenance planning approaches and their defining characteristics.</p>
</sec>
</sec>
<sec id="Ch1.S5">
  <label>5</label><title>Applications of DRL approaches for offshore wind O&amp;M</title>
      <p id="d2e1184">A fundamental aspect of training and deploying DRL agents for maintenance planning is the integration of domain-specific knowledge. The following subsections explore how different forms of domain knowledge can enhance decision-making: (i) <italic>wind farm aerodynamic</italic> interactions that affect energy yield; (ii) <italic>weather and sea state</italic> constraints that determine accessibility; (iii) <italic>logistics</italic> and crew-related considerations; (iv) <italic>prognostic and health management data</italic> for predictive maintenance; and, finally, (v) <italic>economic factors</italic> such as market prices and budget limits.</p>
<sec id="Ch1.S5.SS1">
  <label>5.1</label><title>Wind farm aerodynamics</title>
      <p id="d2e1209">In a wind farm, turbines cast aerodynamic wakes that reduce the output of downwind turbines <xref ref-type="bibr" rid="bib1.bibx52" id="paren.62"/>. In <xref ref-type="bibr" rid="bib1.bibx28" id="text.63"/>, the authors considered this by making their DRL agent wake-aware. They fed a wake interaction model into the state and used an ensemble of DQN models to capture different wake scenarios. By incorporating multiple wake models, their agent learned maintenance decisions that minimize farm power loss due to wakes, leading to higher overall energy production.</p>
      <p id="d2e1218">This domain knowledge is particularly important for tightly spaced wind farms where wake losses are significant; a purely self-learned agent without wake inputs might need enormous training experience to “discover” such effects, whereas providing a wake model upfront accelerates learning.</p>
      <p id="d2e1222">The result is a policy that schedules maintenance such that either the waking effect is minimal (e.g. performing maintenance when winds are low or from directions that do not affect other turbines) or  multiple affected turbines are maintained together to collapse the wake loss into a single period.</p>
</sec>
<sec id="Ch1.S5.SS2">
  <label>5.2</label><title>Weather and sea state</title>
      <p id="d2e1233">Offshore maintenance is heavily weather dependent, as high waves, strong winds, or storms can prevent crew transfers and repairs <xref ref-type="bibr" rid="bib1.bibx23" id="paren.64"/>. Some DRL studies explicitly include weather in the environment.</p>
      <p id="d2e1239"><xref ref-type="bibr" rid="bib1.bibx11" id="text.65"/> demonstrated a DRL approach for planning vessel transfers to turbines, integrating real SCADA data and weather conditions. Their agent could prioritize critical repairs and navigate the stochastic availability of weather windows, something traditional scheduling struggles with <xref ref-type="bibr" rid="bib1.bibx7" id="paren.66"/>. By training on historical weather patterns, the DRL policy learns, for example, to take advantage of a calm sea state to perform a repair even if it is slightly early because waiting might mean a long weather delay.</p>
      <p id="d2e1247">Similarly, Pinciroli’s state included predicted power (which indirectly reflects wind forecast) to help the agent plan around periods of low wind (often correlated with calmer weather) <xref ref-type="bibr" rid="bib1.bibx42 bib1.bibx38" id="paren.67"/>.</p>
      <p id="d2e1254">In practice, one can input wave height forecasts or wind speed forecasts into the DRL state; the agent will then learn not to “choose” an action that requires travel during bad weather. Domain knowledge here ensures feasibility and robustness: the agent that knows about weather will inherently develop a maintenance schedule that aligns with seasonal weather patterns, reducing cancellations and idle times.</p>
</sec>
<sec id="Ch1.S5.SS3">
  <label>5.3</label><title>Logistics</title>
      <p id="d2e1266">Offshore O&amp;M involves vessels, helicopters, crews, and spare parts – logistical aspects that greatly affect cost. DRL models have started to include these. In the multi-crew PPO model by <xref ref-type="bibr" rid="bib1.bibx42" id="text.68"/>, the state and action were designed to capture crew positions and availability. The environment simulation accounted for travel times to turbines and repair durations. This domain realism meant the learned policy actually coordinates crew movements: e.g. sending Crew A to a turbine that will finish repair soon, while Crew B waits at the depot until a large failure occurs. By encoding travel time and multiple crews, the DRL agent learned to avoid wasted trips and to keep crews busy, emulating optimal routing and scheduling decisions.</p>
      <p id="d2e1272">In <xref ref-type="bibr" rid="bib1.bibx26" id="text.69"/>, logistics is addressed in a spatial sense by using a GCN to group geographically close maintenance. In an offshore wind context, that could translate to handling nearby turbines in one outing to minimize transit.</p>
      <p id="d2e1279">Even without an explicit graph, a DRL agent can learn from cost feedback that doing maintenance on neighbouring turbines back to back saves the transit cost of multiple separate trips. Future research should incorporate more detailed logistics, such as vessel capacity, fuel cost, and inventory of spare parts, into the DRL state/reward. The benefit of doing so is that the learned policy becomes a holistic O&amp;M schedule that respects not just failure risks but also supply chain and labour constraints.</p>
</sec>
<sec id="Ch1.S5.SS4">
  <label>5.4</label><title>Prognostics and health management (PHM) data</title>
      <p id="d2e1290">Almost all DRL approaches in this area use PHM outputs (like condition monitoring and RUL predictions), which is essential domain knowledge for predictive maintenance. The difference is in how they incorporate it. For instance, one could use a discretized health state (e.g. “good”, “degraded”, “critical”) as part of the state space, which is easier for an agent to handle than raw sensor readings. The approach of <xref ref-type="bibr" rid="bib1.bibx27" id="text.70"/> of using a CNN on sensor data before the DRL agent is another way, essentially letting a deep learning model extract relevant features (e.g. vibration patterns) which portray the health knowledge.</p>
      <p id="d2e1296">For different applications, <xref ref-type="bibr" rid="bib1.bibx3" id="text.71"/> and <xref ref-type="bibr" rid="bib1.bibx2" id="text.72"/> integrate an input–output hidden Markov model to classify health states of turbofan engines and feed this into the DRL agent.</p>
      <p id="d2e1305">By combining PHM and DRL, these frameworks close the loop from condition monitoring to decision-making. The advantage is clear: the better the agent’s awareness of actual component condition (even if inferred), the better it can time maintenance. PHM can also be embedded in the model by shaping the reward as well; e.g. a large penalty for a failure effectively encodes the idea that “a catastrophic gearbox failure is very bad”, which the agent must learn.</p>
      <p id="d2e1308">PHM-driven rewards can also be used, such as giving a small penalty for running a component in a highly degraded state (reflecting higher wear or risk).</p>
</sec>
<sec id="Ch1.S5.SS5">
  <label>5.5</label><title>Economic factors</title>
      <p id="d2e1319">Domain knowledge also includes economic factors like energy price and maintenance cost structure. Some works incorporate dynamic electricity prices or contractual penalties into the reward. For instance, if energy price forecasts are available, a DRL agent could decide to do maintenance when energy price (hence lost revenue) is low.</p>
      <p id="d2e1322">While not explicitly seen in the reviewed models, attention-based approaches are starting to include market price as part of the input to O&amp;M scheduling models <xref ref-type="bibr" rid="bib1.bibx25" id="paren.73"/>.</p>
      <p id="d2e1328">In offshore wind, including such factors could align maintenance not just with technical needs but also with business cycles (e.g. perform maintenance during low-demand seasons or when subsidy/price is low). Another economic constraint is the budget or resource limit. DRL can consider these by capping certain rewards or through state variables (like remaining budget).</p>
      <p id="d2e1331">Integrating domain-specific information can make DRL formulations more realistic and better aligned with the physical and operational characteristics of offshore wind systems. Figure <xref ref-type="fig" rid="F6"/> summarizes the types of knowledge commonly incorporated in existing studies, including wake interactions, weather accessibility, logistics constraints, degradation physics, and PHM-derived indicators. The role of these elements is to provide structure that may help the agent focus on features that are relevant to decision-making; nevertheless, it is worth noting that increased model complexity, additional training effort, or conflicting inputs may offset potential gains, and over-specifying the environment can make optimization more difficult rather than easier. Thus, domain knowledge can support DRL when carefully selected and validated, but its contribution depends on the quality, relevance, and reliability of the information provided.</p>

      <fig id="F6"><label>Figure 6</label><caption><p id="d2e1339">Domain-specific knowledge for offshore wind.</p></caption>
          <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f06.png"/>

        </fig>

      <p id="d2e1348">All the studies reviewed have, in one way or another, blended domain knowledge into their DRL approach: <xref ref-type="bibr" rid="bib1.bibx28" id="text.74"/> added wake/convective layers, Chatterjee added weather data, Kerkkamp added graph relations, Pinciroli added RUL and power forecasting, Abbas added HMM interpretable layers, etc. This synergy of domain expertise and reinforcement learning is key to developing trustworthy, high-performance maintenance policies that industry can adopt.</p>
      <p id="d2e1354">To compare the DRL-based maintenance planning approaches reviewed, Table <xref ref-type="table" rid="T2"/> highlights their key features, such as agents, algorithms, and problem formulations. All the mentioned approaches aim to optimize maintenance scheduling but differ in the algorithms and formulations used. The column <italic>agent</italic> refers to whether one policy controls the whole system or multiple coordinating policies exist, <italic>algorithm</italic> shows which training method was used, <italic>problem formulation</italic> indicates how the maintenance decision problem is modelled for the agent, and <italic>domain-specific knowledge</italic> indicates which categories of offshore-wind-specific knowledge are explicitly integrated in each study (e.g. wake/aerodynamic effects, weather, logistics, and PHM information).</p>

<table-wrap id="T2" specific-use="star"><label>Table 2</label><caption><p id="d2e1374">Comparison of deep reinforcement learning approaches for offshore wind farm maintenance.</p></caption><oasis:table frame="topbot"><oasis:tgroup cols="18">
     <oasis:colspec colnum="1" colname="col1" align="justify" colwidth="3.5cm"/>
     <oasis:colspec colnum="2" colname="col2" align="justify" colwidth="4.5cm"/>
     <oasis:colspec colnum="3" colname="col3" align="center"/>
     <oasis:colspec colnum="4" colname="col4" align="center" colsep="1"/>
     <oasis:colspec colnum="5" colname="col5" align="center"/>
     <oasis:colspec colnum="6" colname="col6" align="center"/>
     <oasis:colspec colnum="7" colname="col7" align="center"/>
     <oasis:colspec colnum="8" colname="col8" align="center"/>
     <oasis:colspec colnum="9" colname="col9" align="center"/>
     <oasis:colspec colnum="10" colname="col10" align="center" colsep="1"/>
     <oasis:colspec colnum="11" colname="col11" align="center"/>
     <oasis:colspec colnum="12" colname="col12" align="center"/>
     <oasis:colspec colnum="13" colname="col13" align="center"/>
     <oasis:colspec colnum="14" colname="col14" align="center" colsep="1"/>
     <oasis:colspec colnum="15" colname="col15" align="center"/>
     <oasis:colspec colnum="16" colname="col16" align="center"/>
     <oasis:colspec colnum="17" colname="col17" align="center"/>
     <oasis:colspec colnum="18" colname="col18" align="center"/>
     <oasis:thead>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry colname="col2" align="left">Reported gain(s)</oasis:entry>
         <oasis:entry namest="col3" nameend="col4" colsep="1">Agent </oasis:entry>
         <oasis:entry namest="col5" nameend="col10" colsep="1">Algorithm </oasis:entry>
         <oasis:entry namest="col11" nameend="col14" colsep="1">Problem </oasis:entry>
         <oasis:entry namest="col15" nameend="col18">Domain-specific </oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"/>
         <oasis:entry colname="col2" align="left"/>
         <oasis:entry rowsep="1" colname="col3"/>
         <oasis:entry rowsep="1" colname="col4"/>
         <oasis:entry rowsep="1" colname="col5"/>
         <oasis:entry rowsep="1" colname="col6"/>
         <oasis:entry rowsep="1" colname="col7"/>
         <oasis:entry rowsep="1" colname="col8"/>
         <oasis:entry rowsep="1" colname="col9"/>
         <oasis:entry rowsep="1" colname="col10"/>
         <oasis:entry rowsep="1" namest="col11" nameend="col14" colsep="1">formulation </oasis:entry>
         <oasis:entry rowsep="1" namest="col15" nameend="col18">knowledge </oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left">Reference</oasis:entry>
         <oasis:entry colname="col2" align="left"/>
         <oasis:entry colname="col3">Single</oasis:entry>
         <oasis:entry colname="col4">Multi</oasis:entry>
         <oasis:entry colname="col5">DQN</oasis:entry>
         <oasis:entry colname="col6">PPO</oasis:entry>
         <oasis:entry colname="col7">SAC</oasis:entry>
         <oasis:entry colname="col8">QMIX</oasis:entry>
         <oasis:entry colname="col9">W-QMIX</oasis:entry>
         <oasis:entry colname="col10">DCMAC</oasis:entry>
         <oasis:entry colname="col11">MDP</oasis:entry>
         <oasis:entry colname="col12">PO-MDP</oasis:entry>
         <oasis:entry colname="col13">Graph</oasis:entry>
         <oasis:entry colname="col14">Hierarchical</oasis:entry>
         <oasis:entry colname="col15">Aerodynamics</oasis:entry>
         <oasis:entry colname="col16">Weather</oasis:entry>
         <oasis:entry colname="col17">Logistics</oasis:entry>
         <oasis:entry colname="col18">PHM</oasis:entry>
       </oasis:row>
     </oasis:thead>
     <oasis:tbody>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx28" id="text.75"/></oasis:entry>
         <oasis:entry colname="col2" align="left"><inline-formula><mml:math id="M23" display="inline"><mml:mrow><mml:mo>+</mml:mo><mml:mn mathvariant="normal">11.1</mml:mn></mml:mrow></mml:math></inline-formula> % power generation vs. baseline schedule</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M24" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M25" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M26" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"><inline-formula><mml:math id="M27" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx3" id="text.76"/></oasis:entry>
         <oasis:entry colname="col2" align="left">NA</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M28" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M29" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"><inline-formula><mml:math id="M30" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"><inline-formula><mml:math id="M31" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx15" id="text.77"/></oasis:entry>
         <oasis:entry colname="col2" align="left"><inline-formula><mml:math id="M32" display="inline"><mml:mrow><mml:mo>∼</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula> % more cost-effective vs. threshold-based maintenance (total maintenance cost, case study)</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"><inline-formula><mml:math id="M33" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"><inline-formula><mml:math id="M34" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col9"><inline-formula><mml:math id="M35" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M36" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx27" id="text.78"/></oasis:entry>
         <oasis:entry colname="col2" align="left">95.6 % reduction in unplanned downtime and maintenance cost vs. periodic maintenance</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M37" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"><inline-formula><mml:math id="M38" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"><inline-formula><mml:math id="M39" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx13" id="text.79"/></oasis:entry>
         <oasis:entry colname="col2" align="left">23 % life-cycle cost reduction: EUR <inline-formula><mml:math id="M40" display="inline"><mml:mrow><mml:mn mathvariant="normal">2.32</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mn mathvariant="normal">4</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> vs. EUR <inline-formula><mml:math id="M41" display="inline"><mml:mrow><mml:mn mathvariant="normal">3.01</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mn mathvariant="normal">4</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> (DIAR vs. UIFR)</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M42" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M43" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"><inline-formula><mml:math id="M44" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M45" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx59" id="text.80"/></oasis:entry>
         <oasis:entry colname="col2" align="left">NA</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M46" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"><inline-formula><mml:math id="M47" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M48" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx26" id="text.81"/></oasis:entry>
         <oasis:entry colname="col2" align="left">NA</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M49" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M50" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"><inline-formula><mml:math id="M51" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col13"><inline-formula><mml:math id="M52" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"><inline-formula><mml:math id="M53" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx37" id="text.82"/></oasis:entry>
         <oasis:entry colname="col2" align="left"><inline-formula><mml:math id="M54" display="inline"><mml:mrow><mml:mo>∼</mml:mo><mml:mn mathvariant="normal">20</mml:mn></mml:mrow></mml:math></inline-formula> % more cost-effective vs. threshold-based maintenance (total maintenance cost, case study)</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"><inline-formula><mml:math id="M55" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"><inline-formula><mml:math id="M56" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col9"><inline-formula><mml:math id="M57" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M58" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx42" id="text.83"/></oasis:entry>
         <oasis:entry colname="col2" align="left">NA</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M59" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"><inline-formula><mml:math id="M60" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M61" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"><inline-formula><mml:math id="M62" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col17"><inline-formula><mml:math id="M63" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row rowsep="1">
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx11" id="text.84"/></oasis:entry>
         <oasis:entry colname="col2" align="left">NA</oasis:entry>
         <oasis:entry colname="col3"><inline-formula><mml:math id="M64" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col4"/>
         <oasis:entry colname="col5"><inline-formula><mml:math id="M65" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"/>
         <oasis:entry colname="col11"><inline-formula><mml:math id="M66" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col12"/>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"><inline-formula><mml:math id="M67" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1" align="left"><xref ref-type="bibr" rid="bib1.bibx6" id="text.85"/></oasis:entry>
         <oasis:entry colname="col2" align="left">NA</oasis:entry>
         <oasis:entry colname="col3"/>
         <oasis:entry colname="col4"><inline-formula><mml:math id="M68" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col5"/>
         <oasis:entry colname="col6"/>
         <oasis:entry colname="col7"/>
         <oasis:entry colname="col8"/>
         <oasis:entry colname="col9"/>
         <oasis:entry colname="col10"><inline-formula><mml:math id="M69" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col11"/>
         <oasis:entry colname="col12"><inline-formula><mml:math id="M70" display="inline"><mml:mo>✓</mml:mo></mml:math></inline-formula></oasis:entry>
         <oasis:entry colname="col13"/>
         <oasis:entry colname="col14"/>
         <oasis:entry colname="col15"/>
         <oasis:entry colname="col16"/>
         <oasis:entry colname="col17"/>
         <oasis:entry colname="col18"/>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table><table-wrap-foot><p id="d2e1377">NA means not available.</p></table-wrap-foot></table-wrap>

      <p id="d2e2370">In Fig. <xref ref-type="fig" rid="F7"/>, we provide a comparative view of the distribution of key features across the reviewed literature. The pie chart on the left shows the proportion of single- versus multi-agent frameworks. The central pie chart outlines the distribution of algorithms, such as DQN, PPO, SAC, and QMIX variants, showing that DQN remains the most commonly used DRL algorithm despite the explosion of the action space, which grows with the number of components or maintenance tasks, limiting its applicability in larger-scale problems. Finally, the pie chart on the right highlights the prevalence of MDP- and POMDP-based modelling.</p>

      <fig id="F7" specific-use="star"><label>Figure 7</label><caption><p id="d2e2378">Comparative view of single vs. multi-agent methods, main deep reinforcement learning algorithms, and problem formulations in the reviewed literature.</p></caption>
          <graphic xlink:href="https://wes.copernicus.org/articles/11/1185/2026/wes-11-1185-2026-f07.png"/>

        </fig>

      <p id="d2e2387">With a clear understanding of the diverse methodologies used to model offshore wind maintenance planning, we now turn to the practical impact of these approaches. The following section summarizes the key contributions and performance improvements achieved through the application of DRL, illustrating how these methodological choices translate into tangible benefits such as cost reduction, improved reliability, and enhanced operational efficiency.</p>
</sec>
</sec>
<sec id="Ch1.S6">
  <label>6</label><title>Discussion</title>
      <p id="d2e2399">The application of DRL to offshore wind O&amp;M has shown clear advantages over conventional maintenance strategies, achieving lower costs, higher availability, and more adaptive planning. The reviewed studies highlight five recurring benefits: (i) <italic>cost and downtime reduction</italic>, as agents learn to time interventions “just in time” before failure; (ii) <italic>enhanced predictive maintenance</italic>, through integration of remaining useful life (RUL) data and operational context; (iii) <italic>opportunistic maintenance</italic>, by grouping actions and exploiting favourable conditions; (iv) <italic>improved reliability and safety</italic>, via proactive scheduling and embedded risk constraints; and (v) <italic>computational scalability</italic>, enabling optimization of large, stochastic systems.</p>
      <p id="d2e2417">Finally, we focus on the remaining gaps regarding the use of real-case scenarios for testing the models and the inclusion of multi-level repair that should reflect real operational practices.</p>
      <p id="d2e2420">The following subsections discuss each benefit in detail, illustrating how DRL contributes to more cost-efficient and resilient offshore wind maintenance planning.</p>
<sec id="Ch1.S6.SS1">
  <label>6.1</label><title>Cost and downtime reduction</title>
      <p id="d2e2430">A primary goal is lowering maintenance costs and turbine downtime compared to baseline strategies (reactive or scheduled). DRL agents have demonstrated the ability to significantly reduce unplanned failures and associated costs.</p>
      <p id="d2e2433">For example, <xref ref-type="bibr" rid="bib1.bibx13" id="text.86"/> report that their PPO-based adaptive inspection and repair policy (DIAR) achieves an expected life-cycle cost of EUR <inline-formula><mml:math id="M71" display="inline"><mml:mrow><mml:mn mathvariant="normal">2.32</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mn mathvariant="normal">4</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> compared to EUR <inline-formula><mml:math id="M72" display="inline"><mml:mrow><mml:mn mathvariant="normal">3.01</mml:mn><mml:mo>×</mml:mo><mml:msup><mml:mn mathvariant="normal">10</mml:mn><mml:mn mathvariant="normal">4</mml:mn></mml:msup></mml:mrow></mml:math></inline-formula> for the best uniform-interval, fixed-threshold strategy (UIFR), i.e. a 23 % reduction in their case study.</p>
      <p id="d2e2469">In <xref ref-type="bibr" rid="bib1.bibx27" id="text.87"/>, the DRL-based policy reduced unplanned downtime and maintenance cost in a multi-component system by 95.6 % versus conventional periodic maintenance.</p>
      <p id="d2e2475">These improvements come from the agent’s ability to optimize the timing of maintenance: servicing components “just in time” before failure but not too early to waste life. DRL policies effectively find this sweet spot by continuous learning and adjustment.</p>
</sec>
<sec id="Ch1.S6.SS2">
  <label>6.2</label><title>Predictive (PHM-based) strategies</title>
      <p id="d2e2487">Many wind operators use condition-based triggers (like RUL thresholds from prognostics) to plan maintenance. DRL can enhance these predictive strategies by adding dynamic decision-making. As noted in <xref ref-type="bibr" rid="bib1.bibx42" id="text.88"/>, the DRL policy outperformed a pure RUL threshold policy by considering not only the component health but also external factors such as power demand and crew availability. In other words, whereas a standard predictive maintenance strategy says “replace component <inline-formula><mml:math id="M73" display="inline"><mml:mi>X</mml:mi></mml:math></inline-formula> when its RUL <inline-formula><mml:math id="M74" display="inline"><mml:mo>&lt;</mml:mo></mml:math></inline-formula> <inline-formula><mml:math id="M75" display="inline"><mml:mi>Y</mml:mi></mml:math></inline-formula> days”, a DRL agent might learn “replace component <inline-formula><mml:math id="M76" display="inline"><mml:mi>X</mml:mi></mml:math></inline-formula> when RUL <inline-formula><mml:math id="M77" display="inline"><mml:mo>&lt;</mml:mo></mml:math></inline-formula> <inline-formula><mml:math id="M78" display="inline"><mml:mi>Y</mml:mi></mml:math></inline-formula> days <italic>and</italic> a maintenance team is idle <italic>and</italic> a low-wind period is coming up”, thereby minimizing impact. This kind of contextual decision-making led to higher reward (profit) in their experiments. We see a similar theme in <xref ref-type="bibr" rid="bib1.bibx28" id="text.89"/>: their DQN agent, augmented with wake effect knowledge, boosted energy production beyond what a wake-unaware strategy achieved. By learning the true optimal policy through trial and error, DRL approaches can exceed the performance of both corrective maintenance (which incurs high downtime) and simple predictive rules (which might be myopic or inflexible).</p>
</sec>
<sec id="Ch1.S6.SS3">
  <label>6.3</label><title>Opportunistic maintenance</title>
      <p id="d2e2553">DRL has shown strength in exploiting opportunistic maintenance opportunities that humans or simple policies might miss. For instance, in multi-component scenarios, an agent can coordinate maintenance on multiple turbines in one go to avoid repeated downtime. The PPO agent in Pinciroli’s work learned to wait for low-wind output days to schedule maintenance, which is an opportunistic behaviour yielding higher rewards. <xref ref-type="bibr" rid="bib1.bibx21" id="text.90"/> observed that, even when not explicitly programmed, DRL agents naturally learn to group multiple repairs together, thereby reducing repetitive downtime and sharing high logistics costs over several interventions. Similar findings can be found in <xref ref-type="bibr" rid="bib1.bibx16" id="text.91"/> and <xref ref-type="bibr" rid="bib1.bibx51" id="text.92"/>. <xref ref-type="bibr" rid="bib1.bibx15" id="text.93"/> demonstrated this with their multi-agent approach, where the learned policy effectively grouped maintenance tasks to save on shared downtime costs, beating a policy that treats components independently. Even in single-agent setups, agents learn to use times of low production or existing outages to perform additional repairs. In the context of offshore wind, for example, an agent might schedule a minor repair on one turbine when a vessel is already en route for a major repair on a neighbouring turbine, effectively reducing additional transit costs.</p>
</sec>
<sec id="Ch1.S6.SS4">
  <label>6.4</label><title>Reliability and safety</title>
      <p id="d2e2577">An important point is that DRL policies can improve reliability metrics (e.g. mean time between failures, availability) by preventing failures proactively. <xref ref-type="bibr" rid="bib1.bibx27" id="text.94"/> noted fewer failures in their DRL-maintained system than a conventional approach. Additionally, DRL can incorporate safety constraints (like not allowing maintenance deferral beyond a limit) via rewards or state features. The hierarchical approach by <xref ref-type="bibr" rid="bib1.bibx2" id="text.95"/> is aimed at safety-critical maintenance. By integrating an interpretable model, they ensure the DRL decisions for turbofan engines remain within safe bounds and can be understood by engineers. In offshore wind, ensuring that a DRL policy does not inadvertently run turbines to catastrophic failures is very important. Studies so far show that with proper reward design (heavily penalizing failures), the agents naturally learn to avoid risky deferrals.</p>
</sec>
<sec id="Ch1.S6.SS5">
  <label>6.5</label><title>Computational feasibility</title>
      <p id="d2e2594">Another finding across the literature is that DRL can handle high-dimensional problems that were previously intractable by brute-force optimization. Maintenance scheduling for a wind farm with many turbines, each with several components, is a huge combinatorial problem over a long horizon. Some studies also accelerate learning by transfer or imitation. For example, <xref ref-type="bibr" rid="bib1.bibx42" id="text.96"/> initialized their PPO agent via imitation learning from a heuristic policy to shorten training time. This hybrid approach marries human insight with AI optimization. The result is a practical decision-support tool that can quickly recommend which turbine to maintain and when, given the current observations.</p>
      <p id="d2e2600">DRL, with its experience-driven learning, provides feasible solutions. Nevertheless, training can be computationally intensive (e.g. WQMIX took 12 h in the 13-component case, <xref ref-type="bibr" rid="bib1.bibx37" id="altparen.97"/>), although once trained, the policy can execute decisions in real-time.</p>
      <p id="d2e2606">Model-based or brute-force methods struggle to consider all contingencies and long-term effects <xref ref-type="bibr" rid="bib1.bibx12" id="paren.98"/>.</p>
      <p id="d2e2612">The consensus of recent work is that DRL-based maintenance planning consistently outperforms static strategies in simulation, often by a wide margin in cost savings or uptime. These improvements stem from DRL’s ability to learn optimal scheduling under uncertainty, adapt to varying conditions (weather, load demands), and coordinate multiple decisions in a way that human-designed rules cannot easily mimic.</p>
      <p id="d2e2616">While the performance gains of DRL-based maintenance strategies are evident, their success is further amplified by the integration of domain-specific knowledge. The next section focuses on how incorporating elements such as wind farm aerodynamics, weather constraints, logistical considerations, and PHM data not only enriches the state representations but also guides the learning process toward more realistic and reliable maintenance policies.</p>
</sec>
<sec id="Ch1.S6.SS6">
  <label>6.6</label><title>Simulation-based studies vs. real-world applications</title>
      <p id="d2e2628">Simulation-based research has been the predominant way to develop and evaluate DRL for wind farm maintenance. All the studies reviewed above rely on simulated environments, typically a stochastic model of turbine degradation and failure, combined with models of maintenance actions (costs, durations, effects) and often weather or logistics simulators.</p>
      <p id="d2e2631">For instance, <xref ref-type="bibr" rid="bib1.bibx28" id="text.99"/> used a simulation of wind dynamics and turbine wakes to train their DQN ensemble; <xref ref-type="bibr" rid="bib1.bibx42" id="text.100"/> built a custom simulator for turbine failures and crew assignments to test PPO. Simulation is essential because it provides a safe and flexible sandbox to train DRL agents (which often require millions of decision steps for convergence). Researchers can speed up or repeat scenarios (like many years of operation) to expose the agent to rare events, something impossible to do quickly in the real world.</p>
      <p id="d2e2640">Key insights from simulation studies include the significant performance gains of DRL policies over traditional maintenance strategies. Many papers report that their learned policies yield lower cost or higher availability than periodic (time-based) or reactive (run-to-failure) maintenance. For example, the agent of <xref ref-type="bibr" rid="bib1.bibx42" id="text.101"/> outperformed corrective and age-based schedules, with fewer failures and lower cost, mirroring the results of <xref ref-type="bibr" rid="bib1.bibx6" id="text.102"/> against periodic policies. Such results, consistently observed in simulation, build a compelling case for applying DRL in practice.</p>
      <p id="d2e2649">That said, real-world applications of DRL for offshore wind O&amp;M remain at an early stage, and published demonstrations typically rely on simplified physical and operational assumptions. A recent example is the work by <xref ref-type="bibr" rid="bib1.bibx28" id="text.103"/>, developed in collaboration with an industry research institute (KEPCO). Their DQN-based scheduling framework illustrates how DRL can, in principle, coordinate maintenance actions and exploit wake interactions to improve farm-level energy production. However, their study relies on the Jensen kinematic wake model, which, while computationally efficient, provides only coarse accuracy in the near-wake region and may misrepresent wake recovery dynamics. Consequently, the reported power-gain improvements (i.e. an 11.1 % increase over their baseline scheduling strategy) should be interpreted as gains within the constraints of a simplified simulation environment <xref ref-type="bibr" rid="bib1.bibx28" id="paren.104"/>. Additional assumptions, such as fixed maintenance durations, deterministic task lists, and the absence of vessel or access constraints, further limit the generalizability of the results.</p>
      <p id="d2e2659">These studies therefore highlight both the potential and the current limitations of DRL for maintenance planning: DRL can identify useful scheduling patterns in controlled testbeds, but its effectiveness in operational settings remains dependent on the fidelity of the underlying simulator. More comprehensive validation using higher-fidelity wake models, realistic metocean variability, and operational logistics will be essential to assess the practical applicability of DRL-based maintenance planning.</p>
      <p id="d2e2662">While the deployment of a trained agent in a live wind farm has not been reported yet, the involvement of an industrial stakeholder and the real-world fidelity of the simulation (including measured wake effects) indicate a step toward actual adoption.</p>
      <p id="d2e2665">In fields like aviation and manufacturing, some DRL-based maintenance planners have been tested on real data if not deployed directly. For instance, the turbofan case by <xref ref-type="bibr" rid="bib1.bibx3" id="text.105"/> used NASA engine datasets to train and validate their approach. This kind of validation on real-world data builds confidence that the policies will translate from simulation to reality. Moreover, ongoing advances in digital twins for wind farms <xref ref-type="bibr" rid="bib1.bibx57" id="paren.106"/> (high-fidelity replicas of turbines and operations) may enable DRL agents to be trained or at least fine-tuned in an environment that closely mimics reality.</p>
</sec>
<sec id="Ch1.S6.SS7">
  <label>6.7</label><title>Multi-level repairs</title>
      <p id="d2e2682">Traditional maintenance models in offshore wind operations typically reduce decisions to a simple binary choice, either repair or not. However, real-world observations show that turbine downtime arises from a spectrum of failures. For instance, industry data reveal that roughly 70 % of downtime is caused by major repairs, about 17 % by minor repairs, and the remainder due to simple resets <xref ref-type="bibr" rid="bib1.bibx10" id="paren.107"/>.</p>
      <p id="d2e2688">This suggests that maintenance actions in the field are inherently multi-level, ranging from minor fixes that temporarily restore performance to major overhauls or full component replacements.</p>
      <p id="d2e2691">To make the notion of <italic>multi-level</italic> maintenance actions more concrete in a wind context, <xref ref-type="bibr" rid="bib1.bibx1" id="text.108"/> study preventive maintenance for a wind turbine gearbox under temperature-based condition monitoring. They compare (i) a commonly adopted industrial strategy in which, whenever a threshold temperature is exceeded, the turbine is temporarily derated while the gearbox is cooled, and, after such events become frequent enough, the gearbox is ultimately renewed (replacement or overhaul) against (ii) a multi-level strategy in which each threshold exceedance triggers an imperfect preventive maintenance action that partially restores the gearbox condition by reducing its failure rate to a value between the current rate and that of a new gearbox, with renewal enforced only after <inline-formula><mml:math id="M79" display="inline"><mml:mi>N</mml:mi></mml:math></inline-formula> imperfect PM actions. Their numerical comparison shows that neither “renewal only” nor “imperfect PM then renewal” dominates universally: the more economical strategy depends on gearbox reliability and on the relative magnitude of production loss, cooling, preventive maintenance, and renewal logistics costs. This case illustrates why binary “repair/replace” abstractions can be limiting for offshore wind O&amp;M: introducing intermediate action levels (e.g. partial restoration actions before renewal) enables the policy to express realistic trade-offs between short-term production impacts and long-term degradation management.</p>
      <p id="d2e2707">While incorporating these multi-level actions enriches the state action representation, it also exposes a significant gap in current DRL models for offshore wind O&amp;M. Most existing studies focus on binary decision frameworks.</p>
      <p id="d2e2711">The lack of integration of multi-level maintenance actions is discussed in greater detail in the next section by examining how DRL-based planning models can incorporate multi-level maintenance strategies, allowing agents to choose among different maintenance tasks. Furthermore, we  explore the implications of this refined modelling for cost, reliability, and overall operational efficiency.</p>
</sec>
</sec>
<sec id="Ch1.S7">
  <label>7</label><title>Future directions</title>
      <p id="d2e2723">Traditional maintenance models in offshore wind often reduce decisions to a binary choice (e.g. to repair or not). In a DRL formulation, the maintenance planning problem can be recast as a Markov decision process where the state includes asset features such as component condition, age windows, and weather windows, while the action space is expanded beyond a simple “maintain or not” decision.</p>
      <p id="d2e2726">Instead, this enables the agent to choose among multiple repair options. Such an approach would allow the agent to balance cost and reliability by, for example, deploying a less expensive interim repair when system health is marginally degraded or committing to a full repair only when necessary.</p>
      <p id="d2e2729">Incorporating minor repairs as a viable action can prevent small degradations from escalating into catastrophic failures. DRL agents trained on multi-level maintenance tasks can learn to execute low-level fixes when early signs of degradation appear, thereby extending component life and improving overall system reliability. <xref ref-type="bibr" rid="bib1.bibx54" id="text.109"/>, for example, demonstrated that a DRL-based policy for structural maintenance maintained high reliability by optimally balancing minor and major interventions across numerous components.</p>
      <p id="d2e2735">Potential avenues for enabling a DRL agent to consider multi-action maintenance planning include the following:</p>
      <p id="d2e2739"><def-list>
          <def-item><term><bold>Expanded action spaces.</bold></term><def>

      <p id="d2e2748">Researchers have begun explicitly modelling a range of actions, such as “do nothing”, “perform minor repair”, “conduct major repair”, or “replace component”. For example, <xref ref-type="bibr" rid="bib1.bibx56" id="text.110"/> formulate an infinite-horizon DRL maintenance problem where a component’s health can be partially recovered through an imperfect repair or fully restored via corrective maintenance. This expanded action space helps the agent learn which level of intervention yields the optimal long-term cost and reliability trade-off.</p>
          </def></def-item>
          <def-item><term><bold>Parameterized and hybrid actions.</bold></term><def>

      <p id="d2e2761">To manage the complexity arising from multiple discrete repair choices combined with continuous variables (e.g. the timing of intervention), advanced approaches employ parameterized action spaces. In one instance, a parameterized PPO algorithm was developed to handle mixed discrete–continuous decisions, effectively allowing the agent to adjust both the type of repair and the timing simultaneously <xref ref-type="bibr" rid="bib1.bibx56" id="paren.111"/>. This structured action space helps maintain convergence stability while exploring complex repair policies.</p>
          </def></def-item>
          <def-item><term><bold>Multi-agent decomposition.</bold></term><def>

      <p id="d2e2774">Another promising strategy is to decompose the large-scale maintenance problem by modelling each turbine or even each component as an individual RL agent within a cooperative framework. For example, <xref ref-type="bibr" rid="bib1.bibx47" id="text.112"/> address the explosion of the action space in multi-level preventive maintenance by treating each machine as an independent agent that coordinates with others. Such a decomposition not only alleviates scalability issues but also enables the agents to learn localized strategies that can later be integrated for holistic wind farm maintenance.</p>
          </def></def-item>
        </def-list></p>
      <p id="d2e2782">Despite the promise of these approaches, challenges remain in training and deployment. The high dimensionality of both state and action spaces, especially in a wind farm with hundreds of turbines, can lead to a combinatorial explosion of decision possibilities. Techniques such as multi-agent reinforcement learning are being explored to address these scalability concerns. Cross-industry insights from manufacturing, aerospace, and civil infrastructure further suggest that lessons learned in one domain (e.g. mission-aware planning or adaptive grouping strategies) can be effectively translated to offshore wind maintenance planning.</p>
      <p id="d2e2785">In summary, by moving beyond binary choices, these models could capture the inherent complexity of real-world O&amp;M practices and facilitate the development of policies that more accurately balance short-term fixes with long-term reliability. This advancement would represent a significant step toward more realistic and effective DRL-based maintenance planning in offshore wind operations.</p>
      <p id="d2e2788">The concluding section synthesizes the insights gathered from the reviewed models, underscoring the strengths, current limitations, and future directions for integrating DRL into operational decision-making frameworks that could revolutionize maintenance planning in the renewable energy sector.</p>
</sec>
<sec id="Ch1.S8" sec-type="conclusions">
  <label>8</label><title>Conclusions</title>
      <p id="d2e2799">The literature demonstrates that DRL is a promising approach for offshore wind farm maintenance planning. By learning from interactions in a simulated environment, DRL agents can devise maintenance policies that outperform traditional corrective, time-based, and other predictive strategies on key metrics like cost, downtime, and energy production. Both single-agent and multi-agent frameworks have been explored: single-agent DRL has succeeded in optimizing complex maintenance schedules by considering long-term consequences, while multi-agent DRL offers a path to scaling these solutions to larger systems by decentralizing decisions. Moreover, the most successful studies embed domain-specific knowledge, from wake physics and weather patterns to prognostic models, into the DRL process, creating hybrid solutions that learn efficiently and behave realistically.</p>
      <p id="d2e2802">The reviewed works highlight several advantages of DRL in offshore wind O&amp;M: adaptive scheduling that responds to the actual condition of turbines, opportunistic maintenance that smartly times actions to minimize impact on operations, and the ability to handle the high dimensionality of scheduling problems that defy mathematical or brute-force optimization. For example, DRL agents have learned to schedule maintenance during low-wind periods <xref ref-type="bibr" rid="bib1.bibx38" id="paren.113"/>, to cluster repairs and save vessel trips <xref ref-type="bibr" rid="bib1.bibx26" id="paren.114"/>, and to prevent failures by reacting to prognostic alarms better than fixed rules. These capabilities translate into quantifiable gains: double-digit percentage improvements in cost savings or energy output in case studies are common <xref ref-type="bibr" rid="bib1.bibx28 bib1.bibx37" id="paren.115"/>.</p>
      <p id="d2e2814">However, challenges remain to be solved before DRL becomes commonplace in live offshore wind operations. Safety and trust are critical aspects: operators need assurance that an AI agent would not recommend catastrophic decisions. This is why interpretability (as in the hierarchical HMM approach) and extensive testing are crucial. Computational efficiency is also a concern: multi-agent or long-horizon DRL can be computationally intensive in the training phase, though improvements in algorithms and hardware mitigate this. Additionally, integration with existing maintenance management systems requires user-friendly interfaces and perhaps human-in-the-loop designs (where human planners can review or override AI suggestions).</p>
      <p id="d2e2817">Moreover, a critical limitation in current DRL applications is the overly simplistic treatment of repair actions. Most methods consider only one kind of repair, typically reducing the decision to whether to replace a component. In practice, maintenance is a multifaceted process that often involves choosing among various repair strategies, each with distinct implications for system performance and cost. For instance, an optimal policy should not only determine the optimal timing of an intervention but also decide which specific repair tasks to undertake based on the current condition of components. Addressing this gap requires developing agents capable of discerning a richer set of actions that reflect the complexities of real-world maintenance tasks.</p>
      <p id="d2e2821">For practitioners and researchers, these findings support moving toward DRL-based decision support tools for wind farm O&amp;M. As offshore wind farms continue to grow and data from operations accumulate, DRL approaches aim to become integral in optimizing maintenance planning, ultimately lowering the cost of renewable energy and improving the reliability of wind power generation.</p>
      <p id="d2e2824">Despite efforts to follow a transparent and reproducible literature selection protocol as defined in Sect. <xref ref-type="sec" rid="Ch1.S1"/>, several threats to validity remain. First, <italic>publication bias</italic> may affect the evidence base: studies reporting positive performance gains of DRL approaches are more likely to be published than negative or inconclusive results, and industrial deployments are often underreported due to confidentiality. As a consequence, the reviewed corpus may over-represent successful proof-of-concept demonstrations and under-represent failure cases or practical limitations. Second, <italic>reproducibility</italic> is constrained by limited access to high-fidelity O&amp;M simulators and proprietary data (e.g. SCADA/CM and maintenance logs) and incomplete reporting of experimental details (e.g. hyperparameters, reward shaping, baseline tuning, random seeds). These issues complicate rigorous replication and can make cross-paper comparisons sensitive to implementation choices rather than underlying algorithmic differences. Third, <italic>generalizability</italic> is limited because most reviewed studies evaluate DRL in stylized simulation environments with simplified wake, metocean, and logistics assumptions; consequently, reported gains may not transfer to real offshore operations or to farms with different layouts, failure modes, contractual constraints, and accessibility regimes. Finally, while many insights into DRL modelling choices (e.g. partial observability remedies, multi-agent coordination, multi-level actions) are transferable beyond offshore wind, their effectiveness may vary substantially across domains and should not be assumed without domain-specific validation.</p>
</sec>

      
      </body>
    <back><app-group>

<app id="App1.Ch1.S1">
  <label>Appendix A</label><title>Nomenclature and abbreviations</title>

        <table-wrap position="anchor"><oasis:table><oasis:tgroup cols="2">
     <oasis:colspec colnum="1" colname="col1" align="left"/>
     <oasis:colspec colnum="2" colname="col2" align="left"/>
     <oasis:tbody>
       <oasis:row>
         <oasis:entry colname="col1">A2C</oasis:entry>
         <oasis:entry colname="col2">Advantage actor critic</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">A3C</oasis:entry>
         <oasis:entry colname="col2">Asynchronous advantage actor critic</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">CNN</oasis:entry>
         <oasis:entry colname="col2">Convolutional neural network</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">DCMAC</oasis:entry>
         <oasis:entry colname="col2">Deep centralized multi-agent actor critic</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">DDPG</oasis:entry>
         <oasis:entry colname="col2">Deep deterministic policy gradient</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">DDQN</oasis:entry>
         <oasis:entry colname="col2">Double deep Q network</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">DQN</oasis:entry>
         <oasis:entry colname="col2">Deep Q network</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">DRL</oasis:entry>
         <oasis:entry colname="col2">Deep reinforcement learning</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">GCN</oasis:entry>
         <oasis:entry colname="col2">Graph convolutional network</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">IOHMM</oasis:entry>
         <oasis:entry colname="col2">Input–output hidden Markov model</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">MDP</oasis:entry>
         <oasis:entry colname="col2">Markov decision process</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">O&amp;M</oasis:entry>
         <oasis:entry colname="col2">Operations and maintenance</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">PHM</oasis:entry>
         <oasis:entry colname="col2">Prognostics and health management</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">POMDP</oasis:entry>
         <oasis:entry colname="col2">Partially observable Markov decision process</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">PPO</oasis:entry>
         <oasis:entry colname="col2">Proximal policy optimization</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">QMIX</oasis:entry>
         <oasis:entry colname="col2">Multi-agent value-based RL algorithm</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">RL</oasis:entry>
         <oasis:entry colname="col2">Reinforcement learning</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">RUL</oasis:entry>
         <oasis:entry colname="col2">Remaining useful life</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">SAC</oasis:entry>
         <oasis:entry colname="col2">Soft actor critic</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">SCADA</oasis:entry>
         <oasis:entry colname="col2">Supervisory control and data acquisition</oasis:entry>
       </oasis:row>
       <oasis:row>
         <oasis:entry colname="col1">W-QMIX</oasis:entry>
         <oasis:entry colname="col2">Weighted QMIX</oasis:entry>
       </oasis:row>
     </oasis:tbody>
   </oasis:tgroup></oasis:table></table-wrap>

      
</app>
  </app-group><notes notes-type="dataavailability"><title>Data availability</title>

      <p id="d2e3061">No underlying research data are associated with this study. This manuscript is a literature review and does not report original experimental, observational, or simulation data generated by the authors. Accordingly, no data repository or dataset archive applies to this work.</p>
  </notes><notes notes-type="authorcontribution"><title>Author contributions</title>

      <p id="d2e3067">MB: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing – original draft, writing – review and editing. XJ: supervision, writing – review and editing. RRN: supervision, writing – review and editing.</p>
  </notes><notes notes-type="competinginterests"><title>Competing interests</title>

      <p id="d2e3074">The contact author has declared that none of the authors has any competing interests.</p>
  </notes><notes notes-type="disclaimer"><title>Disclaimer</title>

      <p id="d2e3080">Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. The authors bear the ultimate responsibility for providing appropriate place names. Views expressed in the text are those of the authors and do not necessarily reflect the views of the publisher.</p>
  </notes><notes notes-type="financialsupport"><title>Financial support</title>

      <p id="d2e3086">This research has been supported by the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Holi-DOCTOR project, grant no. KICH1.ED02.20.004).</p>
  </notes><notes notes-type="reviewstatement"><title>Review statement</title>

      <p id="d2e3092">This paper was edited by Yolanda Vidal and reviewed by three anonymous referees.</p>
  </notes><ref-list>
    <title>References</title>

      <ref id="bib1.bibx1"><label>Aafif et al.(2022)Aafif, Chelbi, Mifdal, Dellagi, and Majdouline</label><mixed-citation>Aafif, Y., Chelbi, A., Mifdal, L., Dellagi, S., and Majdouline, I.: Optimal preventive maintenance strategies for a wind turbine gearbox, Energy Reports, 8, 803–814, <ext-link xlink:href="https://doi.org/10.1016/j.egyr.2022.07.084" ext-link-type="DOI">10.1016/j.egyr.2022.07.084</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx2"><label>Abbas(2024)</label><mixed-citation>Abbas, A.: A Hierarchical Framework for Interpretable, Safe, and Specialised Deep Reinforcement Learning, Doctoral thesis, Technological University Dublin, <ext-link xlink:href="https://doi.org/10.21427/p05p-az54" ext-link-type="DOI">10.21427/p05p-az54</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx3"><label>Abbas et al.(2024)Abbas, Chasparis, and Kelleher</label><mixed-citation>Abbas, A. N., Chasparis, G. C., and Kelleher, J. D.: Hierarchical framework for interpretable and specialized deep reinforcement learning-based predictive maintenance, Data Knowl. Eng., 149, 102240, <ext-link xlink:href="https://doi.org/10.1016/j.datak.2023.102240" ext-link-type="DOI">10.1016/j.datak.2023.102240</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx4"><label>Abkar et al.(2023)</label><mixed-citation>Abkar, M., Zehtabiyan-Rezaie, N., and Iosifidis, A.: Reinforcement learning for wind farm flow control: Current state and future actions, Renew. Energ., 205, 271–289, <ext-link xlink:href="https://doi.org/10.1016/j.renene.2023.01.001" ext-link-type="DOI">10.1016/j.renene.2023.01.001</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx5"><label>Adadi and Berrada(2018)</label><mixed-citation>Adadi, A. and Berrada, M.: Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI), IEEE Access, <ext-link xlink:href="https://doi.org/10.1109/ACCESS.2018.2870052" ext-link-type="DOI">10.1109/ACCESS.2018.2870052</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx6"><label>Andriotis and Papakonstantinou(2021)</label><mixed-citation>Andriotis, C. P. and Papakonstantinou, K. G.: Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints, Reliab. Eng. Syst. Safe., 212, 107551, <ext-link xlink:href="https://doi.org/10.1016/j.ress.2021.107551" ext-link-type="DOI">10.1016/j.ress.2021.107551</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx7"><label>Borsotti et al.(2024)Borsotti, Negenborn, and Jiang</label><mixed-citation>Borsotti, M., Negenborn, R., and Jiang, X.: Model predictive control framework for optimizing offshore wind O&amp;M, in: Advances in Maritime Technology and Engineering, CRC Press, 533–546, <ext-link xlink:href="https://doi.org/10.1201/9781003508762-65" ext-link-type="DOI">10.1201/9781003508762-65</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx8"><label>Borsotti et al.(2026)Borsotti, Negenborn, and Jiang</label><mixed-citation>Borsotti, M., Negenborn, R., and Jiang, X.: A review of multi-horizon decision-making for operation and maintenance of fixed-bottom offshore wind farms, Renew. Sust. Energ. Rev., 226, 116450, <ext-link xlink:href="https://doi.org/10.1016/j.rser.2025.116450" ext-link-type="DOI">10.1016/j.rser.2025.116450</ext-link>, 2026.</mixed-citation></ref>
      <ref id="bib1.bibx9"><label>Bui and Hollweg(2024)</label><mixed-citation>Bui, V. and Hollweg, G. V.: A Critical Review of Safe Reinforcement Learning Techniques in Smart Grid Applications, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2409.16256" ext-link-type="DOI">10.48550/arXiv.2409.16256</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx10"><label>Carroll et al.(2017)Carroll, McDonald, Dinwoodie, McMillan, Revie, and Lazakis</label><mixed-citation>Carroll, J., McDonald, A., Dinwoodie, I., McMillan, D., Revie, M., and Lazakis, I.: Availability, operation and maintenance costs of offshore wind turbines with different drive train configurations, Wind Energy, 20, 361–378, <ext-link xlink:href="https://doi.org/10.1002/we.2011" ext-link-type="DOI">10.1002/we.2011</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx11"><label>Chatterjee and Dethlefs(2021)</label><mixed-citation>Chatterjee, J. and Dethlefs, N.: Scientometric review of artificial intelligence for operations &amp; maintenance of wind turbines: The past, present and future, Renew. Sust. Energ. Rev., 144, 111051, <ext-link xlink:href="https://doi.org/10.1016/j.rser.2021.111051" ext-link-type="DOI">10.1016/j.rser.2021.111051</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx12"><label>Chen et al.(2024)Chen, Kang, Li, Li, and Zhao</label><mixed-citation>Chen, M., Kang, Y., Li, K., Li, P., and Zhao, Y.-B.: Deep reinforcement learning for maintenance optimization of multi-component production systems considering quality and production plan, Qual. Eng., 1–12, <ext-link xlink:href="https://doi.org/10.1080/08982112.2024.2373362" ext-link-type="DOI">10.1080/08982112.2024.2373362</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx13"><label>Cheng et al.(2023)Cheng, Liu, Li, and Li</label><mixed-citation>Cheng, J., Liu, Y., Li, W., and Li, T.: Deep reinforcement learning for cost-optimal condition-based maintenance policy of offshore wind turbine components, Ocean Eng., 283, 115062, <ext-link xlink:href="https://doi.org/10.1016/j.oceaneng.2023.115062" ext-link-type="DOI">10.1016/j.oceaneng.2023.115062</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx14"><label>Civera and Surace(2022)</label><mixed-citation>Civera, M. and Surace, C.: Non-Destructive Techniques for the Condition and Structural Health Monitoring of Wind Turbines: A Literature Review of the Last 20 Years, Sensors-Basel, 22, 1627, <ext-link xlink:href="https://doi.org/10.3390/s22041627" ext-link-type="DOI">10.3390/s22041627</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx15"><label>Do et al.(2024)Do, Nguyen, Voisin, Iung, and Neto</label><mixed-citation>Do, P., Nguyen, V.-T., Voisin, A., Iung, B., and Neto, W. A. F.: Multi-agent deep reinforcement learning-based maintenance optimization for multi-dependent component systems, Expert Syst. Appl., 245, 123144, <ext-link xlink:href="https://doi.org/10.1016/j.eswa.2024.123144" ext-link-type="DOI">10.1016/j.eswa.2024.123144</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx16"><label>Dong et al.(2021)Dong, Zhao, and Wu</label><mixed-citation>Dong, W., Zhao, T., and Wu, Y.: Deep Reinforcement Learning Based Preventive Maintenance for Wind Turbines, in: 2021 IEEE 5th Conference on Energy Internet and Energy System Integration (EI2), 2860–2865, <ext-link xlink:href="https://doi.org/10.1109/EI252483.2021.9713457" ext-link-type="DOI">10.1109/EI252483.2021.9713457</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx17"><label>Dulac-Arnold et al.(2021)Dulac-Arnold, Levine, Mankowitz, Li, Paduraru, Gowal, and Hester</label><mixed-citation>Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T.: Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks and Analysis, Mach. Learn., 110, 2419–2468, <ext-link xlink:href="https://doi.org/10.1007/s10994-021-05961-4" ext-link-type="DOI">10.1007/s10994-021-05961-4</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx18"><label>Fox et al.(2022)Fox, Pillai, Friedrich, Collu, Dawood, and Johanning</label><mixed-citation>Fox, H., Pillai, A. C., Friedrich, D., Collu, M., Dawood, T., and Johanning, L.: A Review of Predictive and Prescriptive Offshore Wind Farm Operation and Maintenance, Energies, 15, <ext-link xlink:href="https://doi.org/10.3390/en15020504" ext-link-type="DOI">10.3390/en15020504</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx19"><label>Haarnoja et al.(2018)Haarnoja, Zhou, Abbeel, and Levine</label><mixed-citation>Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S.: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arxiv.1801.01290" ext-link-type="DOI">10.48550/arxiv.1801.01290</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx20"><label>Hausknecht and Stone(2015a)</label><mixed-citation>Hausknecht, M. and Stone, P.: Deep Recurrent Q-Learning for Partially Observable MDPs, in: AAAI Fall Symposium Series, arXiv, <ext-link xlink:href="https://doi.org/10.48550/arxiv.1507.06527" ext-link-type="DOI">10.48550/arxiv.1507.06527</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx21"><label>Huang et al.(2020)Huang, Chang, and Arinez</label><mixed-citation>Huang, J., Chang, Q., and Arinez, J.: Deep reinforcement learning based preventive maintenance policy for serial production lines, Expert Syst. Appl., 160, 113701, <ext-link xlink:href="https://doi.org/10.1016/j.eswa.2020.113701" ext-link-type="DOI">10.1016/j.eswa.2020.113701</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx22"><label>Igl et al.(2018)Igl, Zintgraf, Le, Wood, and Whiteson</label><mixed-citation>Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S.: Deep Variational Reinforcement Learning for POMDPs, in: Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, PMLR, 2117–2126, <uri>http://proceedings.mlr.press/v80/igl18a.html</uri> (last access: 28 October 2025), 2018.</mixed-citation></ref>
      <ref id="bib1.bibx23"><label>Jenkins et al.(2021)Jenkins, Prothero, Collu, Carroll, McMillan, and McDonald</label><mixed-citation>Jenkins, B., Prothero, A., Collu, M., Carroll, J., McMillan, D., and McDonald, A.: Limiting Wave Conditions for the Safe Maintenance of Floating Wind Turbines, J. Phys. Conf. Ser., 2018, <ext-link xlink:href="https://doi.org/10.1088/1742-6596/2018/1/012023" ext-link-type="DOI">10.1088/1742-6596/2018/1/012023</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx24"><label>Kaelbling et al.(1998)Kaelbling, Littman, and Cassandra</label><mixed-citation>Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.: Planning and acting in partially observable stochastic domains, Artif. Intell., 101, 99–134, <ext-link xlink:href="https://doi.org/10.1016/S0004-3702(98)00023-X" ext-link-type="DOI">10.1016/S0004-3702(98)00023-X</ext-link>, 1998.</mixed-citation></ref>
      <ref id="bib1.bibx25"><label>Kazemian et al.(2024)Kazemian, Yildirim, and Ramanan</label><mixed-citation>Kazemian, I., Yildirim, M., and Ramanan, P.: Attention is All You Need to Optimize Wind Farm Operations and Maintenance, arXiv, <ext-link xlink:href="https://doi.org/10.48550/arxiv.2410.24052" ext-link-type="DOI">10.48550/arxiv.2410.24052</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx26"><label>Kerkkamp et al.(2022)Kerkkamp, Bukhsh, Zhang, and Jansen</label><mixed-citation>Kerkkamp, D., Bukhsh, Z., Zhang, Y., and Jansen, N.: Grouping of Maintenance Actions with Deep Reinforcement Learning and Graph Convolutional Networks, in: Proceeding of the 14th International Conference on Agents and Artificial Intelligence, vol. 2, SciTePress Digital Library, 574–585, <ext-link xlink:href="https://doi.org/10.5220/0000155600003116" ext-link-type="DOI">10.5220/0000155600003116</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx27"><label>Lee and Mitici(2023)</label><mixed-citation>Lee, J. and Mitici, M.: Deep reinforcement learning for predictive aircraft maintenance using probabilistic Remaining-Useful-Life prognostics, Reliab. Eng. Syst. Safe., 230, 108908, <ext-link xlink:href="https://doi.org/10.1016/j.ress.2022.108908" ext-link-type="DOI">10.1016/j.ress.2022.108908</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx28"><label>Lee et al.(2025)Lee, Woo, and Kim</label><mixed-citation>Lee, N., Woo, J., and Kim, S.: A deep reinforcement learning ensemble for maintenance scheduling in offshore wind farms, Appl. Energ., 377, <ext-link xlink:href="https://doi.org/10.1016/j.apenergy.2024.124431" ext-link-type="DOI">10.1016/j.apenergy.2024.124431</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx29"><label>Li et al.(2023)Li, Lin, Yu, Du, Li, and Fu</label><mixed-citation>Li, Q., Lin, T., Yu, Q., Du, H., Li, J., and Fu, X.: Review of Deep Reinforcement Learning and Its Application in Modern Renewable Power System Control, Energies, 16, <ext-link xlink:href="https://doi.org/10.3390/en16104143" ext-link-type="DOI">10.3390/en16104143</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx30"><label>Liang et al.(2025)Liang, Miao, Li, Tan, Wang, Luo, and Jiang</label><mixed-citation>Liang, J., Miao, H., Li, K., Tan, J., Wang, X., Luo, R., and Jiang, Y.: A Review of Multi-Agent Reinforcement Learning Algorithms, Electronics, 14, <ext-link xlink:href="https://doi.org/10.3390/electronics14040820" ext-link-type="DOI">10.3390/electronics14040820</ext-link>, 2025.</mixed-citation></ref>
      <ref id="bib1.bibx31"><label>Lillicrap et al.(2016)Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, and Wierstra</label><mixed-citation>Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.: Continuous Control with Deep Reinforcement Learning, in: International Conference on Learning Representations (ICLR), arXiv, <ext-link xlink:href="https://doi.org/10.48550/arxiv.1509.02971" ext-link-type="DOI">10.48550/arxiv.1509.02971</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx32"><label>Lowe et al.(2017)Lowe, WU, Tamar, Harb, Pieter Abbeel, and Mordatch</label><mixed-citation>Lowe, R., Wu, Y., Tamar, A., Harb, J., Pieter Abbeel, O., and Mordatch, I.: Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.1706.02275" ext-link-type="DOI">10.48550/arXiv.1706.02275</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx33"><label>Mnih et al.(2015)Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg, and Hassabis</label><mixed-citation>Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D.: Human-level control through deep reinforcement learning, Nature, 518, 529–533, <ext-link xlink:href="https://doi.org/10.1038/nature14236" ext-link-type="DOI">10.1038/nature14236</ext-link>, 2015.</mixed-citation></ref>
      <ref id="bib1.bibx34"><label>Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu</label><mixed-citation>Mnih, V., Badia, A., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K.: Asynchronous Methods for Deep Reinforcement Learning, Proceedings of Machine Learning Research, arXiv, <ext-link xlink:href="https://doi.org/10.48550/arXiv.1602.01783" ext-link-type="DOI">10.48550/arXiv.1602.01783</ext-link>, 2016.</mixed-citation></ref>
      <ref id="bib1.bibx35"><label>Narayanan(2023)</label><mixed-citation>Narayanan, S.: Reinforcement Learning in Wind Energy: A Review, Int. J. Green Energy, 20, 443–465, <ext-link xlink:href="https://doi.org/10.1080/15435075.2023.2281329" ext-link-type="DOI">10.1080/15435075.2023.2281329</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx36"><label>National Renewable Energy Laboratory(2022)</label><mixed-citation>National Renewable Energy Laboratory: Offshore Wind Energy Market Assessment 2022, National Renewable Energy Laboratory, <uri>https://www.nrel.gov/wind/offshore-market-assessment.html</uri> (last access: 28 October 2025), 2022.</mixed-citation></ref>
      <ref id="bib1.bibx37"><label>Nguyen et al.(2022)Nguyen, Do, Voisin, and Iung</label><mixed-citation>Nguyen, V.-T., Do, P., Voisin, A., and Iung, B.: Weighted-QMIX-based Optimization for Maintenance Decision-making of Multi-component Systems, in: Proceedings of the European Conference of the PHM Society 2022, vol. 7, 360–367, <ext-link xlink:href="https://doi.org/10.36001/phme.2022.v7i1.3319" ext-link-type="DOI">10.36001/phme.2022.v7i1.3319</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx38"><label>Ogunfowora and Najjaran(2023)</label><mixed-citation>Ogunfowora, O. and Najjaran, H.: Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization, J. Manuf. Syst., 70, 244–263, <ext-link xlink:href="https://doi.org/10.1016/j.jmsy.2023.07.014" ext-link-type="DOI">10.1016/j.jmsy.2023.07.014</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx39"><label>Pandit and Wang(2024)</label><mixed-citation>Pandit, R. and Wang, J.: A comprehensive review on enhancing wind turbine applications with advanced SCADA data analytics and practical insights, IET Renew. Power Gen., 18, 722–742, <ext-link xlink:href="https://doi.org/10.1049/rpg2.12920" ext-link-type="DOI">10.1049/rpg2.12920</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx40"><label>Parisotto et al.(2020)Parisotto, Song, Rae, Pascanu, Gulcehre et al.</label><mixed-citation>Parisotto, E., Song, H. F., Rae, J. W., Pascanu, R., Gulcehre, C., Jayakumar, S. M., Jaderberg, M., Lopez Kaufman, R., Clark, A., Noury, S., Botvinick, M. M., Heess, N., and Hadsell, R.: Stabilizing Transformers for Reinforcement Learning, in: Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, PMLR, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.1910.06764" ext-link-type="DOI">10.48550/arXiv.1910.06764</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx41"><label>Pesántez et al.(2024)Pesántez, Guamán, Córdova, Torres, and Benalcazar</label><mixed-citation>Pesántez, G., Guamán, W., Córdova, J., Torres, M., and Benalcazar, P.: Reinforcement Learning for Efficient Power Systems Planning: A Review of Operational and Expansion Strategies, Energies, 17, 2167, <ext-link xlink:href="https://doi.org/10.3390/en17092167" ext-link-type="DOI">10.3390/en17092167</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx42"><label>Pinciroli et al.(2021)Pinciroli, Baraldi, Ballabio, Compare, and Zio</label><mixed-citation>Pinciroli, L., Baraldi, P., Ballabio, G., Compare, M., and Zio, E.: Deep Reinforcement Learning Based on Proximal Policy Optimization for the Maintenance of a Wind Farm with Multiple Crews, Energies, 14, <ext-link xlink:href="https://doi.org/10.3390/en14206743" ext-link-type="DOI">10.3390/en14206743</ext-link>, 2021.</mixed-citation></ref>
      <ref id="bib1.bibx43"><label>Qing et al.(2022)Qing, Tong, Qi, and Li</label><mixed-citation>Qing, Y., Tong, Y., Qi, Z., and Li, Y.: A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, and Challenges, arXiv [preprint], <ext-link xlink:href="https://doi.org/10.48550/arXiv.2211.06665" ext-link-type="DOI">10.48550/arXiv.2211.06665</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx44"><label>Rashid et al.(2018)Rashid, Samvelyan, de Witt, Farquhar, Foerster, and Whiteson</label><mixed-citation>Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., and Whiteson, S.: QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning, CoRR, arXiv, <ext-link xlink:href="https://doi.org/10.48550/arXiv.1803.11485" ext-link-type="DOI">10.48550/arXiv.1803.11485</ext-link>, 2018.</mixed-citation></ref>
      <ref id="bib1.bibx45"><label>Schulman et al.(2017)Schulman, Wolski, Dhariwal, Radford, and Klimov</label><mixed-citation>Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.: Proximal Policy Optimization Algorithms, CoRR, arXiv, <ext-link xlink:href="https://doi.org/10.48550/arxiv.1707.06347" ext-link-type="DOI">10.48550/arxiv.1707.06347</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx46"><label>Stetco et al.(2019)Stetco, Dinmohammadi, Zhao, Robu, Flynn, Barnes, Keane, and Nenadic</label><mixed-citation>Stetco, A., Dinmohammadi, F., Zhao, X., Robu, V., Flynn, D., Barnes, M., Keane, J., and Nenadic, G.: Machine learning methods for wind turbine condition monitoring: A review, Renew. Energ., 133, 620–635, <ext-link xlink:href="https://doi.org/10.1016/j.renene.2018.10.047" ext-link-type="DOI">10.1016/j.renene.2018.10.047</ext-link>, 2019.</mixed-citation></ref>
      <ref id="bib1.bibx47"><label>Su et al.(2022)Su, Huang, Adams, Chang, and Beling</label><mixed-citation>Su, J., Huang, J., Adams, S., Chang, Q., and Beling, P. A.: Deep multi-agent reinforcement learning for multi-level preventive maintenance in manufacturing systems, Expert Syst. Appl., 192, 116323, <ext-link xlink:href="https://doi.org/10.1016/j.eswa.2021.116323" ext-link-type="DOI">10.1016/j.eswa.2021.116323</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx48"><label>Sutton and Barto(2018)</label><mixed-citation>Sutton, R. S. and Barto, A. G.: Reinforcement Learning: An Introduction, 2nd edn., MIT Press, Cambridge, MA, ISBN 978-0-262-03924-6, <uri>http://incompleteideas.net/book/the-book-2nd.html</uri> (last access: 28 October 2025), 2018.</mixed-citation></ref>
      <ref id="bib1.bibx49"><label>Tautz-Weinert and Watson(2017)</label><mixed-citation>Tautz-Weinert, J. and Watson, S.: Using SCADA data for wind turbine condition monitoring – a review, IET Renew. Power Gen., 11, 382–394, <ext-link xlink:href="https://doi.org/10.1049/iet-rpg.2016.0248" ext-link-type="DOI">10.1049/iet-rpg.2016.0248</ext-link>, 2017.</mixed-citation></ref>
      <ref id="bib1.bibx50"><label>Tusar and Sarker(2022)</label><mixed-citation>Tusar, M. I. H. and Sarker, B. R.: Maintenance cost minimization models for offshore wind farms: A systematic and critical review, Int. J. Energ. Res., 46, 3739–3765, <ext-link xlink:href="https://doi.org/10.1002/er.7425" ext-link-type="DOI">10.1002/er.7425</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx51"><label>Valet et al.(2022)Valet, Altenmüller, Waschneck, May, Kuhnle, and Lanza</label><mixed-citation>Valet, A., Altenmüller, T., Waschneck, B., May, M. C., Kuhnle, A., and Lanza, G.: Opportunistic maintenance scheduling with deep reinforcement learning, J. Manuf. Syst., 64, 518–534, <ext-link xlink:href="https://doi.org/10.1016/j.jmsy.2022.07.016" ext-link-type="DOI">10.1016/j.jmsy.2022.07.016</ext-link>, 2022.</mixed-citation></ref>
      <ref id="bib1.bibx52"><label>Vermeer et al.(2003)Vermeer, Sørensen, and Crespo</label><mixed-citation>Vermeer, N.-J., Sørensen, J., and Crespo, A.: Wind turbine wake aerodynamics, Prog. Aerosp. Sci., 39, 467–510, <ext-link xlink:href="https://doi.org/10.1016/S0376-0421(03)00078-2" ext-link-type="DOI">10.1016/S0376-0421(03)00078-2</ext-link>, 2003.</mixed-citation></ref>
      <ref id="bib1.bibx53"><label>Wang et al.(2026)Wang, Vidal, and Pozo</label><mixed-citation>Wang, S., Vidal, Y., and Pozo, F.: Recent advances in wind turbine condition monitoring using SCADA data: A state-of-the-art review, Reliab. Eng. Syst. Safe., 267, 111838, <ext-link xlink:href="https://doi.org/10.1016/j.ress.2025.111838" ext-link-type="DOI">10.1016/j.ress.2025.111838</ext-link>, 2026.</mixed-citation></ref>
      <ref id="bib1.bibx54"><label>Wei et al.(2019)Wei, Jin, Bao, and Li</label><mixed-citation>Wei, S., Jin, X., Bao, Y., and Li, H.: Reinforcement Learning in Maintenance of Civil Infrastructures, in: Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Workshop on Reinforcement Learning for Real Life (RL4RealLife), <uri>https://proceedings.mlr.press/v97/</uri> (last access: 28 October 2025), 2019.</mixed-citation></ref>
      <ref id="bib1.bibx55"><label>Williams(1992)</label><mixed-citation>Williams, R. J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning, Mach. Learn., 8, 229–256, <ext-link xlink:href="https://doi.org/10.1007/BF00992696" ext-link-type="DOI">10.1007/BF00992696</ext-link>, 1992.</mixed-citation></ref>
      <ref id="bib1.bibx56"><label>Zhang et al.(2023)Zhang, Li, and Coit</label><mixed-citation>Zhang, C., Li, Y.-F., and Coit, D. W.: Deep Reinforcement Learning for Dynamic Opportunistic Maintenance of Multi-Component Systems With Load Sharing, IEEE T. Reliab., 72, 863–877, <ext-link xlink:href="https://doi.org/10.1109/TR.2022.3197322" ext-link-type="DOI">10.1109/TR.2022.3197322</ext-link>, 2023.</mixed-citation></ref>
      <ref id="bib1.bibx57"><label>Zhang et al.(2024)Zhang, Shen, Liu, Chen, Zhang, and Li</label><mixed-citation>Zhang, E., Shen, F., Liu, S., Chen, G., Zhang, F., and Li, S.: Offshore wind power digital twin modeling system for intelligent operation and maintenance applications, E3S Web Conf., 546, <ext-link xlink:href="https://doi.org/10.1051/e3sconf/202454602010" ext-link-type="DOI">10.1051/e3sconf/202454602010</ext-link>, 2024.</mixed-citation></ref>
      <ref id="bib1.bibx58"><label>Zhang and Si(2020)</label><mixed-citation>Zhang, N. and Si, W.: Deep reinforcement learning for condition-based maintenance planning of multi-component systems under dependent competing risks, Reliab. Eng. Syst. Safe., 203, 107094, <ext-link xlink:href="https://doi.org/10.1016/j.ress.2020.107094" ext-link-type="DOI">10.1016/j.ress.2020.107094</ext-link>, 2020.</mixed-citation></ref>
      <ref id="bib1.bibx59"><label>Zhao and Zhou(2022)</label><mixed-citation>Zhao, F. J. and Zhou, Y.: Wind Farm Maintenance Scheduling Using Soft Actor-Critic Deep Reinforcement Learning, in: 2022 Global Reliability and Prognostics and Health Management (PHM-Yantai), 1–6, <ext-link xlink:href="https://doi.org/10.1109/PHM-Yantai55411.2022.9942116" ext-link-type="DOI">10.1109/PHM-Yantai55411.2022.9942116</ext-link>, 2022.</mixed-citation></ref>

  </ref-list></back>
    <!--<article-title-html>Review of deep reinforcement learning for  offshore wind farm maintenance planning</article-title-html>
<abstract-html/>
<ref-html id="bib1.bib1"><label>Aafif et al.(2022)Aafif, Chelbi, Mifdal, Dellagi, and Majdouline</label><mixed-citation>
      
Aafif, Y., Chelbi, A., Mifdal, L., Dellagi, S., and Majdouline, I.:
Optimal preventive maintenance strategies for a wind turbine gearbox, Energy Reports, 8, 803–814, <a href="https://doi.org/10.1016/j.egyr.2022.07.084" target="_blank">https://doi.org/10.1016/j.egyr.2022.07.084</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib2"><label>Abbas(2024)</label><mixed-citation>
      
Abbas, A.: A Hierarchical Framework for Interpretable, Safe, and Specialised Deep Reinforcement Learning, Doctoral thesis, Technological University Dublin, <a href="https://doi.org/10.21427/p05p-az54" target="_blank">https://doi.org/10.21427/p05p-az54</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib3"><label>Abbas et al.(2024)Abbas, Chasparis, and Kelleher</label><mixed-citation>
      
Abbas, A. N., Chasparis, G. C., and Kelleher, J. D.:
Hierarchical framework for interpretable and specialized deep reinforcement learning-based predictive maintenance, Data Knowl. Eng., 149, 102240, <a href="https://doi.org/10.1016/j.datak.2023.102240" target="_blank">https://doi.org/10.1016/j.datak.2023.102240</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib4"><label>Abkar et al.(2023)</label><mixed-citation>
      
Abkar, M., Zehtabiyan-Rezaie, N., and Iosifidis, A.: Reinforcement learning for wind farm flow control: Current state and future actions, Renew. Energ., 205, 271–289, <a href="https://doi.org/10.1016/j.renene.2023.01.001" target="_blank">https://doi.org/10.1016/j.renene.2023.01.001</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib5"><label>Adadi and Berrada(2018)</label><mixed-citation>
      
Adadi, A. and Berrada, M.:
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI), IEEE Access, <a href="https://doi.org/10.1109/ACCESS.2018.2870052" target="_blank">https://doi.org/10.1109/ACCESS.2018.2870052</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib6"><label>Andriotis and Papakonstantinou(2021)</label><mixed-citation>
      
Andriotis, C. P. and Papakonstantinou, K. G.:
Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints, Reliab. Eng. Syst. Safe., 212, 107551, <a href="https://doi.org/10.1016/j.ress.2021.107551" target="_blank">https://doi.org/10.1016/j.ress.2021.107551</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib7"><label>Borsotti et al.(2024)Borsotti, Negenborn, and Jiang</label><mixed-citation>
      
Borsotti, M., Negenborn, R., and Jiang, X.:
Model predictive control framework for optimizing offshore wind O&amp;M, in: Advances in Maritime Technology and Engineering, CRC Press, 533–546, <a href="https://doi.org/10.1201/9781003508762-65" target="_blank">https://doi.org/10.1201/9781003508762-65</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib8"><label>Borsotti et al.(2026)Borsotti, Negenborn, and Jiang</label><mixed-citation>
      
Borsotti, M., Negenborn, R., and Jiang, X.:
A review of multi-horizon decision-making for operation and maintenance of fixed-bottom offshore wind farms, Renew. Sust. Energ. Rev., 226, 116450, <a href="https://doi.org/10.1016/j.rser.2025.116450" target="_blank">https://doi.org/10.1016/j.rser.2025.116450</a>, 2026.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib9"><label>Bui and Hollweg(2024)</label><mixed-citation>
      
Bui, V. and Hollweg, G. V.: A Critical Review of Safe Reinforcement Learning Techniques in Smart Grid Applications, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.2409.16256" target="_blank">https://doi.org/10.48550/arXiv.2409.16256</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib10"><label>Carroll et al.(2017)Carroll, McDonald, Dinwoodie, McMillan, Revie, and Lazakis</label><mixed-citation>
      
Carroll, J., McDonald, A., Dinwoodie, I., McMillan, D., Revie, M., and Lazakis, I.:
Availability, operation and maintenance costs of offshore wind turbines with different drive train configurations, Wind Energy, 20, 361–378, <a href="https://doi.org/10.1002/we.2011" target="_blank">https://doi.org/10.1002/we.2011</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib11"><label>Chatterjee and Dethlefs(2021)</label><mixed-citation>
      
Chatterjee, J. and Dethlefs, N.:
Scientometric review of artificial intelligence for operations &amp; maintenance of wind turbines: The past, present and future, Renew. Sust. Energ. Rev., 144, 111051, <a href="https://doi.org/10.1016/j.rser.2021.111051" target="_blank">https://doi.org/10.1016/j.rser.2021.111051</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib12"><label>Chen et al.(2024)Chen, Kang, Li, Li, and Zhao</label><mixed-citation>
      
Chen, M., Kang, Y., Li, K., Li, P., and Zhao, Y.-B.:
Deep reinforcement learning for maintenance optimization of multi-component production systems considering quality and production plan, Qual. Eng., 1–12, <a href="https://doi.org/10.1080/08982112.2024.2373362" target="_blank">https://doi.org/10.1080/08982112.2024.2373362</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib13"><label>Cheng et al.(2023)Cheng, Liu, Li, and Li</label><mixed-citation>
      
Cheng, J., Liu, Y., Li, W., and Li, T.:
Deep reinforcement learning for cost-optimal condition-based maintenance policy of offshore wind turbine components, Ocean Eng., 283, 115062, <a href="https://doi.org/10.1016/j.oceaneng.2023.115062" target="_blank">https://doi.org/10.1016/j.oceaneng.2023.115062</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib14"><label>Civera and Surace(2022)</label><mixed-citation>
      
Civera, M. and Surace, C.:
Non-Destructive Techniques for the Condition and Structural Health Monitoring of Wind Turbines: A Literature Review of the Last 20 Years, Sensors-Basel, 22, 1627, <a href="https://doi.org/10.3390/s22041627" target="_blank">https://doi.org/10.3390/s22041627</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib15"><label>Do et al.(2024)Do, Nguyen, Voisin, Iung, and Neto</label><mixed-citation>
      
Do, P., Nguyen, V.-T., Voisin, A., Iung, B., and Neto, W. A. F.:
Multi-agent deep reinforcement learning-based maintenance optimization for multi-dependent component systems, Expert Syst. Appl., 245, 123144, <a href="https://doi.org/10.1016/j.eswa.2024.123144" target="_blank">https://doi.org/10.1016/j.eswa.2024.123144</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib16"><label>Dong et al.(2021)Dong, Zhao, and Wu</label><mixed-citation>
      
Dong, W., Zhao, T., and Wu, Y.:
Deep Reinforcement Learning Based Preventive Maintenance for Wind Turbines, in: 2021 IEEE 5th Conference on Energy Internet and Energy System Integration (EI2), 2860–2865, <a href="https://doi.org/10.1109/EI252483.2021.9713457" target="_blank">https://doi.org/10.1109/EI252483.2021.9713457</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib17"><label>Dulac-Arnold et al.(2021)Dulac-Arnold, Levine, Mankowitz, Li, Paduraru, Gowal, and Hester</label><mixed-citation>
      
Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T.:
Challenges of Real-World Reinforcement Learning: Definitions, Benchmarks and Analysis, Mach. Learn., 110, 2419–2468, <a href="https://doi.org/10.1007/s10994-021-05961-4" target="_blank">https://doi.org/10.1007/s10994-021-05961-4</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib18"><label>Fox et al.(2022)Fox, Pillai, Friedrich, Collu, Dawood, and Johanning</label><mixed-citation>
      
Fox, H., Pillai, A. C., Friedrich, D., Collu, M., Dawood, T., and Johanning, L.:
A Review of Predictive and Prescriptive Offshore Wind Farm Operation and Maintenance, Energies, 15, <a href="https://doi.org/10.3390/en15020504" target="_blank">https://doi.org/10.3390/en15020504</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib19"><label>Haarnoja et al.(2018)Haarnoja, Zhou, Abbeel, and Levine</label><mixed-citation>
      
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S.:
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, arXiv [preprint], <a href="https://doi.org/10.48550/arxiv.1801.01290" target="_blank">https://doi.org/10.48550/arxiv.1801.01290</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib20"><label>Hausknecht and Stone(2015a)</label><mixed-citation>
      
Hausknecht, M. and Stone, P.:
Deep Recurrent Q-Learning for Partially Observable MDPs, in: AAAI Fall Symposium Series, arXiv, <a href="https://doi.org/10.48550/arxiv.1507.06527" target="_blank">https://doi.org/10.48550/arxiv.1507.06527</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib21"><label>Huang et al.(2020)Huang, Chang, and Arinez</label><mixed-citation>
      
Huang, J., Chang, Q., and Arinez, J.:
Deep reinforcement learning based preventive maintenance policy for serial production lines, Expert Syst. Appl., 160, 113701, <a href="https://doi.org/10.1016/j.eswa.2020.113701" target="_blank">https://doi.org/10.1016/j.eswa.2020.113701</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib22"><label>Igl et al.(2018)Igl, Zintgraf, Le, Wood, and Whiteson</label><mixed-citation>
      
Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson, S.: Deep Variational Reinforcement Learning for POMDPs, in: Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 80, PMLR, 2117–2126, <a href="http://proceedings.mlr.press/v80/igl18a.html" target="_blank"/> (last access: 28 October 2025), 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib23"><label>Jenkins et al.(2021)Jenkins, Prothero, Collu, Carroll, McMillan, and McDonald</label><mixed-citation>
      
Jenkins, B., Prothero, A., Collu, M., Carroll, J., McMillan, D., and McDonald, A.:
Limiting Wave Conditions for the Safe Maintenance of Floating Wind Turbines, J. Phys. Conf. Ser., 2018, <a href="https://doi.org/10.1088/1742-6596/2018/1/012023" target="_blank">https://doi.org/10.1088/1742-6596/2018/1/012023</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib24"><label>Kaelbling et al.(1998)Kaelbling, Littman, and Cassandra</label><mixed-citation>
      
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.:
Planning and acting in partially observable stochastic domains, Artif. Intell., 101, 99–134, <a href="https://doi.org/10.1016/S0004-3702(98)00023-X" target="_blank">https://doi.org/10.1016/S0004-3702(98)00023-X</a>, 1998.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib25"><label>Kazemian et al.(2024)Kazemian, Yildirim, and Ramanan</label><mixed-citation>
      
Kazemian, I., Yildirim, M., and Ramanan, P.:
Attention is All You Need to Optimize Wind Farm Operations and Maintenance, arXiv, <a href="https://doi.org/10.48550/arxiv.2410.24052" target="_blank">https://doi.org/10.48550/arxiv.2410.24052</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib26"><label>Kerkkamp et al.(2022)Kerkkamp, Bukhsh, Zhang, and Jansen</label><mixed-citation>
      
Kerkkamp, D., Bukhsh, Z., Zhang, Y., and Jansen, N.:
Grouping of Maintenance Actions with Deep Reinforcement Learning and Graph Convolutional Networks, in: Proceeding of the 14th International Conference on Agents and Artificial Intelligence, vol. 2, SciTePress Digital Library, 574–585, <a href="https://doi.org/10.5220/0000155600003116" target="_blank">https://doi.org/10.5220/0000155600003116</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib27"><label>Lee and Mitici(2023)</label><mixed-citation>
      
Lee, J. and Mitici, M.:
Deep reinforcement learning for predictive aircraft maintenance using probabilistic Remaining-Useful-Life prognostics, Reliab. Eng. Syst. Safe., 230, 108908, <a href="https://doi.org/10.1016/j.ress.2022.108908" target="_blank">https://doi.org/10.1016/j.ress.2022.108908</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib28"><label>Lee et al.(2025)Lee, Woo, and Kim</label><mixed-citation>
      
Lee, N., Woo, J., and Kim, S.:
A deep reinforcement learning ensemble for maintenance scheduling in offshore wind farms, Appl. Energ., 377, <a href="https://doi.org/10.1016/j.apenergy.2024.124431" target="_blank">https://doi.org/10.1016/j.apenergy.2024.124431</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib29"><label>Li et al.(2023)Li, Lin, Yu, Du, Li, and Fu</label><mixed-citation>
      
Li, Q., Lin, T., Yu, Q., Du, H., Li, J., and Fu, X.:
Review of Deep Reinforcement Learning and Its Application in Modern Renewable Power System Control, Energies, 16, <a href="https://doi.org/10.3390/en16104143" target="_blank">https://doi.org/10.3390/en16104143</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib30"><label>Liang et al.(2025)Liang, Miao, Li, Tan, Wang, Luo, and Jiang</label><mixed-citation>
      
Liang, J., Miao, H., Li, K., Tan, J., Wang, X., Luo, R., and Jiang, Y.:
A Review of Multi-Agent Reinforcement Learning Algorithms, Electronics, 14, <a href="https://doi.org/10.3390/electronics14040820" target="_blank">https://doi.org/10.3390/electronics14040820</a>, 2025.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib31"><label>Lillicrap et al.(2016)Lillicrap, Hunt, Pritzel, Heess, Erez, Tassa, Silver, and Wierstra</label><mixed-citation>
      
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D.:
Continuous Control with Deep Reinforcement Learning, in: International Conference on Learning Representations (ICLR), arXiv, <a href="https://doi.org/10.48550/arxiv.1509.02971" target="_blank">https://doi.org/10.48550/arxiv.1509.02971</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib32"><label>Lowe et al.(2017)Lowe, WU, Tamar, Harb, Pieter Abbeel, and Mordatch</label><mixed-citation>
      
Lowe, R., Wu, Y., Tamar, A., Harb, J., Pieter Abbeel, O., and Mordatch, I.:
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.1706.02275" target="_blank">https://doi.org/10.48550/arXiv.1706.02275</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib33"><label>Mnih et al.(2015)Mnih, Kavukcuoglu, Silver, Rusu, Veness, Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, Petersen, Beattie, Sadik, Antonoglou, King, Kumaran, Wierstra, Legg, and Hassabis</label><mixed-citation>
      
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D.:
Human-level control through deep reinforcement learning, Nature, 518, 529–533, <a href="https://doi.org/10.1038/nature14236" target="_blank">https://doi.org/10.1038/nature14236</a>, 2015.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib34"><label>Mnih et al.(2016)Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, and Kavukcuoglu</label><mixed-citation>
      
Mnih, V., Badia, A., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K.:
Asynchronous Methods for Deep Reinforcement Learning, Proceedings of Machine Learning Research, arXiv, <a href="https://doi.org/10.48550/arXiv.1602.01783" target="_blank">https://doi.org/10.48550/arXiv.1602.01783</a>, 2016.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib35"><label>Narayanan(2023)</label><mixed-citation>
      
Narayanan, S.: Reinforcement Learning in Wind Energy: A Review, Int. J. Green Energy, 20, 443–465, <a href="https://doi.org/10.1080/15435075.2023.2281329" target="_blank">https://doi.org/10.1080/15435075.2023.2281329</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib36"><label>National Renewable Energy Laboratory(2022)</label><mixed-citation>
      
National Renewable Energy Laboratory: Offshore Wind Energy Market Assessment 2022, National Renewable Energy Laboratory, <a href="https://www.nrel.gov/wind/offshore-market-assessment.html" target="_blank"/> (last access: 28 October 2025), 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib37"><label>Nguyen et al.(2022)Nguyen, Do, Voisin, and Iung</label><mixed-citation>
      
Nguyen, V.-T., Do, P., Voisin, A., and Iung, B.:
Weighted-QMIX-based Optimization for Maintenance Decision-making of Multi-component Systems, in: Proceedings of the European Conference of the PHM Society 2022, vol. 7, 360–367, <a href="https://doi.org/10.36001/phme.2022.v7i1.3319" target="_blank">https://doi.org/10.36001/phme.2022.v7i1.3319</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib38"><label>Ogunfowora and Najjaran(2023)</label><mixed-citation>
      
Ogunfowora, O. and Najjaran, H.:
Reinforcement and deep reinforcement learning-based solutions for machine maintenance planning, scheduling policies, and optimization, J. Manuf. Syst., 70, 244–263, <a href="https://doi.org/10.1016/j.jmsy.2023.07.014" target="_blank">https://doi.org/10.1016/j.jmsy.2023.07.014</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib39"><label>Pandit and Wang(2024)</label><mixed-citation>
      
Pandit, R. and Wang, J.:
A comprehensive review on enhancing wind turbine applications with advanced SCADA data analytics and practical insights, IET Renew. Power Gen., 18, 722–742, <a href="https://doi.org/10.1049/rpg2.12920" target="_blank">https://doi.org/10.1049/rpg2.12920</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib40"><label>Parisotto et al.(2020)Parisotto, Song, Rae, Pascanu, Gulcehre et al.</label><mixed-citation>
      
Parisotto, E., Song, H. F., Rae, J. W., Pascanu, R., Gulcehre, C., Jayakumar, S. M., Jaderberg, M., Lopez Kaufman, R., Clark, A., Noury, S., Botvinick, M. M., Heess, N., and Hadsell, R.: Stabilizing Transformers for Reinforcement Learning, in: Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, PMLR, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.1910.06764" target="_blank">https://doi.org/10.48550/arXiv.1910.06764</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib41"><label>Pesántez et al.(2024)Pesántez, Guamán, Córdova, Torres, and Benalcazar</label><mixed-citation>
      
Pesántez, G., Guamán, W., Córdova, J., Torres, M., and Benalcazar, P.:
Reinforcement Learning for Efficient Power Systems Planning: A Review of Operational and Expansion Strategies, Energies, 17, 2167, <a href="https://doi.org/10.3390/en17092167" target="_blank">https://doi.org/10.3390/en17092167</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib42"><label>Pinciroli et al.(2021)Pinciroli, Baraldi, Ballabio, Compare, and Zio</label><mixed-citation>
      
Pinciroli, L., Baraldi, P., Ballabio, G., Compare, M., and Zio, E.:
Deep Reinforcement Learning Based on Proximal Policy Optimization for the Maintenance of a Wind Farm with Multiple Crews, Energies, 14, <a href="https://doi.org/10.3390/en14206743" target="_blank">https://doi.org/10.3390/en14206743</a>, 2021.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib43"><label>Qing et al.(2022)Qing, Tong, Qi, and Li</label><mixed-citation>
      
Qing, Y., Tong, Y., Qi, Z., and Li, Y.: A Survey on Explainable Reinforcement Learning: Concepts, Algorithms, and Challenges, arXiv [preprint], <a href="https://doi.org/10.48550/arXiv.2211.06665" target="_blank">https://doi.org/10.48550/arXiv.2211.06665</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib44"><label>Rashid et al.(2018)Rashid, Samvelyan, de Witt, Farquhar, Foerster, and Whiteson</label><mixed-citation>
      
Rashid, T., Samvelyan, M., de Witt, C. S., Farquhar, G., Foerster, J. N., and Whiteson, S.:
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning, CoRR, arXiv, <a href="https://doi.org/10.48550/arXiv.1803.11485" target="_blank">https://doi.org/10.48550/arXiv.1803.11485</a>, 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib45"><label>Schulman et al.(2017)Schulman, Wolski, Dhariwal, Radford, and Klimov</label><mixed-citation>
      
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O.:
Proximal Policy Optimization Algorithms, CoRR, arXiv, <a href="https://doi.org/10.48550/arxiv.1707.06347" target="_blank">https://doi.org/10.48550/arxiv.1707.06347</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib46"><label>Stetco et al.(2019)Stetco, Dinmohammadi, Zhao, Robu, Flynn, Barnes, Keane, and Nenadic</label><mixed-citation>
      
Stetco, A., Dinmohammadi, F., Zhao, X., Robu, V., Flynn, D., Barnes, M., Keane, J., and Nenadic, G.:
Machine learning methods for wind turbine condition monitoring: A review, Renew. Energ., 133, 620–635, <a href="https://doi.org/10.1016/j.renene.2018.10.047" target="_blank">https://doi.org/10.1016/j.renene.2018.10.047</a>, 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib47"><label>Su et al.(2022)Su, Huang, Adams, Chang, and Beling</label><mixed-citation>
      
Su, J., Huang, J., Adams, S., Chang, Q., and Beling, P. A.:
Deep multi-agent reinforcement learning for multi-level preventive maintenance in manufacturing systems, Expert Syst. Appl., 192, 116323, <a href="https://doi.org/10.1016/j.eswa.2021.116323" target="_blank">https://doi.org/10.1016/j.eswa.2021.116323</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib48"><label>Sutton and Barto(2018)</label><mixed-citation>
      
Sutton, R. S. and Barto, A. G.: Reinforcement Learning: An Introduction, 2nd edn., MIT Press, Cambridge, MA, ISBN 978-0-262-03924-6, <a href="http://incompleteideas.net/book/the-book-2nd.html" target="_blank"/> (last access: 28 October 2025), 2018.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib49"><label>Tautz-Weinert and Watson(2017)</label><mixed-citation>
      
Tautz-Weinert, J. and Watson, S.:
Using SCADA data for wind turbine condition monitoring – a review, IET Renew. Power Gen., 11, 382–394, <a href="https://doi.org/10.1049/iet-rpg.2016.0248" target="_blank">https://doi.org/10.1049/iet-rpg.2016.0248</a>, 2017.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib50"><label>Tusar and Sarker(2022)</label><mixed-citation>
      
Tusar, M. I. H. and Sarker, B. R.:
Maintenance cost minimization models for offshore wind farms: A systematic and critical review, Int. J. Energ. Res., 46, 3739–3765, <a href="https://doi.org/10.1002/er.7425" target="_blank">https://doi.org/10.1002/er.7425</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib51"><label>Valet et al.(2022)Valet, Altenmüller, Waschneck, May, Kuhnle, and Lanza</label><mixed-citation>
      
Valet, A., Altenmüller, T., Waschneck, B., May, M. C., Kuhnle, A., and Lanza, G.:
Opportunistic maintenance scheduling with deep reinforcement learning, J. Manuf. Syst., 64, 518–534, <a href="https://doi.org/10.1016/j.jmsy.2022.07.016" target="_blank">https://doi.org/10.1016/j.jmsy.2022.07.016</a>, 2022.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib52"><label>Vermeer et al.(2003)Vermeer, Sørensen, and Crespo</label><mixed-citation>
      
Vermeer, N.-J., Sørensen, J., and Crespo, A.:
Wind turbine wake aerodynamics, Prog. Aerosp. Sci., 39, 467–510, <a href="https://doi.org/10.1016/S0376-0421(03)00078-2" target="_blank">https://doi.org/10.1016/S0376-0421(03)00078-2</a>, 2003.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib53"><label>Wang et al.(2026)Wang, Vidal, and Pozo</label><mixed-citation>
      
Wang, S., Vidal, Y., and Pozo, F.:
Recent advances in wind turbine condition monitoring using SCADA data: A state-of-the-art review, Reliab. Eng. Syst. Safe., 267, 111838, <a href="https://doi.org/10.1016/j.ress.2025.111838" target="_blank">https://doi.org/10.1016/j.ress.2025.111838</a>, 2026.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib54"><label>Wei et al.(2019)Wei, Jin, Bao, and Li</label><mixed-citation>
      
Wei, S., Jin, X., Bao, Y., and Li, H.:
Reinforcement Learning in Maintenance of Civil Infrastructures, in: Proceedings of the 36th International Conference on Machine Learning (ICML 2019), Workshop on Reinforcement Learning for Real Life (RL4RealLife), <a href="https://proceedings.mlr.press/v97/" target="_blank"/> (last access: 28 October 2025), 2019.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib55"><label>Williams(1992)</label><mixed-citation>
      
Williams, R. J.:
Simple statistical gradient-following algorithms for connectionist reinforcement learning, Mach. Learn., 8, 229–256, <a href="https://doi.org/10.1007/BF00992696" target="_blank">https://doi.org/10.1007/BF00992696</a>, 1992.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib56"><label>Zhang et al.(2023)Zhang, Li, and Coit</label><mixed-citation>
      
Zhang, C., Li, Y.-F., and Coit, D. W.:
Deep Reinforcement Learning for Dynamic Opportunistic Maintenance of Multi-Component Systems With Load Sharing, IEEE T. Reliab., 72, 863–877, <a href="https://doi.org/10.1109/TR.2022.3197322" target="_blank">https://doi.org/10.1109/TR.2022.3197322</a>, 2023.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib57"><label>Zhang et al.(2024)Zhang, Shen, Liu, Chen, Zhang, and Li</label><mixed-citation>
      
Zhang, E., Shen, F., Liu, S., Chen, G., Zhang, F., and Li, S.:
Offshore wind power digital twin modeling system for intelligent operation and maintenance applications, E3S Web Conf., 546, <a href="https://doi.org/10.1051/e3sconf/202454602010" target="_blank">https://doi.org/10.1051/e3sconf/202454602010</a>, 2024.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib58"><label>Zhang and Si(2020)</label><mixed-citation>
      
Zhang, N. and Si, W.:
Deep reinforcement learning for condition-based maintenance planning of multi-component systems under dependent competing risks, Reliab. Eng. Syst. Safe., 203, 107094, <a href="https://doi.org/10.1016/j.ress.2020.107094" target="_blank">https://doi.org/10.1016/j.ress.2020.107094</a>, 2020.

    </mixed-citation></ref-html>
<ref-html id="bib1.bib59"><label>Zhao and Zhou(2022)</label><mixed-citation>
      
Zhao, F. J. and Zhou, Y.:
Wind Farm Maintenance Scheduling Using Soft Actor-Critic Deep Reinforcement Learning, in: 2022 Global Reliability and Prognostics and Health Management (PHM-Yantai), 1–6, <a href="https://doi.org/10.1109/PHM-Yantai55411.2022.9942116" target="_blank">https://doi.org/10.1109/PHM-Yantai55411.2022.9942116</a>, 2022.

    </mixed-citation></ref-html>--></article>
