Multi-Agent Reinforcement Learning Control of a Hydrostatic Wind Turbine-Based Farm

This paper leverages multi-agent reinforcement learning (MARL) to develop an efficient control system for a wind farm comprising a new type of wind turbines with hydrostatic transmission. The primary motivation for hydrostatic wind turbines (HWT) is increased reliability, and reduced manufacturing, operating, and maintaining costs by removing troublesome components and reducing nacelle weight. Nevertheless, the high system complexity of HWT and the wake effect pose significant challenges for the control of HWT-based wind farms. We therefore propose a MARL algorithm named multi-agent policy optimization (MAPO), which allows agents (turbines) to gradually improve their control policies by repeatedly interacting with the environment to learn an optimal operation curve for wind farms. Simulation results based on a wind farm simulator, FAST.Farm, show that MAPO outperforms the greedy policy and a popular learning-based method, multi-agent deep deterministic policy gradient (MADDPG), in terms of power generation.


I. INTRODUCTION
D EVELOPING renewable energy to substitute traditional fossil energy is one of the most promising ways to reduce environmental pollution.In Europe, wind energy accounts for the highest share of clean energy generation and is also the fastest-growing electricity source in the market [1].Nonetheless, there is an intractable drawback for offshore wind farms comprising of gearbox-based wind turbines-their maintenance is costly.Hydrostatic wind turbines (HWT) can help tackle this problem [2] because the hydrostatic transmission system is more robust than the gearbox-based transmission and can offer a longer life cycle.In addition, HWT allows to shift the heavy motor and generator to the platform (Fig. 1), and therefore the mass of the nacelle can be significantly reduced, which vastly facilitates ease of installation and maintenance Yubo Huang and Xiaowei Zhao are with the Intelligent Control & Smart Energy (ICSE) Research Group, School of Engineering, University of Warwick, CV4 7AL Coventry, U.K. (e-mail: yubo.huang@warwick.ac.uk; xiaowei.zhao@warwick.ac.uk).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TSTE.2023.3270761.
of wind turbines.Furthermore, the frequency/inertial response exhibited by HWTs is of clear value to large-scale power systems because they are installed with synchronous generators.These economical advantages motivate us to study the HWT-based wind farm.We focus on its control in this paper.Like the case for the traditional wind turbines/farms, the control method for a single HWT is not suitable for a HWT-based wind farm due to the wake effect.Specifically, the optimal control policy for an isolated HWT is maximum power point tracking (MPPT [3], see Fig. 2): when the wind speed is below rated, the objective is to control the generator torque to maximize its power output.When the wind speed is sufficient to drive the full-power operation of HWTs, the goal becomes to maintain the output at the rated level to alleviate the structural load via the joint control of blade pitch and generator torque.In wind farms, turbines are normally installed in arrays, and thus the actions of upstream turbines affect the environmental state of their downstream counterparts through the wake effect.Although MPPT can achieve optimal solutions for upstream turbines, the power outputs of HWTs within the wake planes of upstream turbines are reduced greatly, causing a decline in power generation of the entire wind farm.Therefore, how to design a control policy for wind farms which can overcome the wake effect is an ongoing issue.
For the farm-level control, the model-free methods may offer more benefits than the model-based methods due to the high system complexity and environmental uncertainty of wind farms.Firstly, model-based control methods (e.g.Model Predictive Control) require an accurate wind farm model, but the high environmental uncertainty of wind farms will inevitably introduce considerable modelling errors.Control policies designed based on the model with modelling errors are likely to be sub-optimal.Additionally, the algorithm complexity of model-based methods is usually higher than the model-free methods, which can cause greater computational cost.For example, when the task has a long horizon like the wind farm control case, it might be difficult for model predictive control to achieve real-time control because of the expensive computation cost.Thus many studies have recently attempted to leverage model-free data-driven methods to approach a better wind farm control policy, including dynamic programming [5], genetic algorithm [6], and swarm optimization [7].
Among multitudinous model-free methods, model-free reinforcement learning (RL) has its exclusive advantages in solving the wind farm control task.For example, dynamic programming is impractical for large-scale wind farm control since it has high memory expenditure when the state space is large.As for the genetic algorithm and swarm optimization, they cannot guarantee the convergence or stability of the control policy during the optimization process.Model-free RL [8] can effectively tackle these challenges with the assistance of deep neural networks and has achieved excellent results in wind farm control.Dong et al. integrated deep deterministic policy gradient (DDPG) and the high-fidelity wind farm model to learn the control policy [9].Zhao et al. used the knowledge-assisted DDPG to optimize the control policy as well as ensuring safety during training [10].Bay et al. introduced a distributed RL-based method to wind farm power capture maximization using yaw control [11].These works demonstrated that model-free RL can be applied smoothly to wind farm control and achieve better results than many selected data-driven methods.
Almost all existing model-free RL control methods for wind farms (which consist of multiple turbines) regard the wind farm as a single agent, but using multi-agent RL (MARL) to train wind farm control policy is obviously more rational than using singleagent RL (SARL).There are some limitations encountered in applications of SARL: r SARL is not scalable since the dimensions of the joint action space will grow exponentially with the increase in the number of HWTs in a wind farm.
r In execution, each HWT demands to acquire the states of their teammates to generate its action based on the control policy.This high degree of communication can not be satisfied in the real-world scenarios.Both limitations can be addressed by introducing the centralized training with decentralized execution (CTDE) principle [12] in MARL.This implies that the concatenation of the states of all HWTs is inputted to the value network to estimate the future return (power) of each HWT during training, but each HWT only uses their private state to sample its action (low dimension) rather than the joint action based on the individual policy in execution (communication-free).
On the other hand, there are also several challenges in designing the control system of a HWT-based wind farm within the MARL framework.Firstly, to bridge the simulation to reality gap, in the construction of the wind farm simulator, we should not only consider the aerodynamics of the wind farm but also the dynamics of multifarious substructures of HWTs, which are typically ignored in the existing wind farm control research.Moreover, there are significant differences in the RL-based control designs between wind farms consisting of gearbox-based wind turbines and the ones consisting of HTWs.For example, to standardize the control task as a complete MDP (Markov decision process, a compulsory condition for RL design), the former only includes the rotor speed in the state space because gearbox-based wind turbines have constant gearbox ratios between the rotors and generators.However, the latter must consider the dynamics of the hydrostatic transmission of each HWT besides the rotor speed.Last but not least, the developed MARL algorithm need effectively enhance the coordination between HWTs to overcome the wake effect.This paper makes the following contributions to address the aforementioned issues: r Developed a HWT-based wind farm model based on FAST.FARM [13], where the gearbox transmission of the wind turbine is replaced by the hydrostatic transmission.This model includes both the aerodynamics of large-scale wind farms and the mechanical dynamics of substructures of a HWT.Then, the FAST.Farm driven by the proposed model is integrated with Python to build a high-fidelity HWT-based wind farm simulator used for training MARL algorithms.
r Proposed a novel CTDE-based MARL algorithm named multi-agent policy optimization (MAPO) to learn the wind farm control policy.MAPO balances the collective return and the individual return by a dynamical weight, which induces agents to explore new policies in the initial training and exploit the explored information to subsequently maximize the group return.By encouraging agents to maximize the collective return, MAPO can efficaciously promote the coordination between HWTs and further minimize the negative effect of wakes on the power generation.
r Simulation results show that the control policy trained by MAPO achieves high performance in different wind farm power layout and fluctuating environments.The structural dynamic analysis shows that MAPO does not cause unusual vibrations of the main sub-structures.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

II. CONSTRUCTING A HWT-BASED WIND FARM SIMULATOR
FOR MARL Before we train control policies for HWTs by using the MARL algorithms, a high-fidelity simulator should be developed.This simulator includes the models of the aerodynamics of the wind farm and the elastic-servo dynamics of HWT.Different from the traditional control methods that use the HWT-based wind farm model to design the control policy, MARL aims to teach each agent (turbine) to learn the control policy through interacting with the simulator.Please refer Section III-A for details.Below, we will introduce the used hydraulic wind turbine models and its control modules.

A. Modeling the Dynamics of Hydrostatic Wind Turbines
At the farm level, the aerodynamic torque T i r1 of the rotor and the thrust force F i thrust2 exerted by the turbine i can be described through a quasi-static model [14]: where i = 1, 2, . . ., n and n is the number of HWTs in a farm; ρ, R, ω i r , β i are the air density, blade length, rotor speed, and pitch angle of turbine i, respectively; v i is the wind speed at the ith turbine.λ i = ω i r R/v i is the tip speed ratio; C p and C T are the the power coefficient and the disk-based thrust coefficient [15], respectively.
FAST.Farm uses a gearbox-based turbine model to simulate the operation of a wind farm.The main task in this subsection is to embed the HWT model into the farm-level aerodynamics model introduced in Section II-A to construct a complete HWTbased wind farm simulator.
For the i-th HWT, the dynamics of its rotor speed is proportional to the difference between T i r obtained from (1) and T i p (the torque of pump): where J i r and J i p are the rotational mass moments of inertia of the rotor and pump, respectively.
A hydrostatic drivetrain transmits the mechanical power on the low-speed rotor side to the high-speed generator side for electricity generation.As shown in Fig. 1, this hydrostatic drivetrain comprises a hydraulic pump, high-pressure and low-pressure lines, and a hydraulic motor.First, the rotation of the low-speed shaft with the rotor-pump assembly can pump the hydraulic oil from the low-pressure transmission line to the high-pressure line and the pump torque is [16]: where D p is the pump displacement, meaning the volume of fluid pumped per revolution, P i p represents the pressure difference across the pump, B p is the viscous damping, and C fp is the Coulomb friction coefficient of the pump.The net volumetric flow of the pump Q i p is computed by: where C sp is the laminar leakage coefficient of the pump.Then, we use a dissipative model to interpret the dynamics of transmission lines [17].Specifically, this model describes how changes in the net volumetric flows of the pump Q i p and motor Q i m cause the state transform of hydraulic lines (5), and further result in the variation of pressure difference in pump and motor (6), where P i m denotes the pressure difference across the motor.
The presented model uses the form of state space to represent the dynamics of fluid in a hydrostatic drivetrain.Here, and C = [C 1 ; C 2 ] are the state matrix, input matrix, and output matrix, respectively, and their values are determined by the length L and inner diameter r of transmission lines, and the density ρ, kinematic viscosity ν, and effective bulk modulus E of the hydraulic oil (please see [18] for specific calculations).
x i is the state vector, T is the input vector and P i = [P i p , P i m ] T is the output vector.Similar to the pump, the motor can also be characterized by its volumetric displacement D i m , but the function of the motor is to convert hydraulic power into mechanical power.Thus, for the hydraulic motor, we only reverse the sign of the leakage flow and friction torques in the pump model [16].The net volumetric flow Q i m and torque T i m of the pump are: where ω i m is the motor speed, C sm is the laminar leakage coefficient of the motor, B m is the viscous damping, and C fm is the Coulomb friction coefficient of the motor.
In a hydrostatic transmission system, we can control the motor torque by changing its displacement D m (7).The response of motor displacement is characterized via a time constant t m = 0.5 and a displacement reference Di m : And the power produced by the generator is: where η is the generator efficiency.At this point, we have integrated the aerodynamic model of the wind farm and the hydrostatic transmission model of the turbine.Next we will implement them in FAST.Farm.We replace the gearbox-based drivetrain with the hydrostatic drivetrain by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.modifying the ServoDyn module in FAST.Farm3 .Firstly, the drivetrain rotational-flexibility DOF is closed in the ElastoDyn input file (.dat) and the GBRatio is set to 1.Then, we regard the generator in gearbox-based wind turbines as the hydraulic pump in HWTs and modify its inertial in the FAST input file (.fst).The transmission dynamics ( 5)-( 6) of the hydraulic system in HWTs is modeled as a function in the ServoDyn module and it will be called before the state update of the servo system.Finally, in the UserVSCont _ KP.f90 file, we provide an interface to write the trained MARL control policy and the MARL training samples can be collected in the.out files.Now the HWT-based wind farm simulator is constructed and can perform its core function shown in Fig. 3.

B. The Control Framework of an Individual HWT
Above we have constructed a simulator of the HWT-based wind farm.Then we will introduce the torque control and blade pitch control regimes of HWTs in the simulator.
1) Torque Control: For a single variable-speed HWT, its operation curve (MPPT: maximal power point tracking, also called the greedy policy) can be divided into three regions shown in Fig. 2. In region 2, below the rated wind speed, the wind is not sufficient to drive the turbine to operate at its full-power point.The blade pitch angle will keep at its minimum to capture wind energy as much as possible.The primary task in region 2 is to control the motor torque to make the HWT run on its optimal torque curve (Fig. 2), maximizing the output power.Considering the motor displacement actuator, the closed-loop torque control system is shown in Fig. 4. It is worth mentioning that we Fig. 4. The torque control system of HWTs.In the simulator, the AeroDyn module can compute the load of HWT according to the inflow wind.The ElasticDyn module determines the kinematics of each substructure of a wind turbine.The ServoDyn module describes the dynamics of the servo system, and the control system is also embedded in this module.Dm is the displacement command of the hydraulic motor.
find the respond of motor displacement control is obviously swift than that of the pump in pre-experiments since the pump affects the generator torque by changing the pressure and flow rate of the hydraulic oil but the motor can directly determine the input mechanical torque of generator.
2) Blade Pitch Control: According to MPPT, in region 3 (see Fig. 2), the output power of a HWT should be kept at its nominal value via the blade pitch control [2].The dynamics of pitch actuator can be represented by a first-order differential equation: where β and β are the real-time pitch angle and its reference determined by MPPT and the pitch controller, respectively, and t β = 0.1 is the time constant of the blade pitch actuator.
From the above introduction, for a single wind turbine, the torque and pitch references are calculated by MPPT during its operation.This coordination-free control policy is optimal for an isolated turbine but is unsuitable for a wind farm due to the wake effect.For instance, if all upstream wind turbines adopted this greedy control strategy, 4 although they could maximize their power output, within their wake plane, the downstream wind speed would experience a rapid drop and the power generation of turbines situated at this area will plummet.As a result, the power production of the entire wind farm would keep at a relatively low level.To tackle this problem, in the next section, a novel MARL method will be proposed to train a collaborative control policy for all the HWTs in a wind farm to overcome the wake effect.Then, the real-time references of torque and pitch angle of HWTS will be generated by the trained policy.

III. MULTI-AGENT REINFORCEMENT LEARNING CONTROL OF A HWT-BASED WIND FARM
In Section II-B, we have introduced the greedy control policy (MPPT) that uses the optimal operation curve to calculate the control references of a single HWT.For wind farm control, however, there is no one-size-fits-all optimal operation curve, but the policy network in RL can approximate it through interacting with the simulator.In this section, we propose the Multi-Agent Policy Optimization (MAPO) algorithm to control the wind farm.And we also illustrate how MAPO trains a collaborative control policy for a HWT-based wind farm by using the simulator introduced in Section II, and how the control policy guides the actions of HWTs to alleviate the wake effect and further boost the power generation of the whole wind farm.

A. Modeling the HWT-Based Wind Farm Control Task as a Markov Decision Process
In MAPO, we regard each HWT in the wind farm as an agent which has an independent policy network/function π i and agent value network/function V i , ∀i ∈ [1, 2, . . ., n].Overall, there is a group value network/function V gru used for estimating the future return of the wind farm based on its state s t .The policy network π i outputs the action a i t (control reference signals) for turbine i given its observation o i t and the agent value network V i estimates the future return of turbine i (12).The concrete simulator state, agent action, and reward are defined as follows: r State: the observation o i of turbine i includes not only its external information (e.g. the wind speed on the rotor, the turbine location) but also its internal status-the rotor speed ω i r , and the pump and motor pressure differences (P i p and P i m ).The group (farm) state s is the concatenation of observations of all agents (11).
r Action: the action a i is the control reference signals (torque reference and pitch reference) that the corresponding substructure of wind turbine i should track to maximize the output power.
r Reward: the reward r i should be proportional to the power generated by turbine i. Hence the reward function is designed as Fig. 5.We expect all turbines can work in their rated state, so the reward of turbine i is maximal at its rated point.When the power exceeds its rated value, the reward is set to 0 to punish the agent.The group reward r is the sum of all agent rewards (11).And they satisfy that: where ⊕ is the operator of concat and n is the number of turbines in the simulator.
Based on these concepts, the agent state value function under policy π i and the group state value function V π (s t ) under policy π can be defined as (Hereinafter, V π i (o i t ) and V π (s t ) are abbreviated as V i t and V t , respectively): where γ is the discount coefficient.
The interaction between the RL agents and the HWT-based wind farm simulator can be standardized as a Partially Observable MDP.Initially, the weights of all policy networks are randomly initialized and thus the corresponding farm control policy is of low quality.At each discrete time t, as shown in Fig. 6, the agent i (turbine) observes its private status o i t ∈ O i from the simulator.The concatenation of observations of all agents is the group state s t ∈ S (11).Based on the observation o i t , the policy network π i of agent i will sample an action a i t (control reference signal) for different turbine substructures (O i → A i ).Then all turbines will take their actions (e.g.torque reference), and the simulator will feed back a reward r i ∈ R to each agent while jumping to the next state s t+1 (refer to Fig. 3).The sample (s t , a t , r t , s t+1 ) will be collected to train the policy and value network (see the next subsection for details) to improve the performance of the control policy, and then this interaction will continue.At each iteration, the quality of the policy π i , ∀i ∈ [1, 2, • • • n] can be evaluated by the expected return (power generated by the wind farm): where s 0 is the start state of the simulator and ρ is its probability distribution.
After this process is iterated enough times, the original random control policy will converge to a superior solution that can be deployed to real-world machines.Additionally, as illustrated in Fig. 6, we input the private observation o i and the group state s to the value network V i (o i ) and V (s) to estimate the future return of agent i and the group future return, respectively.However, in the policy network π i , only the private observation o i is leveraged to sample the action references.This setting is to satisfy the principle of CTDE, which can avoid Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the communication and environment non-stationary issues in MARL.
In the HWT-based wind farm control task, if all turbines aim to maximize their own return, the ultimate control policy will probably fall into a locally optimal solution.Otherwise, if the objective of all agents is always to maximize the group return throughout the training, in the initial stage, agents tend to exploit the explored information to increase the collective return rather than discovering new states.It will limit the exploration of each agent and thus the learning speed is extremely slow at this stage.We expect the agent to focus on increasing their own return at the beginning of the training but dedicate to accumulating the group return in the latter stage to find the best collaborative control policy.We can leverage a dynamical parameter η, whose value gradually grows from 0 at the beginning to 1 after the training, to (13) to achieve this purpose.Now the objective of the policy network π i changes from (13) to: Whereupon, for the i-th agent, under the policy π i , the advantage of action a i over other actions is: To enhance the stability and facilitate the performance of RL algorithms, in this paper, we use the general advantage estimator (GAE) [19], [20] to calculate the advantage: where λ is a constant less than 1.

B. Training the Multi-Agent RL Functional Networks
In this subsection, we present the training method of functional networks in MAPO.During the interaction between agents and the simulator, the operation trajectory D of the wind farm (include the trajectory D i of turbine i, ∀ i = 1, 2, . . ., n) can be collected for training.The sample structures of these trajectories are (s t , a t , r t , s t+1 ) ∈ D used to train the group value network and o i t , a i t , r i t , o i t+1 ∈ D i used to train the policy and value network of agent i.
At the k-th iteration, the weight matrix of agent i's policy network π i k is θ i k .The objective of π i k is to maximize (14).However, in practical, it is impossible that using (14) to optimize π i k directly.Instead, [21] proposed a surrogate objective to update it based on the collected samples D i k : Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where T is the total time steps of an episode τ , AG i is the advantage function calculated by ( 16) and g( , AG i ) is the clip function: The update rule of agent i's value network V i (φ i k denotes the weight matrix of network V i at the k-th iteration) is: where R i t is the discounted return of agent i at time t: After all agents' value and policy networks are updated, we can train the group value network V gru (φ gru k denotes the weight matrix of network V gru at the k-th iteration) by: where R gru t is the discounted return of the wind farm at time t.
The complete training process of MAPO is showed in Algorithm 1.

IV. RESULTS
In our simulations, the observation o i of turbine i includes its rotor speed ω i r , pump and motor pressure differences (P i p and P i m ).The group state s is formed by concatenating all agents' observations.In the training curves, the solid line represents the average episode return of 5 trials started from random time seeds, and the standard deviation of the episode return of the 5 trials bounds the shaded region of a curve.There are two criteria for evaluating the performance of RL algorithms in wind farm control tasks: cumulative return (the solid line) and stability (the shaded region).High returns show that the tested control policy is effective in wind farm power generation, and the small shaded region signifies the corresponding agents can achieve similar performance under fluctuating initial conditions and vice versa.To reproduce the results, we provide the parameters used in the HWT-based farm simulator and the hyper-parameters of MAPO in Tables I-III, respectively.The pseudo-code of MAPO is shown in Algorithm 1.In addition, we employ two useful techniques, namely policy smoothing regularization and dual value network, to reduce the variance of results during training.
During training and testing, the time step in Fast.Farm is set to 0.00625 s.The total simulation time of one episode (the period from turbines launch to stop) in final testing is 3600 s, while this number is 250 s in training.The inflow

TABLE I PARAMETERS OF THE WIND FARM AND WIND TURBINE
surface (left) of the wind field follows a normal distribution of: V x = N (10, 4), V y = N (0, 5), V z = N (0, 1)(m/s), where N denotes the normal distribution.Prior to calculating the wake dynamics, the ambient wind is generated by the inflow module in FAST.Farm at the beginning of each episode.The parameters of the NREL 5-MW reference wind turbine used in our simulations are listed in Table I.     results of and the greedy control policy (MPPT).We conclude that MAPO can forcefully raise the wind farm power generation, which suggests the agents have learned how to cope with the wake effect in turbine arrays.As shown in Fig. 9, the RL agents' strategy involves slightly reducing the power output of the upstream turbine (WT1) to weaken its wake For all i = 1, 2, . . ., n, initialize the weight vectors φ gru 0 , φ i 0 and θ i 0 of V gru 0 , V i 0 and π i 0 , respectively.for k = 0, 1, 2  (21).end for effect on downstream turbines.During the training process, upstream turbines aim to seek an equilibrium that can maximize the power output of their downstream turbines while minimizing their losses.

A. Comparative Evaluations
In both Figs. 8 and 9, the variance (shaded region) of MAPO is relatively large in the initial training stage because we encourage agents to explore new states and policies at this stage.Afterward, all agents focus on maximizing the group return, implying that the objectives of agents are consistent now (coordination).As a result, the variance gradually diminishes to a low level.In contrast, the variance of MADDPG remains high even at the end of training.Thus the policy learned by MAPO is more stable than MADDPG for deployment in real-world HWT-based wind farms.The curves of MAPO and MADDPG have both converged after being sufficiently trained by samples collected from FAST.Farm.Notably, the convergence value of MAPO is significantly greater than that of MADDPG, indicating that MAPO can increase the power generation of HWT-based wind farms more than MADDPG.
To illustrate how MAPO captures wind changes and maximizes the power output of the wind farm, we generated heat maps during the training process.Fig. 10(a) shows the wake effect of upstream wind turbines on downstream turbines, indicating that without additional control, the turbines located in the wake planes would experience a significant decrease in the wind energy captured.In contrast, Fig. 10(b) shows the learned strategy that controls turbines to avoid wakes during the training process under the similar state, where the wind direction is mainly along the x-axis.In this strategy, each turbine selects a suitable yaw angle to minimize the impact of its wake on the surrounding turbines.Fig. 10(c) and (d) demonstrate the control strategies learned by the turbines to adapt to changes in the wind direction along the y-axis.As observed, all turbines have adjusted their yaw angles to align with the direction of the inflow wind, thereby maximizing wind speed on their rotational planes.Moreover, they have also been rotated to an optimal angle, directing their wakes towards a direction that has minimal effect on surrounding turbines.
We also test the final trained MAPO control policy via embedding it into wind farms and  table, the mean column shows the average power output of the wind farm over five episodes, each lasting 3600 seconds.This data directly reflects the amount of power generated by wind farms.The std column indicates the standard deviation of the mean power output across the five episodes, which helps to evaluate the effect of different initial conditions on the performance of the controllers.The max and min columns respectively represent the highest and lowest power output values during the five episodes, and the difference between them, |max − min|, measures the power fluctuations.Based on the results presented in this table, it can be concluded that the MAPO controller is the most effective at driving wind turbines to generate power, and it demonstrates greater stability across the different episodes compared to the other controllers.Additionally, the wind turbine controlled by the MAPO controller exhibits less power output fluctuation, indicating higher power quality.Fig. 11 shows the  variations in power output of the nine-turbine wind farm.Compared with the greedy control policy and wake steering-a fine industrial method derived from a relatively low-fidelity wind farm model named FLORIS [22], the wind farm manipulated by MAPO generated more power, which is consistent with the training curves.What's more, the power output by the MAPOdriven wind farm is more stable thanks to a fourth-order filter being used to smooth the control actions.
Since our HWT-based wind farm model, adapted from FAST.Farm, includes the sub-structural dynamics of HWTs, which is an advantage over other wind farm models, we analyzed the flapwise tip deflection of one blade and the fore-aft displacement of the tower of the front-left HWT in a six-turbine farm layout under MAPO, MADDPG, and the greedy control policy (Fig. 12).The results show that none of these three control strategies cause unusual vibrations of the blade and tower, and other HWTs have similar results.This implies that HWTs operate within safe structural limits under these three controllers.

B. Parameter Analysis
In MAPO, we use a group value network with the input of the group state s to estimate the future wind farm return, and an individual value network for each agent with the input of its observation o to estimate its future return.Without violating the principle of CTDE, the input of individual value networks can also be the group state s, which is referred to as MAPO-v2.Intuitively, MAPO-v2 can predict the agent return more precisely and faster as the network acquires more state information about the wind farm.However, Fig. 13 shows that, in terms of variance or cumulative return evaluation criteria, the performance of MAPO-v2 is distinctly worse than that of MAPO.Based on this result, we think that the observations of other HWTs are not conducive to the estimation of the target agent and even become noisy.Therefore, using the local information to estimate the individual return is more appropriate in the RL agent training.
The core idea of MAPO is to utilize a dynamical parameter η to balance the agent return and the group return.There are two additional options: 1) Fixed weight -η in ( 14) is set to a fixed value.2) Agent weight -η in ( 14) is set to 0. The fixed weight method assigns equivalent weights to agents exploring their own policies and boosting the group return.This results in a large variance being maintained throughout the training process (Fig. 14).The objective of the agent weight method remains unchanged, causing low variances of results.However, the learned control policy eventually falls into a locally optimal solution (Fig. 14).In conclusion, the dynamical weight method exhibits its superiority thanks to a proper balance of the exploration-exploitation dilemma.

V. CONCLUSION
In this paper, we developed a HWT-based wind farm model by adapting FAST.Farm.HWTs have the potential to reduce the the maintenance cost of wind farms.We also proposed MAPO (multi-agent policy optimization) to optimize the wind farm control policy to boost the power generation of HWT-based farms.Our simulation results show that MAPO is of high performance in different wind farm layout cases and fluctuating environments.In addition, the control policy trained by MAPO has not caused any unusual vibrations in the substructures of HWTs, indicating it does not affect the safe operation of turbines.Moreover, the CTDE paradigm utilized in MAPO is beneficial for real-world deployment as it avoids the real-time communication issue between turbines within a wind farm.

Manuscript received 1
May 2022; revised 20 October 2022 and 18 February 2023; accepted 19 April 2023.Date of publication 26 April 2023; date of current version 20 September 2023.This work was supported by European Union's Horizon 2020 Research and Innovation Program through the Marie Sklodowska-Curie under Grant 861398.Paper no. TSTE-00442-2022.(Corresponding author: Xiaowei Zhao.)

Fig. 3 .
Fig. 3. Sub-model hierarchy of the HWT-based farm simulator for MARL.Note that we only illustrate one turbine in this figure for convenience.In fact, this simulator can include multiple turbines during operation.

Fig. 5 .
Fig. 5.The reward functions in the wind farm control task.

Fig. 8
Fig. 8 compares the training curves of MAPO traced by the cumulative returns in 200 episodes, with the benchmark

Fig. 8 .
Fig. 8.Comparison of MAPO with MADDPG and the greedy control policy.Left: results of the wind farm composed of three hydrostatic wind turbines; Middle: results of the wind farm composed of six hydrostatic wind turbines.Right: results of the wind farm composed of nine hydrostatic wind turbines.Please see Fig. 7 for the layouts of the three wind farms.

Fig. 9 .
Fig. 9. Training curves of each HWT in a wind farm consisting of three HWTs.The sequence of them is: WT1, WT2 and WT3.

Algorithm 1 :
Multi-Agent policy optimization for a wind farm with n HWTs.

Fig. 10 .
Fig. 10.The yaw control policy of MAPO for overcoming the wake effect.

Fig. 12 .
Fig. 12. Displacements of the Blade 1 and tower of the front-left HWT in the 6-turbine wind farm, under different control policies.Top: Blade 1 flapwise tip deflections.Bottom: Tower fore-aft displacements.

Fig. 13 .
Fig. 13.Results of using local state or global state to estimate the agent return.

TABLE IV TEST
RESULTS OF THREE CONTROLLERS IN FOUR WIND FARMS, UNIT (W)