International Journal of Electrical and Computer Engineering (IJECE) Vol. No. October 2025, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Discount factor-based data-driven reinforcement learning cascade control structure for unmanned aerial vehicle systems Ngoc Trung Dang. Quynh Nga Duong Faculty of Electrical Engineering. Thai Nguyen University of Technology. Thai Nguyen. Vietnam Article Info ABSTRACT Article history: This article investigates the discount factor-based data-driven reinforcement learning control (DDRLC) algorithm for completely uncertain unmanned aerial vehicle (UAV) quadrotors. The proposed cascade control structure of UAV is categorized with two control loops of attitude and position subsystems, which are established the proposed discount factor-based DDRLC Through the analysis of the Bellman function's time derivative from two perspectives, a revised Hamilton-Jacobi-Bellman (HJB) equation including a discount factor is developed. Then, in the view of off-policy consideration, an equation is formulated to simultaneously solve the approximate Bellman function and approximate optimal control law in the proposed DDRLC algorithm with guaranteed convergence. According to the modified state variables vector, the development of the discount factor-based DDRLC algorithm in each control loop is indirectly implemented by transforming the time-varying tracking error model into the time invariant Finally, a simulation study on the proposed discount factor-based DDRLC algorithm is provided to validate its effectiveness. To validate the tracking performance of the quadrotor, four performance indices are considered, including yayayaycy = 3. 0527, yayayayu = 0. 1175, yaycNyayaycy = 1. and yaycNyayayu = 0. 0144, where the subscript ycy denotes position tracking error and yu denotes attitude tracking error. Received Oct 25, 2024 Revised Jun 18, 2025 Accepted Jul 12, 2025 Keywords: Approximate/adaptive dynamic Data reinforcement learning Model-free based control Quadrotor Unmanned aerial vehicles This is an open access article under the CC BY-SA license. Corresponding Author: Ngoc Trung Dang Faculty of Electrical Engineering. Thai Nguyen University of Technology 3-2 Street. Tich Luong Commune. Thai Nguyen City. Vietnam Email: trungcsktd@tnut. INTRODUCTION In recent decades, unmanned aerial vehicles (UAV. have been increasingly used to perform various tasks, such as surveillance, military, air traffic control, agriculture management . Ae. To perform task effectively, it is often necessary to develop the trajectory tracking problem and optimal control performance. In practical application, these two control requirements are necessary to develop to the obstacle of external disturbance and dynamic uncertainties. Due to the complexity of UAV model with a high number of variables, an approach of model separation is considered with rotational and translational sub-systems . Ae. In study . , the control designs for position sub-system and attitude sub-system were implemented similarly by sliding mode control technique (SMC) and the addition of state observer, neural networks (NN. were considered to handle the obstacle of external disturbance and dynamic uncertainties. Some extensions were developed for multi-rotor UAV model with unknown bounded time-varying disturbance by augmented disturbance observer (DO) based controller, which was implemented under the appointed-time prescribed performance (ATPP) technique . In . , an adaptive trajectory tracking control was proposed for UAV experiment systems after estimating the necessary variables based on image, inertia measurement. Moreover. Journal homepage: http://ijece. Int J Elec & Comp Eng ISSN: 2088-8708 for general robotics control design studied in . , output feedback law with state observer was presented for surface vessels (SV. according to event-triggered rule. In order to further handle Backlash-Like hysteresis and external disturbance, an adaptive fuzzy dynamic memory-event-triggered mechanism was studied for a six-rotor UAV by Backstepping recursive framework with the first-order filtering technique . But as far as we known, it can be found that there is little research attention on the optimal control UAV systems. With the complexity of UAV model and the diversification of practical tasks, it is difficult to obtain the control objectives of complex purposes only relying on a single UAV agent. Hence. UAV researches put forward the concept of multi-agent systems (MAS. , which involves two research hotspots of consensus and formation control problems . , . , . , . Ae. In . , a consensus control law was developed for multiple UAV systems with time delay and cascade model. However, the Kronecker product and Linear Matrix Inequalities (LMI. were implemented in . due to the simplification of UAV model. The research conducted by . is concerned with the consensus controller with the sign function. Hence, the stability consideration requires the Fillipov theory employment. Additionally, the bearing persistence of excitation (PE) based leader-follower formation control strategy was proposed for multiple double-integrators in three dimensional . D) space using the projection of vector on the plane orthogonal to 2-sphere . When each agent was considered more complicated with Euler-Lagrange systems, the state representation can be used to obtain the event-triggered based consensus controller with Kronecker product . The fault-tolerant consensus control problem for nonstrict-feedback nonlinear MASs with intermittent actuator faults was investigated state observer and backstepping technique . Moreover, the formation control of multiple UAVs was also considered by model predictive control (MPC) with the affine tracking error model . Ae. Despite this, studies . Ae. did not examine the stability properties of the closed-loop system when operating under MPC framework. For the formation tracking control problem, addressing the timevarying formation (TVF) is also extremely crucial for meeting application requirement . , . , . , . According to linear UAV model, the TVF tracking control was investigated by Kronecker product consideration and LMIs technique . Although the cost function was mentioned in . but the optimal control law has not been studied in this work. On the other hand, extended observer (ESO) based backstepping controller was proposed in the second-order attitude sub-system . Furthermore, the estimation of yaw angle in virtual leader was carried out with the connection to the time-varying communication topology as well as the distributed formation tracking control was addressed in the position sub-system . Based on the linear model of fixed-swing UAVs, the TVF tracking control was discussed by employing the solution of Riccati equation . Notably, . tackled the TVF tracking control for multiple linear systems by extending Event-Triggered mechanism. Although there has been some research on the distributed control schemes for MASs especially the consensus and formation systems, most of the recent references have focused on simple UAV model and rarely considered the cascade UAV structure as well as optimization-based control formulation. Implementing the optimal control law in real-world systems requires the use of iterative algorithms to compute solutions to the Hamilton-Jacobi-Bellman (HJB) equations for nonlinear systems or Riccati equations for linear systems, since analytical solutions are typically not feasible. To advance the implementation of optimal control in robotic systems, it is essential to incorporate reinforcement learning control (RLC) in conjunction with methods from approximate and adaptive dynamic programming (ADP), as highlighted in studies . , . Ae. In . , . Ae. , the actor/critic structure was realized via neural network (NN) approximation methods, with learning algorithms for weight adaptation proposed alongside optimization strategies, enabling the closed-loop system to satisfy both tracking performance and optimality requirements. However, it is necessary to eliminate external disturbance and dynamic uncertainties in the practical model, which are handle by traditional robust control design . , . Ae. A different approach of handling directly the external disturbance and dynamic uncertainties in optimal control law can be known in zero and non-zero sum game methods . Ae. On the other hand, it is different from the simultaneous learning in actor/critic framework in . , . Ae. , authors in . , . developed the sequential learning value iteration (VI) algorithm to obtain the Bellman function and optimal control law. Some researchers focused on using datadriven RL to obtain the optimal control strategies for uncertain systems . , . , . , . , . Ae. According to the data collection in time interval, the approximate optimal function can be computed from the approximate optimal control input without the knowledge of model. However, to handle the complete uncertainty in the inverse direction, the addition of off-policy technique or Q-learning is necessary to consider . , . , . A data-driven reinforcement learning control strategy was recently introduced for quadrotors, demonstrating the capability to achieve optimal control while ensuring trajectory tracking, which is closely related to the focus of this article . However, the data-driven RL approach in . was applied solely to the attitude subsystem of a UAV, and the associated cost function did not incorporate a discount factor. account of the above results, we will further explorer the cascade UAV control structure, which involves two data-driven RL with a discount factor-based performance index, and this is another interest of this study. Discount factor-based data-driven reinforcement learning cascade control A (Ngoc Trung Dan. A ISSN: 2088-8708 This study investigates a cascade control architecture for a fully uncertain quadrotor UAV by employing two data-driven RL algorithms based on a performance index with a discount factor. Through constructing a data set tailored to this general class of affine continuous-time systems and integrating a RL strategy using an off-policy algorithm, a control framework is formulated for UAVs with unknown dynamics. The summary contributions of this study are given in the following: Based on the optimal control scheme with a discount factor-based performance index, we further introduce a RL algorithm for an affine continuous-time system to guarantee the finite value of the integral cost function with infinity terminal. We propose a novel data-driven RL based cascade control structure in both two sub-systems for completely uncertain UAVs by off-policy method. Compared with the current results . , only considering the RL algorithm for the attitude sub-system without discount factor, a data-driven RL based cascade control structure is first proposed for completely uncertain UAVs with a discount factor-based performance index. Finally, simulation results are presented to validate the effectiveness of the proposed model-free, data-driven RL algorithm. CONTROLLER METHODOLODY FOR QUADROTOR As shown in Figure 1, the Earth-fixed frame and the body-fixed frame are established to describe the dynamic model of the quadrotor. The movements of this Quadrotor as shown in Figure 1 can be established by changes on four lift forces, which are generated by adjusting the angle velocities of four rotors. It can be seen that a vertical movement can be obtained by the variation of the sum of four lift forces on the four rotors. Due to the difference between the counter-torques achieved by the group of rotors (Rotor 1 and Rotor . and the group of rotors (Rotor 2 and Rotor . , the yaw movement is established. Additionally, the pitch and roll movements can be generated by changing the lift forces of each pair, which result in the longitudinal motion and the lateral motion, as shown in Figure 1. The position of the UAV quadrotor and the quadrotor attitude are given as yc = . cycu , ycyc , ycyc ]ycN OO Ey3 and yu = . uo, yuE, yu. ycN OO Ey3 , respectively. It is worth noting that Euler angles Roll-Pitch-Yaw are satisfied the bound condition as OeyuU/2 < yuo < yuU/2. OeyuU/2 < yuE < yuU/2 and OeyuU < yue < yuU. Moreover, the UAV quadrotor parameters are expressed in Table 1. Figure 1. Quadrotor model in North-East-Down (NED) coordinate Table 1. UAV parameters and variables UAV parameters yco yci yui1 , yui2 , yui3 , yui4 yco ya = yccycnycayci. ayuo , yayuE , yayue } OO Ey3y3 ycoyce , ycoyca , ycoyua Weight of the quadrotor Acceleration of the gravity Angle velocity of each rotor The arm length The inertia matrix is symmetric and positive definite Positive parameters The rotation matrix ycI OO ycIycC. representing the transformation from the Earth-fixed frame to the body-fixed coordinate system is given as . Int J Elec & Comp Eng. Vol. No. October 2025: 4542-4554 Int J Elec & Comp Eng ISSN: 2088-8708 ycayuE ycayue yc ycI = [ yuo yuE yue Oe ycayuo ycyue ycayuo ycyuE ycayue ycyuo ycyue ycayuE ycyue ycyuo ycyuE ycyue ycayuo ycayue ycayuo ycyuE ycyue Oe ycyuo ycayue ycN OeycyuE ycyuo ycayuE ] ycayuo ycayuE where yca(A) = ycaycuyc(A) , yc(A) = ycycnycu(A). In the view of . , the complete quadrotor dynamic model can be represented as . = ycIyce = ya. u,yuN) yuN yua where the parameters are given in Table 1 and the Coriolis matrix ya. u,yuN) OO Ey3y3 is described in . Additionally, the force yce OO Ey3y1 is relative to the body fixed frame of the quadrotor can be obtained as . yce = . ycN yc. Oe ycIycN . ycN where the lifting force yce OO Ey and the torque yua = . yuayuE yuayue ] OO Ey3 are given as . , . yce = ycoyce . ui12 yui22 yui32 yui42 ) . yuayuo = ycoyca ycoyce . ui22 Oe yui42 ), yuayuE = ycoyca ycoyce . ui12 Oe yui32 ), yuayue = ycoyua . ui12 Oe yui22 yui32 Oe yui42 ) . In where, the control signals of the quadrotor . are defined as . ycyce = yui12 yui22 yui32 yui42 , ycyuo = yui22 Oe yui42 , ycyuE = yui12 Oe yui32 ycyue = yui12 Oe yui22 yui32 Oe yui42 . The control objective of this paper is to develop a data-driven RL algorithm based on the optimal control scheme to achieve an optimized tracking control law for a quadrotor, enabling the quadrotor to effectively track the desired trajectory with high accuracy. The optimal control signal ensures trajectory tracking while simultaneously achieving approximate optimality by minimizing the objective function. Additionally, the data-driven RL-based optimal control law is developed for not only the position sub-system but also the attitude sub-system without the knowledge of the UAV model. ycIyceycoycaycyco 1. Unlike the conventional trajectory tracking control purpose in UAV control systems . , . , . , . , . , the control objective in this paper considers both the trajectory tracking performance and the optimal control problem. In addition, both subsystems as shown in Figure 2 achieve a unified framework of optimal control and stability, which is typically difficult to attain due to the time-varying dynamics of the closed-loop systems. Figure 2. The quadrotor control schematic Discount factor-based data-driven reinforcement learning cascade control A (Ngoc Trung Dan. A ISSN: 2088-8708 In this section, a data-driven reinforcement learning approach is introduced to address the trade-off between tracking performance and optimality within the quadrotor control system. The control architecture illustrated in Figure 2 integrates both position and attitude control strategies under the application of a discount factor. These controllers are updated concurrently using the collected data to handle system uncertainties effectively. Discount factor-based RL control design for augmented quadrotor system First of all, we consider a nonlinear affine system as . = yce. ) yci. )yc. and the associated cost function is defined by . , yc. ) = Oyc . ycN ycEyuC. ycN ycIyc. ] yccyua. where ycE OO Eyycuyycu > 0, ycI OO Eyycuyycu > 0 are both symmetric positive definite matrices. The tracking error model of nonlinear affine systems . with the desired trajectory yuCycc . , which is established by a command ycc generator model yuCycc . = Ea. uCycc . Ea. = 0, can be formulated as . yccyc ycc yccyc yce. = yce. ) Oe Ea. uCycc . ) yci. )yc. where yce. = yuC. Oe yuCycc . Ea. uCycc . )is the unknown function. Hence, according to tracking error model . and the command generator model Ea. uCycc . ), we achieve the following augmented system: = ya. ) ya. )yc. yuCycc . ) Oe Ea. uCycc . ) yci. yuCycc . ) yuA. = . yuCycc . ]ycN , ya. ) = [ ] , ya. ) = [ Ea. uCycc . ) . The optimal control law ycO . is designed to minimize the discounted cost function associated with the augmented system . , yc. ) = Oyc yce OeyuI. uaOey. , yc. )yccyua, . ycN where yuI > 0is a discount factor, ycO. , yc. ) Ou yuA. ycN ycEyuA. ) ycIyc. , ycE = . cE . and ycI = ycI. The addition of the discount factor yuI in the cost function . is able to guarantee that it will be finite value although the integral terminal is infinity. Therefore, it is unnecessary to explicitly define the admissible control set, as discussed in . The set e. cO) is defined as the constraint set of control input yc. uA) such that the cost function . is finite. Based on the dynamic programming principle, the tracking Bellman function for the augmented system . can be expressed as the following static function: ycO O . ) = ycoycnycu yc. )OOyu. cO) ycO. , yc. )) . Based on two approaches for computing the time derivative of the Bellman function ycO O . ) in . , the associated Hamiltonian function under a discount factor yuI > 0 is formulated. The first approach involves a direct computation, as detailed: ycc yccyc ycO O . ) = yuiycO O yccyuA yuiyuA yccyc yuiycO O yuiyuA . ) ya. uA)ycO . where ycO . denotes the optimal control input. According to the Bellman principle, a second approach for computing the time derivative of the Bellman function ycO O . )is formulated by utilizing the static Bellman function in . Int J Elec & Comp Eng. Vol. No. October 2025: 4542-4554 Int J Elec & Comp Eng yc yu yc yu ycO O . ) = O = Oyc ISSN: 2088-8708 yce OeyuI. uaOey. , yc O . )yccyua yce OeyuIyu O yce OeyuI. uaOe. c y. ) ycO. , ycO . )yccyua yc yc yu yce OeyuI. uaOey. , ycO . )yccyua yce OeyuIyu ycO O . c y. ) . The representation . obtains that: ycO O . )OeycO O (. c y. ) yu yc yu = Oyc yu yce OeyuI. uaOey. , ycO . )yccyua . ce OeyuIyu Oe. yu ycO O . c y. In the view of . as yu Ie 0, we achieve that the static Bellman function ycO O . ) can be soved by the optimal control signal ycO . using the following partial derivative equation as . , yc O . ) Oe yuIycO O (. ) yuiycO O yuiyuA . ) ya. uA)ycO . ) = 0. Conversely, to determine the optimal control input ycO . using the static Bellman function ycO O (. ) and based on the Bellman principle, the corresponding optimization problem can be formulated as . ycO O . ) = ycoycnycu yc. OOyu. cO) yc yu (Oyc ycO. , yc. )yccyc yce OeyuIyu ycO O (. c y. )) . Since yu Ie 0 , . leads to the corresponding optimization problem as . ycoycnycu yc. uA)OOyu. cO) . , yc. ) Oe yuIycO O . ) yuiycO O yuiyuA . ) ya. )yc. )] = 0. Defining the modified Hamiltonian function in the presence of a discount factor yuI > 0 as . , ya. uA, yc. , yuycO, ycO) = . )ycN ycEyuA. )ycN ycI yc. Oe yuIycO. ) yuycO ycN . ) ya. )yc. ) where yuycO. uA) Ou uA)ycN yuiyuA , it follows that the optimal control solution is then obtained by . , ycO . ) = argmin. uA, yc. , yuycO O . ))] = Oe ycI Oe1 ya ycN . )yuycO O . ) ycOOyu. Additionally, substituting the optimal control law ycO . ) . , it implies the partial derivative equation (PDE) is expressed as . ya O . , ycO . , yuycO O , ycO O . ) = yuA. ycN ycEyuA. Oe yuycO OycN . )ya. )ycIOe1 ya ycN . )yuycO O . ) Oe yuIycO O . ) yuycO OycN . )ya. ) = 0. ycIyceycoycaycyco 2. Including a positive discount factor yuI > 0 ensures that the cost function in . remains finite, even when the state variable yuC. does not converge to zero as yc Ie O. This consideration leads to the appearance of the term "yuIycO O . uA) " in . resulting in necessary adjustments within the discount factor-based RL control framework described in sections 2. 2 and 2. Data-driven proportional-integral position controller In this section, a cascade control framework for a quadrotor UAV as shown in Figure 2 is formulated following the model separation in . , where each subsystem applies a discount factor-based optimal control approach. However, due to the inherent uncertainties and nonlinearities present in . , obtaining a direct analytical solution is infeasible. As a result, a data-driven RL algorithm is employed to estimate the static Bellman function ycO O . uA) corresponding to the optimal control policy yc O . uA) for each The dynamic model of the position sub-system . can be modified as . ycO = yco yc ycI. yco yce yce ycN Oe yci. ycN = yco ycoyce ycyc Discount factor-based data-driven reinforcement learning cascade control A (Ngoc Trung Dan. A ISSN: 2088-8708 where ycyc = ycyce ycI. ycN Oe yco . ycN . For developing the control design of the position sub-system . , the tracking error model is necessary to made with the time invariant model as shown in . Therefore, the state variables vector ycuyc = . cycu , ycNycu , ycyc , ycNyc , ycyc , ycNyc )ycN OO Ey6 is applied to reduce the order of . Hence, the model . can be transformed into the first order system as . ycuN yc = yayc ycuyc yaAyc ycyc yayc = yccycnycayci. cayc , ycayc , ycayc ) OO Ey 6y6 , ycayc = [ yco ] and yaAyc = yce . yco 0ycN . ycN ycyceyce ycyceyce ycyceyce Moreover, due to the time varying of the desired trajectory yc ycyceyce . = . , ycyc . , ycyc . ] OO Ey3 , to transform the tracking error model of the position sub-system . into the time invariant model . , it is necessary to utilize the following assumptions: ycyceyce yaycycycycoycyycycnycuycu 1. The desired trajectory yc ycyceyce . = . cycu ycc yccyc yc ycyceyce . is the Lipschitz function. ycyceyce yaycycycycoycyycycnycuycu 2. The reference vector ycuyc expressed as . , ycc yccyc ycN . , ycycycyceyce . , ycycycyceyce . ] OO Ey3 is bounded and its time ycyceyce = . cycu ycyceyce , ycNycu ycyceyce , ycyc ycyceyce , ycNyc ycyceyce , ycyc ycyceyce ycN , ycNyc ] OO Ey6 can be completely ycyceyce ycyceyce ycuyc . = yaycycc ycuyc . Therefore, in the view of . , it obtains the time invariant model . yceNyc ya ycUyc = [ ycyceyce ] = [ yc ycuN yc yayc Oe yaycycc ] ycUyc . yc ] ycyc ycyceyce yceyc = ycuyc Oe ycuyc , ycUyc = [ ycyceyce ] ycuyc The tracking cost function is modified as . ycOyc . cUyc . ) = Oyc yce OeyuI. cOey. cUyc . ycN ycEyc ycUyc . ycN ycIyc ycyc . ]yccyc . ycEyceyc where ycEyc = [ ] and ycEyceyc OO Ey6y6, ycIyc OO Ey3y3 are symmetric matrices with positive definiteness. Note yceyc that, the term yce OeyuI. uaOey. is added to . for ensuring the finite cost function while ycUyc = [ ycyceyce ] does not ycuyc converge to zero as time approaches infinity. According to . and the off-policy technique . , the data-driven algorithm is proposed to develop the position controller as follows: Algorithm 1. Data-driven algorithm for position controller Initialization: Employing the stabilizing policy ycyc0 . cUyc ) and the additional noise yceyc . to satisfy PE Collecting the input-output data in the quadrotor system and establishing the threshold yunyc Policy evaluation: Based on the control input ycycycn . cUyc ) = ycCycycn . cUyc ) yceyc and the control policy ycCycycn . cUyc ), we solve the . to find simultaneously ycOycycn 1 . cUyc ) and ycycycn 1 . cUyc ): yc yuu ycOycycn 1 . cUyc . c yu. ) Oe yce yuIyuu ycOycycn 1 . cUyc . ) = Oe Oyc 2ycCycycn ycIyc yceyc ) yccyua. ycCycycn . = ycycycn . yce OeyuI. uaOeycOeyu. cUyc . ycN ycEyc ycUyc . cCycycn )ycN ycIyc ycCycycn Int J Elec & Comp Eng. Vol. No. October 2025: 4542-4554 Int J Elec & Comp Eng ISSN: 2088-8708 Policy improvement: Obtain the control policy ycycycn . cUyc ) = ycycycn 1 . cUyc ), ycn Ie . and go to step 2 until Anycycycn 1 Oe ycycycn An < yunyc . In the Algorithm 1, the solution of Bellman equation . is improved by data collection by the following yc yuu ycOycycn 1 . cUyc . c yu. ) Oe ycOycycn 1 . cUyc . ) = Oe Oyc yc yuu Oyc yc yuu yuI ycOycycn 1 . cUyc . )yccyua 2 Oyc . cUyc ycN . ycEyc ycUyc . cycycn )ycN . cUyc . )ycIyc ycycycn . cUyc . )) yccyua ycN . cycycn 1 . cUyc . )) ycIyc ycycycn . )yccyua . After achieving the position control signal ycyc in the quadrotor control structure as shown in Figure 2, we proceed to compute the reference of attitude control scheme . uoycc yuEycc yueycc ]ycN as follows. ycN , it follows that: According to ycyc = ycyce ycI. ycN Oe ycoyce ycyc ycoyci ycoyce . cycnycu yu. cycnycu yu. caycuyc yu. caycuyc yu. cycnycu yuE) ycN 1 = ycyce [. caycuyc yu. cycnycu yuE). cycnycu yu. Oe . caycuyc yu. cycnycu yu. ] . caycuyc yu. caycuyc yuE) . By setting the yaw angle reference yueycc . as a constant number to synchronize in practical applications, based on . , we can achieve the desired ycyce , yuoycc , yuEycc as . ycoyce ycyce = . caycuyc yu. cycnycu yuE) ycycycu ycycnycu yue Oe ycycyc ycaycuycyue yuoycc = ycaycycaycycnycu ( ycyce cycyc yc yuEycc = ycaycycaycycnycu ( ycycu ycaycuycyue ycycyc ycycnycuyue ycyce ycaycuyc yue Data-driven RL based attitude controller In this part, a data-driven RL-based attitude control law is similarly designed as above to obtain the input signals ycyu for satisfying optimal tracking performance with the desired trajectory . The attitude dynamic model . can be rewritten by . yuO = yaOe1 yua Oe yaOe1 ya. u,yuN) yuN By considering the attitude state vector ycuyu = . uo, yuoN, yuE, yuEN, yue, yueN]ycN and referring to the attitude control structure illustrated in Figure 2, the design approach mirrors the position control strategy described in subsection 2. Based on . , the augmented attitude dynamics can be reformulated as . ycUyu = [ yu ] = [ ycuNyuycc yayu Oe yayuycc ya ] ycUyuycc [ yu ] ycyu Accordingly, the attitude control strategy is summarized in the Algorithm 2: Algorithm 2. Data-driven RL based attitude control scheme Initialization: Employing the stabilizing policy ycyc0 . cUyc ) and the additional noise ycycyce . to satisfy PE Collecting the input-output data of the quadrotor system. Policy evaluation: Based on the control signal ycycycn . cUyc ) = ycCycycn . cUyc ) yceyc and the control policy ycycycn . cUyc ), we solve the . to find simultaneously ycOycycn 1 . cUyc ) and ycycycn 1 . cUyc ): yc yuu ycOycycn 1 . cUyc . c yu. ) Oe ycOycycn 1 . cUyc . ) = Oe Oyc yc yuu Oyc yc yuu yuI ycOycycn 1 . cUyc . )yccyua 2 Oyc . cUyc ycN . ycEyc ycUyc . cycycn )ycN . cUyc . )ycIyc ycycycn . cUyc . )) yccyua ycN . cycycn 1 . cUyc . )) ycIyc ycycycn . )yccyua . Policy improvement: Obtain the control policy ycycycn . cUyc ) = ycycycn 1 . cUyc ), ycn Ie . and go to step 2 until Anycycycn 1 Oe ycycycn An < yunyc Discount factor-based data-driven reinforcement learning cascade control A (Ngoc Trung Dan. A ISSN: 2088-8708 ycIyceycoycaycyco 3. Two data-driven RL algorithms incorporating a discount factor are proposed for the quadrotor, addressing both the attitude and position subsystems. This work extends the study in . , which focused solely on RL control for the attitude subsystem without considering a discount factor. SIMULATION RESULTS In this section, we use the example of quadrotor to illustrate the proposed data RL algorithm with the following parameter as follows: yco = 2. , ycoyc = 1. cAyc 2 ), ycoyc = 1 ( ya = 10Oe3 yccycnycayci. 1,5. 1,5. yco2 ). ycAyc 2 yco yco ), yci = 9. 8 ( . , ycoyua = 0. , yc The desired trajectory of the position controller is chosen as: ycycc . = . 5yc, 0. 5yc, 1. 5 y. ycN , it can be obtained that the . is guaranteed with matrix yaycycc = 0 Moreover, the cost function utilizes the weight matrices ycEyce = 100ya6 , ycIycy = ya3 , ycEyuyce = 100ya6 , ycIyu = ya3 , yuI = 0. 01, ycNycycyceycy = 0. 01, and a discount factor of yuI = 0. During the initial data collection phase . , two proportional-derivative (PD) controllers are applied to the position and attitude loops to gather data for the learning process. To ensure the persistence of excitation (PE) conditions required for the proposed algorithms, noise signals defined as ycycyyce = Oc100 yco=1 0 . cyco y. and ycyuyce = Ocyco=1 0 . cyco y. , where each ycyco is randomly selected within [Oe100, . , are injected into the position and attitude control inputs. For the critic and actor neural networks, second-order and first-order polynomial activation functions are employed, respectively. It is worth noting that the tracking performance of the proposed datadriven RL-based position and attitude controllers is illustrated in Figures 3 to 7, demonstrating fast convergence with only four iterations required for the algorithm weights to stabilize. Moreover, the position tracking errors converge to zero within 4 seconds, while the attitude tracking errors reach zero in 5 seconds, as illustrated in Figures 3 and 5, respectively. Furthermore. Figure 7 demonstrates the quadrotorAos trajectory tracking performance relative to a predefined reference path, showing that the quadrotorAos position closely follows the reference trajectory with high accuracy. Furthermore, to evaluate the effectiveness of the tracking performance, numerous performance indices, including the integral of absolute error (IAE) and the integral of absolute time-weighted error (IATE), are presented as shown in Table 2. Figure 3. The position tracking error Int J Elec & Comp Eng. Vol. No. October 2025: 4542-4554 Int J Elec & Comp Eng ISSN: 2088-8708 Figure 4. The convergence of training weights in position controller Figure 5. The tracking of orientation angles Figure 6. The convergence of training weights in attitude controller Discount factor-based data-driven reinforcement learning cascade control A (Ngoc Trung Dan. A ISSN: 2088-8708 Figure 7. The trajectory tracking of RL control Table 2. Numerous performance indices Performance indices Value yayayaycy yayaya yaycNyayaycy yaycNyaya CONCLUSION A novel data-driven reinforcement learning algorithm incorporating a discount factor was proposed for application in the two subsystems of a UAV quadrotor to address performance challenges in fully uncertain UAV systems. Utilizing the off-policy approach, the model-free cascade control framework was constructed to simultaneously obtain the optimal control law and the corresponding Bellman function. The network weights were adjusted to approximate the solution of the modified Hamilton-Jacobi-Bellman (HJB) equation, with theoretical guarantees of both convergence and stability. A numerical example was provided to demonstrate the effectiveness of the proposed discount factor-based data-driven RL algorithm in the UAV control context. ACKNOWLEDGEMENTS This research was supported by Research Foundation funded by Thai Nguyen University of Technology. REFERENCES