Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol.
No.
September 2025, pp.
ISSN: 2089-3272.
DOI: 10.
52549/ijeei.
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAV Navigation in Partially Observable Dynamic Environments Maryam Allawi Haddad1.
Dhayaa Raissan Khudher 2 1 Department of Computer Engineering.
University of Basrah.
Basrah ,Iraq 2 Department of Computer Engineering.
University of Basrah.
Basrah.
Iraq
Article Info
ABSTRACT
Article history:
Partial observability and sensor limitations are challenging for the navigation of autonomous Unmanned Aerial Vehicles (UAV.
Deep Reinforcement Learning (DRL) algorithms have emerged as potential tools in advancing this However, their effectiveness degrades in challenging environments, particularly in the presence of dynamic obstacles.
Recent research trends emphasize the need for new DRL variants that guarantee robustness, real-time adaptability, and improved generalization under uncertainty.
This paper proposes a lightweight DRL architecture that combines Proximal Policy Optimization (PPO) with a Gated Recurrent Unit (GRU), extended with a temporal LiDAR differencing feature called Delta-LiDAR.
The difference between consecutive LiDAR scans is computed to provide the velocity and directional cues without the computational burden of Long Short-Term Memory (LSTM) networks.
We evaluate three models.
PPO-LSTM.
PPOGRU, and Delta-LiDAR augmented PPO-GRU in a 3D simulated UAV navigation environment characterized by noise, clutter, and dynamic obstacles.
We considered several metrics, including success rate, collision frequency, trajectory smoothness, and computational efficiency, to determine the effectiveness of each architecture.
The experimental results demonstrate that Delta-LiDAR improves GRU-based temporal reasoning.
The deployment complexity is reduced compared with the LSTM-based architecture, which makes it ideal for real-time UAV operation in partially observable Received Aug 4, 2025 Revised Sep 19, 2025 Accepted Sep 27, 2025 Keywords:
UAV autonomous navigation PPO algorithm GRU network Delta-LiDAR Feature encoder Copyright A 2025 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Maryam Allawi Haddad.
Sc.
Student.
Department of Computer Engineering.
University of Basrah.
Iraq.
Email: pgs.
allawi@uobasrah.
INTRODUCTION
Unmanned aerial vehicles (UAV.
have become indispensable tools in numerous applications, such as surveillance .
, environmental monitoring, flood detection, and disaster relief.
In these domains, autonomous flight in an unstructured and dynamic environment is critical .
However, robust and adaptable navigation in such scenarios remains challenging, particularly under conditions of partial observability .
, sensor noise, and environmental uncertainty .
Classic RL algorithms work well in stable and clear environments, but they often struggle and fail in situations with moving obstacles or fluctuating sensor input .
Early attempts to apply RL algorithms to UAV navigation were value-based methods, such as Deep Q-Networks (DQN) .
In this approach, the decision-making process is done by choosing the action with the highest value.
While this method is effective in a discrete action space .
, simple grid-world navigatio.
, it struggles in a continuous action space required in UAV applications .
Policy-based approaches can handle this by directly optimizing the policy using gradient ascent without requiring a value function.
In highJournal homepage: http://section.
com/index.
php/IJEEI/index IJEEI
ISSN: 2089-3272
dimensional state, policy-based methods suffer from high variance and unstable training .
The actor-critic approaches .
Asynchronous Advantage Actor Critic A3C .
PPO .
) have been introduced to stabilize learning by combining both learning a policy .
he acto.
and a value function .
he criti.
In traditional RL, the agents merely adhere to predetermined states and actions within a sequential decision-making, i.
Markov decision process (MDP) .
, .
The approach was, however, confronted with serious constraints when used with autonomous systems .
, .
RL method faces limitations in high-dimensional observation space directly and fails to handle partial observability effectively .
, .
Recently.
RL has been integrated with deep learning to mitigate some of these issues by enabling endto-end learning, which is essential in tasks like autonomous navigation in unknown environments .
This synergy enhances the ability of an autonomous agent to handle high-dimensional sensory data .
RedGreen-Blue (RGB) images.
Light Detection and Ranging (LiDAR) scans, and Inertial Measurement Unit (IMU)) to effectively interpret complex environments .
The principal role of the deep neural networks is to replace the tabular representation of RL by approximating the policy function, the value function, or other models related to the environment .
In dynamic environments, the critical challenge that an agent faces is the ability to infer obstacle motion from sequential observation.
Recurrent neural networks (RNN.
, such as LSTM, demonstrated strong performances in capturing time-related patterns in sequence data .
, .
DRL models with memory augmentation, like PPO-LSTM, have been suggested to counter the problem caused by limited visibility as well as temporal change .
These models employ LSTM units for retaining history context to enable better decision-making in time-correlated scenarios.
LSTM-based PPO models work better at handling sequences of data and learning from diverse obstacles.
Zheng et al.
demonstrated that PPO-LSTM achieves strong performance in UAV navigation, but the high computational cost and large parameter count hinder real-time These constraints render LSTM models less appropriate for dynamic, resource-limited contexts that necessitate rapid and consistent decision-making.
The high computational complexity and large number of parameters make real-time implementation on UAV platforms challenging.
The GRU is a lighter option with faster training and less computational cost .
The GRU has limited effectiveness in responding to fast, dynamic change due to shorter memory retention .
, .
Zhang et al.
conducted a comparative study between GRU and LSTM networks within a memorybased DRL framework.
The result shows that GRU achieves faster convergence and lower computational This finding supports the growing preference for GRU in resource-constrained environments.
Nonetheless, it requires an additional embedding network to preprocess the observations.
Despite significant advancements in the application of RL and DRL to UAV navigation, certain gaps still exist.
Memory-augmented models such as PPO-LSTM considered temporal correlations, but high computational requirements and enormous parameter lists prohibit real-time deployment on resourceconstrained UAV platforms.
In comparison, lightweight models such as GRU conserve processing burden and converge faster but lack in depicting rapidly changing obstacle dynamics due to shorter memory retention.
More importantly, recent methods primarily employ raw sensory observations without explicitly modeling motion cues across time steps.
This absence of representation degrades the agent's ability to foresee obstacle motion in dynamic environments.
Therefore, there is a clear gap in developing a method that .
is computationally light for real-time UAV deployment, .
can handle temporal dynamics with the presence of partial observability, and .
employs motion-related information not embedded explicitly in raw To address these limitations, we propose a new technique for temporal augmentation called DeltaLiDAR, which captures the temporal difference in LiDAR scans between consecutive time steps.
This augmentation provides explicit motion cues, such as obstacle approaching speed and direction, that are not represented explicitly in raw observations.
In contrast to prior works, our Delta-LiDAR augmentation merely feeds motion-related temporal information directly into the GRU, simplifying the architecture while improving dynamic awareness.
This study investigates the effectiveness of PPO-GRU with Delta-LiDAR compared to standard PPOLSTM models in dynamic simulated environments, assessing their autonomous flight capabilities.
The experimental results demonstrate that our model has comparatively better performance with faster convergence, lower computational complexity, and improved navigation success rates.
Gazebo 11 is used to simulate real-world UAV navigation environments.
The world contains dynamic obstacles that move randomly throughout and constantly, posing a challenging navigation scenario.
The main contribution of this work is to propose a Delta-LiDAR that captures the motion dynamics by computing differences between consecutive LiDAR scans.
Further, we design a lightweight PPO-GRU framework to improve the temporal reasoning ability without increasing model complexity.
Moreover, we compare the proposed architecture against PPO-LSTM in complex environments with dynamic obstacles, partial observability, and sensor noise.
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272
RESEARCH METHOD
This section presents our proposed GRU-enhanced framework for autonomous UAV navigation under partial observability.
The architecture integrates Delta-LiDAR, a multilayer perceptron.
GRUs, and a PPObased policy optimization pipeline.
The objectives of this approach are to capture both spatial and temporal information for robust decision-making under uncertainty.
The method starts by modeling the task as a Partially Observable Markov Decision Process (POMDP), followed by a detailed explanation of both the observation and action spaces.
Next, we explain the reward function design, the PPO-GRU architecture, the training procedure, the neural network configuration, and the computational considerations.
To improve temporal sensitivity.
Delta-LiDAR is fused with raw sensor inputs.
The GRU is selected due to its lightweight memory efficiency without sacrificing temporal modeling capacity.
Further, we nominate PPO to ensure stable learning via a clipped surrogate objective and to stabilize convergence speed.
Problem formulation The UAV navigation task in dynamic, partially observable environments is formulated as a POMDP and defined by the set:
e, yeu, ye, ye.
Eu, , ) Where:
A ye: the state space of the environments.
A yeu: set of possible actions .
, thrust, yaw rat.
A ye: set of observation space .
rom onboard sensor.
A ye.
cyc 1 .
cyc , ycayc ): transition dynamics.
A Eu.
cyc , ycayc ): reward function.
A .
cyc ): observation probability function.
A yu OO .
, .
: discount factor.
At each time step t, the UAV receives sensory input signals from a 2D 360A LiDAR scan.
LiDAR yc , the temporal difference iLiDAR yc = LiDAR yc Oe LiDAR ycOe1 , velocity vector yeyc = .
cyes , ycyc , ycyc ], and attitude yuEyc = .
cycuycoyco, ycyycnycycaEa, ycycay.
These input signals are first normalized and then concatenated into a unified observation Equation 1 defines the resulting vector:
yeyc = [LiDAR yc , iLiDAR yc , yeyc , yuEyc ] .
In the feature encoder stage, the yeyc It is passed through a Multilayer perceptron (MLP) to project it into a latent feature space.
The MLP consists of two dense layers with a nonlinear activation function (ReLU) applied in between them.
The first layer transforms the normalized observation vector into an intermediate representation to enable the network to learn abstract features.
This activated feature vector is denoted in Equation 2.
The second layer compresses this feature into a compact latent representation as defined in Equation 3.
Ea1 = ycO1 yeyc yca1 Ea2 = ycO2 ycIyceyaycO(Ea1 ) yca2 Where ycO1 , ycO2 and b1 , b2 These are learnable weight parameters.
h2 OO Eyd It is the latent embedding of the observation.
This compact feature vector Ea2 captures the current sensory state and is forwarded to the recurrent network for temporal modeling .
ee subsection 2.
Figure 1 illustrates the overall system pipeline.
Figure 1.
Architecture of the proposed PPO-GRU framework with Delta-LiDAR integration.
The system processes raw sensor inputs, encodes features via MLP, and models temporal dependencies through GRU.
IJEEI.
Vol.
No.
September 2025: 798 Ae 813
IJEEI
ISSN: 2089-3272
Observation and Action Space The observation space consists of a 360A LiDAR scan with 360 beams, the droneAos current velocity vector .
cycu, ycyc, ycy.
, and the droneAos current position and acceleration vector .
ycu, pyc, pyc, aycu, ayc, ay.
These inputs are combined and normalized before being fed to the PPO network.
This sensor configuration enables robust perception and motion sensing without maintaining a large state representation.
Depending on obstacles or sensor range, the environment may be observed either fully or partially.
The action space is continuous and consists of five parameters: .
cycu, ycyc, ycyc, yuiyc, yuiy.
, representing three linear velocity commands and two angular velocity commands.
The velocity commands are directly converted to MAVROS velocity setpoints and sent to the PX4 flight controller in OFFBOARD mode.
The connection between MAVROS and PX4-SITL) PX4 Software-In-The-Loop( an autopilot is shown in Figure 2.
Figure 2.
Communication pipeline between ROS.
MAVROS, and PX4-SITL.
Sensor data and control commands are swapped between PPO and PX4 flight controller by MAVROS in OFFBOARD mode.
Reward Function Design The reward function is designed for enhancing the agent's goal-reaching through safe behaviors, collision avoidance, and choosing shorter paths at a suitable speed.
It incorporates a combination of sparse and dense rewards.
each targeting a specific navigation objective.
To align these overarching objectives of the task, penalties apply in the event of collisions or unproductive motions.
The reward function consists of four The first type is the goal-reaching reward, which provides a substantial positive numerical value when the agent successfully attains the target.
Equation 4 represents this type of reward.
ycIyciycuycayco = { ycnyce |.
cyyc Oe ycyyciycuycayco || O 0.
5yco ycuycEayceycycycnycyce .
Where ycyyc Oe ycyyciycuycayco Is the Euclidean distance from the drone position to the goal at time t.
The second type is the collision penalty.
ycIycaycuycoycoycnycycnycuycu Defined in Equation 5, which imposes a significantly adverse reward on the agent when a collision with an obstacle occurs.
This will motivate the agent to learn safe trajectories and avoid dangerous behaviors.
Oe200 ycIcollision = { If a collision is detected at time yc ycuycEayceycycycnycyce .
The third type is the stability penalty, which aims to reduce abrupt drone motion and promote smooth trajectory tracking.
It incentivizes the agent based on roll and pitch errors at each step, as represented in Equation 6.
This is important for keeping the flight safe, especially when flying through dense or constricted areas where abrupt angular turns would result in instability or crashes.
ycIstability = Oe(.
cycuycoycoyc | .
cyycnycycaEayc |) .
To promote stability of control and discourage abrupt changes in motion, the agent is penalized for velocities changing quickly in recent timesteps.
Instead of operating on instantaneous velocity differences, the Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272 penalty is computed as the average difference in linear velocity within a short time horizon, as illustrated in Equation 7.
ycIsmooth = Ocya yco=.
ycycOeyco 1 Oe ycycOeyco | ya .
Where ya is the smoothing window size and ycyc Is the velocity vector at time t? The total reward is shown in Equation 8.
ycIycycuycycayco = ycIyciycuycayco ycIcollision ycIstability ycIycycoycuycuycEa This reward shaping method combines sparse rewards .
rrival at goals and collisio.
with dense feedback .
tability and smoothnes.
to encourage the agent to achieve robust, safe, and efficient navigation.
This reward design is effective when integrated with GRU-based policies that retain temporal context from past observations and actions.
Delta-LiDAR fusion mechanism The 2D spatial snapshot of the environment at time yc alone cannot detect a change in dynamic Therefore, we compute the differential representation over time steps as illustrated in Equation 9.
iLiDAR yc = LiDAR yc Oe LiDAR ycOe1 The agent uses the resulting vector associated with other sensory data to infer the object motion.
The increasing value of Delta Oe LiDARyc Indicates that the UAV is approaching an obstacle or the obstacle itself is moving towards the UAV .
ase of a dynamic obstacl.
As a result, the GRU model can easily distinguish between static and dynamic obstacles.
Figure 3 illustrates the mechanism for computing Delta Oe LiDARyc .
This diagram illustrates the computation and role of Delta Oe LiDAR yc In enhancing temporal awareness for UAV navigation.
The left-side section shows two consecutive 2D LiDAR scans: LiDAR ycOe1 .
and LiDAR yc .
These signals are captured as the drone observes its environment over time.
The gray bar chart represents the older scan, and the blue bar outlines the current scan.
yuiya The notation iyayc OO Treats this as a temporal derivative to indicate a signal for motion or yuiyc approaching obstacles.
In the right-side section, the drone processes both LiDAR ycOe1 and Delta Oe LiDAR yc To get a richer temporal understanding of its environment.
This fusion allows the drone to detect moving obstacles, approaching threats, and dynamic structure changes.
Before differencing, we register LiDAR ycOe1 to the ego frame at t through onboard odometry (SE .
translation and rotatio.
Figure 3.
Mechanism for computing Delta-LiDAR by subtracting consecutive 2D LiDAR scans.
The resulting temporal perception enables the GRU to distinguish between static and moving entities.
The LiDAR sensor used in the study has a range of 0.
06 meters to 5 meters and allows accurate calculation of distance within this range.
The sensor has a horizontal angular resolution of 1 degree with full 360-degree coverage, allowing it to cover the whole environment surrounding it.
To get accurate readings, the sensor also has Gaussian noise with a mean of 0 and a standard deviation of 0.
IJEEI.
Vol.
No.
September 2025: 798 Ae 813
IJEEI
ISSN: 2089-3272
The sensor also operates at a frequency of 20 Hz, so it refreshes its measurements 20 times per second.
These requirements enable the LiDAR to effectively detect and measure obstacles within its range without a high level of error due to noise.
GRU-Based Temporal Feature Modeling The GRU allows the agent to build a belief state from a sequential input to maintain a hidden state Eayc .
This hidden state captures spatial features and temporal context necessary for partial observability.
The GRU can encode both recent and past information by updating its hidden state vector using the previous hidden state EaycOe1 and two internal gates.
The reset gate ycyc .
ndicate how much of the previous memory to forge.
The update gate ycyc .
etermines how much of the new candidate state should be combined with the previous stat.
, see Equations 10 and 11:
ycyc = yua.
cOyc Ea2 ycOyc EaycOe1 ycayc ) ycyc = yua.
cOyc Ea2 ycOyc EaycOe1 ycayc ) .
Where yua(UI) is the sigmoid function, and ycOO , ycOO , ycaO Are learnable GRU parameters.
The new input and the reset-modulated previous state are combined in a temporary state candidate as represented in Equation 12:
EaEyc = ycycaycuEa .
cOEa Ea2 ycOEa .
cyc AEaycOe1 ) ycaEa ) .
Where ycycaycuEa is the hyperbolic tangent and Oo is the element-wise multiplication.
The final hidden state Eayc It is updated as a convex combination of the previous and candidate hidden states, as in Equation 13:
Eayc = .
Oe ycyc ) Oo EaycOe1 ycyc Oo EaEyc This mechanism enables the GRU to remember relevant temporal features .
, obstacle movemen.
and reject irrelevant information such as noise.
As a result, the hidden state Eayc Becomes a temporal-aware embedding that integrates both environmental dynamics and internal UAV states.
Instead of retaining every single raw sensor data, the GRU keeps a smart summary that highlights key events.
This summary is efficient .
ower-dimensional than raw sensor dat.
and useful for decision-making under partial observability.
1 Integration of GRU for Detection of Obstacles and Temporal Context The Integration of GRU has been included in the model to learn temporal relationships in sequence An essential component of autonomous UAV navigation operates in dynamic environments where future actions are reliant upon past actions and states.
The GRU allows the model to remember the previous states, , velocity changes, orientation, and sensor readings, and aggregate this information to make more knowledgeable decisions over time.
This is especially useful in partially observable environments, where the system does not always observe the whole context.
The addition of GRU allows the model to learn temporal dependencies in the data well without having to perform much computation, in contrast to other types of recurrent networks like LSTM, making it possible for real-time applications with UAVs.
The system reacts to orientation changes well by adding orientation data .
oll, pitch, ya.
onto the input observations directly to the model.
Since the UAV is subjected to continuous orientation change while in flight, these are values that are needed to understand the motion and location of the UAV in 3D space.
The GRU considers this data, as well as other sensing inputs .
, velocity and LiDAR), such that it can calculate how the UAV's navigation decisions and course are affected by changes in orientation.
Integration of GRU for Detection of Obstacles and Temporal Context.
This allows the model to realign the navigation strategy in realtime whenever orientation is modified, leading to smooth control during tilting or rotation.
Therefore, the GRU gives the system the capacity for smooth handling of orientation changes in real-time.
The model can also allow for variations in external forces like wind drift and mobile barriers.
The Delta-LiDAR .
roviding temporal differences of LiDAR dat.
allows the model to keep track of variations in the environment around the UAV, e.
, wind drift changes or mobile obstacles.
The GRU updates this sequential data so that the model can learn to adjust to such variations and make appropriate adjustments to its decision-making for navigation.
For instance, if the wind causes the UAV to drift, the model can detect such deviation from the LiDAR measurements and correct its direction.
The GRU also tracks dynamic obstacles and adjusts decision-making in response.
However, the responsiveness of the system to rapid environmental change is limited by data sampling rate and processing rate, and thus, very quickly moving obstacles or rapidly changing environmental conditions might be out of the model's scope if the data from sensors is not rapidly processed enough.
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272 PPO Training Mechanism We employ the PPO algorithm to optimize the memory-augmented policy under partial observability.
The PPO algorithm restricts the policy update to stay within a clipped range, leading to improved training At each training iteration.
PPO maximizes the clipped objective function, defined in Equation 14:
Cyc , clip.
uUyc .
uE), 1 Oe yun, 1 yu.
ya Cyc )] Ee PPO .
uE) = yayc .
uUyc .
uE)ya .
Cyc Is the advantage estimate at time t, yuUyc .
uE) The probability ratio between the new and old Where ya policy, and yun is the clipped threshold .
et to 0.
The advantage function is computed using Generalized Advantage Estimation (GAE), as in Equation 15:
Cyc = OcTyco=0.
uyuI)yco yuyc yco ya .
Where yu OO .
, .
is the discount factor, yuI OO .
Are the controls the biasAevariance trade-off? yuyc is the temporal difference error for value estimation and can be computed from the immediate reward and value estimate of the current and next hidden state, as yuyc = ycyc yuycO(Eayc 1 ) Oe ycO(Eayc ).
The final training objective consists of three components: policy loss, value loss, and entropy bonus, as shown in Equation 16:
Eeycycuycycayco = EeycEycEycC Oe yca1 .
Eevalue yca2 .
entropy With A .
Eeycycaycoycyce = .
cO(Eayc ) Oe ycOycycaycyciyceyc )2 A Eeyceycuycycycuycyyc = Oeyi.
uUyuE .
cayc |Eayc )ycoycuyciyuUyuE .
cayc |Eayc )] A yca1 , yca2 are scalar coefficients for balancing value and entropy terms.
Training Setup and Hyperparameters In this subsection, we define the training configuration, environment parameters, and hyperparameters used in our PPO-based memory-augmented policy.
The first module of the neural network is the feature encoder (MLP), which has two hidden layers.
The first hidden layer is 256-dimensional, and the second one is 128-dimensional, both with ReLU activation.
They are trained to learn and extract high-level features of the input data required for decision-making in reinforcement learning, as one may notice in Table The GRU module is a single layer of 128 hidden units that is used to preserve temporal relationships and sequential data.
The GAE (Generalized Advantage Estimatio.
parameters yuI = 0.
95 are utilized to estimate the value of each action taken, balancing bias and variation in policy updates for convergent learning.
The Adam optimizer, with a specified learning rate of 3e-4, is employed to update the policy, utilizing a batch size Entropy regularization using an entropy coefficient of c2=0.
01 is also employed to prevent premature convergence, motivating exploration, and facilitating generalization of the acquired policy.
Table 1 shows the training hyperparameters used for the PPO algorithm.
For ensuring an optimal balance between policy stability, learning efficiency, and generalization under dynamic conditions, all parameters were empirically tuned by iterative testing.
To ensure a robust balance between policy stability, learning efficiency, and generalization under dynamic conditions, we empirically adjusted all parameters through iterative testing.
Table 1.
The training hyperparameter of the PPO algorithm.
Parameter Network Type MLP Hidden Layers GRU Hidden Size Activation Function Discount factor yu GAE parameter yuI Clipping threshold yun Learning rate Optimizer Entropy coefficient yca2 Value loss coefficient yca1 Batch size Epochs per update Max training steps IJEEI.
Vol.
No.
September 2025: 798 Ae 813 Value GRU-based Actor-Critic .
, .
ReLU
Adam
1M Ae 3M
IJEEI
ISSN: 2089-3272
The training was performed on a computer equipped with a GPU (NVIDIA RTX 3.
, an Intel i7 CPU, and 32 GB RAM.
The software environment included Ubuntu 20.
Python 3.
ROS Noetic.
Gazebo 11, and PyTorch 1.
Algorithm 1 summarizes the training procedures of our proposed PPO-GRU agent with Delta-LiDAR.
Algorithm 1: PPO-GRU with Delta-LiDAR training pipeline Require: Policy network.
yuUyuE , value function ycOyuo Require: Discount factor yu.
GAE parameter yuI, clipped threshold yun Require: Coefficients yca1 .
alue los.
, yca2 .
ntropy bonu.
1: for each t=1 to T do 2: yuyc Ia ycyc yuycO(Eayc 1 ) Oe ycO(Eayc ) 3: end for 4: for each t=1 to T do Cyc Ia OcTyco=0.
uyuI)yco yuyc yco 5: ya 6: end for 7: for each t=1 to T do yuU ( yca OEa ) 8: ycyc .
uE) Ia yuE yc O O yc yuUyuE ( ycayc OEayc ) Cyc , clip.
uE), 1 Oe yun, 1 yu.
ya Cyc ) 9: EePPO, t Ia min.
uE)ya 10: Eevalue, t Ia .
cOyuo .
cuyc ) Oe ycOtarget,yc ) 11: entropy, t Ia OeEU.
uUyuE ( UIO Eayc )) 12: end for 13: Eetotal Ia OcycNyc=1[Ee ycEycEycC Oe yca1 UI Eevalue, t yca2 UI entropy, t ] ycN
14: Update parameters yuE and yuo using gradient descent on Eetotal 15: return updated yuUyuE and ycOyuo
RESULTS AND DISCUSSION
To evaluate the performance of the proposed memory augmented learning for UAV navigation, we conducted three key experiments.
The first experiment established a baseline using the PPO-LSTM structure.
The second experiment replaced the LSTM with a GRU to see whether it would learn temporal dependencies with fewer parameters and less training time.
The third test utilized the developed PPO-GRU with the DeltaLiDAR model, which adds GRU-based memory by providing the difference between consecutive LiDAR scans.
Delta-LiDAR, to the raw LiDAR input.
This study evaluates whether combining GRU-based temporal memory with the detection of active environment change further improves the decision-making ability of the UAV in partially observable or dynamic environments.
The same quantitative measures were used, including total reward, collisions, loss in value, policy entropy, trajectory length, and trajectory smoothness.
Training time and model parameter count were also compared across the three settings to highlight the computationally efficient character of the GRU and the additional reward of Delta-LiDAR.
The smoothness and the length of the path were especially noted to demonstrate how the Delta-LiDAR input enables the UAV to follow shorter and smoother paths, which translates to more stable and intelligent control policies.
The simulation environment is built using Gazebo 11, integrated with PX4-SITL Autopilot .
or lowlevel flight contro.
, and MAVROS .
or ROS communicatio.
Figure 4 illustrates the simulated world from both a top view and a front view.
This subsection focuses on the simulation environment.
UAV configuration, and system-level integration.
Figure 5 illustrates the training environments, including a corridor-like world designed to simulate indoor or constrained drone navigation.
The corridor is modeled as a rectangular space with dimensions of 30 m .
y 6 m .
y 6 m .
The left and right walls bounded the width, and we left the ceiling open to avoid z-axis constraints.
The environment consists of moving circular objects and dynamically walking human agents, which are distributed randomly.
The dynamic obstacles were used to challenge the UAV's perception and collision avoidance in the case of partial observability.
These obstacles maintain a minimum clearance of 0.
4 m on both sides and are positioned at various locations along the y-axis to assess lateral movement.
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272 Figure 4.
Top and front views of the designed UAV simulation environment in Gazebo.
The figure depicts a corridor structure with various static and dynamic obstacles of different shapes and sizes.
Figure 5.
Training environments with six different obstacle layouts in a 3D corridor of .
m y 6 m y 6 .
These configurations are used to assess the UAV's adaptability and collision avoidance capability.
Total Reward Per Episode Policy learning performance is usually quantified in terms of cumulative reward across a single training episode.
Learning curves of PPO-GRU.
PPO-GRU with Delta-LiDAR, and baseline PPO-LSTM across 30,000 training episodes are given in Figure 6.
The episode number is shown on the x-axis, and the total reward is represented on the y-axis.
PPO-GRU with Delta-LiDAR also shows a smooth, consistent rise in rewards with convergence at around 129,000 at episode 25,000, outperforming the other two methods.
The data shows that GRU-based temporal modeling and Delta-encoded LiDAR data make it easier to interpret and make decisions.
The PPO-GRU baseline, on the other hand, goes up and down a lot during the initial training and ends with a lower reward level of roughly 122,000.
Such behaviors illustrate the challenging problem of policy convergence in partial observability when more informative temporal cues are not available.
The PPOLSTM model also learns quickly to a reward threshold around 120,000 but possesses a flattened learning curve and thus shows early saturation.
While LSTM is demonstrated to be initially powerful, it may be less effective at the steeper temporal sensitivity needed for long-run reward enhancement in dynamic worlds.
IJEEI.
Vol.
No.
September 2025: 798 Ae 813
IJEEI
ISSN: 2089-3272
Overall, the results clearly show that combining Delta-LiDAR with GRU helps the agent understand time better and makes the process more stable, leading to better performance.
Figure 6.
Cumulative training reward per episode for PPO-LSTM.
PPO-GRU, and PPO-GRU with DeltaLiDAR.
The Delta-LiDAR variant shows faster convergence and higher final reward.
Number of Collisions per Episode Figure 7 illustrates the average collision rates per episode of the three architectures, the x-axis represents episode Number, and the y-axis represents the average number of collisions.
The PPO-GRU with Delta-LiDAR exhibits a smooth and consistent reduction in collision rate and converges to approximately 2 collisions per episode.
The baseline PPO-GRU has a high variance and cannot drop below 10 collisions per episode, reflecting poor stability and less adaptability.
The PPO-LSTM converges sooner than GRU-based policies, but at approximately 7 collisions.
These results support the hypothesis that Delta-LiDAR enhances temporal-spatial awareness that leads to safer and more stable navigation performance.
Figure 7.
Average number of collisions per episode.
PPO-GRU vs PPO-GRU with Delta-LiDAR.
The proposed framework achieves a lower collision rate.
Trajectory Length over Time Figure 8 illustrates that employing PPO-GRU with Delta-LiDAR data enhances trajectory efficiency, where the x-axis represents episode number and the y-axis represents average trajectory length .
The proposed method exhibited the shortest average trajectory length when compared to standard PPO-GRU and PPO-LSTM.
The utilization of time-aware delta LiDAR data significantly facilitates navigation in areas with limited visibility.
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272 Figure 8.
Average trajectory length .
per episode for PPO-LSTM.
PPO-GRU, and PPO-GRU with DeltaLiDAR.
The Delta-LiDAR model follows shorter paths.
Policy Entropy During Training This type of measure is being able to estimate the expected returns, and this is especially crucial for stable policy improvements.
In Figure 9, the x-axis represents episode number, and the y-axis represents average policy entropy.
One can see that the PPO-GRU critic updates are extremely unstable.
The orange line is defined by massive spikes and massive drops into negative areas, indicating that the value estimation process is extremely unstable and difficult to optimize.
Whereas some stability is achieved around the 200k timestep mark, it is so after an extremely long period of instability, subtracting from the overall consistency of learning.
In contrast, the PPO-GRU with Delta-LiDAR .
reen lin.
reflects a smooth and always decreasing policy entropy, indicating a much more stable and better-organized learning process.
A smooth trajectory indicates that the critic learns valuable information better, which leads to improved reward propagation and improved value function approximation.
These results clearly demonstrate that incorporating more Delta-LiDAR enhances further training stability and higher critic reliability.
Thus, the policy improves progressively with each step, leading to better overall training performance and deployment.
Figure 9.
Average policy entropy during training.
PPO-GRU with Delta-LiDAR exhibits smoother entropy Value Loss over Time This metric measures how effectively the agent explores the action space to identify better actions.
Entropy measures higher corresponding to more extensive exploration and stronger policy learning with reduced behaviors of premature convergence to sub-optimal solutions.
Figure 10 compares the value loss of IJEEI.
Vol.
No.
September 2025: 798 Ae 813
IJEEI
ISSN: 2089-3272
the PPO-GRU.
PPO-LSTM, and PPO-GRU- Delta-LiDAR models.
the x-axis represents episode Number, and the y-axis represents average value loss.
All begin high .
, which is the initial exploration phase.
However.
PPO-GRU .
falls sharply and prematurely to 0.
25, which indicates rapid exploitation and potential convergence at sub-optimal policies.
PPO-LSTM .
experiences a gradual fall, stabilizing at 0.
PPO-GRU-Delta-LiDAR .
undergoes a gradual fall, stabilizing at 1.
Such a smoother transition signals enhanced temporal and spatial perception, allowing for better exploration and more stable Overall, the Delta-LiDAR-enhanced model weighs exploration and exploitation more effectively.
Figure 10.
Average Value loss progression per episode for PPO-GRU vs PPO-GRU with Delta-LiDAR.
Trajectory Smoothness over Time Figure 11 shows how smoothly the agents of the three models.
PPO-GRU.
PPO-LSTM, and PPOGRU Delta- LiDAR, move in comparison to each other.
The x-axis represents episode Number, and the y-axis represents average trajectory smoothness.
Lower numbers mean that the movement is smooth and steady.
The suggested PPO-GRU with Delta-LiDAR gives the best final value .
, with a faster and more stable improvement than PPO-GRU .
and PPO-LSTM .
This data shows that adding Delta-LiDAR makes motion stability and control performance during training much better.
Figure 11.
Average Trajectory Smoothness metric per episode.
PPO-GRU with the Delta-LiDAR model yields a smooth and more stable flight path compared to the baseline model and PPO-GRU.
In the study, angular motion and curvature change along the trajectory.
Smoothness is how gradual in-direction changes are mitigated, and is expressed mathematically as follows:
Curvature is a measure of how rapidly the direction of a motion is changing.
The curve of a 2D path can be calculated in Equation .
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272 .
= .
eoA .
yeoAA.
OeyeoA.
yeoAA .
| yc/ya .
eoA .
ya yeoA.
ya ) .
, yeo.
These are the trajectory coordinates.
yeoA .
, yeoA .
: The velocity along the x and y directions.
yeoAA .
, yeoAA .
: The acceleration along the x and y directions.
Smoothness thresholds: The movement is smooth if the curvature .
is less than some threshold.
our experiments, a smooth motion is defined by having a curvature less than 0.
5, which ensures that the motion has gradual direction changes without sudden turns.
Motions with a higher curvature of more than 0.
5 are less The upper bound for angular velocity for a smooth trajectory is computed at 10 degrees per second to allow the UAV ample time to follow a smooth path without abruptly changing its direction of heading.
These metrics provide an exact characterization of the path's smoothness and are employed to assess the model's performance in motion stability during training .
Generalization to Noisy Tunnel Environment The robustness of the proposed framework is evaluated by introducing an unseen environment with moving obstacles to test the PPO-GRU with the Delta-LiDAR agent.
We trained agents in simulated corridorlike environments with dynamic obstacles and clean LiDAR input conditions that are optimal for simulation and controlled learning.
To properly test the performance of the agent outside the unseen world, we transferred the learned models to a novel, highly challenging tunnel environment that has a very different structure and sensing conditions.
The tunnel environment in Figure 12 imposed additional strict spatial constraints and sensor corruption by adding Gaussian noise to each LiDAR range measurement.
The PPO-GRU with the DeltaLiDAR agent was still able to fly safely through the tunnel despite these difficulties and showed strong robustness in both spatial and sensory perturbations.
It was successful 92.
7% of the time whereas the baseline PPO-GRU agent was only 82.
5% successful when subjected to the same conditions.
These results validate the stability and generalization of the proposed method.
The Delta-LiDAR subcomponent enhances temporal understanding in the recurrent policy, distinguishing sensor noise from critical environmental signals.
Conversely, the PPO-GRU agent that lacked this temporal delta information struggled to differentiate between noise and meaningful cues, resulting in more frequent navigation errors and a higher number of missed targets.
This experiment shows the benefits of using memory-based structures GRU along with DeltaLiDAR's time information, particularly when transitioning from simulated hallway training to real-world tunnel deployments with higher uncertainty.
Figure 12.
Visualization of an unseen noisy tunnel environment used for generalization testing.
The scenario includes dynamic obstacles and LiDAR noise to evaluate the robustness of the trained policy.
Figure 13 illustrates the drone's navigation trajectory from start to endpoint, demonstrating a smooth and consistent flight path.
As it is clear from the figure, there are no erratic deviations or sharp corrections.
IJEEI.
Vol.
No.
September 2025: 798 Ae 813
IJEEI
ISSN: 2089-3272
which suggests robust integration of temporal LiDAR data with the memory-based policy, enabling anticipatory maneuvering rather than reactive path adjustments.
Figure 13.
Sample flight trajectory of the PPO-GRU with the Delta-LiDAR agent in the tested environment.
The smooth and direct path highlights stable control and anticipatory decision-making under temporal The detailed comparison between PPO-LSTM.
PPO-GRU, and PPO-GRU with Delta-LiDAR is summarized in Table 2.
The complete experimental results, including all figures, tables, and demonstration videos, are available in the project's GitHub repository:
ttps://github.
com/Maryamallawi96/PPO_GRU_Delta_LiDAR/tree/main/Media.
Table 2.
Comparative Analysis of PPO Variants' Performance .
ean A std over 50 test episode.
Metric Success rate % Avg.
distance goal .
Avg.
Collisions Trajectory Length .
Smoothness Training time .
Model parameter PPO-LSTM 7 A0.
89A0.
03A0.
54A1.
46A1.
PPO-GRU 5A1.
43A0.
0A0.
24A0.
78A0.
63A1.
PPO-GRU-Delta-LiDAR 7A0.
27A0.
0A0.
9A0.
58A0.
92A1.
CONCLUSION This study presented a novel approach to enhance UAV navigation in partially observable and dynamic environments.
This method includes integrating the Delta-LiDAR feature with a GRU-based PPO The temporal difference between consecutive LiDAR scans is computed to provide motion cues about the environment changes.
This difference enhances the agentAos temporal awareness of obstacles' motion without increasing the computational complexity.
The experimental results demonstrate that the proposed PPO-GRU with Delta-LiDAR outperformed both standard PPO-GRU and PPO-LSTM in terms of various performance metrics.
It achieved the highest success rate of 92.
7%, surpassing both PPO-LSTM .
7%) and PPO-GRU .
5%).
The proposed method reduces the trajectory length to approximately 12.
9 meters by episode 15,000, which is less than PPO-GRU .
24 meters at episode 25,.
and PPO-LSTM .
54 meters at episode 20,.
These results reflect that the proposed model achieves faster convergence, learning efficiency, and safer navigation, all while maintaining computational effectiveness.
The generalization capability of the model was further validated using an unseen, noisy tunnel The Delta-LiDAR enhanced the agent, maintaining robust performance despite all spatial Overall, the integration of Delta-LiDAR with PPO-GRU provides a practical, lightweight, and high-performing solution for real-time UAV navigation in partially observable and dynamic environments.
In future work, we aim to extend this architecture by incorporating hierarchical or Transformer-based temporal models with Delta encoding for complex mission planning in dynamic and uncertain environments.
The perception and planning capabilities may be enhanced by incorporating more sensory input .
, vision, radar, or event-based camera.
Enhancing GRU-Based DRL with Delta-LiDAR for Robust UAVA (Maryam Allawi Haddad et a.
A ISSN: 2089-3272 While the proposed PPO-GRU with Delta-LiDAR architecture significantly improves UAV navigation for partial observability, there are issues that remain.
This work is based on 2D LiDAR, which is beneficial for the detection of planar obstacles but does not fully capture vertical features.
While ego-motion correction was employed, small residual discrepancies can still influence the Delta-LiDAR signal in the case of aggressive maneuvers.
Additionally, sensor sampling speeds and onboard processing speeds inherently limit the system's responsiveness to extremely rapid changes in the environment.
Finally, since the technique has been simulated and verified, eventual field deployment will involve further hardware integration as well as robustness testing.
ACKNOWLEDGMENTS
The authors would like to express their sincere gratitude to the professors and colleagues at the Department of Computer Engineering.
University of Basrah, for their valuable support and guidance throughout the development of this study.
REFERENCES