International Journal of Electrical and Computer Engineering (IJECE) Vol.
No.
October 2025, pp.
ISSN: 2088-8708.
DOI: 10.
11591/ijece.
An efficient direction oriented block-based video inpainting using morphological operations and adaptively dimensioned search region with direction-oriented block-based inpainting Shyni Shajahan1.
Jacob Vetha Raj2 Department of Computer Science and Engineering.
Adi Shankara Institute of Engineering and Technology.
Kalady.
India Department of Computer Science and Engineering.
Nesamony Memorial Christian College.
Tamilnadu.
India
Article Info
ABSTRACT
Article history:
Video inpainting is a technique in computer vision used to remove unwanted objects from video sequences while preserving visual consistency, so that modifications remain unnoticeable to the human eye.
This paper presents an accurate video inpainting model based on the adaptively dimensioned search region with direction-oriented block-based inpainting (ADSR-DOBI) The model operates in five main phases: preprocessing, background separation, morphological operations, object removal, and video Initially, the input video is converted into frames, followed by preprocessing steps such as deionizing and resizing.
These frames are then processed using a background subtraction module, where object localization and foreground detection are performed using the binomially distributed foreground segmentation network (BDFgSegNe.
and morphological This results in segmented foreground objects tracked across The object removal phase eliminates the identified foreground objects and defines the missing regions .
to be filled.
The ADSR-DOBI algorithm is then applied to inpaint these regions seamlessly.
Experimental results demonstrate that this approach outperforms existing state-of-the-art methods in both accuracy and efficiency.
Received Jul 25, 2024 Revised Jul 3, 2025 Accepted Jul 12, 2025 Keywords:
Bonomially distributed foreground segmentation Adaptively dimensioned search region with direction oriented block based inpainting Morphological operations Sum of squared differences Video inpainting This is an open access article under the CC BY-SA license.
Corresponding Author:
Shyni Shajahan Department of Computer Science and Engineering.
Adi Shankara Institute of Engineering and Technology Kalady.
Kerala.
India Email: shyni.
cs@adishankara.
INTRODUCTION
Inpainting involves filling missing pixels in images with visually plausible content .
, but it is inherently ill-posed with no unique solution .
The need for inpainting has increased with the growth of high-resolution multimedia content .
Video inpainting extends this task to temporal data, aiming to fill missing regions across frames with coherent and realistic content .
, .
Challenges arise from camera motion and complex object movements in real-world videos .
Applying image inpainting models frameby-frame often leads to temporal inconsistencies like flickering .
This naive approach overlooks video dynamics and fails to capture motion-driven appearance changes over time .
, highlighting the importance of spatiotemporal coherence in high-quality video inpainting.
As videos exhibit temporal regularity, to inpaint a given frame, it is natural to use data from other frames as the data in other frames may correspond to parts of the scene behind the masked region .
Both spatial structure and temporal coherence are required to be considered in high-quality video inpainting .
Recovering missing video content requires the understanding of not only the spatial context of each frame but Journal homepage: http://ijece.
ISSN: 2088-8708
also the motion context across frames .
For any missing pixels that lack good correspondence due to occlusion, the video inpainting method must hallucinate reasonable content .
The state-of-the-art methods tend to capture long-term correspondences with an attention mechanism, so the available content at distant frames can be globally propagated to the unknown regions .
Traditional patch-based methods find similar Spatio-temporal patches from the known regions of videos to fill the holes, which formulate the problem as a patch-based optimization task .
These methods rely heavily on the hypothesis that the missing content in the corrupted region appears in neighboring frames, which greatly limits their generalization ability .
addition, patch-based methods assume there is a reference for the missing part and often fail to recover nonrepetitive and complex regions .
they cannot recover a missing face wel.
In recent years, a number of deep learning-based video inpainting methods are proposed .
These exiting deep video inpainting methods can be summarized as two key modules, a temporal feature aggregation, and single-frame inpainting for temporal consistency .
Siddavatam et al.
proposed a video inpainting method using autoencoders that learns the background first, then object features, followed by object removal and background reconstruction.
They used a pre-trained YOLO model for object detection.
Although the method showed improved performance, it faced limitations related to deepfake tasks.
Ke et al.
introduced an occlusion-aware video object inpainting approach with the YouTube-VOI benchmark for realistic occlusions.
Their video object inpainting network (VOIN) used temporal GANs and spatio-temporal attention for shape completion and texture While effective for complex objects.
VOINAos performance degraded with inaccurate input.
Szeto et al.
proposed a temporally-aware interpolation network for video frame inpainting, using a video prediction subnetwork to generate intermediate frames and blending them with temporally-aware interpolation (TAI).
Their method outperformed state-of-the-art approaches but produced blurry results under heavy camera motion.
Huang and Lin .
introduced a video inpainting method based on object motion rate and color variance, using an adaptive foreground model and exemplar-based inpainting for unpaired areas.
While their approach yielded visually pleasing results, it struggled to accurately estimate motion rates when moving objects overlapped.
Inpainted videos have become more and more difficult to be distinguished even by eyes in pace with the remarkable success in video inpainting methods .
The difficulty of video inpainting is inherently tied to the content of the videos and masks being inpainted.
So, content-informed diagnostic evaluation is performed, which identifies the strengths and weaknesses of modern inpainting methods .
Most of the existing techniques developed for video inpainting have complexities in terms of computation and accuracy.
Although there are several techniques, there is a constant demand for reliable and efficient video inpainting systems.
Therefore, this paper proposes an efficient direction oriented fast iterative block-based video inpainting model using ADSR-DOBI.
The rest of the paper is organized as follows, section 1 surveys the existing works related to video Section 2 explains the proposed methodology.
The experimental evaluation of the proposed methodology is given in section 3 and section 4 concludes the paper with future enhancement.
PROPOSED VIDEO INPAINTING SYSTEM
In this paper, direction oriented fast iterative block-based video inpainting using morphological operations and SSD is done.
The proposed method first detects the foreground object that needs to be removed and the target region to be inpainted.
Then the ADSR-DOBI algorithm is utilized for the purpose of inpainting, where the target region is inpainted with the efficient block matching mechanism.
The block diagram of the proposed methodology is shown in Figure 1.
Processing In this proposed methodology, the input video .
cOycnycu ) is taken from publicly available datasets.
part of pre-processing, frames are extracted from the captured videos for the further process and the converted frames are initialized as, yceyc.
= ycOycnycu .
, yceyc.
, .
cA) } .
where, yceyc.
is the number of frames.
The converted frames are further pre-processed based on the following Resize image: In this section, the frames yceyc.
are resized for reducing the computational time of the Image resizing refers to the scaling of images.
It helps in reducing the number of pixels from an image and also zooming in on images.
Because the large images are fed into the AI algorithm vary in size therefore the training might be increased.
Int J Elec & Comp Eng.
Vol.
No.
October 2025: 4705-4713
Int J Elec & Comp Eng
ISSN: 2088-8708
Noise Removal: In this section, the noises are removed due to the presence of blur and illuminations in the images.
Hence the proposed method uses the technique of Gaussian smoothening in order to enhance the image structures at a different scale.
The visual effect of this blurring technique is a smooth blur resembling that of viewing the image through a translucent screen.
The degree of smoothing is determined by the standard deviation of the Gaussian.
Hence the preprocessed frame is initialized as, yceycycyycy.
= .
, yceycycyycy.
, .
cA) } .
where, yceycycyycy.
denotes the preprocessed frames.
Figure 1.
Block diagram of the proposed methodology Background subtraction Background subtraction is a widely used method for detecting moving objects in videos captured by static cameras.
It helps in segmenting foreground blobs into individual objects and tracking them across In this work, foreground segmentation network (FgSegNe.
is used for this purpose.
FgSegNet is a recently developed, high-performing neural network that employs encoder-decoder architecture.
The encoder, made up of convolutional neural networks (CNN.
, extracts image features, while the decoder, using a transposed CNN (TCNN), reconstructs the feature maps for object segmentation.
This architecture enables accurate background subtraction and object identification.
To ensure stable training and avoid issues like vanishing or exploding gradients, the networkAos weights are initialized using a Binomial distribution.
The architecture of FgSegNet is shown in Figure 2.
Initially, the input image frames are fed into three CNNs and the outputs are concatenated and applied for TCNN.
Before segmentation, the weights of networks must be initialized.
Here the weights are initialized using the Binomial distribution function rather than assigning the random numbers.
Thus, the weights are initialized as, yco ycsya = ( ) ya ya yaycoOeya ya .
where, ycsya denotes the binomial probability, ya denotes the probability of success, ya probability of failure, ycodenotes number of trials, ya denotes the specific outcomes i.
, weights.
Encoder network The encoder network consists of three copies of CNN, each of which contains the first four blocks as that of the VGG-16 net and the dropout layers.
The input image frames are fed to each CNN where the convolution layer transforms the inputs into the feature maps of size yc y Ea.
The transformation of feature maps can be expressed as.
An efficient direction oriented block-based video inpainting using A (Shyni Shajaha.
A ISSN: 2088-8708
ycOycaycuycuyc.
= EcUOyaycu O yceycycyycy.
Ec = ycoycaycu( ycu.
ycOEayceycyce.
Ea = yaycu O yceycycyycy.
where, ycOycaycuycuyc.
denotes the output feature maps of convolution.
Ec denotes the ReLU activation function, yaycu is the weight values, yu denotes the bias of the network.
Thus, the extracted features maps are down-sampled by using the pooling layer.
The pooling layer utilizes the max pooling operation that stores only the max-pooling indices i.
the locations of maximum feature value in each pooling window to capture and store boundary information.
The output of the pooling layer is computed as, ycOycyycuycuyco.
= ycOycaycuycuyc.
Oeyaycu where, ycOycyycuycuyco.
denotes the output feature maps of the pooling layer, yc denotes the strides of the kernel.
the end of this process, the dropout layer is utilized to avoid the problem of overfitting.
Figure 2.
Architecture of FgSegNet Decoder network The output of the encoder network is concatenated to form the feature maps of different scales and then the maps are fed to TCNN for decoding the feature maps.
Here the feature maps are operated with the transposed convolution to enlarge the feature maps.
Finally, a sigmoid function is used in the last layer as the probability values for each pixel to obtain discrete binary class labels foreground and background.
The segmented image can be obtained as, yceycycyceyci.
= 1 yceycuycy.
The cross-entropy is utilized as a loss function, which is expressed as, ya = Ocycoycn=1 Oe.
ycoycuyci yce ycycyceyci.
Oe yceycyciyc.
) ycoycuyci.
Oe yceycyciyc.
)) yco where, ya denotes the loss function, yco denotes the number of pixels in the frame, yceycyciyc.
is the ground truth label, yceycycyceyci.
denotes the segmented labels by the network.
The segmented frames are further enhanced to remove the imperfections in segmentation using morphological operations.
Object removal The segmented objects or target objects speciAed by the morphological process yceycycoycuyc.
are removed from each image by forming the hole region to be inpainted.
The static portion of the hole can be Int J Elec & Comp Eng.
Vol.
No.
October 2025: 4705-4713
Int J Elec & Comp Eng
ISSN: 2088-8708
Alled by available background information using the video inpainting algorithm.
Otherwise, image inpainting is performed based on the surrounding image statistics.
Hence, the frame with the hole region is initialized as, yceycEayco.
= UOyceycEayco.
, yceycEayco.
, yceycEayco.
A .
, yceycEayco.
cA) UU where yceycEayco.
denotes the frames with formed hole region, yceycEayco.
cA) denotes the ycA ycEa frame.
Video inpainting This section explains the video inpainting process using the direction-oriented block-based inpainting (DOBI) algorithm.
The target region, identified as the hole area, is filled with matching background content.
Initially, the boundary points of the target region are determined to search for suitable While using a fixed patch size increases search points, adapting the search region size improves efficiency without sacrificing quality.
When motion varies among adjacent blocks, a larger search area is Thus, an adaptively dimensioned search region is used, leading to the proposed adaptively dimensioned search region-based DOBI (ADSR-DOBI) method, as shown in Figure 3.
Figure 3.
Frames with target and source region for inpainting Initially, the target region .
cNycNycI ) has been selected and the nearest boundary points were detected.
From the outside of the detected boundary, the block with the highest matching probability is selected.
During this process, the dimension of the search region is chosen adaptively as, ycc yaAycIycI = .
aAycycc , yaAycycc } .
yaAycycc = ycoycnycu.
aAyc , ycoycaycu(.
cycu Oe yc1 |, .
cycu Oe yc2 |, .
cycu Oe yc3 |)} .
yaAycycc = ycoycnycu.
aAyc , ycoycaycu(.
cycu Oe yc1 |, .
cycu Oe yc2 |, .
cycu Oe yc3 |)} .
where, ( yccyaAycIycI ) denotes the dimension of search region, .
aAycycc ), .
aAycycc ) are the adaptive displacement in the horizontal .
and vertical .
direction, .
cycu , ycycu ) denotes the number of search points.
The adaptive SR is bounded such that, yaAycycc O yaAyc , yaAycycc O yaAyc Moreover, in order to determine the pixel, the most accurate to be repaired the confidence of the repaired pixels needs to be updated.
An increase in both the confidence of the neighboring pixels and the priority of the neighboring pixels constitutes the most accurate pixel to be repaired so that the improved output can be An efficient direction oriented block-based video inpainting using A (Shyni Shajaha.
A ISSN: 2088-8708 Therefore, the priority of a given block can be defined as, .
yuCycn = yuuycn .
yusycn yuuycn = yusycn = OcycOOycN ycNycI OyaAycIycI .
cNycNycI | .
yueycn | yu , 0 O yuuycn O 1 , 0 O yusycn O 1 where, yuCycn denotes the priority, yuuycn , yusycn are the confidence term and data term, yucyuu is the unit vector orthogonal to image gradient, yueycn is the unit vector orthogonal to the point ycn, yu is the normalization vector.
Thus, the pixel ycn with the highest priority is treated as the initial search centre to choose the target patch to be filled.
Then the selection of the best matching block .
cAycoyc ) is done based on the sum of squared differences (SSD) calculated between the known pixels of the target region and search region.
From this step, the area has been identified that satisfies the following criterion as, ycAycoyc = ycaycyciycoycnycu yc.
cAycoyc , ycNycNycI ) .
ycNycNycI OOyaAycIycI where, yc(A) is the SSD between the block .
cAycoyc ) and the block .
cNycNycI ).
Hence the corresponding position to the unknown pixel ycn of the target region is filled by assigning the known pixels of the matching block yc After filling the confidence value is updated as, yuCyc = yuCycn .
OAyc OO ycAycoyc O ycNycNycI The steps are repeated until the target region gets filled.
The pseudo-code of the proposed ADSR-DOBI is shown in below Figure 4.
Figure 4.
Pseudo code of the ADSR-DOBI algorithm
RESULTS AND DISCUSSION
In this section, the performance of the proposed video inpainting method is analyzed.
The proposed methodology is implemented in the working platform of PYTHON.
Int J Elec & Comp Eng.
Vol.
No.
October 2025: 4705-4713
Int J Elec & Comp Eng
ISSN: 2088-8708
Database description For the performance analysis, the proposed work uses the YouTube-video object segmentation (VOS) dataset that is publically available on the internet.
YouTube-VOS contains 4,453 videos.
From the dataset, 80% of data was used for training and 20% data for testing.
The collected dataset has more than 7,800 unique objects, 190k high-quality manual annotations, and more than 340 minutes in duration.
Performance analysis The proposed BDFgSegNet method is based on quality metrics such as sensitivity, specificity, accuracy, precision, recall.
F-measure, false positive rate (FPR), false negative rate (FNR), and Matthews correlation coefficient (MCC).
Table 1 shows the comparative analysis of the proposed ADSR-DOBI algorithm and the existing DOS-based algorithm.
The analysis has been done by varying the size of the For a 28541-pixel frame, the time taken for inpainting is 0.
018 seconds lesser than the existing DOSbased method.
Similarly, for a 54300-pixel frame, the time taken for inpainting is 1.
2267 seconds lesser when compared with the DOS algorithm.
Lesser time, taken for inpainting the given area, shows the efficiency of the proposed method.
Table 2 presents a performance comparison of the proposed ADSR-DOBI method with existing techniques using quality metrics such as PSNR.
SSIM.
MSE, and RMSE.
A higher PSNR and SSIM indicate better image quality, while lower MSE and RMSE reflect reduced error.
The proposed method shows a 0.
dB improvement in PSNR over FFBMA and a 0.
07 increase in SSIM compared to BBGDS.
Additionally.
MSE and RMSE values are consistently lower than those of existing methods.
Overall.
ADSR-DOBI outperforms other techniques across all metrics, demonstrating its effectiveness in video inpainting.
Table 3 presents a performance comparison between the proposed BDFgSegNet and existing methods using various quality metrics.
Higher values of accuracy, specificity, sensitivity, precision.
F-measure, and MCC indicate better performance, while lower FPR and FNR values reflect greater As observed, existing methods such as CNN.
ANN.
DNN, and SegNet show relatively low performance across these metrics.
In contrast, the proposed BDFgSegNet demonstrates superior results in all evaluated metrics, highlighting its effectiveness with optimal feature selection.
Table 1.
Performance comparison of proposed ADSR-DOBI algorithm with the DOS-based algorithm Total frame size .
n pixel.
Removed area size .
n pixel.
Time taken in seconds DOS-based algorithm Proposed ADSR-DOBI Table 2.
Performance analysis of proposed ADSR-DOBI method based on different image quality metrics Techniques DOS
BBGDS
FFBMA
Proposed ADSR-DOBI
PSNR
SSIM
MSE
RMSE
Table 3.
Performance analysis of proposed BDFgSegNet method based on quality metrics Performance metrics/techniques Accuracy Specificity Sensitivity Precision F-Measures FPR
FNR
MCC
SegNet
CNN
ANN
DNN
Proposed BDFgSegNet
CONCLUSION
Video inpainting removes or restores missing regions in a video using spatial and temporal information.
This paper proposes an efficient ADSR-DOBI block matching algorithm for inpainting, where the target region is identified and filled with matching background content.
The proposed method is evaluated against An efficient direction oriented block-based video inpainting using A (Shyni Shajaha.
A ISSN: 2088-8708 existing techniques using quality metrics.
Results show that ADSR-DOBI achieves superior performance, with a PSNR value of 22.
19 and an inpainting time of 3.
43 secondsAiboth better than existing methods.
These findings demonstrate the efficiency of ADSR-DOBI.
Future work can improve the method further by addressing occluded regions and enhancing motion handling capabilities.
REFERENCES