Vol.
16 No.
Jurnal Riset Teknologi Pencegahan Pencemaran Industri Journal homepage: https://w.
Addressing Missing Data in Environmental Technologies: Economic and Environmental Optimizing Air Quality Monitoring with Random Forest and MissForest Titin Agustin Nengsih*1.
Indrawata Wardhana2.
Nazori Madjid3 123UIN Sulthan Thaha Saifuddin Jambi.
Indonesia
A R T IC L E I N F O
ABSTRACT
Article history:
Received: February, 27 2025 Received in revised form March, 29 Accepted: April, 25 2025 Available online: May, 28 2025 Keywords:
Air Quality Air quality monitoring often encounters missing data issues due to technical glitches, equipment malfunctions, or other causes.
This study employs PM2.
5 and PM10 datasets from station 6, calculating multiple weighted probabilities for imputation.
The methodology employed in this study includes the simulation of missing data patterns using multivariate amputation techniques (MCAR.
MAR, and MNAR), followed by the application of machine learning-based imputation methodsAiRandom Forest and missForest.
The performance of each method was assessed using statistical evaluation metrics: Root Mean Square Error (RMSE).
Nash-Sutcliffe Efficiency (NSE), and Kling-Gupta Efficiency (KGE) with missing values introduced at rates of 10, 40, and 70 percents.
The results show that missForest consistently outperforms Random Forest across all missingness levels and amputation types.
For example, in the low missing data scenario .
%).
MF achieves RMSE values as low as 0.
83 (PM2.
76 (PM.
, with perfect NSE and KGE scores .
, while RF yields higher RMSEs and slightly lower Even under high missing data conditions .
%).
MF maintains strong performance with RMSE values of 10.
54 and NSE above 0.
These findings highlight MFAos superior accuracy and robustness for handling missing air quality data.
Imputation Missing Values Random Forest
missForest
INTRODUCTION
Ambient air pollution poses a significant environmental concern, exerting adverse effects on human Exposure to particulate pollutants, including fine particulate matter and ozone, increases the risk of mortality.
Notably, in 2015, the Global Burden of Disease report indicated a staggering toll, attributing 4.
2 million deaths and 1 million disability-adjusted life years to PM2.
worldwide (Burnett et al.
, 2.
To address this issue, ground-based air quality monitoring stations have been established, enabling real-time surveillance of air quality.
However, these stations are often geographically dispersed, leading to gaps in the dataset.
This gap is particularly prominent during the initial stages of station deployment (Chu & Bilal, 2.
The evaluation of a missing data methodology in a practical context involves applying it to a realistic missing data issue.
In this regard, (Brand, 1.
proposed a multivariate amputation concept that replicates the absence of actual data.
While this idea was briefly mentioned twice in previous literature, (Schouten et al.
, 2.
expanded on schematic concepts and introduced an amputation procedure capable of generating complex missing data To address missing data patterns across various data types, including continuous, discrete, binary, unordered categorical, and ordered categorical variables, the Multiple Imputation by Chained Equations (MICE) approach has been effective (White et al.
, 2.
(Deng et al.
, 2.
(Zhao *Correspondence author.
E-mail: nengsih@uinjambi.
id (Titin Agustin Nengsi.
doi: https://10.
21771/jrtppi.
2503-5010/2087-0965A 2025 Jurnal Riset Teknologi Pencegahan Pencemaran Industri-BBSPJPPI (JRTPPI-BBSPJPPI).
This is an open acces article under the CC BY-NC-SA license .
ttps://creativecommons.
org/licenses/by-nc-sa/4.
0/).
Accreditation number: (Ristekdikt.
158/E/KPT/2021 Nengsih et.
/ Jurnal Riset Teknologi Pencegahan Pencemaran Industri Vol 16 No 1 .
23-31 & Long, 2.
Furthermore, in the context of Missing at Random (MAR) and Missing Not at Random (MNAR) scenarios, the Bayesian Imputation approach has shown superior performance (Halme & Tannenbaum, 2.
For high-dimensional data with a predictive focus.
Bayesian Linear Regression has been successful (Castillo et al.
, 2.
A notable solution.
MissForest (Stekhoven & Byhlmann, 2.
, offers a promising avenue for handling missing data.
To maximize the benefits derived from imputed data in MICE, where reimputing the data would not alter standard error estimates, it is recommended to perform multiple imputations.
According to (Graham et al.
, 2.
the optimal number of imputations .
is suggested as 3 into 5 imputations.
Addressing the challenge of determining the number of imputations, .
on Hippel, 2.
introduces a two-step approach.
In the quest for identifying the best parameter for multiple imputation, .
on Hippel & Bartlett, 2.
employ the Maximum Likelihood method, offering enhanced efficiency in point estimates.
This approach is not only less computationally intensive but also quicker, resulting in slightly more efficient point estimates.
To assess the MCAR.
MAR, and MNAR missingness mechanisms, we introduce missingness into the variable Y Subsequently, we categorize the MAR and MNAR methods based on their consideration of different aspects: incomplete variable's left (LEFT/L), right (RIGHT/R), both tails (TAIL/T), or distributional center (MID).
The occurrence of MAR-induced missingness in Y relies on X, as indicated in Figure 2, which portrays the four distribution functions (LEFT.
RIGHT.
MID, and TAIL) (Schouten & Vink, 2.
(Wardhana et al.
, 2.
scenarios involving MNAR missingness, the presence or absence of the true value of Y influences the probability of Y itself being missing.
Additionally, we generate three levels of missingness proportions: 0.
1, 0.
5, and 0.
It is important to emphasize that these proportions represent the sampled ratio of incomplete cases in Y while keeping X as a constant To generate missing values across all conditions, we employ the multivariate amputation technique (Schouten et al.
, 2.
In the context of data characterized by non-linearity and non-normality, a comparison was conducted between Random Forest Imputation and predictive maintenance mean (Hong & Lynn, 2.
The application of Machine Learning imputation techniques extends to diverse fields, including meteorology observation (Boomgard-Zagrodnik & Brown, 2.
as well as geostatistics (Li et al.
, 2.
and (Avalos & Ortiz, 2.
Handling missing values in air quality data is a critical aspect of ensuring accurate and reliable analyses.
In the realm of air quality assessment, missing data can arise due to various reasons such as sensor malfunctions (Zainuri et , 2.
, equipment downtime (Norazian et al.
, 2.
, or data transmission issues (Junger & Ponce de Leon, 2.
These gaps in the data can potentially lead to biased or incomplete conclusions if not properly addressed.
To tackle this challenge, several imputation techniques are commonly These techniques involve replacing missing values with estimated values based on the available data.
Methods like mean imputation (Junger & Ponce de Leon, 2.
, interpolation (Norazian et al.
, 2.
, nearest neighbor imputation (Zhou et al.
, 2.
, regression-based imputation (Quinteros et al.
, 2.
, and multiple imputation (Zainuri et al.
, 2.
provide ways to fill in the gaps and enable more comprehensive analyses.
The choice of imputation method depends on factors such as the nature of the data, the extent of missingness, and the specific goals of the analysis.
This study introduces a novel approach by integrating multivariate amputation (MCAR.
MAR.
MNAR) with weighted probability distributions to generate more realistic missing data scenarios.
The use of two imputation methodsAiRandom Forest missForestAiwas intentional, as they represent advanced tree-based algorithms suited for different strengths.
Random Forest is widely recognized for its effectiveness in handling highdimensional, nonlinear datasets, while missForest builds upon this by incorporating an iterative, non-parametric framework that enhances imputation accuracy.
comparing the two under various missingness conditions, this research provides a more comprehensive understanding of their performance in air quality data contexts.
Dataset
METHODS
The dataset comprises air quality observations collected from 135 monitoring stations throughout Uganda.
This continuous dataset encompasses calibrated hourly Nengsih et.
/ Jurnal Riset Teknologi Pencegahan Pencemaran Industri Vol 16 No 1 .
23-31 PM2.
5 and PM10 data derived from air quality monitoring devices and a reference-grade monitoring apparatus during the period between 2019 and 2020.
It consists of two files, namely "hourly air quality data.
csv" and "reference grade monitor hourly air quality data.
" These files contain timestamps in UTC.
PM2.
5 and PM10 concentrations, unique site IDs for monitoring sites, and site coordinates .
atitude and longitud.
Analysis of the monitor dataset reveals mean PM2.
5 and PM10 concentrations of 39g/m3 and 49.
61g/m3, respectively.
The referencegrade monitor employed for this data collection is the Met One Beta Attenuation Monitor Model 1022, specifically designed for hourly PM2.
5 concentration measurement and In contrast, the low-cost monitors utilize laser scattering technology and dual Plantower Sensors (PMS 5.
(Sserunjogi et al.
, 2.
Method Imputation Imputing missing air quality data is crucial for air pollution research and monitoring.
Various methods exist:
the simple single imputation replaces missing values with estimated ones using mean (Hirabayashi & Kroll, 2.
, median, or regression.
multiple imputation creates multiple simulated values to capture uncertainty(Schouten et al.
spectral methods employ discrete sampling for nonstationary time series(Alsaber et al.
, 2.
logistic regression handles non-linear relationships in time series(Chen et al.
, needing substantial data.
Choosing a method depends on data specifics and resources.
A comparative study can help identify the most effective approach.
Random Forest (RF) is an ensemble technique based on decision trees, designed to fill in missing data by consolidating outcomes from several decision trees (Deng et al.
, 2.
These trees differ due to their creation from diverse datasets, leading to distinct outcome predictions for the same inputs.
RF then combines these predictions through a voting process to yield a final result.
This imputation method boasts strong classification capabilities and is well-suited for managing high-dimensional data.
The process of RF as shown in algorithm 1.
Algorithm 1 Random Forest Input : Data matrix X = { X obs .
X miss } Output : Imputed data matrix.
X = {X obs .
X imputed } for i = 1 -> 4 .
ultiple imputation.
do {X .
A 2 } C X obs Initial imputation.
X miss C N ( X .
A ) for i = 1 -> N do Estimate Wt .
Equation 6 Estimate X miss .
Equation 7 X miss t Oe1 P( X miss | X obs .
X miss end for X imputed C X mis end for return X imputed C Aggregate( X imputed ) MissForest (MF) employs an iterative method that utilizes the Random Forest algorithm to predict missing In each iteration, a Random Forest model is constructed for individual variables, leveraging observed data to estimate missing values.
This process iterates until convergence, progressively refining imputed values with each iteration.
(Zhang et al.
, 2.
Algorithm 2 missForest Require: X an ycu ycu ycy matrix, stopping criterion Sort X by amount of missing values of stations Make an initial guess for missing values using another While not do X old C store previously imputed matrix.
For s in 1Ap do .
Fit a random forest : yobs xobs .
Predict yobs using xobs .
X old C update impute matrix, using predicted yobs Update .
Return the imputed matrix X imp .
Process of Missing Values The process of generating missing values involves deliberately creating gaps in data to assess how a model performs on different complete datasets.
To achieve this, missing values are introduced only to the testing instances.
Nengsih et.
/ Jurnal Riset Teknologi Pencegahan Pencemaran Industri Vol 16 No 1 .
23-31 while the training instances remain unaffected and In cases where the original dataset has missing entries, the training instances with missing values are excluded, resulting in the construction of an RF model using fully observed training data.
Three distinct missing mechanisms are introduced, each serving a unique purpose, and their specifics are outlined in (Karmitsa et al.
(Alsaber et al.
, 2.
MAR : indicates that the likelihood of an attribute having missing values is influenced by the values of other attributes (Schouten & Vink, 2.
MNAR : In this scenario, the probability of an attribute having missing values is connected to the attribute's own value.
Specifically, missing values are introduced in one attribute, with higher attribute values being removed at a certain proportion(Khan & Hoque, 2.
MCAR: Under this mechanism, a specific number of locations are chosen randomly, and the values at these chosen locations are removed.
Importantly, the decision to introduce missing values is independent of the values of other attributes or the attribute itself (Idri et al.
, 2.
To examine how the rate of missing data influences classification outcomes, portions of values within the datasets are randomly removed at fractions of 10%, 40%, and 70%, respectively.
By systematically altering the missing rates, the study aims to gain insights into how the presence of missing data impacts classification results.
Evaluation Criteria The performance of MissForest and Random Forest at air quality multiple imputation data was tested by comparing prediction data in different percentages with the observed data using the Normalized Root Mean Square Error (NRMSE), the Nash-Sutcliffe efficiency (NSE), and the Kling-Gupta efficiency (KGE) as can see in Eq .
NRMSE =
RMSE
Eu X i =1 NSE ( X .
X ) = 1 Oe N Oe1 i =0 N Oe1 i =0 ( X i Oe X i )2 .
( X i Oe mean( X )) 2 KGE = 1 Oe .
Oe .
2 ( A Oe .
2 (A Oe .
As Ao
A /A
A= s s A o / Ao Figure 1.
Framework of Multi Weight Probabilities RESULT AND DISCUSSION The multivariate continuous data, which was distributed across monitoring stations in Uganda with 0% missing data, was analyzed using Random Forest and missForest.
The variable PM2 5.
PM10, and MCAR missingness mechanisms were used to create the amputation.
Considering element missingness is completely random and we can probably not predict that value from any other value in the data, it is assumed that some data is missing.
The Nengsih et.
/ Jurnal Riset Teknologi Pencegahan Pencemaran Industri Vol 16 No 1 .
23-31 Missing Completely at Random (MCAR) algorithm is utilized due to this.
Multiple weight probabilities, along with the distributional center (DC), right tail (RT), and left tail (L), are employed to create incomplete variables (M).
Several imputation percentages are also broken down into three groups: low .
%), middle .
%), and high .
%).
Figure 2 illustrates that among all stations, station 6 exhibited the lowest outlier values for both PM 2.
5 and PM Building upon this observation, station 6 was selected as a key input source, aggregating data from 135 monitoring As detailed in Table 1, station 6 maintained a mean PM 2.
5 concentration of 37.
39 g/mA without any data gaps, and a mean PM 10 concentration of 49.
61 g/mA, also without any missing values.
From fig 4, we see that the kurtosis of both data PM 2.
and PM 10 was leptokurtic with values: 3.
3029 and 3.
The skewness was positive for both data with values :
4682 and 0.
The table 2 offers a comprehensive analysis of imputation techniques, specifically MF and RF, across various levels of missing data categorized as Low.
Middle, and High.
In the Low missing data scenario, both MF and RF exhibit favorable results with generally low RMSE values, indicating effective imputation.
The NSE values are consistently high, implying a strong alignment between observed and imputed values.
Additionally, the KGE values are notably high, reflecting robust model performance and accurate imputation.
Moving to the Middle missing data scenario, a nuanced trend emerges.
Although MF tends to yield slightly higher RMSE values compared to RF, both methods maintain high NSE values, signifying proficient representation of observed values.
The KGE values remain elevated, reinforcing the notion of reliable model efficiency even in moderately incomplete datasets.
PM2.
PM10
Figure 1.
Station number with PM2.
5 and PM10 PM2.
5 and PM10 AAg/m3 Figure 3.
Data distribution of Station 6 Figure 2.
Outlier in Data Station 6 Fig 3 illustrates the distribution of data spanning between the lower and upper thresholds.
The majority of the data points are within this range, with only a minimal portion classified as outliers.
To detect outliers, a meanbased approach is employed.
This technique involves calculating the mean of the data and identifying data points that deviate beyond a specific range from this mean.
Conversely, the high missing data scenario introduces more significant challenges.
Imputation errors escalate for both MF and RF, as evidenced by elevated RMSE values.
The NSE values experience a decline, particularly pronounced for RF, indicating a diminished concordance with observed data.
A similar pattern emerges with the KGE values, illustrating reduced model efficiency in capturing variability under increased missing data Nengsih et.
/ Jurnal Riset Teknologi Pencegahan Pencemaran Industri Vol 16 No 1 .
23-31 Table 1.
Descriptive statistics of the Air Quality all station Var PM2.
PM10
Unit (AAg/m.
(AAg/m.
% miss Range 77 Ae 214.
1 - 499.
In summation for PM 2.
5 , the analysis underscores the effectiveness of both MF and RF imputation techniques, particularly in scenarios with lower levels of missing data.
The outcomes in the High missing data scenario highlight the inherent difficulty of imputing highly incomplete datasets, with RF showing a marginally greater impact.
Thus, tailored approaches may be necessary for addressing imputation challenges in varying missing data scenarios to ensure accurate and reliable results.
The RF method for the NSE and KGE values equal to one in type L.
M, and R for 10 percent amputation may be observed in table 3 using the method imputation for PM10.
For all performance evaluations, the MF still outperforms the RF even with a 40 percent amputation for type L.
M, and R.
The table's extensive analysis reveals a comprehensive comparison between the MF and RF models, examining their performance metrics across distinct categories and positions.
Notably.
Model MF consistently outperforms RF in terms of RMSE and KGE metrics, showcasing its ability to achieve a higher level of agreement between predicted and observed values.
This superiority is evident across various categories, with Model MF demonstrating a marginal advantage in metrics like RMSE and KGE, particularly noteworthy in the high Mean Var Std Dev Additionally, both models exhibit robust predictive accuracy, as evidenced by consistently high NSE scores across most cases.
An intriguing observation emerges when considering data imputation under significant challenges.
Despite a substantial 70% data amputation.
Model MF showcases remarkable resilience in imputing data accurately, as reflected by its KGE scores nearing the ideal value of 1.
Meanwhile.
RF's imputation performance remains noteworthy, achieving up to 86% accuracy for types L.
M, and R at the 70% amputation threshold.
In light of these findings, a compelling conclusion emerges: Model MF consistently demonstrates superior accuracy compared to RF across diverse types and percentages of data amputation.
Its capacity to sustain high precision in data imputation even under severe conditions further reinforces its effectiveness.
Ultimately, the comprehensive analysis underscores Model MF's efficacy and reliability in predictive modeling and data imputation From figure 6, it shows that most of type L.
M, and R in 70 percent missing values can be solved with MF.
Most of error prediction imputation were held in range 40 Ae 60 as shown in orange circle.
Table 2.
PM2.
5 for Imputation missForest and Random Forest with evaluation criteria RMSE.
NSE and KG Percentage Low
Middle High
RMSE
NSE
KGE
Nengsih et.
/ Jurnal Riset Teknologi Pencegahan Pencemaran Industri Vol 16 No 1 .
23-31 Table 3.
PM10 for Imputation missForest and Random Forest
Percentage type
RMSE
NSE
KGE
MF RF MF RF MF RF
M 14.
CONCLUSION
In conclusion, this research underscores the significance of accurate data imputation techniques, with MissForest proving to be a reliable and robust method for addressing missing data across varying levels of complexity in Air Quality Index.
The findings emphasize the importance of tailored approaches and shed light on the limitations and strengths of different imputation strategies for enhancing data integrity and analysis.
In conclusion, this study highlights the critical role of accurate imputation in air quality monitoring.
The missForest method consistently demonstrated superior performance across all missingness levels and types, outperforming Random Forest in terms of RMSE.
NSE, and KGE.
Notably, missForest achieved near-perfect results in low missingness scenarios, with RMSE as low as 83 (PM2.
76 (PM.
, and NSE and KGE values Even at a high missingness level of 70%, missForest maintained strong performance with RMSE up 54 and NSE above 0.
These findings underscore missForestAos robustness and reliability in handling complex
missing data, making it a highly recommended method for environmental data imputation REFERENCE