J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
CONSTRUCTING A DATASET FOR INFECTIOUS DISEASE PREDICTION AND
SPATIAL CLUSTER ANALYSIS
Husni Iskandar Pohan1* School of Computer Science.
Bina Nusantara University.
Jakarta.
Indonesia 11480 Email*: husni.
pohan@binus.
ABSTRACT
This study presents a structured methodology for developing a custom dataset from patient visit records collected between January 1, 2019, and December 31, 2021, at a healthcare facility in Bandung Regency.
Indonesia.
The raw medical records were transformed into a machine learningAeready dataset through processes such as feature extraction, labeling, and geospatial enrichment using longitude and latitude coordinates.
Personally identifiable information was removed, and clinical symptoms were standardized into structured variables to support both supervised and unsupervised learning tasks, including disease classification, referral prediction, and spatial cluster detection.
The final dataset consisted of 1,015 Covid cases (COV), 1,356 Dengue cases (DHF), and 308 Varicella cases (VAR).
It has been applied in advanced experiments involving feature importance analysis with SHAP and LIME, geospatial clustering, and synthetic data generation to address privacy and data availability concerns.
This methodology is designed to support future research in healthcare analytics and the development of decision support systems and public health planning tools.
However, since the dataset was constructed using records from a single healthcare facility in Bandung, the findings and patterns identified may not be generalizable to other regions that could exhibit different disease trends or healthcare-seeking behaviors.
Keywords: Covid.
Dengue.
Varicella.
Dataset.
Cluster
INTRODUCTION
The Department of Health places special attention on diseases whose transmission is based on geographic clustering.
Unlike heart disease and diabetes, which are typically experienced individually, cluster-based diseases can spread rapidly among patients located in close geographic proximity.
One of the challenges in working with such datasets is that the available data is not always in a ready-to-use format.
Therefore, it is necessary to develop an approach to understand the data and transform it into a form suitable for processing with machine learning algorithms.
This requires a solid understanding of the characteristics of the diseases involvedAiin this case.
Covid .
ereafter called COV).
Dengue .
ereafter called DHF), and Varicella .
ereafter called VAR).
The objective of this study is to develop a structured methodology for transforming raw patient visit records into an anonymized, machine learningAeready dataset enriched with spatial attributes.
By focusing on three infectious diseasesAiCOV.
DHF, and VARAithis research aims to support predictive modeling and geospatial cluster analysis in a clinical context.
The contribution of this study lies in the creation of a publicly reproducible framework for preparing healthcare datasets with both clinical and spatial features.
This includes processes such as disease labeling, referral prediction, and geolocation taggingAikey components often omitted or only partially addressed in prior works.
Furthermore, the resulting dataset can facilitate subsequent research involving model interpretability techniques such as SHAP and LIME, as well as synthetic data generation using CTGAN, thus expanding its applicability in advanced analytical and privacy-preserving contexts.
Research Gap This study addresses a notable research gap: the scarcity of integrated datasets and methodologies that combine symptom-level clinical data with geographic coordinates for infectious disease modelling in lowand middle-income country settings.
Most existing works either focus solely on classification without spatial awareness or require pre-cleaned datasets.
In contrast, this research provides end-to-end guidanceAi from raw data ingestion to usable analytic datasetsAiwhile highlighting technical and ethical considerations in real-world implementation.
*) Corresponding Author Submitted : July 15, 2025 Accepted : August 5, 2025 Published : August 12, 2025 ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
The cause of COV is the SARS (Severe Acute Respiratory Syndrom.
virus, which spreads through droplets from coughing or sneezing.
Transmission can also occur through direct interaction with infected individuals or by touching contaminated surfaces.
Common symptoms include fever, dry cough, and shortness of breath.
In some cases, it can also lead to fatigue, loss of smell, and diarrhea.
DHF is caused by a virus transmitted through the bite of the Aedes Aegypti mosquito.
Symptoms may include high fever, headache, pain behind the eyes, and joint or muscle pain.
In more severe cases, it can cause bleeding in the nose, gums, or internal organs, which may lead to fatal outcomes.
Based on its transmission pattern.
DHF cases typically increase during the rainy season.
The Figure 1 illustrates the virus transmission cycle between humans and mosquitoes.
Figure 1.
DHF Transmission Cycle .
VAR is caused by the Varicella-Zoster virus and is transmitted through fluid from skin lesions or Early symptoms may include fever, fatigue, loss of appetite, skin rashes, and red spots, which later develop into fluid-filled itchy vesicles that eventually dry and form scabs.
It can be fatal for individuals with weakened immune systems.
VAR has received relatively little attention as a contagious disease from both regional and global public health perspectives, especially in low- to middle-income countries.
In IndonesiaAia developing country with limited public awareness and low vaccination coverage for VARAicases of VAR remain common.
In contrast to countries like the United States and China, where detailed statistical data on VAR is available.
Indonesia still lacks sufficient epidemiological data on VAR .
The Characteristics of The Three Diseases is shown in the Table 1.
Table 1.
Characteristics of The Three Diseases.
Aspect DHF .
VAR .
Etiology Transmission Medium Dengue Virus Mosquito (Aedes Fatigue Muscle pain Red skin spots Sudden high fever Nosebleeds Epistaxis .
Black stool .
Upper abdominal pain/nausea Bloody urine .
Facial flushing Varicella-Zoster Virus Close Contact Coronavirus Close Contact Fever Skin damage Malaise Cough Fever Shortness of breath Sore throat Diarrhea and vomiting Low oxygen saturation High fever .
Rapid breathing .
Diminished breath Red eyes .
Rattling breath sounds Moderate to severe Epigastric tenderness Anamnesis (Nurs.
Physical Exam (Docto.
Scabs .
Tear drop lesions Vesicular skin eruption Petechiae .
kin ras.
Supporting Tests (La.
NS1 Test COV .
PCR Test Nose/throat infection PCR Swab Test ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Platelet Count Hematocrit Test Hemoglobin/Leukocy Test Serological Test Leukocyte Type Count Tzanck Test Serological Test Antigen Swab Test After identifying the characteristics of the three diseases, a total of 68,666 patient records in SQL Server format were first evaluated.
The original data format is shown in the Table 2.
Table 2.
The Original Data Format Description Visit ID Visit Date Diagnosis Patient ID Patient Name Patient Address Date of Birth Gender Anamnesis Physical Examination Body Weight Hemoglobin Leukocytes Platelets Cholesterol Triglycerides Uric Acid Fasting Blood Glucose 2-Hour Postprandial Glucose Random Blood Glucose Other Information Radiology Result ECG Result ECHO Result Therapy Wound Treatment (Yes/N.
Sutures (Yes/N.
Physiotherapy (Yes/N.
Nebulizer Treatment(Yes/N.
Additional Lab Info (Yes/N.
Additional Lab Notes Active Visit Entry (Yes/N.
Age at Visit Action Description Data Type Varchar.
Datetime Varchar.
Varchar.
Varchar.
Varchar.
Datetime Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Varchar.
Bit Bit Bit Bit Bit Varchar.
Bit Varchar.
Varchar.
Example KJ3100251616 2021-12-31 13:58:28.
Others Johny Iskandar
PBB IV C-71
2019-06-05 08:09:06.
Male Fever 3 days.
Cough.
Flu Temp 38.
9 ISPA BP DHF
2,100
70,000
HT:37
Atelaktasis DBN Nephrolithiasis Dexta Lab Norages ZenirexErfasal Desc 2 Years, 6 Month, 28 Days Others The conversion process must remove any patient-identifying information to comply with medical confidentiality regulations.
Once the conversion is completed, the structure of the prepared final dataset for storing transactional data is organized as shown in the following Table 3.
Table 3.
Final Dataset
Field
VisitDate Longitude Latitude vCode
vReferral vSpot
vRed
vCongested vCough
vFlu
vFeverish
vStomach
vNauseous
vVomit
vDizzy Description Visit Date
Coordinate Coordinate Scale
Interval
Interval
Interval
Type
Numerical Numerical Numerical Source Registration Registration Registration Disease Code
Nominal Categorical Diagnosis Referred or Not Presence of Spots Redness Signs Shortness of Breath
Presence of Cough
Presence of Flu
Feverish Sensation Stomach Issues Nausea
Vomiting Dizziness Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Therapy Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Sample 01/12/2021 COV/DHF/ VAR
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
vItchy vSwallow vBlister vSore vWeak vRheumaticPain vCold vFever vTemp vThrombocyte Itching Difficulty Swallowing Blisters or Lesions Body Aches Weakness Joint/Muscle Pain Runny Nose Presence of Fever Body Temperature Thrombocyte Count Nominal Nominal Nominal Nominal Nominal Nominal Nominal Nominal Interval Interval Categorical Categorical Categorical Categorical Categorical Categorical Categorical Categorical Numerical Numerical Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Anamnesis Examination Examination Y/N Y/N Y/N Y/N Y/N Y/N Y/N Y/N MATERIAL AND METHODS Unlike other attributes that are generally well-known in the medical domain, longitude and latitude require a more specific explanation.
Every location on Earth is defined by two values: latitude, which determines the horizontal axis, and longitude, which determines the vertical axis.
Latitude has been recognized since ancient times by civilizations such as the Greeks and Romans.
Eratosthenes .
Ae194 BC), a Greek scholar, used latitude to estimate the EarthAos circumference.
Similarly, another Greek scholar.
Ptolemy .
nd century AD), produced a monumental work titled Geographia to map the Earth .
Unlike latitude, which can be determined based on the sunAos position, longitude is more difficult to determine, as it requires time as a reference parameter.
In 1884, the Meridian Conference established the Prime Meridian at zero degrees longitude, passing through the Greenwich Observatory in LondonAicreating a standardized global coordinate system that remains in use today.
The coordinate lookup function utilizes Nominatim, a geocoding service based on OpenStreetMap (OSM), to transform address strings into geographic coordinates.
This process was implemented in Step 11 of the dataset preparation workflow to enrich the data with spatial attributes.
This approach allows addressbased location data to be converted into precise longitude and latitude values, enabling subsequent geospatial analyses such as clustering and disease mapping.
To ensure compliance with ethical standards and protect patient confidentiality, all personally identifiable information (PII) was removed from the dataset during preprocessing.
This included patient names, identification numbers, exact residential addresses, and any other attributes that could directly or indirectly reveal an individual's identity.
Additionally, sensitive fields were anonymized or transformed into categorical representations where appropriate.
The final dataset only retains de-identified clinical and spatial attributes, such as symptom indicators, laboratory results, and approximate location coordinates .
ongitude and latitud.
, which are limited to facility-level granularity.
These measures were implemented in accordance with data protection principles to minimize re-identification risks while maintaining the analytical value of the dataset for research purposes.
There are 11 steps involved in producing this dataset, starting from importing SQL data to completing the data with longitude and latitude (Figure .
Figure 2.
Dataset Preparation Process Step 1: Importing SQL Data The initial step in the dataset preparation process is importing data from Microsoft SQL Server, the primary storage for patient medical records.
This involves establishing a secure connection to the server using ODBC or tools like SQL Server Management Studio (SSMS).
Python .
ith pyodbc or sqlalchem.
, or Power BI.
The imported data includes key information such as patient ID, chief complaints, diagnoses, symptoms, and test results.
During import, data quality must be ensured from the outsetAithis includes consistent date formats, valid diagnosis codes, and alignment between the extracted columns and analytical ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
This raw data serves as the foundation for all subsequent processes, so a secure connection and welloptimized queries are essential to avoid disrupting production systems.
Step 2: Extracting Data Based on Disease Keywords Once the data is successfully imported, the next step is extracting entries based on relevant diseaserelated keywords.
This helps filter records to focus only on data indicating specific diseases of interestAi such as infectious, non-infectious, or chronic illnesses.
The keywords may include disease names, medical terms, or ICD (International Classification of Disease.
This extraction process is performed using text pattern matching on diagnosis or symptom columns, employing SQLAos LIKE operator or regular expressions in Python.
The result is a more targeted subset of data, allowing downstream processes to proceed more efficiently and aligned with the analysis objectives.
Step 3: Processing Data with Disease Code Labels The filtered dataset is then labeled with disease codes, often referring to standardized classifications such as ICD-10.
These labels serve as target variables for building disease classification models, ensuring each data entry has a clearly defined disease tag.
Labeling can be automated using ICD reference tables or done semi-manually for unstructured entries.
Maintaining consistency in labeling is critical to avoid biased analysis and to ensure the results are valid for research, prediction, or epidemiological reporting.
Step 4: Processing Data with Referral Code Labels The next step involves labeling the data with referral codes, identifying whether a patient was treated at the current facility or referred elsewhere.
The label can be binary .
reated/referre.
or more detailed based on the referral destination .
, general hospital, specialty hospital, another clini.
This label is crucial for building clinical decision-support systems and managing patient flow.
By analyzing referral patterns in relation to symptoms and diagnoses, the system can better predict referral needs and help plan healthcare resources effectively.
Step 5: Processing Data with Symptom Features Following labeling, the focus shifts to processing disease symptoms.
These are typically captured as free-text entries or checkbox options selected by medical staff.
This step involves extracting, normalizing, and converting symptoms into numerical formats suitable for machine learning.
Techniques such as onehot encoding or embeddings can be used depending on symptom complexity and variation.
The result is a structured feature set representing a patientAos clinical condition, which significantly influences model accuracy in disease and referral prediction.
Step 6: Imputing Missing Data for Incomplete Features Medical datasets often contain missing or incomplete fieldsAisuch as unrecorded blood pressure or unfilled symptom checklists.
To address this, imputation is applied, filling missing values using statistical or machine learning-based approaches.
Common imputation methods include using the mean/mode for numerical data, or more advanced techniques like k-NN or regression-based imputation.
The goal is to preserve data integrity and maximize the usable data pool for model training and evaluation.
Step 7: Splitting the Dataset Once the data is cleaned and complete, it is divided into two sets: training and testing datasets.
typical ratio is 70:30 or 80:20 depending on dataset size and model complexity.
The split is performed randomly, often using stratified sampling to maintain balanced label distributions across both sets.
This ensures that the trained model can be objectively evaluated on previously unseen data.
Step 8: Training Data with Disease and Referral Labels The training dataset is then used to build predictive models with two target labels: disease code and referral status.
Depending on the approach, either two separate models or a single multi-label model may be developed.
Algorithms like decision trees, random forests, or neural networks are commonly used.
During training, the model learns to associate input features .
ymptoms, age, histor.
with output labels.
The outcome is a model capable of predicting both the disease classification and referral likelihood for new patient records.
Step 9: Testing Data with Disease and Referral Labels The testing dataset is used to evaluate the trained modelAos performance.
Each model prediction is compared against the actual label in the test set.
Evaluation is done using metrics like accuracy, precision, recall.
F1-score, and AUC.
This step is essential to assess the modelAos ability to generalize beyond the training data.
The evaluation results help determine whether the model needs retuning, feature enhancement, or data re-imputation.
Step 10: Completing Data for Cluster Analysis Beyond classification, cluster analysis is conducted to group patient data based on shared features.
This is done using unsupervised learning algorithms like K-Means.
DBSCAN, or Hierarchical Clustering.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Clustering helps reveal hidden patterns in the dataAisuch as patient groups with similar symptoms or high referral rates.
These insights can guide the development of targeted health interventions for specific clusters.
Step 11: Completing Data with Longitude and Latitude Finally, the dataset is enriched with geographic informationAilongitude and latitude coordinates.
These may come from healthcare facility addresses or administrative-level patient locations.
Adding spatial data enables geospatial analyses such as disease spread mapping, referral hotspot detection, or interactive dashboard development.
By combining clinical and geographic perspectives, the analysis becomes more holistic, supporting better planning and delivery of equitable healthcare services.
RESULT AND DISCUSSION
The preparation and processing of the dataset yielded a well-structured and anonymized collection of patient records related to three infectious diseases: COV.
DHF, and VAR.
After careful filtering, labeling, and feature engineering, the final dataset consisted of 1,015 COV cases, 1,356 DHF cases, and 308 VAR These cases were extracted from a pool of over 68,000 visit records spanning three years .
9Ae 2.
, originating from a healthcare facility in Bandung Regency.
The first result was the successful classification of diseases using supervised learning models.
Experiments using traditional classifiers such as Decision Trees.
Random Forests, and XGBoost showed high accuracy in identifying disease types based on patient symptoms, examination data, and additional The disease code .
Cod.
became the primary label, and symptoms like fever, cough, shortness of breath, red spots, and nausea were among the most significant predictors.
Feature importance analysis using SHAP (SHapley Additive Explanation.
and LIME (Local Interpretable Model-Agnostic Explanation.
confirmed these variables as the most influential in decision-making by the models.
In addition, a second predictive model focused on referral status .
Referra.
Ai determining whether a patient required referral to a more advanced healthcare facility.
Using similar features as inputs, the model successfully predicted referral decisions, aiding the potential development of clinical decision support systems (CDSS).
This experiment is particularly valuable for healthcare systems with limited resources, enabling smarter patient routing.
Further analysis was conducted through cluster-based modeling.
Using geospatial data .
ongitude and latitud.
, unsupervised algorithms like K-Means and DBSCAN were applied to identify patterns in the distribution of cases.
The clustering revealed geographic hotspots of infection, supporting epidemiological surveillance and localized health interventions.
This spatial understanding is crucial in managing outbreaks where disease transmission is closely linked to proximity and mobility patterns.
Another key outcome was the application of synthetic data generation to address the challenge of small sample sizes and privacy concerns in medical data.
Using techniques such as CTGAN and basic generative models, the team was able to augment the dataset, preserving data structure without compromising patient confidentiality.
This synthetic data supported further model training and Lastly, the inclusion of interpretable models and geospatial dimensions elevated the usefulness of the By combining clinical features with location data, this research not only supports predictive analytics but also opens the door to the development of public health dashboards, early warning systems, and policy planning tools.
The methods and results shown here are expected to contribute to future research on disease prediction and outbreak control in developing regions.
Figure 3 shows the disease quantity of the three disease types relative to the total number of cases obtained.
Figure 3.
Disease Quantity The results of this dataset development have been utilized in seven experiments, four of which have already been published in two journals and two conference proceedings.
Table 4 describes several uses of the dataset in research, some of which have already been published in journals and conference proceedings.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Table 4.
Experiment Experiment Name DHF Disease Prediction Disease Type Prediction Referral Prediction Cluster Analysis Dominant Feature Prediction using LIME and SHAP Synthetic Data Generation .
Prediction with XGBoost and Feature Interpretation (SHAP) Experiment Objective Dataset trial for predicting Dengue Hemorrhagic Fever .
Determining patientAos disease type among three alternatives: COV.
DHF, and VAR Predicting whether a patient should be referred to a facility equipped to handle infectious diseases Using longitude and latitude to identify the number of cases in a specific geographic area, proximity, etc .
Utilizing model interpreters LIME and SHAP to identify the most dominant features in predictions To overcome limited medical data populations due to privacy regulations .
XGBoost implementation with SHAP-based dominant feature identification .
Result Achieved satisfactory results Achieved satisfactory results Achieved satisfactory results Achieved satisfactory results Achieved satisfactory results Achieved satisfactory results Achieved satisfactory results CONCLUSION This study potentially contributes to the development of decision support systems, although further validation with healthcare stakeholders is required.
As part of this effort, it presents a structured approach to the development of a disease-specific dataset derived from electronic medical records, with an emphasis on three geographically transmissible infectious diseases: COV.
DHF, and VAR.
The methodology encompasses data extraction, transformation, anonymization, feature engineering, and geospatial enrichmentAiensuring that the dataset adheres to both analytical rigor and ethical standards regarding patient confidentiality.
The experimental results underscore the datasetAos utility in supporting various machine learning tasks, including disease classification, referral prediction, and cluster-based spatial analysis.
The integration of model interpretability techniques (SHAP and LIME) further enhances transparency and trust in predictive Moreover, the generation of synthetic data addresses constraints related to data availability and privacy, offering a viable path for model training and validation in sensitive healthcare contexts.
Overall, the findings affirm that the structured and geotagged dataset developed in this study holds significant potential for advancing data-driven epidemiological research and clinical decision support, particularly in settings with limited health data infrastructure.
Future investigations may build upon this foundation by incorporating longitudinal analyses, real-time data integration, and broader population-level health indicators.
BIBLIOGRAPHY