J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Hyperparameter Optimization in Machine Learning Models on Sky Survey Data Classification Efraim Kurniawan Dairo Kette* Fakultas Sains dan Teknik.
Universitas Nusa Cendana.
Kupang.
Indonesia Email*: efraim.
kette@staf.
ABSTRACT
Discovering the optimal model in today's popularity of various machine learning applications remains an essential challenge.
Besides data dependency, the performance of classification models is also affected by deciding on suitable algorithm with optimal hyperparameter settings.
This study conducted a hyperparameter optimization process and compared the accuracy results by applying various classification models to the observation dataset.
This study obtains data from the Sloan Digital Sky Survey Data Release 18 (SDSS-DR.
and Sloan Extension for Galactic Understanding and Exploration (SEGUE-IV).
The SDSS-DR18 and SEGUE-IV provide observational data of space objects, such as stellar spectra with corresponding positions and magnitudes of galaxies or stars.
The SDSS-DR18 dataset contains magnitude and redshift data of celestial objects with target features of stars.
Quasi Stellar Objects (QSO.
, and The SEGUE-IV dataset contains equivalent-width parameters, inline indices, and other features to the radial velocity of the corresponding star spectrum.
This study utilized several machine learning models, such as k-Nearest Neighbor (KNN).
Gaussian-Naive Bayes, eXtreme Gradient Boosting (XGBoos.
Random Forest.
Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP).
This study utilized Bayesian.
Grid, and Random-based approaches to find the optimal hyperparameters to maximize the performance of the classification model.
This study proved that some classification models have improved accuracy scores through the Bayesian-based hyperparameter optimization settings.
This study discovers the XGBoost model shows the highest classification results after hyperparameters optimization compared to other models for both datasets with an average accuracy of 99.
10% and 95.
Keywords: Machine Learning.
Hyperparameter Optimization.
Sky Object Classification.
INTRODUCTION
Classification with machine learning has become a powerful tool for various applications .
Machine learning capabilities allow computers to perform classification processes based on patterns and characteristics learned in the data.
Despite the rising popularity of machine learning applications, identifying which classification models consistently produce the most accurate results remains a crucial The quality, amount, and relevance of the data utilized in the training process have a significant impact on the classification model's performance .
However, beyond the data, the efficiency of classification models is also determined by selecting the best algorithms with optimal hyperparameter Therefore, a systematic approach is essential to identify the best-suited algorithms with the most suitable model settings for achieving optimal classification performance.
Machine learning algorithms are often used in science, especially astronomy.
Over the past few years, the Sloan Digital Sky Survey (SDSS) project, which attempts to map a quarter of the sky, has produced much observational data .
SDSS provides observational data in the form of physical, atmospheric, and spectral parameters of various celestial objects that are freely accessible.
Astronomers were gradually unable to manually categorize and label celestial objects in the future due to the vast amount of data with which newly discovered celestial objects.
Thus, utilizing machine learning classification algorithms is meant to help overcome this problem.
Several categorization algorithms efficiently process vast amounts of data already developed into classification models.
It is used to predict a target variable from new data.
Classification models with diverse methodologies are often used for sky object classification problems .
In this study, several classification models are applied, such as k-nearest Neighbor (KNN).
Gaussian Naive Bayes.
Support Vector Machines (SVM).
Random Forest, eXtreme Gradient Boosting (XGBoos.
, and Multi-Layer Perceptron (MLP), to classify sky object observation data.
The models used in this study were selected *) Corresponding Author Submitted : September 19, 2024 Accepted : August 15, 2025 Published : September 26, 2025 ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
based on their simplicity and ease of interpretation (Random Forest.
Gaussian Naive Bayes, and KNN), as well as their ability to capture the underlying patterns of the data and produce high accuracy (XGBoost and MLP).
The performance of each classification model is shown by comparing the categorized results.
The classification results from the various algorithms will be compared to determine the performance of each classification model created.
This study aims to obtain optimal performances through configuration variable settings known as They differ from parameters, which are variables derived from data.
Machine learning model hyperparameters are tuned manually before the training process .
It also regulates the algorithm's structure and complexity.
Grid-based and randomized searches are the most common approaches in determining optimal hyperparameters .
Grid-based search systematically uses all possible hyperparameter settings, requiring high computational time.
Random-based search uses a subset of samples taken from the overall hyperparameter settings, showing efficiency but not comprehensiveness.
Bayesianbased hyperparameter search applies a probabilistic approach to all possible hyperparameter settings, thus showing consistency and faster convergence to the optimal hyperparameter settings .
Therefore.
Bayesian-based search is applied in this study to find the optimal hyperparameters for each model.
MATERY AND METHODOLOGY
Data This study uses observational data from the Sloan Digital Sky Survey 18th release (SDSS-DR.
and Sloan Extension for Galactic Understanding and Exploration (SEGUE-IV).
The SDSS dataset contains magnitude and redshift data of celestial objects with target variables of stars.
Quasi Stellar Objects (QSO.
, and galaxies.
This study obtained 10,000 data samples from the SDSS-DR18 source and divided them into three classes.
The obtained data consist of 4795, 4089, and 1116 samples for the galaxy, star, and the QSO class, respectively.
Each sample has eighteen features such as objid, ra, dec, u, g, r, i, z, run, rerun, camcol, field, specobjid, redshift, plate, mjd, fiberid, and class as classification target.
The SEGUE-IV dataset included in SDSS-V contains an equivalent width of several stellar spectral lines .
This study obtained 4148 data samples from the SDSS-V Stellar Parameter Pipeline .
ppLine.
Each sample in the SEGUE dataset has 78 parameters, with the teffadopt feature as the classification target, which is the average stellar effective temperature calculated in various ways.
As shown in Table 1, each sample teffadopt feature is divided into specific spectral classes for the target classification model.
Table 1.
Spectral Classes of teffadopt Spectral Classes .
(AK) 28,000 - 50,000 10,000 - 28,000 7,500 - 10,000 6,000 - 7,500 4,900 - 6,000 3,500 - 4,900 2,000 - 3,500 < 2,000 Preprocessing This study removed six features, such as objid, run, rerun, camcol, and field, from the first dataset (SDSS DR-18 datase.
that were unrelated to the classification process.
The features u, g, r, i, z .
etter ofDeV/Exp magnitude fi.
are Thuan Gunn astronomical magnitude systems representing the response of the 5-band telescope.
This study finds a high correlation between these several features.
These features are then simplified by the Principal Component Analysis (PCA) method into three new features called PCA_1.
PCA_2, and PCA_3 to accelerate the convergence of the classification process as done in previous studies .
Reducing features were not performed on the second dataset (SEGUE-IV datase.
The correlation of each feature was low in the second dataset.
As a result, all features from the second dataset were retained so models could learn more information.
This study utilizes the minimum-maximum scaling method (MinMaxScale.
with a minimum value limit of 0 and 1 for the maximum limit to reduce the impact of the different values in the two datasets.
Any missing value samples in both datasets will be removed from the overall dataset if more than 20% of the total data.
Meanwhile, the median of the valid values will be given as an estimation of the samples if the missing values do not exceed 20% of the total data.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Model This research uses several classification models, such as k-Nearest Neighbor (KNN).
Gaussian-Naive Bayes, eXtreme Gradient Boosting (XGBoos.
Random Forest.
Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP).
Gaussian-Naive Bayes Naive Bayes is a parametric or non-parametric classification algorithm based on Bayesian concepts .
This method is usually more appropriate for categorical datasets .
This study utilizes the Gaussian .
distribution function to improve the performance of the Naive Bayes method in handling continuous data.
There is the equation .
, .
, and .
yuN= yua2 = Oo Ocycu yc=1.
cyc OeyuNyco ) ycE.
aycn = ycyc .
ayco ) = Ocycu yc=1 ycyc ycuOe1 Oo2yuUyuayco2 cOeyuNyco )2 Oe Equation .
calculated the probability of a possible classification .
aycn = y.
yayco ) denoted the probability of an attribute/feature in a particular class, where the yayco .
a1 , ya2 .
A , yayco } denoted the kAeth known label or class, and yaycn .
a1 , ya2 .
A , yayco } denoted the iAeth attribute or features.
Based on equation .
, yuNyco and yuayco2 denoted the mean and standard deviation parameters, respectively.
c1 , yc2 .
A , ycycu } denotes jAeth value of a particular attribute/feature.
m and n as the number of attributes/features and data, respectively.
Lastly, l denotes the number of labels or data classes.
k-Nearest Neighbors (KNN) KNN is simple non-parametric classification algorithm that categorizes a vector of new sample data .
yc OO Eyyco ) against most training data with a class determined by the nearest k-value .
The nearest neighbor of y is the closest distance of the new sample to each training data sample .
ycn Ou ycu where n is the number of dat.
c, ycuycn ) = Ou yc Oe ycuycn Ou2 ycoyc = arg max (Oc .
cu ycAycA , ycoycAycA)OO ycNyco yu.
ca = ycoycnycAycA )) ycn In Equation .
, ycc.
c, ycuycn ) denoted the distance function between the data.
The l 2 norm is a form of Euclidean distance that determines the closest distance between the new sample data and the training In Equation .
, ycoyc denoted the prediction of the class label of sample data y, with ycaycyci ycoycaycu as an argument that gives the highest value to the function, yu.
ca = ycoycnycAycA ) denoted the Dirac-delta function .
if met the condition and 0 otherwis.
Support Vector Machine SVM is a classification algorithm that uses the concept of hyperplanes that separate classes in feature space .
SVM is effective for handling high-dimensional issues with limited training data .
= yc ycN ycuycn ycaycn In Equation .
, yce.
denoted decision function that calculates the distance from the data point to the hyperplane boundary region.
The yce.
is positive if the data point is classified into the positive class .
c ycN ycuycn ycaycn Ou .
, and negative if the data point is classified into the negative class .
c ycN ycuycn ycaycn < .
yc ycN denoted the transpose of the weight vector perpendicular to the hyperplane that determines the orientation of the hyperplane.
ycaycn denoted the intercept line parallel to the hyperplane boundary line.
Random Forest and eXtreme Gradient Boosting (XGBoos.
Random Forest and XGBoost are ensemble classification algorithms that utilize the concept of decision trees to make predictions .
The decision tree model was developed independently in the Random Forest model, and the final forecast is usually made by classification of the individual tree projections.
The decision tree in the XGBoost model is built sequentially with a gradient descentbased optimization process.
= Ocya yco=1 ycyco,ycn Oo ya.
cu yun ycIyco,ycn ) In Equation .
, yceycn .
denoted the predicted probability or class label of the i-th tree for data point x.
The notation K is the number of leaves in the decision tree, with ycyco,ycn denoted the weight of leaf-k of ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
tree-i.
cu yun ycIyco,ycn ) denotes an indicator function, which takes the value one if x as the input belongs to leaf-k of tree-i and zero otherwise.
Multi-Layer Perceptron (MLP) MLP is a classification algorithm that often uses the back-propagation method, which is usually combined with gradient descent to adjust the weights and bias to minimize the loss function .
ycycn = ycOycn,yco ycuycn yca In Equation .
, ycycn denoted the i-th hidden layer that connects the input to the output layer.
The ycuycn denoted the i-th input data vector and the ycOycn,yco denoted the weight matrix from the i-th input vector to the kth node of the i-th hidden layer.
Evaluation methods and metrics This study utilized cross-validation as a method to evaluate the model's performance.
Cross-validation provides a better approach to distributing training and testing data and is fair in distributing data evenly for the evaluation process .
This study utilized the accuracy score as a metric to measure the cross-testing process of each classification model.
The accuracy score is calculated to show the ratio of correct classification results to total data.
This study used ten cross-tests to evaluate each classification procedure, producing an average accuracy across all tests.
RESULT AND DISCUSSIONS
This study conducts modeling experiments and compares the classification results of optimized models on a personal computer with AMD Ryzen 5 5500U CPU specifications .
1 GH.
, 16 GB of memory, and Microsoft 11 operating system.
This study utilizes the Python programming language with the Jupyter Notebook compiler.
Several tools used to help build the model were obtained from TensorFlow Tables 2 and 3 show comparisons of classification experiments results towards sky object data consisting of stars, galaxies, and QSOs from the SDSS-DR18 observational dataset.
Table 2 shows the results of each model using hyperparameter settings set based on grid-based (KNN.
Gaussian-Naive Bayes, and SVM) and random-based (Random Forest.
XGBoost, and MLP) searches.
Table 2.
Classification of sky objects .
tars, galaxies, and QSO.
with grid-based and random-based hyperparameter optimization.
Avg.
Training Optimization Model Accuracy Duration time .
(%) .
MLP
Random Forest
XGBoost
Gaussian-Nayve Bayes SVM
KNN
Based on the results and computation time from Table 2, the MLP model shows the highest level of accuracy compared to other models for the classification of celestial objects .
tars, galaxies, and QSO.
with an average cross-test score of 98.
Meanwhile, the KNN model shows the lowest accuracy with an average cross-test score of 91.
54% for the same data classification.
Table 3 shows comparisons of classification experiments result towards SDSS-DR18 dataset but utilize the hyperparameter settings set based on Bayesian search.
Table 3 shows that few model accuracies increased, although there is no significant difference compared to the results obtained in Table 2.
Table 3.
Classification of sky objects (Stars.
Galaxies, and QSO.
with Bayesian-based hyperparameter optimization.
Avg.
Optimization Training Model Accuracy time .
Duration (%) .
XGBoost
MLP
Random Forest
Gaussian-Nayve Bayes SVM
KNN
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
The XGBoost model shows the highest average score of 99.
Meanwhile, the KNN model still shows the lowest average score of 91.
The optimal hyperparameter settings of the Bayesian-based search results for each model against the SDSS-DR18 dataset are shown in Table 4.
Table 4.
Optimal hyperparameter settings for sky object classification (Stars.
Galaxies, and QSO.
Model hyperparameter settings XGBoost Aocolsample_bytreeAo: 0.
AogammaAo : 0.
Aolearning_rateAo : 0.
Aomax_depthAo : 4.
Aomin_child_weightAo: 3
MLP
AosolverAo : AolbfgsAo.
Aomax_iterAo: 10000.
Aolearning_rate_initAo: 0.
Aohidden_layer_sizesAo:
, .
AoalphaAo : 0.
AoactivationAo : AoreluAo Random Forest Aomax_depthAo : 80.
Aomax_featuresAo : 3.
Aomin_samples_leafAo : 3.
Aomin_samples_splitAo: 8 Gaussian-Nayve Bayes Aovar_smoothingAo : 6.
SVM
AoCAo : 30.
AokernelAo: AorbfAo
KNN
AometricAo : 'manhattan'.
Aon_neighborsAo: 4.
AoweightsAo : AodistanceAo Table 5 and 6 show comparisons of classification experiments results towards stellar (Star.
spectrum data (SEGUE-IV).
Table 5 shows the results of each model using hyperparameter settings set based on grid-based (KNN.
Gaussian-Naive Bayes, and SVM) and random-based (Random Forest.
XGBoost, and MLP) searches.
Table 5 results show that the XGBoost model has the highest accuracy compared to other models on classification stellar .
spectrum class with an average score of 94.
Meanwhile, the Gaussian-Nayve Bayes model shows the lowest accuracy with an average score of 88.
Table 6 shows comparisons of classification experiments result towards SEGUE-IV dataset but utilize the hyperparameter settings set based on Bayesian search.
Table 5.
Classification of stellar .
spectrum class with grid-based and random-based hyperparameter optimization.
Avg.
Optimization Training Model Accuracy time .
Duration (%) .
XGBoost
SVM
Random Forest
MLP
KNN
Gaussian-Nayve Bayes Table 5 results show that the XGBoost model has the highest accuracy compared to other models on classification stellar .
spectrum class with an average score of 94.
Meanwhile, the GaussianNayve Bayes model shows the lowest accuracy with an average score of 88.
Table 6 shows comparisons of classification experiment results with the SEGUE-IV dataset, but it utilizes the hyperparameter settings set based on Bayesian search.
After conducting hyperparameter optimization experiments using grid-based, random, and Bayesian approaches.
MLP.
SVM, and Gaussian Naive Bayes (GNB) showed minimal accuracy improvements due to inherent model limitations and data compatibility GNBAos limitation of feature independence made it fundamentally incompatible with the second dataset, as real-world data rarely meets this assumption.
SVMAos performance is heavily dependent on kernel Experimenting with the second dataset with 78 features, the kernel likely already operated nearoptimally, leaving little room for improvement through tuning parameters like C or gamma.
Similarly.
MLP flexibility was constrained by the smaller size .
,148 sample.
of the second dataset, limiting its ability to benefit from deeper architectural tuning.
Therefore, we conclude that these modelsAo performance was bottlenecked by their design and data compatibility, making hyperparameter optimization less impactful compared with models like XGBoost and Random Forest, which are inherently more adaptable to the datasets structures.
ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
Table 6.
Classification of stellar .
spectrum class with Bayesian-based hyperparameter optimization.
Avg.
Optimization Training Model Accuracy time .
Duration (%) .
XGBoost
SVM
Random Forest
MLP
KNN
Gaussian-Nayve Bayes Table 7.
Optimal hyperparameter settings for stellar .
spectrum class.
Model hyperparameter settings XGBoost Aocolsample_bytreeAo: 0.
AogammaAo : 0.
Aolearning_rateAo : 0.
Aomax_depthAo : 10.
Aomin_child_weightAo: 1
SVM
AoCAo : 10.
AokernelAo: AolinearAo Random Forest Aomax_depthAo : 80.
Aomax_featuresAo : 3.
Aomin_samples_leafAo : 3.
Aomin_samples_splitAo: 12
MLP
AosolverAo : Aoadam.
Aomax_iterAo: 10000.
Aolearning_rate_initAo: 0.
Aohidden_layer_sizesAo:
, .
AoalphaAo : 0.
AoactivationAo : AoreluAo
KNN
AometricAo : 'manhattan'.
Aon_neighborsAo: 10.
AoweightsAo : AodistanceAo Gaussian-Nayve Bayes Aovar_smoothingAo : 0.
SUMMARY
This study utilized six supervised learning algorithms to classify two observational datasets obtained from the Sloan Digital Sky Survey 18th release (SDSS-DR.
and Sloan Extension for Galactic Understanding and Exploration (SEGUE-IV).
The SDSS-DR18 dataset contains 10000 samples with features such as magnitude and redshift of stellar objects, and target features consist of Stars.
QSOs, and Galaxies.
The SEGUE-IV dataset contains 4148 samples with features such as equivalent-width parameters, inline indices, and other radial velocity features to the corresponding star spectrum.
This study utilizes a Bayesian-based hyperparameter search to determine the optimal settings that maximize the performance of the classification model.
This study shows that the XGBoost algorithm provides the best performance compared to the other models for both datasets, with an average accuracy of 99.
10% and 95.
11%, respectively.
Despite resource constraints using personal computers for this study, the experiment successfully achieved high accuracy goals for large-scale data.
It proves the model's effectiveness for large-scale sky survey classification.
While optimization time for the XGBoost model took forty-two times longer, it is a notable trade-off because accuracy is often mission-critical in scientific applications.
This study found that the optimal hyperparameter settings of the XGBoost model for classification of the SDSS-DR18 database include colsample_bytree of 0.
7, gamma of 0.
3, learning_rate of 0.
05, max_depth of 4, and min_child_weight of Meanwhile, the optimal hyperparameter settings for the XGBoost model for classification of the SEGUEIV database include colsample_bytree of 0.
4, gamma of 0.
4, learning_rate of 0.
2, max_depth of 10, and min_child_weight of 1.
This study proved that some classification models have improved accuracy scores through the Bayesian-based hyperparameter optimization.
We intentionally avoided handling data imbalance to maintain the robustness and real-world applicability of the models.
By keeping the datasets in their original imbalanced state, we ensured that the results reflected how the models would perform on primary data, which is critical for evaluating their suitability for real-world sky object classification tasks.
Introducing imbalance-handling techniques .
, over-sampling, under-sampling, or class weightin.
could artificially inflate accuracy or mask weaknesses, leading to biased conclusions about model performance.
However, future work could explore these techniques to address potential skews in class distributions, as imbalance ISSN: 2337-7631 (Printe.
ISSN: 2654-4091 (Onlin.
J-Icon : Jurnal Informatika dan Komputer Vol.
13 No.
October 2025, pp.
DOI: 10.
35508/jicon.
handling might improve minority class recall and overall generalizability, especially for models like GNB or KNN that struggle with imbalanced data.
Also, this could provide a more comprehensive understanding of the model's behavior across different data scenarios.
Further improvement can explore the possibilities of other ensemble classification techniques that may outperform the XGBoost algorithm.
REFERENCES