Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
Multi View Natural Network for CrossProject Software Defect Prediction Boy Setiawan1 and Agus Subekti1 1Faculty of Computer Science.
Nusa Mandiri University.
Jakarta.
Indonesia Corresponding author: Boy Setiawan .
-mail: 14230023@nusamandiri.
ABSTRACT Software Defect Prediction (SDP) plays a critical role in software engineering by enabling early identification of potentially defective modules, to assist developers and testers in prioritizing testing and inspection efforts to improve software quality and reliability.
Driven by rapidly changing business requirements, defect prediction models have become increasingly essential in quality assurance workflows.
Traditional approaches to SDP focused on Within-Project Defect Prediction (WPDP), where models are trained on historical data from the same project and effective under sufficient data conditions.
This challenge motivates the adoption of Cross-Project Defect Prediction (CPDP), which leverages data from different However.
CPDP faces notable challenges including datasets distributional differences and class imbalance, which can degrade prediction performance and bias.
To address these issues, recent studies have proposed data transformation, resampling, and domain adaptation techniques.
In this study, we explore a multi-view learning approach using Neural Networks (NN) to enhance generalization and performance in CPDP scenarios.
By leveraging multiple views of the same datasetAigenerated through concatenation of heterogeneous software metrics, imputation for missing values, normalization using Box-Cox transformation, and embedding-based feature transformationAiwe aim to construct a robust Multi-View Neural Network (MVNN).
This architecture enables the integration of diverse information while mitigating the limitations of single-view learning in CPDP.
Our method preserves more in-formation compared to conventional approaches that rely only on shared features.
Experimental validation using benchmark SDP repositories demonstrates the competitiveness of our approach, offering improved performance over existing CPDP models and highlighting the potential of multi-view learning in defect prediction tasks.
KEYWORDS Cross-Project Defect Prediction.
Multi-View Learning.
Software Defect Prediction INTRODUCTION Software defect prediction (SDP) is a complex and critical task in the field of software engineering, focused on identifying potential defects in software systems at the early stages of development.
Its primary objective is to assist developers and testers in allocating their efforts and resources more efficiently by targeting components of the codebase that are most likely to contain faults .
In a time where software is crucial and affected influence the bulk of daily lives, an accurate SDP helps to minimize the time and effort necessary for testing software products by automating the process of detecting the areas of software that are prone to defects early in the software development life cycle (SDLC).
Defects in software include errors, flaws, mistakes, faults or bugs which may come from the absence of skills or experiences, misconceptions or requirements, uncontrollable development phase etc.
As software projects continue to grow in size and complexity to facilitate rapid evolving business requirements, the demand for accelerated development has VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
intensified in recent years.
Consequently, quality assurance practicesAiparticularly fault prediction modelsAihave become increasingly vital.
These models are primarily designed to enable the efficient allocation and prioritization of quality assurance activities, such as testing and code The challenge with CPDP is the distribution difference between datasets from various software projects .
, .
which cannot satisfy a similar distribution assumption in most cases.
As a result, studies have been conducted to meet this challenge starting from Zimmerman et al.
12 real-world applications CPDP study, and is still an active study in SDP where most CPDP methods are based on transferring the knowledge across different but related domains .
, .
, .
Another major challenge in CPDP is class imbalance .
, caused by fewer modules with defects in a software projects.
This condition will impact the performance by favouring the majority class and introduce bias.
As a result, strategies to address the class imbalance issue is also a common area of study in CPDP Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
ranging from over sampling, under sampling and generative When data are insufficient and acquiring data from other sources is a non-trivial task especially when confidentiality is a concern in software projects, the need to maximize existing data are crucial.
The standard machine learning (ML) techniques used in SDP usually consume one input for however.
SDP can also be solved using multiple views .
, multiple feature vector.
Multi-view learning is an emerging field in ML which considers learning with multiple views to improve generalization, better known as data fusion or data integration .
Its aim is to learn to model each view and jointly optimizes all the models to improve generalization performance.
A notable advantage of multiview learning is improving generalization by manually generating multiple manual views to increase performance.
Although many multiple-views ML methods such as sparse multi-view time Support Vector Machine (SVM) .
and multi-view discriminant analysis (DA) .
have proven capable on classification problem, we proposed NN approach since the application of multi-view is very profound in NN Based on the problems stated above, we are focusing our study to use multiple-views with NN to select a sample dataset from SDP repositories which have different software metrics, and to construct a high quality MVNN.
In order to maximize existing datasets, we are committing of using the same datasets to generate a different view of the same datasets to increase performance and generalization of the proposed MVNN.
Many methods of CPDP approaches the different software metrics by utilizing only the same features between datasets .
, we opted of preserving as much information available by concatenating the datasets software metrics, and resorted to imputation to handle missing values.
In the issue of dataset distribution, a Box-Cox transformation will be used to normalized the concatenate datasets to resemble a normal To generate another view of the datasets, we utilized an expand and reduce method of generating high dimensionality vector with tree-based embedding and employ a dimensionality reduction algorithm to resize it to a considerate size as the new view of the dataset in a different latent space.
At the end, we will validate our findings with an empirical comparison from previous studies to show the competitiveness of our proposed method.
The key contributions of this works are as follows:
This study proposes a novel CPDP model based on MVNN to construct a prediction model that enhance the contribution of SDP in software engineering.
We introduce a novel dataset pre-processing step for CPDP to overcome the challenge of different distribution in the datasets.
We proposed a novel way to generate a different view of the datasets by expanding and reducing the dimensionality to produce the same datasets in a different latent space.
VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
Finally, to verify the performance of the proposed method, we conducted experiments on various SDP dataset repositories with existing CPDP methods.
This paper follows the following structure.
Section 1 gives introduction on the problem domain.
Section 2 provides theoretical framework overview of the relevant CPDP work.
Previous studies is the topic of Section 3.
The presentation of our research methodology and experimental setups follow in The experimental result and discussion are presented in Section 5 along with the threats to internal, external, construct of our study and conclusions are covered in Section 6.
II.
THEORETICAL FRAMEWORK
In this section we briefly introduced theoretical framework which underline our proposed MVNN method and related works on CPDP.
MULTI-VIEW NEURAL NETWORK
With the increasing amount of data volumes and varieties in recent years, the interest in multi-modal and heterogeneous representations has increased in popularity to enhance learning MVNN have raised as a perfect approach in fusing multiple representations of data in a unified predictive model .
MVNN refer to neural networks architectures that integrate multiple representation of features .
from the same data instance to improve learning performance, which leverage both redundant and complementary information across all modalities .
One of the fundamental challenges is representing and summarizing multimodal data in a way that exploits the complementarity and redundancy of using multiple modalities in the dataset .
The simplest example to overcome multiple modalities is concatenation of individual modality features known as early fusion .
which integrates features immediately after they are extracted, resulted in a joint representations or unimodal data representation where NN excels and have become a popular method of choice.
Although there are ML based multi-view algorithms for classification problems such as kernel support SVM.
NN has demonstrated outstanding performance in a variety of tasks such as face recognition, object detection and classification with MVNN .
The superior performance has increased the popularity of NN based joint representations, combine with the ability to pre-train the representations in an unsupervised However, the performance is strongly dependent on the amount of data available during training.
Despite all the advantages shown by NN, one of the disadvantages comes from the inability to handle missing data, although there are ways to alleviate this issue .
MIN-MAX SCALER
The Min-Max scaler adjusts the scale of an attribute by shifting its values along the x-axis, ensuring that the Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
transformed attributeAos values fall within the interval of .
, .
, according to .
yeyescaled = yeye Oe yeyemin yeye max Oe yeyemin In .
, the scaling factor is determined by the attributeAos range, while the translational term is set as its minimum value.
This approach guarantees that the attributeAos values are transformed to a minimum of zero and a maximum of one which is the ideal value for NN input.
STANDARD SCALER
Standard scaling plays a crucial role in the SDP domain by addressing the issue of varying feature scales commonly found in datasets as a result of various metrics used on software Metrics such as code complexity and lines of code often have diverse distributions and scales, which, if not properly normalized, can negatively impact the effectiveness of learning algorithms .
As emphasized by .
to optimize ML performance, appropriate preprocessing stepsAi such as standard scalingAito ensure stable training and improve model accuracy.
The standard scaler adjusts the scale of an attribute by centering the data around zero ( becomes .
and scales it to have a standard deviation (E) of 1, ensuring features are on the same scale .
, according to .
yeyescaled = yeye Oe
BOX-COX TRANSFORMATION
The Box-Cox transformation for non-negative responses is a function of the parameter .
where the aim of the function is to resemble a normal gaussian distribution with mean and standard deviation of 0.
Since most ML models assume a normal gaussian distribution of the dataset, this formula helps transform the dataset skew distribution.
The transformed response is according to .
y() = .
Oe.
/ (=.
log y (=.
For = 1, there is no transformation applied.
When = 1/2, a square root transformation is applied.
For = 0, a logarithmic transformation is applied.
For = -1, a reciprocal transformation is applied, ensuring continuity and avoiding issues at zero.
One of the main problems of SDP comes from the different distribution or feature space of each dataset, which hinders an optimal performance across projects .
for CPDP.
The application of Box-Cox transformation in a skew dataset, helps to normalized the distribution and improve prediction results.
VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
ISOMETRIC FEATURE MAPPING
Isometric Feature Mapping (Isoma.
is a widely used technique for non-linear dimensionality reduction technique to overcome high dimensionality in a dataset compare to Principal Component Analysis (PCA) which excels on linear The most distinct feature of Isomap lies in its versatility tested across various applications, ranging from image processing, fault prediction in electromechanical systems, and anomaly detection in hyperspectral imagery .
Introduced in 2000 by Tenenbaum et al.
as an improvement of multidimensional scaling (MDS) by replacing geodesic distances rather than Euclidean distances, this improvement allows Isomap to capture the true manifold structure of the dataset .
Beside the advantages.
Isomap performs sub optimally when processing data that encompasses multiple clusters or manifold structures, but this drawback has spurred the development of modifications, including extensions from the original Isomap such as FastIsomap and Landmark Isomap, aimed at enhancing computational efficiency and the ability to handle more complex datasets effectively .
SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE
Synthetic Minority Over-Sampling Technique (SMOTE) is a commonly used statistical method to address datasets with imbalance class by generating synthetic data for the minority Unlike simple duplication of minority instances like Random Over-Sampling (ROS).
SMOTE synthesizes new data points by interpolating existing instances in the minority class until a balance minority class is achieved .
SDP important goal is to identify defective modules where a high proportion of the datasets suffer from significant class imbalance, with non-defective instances acts as a majority class which often leads to biased predictions favouring the majority class .
By applying SMOTE, the defective class presentation is bolstered to better differentiate between defective and non-defective instances without introducing bias to the model enhances the modelAos prediction.
K-NEAREST NEIGHBOUR IMPUTER
The problem of missing values in datasets is a common and well-known condition in a real-world application.
This condition has a negative impact on the performance of ML models and resulted the training process and introduce errors in the model.
The cause of missing values comes from various reasons and sources, but handling it is crucial to avoid bias and low final performance.
A well-known technique to overcome missing value is the use of imputation by means of estimating and replacing missing values on available data, this enable the missing value to be replaced and produces a complete view of the dataset.
The effectiveness of imputation strategies is influenced by the underlying missing data mechanism and categorized into three types: Missing Completely at Random (MCAR) is a missing condition where both observed and unobserved data is unrelated, and ideal for imputation.
Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
Missing at Random (MAR) is a missing condition cause by random and depends on observed data, most imputation methods assume this condition.
Missing Not at Random (MNAR) is a missing condition related to unobserved data where the missing value itself or unknown unmeasured factor influences the likelihood of missingness .
Advanced techniques of imputation such as K-Nearest Neighbours (KNN) does not rely on standard statistical calculation such as mean, media or mode imputation which may not capture the underlying data structure .
, but utilize the relationships between variables to predict the missing Its simplicity and proven effectiveness in many imputation problems has made KNN a popular choice to handle missing values .
KNN seeks its K nearest variables and imputes by a weighted average of observed values of the identified neighbours where it was setup prior of fitting the The main problem of utilizing KNN is finding the most suitable K to produce the optimum imputation values, which could be an expensive computation experiments if done by trials and errors.
RANDOM TREE EMBEDDING
Random Tree Embedding (RTE) is a fundamental concept in computer science and mathematics, particularly in the representation and analysis of tree structures within different graph frameworks.
This concept plays a crucial role in various applications, such as network design, data structure optimization, and algorithm development.
The central idea behind RTE is to map tree-like structures into geometric spaces or graphs in a manner that maintains key properties, including the distance relationships between nodes .
The ability of RTE to transforms datasets to sparse high dimensionality, which maps each data point to a binary vector indicating which leaf node it originates within each tree expand the datasets in an efficient manner to uncover hidden and complex pattern which can be exploited by ML models to increase the prediction performance overall.
SOFTWARE DEFECT DATASETS
SDP is one of the most active research fields in software engineering and plays an important role in software quality assurance by avoiding incorrect results or unexpected behaviours from the software being developed .
Among various datasets available, the NASA Metrics Data Program (MDP) stands out and has garnered significant attention to become a foundational resource for SDP studies, with other notable datasets include SOFTLAB.
AeM.
ReLink and the Predictive Models in Software Engineering (PROMISE) of various open-source projects.
The NASA MDP dataset consists of data from various NASA software projects which recorded historical records of software metrics and corresponding defect data, making it a benchmark for empirical evaluations in defect prediction methodologies methodologies .
However, study by methodologies .
VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
emphasis data quality issues concerning the MDP dataset ranging from inconsistencies between versions, implausible values, missing data and insufficient documentation on data pre-processing.
The concerns are significant because the MDP dataset are extensively employed in SDP research and affect comparability and reliability of the studies.
In this study, the .
release of the NASA corpus as presented in Table I was utilized, which includes comprehensive datasets tailored for SDP tasks.
The NASA MDP consists of software metrics derived from static code analysis, a method that examines software without executing it to ensure the detection of potential defects early in the development cycle.
This method often measures software metrics such as cyclomatic complexity, lines of code, coupling, cohesion, and others that are critical indicators of software quality and maintainability .
These metrics are crucial in developing reliable defect prediction models and includes recognized metrics such as McCabe's Cyclomatic Complexity and Halstead's software metrics, which play critical roles in assessing software quality and predicting potential defects .
One commonly analyzed metric is McCabe's Cyclomatic Complexity, which quantifies the control flow complexity of a program by measuring the number of linearly independent paths through the code.
This metric helps determine the potential difficulty in understanding and maintaining the code, which correlates with defect rates .
Halstead's metrics extend this assessment by considering the complexity of the program through measures such as the number of operators and operands, which provide insight into the cognitive load required for software comprehension .
The varieties of the datasets, with multiple and different software projects, metrics and software defect patterns, ensures that the proposed models are evaluated real-world generalization and robustness of the proposed model.
Another SDP datasets utilized in this study is the SOFTLAB dataset, which is another prominent resource in the realm of SDP, designed to assist researchers and practitioners in identifying defect-prone components within software It encapsulates data from multiple software projects, specifically five distinct projects developed by a Turkish software company that specializes in creating embedded controllers for home appliances.
These projectsAiAR1.
AR3.
AR4.
AR5, and AR6Airecord a consistent set of 29 software metrics, facilitating a comprehensive analysis aimed at predicting defects within software systems.
This offers a unique opportunity for researchers to explore various defect prediction methodologies that can enhance the accuracy and reliability of SDP models across different contexts .
The significance of the Relink dataset for SDP is underscored by its use in various studies that apply advanced machine learning techniques such as federated learning and prototype learning, where recent research has tested federated prototype learning techniques on projects within the Relink dataset, aiming to improve defect prediction while Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
maintaining data privacy through decentralized model training .
Moreover, the Relink dataset is instrumental for comparative studies in defect prediction.
Its varied project metrics allow for in-depth performance evaluations against other prominent datasets such as AeM and NASA MDP.
The PROMISE repository is a well-established and publicly accessible dataset collection that plays a vital role in software engineering research, particularly in the domain of software defect prediction where it hosted various java based opensource projects for SDP datasets such as Apache Ant.
Camel.
Ivy.
JEdit.
Log4j.
Lucene.
Poi.
Synapse, and Xerces.
Assembled by .
with the help of tools like BugInfo and CKJM, each record in the dataset represents a software module .
ommonly a class fil.
, described by a set of software quality metrics such as the CK metrics.
Halstead metrics.
McCabe cyclomatic complexity, and LOC-based metrics .
Besides being used for WPDP and CPDP, the unique nature of the datasets which provided different versions of the same projects, enable it to be extensively used in Cross-Version SDP (CVDP).
The last SDP dataset used in this study is the Apache Eclipse Evolution Metrics (AeM) dataset, which is specifically designed to enhance the understanding and methodologies related to software quality.
It originates from various projects at the Apache Software Foundation and Eclipse Foundation, and undergone detailed software metrics derived from real-world projects which comprised of 61 distinct metrics that are crucial for defect prediction analysis.
These metrics encompass object-oriented measures, prior defect metrics, and code change metrics, collectively facilitating a comprehensive examination of defect-prone software components .
This gives AeM the advantage as a pivotal dataset for CPDP, where models are trained using one projectAos dataset and validated on another.
A brief description of all the datasets used in this study is shown on Table I.
TABLE I
SDP DATASETS
Dataset
PROMISE
NASA
MDP
SOFTLAB
ReLink AeM Instances VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
Features Defective Instances Non-Defective Instances Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
CROSS-PROJECT DEFECT SOFTWARE METRICS While WPDP is a process of using the same project dataset both as training and testing.
Jing et al.
differentiate CPDP using the metrics similarity and size used between datasets of various projects.
CPDP methods are based on the assumption that the data of source and target companies should have the same software metrics.
When no common metrics exist between source and target projects, in general existing CPDP methods cannot be used for defect prediction.
In this study we opted to concatenate the software metrics rather than to reduce it in order to keep as much information possible, and apply imputation for any missing values.
STRATIFIED CROSS-VALIDATION
Stratified cross-validation (CV) is prevalent and an essential technique when class imbalance exists in the datasets.
maintaining the relative distribution of the classes, each fold ensures that the distribution of the classes is preserved during training and testing, which is beneficial for underrepresented classes compare with K-fold CV .
and increases the generalizability of the models and performance across metrics such as precision, recall.
F1 scores.
As shown in Table 1 and 2, class imbalance is a major challenge in SDP datasets, where the minor class of defective equals Yes is only a small fraction of the overall instances and is the primary target class to The use of imbalanced datasets for training a classifier will most likely generate a classifier that tends to over-predict the presence of the majority class but a lower probability of predicting the minority or faulty modules.
When the model predicts the minority class, it often has a higher error rate compared to predictions for the majority class.
Although there are various methods to overcome the disadvantages of training a model with an imbalance dataset such as over and under sampling either with random or synthetics data, very few studies focused in the area of SDP.
Most notable study concludes that sampling techniques improved the prediction performance of linear and logistics models, whilst NN and tree-based classification tree did not have a better performance upon application of the sampling techniques .
PREVIOUS RESEARCH
Over the years, a variety of approaches have been proposed and applied within SDP to assist practitioners in optimizing the use of limited testing resources by focusing on modules that are more likely to be defective.
Initial research efforts primarily emphasized WPDP, where models are trained using historical data commonly organized in the form of datasets .
from the same project and subsequently employed to predict defects in upcoming releases.
Findings from early studies suggest that when sufficient training data from the same project is available, the resulting prediction models tend to perform well within that context .
As a result, researchers naturally consider using the sample data from well-known software projects in order to learn the model and apply it to VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
predict defects in other software projects, which is the CPDP modelAos design principle .
, especially in a condition when data are insufficient or non-existent for building quality defect A number of studies have been done over time to improve SDP efficacy, which can spot defect early in the SDLC.
Researchers have experiment with techniques such as ML, data mining and statistical analysis .
, where a majority of research focused on WPDP.
Recently greater attention is directed towards CPDP and leverage training form other projects leveraging a diverse method to construct the best CPDP model.
In the area of ML, active learning was used in the study of .
to choose representative unlabelled modules from the target project combine with TrAdaBoost to weight the source and target project and applying the weight to a SVM.
enhance the effectiveness of cross project defect prediction, by employing a technique known as kernel twin support vector machine (DA-KTSVM) to learn the domain adaptation model.
Jin et al.
attempted to maximize the similarity between the feature distributions of the source and target projects.
While multi-source CPDP (MSCPDP) is the focus study of .
with tackling the challenge of using multiple dataset source to build high performance model.
Liu et al.
proposed a two-phase transfer learning model (TPTL) that builds two defect predictors based on the two selected projects independently using TCA and combines their prediction probabilities to improve performance.
A novel hybrid approach using NN is done by .
called SMOTE Correlation and Attention Gated recurrent unit based Long Short-Term Memory optimization (SCAG-LSTM), which employs a novel hybrid technique that extends the SMOTE with edited nearest neighbours (ENN) to rebalance class distributions and mitigate the issues caused by noisy and irrelevant instances in both source and target domains.
Another approach of handling class imbalance and different data distribution using two-phase feature importance amplification (TFIA) is done by .
which yield significant improvement for CPDP.
Abdu et al.
presented GB-CPDP, a graph-based feature learning model for CPDP that uses LSTM networks to develop predictive models and Node2Vec to convert CFGs and DDGs into numerical vectors.
Another approach to tackle class imbalance issue in transfer learning was conducted by .
called Weighted Balanced Distribution Adaptation (W-BDA) by not only considers the distribution adaptation between domains but also adaptively changes the weight of each class.
An improvement of WBDA caused by increasing data or variances in the data sampling which affected the model performance is done by .
called WBDA to improve the performance of the previous study on balanced distribution adaption.
IV.
RESEARCH METHOD
This section details the experimental steps undertaken in this study to evaluate the proposed SDP methods.
Figure 1 provides a schematic overview of the experimental framework Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
used to validate the effectiveness of the suggested approaches.
The framework was designed to ensure an empirical evaluation of the models, leveraging software metric datasets sourced from the NASA MDP.
SOFTLAB.
Relink.
PROMISE and AeM repositories where arbitrary projects datasets are chosen from each SDP datasets as training and while the rests are chosen as the testing datasets.
Figure 1.
Experimental Framework The pre-processing steps start with features selection and metrics matching, in this study we opted by selecting all the features from the training datasets and chose it as the final features for the model.
As the number of features differs between datasets, null values will be introduced in the new dataset where an imputation using KNN imputer with n_neighbors = 2 was applied to overcome this condition.
Unlike statistic approach of imputation of using either mean, modus etc.
of the feature target, imputation using KNN calculate the value from the mean between close instances and will produce a more realistic imputation value.
A study by .
on the major challenges of SDP datasets and the VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
impact on low prediction performances conclude that one of the causes originate from the different data distribution of each datasets especially for CPDP and Heterogenous SDP (HDP).
To overcome the different distribution of each dataset, we utilize Box-Cox normalization to transform the datasets distribution to resemble a gaussian normal distribution as shown by Figure 2 on dataset kc1.
Prior of normalization, the distribution of kc1 varies showing multiple different skewed distribution shape.
The impact of Box-Cox transformation is clearly seen by the bell shape of the features distribution graph, where most of them have a close mean and standard deviation of 0.
Figure 2.
Box-Cox Transformation on kc1 Although NN has the abilities to extract features and complex relationship in the dataset and has proven effective with or without FE across various domains as detailed by .
on comparing between ML and NN, the transformed dataset will boost NN dynamically to process input data, and learn to recognize patterns and assimilate high-level features in a hierarchical manner, effectively managing complexities in relationships among features.
The last pre-processing is applying scaling to equalize the datasets and help avoid bigger value to influence any calculation during training.
To overcome the imbalance on the dataset, during the training phase SMOTE was utilized to balance the classes by Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
interpolating existing instances in the minority class until a balance with the majority class is achieved proportionally prior of being feed for the next phase.
In this study, we employed 10 folds stratified CV in the training phase with a constant random state for reproduction.
In each fold the distributions between each class is maintained to ensure the distribution is the same for all training in a 9:1 split of training and testing datasets.
Each fold will be feed to a RTE .
efers to Table II for hyperparameter.
where the output is a sparse one hot encoding of 43197 high dimensionality one-encoding matrix, which will be reduced to a 100 dense matrix utilizing Isomap .
efers to Table II for hyperparameter.
descriptive transformation of the dataset is shown in Figure to a more considerate size so training can be done efficiently, the bottom left image shows the effect of Isomap where a considerable amount of defective class formed a cluster in the middle.
TABLE II
HYPERPARAMETERS
Name
RTE
Isomap Hyperparameter n_estimators max_depth min_samples_split min_samples_leaf n_components n_neigbours Value Figure 4.
Our proposed MVNN A NN with two views as inputs will be used to predict SDP with the same architecture as shown in Figure 2.
Both meta models consist of one layer with 64 nodes with ReLu activation, batch normalization and dropout for The first meta-model input is the output of the pre-processing phase, while the output of the input transformation will be feed to the second meta-model.
The outputs of both meta-models will be concatenated and feed to the final classifier which will predict the final result.
depiction on the proposed MVNN is shown on Figure 4.
Figure 3.
Original.
RTE and Isomap Transformation on training dataset on concatenation of cm1, mw1 and pc1 of NASA MDP The upper left image shows the original training dataset where it is clearly seen that the defective class is cluttered with the non-defective class.
RTE transforms stretch the dataset to a condition that there is a clear and visible gap between both classes shown in the upper right image, but since the transformed dataset has a high dimensionality, training with NN int his stage will consume a huge amount of computing resources.
Since there is no clear dividing line to differentiate both classes on the RTE transformed dataset, we concluded that the dataset is more closely to a non-linear The last phase is to bring down the dimensionality VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
Figure 5.
Our proposed MVNN A depiction of the weights of the both meta models and the final classifier is shown in Figure 5.
The upper left image shows the weights of the meta model 2 with the input from Isomap dataset transformation, compared with the meta model 1 with input from the original dataset, the Isomap transformation is spread out and shows randomness.
Both views .
he original and transformed dataset.
are concatenated to the final SDP classifier layer where it transforms it to a more distinct solid line clearly visible to produce the final MVNN Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
TABLE i MULTI VIEW EFFECT ON PERFORMANCE
Dataset
NASA MDP
Mean
NASA MDP
Mean
NASA MDP
Model
MVNN
Train c1, pc2, mc.
Test Single View Meta Model c1, pc2, mc.
Single View Meta Model c1, pc2, mc.
Mean
RESULT AND DISCUSSION
In this section, the results of the study are presented and summarizing the performance parameters used such as accuracy, precision, recall.
Compute Area Under the Curve (AUC).
F1-Score, and the effect of multi views as shown in Table i.
RESULT
Tabel i shows the result of our proposed multi views model and single view from our proposed MVNN.
The first meta model with the input from the original dataset on average has the same performance as the second meta model with the Isomap transformed dataset, with the exception of 77 auc which is quite small compared to 0.
84 of the second meta model auc result.
The worst performance of both meta models come from the jm1 test dataset.
Even if our proposed MVNN manage to increased the performance results on all VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
f1-scr evaluation metrics, it still performs worst on average.
Even so on average, the use of both views have increased the performance results.
Using our proposed pre-processing on kc1, pc2 and mc2 datasets on the NASA MDP as training the mean final results for accuracy, precision, recall, auc and f1-score is 0.
91, 0.
91, 0.
85 and 0.
90 respectively.
The best result on the testing dataset was achieved on pc2 with auc and f1-score of 95 and 0.
96 respectively, while the worst result was against jm1 where the auc was 0.
A close up look at the confusion metrics on Figure 6, clearly shows there is a significant amount of wrong prediction made by the MVNN with 0.
of false negatives and 0.
16 of false positives which impacted the final performance results.
On the SOFTLAB datasets, the model was retrained with ar1 and ar3 as training and the rest as testing with the average result of 0.
87 auc and 0.
94 f1score.
The worst result is against ar4 with auc of 0.
73, the Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
TABLE IV
OUR PROPOSED MVNN RESULTS ON VARIOUS SDP DATASETS
Dataset
NASA MDP
Train c1, pc2, mc.
Mean
SOFTLAB
Test r1, ar.
Mean ReLink pache, saf.
Mean AeM q, jd.
Mean
PROMISE
7, camel-1.
Mean result of 0.
22 false negatives and 0.
07 false positives shown in Figure 6.
The same model was retrained with ReLink using apache and safe datasets as training and zxing as testing dataset achieved a below average of the training datasets with 0.
85 of auc.
The confusion metric of zxing shows a low false negative of 0.
27 and 0.
21 of false positive.
The final results on the AeM using eq and jdt as training, showed the worst result was against pde with 0.
81 of f1-score 53 of false negative.
The results from PROMISE using 1-7 and camel.
1-6 using our proposed MVNN as training datasets, resulted the worst result on the testing dataset was against xalan.
2-7 with 0.
69 auc score and 0.
09 false negative 35 false positive.
f1-scr performance above 0.
To validate our proposed method, we verify it with various previous CPDP studies and conduct the same experiments to compare the results.
In a condition where the datasets were used in our study as training, we exempt the results for fairness and avoiding bias.
DISCUSSION
Overall, our proposed method has shown promising result to predict defective software module by learning from some projects and inferred it in other projects with average VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
Figure 6.
Test datasets confusion metric Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
Source Target
JDT
JDT
JDT
JDT
PDE
PDE
PDE
PDE
Average JDT
PDE
PDE
JDT
PDE
JDT
PDE
JDT
ALTRA
AUC
F1-Score
TABLE V
OUR PROPOSED MVNN PERFORMANCE WITH AeM
MSCPDP
TFIA
SCAG- LSTM
AUC
F1-Score
AUC
F1-Score
AUC
F1-Score Table V shows the performance of various CPDP methods such as ALTRA .
MSCPDP .
TFIA .
, and SCAGLSTM .
On average SCAG-LSTM performs the best follows by TFIA compared with ours and other methods.
The low average performance happened when trained using a single dataset training which hinders our proposed method of concatenating multiple datasets input, but on any other datasets our proposed method is in par with SCAG-LSTM and TFIA and the devoid concatenation affects little on the overall performance as shown in Table IV.
Table VI compares the performance of our proposed method with WBDA .
and WBDA .
where our proposed method excels compared with WBDA and WBDA .
Compared with the experiment in Table V with only a single dataset for training, the results of Table VI show the performance of our proposed pre-processing method when trained with multiple input datasets.
Table VII compares the performance of our proposed method with TPTL .
DA-KTSVM .
and GB-CPDP .
where the average mean performance shows that our proposed method outperform the other methods.
THREATS TO VALIDITY
As with every empirical experiment, the results of our works are subject to some threats to validity.
CONSTRUCT VALIDITY
We admit that during our experiments, only a subset of various SDP repositories were used and not all datasets were included in the case of PROMISE.
Although it would be best to include all of them, but the limitation of resources hinders us to take this step.
For objectivity, we only used datasets from the same repository to do a benchmarking or testing, we reserved ourself from modifying unless it is necessary to VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
Ours AUC F1-Score conduct the experiment.
Since most studies on SDP uses an open and public datasets, we consider the datasets is complete and adequately fixed and reliable to be used in our INTERNAL VALIDITY Although there are questions regarding the validity of the dataset, especially the NASA MDP.
We found it to be constructive and the necessary adjustment have been made and verified by previous studies.
Therefore, the validity of the datasets should be minor and will cause little effect on the results.
Another issue about the validity is the complex nature of software engineering especially the human factor such as developersAo skills and experiences which might affects the likelihood of defect being introduced in the EXTERNAL VALIDITY We validated our findings using open and public datasets from different sources and different software metrics to gain more confidence in the external validity of our study.
doing so, we hope to achieve generalization with our proposed method, and any replicated studies with our method will be a step to improve our method.
VI.
CONCLUSION
In this article, we propose MVNN for CPDP which shows to be reliable compared with previous studies.
The challenges of CPDP besides the different distribution between datasets, is to find a suitable features or software metrics that can be accepted and used universally among software projects.
Although there are methods and techniques to overcome the challenges of CPDP, but the complex nature of software projects still proves to be a Boy Setiawan, et.
: Multi View Natural Network for A (April 2.
TABLE VI
OUR PROPOSED MVNN PERFORMANCE WITH RELINK.
SOFTLAB AND NASA
Dataset
Source Target
ReLink
SOFTLAB
apache, zxing
ar1, ar4, ar5, ar6
ar1, ar3, ar4, ar6
cm1, pc1, pc3, pc5
mw1, pc1, pc3, pc5
NASA
Average WBDA
AUC
F1-Score
WBDA
AUC
F1-Score
Ours AUC
F1-Score
TABLE VII
OUR PROPOSED MVNN PERFORMANCE WITH PROMISE
Source Target Average jedit_4.
TPTL
AUC
F1-Score
DA-KTSVM
AUC
F1-Score challenging field in the future to improve software engineering, by finding an efficient tools to find and predict defect during the life cycle of software development.
In this method, we used multiple steps of pre-processing prior of training ranging from concatenation of features, imputation, distribution normalization and scaling to overcome the different software metrics, and to handle the different distribution of each dataset.
Although our proposed method was meant to concatenate more than one dataset, but it still adapts with only a single dataset and still give a satisfactory results compare with previous CDPD methods.
Besides the original dataset as the primary view, we opted to create a different view from the dataset by utilizing high dimensionality embedding based on Random Tree, where given enough trees the model will fit the dataset eventually.
The use of Isomap as a dimensionality reduction will reduce the size of the MVNN input, so a smaller yet effective NN classifier can be trained using stratified CV.
Empirical studies with some notably SDP datasets show the effectiveness of our proposed method compared with previous methods on CPDP.
In the future, we would like to extend our research to HDP incorporating other methods for pre-processing and align towards more deep layers NN or using attention mechanism to fully exploit the high dimensionality of the Random Tree output to further improve the performance our proposed MVNN.
VOLUME 07.
No 01, 2025 DOI: 10.
52985/insyst.
F1-Score Ours AUC F1-Score Agus Subekti: Supervision.
Validation.
Original Draft Writing Preparation.
Review Writing & Editing.
COPYRIGHT
This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 4.
0 International License.
REFERENCES