JOIV : Int.
Inform.
Visualization, 8.
: IT for Global Goals: Building a Sustainable Tomorrow - November 2024 1536-1544
INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION
journal homepage : w.
org/index.
php/joiv Skin Lesion Classification: A Deep Learning Approach with Local Interpretable Model-Agnostic Explanations (LIME) for Explainable Artificial Intelligence (XAI) Sin Yi Hong a.
Lih Poh Lin b,* Faculty of Engineering and Technology.
Tunku Abdul Rahman University of Management and Technology.
Kuala Lumpur.
Malaysia Centre for Multimodal Signal Processing.
Biomedical and Bioinformatics Engineering Research Group.
Faculty of Engineering and Technology.
Tunku Abdul Rahman University of Management and Technology.
Kuala Lumpur.
Malaysia Corresponding author: *linlp@tarc.
AbstractAiThe classification of skin cancer is crucial as the chance of survival increases significantly with timely and accurate Convolution Neural Networks (CNN.
have proven effective in classifying skin cancer.
However.
CNN models are often regarded as "black boxesAy, due to the lack of transparency in the decision-making.
Therefore, explainable artificial intelligence (XAI) has emerged as a tool for understanding AI decisions.
This study employed a CNN model.
VGG16, to classify five skin lesion classes.
The hyperparameters were adjusted to optimize its classification performance.
The best hyperparameter settings were 50 epochs, a 1 dropout rate, and the Adam optimizer with a 0.
001 learning rate.
The VGG16 model demonstrated satisfactory classification The Local Interpretable Model-Agnostic Explanations (LIME) method was implemented as the XAI tool to justify the predictions made by VGG16.
The LIME explanation revealed that the correct predictions made by VGG16 were owing to its truthful extraction of the cancer or lesion area, especially for the Auvascular lesionAy class.
Meanwhile, inaccurate classifications were attributed to VGG16 extraction of the background and insignificant parts of the skin as core features.
In conclusion.
The LIME model allowed visual inspection of the features selected by VGG16, paving the way for improving the CNN model for better feature extraction and classification of skin lesions, offering a promising direction for future research.
KeywordsAi CNN.
deep learning.
explainable ai.
skin cancer.
local interpretable model-agnostic explanations.
VGG16.
XAI.
Manuscript received 16 Jun.
revised 21 Aug.
accepted 17 Oct.
Date of publication 30 Nov.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.
0 International License.
The outcomes of diagnostics frequently rely on the dermatologist's skill and expertise, which can sometimes be subjective .
The accuracy of skin cancer diagnosis was about 60% .
on-dermoscopic imag.
to 84 % .
ermoscopic imag.
This shortcoming has motivated the introduction of artificial intelligence (AI) in skin lesion diagnosis.
Deep neural networks like Convolutional Neural Networks (CNN.
possess remarkable feature extraction and classification As a result, they have witnessed extensive adoption in skin lesion classification in recent years, including application in .
, .
, .
Table 1 summarizes several studies that employ CNNs for classifying and detecting skin diseases.
Works from the literature suggest that CNNs excel at learning complicated patterns in images, rendering CNNs ideal for analyzing skin lesions where features are sometimes subtle.
In other words.
CNNs can capture details that may not be easily discernible to the human eye.
INTRODUCTION
A skin lesion is an abnormal growth of skin cells that could be caused by exposure to ultraviolet light, environmental hazards, or genetic risk factors .
Skin lesions such as actinic keratoses, dermatofibroma, and seborrheic Keratoses are benign .
, while melanoma, basal cell carcinoma, and squamous cell carcinoma are usually malignant .
About 104,930 new melanoma cancer cases are estimated for the year 2023 in the United States .
The 5-year survival rate of skin cancer is as high as 94% when the cancer is still localized but is drastically reduced to 32% after metastasis .
, emphasizing the importance of early The early diagnosis of skin lesions typically involves visual examination before histopathological nevertheless, classifying skin lesions is challenging due to the high variabilities in the appearance of the skin
TABLE I
REPORTED WORKS OF CNNS IN SKIN DISEASE DETECTION
Method Model used: VGG16.
ResNet50.
Inception-v3 and Ensemble.
Binary cross-entropy loss function Dataset Monkey Pox Skin Lesion Dataset .
Model used: AlexNet.
VGG16.
ResNet-18 and fusion CNN models with Support Vector Machine.
Fuse the deep features from various layers of CNNs Model used: Inceptionv3.
VGG19.
SqueezeNet and ResNet50 Use the .
NET
framework with the C# language to enable a web service for users Model used: ResNet50.
Xception and Inception-ResNet-v2.
DenseNet121 and Inception-v3
Employ global average pooling followed by a 1x1 convolution Instead of a fully connected layer
ISIC
ISIC
XiangyaDerm .
Key Results [VGG.
Accuracy: 81.
Precision: 0.
Recall:0.
F1 score: 0.
[ResNet.
Accuracy 82.
Precision: 0.
Recall: 0.
F1 score: 0.
[AlexNe.
Avg.
AUC: 89.
[VGG.
Avg.
AUC: 88.
[ResNet-.
Avg.
AUC: 88.
[Inception V.
Avg.
accuracy: 93% [VGG.
Avg.
accuracy: 94.
[ResNet.
Avg.
accuracy: 97% [SqueezeNe.
Avg.
accuracy: 96% [ResNet.
Avg.
precision: 63% [Inception-v.
Avg.
precision:64% [DenseNet.
Avg.
precision:69% [Xceptio.
Avg.
precision:68% [InceptionResNetv.
Avg.
precision:71% understand AI models.
In other words.
XAI allows the understanding of the reasons behind the decisions made by the AI .
Some common XAI techniques include Gradient-weighted Class Activation Mapping (GRAD-Ca.
Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations (SHAP) .
, .
, .
, .
Nguyen et al.
have compared LIME.
SHAP, and CAM in describing the ResNet50 CNN in the image classification problem.
Considering the hummingbird shown in Fig.
1, which has 49 clusters, all the tested XAI were able to highlight the core feature used by ResNet50 in the As shown in Fig.
LIME identifies core features using superpixel regions.
SHAP employs weighted color regions, and CAM utilizes heatmaps to highlight essential features.
Ref Fig.
1 A hummingbird image with 49 clusters for classification.
The implementation of CNNs has proven to be effective in skin lesion classification.
However, despite their success.
CNN models come with certain limitations, most notably being regarded as "black boxesAy, which refers to the low level of transparency and interpretability in the decisionmaking of CNNs .
CNNs do not provide explicit explanations or justifications for their classification.
When dealing with critical applications such as medical diagnosis, being informed of the logic behind a model's decision is crucial for building the trust and confidence of the medical practitioners in AI .
Furthermore, more advanced and robust CNN architectures have been developed in recent years for various Recent published CNN models continue to grow more complex, with some models already operating with trillions of parameters.
As a consequence, the clarity and interpretability of these emerging CNNs were compromised, making it harder to comprehend the prediction process and internal mechanisms .
Additionally.
In situations where a CNN model makes incorrect predictions, it is often challenging to diagnose the exact cause of the failure .
Amid all these limitations, explainable artificial intelligence (XAI) has arisen as a framework and tool for understanding and interpreting AI decisions.
XAI generally refers to all techniques and solutions that enable humans to Fig.
2 XAI explanation using .
LIME, .
SHAP and .
CAM .
XAI in medicine has been presented in the literature.
Yong et al.
employed both Grad-CAM and Kernel SHAP techniques on a CNN-based classification of melanoma and benign naevus .
This work's significance lies in using Grad-CAM and Kernel SHAP to perform sanity check experiments, including reproducibility, model dependence, and sensitivity tests.
Yong et al.
conducted an extensive analysis, performing 30 model training on 15 randomly selected subclasses of 818 data each.
The obtained performance metrics showcase a good 85% mean Area Under Curve (AUC), 1.
8% variance, and an 87% recall.
The sanity check experiments showed that GradCAM and SHAP were reproducible, model-dependent, and mostly sensitive, but occasionally marked irrelevant features as necessary.
The study introduces initial insights connecting accuracy and interpretability in the classification of skin lesions.
In the year 2021, the LIME and SHAP models were proposed and compared by Ong et al.
to explain COVID-19 diagnoses via X-ray images .
A 14-layer SqueezeNet CNN model was first applied to classify the X-rays into trio classes, namely pneumonia.
COVID-19, and normal lung, followed by the implementation of XAI based on visual evaluation.
Both LIME and SHAP were able to mark the region of interest (ROI) that leads to accurate or inaccurate classification.
LIME used superpixel to mark ROI, while SHAP used green and red to mark positive and negative accuracy areas.
The author concludes that SHAP is a relatively better XAI for the application as SHAP could always identify the lung region, which is the core feature of an X-ray.
The work has provided insight into essential features marked by the CNN Grad-CAM has also been applied to image segmentation models, as demonstrated by Xiao et al.
This study utilized Grad-CAM with several segmentation-deep learning models: MCGU-Net.
R2U-Ne, and Double U-Net to segment datasets containing images of colorectal polyps, liver, and skin melanoma.
The ROI was presented in the form of heat maps.
The heatmaps generated by Grad-CAM revealed distinct insights.
For example, in the case of colorectal polyp images.
Double U-Net emphasized pixels at the polyp's edge during segmentation, assigning the highest importance to the edge's left and right ends.
In contrast, pixels across the whole polyp region were prioritized by R2U-Net, with increasing significance towards the polyp's MCGU-Net focused on the center alongside the pixels on the lower edges of the polyp.
Regardless.
GradCam has effectively identified the image regions that command the primary focus of medical image segmentation In short.
XAI is emerging in medical image analysis.
Table 2 summarizes some reported work of XAI in medical TABLE II
REPORTED WORKS OF XAI IN MEDICAL APPLICATIONS
Application COVID-19 Melanoma Segmentation Description Dataset from Covidx.
Employed SqueezenNet Dataset from HAM10000.
Employed Inception Datasets:
XAI
LIME and SHAP to give a qualitative visual Ref Grad-CAM and KernelSHAP to give qualitative visual inspection and quantitative Structural Similarity Index (SSIM)comparison Grad-CAM to give .
Application of medical consisting of polyps, liver CT and melanoma Adverse Drug Event (ADE) Description CVC-Clinic, 3Dircadb.
Lesion Boundary Segment.
Employed MCGU-Net.
R2U-Net and Double U-Net Dataset from Swedish Health Record Research Bank.
Employed RNN:
RETAIN and RNN-GRU
XAI
a qualitative visual Ref SHAP for .
Previous studies have established the potential of XAI.
nevertheless, its application in skin lesion classification remains limited.
This challenge is amplified by the absence of extensive datasets that could fully represent the variations in skin lesions.
Considering these gaps, our study developed a CNN model for skin lesion classification.
The evaluation of the CNN model's performance is followed by implementing an XAI model LIME to provide justifications for the decisions made by the CNN model.
To ensure the robustness of our approach, we have employed a wellestablished CNN architecture: VGG16.
To address the need for sufficient training and validation data, we used the multiple-dataset approach where skin lesion images from 4 different sources were included for this study.
LIME
presented in this study offers an understanding of the reasoning behind the CNN model's decisions.
II.
MATERIALS AND METHOD
Dataset and Image Pre-Processing Four datasets were incorporated into our project.
The first dataset was sourced from the Kaggle public dataset 'Skin Cancer ISIC' by Andrey Katanskiy, extracted from The International Skin Imaging Collaboration (ISIC) .
The remaining datasets of gold standard lesion diagnosis images, accessible from ISIC archives .
, were derived from the combination of HAM1000 datasets .
MSK dataset .
and BCN_20000 Dataset .
The combined dataset was then split into train-validation data and test data.
The number of test images per class was set at 200.
At the same time, the train-validation data went through augmentation .
ncluding RandomApply.
RandomCrop.
RandomRotation.
GaussianBlur.
RandomAdjustSharpness etc.
) or random deletion to reach a uniform 1800 images per class.
The train:
validation: test ratio of the dataset was 1600:200:200, giving 2000 images per class, as shown in Table 3.
TABLE i NUMBER OF DATA PER SKIN LESION CLASS
Skin Lesion Class
Actinic Keratosis (AK) Basal Cell Carcinoma
(BCC)
Melanoma (MEL) Melanocytic Nevus (NV) Vascular Lesion (VASC) Total Hyperparameter Train Valid Test Total Type of The VGG16 architecture was selected for this study due to its remarkable performance of top-5 test accuracy of 93% on ImageNet, a dataset with over 14 million images distributed among 1000 classes.
Furthermore, its robustness has been extensively demonstrated in various detection and classification tasks .
Images were pre-processed into 224 y 224 to suit the VGG16 input requirements.
The dataset was normalized as a common pre-processing step in CNNs.
The dataset normalization involved scaling the input pixel values to achieve a mean of 0 and a variance of 1.
Normalizing data offers several advantages including faster convergence, improved generalization performance, and reduced sensitivity to input changes .
Dropout Value The number of epochs refers to how often the complete dataset is fed through the CNN during It affects the training time and the model's performance.
The number of epochs has to be sufficient to allow CNN to learn the data features, but an overly high epoch can lead to overfitting The dropout rate is a regularization technique that randomly drops nodes or neurons in CNN layers to reduce the dependence on specific 10, 30, 50, 90 Tuning the dropout helps to prevent the overfitting of the CNN The optimizer is in control of updating the weights of the neural network based on the calculated gradients of the loss function.
Different optimizers use various algorithms to perform weight updates which can affect the performance of the CNN and therefore require evaluation.
Analysis Accuracy Adam.
SGD.
NAdam Equation and Description .
To measure the correctly classified data across all data.
Recall .
To measure total actual positive data detected over total positive data.
Likewise understood as true positive rate or Precision .
To measure actual predicted positive data out of all predicted positive data.
F1 score To measure the balance between recall score and precision MCC To measure the correlation between the actual and predicted classifications, considering quad results .
alse/true negatives and false/true positive.
to provide a balanced assessment.
TABLE IV
HYPERPARAMETERS EVALUATED
Description Value TABLE V
PERFORMANCE METRICS
Hyperparameters Selection and Performance Metrics Several hyperparameters, including the dropout rate, number of epochs, and type of optimizers, were evaluated to enhance the performance of the CNN models.
The dropout rate refers to the ratio of randomly eliminated neurons during training to reduce overfitting.
The number of epochs refers to how often the complete dataset is fed through CNN during training.
Optimizers adjust the learning rate to minimize the loss function, helping the CNN model to These hyperparameters play critical roles in optimizing CNNs for better performance.
The values assessed and the description of the hyperparameters are shown in Table 4.
Quantitative analysis was performed to evaluate the performance of the CNN models.
A range of performance metrics, including Accuracy, recall score.
Precision score.
F1 score, and the Matthews correlation coefficient (MCC) were employed.
Descriptions of these metrics are provided in Table 5.
Additionally, the loss function of the training and validation sets, which characterize the training process and the potential for overfitting/underfitting, was also examined.
Hyperparameter Epochs Description TN = True Negative.
FN = False Negative.
TP = True Positive.
FP = False Positive XAI Application: LIME LIME was introduced in 2016 by M.
Ribeiro.
Singh, and C.
Guestrin in their publication titled "Why Should I Trust You? Explaining the Predictions of Any Classifier" .
LIME aims to approximate the black-box model locally, making it a post hoc XAI method which does not influence the CNNAos training process.
Instead.
When applying LIME to skin lesion classification, the XAI begins by segmenting the skin lesion image into superpixels, which are pixel clusters with alike characteristics such as colour and 1, 0.
Fig.
5 demonstrates the effect of epochs on classification performance.
Epochs represent one complete pass through the training dataset.
A CNN model with low epochs may struggle with underfitting .
, as evidenced by the subpar classification performance observed with only 10 On the contrary, excessive epochs could lead to overfitting, where CNN memorizes the training sets and cannot generalize on unseen data.
For this study, both 50 and 90 epochs reached convergence, but 50 epochs presented the best classification performance and were thus chosen for subsequent analysis.
intensity, as shown in Fig.
Next, perturbed versions of the original image are generated by randomly masking out a subset of superpixels, creating images with masked regions, as shown in Fig.
These perturbed images are employed in LIME model training.
The superpixels with the highest positive coefficients are considered to have contributed significantly to the prediction of skin lesion type.
Local explanations from LIME are accurate within the immediate context or vicinity of the skin lesion image under consideration.
Effect of Epoch Fig.
Skin lesion image segmented into superpixels and .
Perturbed image to investigate the importance of a specific region 10 epoch 30 epoch 50 epoch 90 epoch i.
RESULTS AND DISCUSSION
Hyperparameter Selection Hyperparameter selection is imperative for a CNN as a suboptimal setting could significantly reduce the classification performance .
Parameters, including the number of epochs, rate of dropout, and type of optimizer, were evaluated to enhance VGG16's performance in classifying skin cancers.
Firstly, dropout rates of 0.
1, 0.
epresenting 10%, 50%, and 70% of discarded nodes, respectivel.
were tested.
Fig.
4 illustrates the impact of the dropout rate on accuracy, precision, recall.
F1-score, and MCC.
Fig.
5 The impact of epoch on classification performance.
Training for 50 epochs yielded optimal convergence and performance.
Optimizers are used to minimize errors between predicted and actual outputs.
An optimizer computes the gradient based on the loss function to adjust the weights in the CNN model for better performance.
Different optimizers present varying weight adjustments that can substantially affect the classification performances.
The optimizers evaluated in this project were Adam.
SDG, and Nadam: The Adam optimizer updates the CNN models by looking at past gradients.
SGD optimizer uses gradient descent on a random point from the entire dataset for parameter updates, reducing redundant work on large datasets while Nadam optimizer combines Adam and Nesterov's accelerated gradient descent method, boosting learning by considering both past and current gradient trends .
A comparison of these optimizers shows that Adam outperformed the other optimizers, as illustrated in Fig.
Effect of Dropout Rate 7 Dropout 5 Dropout 1 Dropout Effect of Optimizer Fig.
4 The effect of dropout rate on classification performance.
A 0.
dropout rate balances preventing overfitting and allowing sufficient learning.
It was observed that the 0.
1 dropout rate gave the best performance metrics.
Dropouts randomly eliminated neurons and connections during training to prevent rapid adjustment and overfitting.
A dropout rate 1.
0 indicates that all neurons are dropped and no training occurs, while a rate of 0 means no neuron is dropped, possibly leading to overfitting.
For this study, a 0.
1 dropout rate appeared to be a reasonable balance where some neurons were eliminated to prevent
overfitting while still enabling the model to learn essential SGD
Adam
SGD
Fig.
6 The influence of optimizer on classification performance.
Adam is effective in practice and performs favorably compared to other stochastic optimization methods This observation is consistent with works reported in the literature, such as that from Kingma et al.
, which concluded that Adam works well with sparse gradients and is robust for a wide range of non-convex optimization problems using deep learning .
Lastly, it is worth noting that the learning rate was consistently set to 0.
001 for this study, which is the default value in the PyTorch application of optimizer Adam.
This value is commonly used in deep learning frameworks and was observed to be optimal in previous works .
observed in the validation loss suggested the CNN model was adjusting its parameters to prevent overfitting and to reach optimal solutions with the unseen data .
Performance Analysis of CNNs in Skin Lesion Classification The classification of the CNN models was assessed quantitatively using the individual class classification report and measurements, including accuracy, recall, precision.
MCC, and F1 score.
According to the results in Fig.
7, the VASC class was the easiest to classify, displaying consistently high (> 0.
precision, recall, and F1-score.
This ease of classification is likely due to its distinct blushing and flushing area that a trained VGG16 can easily recognize .
On the other hand, the performance of VGG16 in classifying the MEL categories was mediocre, with performance metrics ranging between 0.
57 and 0.
This could be due to the dynamic nature of Melanoma lesions that change over time and have several variants .
This makes it challenging to capture all stages and variants in a dataset for comprehensive training of VGG16 to classify MEL.
Individual Class Performance Preci
Recal
F1score
-0,1
BCC
MEL
VASC
Actinic Keratosis Basal Cell Carcinoma Melanoma Melanocytic Nevus Vascular Lesion Fig.
7 The performance of the CNN in classifying different skin lesions.
VASC class has the highest classification accuracy.
Despite this, it is safe to conclude that the CNN model's classification of skin cancer/lesions is satisfactory, with an overall precision, recall.
F1 score, and accuracy of 0.
712, 0.
724, and 0.
711, respectively.
The MCC was adapted to measure the overall quality of the classification, as high scoring is only produced when the classifier accurately predicts the majority of the negative and positive cases .
The CNN's MCC was 0.
639, indicating a moderately strong level of agreement between the predictions and the actual labels of the skin images.
Overall, the CNN model performs reasonably well, with room for improvement.
The training and validation loss were also examined as they visually represent how the VGG16 was learning over The decreasing loss as shown in Fig.
8 indicates that the VGG16 model was improving its performance on the training data.
The validation loss has an overall decreasing trend alongside the training loss, showing that the model was capable of generalizing to new data .
The fluctuation Fig.
8 The loss function of the VGG16 model.
Decreasing loss indicates that the CNN model was learning and improving its ability to make accurate The Application of XAI The VGG16's ability to classify skin cancer/lesion was poorly understood.
it was unclear why some images were To investigate the features VGG16 used for classification, the test set from the dataset was used for XAI The saved VGG16 checkpoint was loaded for predictionAithe LIME model explained by highlighting segmented areas using superpixel segmentation and feature The segmented regions by LIME made the features used by VGG16 to identify skin cancer classes explainable.
According to the classification report discussed in section B.
VGG16 could classify the VASC class well, while it performed poorly on the MEL class.
This observation aligns with the LIME segmented explanation, which showed that VGG16 was trained to extract the core area of VASC, as shown in Fig.
9, for accurate classification but struggled with MEL where irrelevant background was extracted, as depicted in Fig.
Table 6 provides additional correct and incorrect predictions with remarks.
Overall.
LIME enabled visual inspection to explain VGG16's decisions.
Despite some misclassifications, the CNN model was on the right track in distinguishing between different skin lesion classes.
Improvements to the CNN architecture or adding a feature extraction algorithm before CNN classification could enhance classification performance.
Fig.
9 Superpixels graph and mask graph for VASC correct prediction.
LIME shows that VGG16 extracted core features.
Prediction Incorrect Actual:
Predicted:
BCC
Remark:
Fig.
10 Superpixels graph and mask graph for MEL incorrect prediction.
The image was classified as BCC due to the irrelevant feature extraction by VGG16.
LIME shows that irrelevant background was extracted leading to incorrect classification.
VGG16 may have viewed the lighter skin area as BCC.
TABLE VI
XAI EXPLANATION OF CNN CLASSIFICATION
Prediction Correct Actual:
BCC
Predicted:
BCC
Superpixels Graph vs Masked Graph Some improvements can be made to this study.
Firstly, different CNN models, such as EfficientNet and DenseNet, could be employed in skin lesion classification to realize better results.
Further research should also explore additional hyperparameter tuning to optimize model performance Bayesian hyperparameter tuning could be explored to efficiently examine the hyperparameter space and automate finding the best hyperparameters with minimal manual intervention.
Additionally, segmenting the skin lesion outline before training could enhance training Besides, incorporating various XAI models, such as SHAP and Grad-CAM, into CNN models should be considered for future studies.
SHAP takes a team-like approach to explain how models make predictions.
It helps to explain how each feature contributes to the decisions made, giving AI users valuable insights using colored weighted zones.
CAM, being an intrinsic model, analyzes the final convolutional layer of a CNN to understand which parts of the image activate the neurons corresponding to the skin lesion class.
It can enhance the interpretability of CNNs by providing heatmaps highlighting essential features.
In addition, quantitative evaluation of XAI models using metrics like fidelity and stability scores should also be explored in future work.
The fidelity score measures how closely the explanation matches CNNAos decision-making In contrast, the stability score measures how consistent the XAI explanation is across data of the same Another recommendation for future work is to present XAI explanations using weightage for each feature instead of only visual explanations.
While visual explanations can be helpful, they do not always clearly indicate the relative importance of specific features in the decision-making Therefore, by explicitly stating the weightage of each feature, the most influential or substantial features affecting the decision of CNNs are made known to the users.
In short, exploring different CNN models, implementing Bayesian optimization, comparing different XAI methods, and instigating quantitative XAI explanations can collectively enhance the interpretability of CNN models in skin lesion classification, paving the way for more transparent AI systems in healthcare.
Superpixels Graph vs Masked Graph Remarks:
LIME segmented areas of the image in which VGG16 has utilized as the core feature.
Despite not having the entire carcinoma extracted, the feature extracted .
robably the rednes.
was sufficient to allow VGG16 to predict the BCC class correctly.
Correct Actual:
MEL
Predicted:
MEL
Remarks:
LIME segmentation shows that VGG16 has extracted a large area of the skin lesion for the correct classification of MEL.
Incorrect Actual:
Predicted:
Remark:
LIME shows that irrelevant background and a pigmented skin area were extracted leading to incorrect VGG16 may have viewed the pigmented skin .
ottom left corne.
area as NV.
IV.
CONCLUSION
In conclusion, this study has evaluated the classification performance of a VGG16 CNN model in classifying five classes of skin conditions, including three classes of cancerous lesions and two classes of non-cancerous lesions.
Upon the optimization of hyperparameters including dropout rate, number of epochs, and optimizer, the VGG16 model achieved satisfactory classification performance, with an average accuracy, precision, recall.
F1 score, and MCC of 711, 0.
716, 0.
712, 0.
724, and 0.
639, respectively.
LIME,
the post hoc XAI method, was applied to explain the decision made by the VGG16 in skin lesion classification.
The LIME explanations use superpixel to demonstrate the features extracted by the VGG16 model visually.
LIME
showed that accurate predictions by VGG16, particularly the VASC class, were attributed to the truthful extraction of cancer or lesion areas.
At the same time, inaccurate classifications were linked to the extraction of background and insignificant parts of the skin as core features.
Overall, integrating the LIME with the VGG16 model has allowed visual inspection and justification of the model's predictions, providing a pathway for enhancing the CNN model's performance, particularly in feature extraction.
Future work could use quantitative XAI to give specific weight to each feature and highlight its importance in classification.
ACKNOWLEDGMENT
The authors gratefully acknowledge the support provided by Tunku Abdul Rahman University of Management and Technology.
This work was not supported by any grant.
REFERENCES