Journal of Applied Engineering and Technological Science Vol 6. 2025: 984-996 COCOA RIPENESS CLASSIFICATION USING VISION TRANSFORMER Febryanti Sthevanie1*. Untari Novia Wisesty2. Gia Septiana Wulandari3. Kurniawan Nur Ramadhani4 School of Computing. Telkom University. Bandung 40257. Indonesia1234 Center of Excellence Artificial Intelligence for Learning and Optimization. Telkom University. Bandung 40257. Indonesia124 sthevanie@telkomuniversity. id1, untarinw@telkomuniversity. id2, giaseptiana@ id3, kurniawannr@telkomuniversity. Received: 01 December 2024. Revised: 30 April 2025. Accepted: 05 May 2025 *Corresponding Author ABSTRACT The quality of manual methods for assessing the ripeness of cocoa pods is subjective and varies from one person to another because of the intense labor required and variation of light and background conditions within the field. This research implemented an automated classification approach for cocoa ripeness classification utilizing Vision Transformer (ViT) with Shifted Patch Tokenization (SPT) and Locality Self Attention (LSA) to improve classification accuracy. The model proposed in this research achieved an accuracy of 82. 65% and a macro F1 score of 82. 71 on the exam with 1,559 images captured under varying illumination backgrounds and complex scenes. The model also proved better than baseline CNN architectures such as VGG. MobileNet, and ResNet in identifying visually progressive stages of ripeness and demonstrated greater generalization in cocoa ripeness classification. The findings of this research indicate the benefits of reducing manual intervention with careful inspection without compromising quality assurance standards in cocoa production. This work demonstrates new ways of applying transformer models to address computer vision problems in agriculture which is a step towards precision and smart farming. Keywords: Cocoa Ripeness Classification. Vision Transformer. Shifted Patch Tokenization. Locality Self Attention. Agricultural Computer Vision. Introduction In 2021. Indonesia stood third in global cocoa production with 728,046 tons produced annually(Food and Agriculture Organization (FAO), 2. The cocoa sector is a sub-strategic industry in Indonesia as more than 95% of Sulawesi. Sumatra and PapuaAos cocoa production comes from smallholder farmers(International Cocoa Organization (ICCO), 2. These farmers have very low uptake of mechanisation and agricultural technology, thus applying traditional methods. This leads to a range of quality issues as well as post-harvest inefficiency. One of the most important steps in cocoa processing is harvesting. The degree of pod matureness is a key factor in determining the fermentation quality, chemical composition, and flavor profile of chocolate products. If harvested too early, beans have a high tendency of being under fermented resulting in a bitter taste. Milder over mature pods are more susceptible to being fungal infested, internally germinated, or enveloped in mucilage, thus degrading both bean quality and yield(Siregar et al. , 2. This accentuates the need to maintain certain standards and thus, highlights the importance of timely and accurate cocoa matureness There is still no automated classification system available for mature classification in Indonesia since it is typically done manually through observation of the pod surface color. This method relies on the use of human labor which is prone to human error especially in fields where lighting and scenery are inconsistent. There is a need to design a classification system that works accurately within the constraints of the real world and can decrease post-harvest loss. Other fruits have seen a positive outcome with using computer vision and deep learning for detecting maturity levels. Previous works included the use of color-based features papaya with k-nearest neighbor . -NN)(Suban et al. , 2. , support vector machines (SVM) for banana(Juncai et al. , 2. , oil palm(Siregar et al. , 2. and durian with convolutional neural networks (CNN. (Kharamat et al. , 2. CNNs are poorly suited for tasks involving modeling Sthevanie et al A Vol 6. 2025: 984-996 long-range dependencies and tend to overfit smaller, unbalanced training datasets, which is typical for agricultural datasets. Moreover. CNN-based models often fail to perform adequately in uncontrolled scenarios. With these reasons, this research aims to develop an image-based cocoa maturity classification system using ViT architecture. ViT treats images as sequences of patches and employs self-attention clustering to embody patterns. This method aids in the interpretation of color gradients, texture, and pod surface details better(Dosovitskiy et al. , 2. To improve model performance on small datasets and in complicated environmental conditions, two complementary modules are added. Shifted Patch Tokenization (SPT), which rearranges the spatial order of patches and simulates data augmentation(Lee et al. , 2. Locality SelfAttention (LSA) enhances attention while minimizing focus on non-local feature tokens and strengthens local feature emphasis(Zhou et al. , 2. The effectiveness of Transformer-based models has been explored in agriculture for fruit grading, weed and crop classification, and plant disease detection, achieving high performance even with limited data (Charco et al. , 2024. Chitta et al. , 2. In the specific context of cocoa. Lopes et al. (Lopes et al. , 2. demonstrated the potential of deep learning for cocoa bean grading using CNNs. However. ViT-based models for cocoa matureness classification remain underexplored and lack integration of modules that enhance data efficiency. To better understand the matureness classes of cocoa fruits targeted in this research. Figure 1 illustrates the visual differences across the Immature. Mature, and Overmature stages. Each stage shows distinct characteristics in pod surface color and texture, providing observable cues for automated classification systems. Fig. Cocoa fruit matureness stages: . Immature, . Mature, . Overmature. The objective of this research is to develop an image-based cocoa fruit matureness classification model using the Vision Transformer architecture, enhanced with Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), in order to improve classification accuracy under limited-data and field-based image conditions. The contributions of this research are summarized as follows: Proposing a novel ViT-based approach for classifying cocoa matureness stages from realenvironment images. Integrating SPT and LSA to address challenges related to small datasets and subtle feature Demonstrating that the proposed method outperforms baseline models such as ViT. ResNet, and MobileNet in terms of accuracy and class-wise performance. Providing a practical solution to assist cocoa farmers in making harvesting decisions and improving post-harvest outcomes through automation. Related Works Image-based classification is a fundamental approach in smart agriculture, supporting applications such as matureness prediction (El Sakka et al. , 2024. Khaki & Wang, 2. , disease diagnosis (Liu & Wang, 2. , and yield estimation (Khaki et al. , 2. CNNs have been widely used in classifying crops, detecting diseases, and grading fruit maturity due to their strong spatial feature extraction capabilities (El Sakka et al. , 2024. Liu & Wang, 2. For example. MobileNet and EfficientNet have been applied to lightweight edge devices (Paneru et , 2024. Yasin & Fatima, 2. , while ResNet and Inception architectures were effective in Sthevanie et al A Vol 6. 2025: 984-996 complex background conditions (Khaki et al. , 2020. Liu & Wang, 2. Initial work relied on handcrafted color features classified by traditional algorithms like SVM and k-NN (AlaAoa & Ibrahim, 2024. Alimjan et al. , 2018. Joshi et al. , 2. , but deep learning has since replaced them with robust end-to-end CNN pipelines (El Sakka et al. , 2. Transfer learning with pretrained models such as VGG16 and ResNet50 has improved generalization under limited data conditions (Khaki et al. , 2020. Khaki & Wang, 2. Despite these advances. CNNs often overfit small datasets and struggle to capture global context (Brigato & Iocchi, 2021. Gal & Ghahramani, 2016. Yu et al. , 2. Vision Transformers (ViT. have gained attention as a potential solution by modeling images as sequences of patches and applying self-attention for global information encoding (Dosovitskiy et al. , 2020. Khan et al. , 2. ViTs have outperformed CNNs in leaf disease classification (El Sakka et al. , 2. , matureness prediction (Ergyn, 2. , and weed detection via UAV (Reedha et al. , 2022. Zhao et al. , 2. However, standard ViTs are data-hungry and computationally intensive (Lee et al. , 2. To address this, structural improvements like Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) have been proposed (Lee et al. , 2021. Zhou et al. , 2. SPT introduces overlapping local views, while LSA narrows the attention field for improved locality modeling (Zhou et al. , 2. These methods have enabled ViTs to generalize better on small datasets while maintaining global sensitivity (Lee et al. In agricultural domains. ViTs with SPT/LSA have shown promise in high-resolution tasks such as plant health monitoring (Borhani et al. , 2022. De Silva & Brown, 2. and tomato leaf recognition (Nahak et al. , 2. UAV-based ViT applications have also reported improved segmentation in irregular lighting (Zhao et al. , 2. ViTs are further enhanced by explainability tools like Grad-CAM (Kulkarni et al. , 2022. Mishra & Malhotra, 2. Recent researches confirm ViTs are competitive or superior to CNNs in grape quality grading (Pothen & Nuske, 2016. Shimazu et al. , 2. , banana defect detection (Ergyn, 2. , and rice crop classification (Ulukaya & Deari, 2. Several works integrated multispectral inputs into ViT pipelines for more robust modeling under climate variability (Lin et al. , 2023. Rad, 2. Others employed multi-task learning to combine maturity prediction and yield estimation in parallel(Lin et al. , 2. While these advancements are encouraging, there remains a lack of ViT-based researches targeting cocoa fruit classification. Existing CNN approaches focus on post-harvest bean sorting (Eric et al. , 2023. Essah et al. , 2. and often exclude the critical field-stage matureness identification. No prior work to date has evaluated ViTs with SPT and LSA on realworld cocoa fruit images. To bridge this gap, our research introduces a ViT-SPT-LSA model specifically designed for cocoa matureness classification in uncontrolled environments. This architecture balances global attention with local detail sensitivity to enhance performance with limited image data. Research Methods Dataset Preprocessing Our research used a dataset consisting of 1,559 labeled images of cocoa fruits captured under natural lighting conditions with varying backgrounds, angles, and matureness levels. The dataset includes three maturity classes: immature . 73%), mature . 04%), and overmature . 23%). Input images were resized to 224x224 pixels to match the size of Vision Transformer (ViT) input layer. Data augmentation techniques such as horizontal flipping, rotation, brightness manipulation, and zoom were applied to increase training set diversity and reduce overfitting. All images were normalized using ImageNet mean and standard deviation. We used 1000 images for training set and 559 images for testing set. In the training process, we took 10% of the training set for validation process. The class-wise distribution across each subset is presented in Table 1. Label Immature Mature Overmature Table 1 - Class distribution across dataset subsets. Training Set Validation Set Testing Set Sthevanie et al A Total Vol 6. 2025: 984-996 2 Model Architecture Figure 2 shows the design of the system that was built in this research. The system starts with image input. The first process in the system is patch embedding which breaks the image into a collection of patches. From the collection of existing patches, an encoding process is then carried out to produce a series of features. This series of features is then processed by the Transformer Encoder in the form of multihead Attention blocks. The output from the Transformer Encoder is a feature that will be used as input for the Softmax Classifier process at the end of the system. Fig. Proposed Cocoa Ripeness Classification System 2 Vision Transformer The Vision Transformer (ViT) architecture was selected due to its ability to model longrange dependencies via self-attention, which is beneficial for detecting subtle matureness cues spread across fruit surfaces. CNNs, although effective in local pattern recognition, struggle with capturing global context, especially in images with diverse lighting and background conditions. To enhance ViT performance in limited-data settings, we integrated two architectural improvements: Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA). SPT modifies the patch embedding process by shifting input images in fixed directions before This encourages the model to capture more diverse and local contextual patterns within the input, thereby improving feature representation. Meanwhile. LSA restricts the selfattention mechanism to only focus on local neighborhoods rather than the entire image, embedding inductive bias that is useful for learning localized features. This modification reduces overfitting and increases model robustness in datasets with limited training samples. Patch size was empirically set to 16x16 pixels, balancing spatial granularity and computational This patch size ensures sufficient detail is retained while maintaining compatibility with standard ViT pretraining checkpoints. Figure 3 depicts the architecture of Vision Transformer (ViT), in which an input image is separated into a series of fixed-size patches, linearly embedded, and coupled with positional encoding before being sent through several transformer encoder layers. These layers are made up of multi-head self-attention and feedforward networks, which record global visual dependencies and enable robust categorization. Sthevanie et al A Vol 6. 2025: 984-996 Fig. Vision Transformer architecture (M. -H. Guo et al. , 2. 3 Shifted Patch Tokenization (SPT) Vision Transformer needs a large amount of data in the training process. This is a problem if the number of data provided is limited. To overcome the problem, we used shifted patch tokenization (Emmamuel et al. , 2. This process enrich the variation of data by shifting the patching process and added it to the training set. Figure 4 depicts the Shifted Patch Tokenization (SPT) mechanism, in which input images are spatially shifted in multiple directions . , up, down, left, righ. Each shifted image is then tokenized, and the resulting tokens are aggregated to produce a richer patch representation. This enhances the modelAos capacity to learn local variations in spatial structures. Figure 5 shows the example of SPT result. Fig. Shifted Patch Tokenization process(Emmamuel et al. , 2. Fig. Example of SPT result. Sthevanie et al A Vol 6. 2025: 984-996 4 Locality Self Attention (LSA) Another technique to overcome the limited dataset problem is by using Locality Self Attention (LSA) (Q. Guo et al. , 2. LSA improves the distribution of attention scores by determining the temperature parameters of the softmax function. The learnable temperature scaling enables ViT to determine the softmax temperature throughout the learning phase. The softmax temperature is low, which sharpens the score distribution. As a result, the learnable temperature scaling refines the distribution of attention scores. In addition, diagonal masking is applied to eliminate self-token relations by suppressing the diagonal elements of the similarity matrix generated from Query and Key computations. This technique ensures that inter-token relationships receive higher importance by excluding self-token relations from the softmax Diagonal masking achieves this by assigning OeO to the diagonal elements, directing the Vision TransformerAos attention mechanism to prioritize other tokens rather than focusing on This masking enhances the relative attention scores between distinct tokens, leading to a sharper distribution of attention scores. Consequently. LSA strengthens the locality inductive bias by encouraging the Vision TransformerAos attention to concentrate on local regions. Figure 6 shows a schematic comparison between standard self-attention and Locality Self-Attention (LSA). In the standard version, attention is computed globally across all patches. In contrast. LSA applies a localized attention mask, limiting the scope to adjacent patches and encouraging the model to focus on spatially relevant regions, which is particularly beneficial in small Fig. Comparison Of Standard Self-Attention and LSA. 5 Experiment Setup The model was trained using Adam optimizer for 100 epochs with an initial learning rate of 1y10-4 and batch size of 32. Initial experiments indicated that using a learning rate of 1y10 -4 resulted in faster convergence and less overfitting. The number of epochs of 100 was used because for epochs above 100, the system performance tends to be stable. The batch size of 32 was chosen to balance the training speed with the available computing power. The Adam optimizer was chosen because it combines AdaGrad and RMSprop so that it can adjust the learning rate for each parameter, improving convergence and performance especially on noisy and sparse data. A learning rate scheduler with cosine annealing was employed. Cross-entropy loss was used as the objective function. Evaluation metrics included accuracy, precision, recall, and F1-score to assess classification performance across all classes. Training was conducted on a workstation equipped with an NVIDIA RTX 4070 GPU . GB VRAM), 12. 6 GB RAM, and an Intel Core i9 processor. Average training time per epoch was approximately 90 seconds. The model achieved convergence by the 60th epoch, after which performance saturated. Results and Discussion In this section, we present a detailed analysis of the experimental results, including training performance, class-specific performance analysis with a confusion matrix, comparative evaluation, cross-validation results, and ablation study. First, we can see the training performance of Vision Transformer and our proposed method respectively on Figure 7 and 8. The effect of utilizing SPT and LSA in the training process can be seen by comparing the training and validation accuracy. We can see that by combining Vision Transformer with SPT and LSA, the model obtained a higher validation accuracy . bout 75%) than by using only Vision Transformer . bout 60%). The model also achieved the highest validation accuracy faster than by using only Vision Transformer. Because the gap between training and validation accuracy was smaller, the combination of Vision Transformer with SPT and LSA had a better performance in handling overfitting problem. Sthevanie et al A Vol 6. 2025: 984-996 Fig. Training and validation accuracy of Vision Transformer. Fig. Training and Validation Accuracy Of Our Proposed Method. For the testing performance, our proposed ViT-SPT-LSA model was evaluated using four metrics: accuracy, precision, recall, and F1-score, across three cocoa matureness classes: immature, mature, and overmature. The model achieved an overall accuracy of 82. 64% and a macro-average F1-score of 0. The proposed model showed marked improvements in the classification of the mature class, which often exhibits overlapping visual traits with the overmature class. The inclusion of SPT and LSA enabled the model to better generalize to these subtle inter-class differences, outperforming the baseline ViT in both precision and recall. Class-wise performance based on the confusion matrix in Figure 9 shows: Immature: Precision = 0. Recall = 0. F1-score = 0. Mature: Precision = 0. Recall = 0. F1-score = 0. Overmature: Precision = 0. Recall = 0. F1-score = 0. Fig. Confusion matrix of the testing result. Sthevanie et al A Vol 6. 2025: 984-996 While most misclassifications happen between the immature and mature classes, these findings validate the model's strength in identifying immature and overmature pods. Despite class overlap, the confusion matrix in Figure 9 verifies that 207 of 263 immature pods, 142 of 159 mature pods, and 113 of 137 overmature pods were accurately classified, therefore proving great prediction dependability. Moreover, the precision-recall curves in Figure 10 revealed consistent sensitivity and specificity trade-offs across every class. Suggesting a high degree of class separability, the area under the ROC curve (AUC-ROC) values are 0. 83 for immature, 80 for mature, and 0. 84 for overmature. Fig. TPR-FPR curves for each class with AUC scores. Inference time was measured to assess the modelAos practical deployment capability. The average inference time per image was 21 milliseconds on an NVIDIA RTX 4070 GPU, demonstrating feasibility for real-time deployment in environments such as smart harvesters or quality control systems. Although training time was intensive, largely due to the complexity of transformer-based architectures and the integration of SPT and LSA, the model compensates with a fast inference time, which is competitive with traditional CNN-based models such as ResNet and EfficientNet in similar agricultural classification tasks. While some lightweight CNNs can achieve inference times below 10 ms, many reported CNN-based approaches operate within the 20Ae45 ms range depending on hardware and model complexity , while our model achieves a latency of just 21 milliseconds, making it suitable for real-time deployment. Misclassified examples were predominantly found in the immatureAemature and matureAe overmature boundaries, where lighting inconsistency and fruit surface shadowing played a role in confusing visual signals. Some fruits displayed multiple matureness indicators on a single pod, leading to ambiguity during labeling and training. These results highlight the persistent challenge of environmental variability in field-collected data. Nevertheless, the use of SPT contributed to improved local texture modeling, while LSA helped focus attention on contextually relevant features. Figure 10 presents several examples of such misclassified images, illustrating the visual ambiguity that challenged the model. Fig. Examples Of Misclassified Cocoa Fruit Images. Despite its success, the research faces several limitations. The dataset, while diverse, remains small and may not encompass all variations in cocoa pod appearance. Lighting conditions were not standardized, which may have introduced noise during model training. Sthevanie et al A Vol 6. 2025: 984-996 Moreover, the class imbalance between mature and non-mature classes could still influence decision boundaries. Addressing these limitations will require the collection of a larger and more balanced dataset, the use of advanced augmentation strategies, or the integration of multimodal data such as hyperspectral or thermal imaging. We also conducted 10-fold cross validation for ViT and our method to show the improvement provided by using SPT and LSA. The results of 10-fold cross-validation for ViT and the proposed method are shown in Table 2. Our method consistently outperformed ViT, achieving a higher mean accuracy . 23%) and lower variance . 24%) compared to ViT . verage accuracy 86. 5%, variance 0. 69%). The reduced variance indicates that the proposed method performs more consistently across different data splits, highlighting its robustness and This showed the effect of utilizing SPT and LSA on the ViT model performance. Fold Table 2 - 10-fold Cross validation result. ViT Accuracy (%) Proposed Method Accuracy (%) Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Mean Variance To further demonstrate the effects of using SPT and LSA, we conducted ablation study. The ablation study evaluated the contributions of SPT and LSA to the system performances. Table 3 summarizes the ablation study results. Adding SPT improved accuracy by 1. 48% over the baseline ViT, showing its effectiveness in enriching the training set and improving Adding LSA improved accuracy by 0. 75%, enhancing localized feature By combining SPT and LSA, we obtained the best performance . 65%), demonstrating their complementary roles for enhancing the ViT performance in classyfing cocoa fruit ripeness. Model Table 3 - Ablation study result. Accuracy (%) Macro F1-Score (%) Model 1 (Baseline ViT) Model 2 (ViT SPT) Model 3 (ViT LSA) Model 4 (Proposed Metho. We evaluated our proposed method and compared it with other four methods, namely VGG. MobileNet. ResNet and Vision Transformer (ViT). Table 4 summarizes their performance metrics. Our proposed method achieved the highest accuracy . 65%), outperforming ViT . 14%) and other models. MobileNet and VGG had the lowest accuracies, indicating limitations in their architectures for this dataset. The proposed method also exhibited the best balance of performances across all classes . 71%), followed by ViT . 25%). Furthermore, our method outperformed all other methods across all classes. The most significant improvement was observed for the challenging Mature class, achieving an F1-score 68%, compared to 76. 37% for ViT. Sthevanie et al A Method VGG Vol 6. 2025: 984-996 Table 4 - Performance comparison. Macro avg F1Immature F1Accuracy Score Score Mature F1-score Overmature F1-score MobileNet ResNet ViT Proposed method In summary, our proposed ViT-SPT-LSA model achieved high classification accuracy and robustness, particularly in difficult conditions involving inter-class similarity and lighting Its fast inference time and generalization potential make it well-suited for real-time agricultural deployment and adaptation across related smart farming tasks. Conclusion This research proposed a cocoa ripeness classification model based on Vision Transformer (ViT), enhanced with Shifted Patch Tokenization (SPT) and Locality SelfAttention (LSA). The model achieved a classification accuracy of 82. 65% and a macro-average F1-score of 82. 71%, outperforming baseline ViT and other CNN-based architectures. These results highlight the effectiveness of combining SPT and LSA in enhancing ViTAos generalization on small and variable agricultural datasets. From a theoretical standpoint, the work demonstrates the adaptability of transformer-based architectures for complex agricultural image recognition tasks. Practically, the proposed system offers strong potential for automating cocoa ripeness classification during harvesting or quality control, especially under inconsistent field conditions. Nevertheless, limitations include the relatively small dataset size, susceptibility to lighting variations, and class imbalance, particularly in the mature class. Future work should focus on scaling the dataset, incorporating advanced augmentation or hyperspectral data, and deploying the model on edge computing devices for real-time, in-field applications to fully realize the benefits of smart agriculture technologies. References AlaAoa. , & Ibrahim. Classification of tomato leaf images for detection of plant disease using conformable polynomials image features. MethodsX, 13, 102844. https://doi. org/10. 1016/j. Alimjan. Sun. Liang. Jumahun. , & Guan. A new technique for remote sensing image classification based on combinatorial algorithm of SVM and KNN. International Journal of Pattern Recognition and Artificial Intelligence, 32. , 1850004. https://doi. org/10. 1142/S0218001418590127 Borhani. Khoramdel. , & Najafi. A deep learning based approach for automated plant disease classification using vision transformer. Scientific Reports, 12. , 11554. https://doi. org/10. 1038/s41598-022-15163-0 Brigato. , & Iocchi. A close look at deep learning with small data. 2020 25th International Conference on Pattern Recognition (ICPR), 2490Ae2497. Charco. Yanza-Montalvan. Zumba-Gamboa. Alonso-Anguizaca. , & BasurtoCruz. ViTSigat: Early Black Sigatoka Detection in Banana Plants Using Vision Transformer. Conference on Information and Communication Technologies of Ecuador, 117Ae130. https://doi. org/10. 1007/978-3-031-75431-9_8 Chitta. Yandrapalli. , & Sharma. Deep Learning for Precision Agriculture: Evaluating CNNs and Vision Transformers in Rice Disease Classification. 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4. 0, 1Ae6. https://doi. org/10. 1109/OTCON60325. De Silva. , & Brown. Multispectral plant Disease Detection with Vision transformerAeconvolutional neural network hybrid approaches. Sensors, 23. , 8531. https://doi. org/10. 3390/s23208531 Sthevanie et al A Vol 6. 2025: 984-996 Dosovitskiy. Beyer. Kolesnikov. Weissenborn. Zhai. Unterthiner. Dehghani. Minderer. Heigold. Gelly. Uszkoreit. , & Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv Preprint arXiv:2010. https://doi. org/10. 48550/arXiv. El Sakka. Mothe. , & Ivanovici. Images and CNN applications in smart European Journal Remote Sensing, 57. , https://doi. org/10. 1080/22797254. Emmamuel. Asim. Yu. Kim. , & others. 3D-CNN method over shifted patch tokenization for MRI-based diagnosis of AlzheimerAos disease using segmented Journal of Multimedia Information System, 9. , 245Ae252. https://doi. org/10. 33851/JMIS. Ergyn. High precision banana variety identification using vision transformer based feature extraction and support vector machine. Scientific Reports, 15. , 10366. https://doi. org/10. 1038/s41598-025-95466-0 Eric. Gyening. -M. Appiah. Takyi. , & Appiahene. Cocoa beans classification using enhanced image feature extraction techniques and a regularized Artificial Neural Network model. Engineering Applications of Artificial Intelligence, 125, https://doi. org/10. 1016/j. Essah. Anand. , & Singh. An intelligent cocoa quality testing framework based Measurement: Sensors, https://doi. org/10. 1016/j. Food and Agriculture Organization (FAO). Indonesia: Upgrading bulk cocoa into fine https://openknowledge. org/server/api/core/bitstreams/684e2bd3-6b91-48f5a7cd-4125c5c74cab/content Gal. , & Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050Ae1059. https://dl. org/doi/10. 5555/3045390. Guo. -H. Xu. -X. Liu. -J. Liu. -N. Jiang. -T. Mu. -J. Zhang. -H. Martin. Cheng. -M. , & Hu. -M. Attention mechanisms in computer vision: A survey. Computational Visual Media, 8. , 331Ae368. https://link. com/article/10. 1007/s41095-022-0271-y Guo. Qiu. Xue. , & Zhang. Low-rank and locality constrained self-attention for sequence modeling. Ie/ACM Transactions on Audio. Speech, and Language Processing, 27. , 2213Ae2222. https://doi. org/10. 1109/TASLP. International Cocoa Organization (ICCO). Top 10 cocoa-producers and the issue of https://w. org/newsstream/post/176254/top-10-cocoa-producers Joshi. Bansal. , & Sharma. Classification of Tomato Leaf Disease using Feature Extraction with KNN Classifier. 2023 Seventh International Conference on Image Information Processing (ICIIP), 541Ae546. https://doi. org/10. 1109/ICIIP61524. Juncai. Yaohua. Lixia. Kangquan. , & Satake. Classification of ripening stages of bananas based on support vector machine. International Journal of Agricultural and Biological Engineering, 8. , 99Ae103. https://doi. org/10. 3965/j. Khaki. , & Wang. Crop yield prediction using deep neural networks. Frontiers in Plant Science, 10, 621. https://doi. org/10. 3389/fpls. Khaki. Wang. , & Archontoulis. A CNN-RNN framework for crop yield Frontiers in Plant Science, 11, 621. https://doi. org/10. 3389/fpls. Khan. Naseer. Hayat. Zamir. Khan. , & Shah. Transformers in vision: A survey. ACM Computing Surveys (CSUR), 54. , 1Ae41. https://doi. org/10. 1145/3505244 Kharamat. Wongsaisuwan. , & Wattanamongkhol. Durian ripeness classification from the knocking sounds using convolutional neural network. 2020 8th International Electrical Engineering Congress . EECON), 1Ae4. https://doi. org/10. 1109/iEECON48109. Sthevanie et al A Vol 6. 2025: 984-996 Kulkarni. Shivananda. , & Sharma. Explainable AI for computer vision. Computer Vision Projects with PyTorch: Design and Develop Production-Grade Models . 325Ae. Springer. https://doi. org/10. 1007/978-1-4842-8273-1_10 Lee. Lee. , & Song. Vision transformer for small-size datasets. arXiv Preprint arXiv:2112. https://doi. org/10. 48550/arXiv. Lin. Crawford. Guillot. Zhang. Chen. Yuan. Chen. Williams. Minvielle. Xiao. , & others. Mmst-vit: Climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer. Proceedings of the Ie/CVF International Conference Computer Vision, 5774Ae5784. https://doi. org/10. 1109/ICCV51070. Liu. , & Wang. Plant diseases and pests detection based on deep learning: A Plant Methods, 17, 22. https://doi. org/10. 1186/s13007-021-00722-9 Lopes. , da Costa. Barbin. Cruz-Tirado. Baeten. , & Barbon Junior. Deep computer vision system for cocoa classification. Multimedia Tools and Applications, 81. , 41059Ae41077. https://doi. org/10. 1007/s11042-02213097-3 Mishra. , & Malhotra. A Dual Approach with Grad-CAM and Layer-Wise Relevance Propagation for CNN Models Explainability. International Conference on Innovation and Emerging Trends in Computing and Information Technologies, 116Ae129. https://doi. org/10. 1007/978-3-031-80842-5_10 Nahak. Pratihar. , & Deb. Tomato maturity stage prediction based on vision transformer and deep convolution neural networks. International Journal of Hybrid Intelligent Systems, 21. , 61Ae78. https://doi. org/10. 3233/HIS-240021 Paneru. Paneru. , & Shah. Analysis of Convolutional Neural Network-based Image Classifications: A Multi-Featured Application for Rice Leaf Disease Prediction Recommendations