JOIV : Int. Inform. Visualization, 9. - March 2025 690-698 INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION journal homepage : w. org/index. php/joiv Multi-Head Voting based on Kernel Filtering for Fine-grained Visual Classification Mutiarahmi Khairunnisa a,b. Suryo Adhi Wibowo a,b,* School of Electrical Engineering. Telkom University. Bandung. Indonesia Central of Excellence Artificial Intelligence for Learning and Optimization (CoE AILO). Telkom University. Bandung. Indonesia Corresponding author: *suryoadhiwibowo@telkomuniversity. AbstractAiResearch on Fine-Grained Visual Classification (FGVC) faces a significant challenge in distinguishing objects with subtle differences within intra-class variations and inter-class similarities, which are critical for accurate classification. To address this complexity, many advanced methods have been proposed using feature coding, part-based components for modification, and attentionbased efforts to facilitate different classification phases. Vision Transformers (ViT) has recently emerged as a promising competitor compared to other complex methods in FGVC applications for image recognition, which are mainly capable of capturing more finegrained details and subtle inter-class differences with higher accuracy. While these advances have shown improvements in various tasks, existing methods still suffer from inconsistent learning performance across heads and layers in the multi-head self-attention (MHSA) mechanisms that result in suboptimal classification task performance. To enhance the performance of ViT, we propose an innovative approach that modifies the convolutional kernel. Our method considerably improves the method's capacity to identify and highlight specific crucial characteristics required for classification by using an array of kernels. Experimental results show kernel sharpening outperforms other state-of-the-art approaches in improving accuracy across numerous datasets, including Oxford-iT Pet. CUB-200-2011, and Stanford Dogs. Our findings show that the suggested approach improves the method's overall performance in classification tasks by achieving more concentration and precision in recognizing discriminative areas inside pictures. Using kernel adjustments to improve Vision Transformers' ability to differentiate somewhat complicated visual features, our strategy offers a strong response to the problem of fine-grained categorization. KeywordsAi Fine-grained visual classification. vision transformer. multi-head self-attention. Manuscript received 30 Jul. revised 11 Sep. accepted 1 Dec. Date of publication 31 Mar. International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4. 0 International License. effective methods to address its complexity. The researchers have also adopted deep learning methods to improve accuracy in this task . , . , . While some approaches based on CNN architecture perform reasonably well, they still have their shortcomings. The technique leads to a high computational cost and noisy outcome, especially as the number of networks used increases. There are now three primary types of approaches in use: attention-based methods . , . , feature encoding . , . , and part-based . , . , . Part-based approaches identify discriminative sections of objects and categorize them, whereas feature encoding methods extract high-level features from images for Meanwhile, attention-based techniques employ attention processes to assess how vital specific object components are concerning one another. Initially developed for word natural language processing (NLP), transformers . have also been modified for use in picture recognition software. Dosovitskiy et al. INTRODUCTION Due to the requirement for highly detailed object identification, computer vision research on fine-grained visual classification (FGVC) is quite interesting. However, improving performance in fine-grained visual classification remains a significant challenge. FGVC is more complex than coarse classification, as shown in Fig. 1 and Fig. The main challenge lies in two factors: . the limitation in acquiring training data and . the existence of subtle differences between objects in the same class, while objects in different classes may have striking similarities. FGVC requires recognizing small, specific details of objects in images, which are often difficult to distinguish even by human observers due to their visual similarity, such as classifying various types of birds . , cats, or dogs . , . This entails a good knowledge of their specific morphological and textural characteristics. However, different studies have attempted to construct more introduced Vision Transformer (ViT), a substantial modification of the transformer design for these tasks, which has proven effective in various object identification . , segmentation . and classification task. ViT takes segments of image patches and transforms them into patch tokens. Like character sequences in NLP, these tokens are used in a multihead self-attention mechanism during training. The selfattention mechanism is an appropriate strategy for FGVC as it effectively extracts and weights information from the full visual map for the classification token. Nonetheless, while applying the ViT method for FGVC, two primary problems need to be resolved. First, when processing all patch tokens at once, the ViT method might not be able to adequately draw attention to crucial locations in complicated datasets or images with crowded backgrounds. Second. ViT's receptive field extension is limited, which may cause the loss of locally significant information. Furthermore. TransFG uses the attention weights that ViT has built to try to remove extraneous inputs from the final transformer layer. Still, it does not entirely use the attention from all transformer levels. Zhang et al. presented the AFTrans technique as a solution to this problem. This technique employs a Siamese design to offer a selective attention module with the same weight parameters. Nevertheless, this approach has limitations in that it takes attention away from highly identifiable local places, which leads to uncertainty in the training phase over the trustworthiness of the attention map. Mutual Attention Weight Selection (MAWS) is a token selection strategy that Wang et al. suggested being used to choose the most informative tokens. FFVT aims to improve feature representation by merging information from several locations and levels in an image. However, applying fixed-size patches introduces noise, which makes the final class token emphasize global information instead of local features across layers. SIM-Trans . presented the Structure Information Learning (SIL) module. It leverages the Multilevel Feature Boosting (MFB) module with self-attention weights to enable contrastive learning and extract robust features. Xu et al. introduced the Internal Ensemble Learning Transformer (IELT) to overcome the uneven learning performance in FGVC. This method selects necessary tokens, considers each center of attention a lousy learner, and assists in cross-layer feature learning. However, despite improving the model's ability to process image details. IELT still faces redundancy and noise problems, where irrelevant information reduces the model's efficiency. In this paper, we propose a modified method by changing the convolutional kernel of the Internal Ensemble Learning Transformer (IELT). This change is motivated by previous research . to overcome the redundancy and noise problems and improve the method's capacity to identify good and essential characteristics for classification. We explored and analyzed various kernels to find the one that performs best. This paper is structured as follows: Section II discusses related works, some literature on the Vision Transformer, and the adopted method. Section i presents experimental results and extensive analysis. Finally, the conclusion is in Section IV. Fig. 1 Example of coarse-grained and fine-grained visual classification. II. MATERIALS AND METHOD Related Works Fine-grained visual classification research focuses on two areas: local identification and global identification. Local identification selects essential parts of the object and creates intermediate-level representations for final classification . Depending on how bounding box/part annotations are incorporated into the technique, the local identification method can be either strong or weak. Intense supervised learning requires part annotations . , . , while weak supervision only uses image labels . , . , . , . More recently, studies have concentrated on identifying discriminative areas and extracting features for more in-depth visual categorization . , . Nevertheless, an essential source of classification mistakes is the disregard for the holistic structural information of an item by many current approaches, which is crucial for precisely localizing the complete object. Fig. 2 These images illustrate the challenges of FGVC, where birds of the same species can differ in color variation and individuality . ows 1 and . , while birds of different species can appear very similar . ows 2 and . One of the most important aspects of FGVC is the ability to distinguish bird species based on fine details in their patterns and colors. In recent years. FGVC research employing ViT has concentrated on optimizing the ViT architecture to use local and global information most. To detect essential patch tokens, for example. TransFG . advocated using attention weights in the ViT method and multiplying them before going to the final transformer layer. While helpful, this method might not work well for complicated datasets or low-resolution images when combining particular tokens with the general categorization token. In contrast, global discrimination methods employ specific distance metrics to learn deep feature embeddings for an entity. Other examples include bilinear methods . for learning interaction features between two independent Convolutional Neural Networks (CNN. The process allows each global feature extracted from the whole input image to interact, helping learn better or more separable representations to perform fine-grained classification. Global methods prioritize a holistic understanding of objects or images and rarely involve precise localization steps. This differs from local identification methods, which focus on understanding the detailed parts of an object. Recent research in FGVC has established Vision Transformers (ViT) as a front-runner technique through image segmentation and transformer architecture. While ViTs work well for various tasks, they fall short in capturing crucial local features necessary for in-depth classification. overcome this constraint, various techniques have been devised, such as employing attention maps to mitigate background noise . , integrating features from various layers using cross-layer filters . , . , and strengthening feature robustness through fusion techniques that employ graph networks and contrastive learning . recognize and emphasize significant picture characteristics. Additionally, the input from the previous stage is normalized using a normal distribution through layer normalization. The transformer encoder's MLP block consists of two wholly linked layers. The output of the -th layer can be expressed as , =. ! #. ! #. A . ! #% #pos 1, ) 1, . Multi-head Voting The multi-head self-attention functions help to learn complicated correlations between characters in an input sequence efficiently. During multi-head self-attention, each attention 'head' learns its representation of the input and evaluates the relevance of each token concerning the others in the sequence. Each head in the multi-headed self-attention generates an attention map. This strategy allows the method to obtain more precise and complex information about the relationships between the tokens. However, the effectiveness of each attention head in identifying discriminative regions can exhibit variability. A novel technique introduced by Xu . is used to address this variability and enhance method Inspired by ensemble learning, notably the bagging algorithm . , the method treats each attention head in a multi-head self-attention (MHSA) mechanism as a weak The module aims to selectively collect tokens from various attention heads to improve the detection of unique areas inside each layer. Assume the Oe 8Ea layer of the transformer . here is between 1 and Oe . has input and output tokens =. < and =>?@ respectively, we can refer to the collection of attention scores for class tokens as A. B C OO Ey represents the attention score of the E Oe 8Ea head from the MHSA-generated attention map. To select valuable tokens based on the attention score AC , a score map GC is generated using Eq. where is the image height, is the image width and the resolution of each image patch. This equation describes how the original image is divided into smaller pieces ( patche. Each patch has a resolution of ( y ) and some channels ). A sequence of The output of this operation is patches to be used in the following process. Patch embedding stores and input information into the transformer encoder in ViT using a trainable linear projection. Patch embeddings, such as cls tokens, are modified with learnable position embeddings to encode positional information. This approach is based on the BERT paper, which only utilizes the final representation associated with it . he output of the transformer ) in the classification layers. The procedure is shown in formula Eq. 1,23 ) 1,23 Layer normalization (LN), multi-head self-attention (MHSA), and multi-layer perceptrons (MLP) are integral to the transformer-encoder process. The MLP head is responsible for the final categorization stage. This is completed at the MLP head layer, which produces the transformer encoder's output. Vision Transformer Backbone The vision transformer (ViT) is a computer vision method based on the Transformer method . , which was initially designed for natural language processing (NLP). Patch embedding transforms the picture x OO y y into a 2D patch sequence OO y UI . The number of parts separated results in the Eq. HL AC H. J) HM 8N Oe O OP QR GC H. J) = K N8EaRTUHMR where O is a hyperparameter that controls the number of votes per head. To produce the total score map G- OO Ey?@ < F "),: . >?@ < F ),: . A . >?@ < F @),: ] = AyCrossEntropy A. E) 1 Oe )CrossEntropy A- . w=xyy )A) . The CLR module's integration of cross-layer and refined features helps to minimize noise and improve feature representation capabilities for efficient classification. Dynamic Selection Inspired by the boosting algorithm. Xu . proposed a dynamic selection (DS) module, using each transformer layer as a "weak learner". DS module is a crucial part of the Internal Ensemble Learning Transformer (IELT). It controls which tokens to keep from each transformer layer considering importance of that layer for final feature quality. In DS module, the contribution of each layer to the final prediction is determined by comparing the number of tokens chosen from each layer in the CLR module. This module starts by calculating the contribution of each layer with comparing the number of tokens selected in the Cross-Layer Refinement (CLR) module. For the -th layer, the number of selected tokens is recorded in the vector e ), which represents the contribution of each layer to the refined feature. e ) . where TA ) is the incremental selection ratio for the -th layer, e ), is the number of tokens selected from that layer, and 8 is the total number of tokens selected across all layers. Layers that make a large contribution will have a higher selection ratio, while layers that make a small contribution will have a lower selection ratio. To determine the contribution of each layer, the selection ratio is updated based on how much the layer contributes to generating useful tokens. The selection ratio for the -th layer . here OO 1, 2, 3. A . Oe . is calculated by updating the previous selection ratio T ) using the latest contribution TA ) with the following Eq. T ) Ia 1 Oe )T ) TA ) TA ) = . Here, . v(" combines class tokens from the ( Oe . -th layer with refined tokens, resulting in cleaner and more informative input for the following layer. To improve the final prediction results, a logit assistance operation is utilized, which takes use of earlier predictions. The preceding prediction results, p OO Eyw , are computed based on cross-layer features as input to the Fully Connected (FC) layer, followed by a SoftMax operation, as indicated by Eq. A = softmax FC LN w=xyy )A A. During training, the cross-entropy loss function is used for p and A- , which are modified using the ground-truth labels z and balance parameters. The loss function L is defined as Equation 2 describes how the -th layerAos output ( >?@ v ) is obtained using established transformation processes. The cross-layer features are retrieved from the class tokens in this >?@ output, indicated by . v," . These features include valuable information from several layers. To get refined using the MHV module without features, we process >?@ size-altering operations or increased convolutions. Token indices are indicated as H}A OO Ey@ , where 8 is the number of refined tokens. Furthermore, an additional transformer layer, the . th layer, is used to extract refined features from refined However, to minimize noise effects or quality degradation, the refined tokens are not immediately fed into the ( Oe . -th layer. Instead, the class tokens from the Oe . -th layer are used as inputs to the ( . -th layer, rather than the class tokens from the -th layer. Thus, the input to the ( . -th layer represented by Eq. where T ) is the selection ratio of the -th layer before it is updated. TA ) is the newly calculated contribution ratio of the tokens selected from that layer, is the moving rate that governs how much the selection ratio is updated. The total number of tokens selected from each layer, calculated using the updated selection ratio: ) = M. A ) ), is then Oxford-iT Pet dataset comprises images of cats and dogs from 37 distinct breeds, with around 200 images per class. Table I includes detailed information on these datasets. where M is the total number of tokens to be selected from all The tokens selected from each layer are based on a certain interval. This interval ensures that each layer selects a specific number of tokens and that there is no overlap between This interval is determined by the initial index A ) and the final index A ), which is calculated by the formula: Oe1 A ) = A Au H). H=1 TABLE I FINE-GRAINED DATASET Dataset CUB-200-2011 . Stanford Dogs . Oxford-iT Pet . Class Training Testing Implementation Details The image was resized to 448 y 448 pixels using the ViTB-16 backbone network, which was pretrained on the ImageNet21K dataset. The image underwent random cropping, horizontal flipping, and color adjustments for training, while central cropping was used for testing. The method was optimized using stochastic gradient descent (SGD) with a momentum of 0. 9 and cosine annealing for learning rate scheduling. The initial learning rate was set to 002 for the Stanford Dogs dataset and 0. 02 for the other three datasets. The training method spanned 50 epochs with a batch size of 8 for all datasets. Implementation was carried out using PyTorch on an NVIDIA DGX100 server, with top1 accuracy as the evaluation metric for all experiments. The experiment conditions for the IELT approach remain the same as in the original paper . , to observe any changes in the outcomes that arise from testing with various kernel The MHV module sets 24 as the maximum number of votes for each head. The loss proportion is set to 0. 4, and the number of upgraded tokens is 24 in the CLR module. The DS module's selection ratio per layer is set initially at 1/ Oe . and the total number of selections is set at 126. Because of the domain gap between the training dataset and the more specialized dataset, the DS module was not employed during the first 10 epochs, which made low-level characteristics more helpful for . A ) = A ) Au ) where A ) is defined as 0 for = 1, or as the number of tokens from the previous layer for A 1, while A ) is the sum of A ) and ), with being in the range of 1 to ( Oe . With dynamic selection, the system automatically adjusts the number of tokens selected from better-performing layers, improving the resulting feature representation while reducing the influence of noise or less relevant features. RESULT AND DISCUSSION Datasets The fine-grained benchmark used in this study consists of three datasets: CUB-200-2011. Stanford Dogs, and OxfordiT Pet. These three datasets refer to datasets commonly used in testing fine-grained classification algorithms and datasets utilized in the IELT approach . Furthermore, all three datasets provide a range of visual challenges that may be used to assess the system's robustness, ultimately leading to a more precise and reliable visual method. CUB-200-2011, a finegrained dataset created exclusively for bird classification, includes not only bird labels but also bounding boxes and part annotations, which are critical for accurate classification. The Stanford Dogs dataset contains images of 120 dog breeds, including 12,000 training and 8,580 testing images. The Fig. 4 Visualization results from our method on each dataset. The first column shows the input image. The second and third columns display the attention maps generated by . and the modified kernel we proposed, respectively. The fourth and fifth columns show the tokens selected by . and the modified kernel we proposed, respectively. The sixth column indicates the locations where the MHV module selects tokens in the image. Method ViT . AFTrans . FFVT . TransFG . SIM-Trans . HAVT . MP-FGVC . IELT . KR-MHV . Comparison of the Type of Enhanced Convolution Kernel This study conducted experiments using various convolution kernels in the MHV module. Table II shows the experimental results using several kernels tested on the Stanford Dogs dataset. The experimental results show that using convolution kernels with different types and sizes also gives different classification accuracy results. For instance, the 3 y 3 Gaussian kernel gives an accuracy of 91. while the 5 y 5 shows a slightly lower accuracy of 91. This suggests that using a Gaussian kernel with a smaller size is more effective in improving classification accuracy on this Based on Table II, it can also be seen that some kernel types, such as Laplacian. Sharpening, and Modified Laplacian, managed to achieve higher accuracy than other Compared to the Gaussian kernel, using the sharpening kernel here resulted in superior accuracy by 213% and 0. 165%, reaching 92. 031% for the 3 y 3 size and 935% for the 5 y 5 size. This shows that the sharpening kernel's ability to improve image clarity and contrast significantly contributed to the improved performance and made it easier for the method to distinguish features important for classification. Gauss-like . Laplacian Box Linear Sharpening Modified Laplacian Kernel Size Method ResNet-50 . FDL . API-Net . ViT . TransFG . HAVT . MP-FGVC . FFVT . IELT . KR-MHV . Method SEER . NAC . OPAM . VIT . CvT . TNT-B . Bamboo . IELT . KR-MHV . Accuracy (%) Accuracy (%) Backbone RG-10B VGG-19 VGG-19 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 Accuracy (%) . Result on CUB-200-2011: The results of comparing the classification accuracy of the different techniques applied to the CUB-200-2011 dataset are shown in Table i. The data clearly shows that the suggested approach outperforms existing state-of-the-art (SOTA) techniques in performance. With kernel sharpening, the proposed approach obtains the most remarkable accuracy of 91. This is a noteworthy 9% improvement over the Vision Transformer (VIT) Compared to TransFG. FFVT, and AFTrans, our method's accuracy improvement is 0. 2%, 0. 3%, and 0. Furthermore, it is 0. 1% higher than the HAVT. MP-FGVC, and SIM-Trans approaches. This increase is comparable to other sophisticated methods, even though the accuracy gains over the approach given in reference . is Improvements were achieved only by adjusting the convolution kernel . he whole method architecture remained This implies that little improvements, like finetuning the convolutional kernels, might significantly influence the method's performance without requiring significant structural adjustments. On the CUB-200-2011 dataset, the kernel sharpening method helps improve classification accuracy. Kernel sharpening enables the technique to produce more accurate predictions by clarifying and refining the features created by convolution processes. This indicates that on the CUB-200-2011 dataset, applying kernel sharpening methods improves classification accuracy and enhances method performance. Comparison with the State-Of-The-Art In this section, we use the IELT method with a 3y3 sharpening kernel based on the accuracy comparison results in Table II. TABLE i COMPARISON RESULT ON CUB-200-2011 DATASET Backbone ResNet-50 ResNet-50 ResNet-50 GoogleNet ResNet-101 Backbone ResNet-50 DenseNet-161 ResNet-101 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 TABLE V COMPARISON RESULT ON OXFORD-iT PET DATASET Thus, methods trained using kernel sharpening have achieved higher classification accuracy, as they can better spot important patterns or object attributes in the picture. The improvement in accuracy achieved by kernel sharpening corroborates the effectiveness of this approach, reinforcing the method's ability to deal with spatially complex image perturbations and reducing the effect of noise encountered frequently in images. Therefore, the MHV module is more effective when kernel sharpening is used as its convolution Method ResNet-50 . DCL . GaRD . StackedLSTM . CAL . Accuracy (%) TABLE IV COMPARISON RESULT ON STANFORD DOGS DATASET TABLE II COMPARISON RESULT ON THE TYPE OF ENHANCED CONVOLUTION KERNEL ON STANFORD DOGS DATASET Kernel Type Backbone ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 ViT-B_16 Accuracy (%) . Result on Stanford Dogs: The KR-MHV method outperformed the reference . method by 0. 2%, achieving the most remarkable accuracy of 92. 0% on the Stanford Dogs The KR-MHV method exhibits significant performance increases over other top methods, indicating its supremacy in fine-grained image categorization. In particular, the KR-MHV method shows a 0. 5% increase in accuracy compared to the FFVT method, which had the most fantastic accuracy at 91. In the same way, the KR-MHV method exhibits a noteworthy 1. 4% accuracy gain over the TransFG method, which attained a 90. 6% accuracy. Additionally. KRMHV performs 1. 7% better than the API-Net method and 0% better than the HAVT and MP-FGVC methods, each with an accuracy of 91. These results show the KR-MHV method's improved performance in fine-grained image classification and highlight its cutting-edge methods, including kernel sharpening, which support its increased identify many significant features. The attention map produced by the approach from . is not as concentrated, which implies that the baseline method could have considered less critical information or noise in the image. On the other hand, the attention map generated by the KR-MHV method is more defined and concentrated. This suggests that KR-MHV is more successful in focusing on regions crucial for With more accurate and concentrated attention mappings on discriminative regions, our suggested KR-MHV method outperforms the reference method. More relevant tokens can be chosen by the KR-MHV module for classification, improving prediction accuracy and Overall, the KR-MHV approach . ur approac. results show a more concentrated attention map on significant regions within the image when using our proposed method, which is based on the reference method . Results on Oxford-iT: A comprehensive analysis of the classification accuracy results obtained from applying different methods on the Oxford-iT Pet dataset is presented in Table V. It is evident from the results that other state-ofthe-art (SOTA) methods were not as successful on this dataset as the KR-MHV method, which achieved the highest accuracy of 95. This remarkable performance underscores the effectiveness of the KR-MHV method in improving classification performance on the Oxford-iT Pet dataset. This notable achievement underscores the efficacy of the KRMHV approach in enhancing classification performance. The suggested approach shows a 0. 2% improvement over KRMHV compared to the IELT method, which obtained accuracy somewhat lower than KR-MHV's. Even though this difference might not seem like much, it matters in the context of high-performance methods since little improvements can have a significant impact. Our method's accuracy improvement over Bamboo. TNT-B, and SEER is 0. 4%, and 10. 1%, respectively. Moreover, the KR-MHV method achieves an astounding 1. 5% greater accuracy, demonstrating a considerable performance advantage over the NAC. OPAM, and ViT methods. This difference highlights the KR-MHV approach's efficacy and kernel sharpening methods, which help explain why it performed so well in the classification challenge. This study shows that by changing the convolution kernel on the multi-head voting module, the Internal Ensemble Learning Transformer (IELT) method can improve the performance of the Fine-Grained Visual Classification (FGVC) task. In particular, kernel sharpening can improve classification accuracy for the FGVC task. In some data sets, the proposed KR-MHV method outperforms other innovative methods while obtaining the highest accuracy. However, compared to the original method, the accuracy improvement is small due to only the modification of the convolution The visualization results demonstrate how KR-MHV can improve prediction reliability, minimize noise, and focus attention on critical discriminative regions. This approach improves accuracy and strengthens the method's ability to handle complex image variations. To evaluate the generality and transferability of the proposed approach, we will conduct more tests on various datasets and explore the integration of advanced combining strategies with other techniques in FGVC in future research. To enhance training methods and overall interpretability, a better knowledge of the method's mechanism and internal representation will be obtained through the extension of the visualization analysis. IV. CONCLUSION ACKNOWLEDGMENT