JOIV : Int. Inform. Visualization, 9. - January 2025 1-12 INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION INTERNATIONAL JOURNAL ON INFORMATICS VISUALIZATION journal homepage : w. org/index. php/joiv Enhancing Vision-Based Vehicle Detection and Counting Systems with the Darknet Algorithm and CNN Model Abdul Haris Rangkuti a,*. Varyl Hasbi Athala a Computer Science Department. School of Computer Science. Bina Nusantara University. Bandung Campus. Jakarta. Indonesia Corresponding author: *rangku2000@binus. AbstractAiThis study focuses on developing an algorithm that accurately calculates the volume of vehicles passing through a busy crossroads in Indonesia using object recognition. The high density of vehicles and their proximity often pose a challenge when distinguishing between vehicle types using a camera. Therefore, the proposed algorithm is designed to assign a unique identity (ID) to each vehicle and other objects, such as pedestrians, ensuring that volume calculations are not repeated. The objective is to provide an equitable comparison of road density and the total number of detected vehicles, enabling the determination of whether the road is To accomplish this, the algorithm incorporates the Non-Max Suppression function, which displays bounding boxes around objects with confidence values and counts the objects within each box. Even when objects are nearby, the algorithm tracks them effectively, thanks to the support of the Darknet Algorithm. The main capabilities of this algorithm for improving vehicle detection include enhanced accuracy, speed, and generalization ability. Typically, it is used in conjunction with the You Only Look Once (YOLO) object detection framework. Five convolutional neural network models are tested to assess the algorithm's accuracy: YOLOv3. YOLOv4. CrResNext50. DenseNet201-YOLOv4, and YOLOv7-tiny. The training process utilizes the Darknet Algorithm. The bestperforming models. YOLOv3 and YOLOv4, achieve exceptional accuracy and F1 scores of up to 99%. They are followed by CrResNext50 and DenseNet201-YOLOv4, which achieve accuracy rates of 92% and 98% and F1 scores of 94% and 98%, respectively. The YOLOv7-tiny model achieves an accuracy rate and F1 score of 86% and 88%, respectively. Overall, the results demonstrate the algorithm's success in accurately detecting and calculating the volume of vehicles and other objects in a busy intersection. This makes it a valuable tool for regional government decision-making. KeywordsAiVolume. object recognition. CNN. F1 Score. Manuscript received 12 Feb. revised 6 Apr. accepted 3 Jun. Date of publication 31 Jan. International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4. 0 International License. authorities have increasingly turned to CCTV cameras to monitor traffic flows and gather valuable data for various One such application is automatic vehicle classification, which involves specialized software identifying different types of vehicles . mall, medium, and larg. in recorded footage. This technology offers numerous benefits, from optimizing traffic management to informing infrastructure planning . Different current road loads can cause inefficiency in using lanes at the intersection. Traffic regulation at the intersection regulates the movement of each group of vehicle movements so that they can move alternately and do not interfere with each other or disrupt existing flows . However, traffic lights in urban areas are still less effective due to the unbalanced volume of vehicles. All traffic flow values . er direction and tota. are converted into passenger car units . using the car ferry equivalent, which is derived from each type of vehicle as follows . INTRODUCTION The government has demonstrated its commitment to developing infrastructure for road construction throughout Indonesia. By providing transportation facilities, the government aims to facilitate interaction between local communities and their surrounding environment, encompassing social, economic, and cultural aspects . Roads are crucial in accommodating various vehicles and pedestrians, including cars, public transport, trucks, bicycles, motorbikes, and pedestrians. They have become an indispensable component of transportation systems . Modern society faces serious problems with transportation systems, including but not limited to traffic congestion, safety, and pollution. Information communication technologies have gained increasing attention and importance in modern transportation systems . To address this, transport a. Light vehicles (LV), namely two-axle, 4-wheeled motorized vehicles with 2. 0 m . ncluding passenger cars, microbuses, pick-ups, and small truck. Heavy vehicles (HV), namely motorized vehicles with more than 3. 5 m and typically with more than four wheels . ncluding buses, two axlesAo trucks, three trucks, and combination truck. Motorcycles (MC), namely two or three-wheeled motorized vehicles. Traffic volume is the number of vehicles that pass a certain point or line. Vehicles are typically classified into several types, including heavy vehicles, light vehicles, motorcycles, and non-motorized vehicles . The traffic volume on a road will vary, forming a traffic flow pattern. Traffic flow patterns indicate changes in traffic volume over a given period . Basically, traffic flow patterns help us know peak and nonpeak hours and their intervals. Density is the number of vehicles per unit length of the road . ehicles/k. Density can be observed from aerial photos and Closed-Circuit Television (CCTV) installed at several intersection points. Describing the short-term traffic flow is essential for studying intelligent transportation systems . Knowing beforehand the real-time density of a road or an intersection could make the road less crowded due to drivers avoiding potentially high traffic . Detecting vehicle objects is the first step in obtaining traffic flow information at the intersection. Object detection aims to get the location and classification of objects from an image. The goal is to acquire the features of the object. In this study, observations will be made of six class objects around the Class objects include cars, trucks, public transportation, bicycles, motorcycles, and people who are at the intersection location. Five convolutional neural network models will test the objects during the testing process. However, the models undergo a training process using Darknet before detecting the targeted objects. This paper aims to use object detection on traffic in urban areas and to experiment with which convolutional neural network models are best suited for this case. The crossroads used as the experimental site of this research is in the Bandung area. West Java province. Fig 1. shows four intersections with dense characteristics, such as the intersection of the Buah Batu and Batu Nunggal highways, during a test experiment using a darknet framework. At the depicted intersection, the volume of each object class will be automatically calculated. Knowing the volume of objects for each class can also improve the supervisory function of vehicle objects while supporting local government decisionmaking. Implementing an idea of fast and timely traffic flow that can effectively reduce traffic jams, reduce accidents, and prepare a comfortable traffic environment. Traffic conditions at the intersection are data on traffic volume taken during peak hours in the Bandung city area. The description for the traffic volume data that became the research material was taken from several intersection roads in the city of Bandung. At this intersection, every day, there is a tremendous amount of For this reason, regulating traffic lights at crossroads is needed to run vehicles and accommodate every road user. The problem with regulating the traffic system using a fixed time model is it can cause changes in traffic density to be unpredictable because of the traffic lights. Fig. 1 Shows intersections area with four locations during an experiment using a darknet framework II. MATERIALS AND METHODS Related Works An intelligent traffic light control system must be implemented dynamically with real-time traffic. Studies are using deep reinforcement learning techniques for traffic light control, showing reasonably good results for control . Ultimately, using smart transportation . , smart traffic light. will make our trips more comfortable and efficient and help avoid congestion on one side of the road . In general, the Intelligent Transportation System (ITS) application has become an essential component and has been widely implemented for smart cities to overcome the limitations of traditional transportation systems. The existing traffic light control system divides the traffic light signals into fixed durations and operates inefficiently . The description of the need for Intelligent Transportation Systems (ITS) has become a concern in recent years. In addition, with the rapid development of vehicle computing hardware, vehicle sensor systems, and city-wide infrastructure, many of these applications continue to be developed, such as Vehicular Cloud (VC), intelligent traffic control, etc . Traffic demand forecasting is essential for transport management and public safety. Still, it is very challenging because of the complex spatial-temporal dependence and consequent uncertainties created by the road network and traffic conditions . Traffic flow prediction is the central part of ITS research. Road traffic data shows the same trend on successive days. Accurate traffic flow prediction ensures public safety and solves traffic jams. The increasing demand for faster travel, severe traffic congestion, and its adverse impact on traffic safety and environmental conditions have attracted significant attention from countries worldwide . due to the limited land resources, construction costs, and timeconsuming processes. Furthermore, the highway expansion project cannot wholly and effectively solve this problem. addition, potential traffic demand is also generated due to increased vehicle traffic capacity . For this reason, the research is focused on knowing the volume of vehicles in an area so that it will be an input for local governments to find to provide valuable insights into the evolving landscape of object detection within the realm of deep learning . The utilization of deep learning object detection algorithms, specifically designed for analyzing 2D images, has emerged as a formidable force in road object detection within autonomous driving . The remarkable success achieved by deep learning methods in the context of road vehicle detection is indisputable . These advancements have solidified the critical role of deep learning in enhancing the accuracy and efficiency of road object detection and paved the way for unprecedented progress in autonomous driving . Nevertheless, rapidly and accurately detecting and classifying vehicles faces challenges arising from the limited spacing between vehicles on the road and interference features in photos or video frames containing vehicle images. A novel vehicle detection and classification model has been developed by optimizing the YOLOv4 model to address this This model incorporates an attention mechanism that effectively suppresses image interference features by considering both channel and spatial dimensions . The CNN model detects moving vehicles using various techniques. One common approach is frame difference, where the model compares consecutive frames in a video to identify the differences in object positions. When an object, such as a vehicle, moves in a video, its location changes from frame to By detecting these changes, the model can identify the presence of moving vehicles . Another method involves training a deep learning model specifically for object This approach requires labeled data to train the model by collecting and annotating a dataset or fine-tuning a pre-trained model on specific data. The trained model can then detect moving vehicles in videos . This research focuses on detecting street objects using camera surveillance. There are six objects in focus: cars, motorcycles, trucks, bicycles, humans, and public These objects are likely to be seen throughout the city streets of Indonesia. In this experiment, video footage of the road using camera surveillance in Bandung City is used on the CCTV. The video quality and lighting of the footage are not the focus. Therefore, the footage is obtained from the There are three phases of the experiment: the preparation, training, and testing phase. Each phase uses a different method of processing. This research also compares five different convolutional neural networks or CNN models. The five different CNNs, namely use You Only Look Once Version 3 (YOLOv. Version 4 (YOLOv. Version 7 tiny (YOLOv7-tin. CSResNext50-Panet-SPP. DenseNet201-YoloV4. These CNNs were chosen to differentiate from previously published work. In some instances, the method for using the CNN model for object detection and comparing the models to determine the best performance has been experimented with. Therefore, the main topic of this research is to compare the performance of the five different CNN models for each street location. A more thorough explanation can be seen in the following section. appropriate and efficient solutions in dealing with increasingly severe traffic jams. The description for detecting vehicle volume in this study is to use Artificial intelligence technology through machine learning. The problem that necessitates the development of object detection applications is the need to accurately determine road density, particularly at intersections, in real-time. addressing the density issue, drivers can avoid road congestion and reduce the likelihood of encountering heavy Several additional benefits are associated with an application capable of detecting the density of vehicle objects. Firstly, drivers can actively seek alternative routes to avoid being trapped in time-consuming traffic jams, thus saving valuable travel time. Moreover, this technology empowers drivers to proactively circumvent potential high-traffic areas, further enhancing their ability to avoid congestion. Target detection technology, as one of the core technologies in computer vision, provides basic technical support for many aspects, such as target tracking, semantic segmentation of vehicles under heavy traffic conditions, discovering invehicle alcohol to prevent road accidents, and detecting cyberattacks on autonomous vehicles . The typical approach for determining the number of vehicles traversing a highway involves employing detection and tracking techniques. By analyzing the tracking trajectories of vehicles, it becomes possible to calculate the total count of vehicles passing through a specific area. The vehicle calculation process based on detection and tracking methods can be subdivided into background reduction and DNN-based methods . Background reduction technology is used to design the background model and extract the existing moving vehicles in the videos. Several morphological operations are usually applied to the vehicle segment to count the traffic vehicles . The background model is specifically used for a limited region within the video frame. Subsequently, morphological processing is applied to the extracted target to amplify its features and mitigate the impact of obstructing vehicles . Various techniques are employed for detecting moving objects. The process involves several post-processing steps, which are crucial in establishing optimal thresholds for distinguishing between foreground and background. These steps significantly enhance the detection rate accuracy achieved through this technique. However, identifying and adapting a suitable threshold, particularly in environments with limited visibility, has proven unsuccessful thus far . The preparation phase uses AuffmpegAy for video cutting and labeling images (Fig . Once the preparation has been completed, the training phase uses the Darknet algorithm to train the prepared data. Lastly, the testing phase measures the performance and outputs the results using OpenCV. OpenCV is an open-source library of programming functions mainly for image processing. OpenCV was chosen in favor of Darknet for the testing due to its ability to write the appropriate algorithm for the experiment . This comprehensive paper delves into deep learning-based object detection frameworks, thoroughly reviewing their capabilities and advancements . Recognizing the diverse nature of specific detection tasks, we extend our exploration to encompass a brief survey of notable tasks such as salient object detection, face detection, and pedestrian detection . By analyzing the unique characteristics of each task, we aim Preparation Phase In Fig 2, the preparation phase diagram is shown. In the first step, a system was created to detect the objects within a CCTV frame, using images obtained from a road CCTV video. These videos were mainly 44 seconds long, and five public TABLE I IMAGE LABEL SAMPLE AND CATEGORIES IN ENGLISH AND INDONESIAN CCTV videos were collected from the Area Traffic Control System of Bandung DISHUB website. Once collected, frames from every few seconds of the video get extracted and become new images using a software called FFMPEG. These images are used to train the machine. Each image requires labeling to make the machine understand the image. Image labeling is constructing a map of visual features with semantic and spatial labels that describe the objects in the image . Image labeling has the function of teaching the machine to understand the given image. This process outputs a class definition text file and an image label text file for each image. Therefore, these files are the references for the machine during training to differentiate between objects and nonobjects . LANGUAGE Image Sample Category English Indonesian Car Mobil Motorcycle Motor Truck Truk People Orang Bicycle Sepeda Public Transportation Angkot This experiment needs some pre-trained convolutional weights and configuration files data for each CNN model, in addition to the image labels and train data. After completing the preparation process, there will be four types of data. Dataset Setup There are 43 combined street images extracted from obtained road CCTV videos. These images have sizes ranging from 516 to 832 kilobytes each. These also have relatively the same size and aspect ratio. This research doesnAot have any preprocessing methods for the images. However, all images have sufficient lighting and contrast for the experiment. Therefore, the machine would learn unprocessed image data to detect real-world objects. Each image has one labeling text that contains data on roadway image objects. The dataset also contains one class definition text. These data are created during the labeling process by a program called labeling. Therefore, in total, there are 87 files contained in a dataset for the training phase. These data are important to help the machine understand objects in the image. This research used CCTV video from one of the obtained video collections for the testing phase. An overview of the training in detecting the volume of vehicles at the intersection using the darknet algorithm can be seen in Fig 3. Fig. 2 Preparation Phase Diagram in detecting the vehicle and People Image labeling has the function of teaching the machine to understand the given image. As stated before, the image labeling process uses labeling as software. This software reads all the images in the given folder and manually draws the bounding box and the class name within the image. This process outputs a class definition text file and an image label text file for each image. Therefore, these files are the references for the machine during training to differentiate between objects and non-objects. There are six types of objects that the CNN models must detect. These objects are cars, motorcycles, trucks, people, bicycles, and Indonesian public transportation called angkot. Each CNN model must correctly detect and identify as many of these objects as an object detection system to have a high value. Table I shows an example of precisely labeled objects with their correct object category. Fig. 3 Diagram of Training Phase Using Darknet The diagram in Fig. 4 presented below is a training process to describe an automatic system designed to detect traffic with darknet algorithms. The aim is to detect and inform the amount of traffic volume at road intersections. This research aims to improve the detection of darknet traffic by exploring a series of machine learning and deep learning techniques to classify such traffic and accurately show related application Therefore, these 100 and 1000 in iteration calculations are solely to save the training progress in those iteration numbers. An overview of the monitoring process stages for several vehicles at bustling crossroads was obtained using the darknet Some peripherals support the training process. can be seen in Table II. TABLE II DEVICE SPECIFICATION USED FOR TRAINING Training Phase In Fig 4, the data serves as the foundation for the training The training process was conducted five times as a part of this experiment, as five different CNN models were Each of these CNN models possesses its own unique architecture, including Yolo V3. Yolo V4. CSRestNext50. Yolo V7 Tiny, and DenseNet201-YoloV4. Throughout this training process, only five CNN models were utilized optimally despite attempting several other models that did not yield satisfactory accuracy results. By employing a variety of CNN models, it becomes possible to identify the most suitable model to achieve optimal accuracy. Consequently, the number of extracted features and the output size may differ among the models. No. Component Processor RAM GPU Disk Space Specification AMD RyzenE 5 5600H Up To 4. GHz 16 GB DDR4 RTX 3060 Mobile 6GB 5 TB Based on Table II, the heavy process of determining the difference factor between CNN models to carry out the training phase is done on the same device. Table II provides information about the six classes' device specifications for the training stage. This type of computer device is a personal computer with the latest generation of computer components that can carry out tasks specifically in detecting 6 classes of vehicles at road intersections, including counting the number of vehicles passing through road intersections. The need for GPU memory and processor specifications becomes dominant in processing class objects carried out in the training Testing Phase The next step after the training process is the testing Fig 5 explains the testing process stages, which start with inputting data on vehicle objects, people, and input Next, predictions are made for object detection using one of the five CNN models used in the experiment. The prediction results are in the form of calculated bounding box Non-Max suppression calculations are performed to display only the most optimal bounding boxes for objects to reduce the number of unimportant bounding The bounding boxes are then drawn and saved for later use by developers or users. Furthermore, the retrieving vehicle process, the person class data, and calling the volume calculation function for each object are immediately processed. This step is crucial so that the vehicle calculation process can be carried out on the frame being processed at that time. Calculating the object or object volume begins by taking data on the number of object labels detected in the frame and then adding them up. This becomes the vehicle volume data detected in the frame. Then, the algorithm performs the object tracking process, which begins with giving an ID to the object in question. However, the object tracking process has two "if" cases. If two objects are adjacent to the main object and the main object is new, the algorithm gives the object a new ID. Once object tracking is complete, the calculation of object volume is also performed for each label. This is done in 12 When the 12-second time limit is reached, the calculation algorithm outputs the model testing performance data, including the traffic volume, the number of objects detected in each category, and the inference time. In addition, the algorithm also generates images that contain information Fig. 4 Training Phase Diagram Using Darknet algorithm. However, this discrepancy is not problematic, as the models still adhere to the fundamental CNN architecture. Specifically. CNNs, in this case, consist of three primary layers: Convolutional. Pooling, and The Fully Connected Layer . Another study addresses these issues concerning accidents and aims to find solutions to reduce road accidents resulting from traffic-related incidents. The main challenge faced in computer vision lies in obtaining effective results when dealing with variations in data shapes and colors . The training process saves the last weight every 100 iterations to prevent complete data loss if something happens. After 1000 iterations, the final weight files were created. There are five final weight files for each of the five CNN models. about the traffic volume at that time, the number of objects that appear in each category, and the resulting bounding box. Precision and Recall: Precision is used to determine the ability of a model to detect or identify the targeted objects. The recall is used to assess the ability of a model to find the targeted objects. Precision is the percentage of correct positive predictions, while recall is the percentage of correct positive predictions among all given ground truths . he number of total object. To obtain the precision and recall values, each detected bounding box needs to be classified as: True positive (TP): A correctly detected ground truths bounding box. False positive (FP): An incorrectly detected nonexisting or existing object with a misplaced detection bounding box. False negative (FN): An undetected ground-truth bounding box. Suppose a dataset with G ground-truths and a model that produces N detections and S of which are correct (S O G) . The concepts of precision and recall can be formally expressed as follows: F1-Score and mean Average Precision: F1-score is a calculation to produce a mean of precision and recall, which can be expressed as: 2 # # $%&$' The values of the F1-score range from 0 to 1, where 1 means the highest accuracy when both precision and recall are 1 and 0 if precision or recall . r bot. have the value of 0. Mean Average Precision or mAP is the average AP over all classes, which is expressed as the following formula . Oc,- * . APi: The AP value for the i-th class C: The total number of classes being evaluated. Intersection over Union: Intersection over Union or IoU is a metric used in object detection to compare the similarity between two bounding boxes: the predicted bounding box and the ground reference bounding box . he box the developer previously labele. IoU encodes the shape properties of the objects under comparison into the region property, such as the widths, heights, and locations of two bounding boxes. Then, it computes a normalized measure focusing on their areas or volumes . Fig 6 illustrates how IoU works. Fig. 5 Diagram of Testing Phase Using OpenCV Measuring Model Performance There are measurements to determine how well a model performs in a specific case. This experiment used five measurements: precision, recall. F1-Score, mean average precision, and traffic count. Collecting these variables, the extracted data will be used to compare the performance between the five chosen CNNs. Therefore, this comparison will determine which CNN has the most optimal performance in the case of CCTV object detection. A more detailed explanation of these calculations is given in the following Fig. 6 Training Graph of Each CNN Model Fig 8 illustrates the model's training graph. This graph shows the most unsteady training process. Unlike other training graphs that have a steadier line, this graph shows that the loss percentage tends to increase and decrease every 100 However, it still resulted in a lower loss percentage overall for every 3000 iterations after the 2400th iteration. Traffic Count . Traffic count is a crucial measurement used to assess the level of traffic in a video at a specific point in time. This measurement plays a vital role in evaluating the performance of a model in real-world scenarios. The traffic count measurement involves two calculations: the number of objects detected in a particular second and the number of objects in the last captured frame. In this experiment, 12 seconds were selected, and all objects were counted during this period. The machine must successfully detect the objects to calculate the total traffic for a given time. For each frame, the machine only counts new objects, assigning them a unique ID until they disappear. This process continues until the last frame at 12 seconds, accumulating objects for their respective Once the total traffic volume is calculated, the next step is to count the number of objects in the last captured This calculation identifies and evaluates flaws in the model's vehicle object detection performance by determining the number of objects in a single frame. RESULTS AND DISCUSSION Results This experiment will use five types of CNN models to detect cars, motorcycles, trucks, bicycles, public transportation, and people as their targeted objects. Each CNN model has a different performance result when used in that The performance of the models can be measured through five metrics. The metrics are precision, recall. F1score, average IoU percentage, and inference time relative to mAP@0. These five metrics are automatically calculated by Darknet algorithm and generated as an output. Two additional metrics were also calculated during the training process: duration and the average model training loss. Fig. 7 shows the training process of CSResnext50-PanetSPP in a graph. The graph shows that the loss percentage began to fall after around the 300th iteration and continuously dropped until the 3000th iteration. From the 3000th iteration, the loss percentage maintained a steady reduction, staying between 0% and 2% until the last iteration. The next CNN model is the DenseNet201-YOLOv4. Fig. 8 Training Graph of DenseNet201-YOLOv4 Fig 9 shows the training graph of YOLOv3. The training process of YOLOv3 has many similarities with the training process of CSResnext50-Panet-SPP with their steady loss The last iteration resulted in between 0. 2% and 6% loss percentage. Fig. 9 Training Graph of YOLOv3 Fig. 10 presents the next CNN model for training six classes of objects: YOLOv4. Fig. 10 illustrates the training process in a graph. After the 2400th iteration, the loss percentage kept increasing and decreasing by 1% for every 100 iterations. Eventually, the last iteration stopped at between 1% and 2%. Fig. 7 Training Graph of CSResnext50-Panet-SPP The only similarity is the total number of iterations, which is Fig. 12 Training Duration of CNN Models The training graph shows the training duration of each CNN model. Fig. 12 presents the training duration of every CNN model. In this case, the lower the duration, the better the CNN models because of the low wait time. YOLOv7-Tiny has the best training duration, with only 3. 561 hours. The worst training duration came from CSResNext50, with 17. 434 hours. The rest of the models have almost the same duration, ranging 5 to 10. 5 hours. Fig. 10 Training Graph of YOLOv4 Fig. 11 presents the result of the experiment using YOLOv7-tiny. Fig 11 shows the training process graph of YOLOv7-tiny. Unlike other models, which have a steady line of loss percentage with the only difference of how constant the loss percentage occurs, this training process has a particular case. After around the 300th iteration, the loss percentage increases again until the 700th iteration. This case only takes place in this training process. Then, the loss percentage kept decreasing steadily like the training process of CSResnext50-Panet-SPP until the 9600th iteration. From there on, the loss percentage stays in the same line of loss percentage until the very end of the iteration. Fig. 13 Average Model Training Loss Graph The training graph also shows the training loss of each CNN model. Fig 13 depicts the average training loss of every CNN model. The lower the value of the training loss, the better the quality of the model. Three models have a score 0: YOLOv3. CSResNext50, and YOLOv4. The lowest training loss score is YOLOv3, 0. 37, while DenseNet201-YOLOv4 has the highest score, with a value of The first three metrics that could affect the modelAos performance are precision, recall, and F1-score. Fig 14 shows each CNN model's precision, recall, and F1-score metrics. Compared to other models. YOLOv4 has the highest value in all three metrics, with a score of 0. 99 for precision, 1. 0 for recall, and 1. 0 for F1-score. All other CNN models have decent values, with an over 0. 9 score in all metrics, except for YOLOv7-Tiny, which has a value of less than 0. 9 for all Fig. 11 Training Graph of YOLOv7-tiny Every CNN model resulted in a different training graph, with a different training duration and training loss percentage. Fig. 14 Combined Precision. Recall, and F1-Score Performance Graph DenseNet201 could detect a person who was a lot further from the road. In this frame, the other motorcycle . was a lot more unclear than in Fig. 17 due to the possibility that the vehicle was moving too fast. The car . that was on the left side of the road and was the furthest from the camera could not be detected. The problem could be with the tree that blocks some parts of the car . Fig. 15 Average IoU Graph of CNN Models The next metric that could affect the performance of a CNN model is the average Intersection of Union or IoU. For this metric, the higher the percentage value, the higher the performance of a CNN model. Fig 15 shows the average IoU of CNN models. All models have a high IoU percentage of over 75%. The highest average IoU value of all five CNN models is YOLOv4, with a score of 93. In comparison, the lowest average IoU is YOLOv7-Tiny, with a score of Fig. 17 CSResNext50 Prediction Results to detect the object Showcasing the results obtained through the employment of the CSResNext50 CNN model. In this particular frame, the model successfully recognizes a total of six cars (Mobi. , eight motorcycles . , and three individuals . Its primary focus lies in identifying moving vehicles, disregarding stationary parked ones. Furthermore, the model targets explicitly individuals situated close to the road. Within the context of this figure, three classes of objects are observable, namely people . , cars . , and motorbikes . However, there exist three other classes of objects that remain unseen, including trucks . , bicycles . , and public transportation . The detection of public transportation . and cars . proves to be a challenging task for this model, resulting in a somewhat similar visual appearance between the two classes. A model could encounter problems detecting objects if the target object is blocked by another object in the frame or if the target object moves too fast for the model to detect. Fig. shows the prediction when using the DenseNet201 CNN Fig. 16 Inference Time. Relative to mAP@0. 50 Performance Graph The last metric is the inference time relative to mAP@0. Fig. 16 illustrates the inference time relative to the mAP@0. 50 performance graph of every CNN model. The inference time represents how fast a model detects the targeted object. The lower the inference time, the faster the detection becomes. However, mAP also needs to be considered to know more about compatibility on such devices. The CNN model with the fastest inference time and high compatibility is DenseNet201-YOLOv4. YOLOv3 and YOLOv4 have a high mAP but have a lower inference time than DenseNet201-YOLOv4. YOLOv4-Tiny has the lowest inference time with a mAP of around 98%, while CSResNext50-Panet SPP has the lowest mAP percentage with a value of around 97%. Discussion During the experiment, a total of five different CNN models were tested to evaluate their performance in detecting cars . and motorcycles . as the targeted objects. Upon completion of the experiment, it was observed that each CNN model produced varying predictions when presented with the same test video. The CNN model exhibits precise predictions for objects falling within three distinct categories: cars (Mobi. , motorcycles . , and people . illustration of these predictions is presented in Fig 17. In this frame, the model detects a total of five cars (Mobi. , seven motorcycles . , and four people . Compared to CSResNext50. DenseNet201 detects one person more but one motorcycle . and one car (Mobi. Fig. 18 DenseNet201 Prediction Results to detect the object In recent object detection research, a CNN model was used to process faces when two target objects were too close to each other. One of the advantages of this model is its ability to detect multiple objects in close proximity. For example, in Fig. 19, the YOLOv3 CNN model was used for prediction. this frame, the model successfully detected a total of five cars . , seven motorcycles . , and four people . It was able to detect a person who was far from the road and all seven visible motorcycles . However, it failed to detect the car . on the right side of the road that was trying to overtake the car . in front. object detection research using a CNN model has shown promising results in detecting multiple objects, even when they are close to each For example, the YOLOv3 CNN model could detect various objects in a given frame. However, there are still challenges in accurately detecting all objects, especially in complex scenarios or occluded by other objects. Further research and advancements in object detection models are continuously being made to improve their performance and successfully detects four cars . , three motorcycles . , one public vehicle . , and four people . However, similar to YOLOv3. YOLOv7-tiny encounters a specific issue. It fails to detect the motorcycle . located next to the car . on the left side, closest to the camera. YOLOv7-tiny is a variant of the YOLO object detection It is optimized for edge GPU devices and is designed to be lightweight, making it suitable for real-world computer vision applications and distributed systems. While YOLOv7tiny offers faster inference times, it may struggle with detecting certain objects that are small or far away. Fig. 21 YOLOv7-tiny Prediction results to detect TABLE i TOTAL OBJECTS DETECTED IN THE LAST FRAME (VOLUME) Class Fig. 19 YOLOv3 Prediction Results to detect the object. Car Motorc Truck People Bicycle Public Transporta SUM In addition to the challenge of detecting target objects. CNN models can also make incorrect predictions. example of this can be seen in Fig. 20, where the YOLOv4 CNN model was used for prediction. In this frame, the model detected a total of five cars . , eight motorcycles . , one public vehicle . , and five people . However, unlike the previous model that struggled with object detection. YOLOv4 misinterpreted the car . on the left road, furthest from the camera, as a person instead of a car. CNN models, like YOLOv4, are designed to learn and recognize image patterns through extensive training on large YOLO v7Tiny YOLO DenseNet 201YOLOv4 CSResNe YOLO Table i shows the total number of objects detected in the last frame of the video. The last frame of the video does not show either trucks or bicycles. Therefore, there is not a single model that detects those vehicles. In Table i. YOLOv4 resulted with the most objects detected out of all five CNN models, detecting five cars, eight motorcycles, five people, and one public transportation. The lowest total object detected was when using YOLOv7-tiny with four cars, three motorcycles, and one public transit. TABLE IV TOTAL OBJECTS DETECTED IN THE LAST 12 SECONDS Class Car Motorcycl Truck People Bicycle Public Transporta SUM Fig. 20 YOLOv4 Prediction Fig. 21 showcases the prediction results obtained using the YOLOv7-tiny CNN model. In this frame, the model YOLO v7Tiny YOLO DenseNet 201YOLOv4 CSResNe YOLO address certain limitations and optimize the models for future Table IV informs the total objects based on the experiment that were detected in the last 12 seconds of the test video. The last 12 seconds of the test video without any bicycle in it. Therefore, all of the models resulted in zero results of detecting a bicycle. The Table shows that YOLOv4 detects the most vehicles, with 21 cars, 81 motorcycles, 7 people, and 1 public transportation. The lowest detection rate is the YOLOv7-tiny, with only 22 cars and 47 motorcycles detected. Table i and Table IV inform that YOLOv4 is the most accurate in detecting targeted objects compared to other CNN However, the number of total objects that YOLOv4 detected is not correctly predicted. For example, the model incorrectly predicts a car into a person. Therefore, other metrics such as precision, recall, and f1-score, which calculate the model's accuracy, are needed to determine their overall Besides how many objects a model could detect in a specific time length, inference time relative to the mAP of a CNN model and the average IoU could also affect their In Fig 16, all CNN models are highly compatible with a mAP of above 97%. YOLOv4 has the highest compatibility but has the second-highest inference time and the highest average IoU, according to Fig 15. conclusion, among five CNN models. YOLOv4 performs better when detecting objects and has high compatibility and a high inference time that could slow down the detection REFERENCES