Stanley A. Dewangga, et. : Implementation of Hand GestureA (October 2. Implementation of Hand Gesture Recognition as Smart Home Devices Controller Stanley A. Dewangga1. Mochamad Subianto1, and Windra Swastika1 1Informatics Department. Faculty of Technology and Design. Ma Chung University. Malang. Indonesia Corresponding author: Stanley A. Dewangga . -mail: stanleyadidewangga@gmail. ABSTRACT Some current virtual assistant products such as Alexa. Siri and Google Home facilitate features to control smart home devices through voice input, which has become increasingly popular in recent years. In addition to voice input, smart home devices can also be monitored and controlled through smartphones or computers using applications that provide users with flexibility. However, both control systems are less efficient, as they consume time and voice input utilization may sometimes not be recognized in crowded Therefore, this research introduces an application to recognize real-time hand gestures and utilize them for a new control system that provides time and energy efficiency. This application processes images using the Mediapipe framework, generating hand landmark outputs. These landmark outputs are utilized to determine the combination of raised or lowered fingers, which is then used to control smart home devices. The application is developed with ESP32 and ESP01s modules as data receivers from gesture recognition, and ESP32-CAM for image acquisition. Meanwhile, the gesture recognition computation process is executed on a Raspberry Pi 3 Model B. The gesture recognition application achieves good accuracy at 88%, but may experience occasional failures for certain gestures. However, the response time generated by the smart home control system is still relatively long, averaging 7. 88 seconds. KEYWORDS Hand Gesture Recognition. Hand Landmark. Mediapipe. Smart Home INTRODUCTION This Smart home is a system designed to control and automate household electronic devices intelligently using integrated sensors. There are numerous benefits offered by smart home devices in providing solutions for both routine and specific human needs. Research on the control system of smart home devices for individuals with physical disabilities . , . states that smart home technology facilitates physical disabilities in managing their household devices. Control over smart home devices can be achieved through the use of current virtual assistant products . , . However, according to Mtshali & Khubisa . controlling smart home devices using virtual assistants is a complex, complicated, and expensive solution for individuals with physical disabilities or the elderly. These virtual assistants are also vulnerable to spoofing attacks . In a study on the design of a smart home as an IoT application based on Voice Recognition and Arduino . , failures were also found in recognizing voice input instructions as controllers for smart home devices. Voice recognition failures can be timeconsuming, making it less efficient. VOLUME 06. No 02, 2024 DOI: 10. 52985/insyst. In noisy environmental conditions, various noises can interfere the voice recognition capabilities of virtual These noises typically originate from sounds not intended to control smart home devices. This issue can hinder the smooth operation of control systems and also result in time consuming. Research on hand gesture recognition to control household electronic devices was conducted . The research design utilized IMU sensors to obtain accelerometer data, which was processed and classified into 4 hand gestures: up, down, left, and right. The results of gesture recognition were then used to trigger the control of electronic devices in the home. Another study on smart home control devices using hand gesture recognition was also carried out . With depth camera acquisition and the application of the Hidden Markov Model (HMM) method for gesture recognition, an average recognition rate of 98. 50% was achieved. However, the control device was limited to only 4 types of hand gestures. In previous studies . , the gesture recognition system that utilized input from the ADPS-9960 sensor was limited to basic swipe gestures: left, right, up, and down. Instructions were executed using upward and downward hand Stanley A. Dewangga, et. : Implementation of Hand GestureA (October 2. movements to navigate menus on an LCD. This research provides a solution to the time consuming and laborintensive process of menu selection. By employing camerabased gesture recognition, this approach overcomes the limitations in the variety of gestures that can be recognized compared to those with the ADPS-9960 sensor. Therefore, this research introduces a smart home control device utilizing hand gesture recognition as its trigger. The utilization of hand gesture recognition is expected to address various issues found in previous research. This approach is also giving more advantage as it does not need any sensors or attachments on our hands, offering more seamless and convenient user experiences . , . Instead of relying on traditional manual switches or remote controls, users can execute simple gestures to operate lighting, temperature settings, and other devices, allowing for quicker and more intuitive control over home automation systems . This method reduces the need for physical contact and movement, saving energy typically used by conventional methods and diminishing the time taken to perform routine tasks. II. APPLICATION DESIGN As shown in Figure 1, the application flow begins with the image acquisition process. In real-time image processing, the term "image" refers to the frame unit acquired using the This image is then used as input for the hand gesture recognition model. The output from hand gesture recognition is subsequently employed to trigger the control of smart home Figure 1. Application Flowchart IMAGE ACQUISITION Image acquisition is performed on the ESP32-CAM using the OV2640 camera module. Upon activation, the ESP32CAM will automatically create a web server and send realtime image data to the web server. The process of sending image data to the web server and creating the web server itself utilizes the built-in code or sketch from the ESP32 add-on in the Arduino IDE called "cameraWebServer". On this web server, the resolution and image quality can be adjusted, and VOLUME 06. No 02, 2024 DOI: 10. 52985/insyst. there are also features to apply effects to the image, such as Grayscale. Binary, and others. HAND GESTURE RECOGNITION The hand gesture recognition process is executed by inferencing each frame captured by the camera in real-time into the Mediapipe model. Mediapipe is a versatile framework developed by Google, provides a suite of image processing functions that facilitate the accurate detection and recognition of hand gestures as they occur. This real-time processing allows the system to provide instantaneous feedback based on the gestures it recognizes. Before hand gestures can be recognized, the input of the hand image is initially detected using the Palm Detection model from Mediapipe as shown in Figure 2. This model is created using the Single Shot Detector (SSD) architecture and Convolutional Neural Network (CNN) method . The output of the detection is a bounding box stored in the values of x-minimum, y-minimum, x-maximum, and y-maximum, representing the coordinate values of the detection box on the image. Figure 2. Hand Gesture Recognition Flowchart The image is then cropped based on the bounding box values and enters the hand landmark prediction model. Hand landmark refers to the points of joints or the framework of the palm, as shown in Figure 3. For example. INDEX_FINGER_MCP represents the coordinate point of the metacarpophalangeal (MCP) joint on the index finger. The output of the hand landmark prediction utilized for the subsequent process includes the x and y coordinates of each landmark on the image. Figure 3. Hand Landmarks The landmark coordinates in Figure 3 are utilized to determine whether each finger is raised or lowered. For example, if the y-coordinate of the finger's TIP in the image is smaller or positioned above the MCP, the finger is interpreted as raised. Each raised finger is assigned a value of 1, while the lowered ones are assigned a value of 0. These values of 0 and 1 represent the raised and lowered states of the 5 fingers on both hands, as illustrated in Figure 4. utilizing these values for the 5 fingers, a total of 2^5 combinations or 32 types of static hand gestures can be Stanley A. Dewangga, et. : Implementation of Hand GestureA (October 2. In this study, only 5 of the 32 hand gestures were tested to measure the response time of the system. Figure 4. Gesture Illustration . CONTROLLER USING RELAY The ESP-01S is utilized to access gesture data and control the relay based on the received gesture information. The ESP-01S is connected to the Raspberry Pi's access point and retrieves gesture data from the Raspberry Pi's web server. the gesture data corresponds to instructions to turn on or off the light, the relay will operate accordingly by opening or closing the electrical circuit. Figure 6 illustrates the configuration scheme of the controller using a relay and ESP-01S. SMART HOME DEVICES CONTROLLER In Figure 5, the ESP32-CAM is connected to the access point created by the Raspberry Pi and sends image data to the web server it establishes. Subsequently, the Raspberry Pi accesses this data through the web server and uses it as input for the gesture recognition model. The Raspberry Pi serves as the main data processing center, handling image processing, inferring to the gesture recognition model, and providing output data in the form of gesture information. Additionally, the Raspberry Pi opens an access point, allowing other components to connect to it. The Raspberry Pi establishes a web server so that after the image or frame is inferred through the model, the output results from the model are sent to the web server created by the Raspberry Pi. The inference process and web server operation are executed concurrently using the threading library in the Python programming language, while the creation of the web server utilizes the Flask library. Both processes need to run simultaneously for the web server to operate and update data while the Raspberry Pi processes image data. The Gesture Output on the web server is accessed by the ESP-01S and ESP32, serving as triggers to control devices. Figure 6. Installation Scheme of ESP01S and Relay CONTROLLER USING IR TRANSMITTER One limitation of the relay in this application is its capability to only open and close the electrical circuit, allowing instructions for turning on or off only. To address this issue, an infrared (IR) transmitter is employed. Various household electronic devices such as ACs. TVs, and projectors use infrared remote controls to execute additional instructions like adjusting temperature, changing channels, and controlling volume. By replicating these functions, the given instructions can be diverse and not limited to just on or However, the use of the infrared transmitter is constrained by its limited range, directional emission that must be in line, and its inability to penetrate objects. To mimic the operation of a remote control, the infrared signals emitted by the remote based on instructions are recorded and initially noted using the IR Receiver HS1838 Remote instruction signals are in hexadecimal form. For example, the signal to turn on the AC is '81C08F70 C1AA09F6'. Each hexadecimal character represents 4 bits of digital signal transmitted at a frequency of 38 kHz. For instance, the character 'A' represents the value '0101,' while '9' represents the value '1001' in the digital signal. When the infrared is emitted from the remote, the signal received by the receiver or AC is its inverse. For example, if the transmitted signal value is '11001001,' the signal received by the receiver is '10010011. ' Therefore, after recording the code signal for each instruction, the code is reversed or If the initially obtained signal code was '81C08F70 C1AA09F6' during recording, the original signal is '6F905583 0EF10381. ' This original signal is what the infrared transmitter will emit when intending to turn on the AC. Figure 5. System Illustration VOLUME 06. No 02, 2024 DOI: 10. 52985/insyst. Stanley A. Dewangga, et. : Implementation of Hand GestureA (October 2. Figure 7. Installation Scheme of IR Transmitter and Receiver The Infrared transmitter is connected to the ESP32. The ESP32 is linked to the Raspberry Pi's access point connection to access gesture data from the Raspberry Pi's web server. When the gesture data meets the conditions to execute specific instructions, the infrared transmitter will emit an infrared signal to the target device and execute instructions based on the hexadecimal code of the previously recorded original signal. As illustrated in Figure 7, there is an IR receiver connected to Arduino UNO to validate or crosscheck the signal code emitted by the transmitter, ensuring there are no errors in the code. APPLICATION TESTING The testing phase involves designing a prototype simulation of the application's operation. The simulation is created on a smaller scale compared to a typical smart home system, with the aim of depicting the real system's Additionally, this simulation is used to test the solutions implemented for previously identified issues and the smooth functioning of additional features. Two aspects are tested in the formed simulation: response time testing and gesture recognition accuracy testing. Response time testing involves recording the time from when a gesture is recognized on the Raspberry Pi until the instructed device receives a response. The testing is conducted 10 times for each gesture and measured in Meanwhile, accuracy is tested using a Confusion Matrix. A Confusion Matrix is a method for calculating accuracy, precision, and recall values in a classification system . There are four terms to calculate performance in the Confusion Matrix, namely TP. FN. FP, and TN as shown in Figure 8. TP or True Positive is the number of data predicted as class 1, and the actual result is also class 1. FN or False Negative is the number of data predicted as class 0, but the actual result is class 1. FP or False Positive is the number of data predicted as class 1, but the actual result is class 0. Meanwhile. TN or True Negative is the number of data predicted as class 0, and the actual result is also class 0 . Equation . is used to measure the number of cases predicted correctly compared to the overall cases. Equation . is a measurement that indicates how accurate the prediction is for the positive class. The precision value is obtained from the ratio of True Positive to the total positive Equation . is used to measure the ratio of True Positive to the total actual positive cases. TP TN Accuracy = . Precision = Recall = TP TN FP FN FP TP FN TP RESULT AND DISCUSSIONS The camera resolution for image acquisition is set to VGA resolution, which is 640x480 pixels. The resulting frames are obtained in the . JPEG file format with a file size ranging from approximately 8000 to 12,000 bytes or 8-12 KB. Raspberry Pi receives frame data from ESP32-CAM with an average frames per second . 6, using 2 frame Frame buffer refers to additional memory for storing graphic information that has not been displayed on the The image data is then sent to the web server created by ESP32-CAM. This data can be accessed using the IP address of ESP32-CAM. TABLE I INFRARED SIGNAL RESULTS Devices Command Signal Code Projector Power On/Off Freeze/ Unfreeze Power On 81C08F70 C1AA09F6 81C08F70 C1AA49B6 96BFBCDC CEC65E15 96BFBCDC 9A664E43 Power Off Converted Signal 03810EF1 55836F90 03810EF1 55836D92 FD693B3D 6373A87A FD693B3D 6659C272 Table 1 illustrates the infrared signals successfully recorded using an IR Receiver and emitted using an IR Transmitter. Some signals from different brand AC remotes were unable to be recorded because the hexadecimal codes emitted for one instruction and one device varied. To address this issue, attempts were made to record the pulse width and retransmit it, but the results were still unsuccessful. Figure 8. Confusion Matrix Example VOLUME 06. No 02, 2024 DOI: 10. 52985/insyst. Stanley A. Dewangga, et. : Implementation of Hand GestureA (October 2. recognizable, leading to inaccurate landmark outputs. Some similar gestures can be recognized accurately at short distances to handle misclassification, but this may reduce the user experience. Implementing higher-resolution cameras that capture more detailed images could improve the system's ability to recognize differences between similar gestures at long distances. Prediction With the developed application, five recognizable hand gestures have been formed to execute various instructions. Among the executable instructions are turning on and off the lights, projector. AC, and freezing/unfreezing the projector, as illustrated in Figure 9. The number 1 represents raised fingers from right to left on the right hand. for example, the gesture data '01100' signifies that the index and middle fingers are raised. The output of this gesture data is sent through the web server within the local network, making it accessible to both ESP-01s and ESP32. Figure 10 demonstrates an example of using the application for gestures 11111 and 10001. The executed instructions for the performed gestures were successful. Figure 10. Example of Application Usage Testing was conducted under specified camera resolution, distance, and lighting conditions. The distance between the camera and the hand was set at 1 meter. Meanwhile, the lighting conditions were measured using a lux meter to measure the light intensity. Measurements recorded in the area where hand movements were tested showed a light intensity of 245 lux. Variations in lighting conditions, either darker or brighter, can significantly affect the detail captured in the hand's image, thereby making gestures more difficult to recognize. The distance between the hand and the camera also plays a crucial role for the system to detect hand gestures. Therefore, further testing on both lighting and distance factors is necessary to optimize the recognition system's performance. ACCURACY TESTING RESULTS As shown in Figure 11, the hand gesture recognition system achieved a good accuracy of 88%. However, the gesture 00110 frequently experiences prediction failures and is often misclassified as 01100, with a precision value of 0. This occurrence may be attributed to instances where the hand is far from the camera and has a relatively low resolution. In such cases, the texture or shape of the hand may not be clearly VOLUME 06. No 02, 2024 DOI: 10. 52985/insyst. Actual Figure 9. Hand gestures and its commands Accuracy Figure 11. Hand gestures and its commands RESPONSE TIME TESTING RESULTS The response speed performance in controlling smart home devices is still below average. The average time required from gesturing to the camera until the smart home device lights up entirely is 7. 88 seconds. The average response time between using relay and infrared also does not indicate a significant difference, with a minimal average speed difference . and a comparable difference in standard deviation . 07 second. TABLE II RESPONSE TIME TESTING RESULTS Response time in seconds (Rela. (Rela. (IR) (IR) (IR) Mean Standard Deviation The generated deviation indicates that the speed performance of the smart home controller is still fluctuating. As observed in Table 2, numbers 1 to 5 and 6 to 10, the Stanley A. Dewangga, et. : Implementation of Hand GestureA (October 2. resulting response times show a significant difference with an average gap of 3. When the program is running, the number of frames per second captured by the Raspberry Pi drastically decreases from the initial 52. 6 fps to around 1 to 3 fps. This is because each frame captured by the camera is processed and inferred into the model before capturing the next frame. Therefore, the computational speed of the Raspberry Pi in processing and inferring images into the model remains inconsistent and inefficient. IV. CONCLUSION The smart home device control system developed has an average overall response time of 7. 88 seconds from 10 tests for each gesture. This response time is still too long if applied in product form. The prolonged response time is due to the computational limitations of the Raspberry Pi 3 Model B in handling images and inferring them into the gesture recognition model. Therefore, this research still cannot solve the issues of time and energy efficiency identified in the previous study. However, this research has the potential to be improved and can address these issues by changing the central computing unit to another small computer with higher computational speed. The hand gesture recognition system achieved an overall accuracy of 88%, which is satisfactory. However, some gesture recognition errors still occur, and certain gestures remain challenging to use. The system can recognize up to 32 gestures, but this study only utilized 5, indicating that the limitations in the number of recognized gestures in the previous research can be addressed with some improvements in this study. Based on the experiments and tests conducted, there are several suggestions that can be applied to enhance and further develop this research, including: Replacing the Raspberry Pi 3 Model B as the main computing unit with a faster computer for image This decision should also consider the amount of input to be processed on that computing resource, as multiple inputs may be required for more than one room. Considering the option of not running gesture recognition in real-time but providing an interface feature for gesture recording mode, making the power consumption more efficient. Implementing a feature to record and store infrared signals from the original remote and being able to emit them when specific instructions are called. AUTHORS CONTRIBUTION Stanley Adi Dewangga: Writing original draft, conceptualization, methodology, editing writing, software, validation, and data curation. Mochamad Subianto: Supervision, analysis, investigation, resources, system validation. Windra Swastika: Supervision, conceptualization, system validation, investigation. VOLUME 06. No 02, 2024 DOI: 10. 52985/insyst. COPYRIGHT This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 4. 0 International License. REFERENCES