International Journal of Electrical and Computer Engineering (IJECE) Vol. No. October 2025, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Enhancing facial landmark detection with ControlNet-based data augmentation Kritaphat Songsri-in1. Munlika Rattaphun1. Sopee Kaewchada2. Sunisa Kidjaideaw2. Sangjun Ruang-On2. Wichit Sookkhathon2. Patompong Chabplan2 Department of Computer Science. Faculty of Science and Technology. Nakhon Si Thammarat Rajabhat University. Tha Ngio. Thailand Department of Information Technology and Digital Innovation. Faculty of Science and Technology. Nakhon Si Thammarat Rajabhat University. Tha Ngio. Thailand Article Info ABSTRACT Article history: Facial landmark detection plays a pivotal role in various computer vision applications, including face recognition, expression analysis, and augmented However, existing approaches often struggle with accuracy due to the variations in lighting, poses, and occlusion. To address these challenges, this study explores the integration of ControlNet with Stable Diffusion to enhance facial landmark detection via data augmentation. ControlNet, an advanced extension of diffusion models, improves image generation by conditioning outputs on structured inputs such as landmark coordinates, enabling precise control over image attributes. By leveraging annotated landmark data from the 300W dataset. ControlNet synthesizes diverse facial images that supplement traditional training datasets. Experimental results demonstrate that ControlNet-based augmentation reduces the interocular normalized mean error (INME) in landmark detection from a baseline of 67 to a range of 4. 63 to 4. 74, with optimal parameter tuning yielding further accuracy gains. These findings highlight the potential of generative models in complementing discriminative approaches and improving robustness and precision in facial landmark detection. The proposed method offers a scalable solution for enhancing model generalization, particularly in applications requiring high-fidelity facial analysis. Future research can extend this framework to broader computer vision tasks that demand detailed feature localization and structured data augmentation. Received Feb 3, 2025 Revised Jun 12, 2025 Accepted Jul 3, 2025 Keywords: ControlNet Deep learning Face image generation Face landmark detection Machine learning This is an open access article under the CC BY-SA license. Corresponding Author: Patompong Chabplan Department of Information Technology and Digital Innovation. Faculty of Science and Technology. Nakhon Si Thammarat Rajabhat University Tha Ngio. Thailand Email: patompong_cha@nstru. INTRODUCTION Facial landmark detection is a critical area in computer vision, supporting numerous applications, including facial recognition . Ae. , expression analysis . Ae. , 3D facial modeling . Ae. and augmented reality . , . These applications rely on accurately identifying specific facial points, or landmarks, that represent essential facial features. Over the past decades, a variety of algorithms have been proposed to localize facial keypoints accurately under diverse conditions. Early approaches were often built on statistical shape models or graphical representations of facial structure. Active shape models (ASM) . and active appearance models (AAM) . are seminal model-based frameworks that iteratively fit a parametric shape and appearance to face images by enforcing learned shape constraints. These methods and other deformable Journal homepage: http://ijece. ISSN: 2088-8708 models provided a foundation for face alignment, but their performance degrades on unconstrained images with large pose or expression variation. To better handle such variability, part-based graphical models were For example, the mixture-of-trees model by . represented facial landmarks as tree-structured parts with global and local mixtures, enabling joint face detection, pose estimation, and landmark localization in wild images. While these graphical techniques increased robustness to pose, their accuracy was limited by the rigidity of the underlying shape assumptions. Subsequently, direct regression methods gained popularity for their efficiency and accuracy, bypassing explicit shape modeling. Cascaded shape regression frameworks . emerged as a dominant approach, where an initial coarse landmark estimate is iteratively refined by a sequence of learned regressors. By learning shape update transformations, these methods can rapidly converge to the target landmarks. First demonstrated an explicit shape regression that directly maps image features to landmark displacements without any parametric model . Numerous enhancements followed: formulated the supervised descent method (SDM) to minimize a nonlinear least-squares alignment objective . , applied random forests with conditional regressors to predict facial keypoints in real time while accounting for head pose . Later, ensemble-based regressors were introduced which further improved reliability. Employed an ensemble of regression trees, enabling one-millisecond face alignment with competitive accuracy . To reduce overfitting and improve generalization, combined gradient-boosted trees with Gaussian processes in a cascade . GPRT) . , which acted as a form of regularized ensemble that achieved state-of-the-art results on challenging benchmarks. These regression and ensemble methods significantly improved alignment speed and accuracy, yet their data-driven nature meant that generalization to extreme poses or expressions was still constrained by the availability and diversity of training data. With the rise of deep learning, convolutional neural network (CNN) approaches have dramatically advanced the state-of-the-art in many vision tasks . Ae. , including facial landmark detection . , . Deep neural networks can learn robust feature representations and implicit shape constraints from large First demonstrated a CNN cascade for facial point detection, outperforming earlier cascaded regressors by a large margin . Subsequent works leveraged increasingly sophisticated deep models and training strategies. Multi-task learning frameworks were introduced to improve robustness: for example. Zhang et al. trained a CNN to predict landmarks together with head pose and facial attributes, gaining resilience to occlusions and pose changes through shared feature learning. Other researchers integrated 3D face modeling into the learning process to handle profile views. Combined a cascaded CNN with a 3D Morphable Model to align faces across large poses . , and proposed a 3D-assisted solution that fits a dense 3D face to 2D landmarks, thereby improving alignment of self-occluded . Fully convolutional architectures and heatmap regression techniques have also yielded excellent accuracy. A very deep residual network for landmark localization by study . nearly saturated the performance on several 2D and 3D face alignment datasets, achieving remarkably low normalized mean errors. In addition, improved loss functions and data handling have enhanced CNN-based alignment. Notably. Feng et al. introduced the Wing loss to better penalize small errors while tolerating outliers, leading to more robust convergence. Incorporated boundary-aware features to explicitly model face contour information, which boosted landmark accuracy on challenging cases like profiled faces and exaggerated expressions . Thanks to these advances, modern neural methods can achieve high accuracy under controlled conditions. However, their performance can still degrade in unconstrained environments due to the inherent diversity of real-world faces. A key remaining challenge is the reliance of deep models on abundant and varied labeled data. practice, collecting and manually annotating a sufficiently diverse facial landmark dataset is costly and laborintensive. Many existing datasets have biased distributions, such as limited extreme poses, occlusions or ethnic diversity, causing models trained on them to generalize poorly to new domains. Data augmentation is therefore crucial to improve model robustness . Conventional augmentation techniques such as random cropping, flipping, rotation and noise injection can expand a dataset but only produce limited perturbations of existing images and may not introduce truly novel face appearances or geometries. This has motivated the use of generative models to synthetically enlarge training data. More recently, diffusion models . , . have emerged as a powerful class of generative models, achieving state-of-the-art image quality and diversity in synthesis tasks. By leveraging a pretrained diffusion prior, one can guide image synthesis using additional inputs such as text, sketches, or keypoint maps . This suggests a tantalizing opportunity: by conditioning a generative model on facial landmark configurations, we can produce synthetic face images that come with free landmark labels, thereby creating virtually unlimited training data with precise ground truth. In this work we present a novel data augmentation framework that integrates ControlNet with Stable Diffusion to synthesize photorealistic face images conditioned on input landmark layouts. Our contributions are threefold. First, we develop the first diffusion model that uses conditional augmentation for facial Second, we provide empirical evidence that our method reduces normalized mean error compared to baseline models. Third, we show how structural generative augmentation can apply to other vision tasks Int J Elec & Comp Eng. Vol. No. October 2025: 4907-4915 Int J Elec & Comp Eng ISSN: 2088-8708 such as human pose estimation and hand keypoint detection where labeled data are scarce. By providing a scalable way to create large volumes of accurately labeled data, our method enables the training of more robust and generalizable models in facial analysis and related fields. METHOD To optimize facial landmark detection, this method integrates ControlNet with Stable Diffusion for synthetic data augmentation. By conditioning the image generation process on predefined facial landmark configurations, this approach generates varied training images to enhance the robustness and accuracy of facial landmark detection. The following subsections describe the dataset, model architecture, loss functions, training strategy, and implementation details. Datasets This study utilizes two primary datasets for training and evaluating the facial landmark detection model: the 300 W dataset . , a widely established benchmark for facial landmark detection, and a ControlNetbased augmented dataset. The ControlNet-based augmented dataset generates synthetic images conditioned on facial landmarks from the 300 W dataset. These datasets together provide both real and synthetic data, allowing for a systematic examination of model performance across various data configurations. The 300 W dataset The 300 W dataset is a crucial benchmark in the facial landmark detection domain, offering a diverse collection of facial images curated to challenge and evaluate detection algorithms effectively. includes various subsets designed to simulate real-world scenarios, capturing a broad spectrum of facial conditions, such as different lighting environments, facial expressions, and levels of occlusion. This dataset serves as the primary source of annotated real-world data for training and evaluating facial landmark detection models. It includes 3,148 training images and 600 testing images, providing a substantial volume of data for robust model training and analysis. Figure 1 displays sample images from the 300 W dataset, illustrating the diversity of facial features and landmarks that make this dataset invaluable for rigorous testing and validation. Figure 1. illustrates examples from the 300 W dataset, highlighting the diversity of facial variations and the detailed annotation of facial landmarks. ControlNet-based augmented dataset To supplement the 300 W dataset, a synthetic dataset was created using ControlNet, an advanced image generation model capable of producing realistic facial images conditioned on specific landmark ControlNet was applied to the 300 W landmark annotations to generate synthetic images that closely adhere to the structural features of the original dataset, enhancing diversity in training data by introducing new variations in lighting, pose, and facial expressions. This augmented dataset was generated at varying ratios yuI relative to the original dataset, from 0% to 100% in steps of 10%, allowing for experimental evaluation of different real-to-synthetic data combinations. By integrating ControlNet-based synthetic images, the augmented dataset provides a scalable solution to boost model generalization and robustness across a range of facial landmark detection scenarios. Figure 1. showcases examples from the ControlNetbased augmented dataset, illustrating how this synthetic data closely resembles real-world conditions and enhances training diversity. Model architecture For efficient computations, our model is designed specifically to handle the single objective of facial landmark detection with precision. The network begins processing with a 64y64y3 color image as input. This input is sequentially passed through five 3y3 convolutional layers, each using a rectified linear unit (ReLU) activation function to introduce non-linearity, addressing challenges like the vanishing gradient. After each convolutional layer, a max-pooling operation reduces the spatial dimensions by half, which enhances the modelAos translational invariance and condenses information. Each of the five convolutional layers is structured with kernels defined by WidthyHeightyInputyOutput, where the kernel size specifies each layerAos input and output channels, ensuring efficient feature extraction. Following these foundational layers, the network includes fully connected layers to process the extracted features. These fully connected layers transform the spatial information into a final output vector of 2L values, where each pair of values represents the x and y coordinates of each of L facial landmarks. In this setup. L is configured for 68 landmark points to capture detailed facial features accurately. This structure allows the model to excel in precise landmark localization, effectively capturing the essential details required for facial analysis. The architecture of the model is depicted in Figure 2. Enhancing facial landmark detection with ControlNet-based data A (Kritaphat Songsri-i. A ISSN: 2088-8708 . Figure 1. Examples from the datasets used for training and evaluation: . sample images from the 300 W dataset displaying diverse facial expressions, lighting conditions, and occlusions with annotated landmarks. synthetic images from the ControlNet-based augmented dataset, generated using 300 W landmark configurations to introduce additional variations in pose, lighting, and expression Figure 2. Overall architecture: a sequence of five 3y3 conv ReLU maxAapool blocks, followed by fully connected layers that output 2y68 landmark coordinates Loss function The modelAos training objective focuses on minimizing localization error for facial landmark The mean absolute error (MAE) is used to quantify the discrepancy between the predicted and actual landmark positions, ensuring accuracy in facial landmark localization. The loss function is defined in . yayaycaycuyccycoycaycyco = ycAya ya OcycA ycn Ocyc . coycnyc Oe ycoycnyc | . where ycA is the number of images, ya represents the total landmarks in each image, ycoycnyc is the ground truth location of the yc-th landmark in image ycn, and ycoCycnyc is the predicted location generated by the model. This MAEbased loss function ensures accurate localization by linearly penalizing errors across the predicted Model training strategy To assess the effects of synthetic data on facial landmark detection, the model was trained with datasets containing different ratios yuI of ControlNet-generated images to original images, ranging from 0. 0 to 0 in steps of 0. Each ratio was treated as a separate experiment, with the proportion of synthetic to real images held constant throughout the training process. By systematically varying these ratios, this approach enables a comparative analysis of how different levels of synthetic data influence model performance, providing insights into the optimal dataset composition for enhancing accuracy and robustness in facial landmark detection. Int J Elec & Comp Eng. Vol. No. October 2025: 4907-4915 Int J Elec & Comp Eng ISSN: 2088-8708 As illustrated in Figure 3, each experimental setup represents a unique dataset composition by balancing real and synthetic data according to the designated ratio. This structure allows the model to learn from both natural and augmented facial variations, examining how synthetic data contributes to generalization across diverse facial conditions. By comparing performance across these configurations, the experiments aim to identify the most effective ratio of synthetic augmentation for enhancing the modelAos ability to accurately detect facial landmarks. Figure 3. Dataset augmentation strategy: for each experiment, a fraction yuI of ControlNet-generated synthetic images is additionally added on top of the real 300 W images Implementation details This facial landmark detection model was implemented using Python and TensorFlow, leveraging its flexibility for deep learning tasks. Input images were normalized to a range of 0 and 1 by dividing pixel values by 255. The model was trained using the Adam optimizer, with a piecewise constant learning rate The initial learning rate of 1y10Oe3 was reduced to 1y10Oe4 after the first third of the training epochs and further to 1y10Oe5 after the second third, ensuring gradual refinement of model parameters. Training was conducted for 1000 epochs with a batch size of 64. Regularization was applied using L2 weight decay 5y10Oe4 to mitigate overfitting. Augmentation techniques, including random rotations, flipping, cropping, and Gaussian blurring, were employed to enhance data diversity and robustness. The implementation strategy, combining efficient architecture, adaptive learning rates, and augmentation, facilitated accurate and robust prediction of facial landmarks under varied conditions. RESULTS AND DISCUSSION This section presents an experimental evaluation of the proposed method, focusing on the impact of ControlNet-based synthetic data augmentation on facial landmark detection performance. The interocular normalized mean error (INME) is employed as the primary evaluation metric, providing a scale-independent assessment of landmark localization accuracy. Comparative analyses are conducted across various augmentation ratios and parameter settings to determine the optimal configurations for achieving robust and precise facial landmark detection. Metrics The INME provides a refined metric specifically suited for evaluating facial landmark detection. This measure calculates the average difference between the predicted and actual landmark positions, with normalization based on the interocular distance, defined as the distance between the two outermost points of the eyes. This normalization ensures a scale-independent assessment. The formula for INME is presented in . yaycAycAya = ycA OcycA C 2 coycnyc Oeycoycnyc ) yaycn where ycA represents the number of images, ya is the total number of landmarks in each image, ycoycnyc and ycoCycnyc are the ground truth and predicted landmark positions, respectively, and yaycn is the distance between the outer corners of the eyes in each image. Enhancing facial landmark detection with ControlNet-based data A (Kritaphat Songsri-i. A ISSN: 2088-8708 Methods comparison The results of the experiments, presented in Table 1, show the performance of the facial landmark detection model with varying ratios of ControlNet-based augmented data, ranging from 0 to 1. The INME is used as a key performance indicator, where lower INME values indicate higher accuracy in landmark From Table 1, it can be observed that the baseline model, without any synthetic augmentation, achieves an INME of 4. As the augmentation ratio increases from 0. 1 to 1, the INME fluctuates slightly 63 and 4. 74, indicating that different levels of augmented data have varied effects on model Further insight into the effect of ControlNet-augmented data on model learning is illustrated in Figure 4. , which display raw and moving average of INME values over training iterations, clarifying long-term performance trends. During the initial third of the training iterations. INME decreases sharply from approximately 6. 0, demonstrating that the model rapidly adapts to the training data. After this initial drop. INME stabilizes between 4. 8 and 5. 4 during the second third of the iterations, with a general downward trend, indicating continued model improvement. The impact of varying the Lambda parameter on INME is also notable. Lower Lambda values . are associated with lower INME, suggesting that selecting an optimal Lambda value can significantly enhance model performance. After the final third of the training iterations. INME converges to a steady range of 4. 6 to 4. 8 across all Lambda values, demonstrating that the model has achieved stable landmark prediction accuracy. The moving average in Figure 4. effectively smooths out raw INME fluctuations, making the trend of performance improvement more apparent. The experimental results highlight the effectiveness of using ControlNet-augmented data and the importance of tuning Lambda to achieve optimal performance in facial landmark detection. The analysis underscores that the integration of carefully chosen synthetic data ratios, along with an optimal Lambda, can enhance model robustness and precision in landmark localization. Table 1. Interocular normalized mean error (INME) of the facial landmark detection model for varying ControlNet-based augmentation ratios . uI). Lower INME indicates higher landmark prediction accuracy Ratios 0 (Baselin. INMEIe . Figure 4. Impact of Lambda . uI) on INME during training: . raw INME values per iteration for different Lambda settings, and . corresponding moving-average curves, highlighting how tuning Lambda influences convergence and landmark localization accuracy Int J Elec & Comp Eng. Vol. No. October 2025: 4907-4915 Int J Elec & Comp Eng ISSN: 2088-8708 In summary, the experimental results highlight the effectiveness of using ControlNet-augmented data and the importance of tuning Lambda to achieve optimal performance in facial landmark detection. The analysis underscores that the integration of carefully chosen synthetic data ratios, along with an optimal Lambda, can enhance model robustness and precision in landmark localization. Additionally, balancing the amount of synthetic and real data ensures diverse training samples without introducing excessive noise, further stabilizing model convergence. CONCLUSION This study highlights the effectiveness of ControlNet-based data augmentation in enhancing the accuracy and robustness of facial landmark detection. By integrating ControlNet-generated synthetic images with real data from the 300 W dataset, the proposed approach addresses critical challenges in landmark detection, including variations in lighting, pose, and facial expressions. The experimental results demonstrate that augmenting training datasets with synthetic data significantly reduces the INME, thereby improving landmark localization accuracy. Furthermore, the findings emphasize the importance of optimizing the ratio of synthetic to real data and fine-tuning model parameters, such as Lambda, to achieve maximum performance gains. Careful selection of synthetic-to-real data proportions ensures that the model learns from diverse conditions without being overwhelmed by artificial samples. In addition, adjusting Lambda allows for controlling the trade-off between reconstruction accuracy and regularization, which ultimately helps stabilize training and prevents overfitting. This methodology holds considerable promise for broader applications in computer vision tasks that require precise feature localization. In particular, fields such as facial expression recognition benefit from reliable landmark positioning, and improved 3D facial modeling depends on accurate feature correspondence. Future research should focus on refining synthetic data generation techniques, exploring more advanced generative models, and extending this approach to other areas of facial analysis to further validate its ACKNOWLEDGMENTS The authors gratefully acknowledge the use of service and facilities of the Faculty of Science and Technology. Nakhon Si Thammarat Rajabhat University. FUNDING INFORMATION This study receives funding from the National Science and Technology Development Agency. Thailand, under the Prototype Research Grant Scheme JRA-CO-2565-17792-TH. REFERENCES