Institut Riset dan Publikasi Indonesia (IRPI) MALCOM: Indonesian Journal of Machine Learning and Computer Science Journal Homepage: https://journal.irpi.or.id/index.php/malcom Vol. 5 Iss. 3 July 2025, pp: 990-999 ISSN(P): 2797-2313 | ISSN(E): 2775-8575 Harnessing Machine Learning to Decode YouTube Subscriber Dynamics: Regression Predictive Models and Correlations Sri Mulyati1*, Samidi2 1 Master of Science in Management Program, Faculty of Economics and Business, Padjajaran University, Bandung, Indonesia 2 Master of Computer Science, Budiluhur University, Jakarta, Indonesia Email: 1sri23020@mail.unpad.ac.id, 2samidi@budiluhur.ac.id Received May 04th 2025; Revised Jun 16th 2025; Accepted Jul 21th 2025; Available Online Jul 31th 2025, Published Jul 31th 2025 Corresponding Author: Sri Mulyati Copyright © 2025 by Authors, Published by Institut Riset dan Publikasi Indonesia (IRPI) Abstract YouTube has grown and become a digital media giant. Content creators continue to struggle with predicting subscriber growth. Due to viewers' changing interests and the vast amount of information, it is challenging to determine which factors most influence subscription behavior. Optimizing content strategy and ensuring channel growth need an understanding of these traits. This study uses linear regression models (LR), neural networks (NN), and Gaussian processes (GP) to predict YouTube subscribers and examine category correlations using video data from various topics. The study of correlation matrix analysis was performed with an absolute root mean square error (RMSE) of 26256351, and the NN prediction model outperformed the LR and GP models. The correlation matrix indicates a slight positive correlation of 0.067 among the YouTube categories. Specifically, the correlation coefficients for population, unemployment rate, and urban population are 0.080, -0.012, and 0.082, respectively. These findings suggest future research to create more intentional content and search for significant factors that increase viewership and marketing audience growth. Keywords: Machine Learning, Networks, Regression, Subscribers, YouTube 1. INTRODUCTION YouTube has grown since 2005 and began to dominate the digital world. YouTube has become one of the most popular social media networks, with over 2 billion monthly users by 2023 estimated. Demographically, 2 million people as users are aged 25-34, 7 million are 35-44 years old, and 377 million are between 18-24 years of age [1]. The massive number of users affected the world economy through contributions from content development, digital content, distribution, and business digital advertising turnover, which is predicted to achieve $518.4 billion in the year 2023 [2]. In addition, YouTube has encouraged many people to engage, from bloggers, broadcasters, artists, celebrities, musicians, and event beginners from remote areas, intentionally to create content and find their fortune by monetizing their talents through digital content. McKinsey stated that YouTube videos are estimated to generate income and be sustainable if they grab their customers' engagement [3]. Even though having numerous subscribers has many benefits for creators, many still fail to predict their subscriber growth. Due to the volume of content and changing viewer preferences, identifying the main factors affecting subscriber behavior is tricky. Understanding and perceiving these factors and optimizing content strategy is essential for growth. The digital content ecosystem is such a competitive area, with artists competing for audience attention [4]. The effect of genre on subscribers is significant to research because some genres attract viewers from different demographics. In contrast, others attract subscribers because of their broad appeal to specific interest groups. Deloitte found that content classifications affect audience engagement and loyalty, emphasizing the significance of strategic content planning for channel effectiveness [5]. The primary objective of this study is to develop and compare predictive models for estimating YouTube subscriber growth using three distinct machine learning algorithms: Linear Regression (LR), Neural Networks (NN), and Gaussian Processes (GP). In addition to model development, the research investigates the relationships between categorical YouTube content genres and key demographic indicators, including population size, urban population density, and unemployment rate, about subscriber counts. DOI: https://doi.org/10.57152/malcom.v5i3.2084 990 ISSN(P): 2797-2313 | ISSN(E): 2775-8575 Furthermore, the study aims to evaluate the predictive performance of each model through Root Mean Square Error (RMSE) metrics and correlation coefficients, thereby offering insights into their implications for strategic content planning and audience segmentation within digital media platforms. The related works of Rui et al. [19] previously used Ordinary Least Squares (OLS) and Online Gradient Descent (OGD) models [6]. Additionally, Prachi et al. [7]) used general linear models and LR to analyze YouTube videos [7]. Unlike previous research that focused solely on video-level metrics like duration, likes, or comments [6], this study included additional socio-demographic variables (population, urbanization, and unemployment rate) at the regional to video level in the predictive context. Also, the study examined and compared the performance of three different machine learning models LR, NN, and GP on the same dataset - and thus offered a contextual discussion of their performance and interpretability. The use of socio-demographic, quantitative (video), and qualitative (topic) data in different formats allows for a more nuanced representation of subscribers and changes of subscribers across categories of content as research novelty (see Tabel 1). Table 1. Research gaps and hypothesis References Rui et al (2019) Prachi et al (2024) Our research Variable Video duration, upload date, and number of likes and comments. Video duration, upload date, and number of likes and comments. Categories, Subscribers, Population, Urban-population, Unemployment Type of Data Regression Models Result YouTube video statistics 1. OLS method, OGD 2. method The OLS method outperforms the OGD method in predicting YouTube views Video duration and number of likes are significant predictors of views [6] 1. Generalized linear models outperform LR models 2. Video duration and number of likes are significant predictors of views. [7] Hypothesis: 1. NN outperformed LR and GP 2. Positive Correlation between YouTube Categories, Population, Urban Population, and Unemployment Rate to YouTube Subscribers Dataset YouTube Statistic General linear models, LR models Dataset YouTube Statistic LR models, NN, and GP Correlation Matrix The research questions are: RQ1 : How does the predictive accuracy of NN compare to LR and GP in modeling YouTube subscriber growth? RQ2 : What is the strength and significance of the correlation between YouTube categories, population, urban population, and unemployment rate to YouTube subscribers? 2. 2.1. METHODOLOGY Method This study employs machine learning techniques to predict the number of subscribers on YouTube. It investigates links between video categories using a regression model to enhance the accuracy and interpretability of view prediction as to research objectives. The tools utilized three algorithm models: LR, NN, and GP for prediction, as well as correlation matrix models, to effectively analyze YouTube subscriber growth. The selection of LR, NN, and GP models was guided by their complementary strengths in predictive analytics because LR offers interpretability and baseline performance for linear relationships. A NN captures complex, nonlinear patterns and interactions among variables. GP provide probabilistic predictions with uncertainty quantification, which is suitable for modeling noisy and high-dimensional data. These models were chosen to assess both predictive accuracy and explanatory power. Their performance was benchmarked using RMSE, allowing for a robust comparison across modeling paradigms. The cross-industry data mining technique is a popular data science method. According to surveys, this method is used 49% of the time in data science initiatives, followed by Scrum and Kanban [8]. Furthermore, most of the CRISP-DM research in data science centers around data mining, artificial intelligence, machine learning, and deep learning. The fields of big data, data analysis, and data analytics exhibit distinct differences. Modeling is needed for data mining, AI, machine learning, and deep learning [9]. Thus, data science projects can use CRISP-DM as a research method and process model. The CRISP-DM methodology is applicable even in highly specialized domains such as deep learning and made possible by utilizing deep learning in artificial intelligence and machine learning. The process model begins with business knowledge, then data understanding, preparation, modeling, evaluation, and deployment [10]. Start with data preparation. Data analysis requires data cleaning and preparation, including standardization and reduction. Maintaining data accuracy, completeness, and consistency (see Figure 1). MALCOM - Vol. 5 Iss. 3 July 2025, pp: 990-999 991 MALCOM-05(03): 990-999 Figure 1. Research Methodology The initial step, business understanding, was to establish the core research question, which is the difficulty for content creators to forecast subscriber increase due to the complex interrelationship between content categories and socio-demographic factors. This step established the study objective: developing forecasting models to inform strategic content planning and audience targeting at the data understanding phase. The dataset contained variables for YouTube content category, subscriber population, total population, population living in urban areas, and percentage of unemployment. Data preparation involved transforming categorical data into numerical values (18 categories), scaling continuous data, and splitting the dataset into training (90%) and testing (10%) sets. These actions prepared data for machine learning and minimized the chances of bias during model assessment. Three machine-learning methods were executed at the modeling level: LR, NN, and GP. The evaluation process employed RMSE as the primary metric of predictive validity. The deployment phase was conceptualized in actual application terms. 2.2. Datasets and Tools This research used descriptive quantitative data and sourced data from Kaggle's machine-learning datasets. Kaggle's YouTube video report 2023 is the dataset used for many machine learning projects (https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023). The form datasets in CSV file datasets include categories, subscribers, unemployment rate, country population, and urban population. There is a total of 836 distinct data points. Of the datasets, 90% (752 data points) were used for training and 10% (84 data points) for testing. The tool used is RapidMiner 10.3, a machine-learning application that can improve the results of data analysis and machine learning and is reliable for various performance analyses of AI projects [10]. The program offers advanced data processing, analysis, and integration operators. Numerous methods for managing missing values and normalizing data ensure that data is error-free and ready for modeling. The variables independent factors are the YouTube category (X1), consisting of 18 genre categories, population (X2), unemployment (X3), and urban population (X4), whereas the dependent variable is the number of subscribers (Y) (see Table 2). Table 2. YouTube Data No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Youtube Category (X1) Autos & Vehicles Comedy Education Entertainment Film & Animation Gaming How to & Style Movies Music News & Politics Nonprofits & Activism People & Blogs Pets & Animals Sports Population (X2) 212559417 212559417 270203917 270203917 212559417 270203917 212559417 1366417754 212559417 270203917 328239523 1366417754 328239523 212559417 Unemployment (X3) 12.08 12.08 4.69 4.69 12.08 4.69 12.08 5.36 12.08 4.69 14.7 5.36 14.7 12.08 Urban Population (X4) 183241641 183241641 151509724 151509724 183241641 151509724 183241641 471031528 183241641 151509724 270663028 471031528 270663028 183241641 Subscriber (Y) 21600000 44200000 12900000 25000000 14200000 20200000 14500000 28400000 66500000 15000000 38600000 14400000 23700000 12300000 Harnessing Machine Learning to Decode YouTube Subscriber... (Mulyati and Samidi, 2025) 992 ISSN(P): 2797-2313 | ISSN(E): 2775-8575 No 15 16 17 18 … 836 Youtube Category (X1) Science & Technology Shows Trailers Travel & Events ….. People & Blogs Population (X2) 83132799 1366417754 1366417754 126014024 …. 126014024 Unemployment (X3) 3.04 5.36 5.36 3.42 …. 3.42 Urban Population (X4) 64324835 471031528 471031528 102626859 ….. 102626859 Subscriber (Y) 19800000 70500000 36600000 12500000 ….. 12400000 3. 3.1. RESULTS AND DISCUSSION Data Processing The analysis of YouTube data using LR, NN, and GP as the predictive model and correlation model (see Figure 2 and Figure 3). Figure 2. Predictive Model Figure 2 above presents a comprehensive workflow diagram constructed in RapidMiner which illustrates the procedural architecture of a predictive modeling framework applied to YouTube analytics. The model integrated multiple machine learning techniques such as LR, NN, dan GP to evaluate and compare predictive performance across different algorithmic paradigms. Figure 3. Correlation Model Figure 3 illustrates a data processing workflow. It is designed to analyze YouTube related data through correlation analysis. It accessed a predefined dataset containing YouTube metrics, such as YouTube category, population, urban population, and unemployment rate. In data processing, the prior steps address missing values and convert category variables to numbers of reproduced data. Then, 90% of the data is separated into training and 10% testing sets. First, LR, NN, and GP are employed to assess YouTube category, population, urban population, and unemployment rate. Further, these traits are classified and linked. Next, the regression model is trained with the training data and evaluated using RMSE rate comparative and correlation matrix to find the optimum regression model. 3.2. RMSE Comparatives Experiments measure RMSE and Coefficient of Determination. RMSE is a square root of the average difference between predicted and forecasted values (Chicco et al., 2021). The formula for RMSE is: 1 𝑹𝑴𝑺𝑬 = ⌊√ ∑𝑚 𝑡=1 = ⌊𝑦𝑖 − 𝑋𝑖⌋²⌋ 𝑚 (1) The RMSE result sees Tabel 3. Table 3. RMSE Result MALCOM - Vol. 5 Iss. 3 July 2025, pp: 990-999 993 MALCOM-05(03): 990-999 Algorithm LR NN GP RMSE 26340816.061+/-0.000 26256351.509+/-0.000 27048822.217+/-0.000 Value NN