International Journal of Computer and Information System (IJCIS) Peer Reviewed Ae International Journal Vol : Vol.
Issue 03.
August 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index Evaluating Machine Learning Algorithms for Predictive Modeling of Large-scale Event Attendance Deni Kurnianto Nugroho1 .
Marwan Noor Fauzy2.
Kardilah Rohmat Hidayat3 Department of Information System.
Faculty of Computer Science Universitas Amikom Yogyakarta Yogyakarta.
Indonesia deni@amikom.
id1 , marwannoorfauzy@amikom.
id2, kardilah.
rh@amikom.
AbstractAiPredicting attendance at large-scale public events is a critical task to support better resource planning, logistics, and safety This study investigates the performance of various machine learning models in forecasting event attendance using metadata features such as event type, venue, location, date, and duration.
The dataset comprises over 19526 event records obtained from a U.
government open data repository, covering multiple years and diverse event categories.
Model performance was evaluated using Mean Absolute Error (MAE).
Root Mean Squared Error (RMSE), and the Coefficient of Determination (RA).
Among the models tested, ensemble methods particularly Gradient Boosting Regressor and XGBoost outperformed others, achieving the lowest MAE .
37 and 59.
and the highest RA values .
22 and 0.
These results suggest superior generalization capability in capturing complex nonlinear patterns in the data.
In contrast, linear models and simpler non-parametric methods such as Decision Trees and K-Nearest Neighbors (KNN) exhibited relatively weaker predictive accuracy, with RA scores close to or below 0.
While the RA values indicate that metadata alone provides a limited view of attendance dynamics, the relatively low MAE across models implies that reasonable point predictions are still achievable.
These findings highlight the potential of ensemble-based methods for baseline forecasting tasks.
Furthermore, the study underscores the importance of incorporating richer feature sets such as pricing, weather, promotional activity, and social sentiment for future model improvement.
This research provides a foundational benchmark for data-driven attendance forecasting and offers practical implications for event organizers seeking scalable, automated prediction tools to support strategic Keywords : forecasting, attendance, prediction, regression, optimization INTRODUCTION Accurate forecasting of event attendance is essential for the successful planning and execution of large-scale public Whether it involves concerts, sports competitions, festivals, political rallies, or exhibitions, predicting attendance levels has direct implications for operational efficiency, safety planning, resource allocation, and the overall attendee experience.
Overestimating attendance can lead to underutilized venues, wasted resources, and financial losses, whereas underestimating can cause overcrowding, logistical failure, and potential safety hazards .
Traditionally, event organizers have relied on qualitative judgment and past experience to estimate attendance.
These methods, however, are often subjective, inconsistent, and difficult to scale especially when dealing with novel events, changing audience behaviors, or external shocks .
pandemics, extreme weather, or socio-political change.
As events become increasingly complex and data-driven decision-making becomes a norm across industries, there is growing demand for more robust and automated forecasting With the advancement of machine learning (ML) and the increasing availability of public data from ticketing platforms, event listings, social media, and transportation networks, it is now feasible to adopt predictive analytics for attendance forecasting.
These methods can leverage structured event metadata such as event title, date, time, category, location, venue capacity, and historical context to Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index identify underlying patterns and make informed predictions about future attendance .
Several studies have shown promising results in this For instance.
Li et al.
developed a deep learning model to predict stadium attendance for football matches, incorporating both historical attendance and contextual features such as time, venue, and competing events .
Similarly.
Ahmed et al.
proposed a hybrid machine learning framework combining metadata and social signals for concert attendance prediction, demonstrating improved accuracy over traditional regression models .
Another study by Liu et al.
employed gradient boosting trees to forecast demand for citywide public events, emphasizing the importance of location and weather variables .
Despite these advances, many existing models are either domain-specific or rely heavily on proprietary or real-time features .
, social media buzz, ticket price.
, limiting their applicability across broader contexts.
This research proposes a simpler yet scalable approach: to investigate how effectively attendance can be predicted using only basic, publicly available event metadata.
The core research questions are: .
How accurate are different machine learning models in forecasting attendance using minimal inputs? .
Which features contribute most to the predictive power? and .
How does model performance vary across event types, locations, and timeframes? By answering these questions, this study contributes toward developing generalizable, low-cost predictive tools that can be applied across various domains, especially in Page 253 International Journal of Computer and Information System (IJCIS) Peer Reviewed Ae International Journal Vol : Vol.
Issue 03.
August 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index contexts where rich historical or behavioral data is The findings can benefit city governments, event organizers, ticketing platforms, and public safety agencies by improving planning precision and minimizing This study aims to compare the performance of several machine learning models in predicting event attendance, and contributes a large-scale benchmark using public datasets for practical deployment in event planning II.
RESEARCH METHODS
This research aims to develop and compare various regression models to predict attendance at large-scale events based on contextual features such as time, location, and event The approach used is quantitative and based on historical data available in tabular form.
The methodological process involves several stages, as follows:
Data Collection and Preprocessing This study uses the Parks' Special Events dataset, which is publicly available through Data.
gov and provided by the Department of the Interior.
The dataset was selected due to its availability, coverage of diverse event types, and inclusion of key metadata fields.
However, it does not include real-time user behavior data or promotional campaign variables, which may affect attendance in realworld scenarios.
The dataset contains curated information about one-time special events facilitated by NYC Parks' Public Programs The dataset was cleaned by removing rows with empty or null Attendance values.
Entries with invalid time values were also filtered.
Time features such as Month.
DayOfWeek.
Hour, and weekend indicator (IsWeeken.
were extracted from the date column.
Categorical features such as location, event category, and event type were encoded into numeric form using one-hot encoding, resulting in a numeric dataset ready for use in model training.
Exploratory Data Analysis Data exploration was conducted to understand emerging patterns in event attendance based on time, location, and Several visualizations were used, such as histograms of attendance distribution, boxplots by day of the week, event location .
, and popular event This analysis provides an initial overview of how these variables influence attendance.
Feature Engineering and Encoding After data exploration, feature engineering was performed to prepare the data for input into the machine learning model.
Categorical variables such as DayOfWeek.
Borough, and Category were converted into numeric representations using one-hot encoding.
The binary variable IsWeekend was also converted to values 0 and 1.
Furthermore, a Pearson correlation analysis was performed to identify the extent to which each feature correlated with the target Attendance variable.
The correlation visualization results were presented in the form Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index of a heatmap to aid in selecting relevant features for the modeling process.
Prediction Model Development To build an event attendance prediction system, ten different regression algorithms were used to compare performance and identify the most appropriate model for ticket and capacity prediction.
Model selection was based on the diversity of approaches they represent, ranging from simple linear models to complex ensemble-based and nonlinear algorithms.
The following is a list of the models used, along with their explanations:
Linear Regression Linear Regression is the most basic regression method, assuming a linear relationship between features and the This model is often used as a baseline due to its high interpretability .
Ridge Regression Ridge Regression adds an L2 penalty to the linear regression loss function, which is useful for addressing multicollinearity and overfitting .
Lasso Regression Lasso (Least Absolute Shrinkage and Selection Operato.
uses L1 regularization to produce a slimmer model with automatic feature selection .
ElasticNet Regression ElasticNet combines L1 and L2 penalties, making it suitable for use when there is correlation between features and the model requires double regularization .
Decision Tree Regression This model divides the data based on features that maximize information and is very effective at capturing nonlinear relationships without the need for feature transformation .
Random Forest Regression Random Forest is an ensemble of many decision trees trained on subsets of data and features, resulting in a model that is robust against overfitting .
Gradient Boosting Regressor Gradient Boosting Regressor gradually builds a predictive model by minimizing the error of the previous This technique has proven highly efficient in various machine learning competitions .
XGBoost Regression XGBoost is a sophisticated implementation of Gradient Boosting Regressor optimized for computational efficiency and additional regularization, making it particularly superior in big data scenarios .
Support Vector Regression (SVR) SVR works by finding the optimum margin in a higherorder feature space and is known to handle cases with Page 254 International Journal of Computer and Information System (IJCIS) Peer Reviewed Ae International Journal Vol : Vol.
Issue 03.
August 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index significant noise or outliers .
K-Nearest Neighbors Regression (KNN) KNN predicts a target value based on the average of its k nearest neighbors.
Although simple, it can be effective if the data distribution is sufficiently dense and uniform .
Each model was trained on the training subset .
%) and tested on the test subset .
%).
The training process was performed using default parameters from the scikit-learn or xgboost libraries, without further hyperparameter tuning, to ensure a fair and objective baseline comparison.
The primary goal of this approach was to evaluate the baseline performance of each model on the event attendance prediction problem before considering further optimization.
To evaluate the prediction performance, we used standard regression metrics: Mean Absolute Error (MAE).
Root Mean Squared Error (RMSE), and the coefficient of determination (RA).
These metrics capture both the average magnitude of errors and the goodness of fit.
RESULT AND ANALYSIS
After conducting experiments on the 10 selected methods, the following results were obtained:
1 Mean Absolute Error (MAE) Evaluation The Gradient Boosting Regressor and Support Vector Regressor (SVR) also demonstrated relatively strong performance, with MAE values in the range of 61Ae62.
These models, while slightly less accurate than Random Forest and XGBoost, still manage to capture non-linearity in the data.
SVR, in particular, is known for its robustness against outliers and overfitting in high-dimensional settings.
On the other hand, traditional linear models such as Linear Regression.
Ridge Regression, and Lasso Regression performed moderately, with MAE values between 70Ae72.
These results indicate that linear methods may struggle to model the complexities in attendance patterns, especially when important non-linear effects or feature interactions are The ElasticNet model produced the highest MAE, approximately 76, making it the least accurate among the evaluated models.
This could be due to the double-penalty mechanism (L1 and L2 regularizatio.
, which may oversimplify relationships between features and outcomes when applied to sparse or non-linearly distributed data.
Overall, the trend observed across models shows a clear performance gap between linear and non-linear approaches.
Models capable of learning complex patterns, particularly tree-based ensembles, consistently outperform those relying on linear assumptions.
This suggests that future work in event attendance prediction should emphasize non-linear modeling approaches, especially when using limited but structured input data.
2 RMSE-Based Model Evaluation Figure 1.
MAE score comparison across different regression models for event attendance prediction Figure 1 illustrates the performance of various regression models in predicting event attendance, evaluated using the Mean Absolute Error (MAE) metric.
Lower MAE values indicate better model accuracy in estimating actual As shown in the figure.
Random Forest and XGBoost achieved the lowest MAE scores approximately 58.
7 and 3 respectively indicating superior predictive accuracy.
These models are both ensemble-based tree learners that can capture non-linear relationships and complex feature interactions effectively.
Their dominant performance suggests that ensemble techniques are well-suited for attendance forecasting, particularly when dealing with diverse event metadata such as category, venue, or scheduled Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index Figure 2.
RMSE scores of different regression models for predicting event attendance.
Figure 2 compares the predictive performance of all evaluated regression models based on the Root Mean Squared Error (RMSE).
RMSE is particularly useful for emphasizing large prediction errors due to its squared error A lower RMSE value indicates that a model generates predictions with smaller deviations from actual values, especially in high-variance scenarios.
As observed in Figure 2, the Gradient Boosting Regressor achieves the lowest RMSE .
Page 255 International Journal of Computer and Information System (IJCIS) Peer Reviewed Ae International Journal Vol : Vol.
Issue 03.
August 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index underscoring its ability to minimize both minor and major deviations in prediction.
This result complements its MAEbased performance and confirms the modelAos robustness in handling the non-linear relationships embedded in event The modelAos dominance across multiple evaluation metrics reinforces the findings of prior studies that favor ensemble learning in structured prediction tasks .
Other models with competitive RMSE values include XGBoost.
Random Forest, and Lasso Regression, all ranging between approximately 165-170.
This suggests a consistent performance advantage among ensemble-based models and regularized linear regressors when dealing with highdimensional tabular data.
Conversely, the Decision Tree Regressor records the highest RMSE .
, indicating a tendency to overfit the training data while failing to generalize Support Vector Regressor and K-Nearest Neighbors Regressor also exhibit relatively poor RMSE values .
pproximately 180-.
, signaling their sensitivity to feature scaling and potential limitations in modeling heterogeneous metadata features Interestingly, traditional linear models such as Ridge Regression.
Linear Regression, and ElasticNet fall within a middle range .
pproximately 170-.
, confirming their moderate capacity to generalize while lacking the expressiveness of more complex A visible trend in this evaluation is that ensemble-based models consistently outperform single estimators or simpler regressors across all metrics.
Moreover, models that tend to overfit or rely on distance-based assumptions .
ike Decision Tree and KNN) show elevated RMSE, highlighting the challenge of generalizing from sparse metadata alone.
These results substantiate the earlier MAE findings and suggest that RMSE, while more sensitive to outliers, reaffirms the dominance of Gradient Boosting Regressor as the most suitable model under current feature constraints.
3 RA-Based Model Evaluation ability to explain the variance in event attendance based solely on event-level metadata.
In contrast to MAE and RMSE which quantify prediction error magnitude RA represents the proportion of variance in the dependent variable that is accounted for by the model.
A value of 1.
denotes perfect fit, 0.
0 indicates no explanatory power beyond predicting the mean, while negative values suggest worse performance than a baseline mean predictor.
Among the models tested.
Gradient Boosting Regressor achieves the highest RA score .
indicating its superior capability to extract relevant patterns from the feature set and explain approximately 22% of the variance in attendance outcomes.
This performance reinforces its status as the most consistently effective model across all evaluation metrics, as evidenced by its leading MAE and RMSE scores in prior analyses.
Models such as XGBoost.
Ridge Regression.
Linear Regression, and Lasso Regression show relatively comparable RA scores .
These values suggest moderate performance, potentially stemming from their reliance on linear assumptions that limit their ability to fully capture the non-linear dynamics often present in real-world attendance behavior.
Random Forest exhibits slightly lower RA .
, despite competitive error-based scores (MAE/RMSE).
This suggests a scenario where the model generates accurate point predictions without effectively explaining the overall variance, likely due to its ensemble architecture averaging out individual feature On the lower end of the spectrum, models such as ElasticNet.
K-Nearest Neighbors Regressor, and Support Vector Regressor yield RA values below 0.
1, indicating minimal explanatory utility.
The Decision Tree Regressor, notably, reports a negative RA value .
pproximately Ae0.
which reveals that it performs worse than a naive model predicting the mean attendance across all events.
This result is a clear indication of severe overfitting and lack of generalization to unseen data.
Overall, the generally low RA scores suggest that event metadata alone including categorical and temporal features is insufficient to model the full behavioral complexity underlying attendance decisions.
This limitation points to the potential benefit of incorporating additional data sources, such as ticket pricing, artist popularity, promotional activity, social sentiment, or competing events, to increase model informativeness and predictive utility.
Figure 3.
RA scores across different regression models for event attendance prediction.
Figure 3 reports the coefficient of determination (RA) for all evaluated regression models, providing insight into their Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index Despite these constraints.
Gradient Boosting Regressor remains the most reliable model in this study, offering the best trade-off between error minimization and variance These findings align with the broader literature on machine learning for tabular data, which consistently favors tree-based ensemble methods in scenarios involving sparse, heterogeneous, or weakly predictive input features .
Page 256 International Journal of Computer and Information System (IJCIS) Peer Reviewed Ae International Journal Vol : Vol.
Issue 03.
August 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index 4 Model Performance Evaluation Table 1.
Performance Metrics for Attendance Prediction Models Model
MAE
RMSE
Grad Boost Reg
XGBoost
Ridge Reg
Linear Reg
Lasso Reg
Random Forest
ElasticNet
KNN Regressor
SVR
Decision Tree The Gradient Boosting Regressor outperformed other models across multiple metrics, achieving the highest RA score .
, indicating that it explains 22% of the variance in the attendance data.
It also achieved competitive MAE .
and RMSE .
, showing better accuracy and error minimization compared to the rest.
Interestingly, while Random Forest and XGBoost had slightly lower MAE values .
31 and 59.
52, respectivel.
, their RA values were lower .
12 and 0.
, suggesting that although their average prediction errors were smaller, they were less effective at capturing overall variance.
Classical linear models such as Linear Regression.
Ridge, and Lasso all yielded similar performances, with RA around 14 and RMSE above 166.
This implies that these models have limited ability to model complex nonlinear patterns in the data, likely due to the simplicity of the features .
, date, location, and categor.
Models such as ElasticNet.
K-Nearest Neighbors (KNN), and Support Vector Regressor (SVR) demonstrated weaker performance, with lower RA scores and higher RMSE.
The SVR and Decision Tree models, in particular, showed negative RA values (Ae0.
02 and Ae0.
, indicating poor generalization and worse performance than a naive mean Overall, ensemble-based models (Gradient Boosting Regressor.
XGBoost, and Random Fores.
showed superior performance compared to individual regressors, supporting findings from prior studies which emphasize their robustness in handling heterogeneous and sparse feature spaces .
The findings highlight the practical advantages of utilizing Gradient Boosting Regressor in predicting event attendance using metadata features.
With the highest RA score among all tested models.
Gradient Boosting Regressor demonstrates a superior capability to learn from complex patterns, non-linear relationships, and subtle feature interactionsAicharacteristics often inherent in real-world event data.
From an operational perspective, this insight is especially valuable for event organizers, ticketing platforms, and digital marketing teams.
By deploying Gradient Boosting Regressor Journal IJCIS homepage - https://ijcis.
net/index.
php/ijcis/index based predictive models early in the event lifecycle, stakeholders can make data-informed decisions on resource allocation, marketing budget distribution, and ticket inventory management.
For instance, events flagged as highdemand by the model can trigger earlier promotional campaigns or dynamic pricing strategies.
Additionally, the consistently poor performance of simpler models such as k-Nearest Neighbors or Decision Tree regressors underscores the importance of model selection in production environments.
While those models offer ease of interpretation and lower computational cost, their inability to generalize well to the event attendance problem renders them less suitable for practical deployment.
Furthermore, the performance gap between Gradient Boosting Regressor and linear models .
ike Ridge and Lass.
implies that non-linear modeling approaches should be prioritized in domains where feature interactions and saturation effects are likely such as when predicting human behavior influenced by time, location, pricing tiers, and genre VI.
CONCLUSION
This study investigated the feasibility of predicting event attendance using machine learning models trained on publicly available event metadata.
The models were evaluated based on three primary performance metrics: Mean Absolute Error (MAE).
Root Mean Squared Error (RMSE), and the Coefficient of Determination (RA).
Our findings reveal that among the tested models.
Gradient Boosting Regressor and XGBoost achieved the best overall performance, with MAE scores of 61.
37 and 59.
52, and RA scores of 0.
22 and 0.
15, respectively.
These results suggest that ensemble-based models are more effective at capturing the underlying patterns in event attendance data, likely due to their ability to model complex nonlinear relationships.
Traditional linear models, including Ridge.
Lasso, and ElasticNet regression, yielded significantly lower RA values .
14 or belo.
, indicating limited capacity to explain the variance in attendance based solely on basic event Similarly, non-parametric models like K-Nearest Neighbors and Decision Trees exhibited suboptimal performance, further reinforcing the importance of advanced ensemble methods in this context.
Despite the modest RA scores across all models, the relatively low MAE indicates that the models were still able to generate reasonably accurate point estimates of attendance.
However, the overall predictive performance also reflects the limitations of using metadata alone .
, event date, location, categor.
without incorporating richer contextual features such as ticket price, social media engagement, weather conditions, or promotional campaigns.
From a practical standpoint, the findings imply that organizers can rely on ensemble-based models like Gradient Boosting to obtain baseline attendance forecasts using minimal input features, particularly when real-time or granular data is unavailable.
This approach can support Page 257 International Journal of Computer and Information System (IJCIS) Peer Reviewed Ae International Journal Vol : Vol.
Issue 03.
August 2025 e-ISSN : 2745-9659 https://ijcis.
net/index.
php/ijcis/index preliminary decision-making in event logistics, including early resource allocation, crowd management strategies, and staff scheduling.
Nevertheless, the study has some limitations.
The current model is trained exclusively on structured metadata without accounting for external or dynamic factors that may strongly influence attendance.
Consequently, the relatively low RA values suggest that much of the variance remains Future work should incorporate richer contextual and temporal information such as real-time weather, social sentiment, pricing strategies, and historical attendance trends to improve predictive accuracy and .
Additionally, model transparency and fairness should be considered, particularly if such predictive tools are to be deployed in high-stakes settings like public safety planning or ticket pricing optimization.
Future studies could enhance prediction accuracy by integrating richer contextual data, such as weather, event marketing efforts, or historical attendance trends.
THANK-YOU NOTE
The authors would like to express their deepest gratitude to Universitas Amikom Yogyakarta for the continuous support, research facilities, and academic environment that enabled the successful completion of this study.
Special thanks also go to the Department of Information Systems.
Faculty of Computer Science, for their guidance, technical input, and encouragement throughout the research process.
We extend our sincere appreciation to the U.
Department of the Interior for providing access to valuable open data through the Parks & Special Events Dataset, which significantly contributed to our modeling and analysis of public event attendance patterns.
We also acknowledge the contributions of fellow researchers, academic peers, and anonymous reviewers whose constructive feedback helped improve the quality and clarity of this paper.
Furthermore, we are grateful to the broader data science and open-source community whose tools, libraries, and platforms greatly facilitated the data processing, modeling, and visualization tasks throughout this The availability of reproducible and welldocumented software frameworks was instrumental in ensuring the robustness and transparency of our findings.
Conference on Knowledge Discovery & Data Mining, 2019, 2497Ae2505.
Li.
Yu, and H.
Gao.
AuForecasting sports attendance using neural networks: A study of football matches,Ay Knowledge-Based Systems, vol.
191, 2020.
Y .
Li.
Yu, and H.
Gao.
AuForecasting sports attendance using neural networks: A study of football matches,Ay Knowledge-Based Systems, vol.
191, 2020.
Ahmed.
Khan, and H.
Malik.
AuEnsemble-based attendance prediction for live music events using hybrid data sources,Ay in Proceedings of the 2022 ACM Web Conference .
, pp.
1856Ae1865.
Liu.
Chen, and X.
Ma.
AuPredicting public event attendance with urban sensing and gradient boosted models,Ay Journal of Big Data, vol.
6, no.
1, 2019.
Montgomery.
et al.
AuLinear regression analysis,Ay Journal of Statistical Software, vol.
8, no.
2, pp.
1Ae20, 2003.
Hoerl and R.
Kennard.
AuRidge Regression: Biased Estimation for Nonorthogonal Problems,Ay Technometrics, 12, no.
1, pp.
55Ae67, 1970.
Tibshirani.
AuRegression shrinkage and selection via the lasso,Ay Journal of the Royal Statistical Society: Series B (Methodologica.
, vol.
58, no.
1, pp.
267Ae288, 1996.
Zou and T.
Hastie.
AuRegularization and variable selection via the elastic net,Ay Journal of the Royal Statistical Society:
Series B, vol.
67, no.
2, pp.
301Ae320, 2005.
Breiman.
Friedman.
Olshen, and C.
Stone.
Classification and Regression Trees.
Wadsworth.
Belmont.
CA, 1984.
Breiman.
AuRandom forests,Ay Machine Learning, vol.
1, pp.
5Ae32, 2001.
Friedman.
AuGreedy function approximation: A gradient boosting machine,Ay Annals of Statistics, vol.
29, no.
5, pp.
1189Ae1232, 2001.
Chen and C.
Guestrin.
AuXGBoost: A Scalable Tree Boosting System,Ay in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
785Ae794, 2016.
Smola and B.
Schylkopf.
AuA tutorial on support vector regression,Ay Statistics and Computing, vol.
14, no.
3, pp.
199Ae 222, 2004.
Zhang and Y.
Ma.
Ensemble Machine Learning: Methods and Applications.
Springer, 2012, ch.
3 (KNN regressio.
REFERENCES