Journal of Natural Resources and Environmental Management http://dx. org/10. 29244/jpsl. RESEARCH ARTICLE Modeling Landslide Hazard Using Machine Learning: A Case Study of Bogor. Indonesia Boedi Tjahjonoa. Indah Firdaniaa. Bambang Hendro Trisasongkoa,b Department of Soil Science and Land Resources. Faculty of Agriculture. IPB University. IPB Darmaga Campus. Dramaga. Bogor, 16680. Indonesia Geospatial Information and Technologies for the Integrative and Intelligent Agriculture (GITIIA). Center for Regional System Analysis. Planning and Development (CRESTPENT). IPB University. IPB Baranangsiang Campus. Bogor, 16153. Indonesia Article History Received 07 July 2023 Revised 13 October 2023 Accepted 13 June 2024 ABSTRACT Keywords forest, landslide, machine learning, random modeling Landslides occur in many parts of the world. Well-known drivers, such as geological activities, are often enhanced by violent precipitation in tropical regions, creating complex multi-hazard phenomena that complicate mitigation strategies. This research investigated the utility of spatial data, especially the digital elevation model of SRTM and Landsat 8 remotely sensed data, for the estimation of landslide distribution using a machine learning approach. Bogor Regency was chosen to demonstrate the approach considering its vast hilly/mountainous terrain and high rainfall. This study aimed to model landslide hazards in Sukajaya District using random forests and analyze the key variables contributing to the isolation of highly probable landslides. The initial model, using the default settings of random forest, demonstrated a notable accuracy of 93%, with an accuracy ranging from 91 to 94%. The three main predictors of landslides are rainfall, elevation, and slope Landslides were found to occur primarily in areas with high rainfall . ,668Ae3,228 m. , elevations of 500 to 1,500 m, and steep slopes . Ae45%). Approximately 4,536 ha were potentially prone to landslides, while the remaining area (> 12,000 h. appeared relatively sound. Introduction Natural disasters vary significantly, and each has specific characteristics that lead to complex mitigation Hydrometeorological and geological disasters are two types of naturally occurring disasters commonly found in Indonesia . The first has been recurring, as Indonesia is located in a tropical region with high, often torrential, rainfall with a distinctive amplitude. This leads to disasters in the form of droughts, landslides, tornadoes, and floods. With a shift in climate patterns . , prior expectations of drought and floods are no longer completely valid, a situation that warrants suitable adaptive mitigation planning. Geological disasters are closely associated with the geographical location of Indonesia, intertwining the Eurasian. IndoAustralian, and Pacific plates. These have been acknowledged to be very active compared to the rest of the world . Their movement drives tectonic activities, which result in diverse and frequent natural disasters, such as earthquakes, tsunamis, volcanic eruptions, and landslides. The latter has a specific occurrence scheme that has been an emerging research focus in disaster studies. Although many consider landslides to be geologically related disasters, they can also be triggered by intensive, high-volume precipitation in specific regions. Landslide disaster studies using multiple disaster sources have attracted considerable research attention . Given the nature of hazard or disaster research, the spatial context is critical in data analysis, information extraction, and visualization. Spatial data contribute to wall-to-wall studies of landslides and other hazard studies. In general, these can be classified into vector and raster data. Data presented in vectors generally provide baseline information such as a final map for public use. This type of data does not suit the frequently updated information, which conforms to raster data. Corresponding Author: Boedi Tjahjono boetjah@apps. Department of Soil Science and Land Resources. Faculty of Agriculture. IPB University. IPB Darmaga Campus. Dramaga. Bogor. Indonesia. A 2024 Tjahjono et al. This is an open-access article distributed under the terms of the Creative Commons Attribution (CC BY) license, allowing unrestricted use, distribution, and reproduction in any medium, provided proper credit is given to the original authors. Think twice before printing this journal paper. Save paper, trees, and Earth! In hazard studies, both digital elevation model (DEM) and Earth observation data are useful. DEM data representing the morphology of a region have been demonstrated to be vital for estimating and mapping landslide-affected areas . The resolution of the DEM is a significant benefit of the analyses . Thus, the need for better DEM data in the future, within the context of spatial and temporal resolution, would make a significant contribution to landslide monitoring and other hazard studies. Combining both types of raster data has proven useful. Earlier research summarized the utility of remote sensing images . Ae. Extending this context, multiple hazards have recently gained increasing attention. Shafique et al. analyzed remote sensing data for landslide events triggered by the Kashmir earthquake. Further extension has been made in terms of multi-temporal data to understand the history of landslides . Irrespective of the type or amount of data, thematic information extraction has generally been performed using machine learning for either classification . or regression problems . The main advantages of using this approach include optimization of the model through subsampling and options for model reimplementation with minimum data inputs. Despite this, studies on machine learning applications for hazards, especially landslides, have been conducted. Commonly used machine learning methods such as random forests have been demonstrated to provide accurate predictions of landslide vulnerability in Taiwan . Tengtrairat et al. indicated the benefits of a bidirectional long short-term memory (BiLSTM) machine learning model in disaster risk analysis in Thailand. Contemporary research shows significant performance of machine learning methods compared to conventional methods . Although various machine learning techniques have been used in previous studies, the geographical context indicates the need for studies in specific geographies as a venue to better understand landslide characteristics or perhaps as a part of the development of generic machine learning models. This research was generally designed to develop machine-learning-based modeling as an initial step for the development of generic models in similar landscapes. Specifically, this research aimed to study the random forest approach to produce geographical estimates of landslide occurrence and examine important variables yielded by selected In addition, parameter optimization was performed to obtain a statistically better model. Methodology Study Area This research was conducted in Sukajaya District. Bogor Regency (Figure . The research area has an undulating to mountainous topography, which includes the upper and middle slopes of the Halimun-Salak Mountains. With complex terrain conditions, the study area has a record of past landslides with a high likelihood of future land surface movements. Figure 1. Study area. This journal is A Tjahjono et al. JPSL, 14. | 408 Data Collection and Analysis The landslide point inventory was carried out through observations and interpretations of high-resolution Google Earth images within the last three years (Figure . Some of these observation points were validated using field surveys conducted between February and March 2021. Road signs related to landslide occurrences were recorded, indicating evidence of past disasters. In addition, landslide scars . or prior scars . sually non-woody vegetatio. were also collected. some were informed by locals. All data . ,040 sample point. , either landslide . ,021 point. or non-landslide . ,019 point. categories, were combined into a spatial database, which served as a sample set for data analysis using a machine learning approach. Figure 3 shows the spatial coverage of the sample dataset. Figure 2. Identifying landslide events from Google Earth. Figure 3. Spatial distribution of the landslide point dataset. http://dx. org/10. 29244/jpsl. JPSL, 14. | 409 This study assessed proxies of landslide events derived from three types of primary data. Terrain conditions were represented by the Shuttle Radar Topography Mission (SRTM) DEM data with a spatial resolution of 30 m downloaded from the US Geological Survey (USGS) Website. This resolution was chosen to match the Landsat 8 data. These images with level 2 processing were downloaded. hence, standard radiometric correction was performed internally using USGS. Landsat data represent the dynamics of land cover, which is an essential element in landslide modeling. The third dataset was rainfall obtained from the Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) downloaded from their website . ttps://w. edu/data/chirps, accessed on 2 July 2. All datasets were spatially co-registered according to the Landsat data. Figure 4 illustrates the research framework used in this study. DEM data were processed using QGIS software to generate five terrain variables: elevation, slope, aspect. Topographic Wetness Index (TWI), and slope Ryzycka et al. showed that although common terrain parameters, such as slope, were quite successful, the application of TWI in modeling offers opportunities for more detailed information. Changes in slope appearance due to landslides were indicated by high TWI values. Previous research indicated that DEM showing a specific plan curvature indicated high potential hazards compared to concave or convex slopes . Figure 4. Research framework. The availability of Landsat 8 Surface Reflectance (SR. Level . data accelerates data processing, considering that this type of data has undergone radiometric calibration. Thus. SR data can be directly inserted into the Normalized Difference Vegetation Index (NDVI) equation, as proposed by Rouse et al. , using bands 4 and As a good representation of land cover conditions. NDVI has become one of the main variables in spatial analyses related to natural disasters, including landslides. NDVI analysis of landslide disasters has been proven in previous studies . Ae. Rainfall is highly correlated with landslide events. High, frequent downpours would destabilize soils in sloping land surfaces. hence, this type of precipitation was ingested in the analysis. Hong et al. presented a global analysis of the influence of rainfall on landslide events and concluded that rainfall is a potential proxy for landslide modeling. Data analysis for classifying landslide or non-landslide classes was carried out in R software using a source code written in the RStudio software. All raster and vector data were converted employing 'raster' package. All sample data and raster attributes at corresponding locations were collected in the same set of data frames, an R terminology for data preparation prior to modeling. Shortly before modeling, samples were This journal is A Tjahjono et al. JPSL, 14. | 410 partitioned into training and testing data at a ratio of 70:30 using a ten-fold cross-validation resampling Random forest modeling was carried out using commonly used R package, the 'randomForest' This approach was developed as an extension of conventional decision trees. While decision-making in decision tree methods is fairly robust for simple classification problems, a single decision-making process would not suit complex problems. Thus, an ensemble approach would theoretically reduce the bias. The ensemble approach has been adopted in many newly developed decision tree algorithms, including Extreme Gradient Boosting. The random forest base model was examined through default parameterization, which was then revised by analyzing the variance of accuracy with several arbitrary parameter values. This procedure is known as tuning. This research focused on tuning the n-tree parameter, considering that the number of decision trees has been shown to be very important for enhancing the overall accuracy . The overall accuracy was computed using validation data independent of the training data, which was 30% of the total sampling pixels. First, we assessed the confusion matrix to better understand the deviation. Next, we computed the overall accuracy by summarizing the diagonal of the confusion matrix. To analyze the variables with the highest contribution, this study used the variable importance approach. The model with the highest accuracy was subsequently inverted to generate estimates of the landslide susceptibility distribution. The inversion method was implemented using the 'raster' package in R. Results and Discussion Model Accuracy The use of the random forest model with default settings resulted in an overall accuracy of 93%. The confidence interval ranged from 91 to 94%, suggesting that a random forest model is an excellent choice for the initial modeling process. The obtained results were better suited than similar models reported in India . Japan . , and Greece . However, the number of samples plays a significant role. This maiden outcome should then be broadened to include a diverse landscape, especially in areas with different precipitation levels. Machine learning algorithms generally have several parameters that need to be tuned. This is typically used to optimize the performance of the initial model. Compared to equivalent algorithms such as Support Vector Machine (SVM), the random forest method has fewer parameters to adjust. Therefore, this method is generally advisable, particularly when standard processing yields weak models. this study, only the n-tree parameter was investigated. The results of the experiments with varying accuracies owing to changing n-tree parameters are presented in Table 1. Table 1. Configuration of n-tree number and accuracy. N-tree setting Overall accuracy A tuning experiment indicated that the optimal n-tree for the data was 500. Although the results of overall accuracy at various values were not significantly different . oughly under 1%), the pattern of accuracy indicated diminishing returns. This indicates that setting excessive values can reduce accuracy, a condition similar to that in a previous report . The results of model inversion are presented in Figure 5. This landslide estimation map indicated clustered locations of landslide disasters in the study area, particularly in the northern flank of the mountain range. However, the likelihood of landslides is low in most research areas. Further spatial analysis by calculating impacted areas showed that there was a high chance of landslides covering approximately 4,536 ha, whereas safe areas covered more than 12,000 ha. http://dx. org/10. 29244/jpsl. JPSL, 14. | 411 Figure 5. Estimation of landslide occurrence using the best model. Contributing Variables Figure 6 ranks the variables responsible for the targeted classes. The three most influential landslide predictors in the study area are rainfall, elevation, and slope. Landslide events were suspected to occur at locations with very high rainfall . ,668Ae3,228 m. , elevations ranging from 500 to 1,500 m above sea level, and steep slopes . Ae45%). These proxies are widely understood as important variables in landslide event A very high level of precipitation could rapidly increase the soil moisture and put more weight on the soil column. While this would have less impact over flat terrain, in hilly or mountainous regions, gravity would trigger a substantial pull, leading to landslide events. Steep slopes also initiate landslides when gravitational force permits, even with substantial vegetation cover . ee also the case presented in Figure . Figure 6. Response variables. Conclusion The random forest method, as a machine learning approach, was shown to be excellent for landslide hazard modeling because it yielded > 90% accuracy. Rainfall, elevation, and slope were the dominant factors that caused landslides in the study area. Tuning the n-tree parameters slightly indicates diminishing returns. The best outcome was provided by n-tree = 500, despite a difference of less than 1% compared with the other This journal is A Tjahjono et al. JPSL, 14. | 412 Model inversion indicated landslide hazard areas covering approximately 4,536 ha, which is believed to be important for inclusion in future land use planning or developing mitigation strategies. Although high classification confidence was outlined in this study, the implementation of the model was fairly challenging. The primary considerations include the extent to which the model is applicable to different environments. This requires extensive near-future research in similar landscapes to investigate the variations and model In terms of possible implementation in mitigation planning, an extension of this research, covering the entire area of Halimun-Salak National Park, would significantly contribute to better spatial planning at regency levels. Acknowledgments We would like to thank the Division of Remote Sensing and Spatial Information. Department of Soil Science and Land Resources. IPB University, for their support through the provision of the computational facilities. We also acknowledge the contributions of the reviewers in constructing a much better manuscript. References