Interpolation and Machine Learning Methods for Sub-Hourly Missing Rainfall Data Imputation in a Data-Scarce Environment: One- and Two-Step Approaches

Boukdire, Mohamed; Çagrı Alperen Inan,; Varra, Giada; Della Morte, Renata; Cozzolino, Luca

doi:10.3390/hydrology12110297

Complete sub-hourly rainfall datasets are critical for accurate flood modeling, real-time forecasting, and understanding of short-duration rainfall extremes. However, these datasets often contain missing values due to sensor or transmission failures. Recovering missing values (or filling these data gaps) at high temporal resolution is challenging due to the imbalance between rain and no-rain periods. In this study, we developed and tested two approaches for the imputation of missing 10-min rainfall data by means of machine learning (Multilayer Perceptron and Random Forest) and interpolation methods (Inverse Distance Weighting and Ordinary Kriging). The (a) direct approach operates on raw data to directly feed the imputation models, while the (b) two-step approach first classifies time steps as rain or no-rain with a Random Forest classifier and subsequently applies an imputation model to predicted rainfall depth instances classified as rain. Each approach was tested under three spatial scenarios: using all nearby stations, using stations within the same cluster, and using the three most highly correlated stations. An additional test involved the comparison of the results obtained using data from the imputed time interval only and data from a time window containing several time intervals before and after the imputed time interval. The methods were evaluated with reference to two different environments, mountainous and coastal, in Campania region (Southern Italy), under data-scarce conditions where rainfall depth is the only available variable. With reference to the application of the two-step approach, the Random Forest classifier shows a good performance both in the mountainous and in the coastal area, with an average weighted F1 score of 0.961 and 0.957, and an average Accuracy of 0.928 and 0.946, respectively. The highest performance in the regression step is obtained by the Random Forest in the mountainous area with an R 2 of 0.541 and an RMSE of 0.109 mm, considering a spatial configuration including all stations. The comparison with the direct approach results shows that the two-step approach consistently improves accuracy across all scenarios, highlighting the benefits gained from breaking the data imputation process in stages where different physical conditions (in this case, rain and no-rain) are separately managed. Another important finding is that the use of time windows containing data lagged with respect to the imputed time interval allows capturing the atmospheric dynamics by connecting rainfall instances at different time levels and distant stations. Finally, the study confirms that machine learning models outperform spatial interpolation methods, thanks to their ability to manage data with complicated internal structure.