Missing data in time series is a frequent problem for the environmental sciences. This is a serious limitation for statistical analysis and therefore, imputation (the process of filling missing data) is a keystone task. Several imputation methods have been proposed and implemented in programming software, however, their efficiency is data-dependent. There is no universal imputation method best for all-time series, but instead, each method suits the structure of particular groups of time series. ¿Which imputation method is best to fill a time series? the main problem is that the target time series (of interest for imputation) already contains missing data, so the validation of methods cannot be performed directly on it. Instead, it needs a full-time series (no missing data) to simulate missing data, perform imputations and compare actual to impute. However, the best imputation method for the full-time series is not necessarily the best for the target time series. The Known Sub- Sequence Algorithm (KSSA) is a novel approach to solve this problem by validating imputation methods directly on target time series. It uses the information contained within sub-sequences between missing data gaps to produce an optimal decision about the best imputation method for any particular target time series, no matter the structure it has. This is done by means of a process of iterative bootstrapping that randomly samples sub-sequences of the target time series in order to learn from them to find the best method form a set of candidates. This is a promising machine learning algorithm that will help environmental scientists and decision-makers working with time series. KSSA will soon be implemented as the ‘kssa’ R-package in CRAN and is currently available on GitHub.
Global Environment, Health and Safety received 45 citations as per Google Scholar report