Malaysia is highly vulnerable to the effects of climate change [1]. The selection of climatic variable indicators for regional analysis was fraught with constraints, assumptions, and availability of datasets. The previous studies demonstrated rainfall and temperature impacts on fish landings in the focus country [2, 3]. Similarly, sea surface temperature (SST) was an essential indicator for coastal upwelling events influencing fish production reported for the region [4]. Prior studies demonstrated that relative humidity was a significant climatic factor in fisheries studies because of its indirect impact on some environmental stressors [5, 6]. Therefore, the use of climatic variables such as rainfall and SST on Malaysia’s marine fish landings must be investigated. Forecasting marine fish landings is highly dependent on the analysis of previous and current behaviors [6]. Autoregressive integrated moving averages (ARIMAs), seasonal ARIMAs, vector autoregression, neural networks, nonlinear autoregressive networks, and wavelets are a few well-known approaches that researchers have used to forecast short-term fish catches [6, 7]. However, these statistical models will not produce satisfactory results if the time series data have nonlinear components [8]. Machine learning (ML) models use only historical data to learn the stochastic dependency between the past and the future [9, 10]. Previous studies have used ecological variables in Malaysia to estimate marine fish catches [11, 12]. However, none of them implemented ML models. Researchers typically use the ML-based linear regression (LR) technique for the prediction of time-series data. This modeling approach is excellent if we have a correlated dataset because the algorithm can accurately predict values. However, algorithms such as the decision tree (DT)-based regression technique can handle data with different measurement scales. DT-based algorithms do not influence outliers and missing values to a fair degree and simplify the building of rules for predictions about individual cases and complex relationships [13]. Moreover, the random forest (RF) algorithm can be used instead of a single DT to reduce overfitting, resulting in better results than with a single optimized DT [14].

In this research, we considered different ML-based predictive models to demonstrate the impact of climatic variables on marine fish landings in Malaysia’s five central states: Kedah, Pahang, Perak, Selangor, and Terengganu. Two error objective functions, the coefficient of determination (R2) and the Nash–Sutcliffe efficiency (NSE), were used to determine the performance of the ML model.

We considered the maximum and minimum air temperature, SST, and humidity to build models using ML. We collected data from 18 consecutive years (2000–2017); we obtained the temperature, rainfall, and humidity data from the Department of Statistics Malaysia and the SST data from the Malaysian Meteorological Department. Marine fish landing data were collected from the Department of Fisheries, Malaysia. For the interpretation of the ML model, individual states were combined into one dataset by mapping the states to numbers, where Selangor is 1, Terengganu is 2, Pahang is 3, Kedah is 4, and Perak is 5. We used the first 16 years (2000–2015) of data to validate the training of the model and the latest 2 years (2016–2017) of data to test the ML models. We used 65 random data points for training and 15 random data points for validation, and the data points were fragmented by the stratifying method so that all of the states exist in both datasets. We implemented the LR, DT and RF algorithms to generate predictions. For the DT and RF algorithms, the maximum depth was set to 7 to reduce data overfitting [15]. We used Python scikit-learn to implement the model and measured the R2 and NSE values to determine the predictive accuracy [16]. Both of these error objective functions expressed values between 0 and 1, and a value closer to 1 indicated a more accurate prediction. The NSE was the best objective function for evaluating the overall fit between the predictive and observed values [17]. Figure 1 shows the graph comparing 15 data points after implementing the 3 ML algorithms. Three values (years) from each state were plotted on the x-axis, and the observed values and those predicted by the 3 ML models were plotted on the y-axis. The RF and DT-based ML regression models produced values closer to the observed values, and they had better R2 (0.88 and 0.89, respectively) and NSE (0.7 and 0.8, respectively) values than the LR model (R2 = 0.6 and NSE = 0.3).

Fig. 1
figure 1

Three ML-based outputs for marine fish landings prediction using the validation dataset

Table 1 shows the predicted and observed results for the test dataset (2016–2017) as well as the error matrices. We found that the RF model output most closely resembles the observed dataset. Table 1 indicates that the LR model has a high bias, whereas DT and RF have comparatively improved prediction results with low bias. The results of the analysis of the 2017 data showed that LR resulted in negative values, indicating that the LR model has low predictive accuracy (R2 = 0.64 and NSE = 0.082). We also found that in 2016 and 2017, the DT model predicted the same values for different states, which is one of the drawbacks of employing a single DT (R2 = 0.89 and NSE = 0.84). Similar or identical inputs yielded a particular predicted value. Therefore, the RF model was used to average multiple DTs to improve the accuracy and reduce data overfitting. The R2 and NSE values of the RF model were 0.86 and 0.86, respectively, which were better than those of the other ML models with the testing dataset. Thus, according to this research, the RF regression model is suitable for predicting marine fish landings (tonnes) in the abovementioned Malaysian states.

Table 1 Observed and predicted fish landing (tonnes) values and the corresponding error

Here, the NSE value for the RF model was 0.86, indicating a good fit [18]. The dataset contained all five major states in both the validation and testing phases. Thus, this research successfully predicted marine fish landings in five central states of Malaysia. Decision-makers in the fishery industry typically plan based on the fishing market’s resource requirements, which are highly dependent on accurate 1- to 2-year forecasts of fish landings [19]. Therefore, this predictive model can be a valuable component included in the construction of decision support systems for Malaysia’s fisheries sector.