Keywords

1 Introduction

A flood is a very complex phenomenon, that is, the links of natural environment, people, and social system (Slobodan 2012). Global flood exposure and flood frequency are projected to increase especially in many low-latitude regions in Asia and Africa (Hirabayashi et al. 2013). The global disaster database, EM-DAT (Fig. 1), shows that the frequency of flood and storm events is correlated with flood fatalities.

Fig. 1
figure 1

Flood fatalities and flood events between 1980 and 2016 at global scale (compiled from EM-DAT database http://emdat.be)

Disaster data collection and analysis are increasingly prominent (UNISDR 2015). The analysis could provide necessary information for policy-setting and decision-making process in disaster risk reduction (IRDR 2014). Various disaster databases are available on national and global scales , which are summarized and listed in by Grasso and Dilley (2013) and Simpson et al. (2014). The exploration of these databases could contribute to better understanding of disaster risk, which is the first priority in the Sendai Framework (UNISDR 2015).

The Centre for Research on the Epidemiology of Disasters supported Vietnam and other Asian countries to develop their national disaster databases (Below et al. 2010). The Vietnamese national disaster database is Damage Assessment and Needs Analysis or DANA. The Central Committee for Flood and Storm Control of Vietnam had been jointly developing the DANA database (MARD 2006; Hughey et al. 2011).

Analysis of flood damage data could provide crucial information to decision-makers in the field of flood risk management and adaptation planning (Merz et al. 2010). Although previous research investigated DANA , such as Hughey et al. (2011) and Nhu et al. (2011), there has not been any investigation related to flood fatalities.

There are many studies on flood-related deaths in developed countries (e.g., Coates 1999; Ashley and Ashley 2008; FitzGerald et al. 2010; Di Mauro and de Bruijn 2012; Sharif et al. 2015) while only a few studies in developing countries (e.g., Paul and Mahmood 2016). The presence of studies on flood-related mortalities in developing economies is minimal despite the fact that the mortality rate in these countries is significant. There is no systematical research on flood-related fatalities in Vietnam.

Studies on flood fatalities often have two main approaches : (1) developing predictive models (Di Mauro et al. 2012; Jonkman and Vrijling 2008; Zhai et al. 2006; Di Mauro and de Bruijn 2012; Jonkman et al. 2002) and (2) analyzing the causes of fatalities (Ashley and Ashley 2008; Paul and Mahmood 2016; Sharif et al. 2015; Coates 1999; Jonkman et al. 2009; Jonkman and Kelman 2005).

Statistical learning is the combination of classical statistics and computer science (James et al. 2013a). The applications of statistical learning are increasingly applied in genetics, medical science, business, and flood risk management field (Merz et al. 2013; Hasanzadeh Nafari et al. 2016). However, there is a lack of research on the application of statistical learning techniques to analyze the relationship between flood damage attributes on flood fatalities.

The present study aims to explore flood fatalities on a national scale and analyze damage-influencing attributes related to flood fatalities using the national disaster database of Vietnam and statistical learning approach.

2 Disaster Database

The Central Committee for Flood and Storm Control of Vietnam has developed the national disaster database since 1989 through DANA database . Flood damage data is collected via one template including 12 categories with many flood impact indicators, for example, fatalities, agriculture impacts, housing damages, infrastructure damages, and economic loss. The database provides hydrometeorological disaster damage information at the national level (Hughey et al. 2011). It only stores direct losses on monetary of reconstruction and recovery of damaged property and infrastructure, and does not report indirect losses, for instance, business disruption and production interruption (Wang et al. 2010).

3 Flood Fatalities in Vietnam

EM-DAT defined death or fatality including person confirmed dead and missing person presumed dead (Below et al. 2010). The loss of human life is considered one of the most critical indicators in assessing flood risk (Maaskant et al. 2009). The flood mortalities are very low in developed countries such as Australia (Mojtahedi and Oo 2016) and Scotland (Crichton 2004). Meanwhile, the flood fatalities in Vietnam are unacceptably high, with at least 14,972 flood mortalities between 1989 and 2015 (Fig. 2).

Fig. 2
figure 2

Spatial patterns of flood fatalities by provinces in Vietnam from 1989 to 2015

Spatial patterns of flood fatalities by provinces in Vietnam are generated using the compiled flood damage data from DANA database and ArcGIS 10.1 software in Fig. 2. Quang Nam, Ca Mau, and Quang Ngai provinces have the highest death toll of more than 800 people during the observation period. The second highest flood fatality level is from 401 to 800 people in nine provinces including Nghe An, Thanh Hoa, Da Nang, Thua Thien Hue, Binh Dinh, Khanh Hoa, Kien Giang, Dong Thap, and An Giang.

4 Statistical Learning Methods

Based on the DANA database , this study uses the two following statistical learning methods, multiple linear regression model and random forest, to measure the relative influence of flood damage attributes on fatalities. We present the detail underpinning methodology of these methods in this section.

4.1 Multiple Linear Regression Model

Multiple linear regression technique is employed to analyze the independent factors relating to the flood fatalities , which is set as a dependent variable or an outcome. After that, LMG method is used to obtain the relative importance of flood fatality determinants based on the regression model.

Multiple linear regression model aims to find an equation to describe the relationship between X (independent variables) and Y (dependent variable). A multiple linear regression model (population) with p independent variables is described as in Eq. (1) to examine the linear relationship between one dependent (Y) and two or more dependent variables (x i).

$$ {Y}_i={\beta}_0+{\beta}_1{x}_{1i}+{\beta}_2{x}_{2i}+\dots +{\beta}_p{x}_{pi}+\varepsilon $$
(1)

where

  • β 0: intercept

  • β 1, β 2,…, β p: regression coefficients

  • Y: dependent variable

  • x pi: independent variables

  • ε: residuals

Weights of independent variables can be generated based on relative importance of variables (Hair Jr. et al. 2014). Bi (2012) reviewed new methods for generating relative importance from regression methods and recommended the LMG method for raw data . LMG indicator, which was proposed by Lindeman, Merend, and Gold in 1980 (Lindeman et al. 1980), is applied to assess the relative importance of variables of the multiple linear regression model in this study. The explanation of LMG is as follows (Lindeman et al. 1980; Bi 2012):

$$ \mathrm{LMG}\left({x}_k\right)=\frac{1}{p!}\sum \limits_{\mathrm{rpermutation}}\mathrm{seq}{R}^2\left(\left\{{x}_k\right\}|r\right) $$
(2)

where r = 1, 2,…, p! and seqR 2({x k}| r) denotes sequential sum of squares for the regressor x k in the ordering of regressors in r-th permutation.

4.2 Random Forest

Another method, random forest , for regression is applied to the same database. Random forest algorithm , for both regression and classification, is a panacea for all data science problems. This method constructs multitude of decision trees and selects the best as the final result which can be used to build predictive models. The random forest algorithm can be presented in the following steps (Liaw and Wiener 2002):

  1. 1.

    Draw n tree bootstrap sample from the original dataset. A sample of these n tree is taken at random with replacement.

  2. 2.

    For each of these samples, develop an unpruned regression tree: randomly sampling m try of predictors and selecting the best split from these variables at each node.

  3. 3.

    Predict new dataset by aggregating the predictions of the trees (average for regression).

Random forest for regression is constructed by growing trees depended on a random vector such that the predicted tree takes numerical values as opposed to class labels. C. Strobl et al. (2008) suggested that the conditional importance for random forest should be applied for the highly correlated predictor variables when this tool reflects the actual impact of each variable. The conditional importance technique is chosen for the dataset of this study .

5 Application

5.1 Multiple Linear Regression Analysis

Multiple linear regression model is applied to flood damage data of DANA . The collected dataset includes 27 samples in 63 provinces (27 years from 1989 to 2015). One year is considered an observation, so there are 1701 observations. Flood fatality is set as a dependent variable. Flood damage attributes are set as independent variables from X1 to X10. Data transformation with logarithm function is applied to both outcome and independent variables for better fitting the normal distribution due to large variation in the flood damage data (Zhou et al. 2017).

Cross-validation methods are used to validate the multiple linear regression model. Cross-validation aims to evaluate whether a model has the good predictive ability for a new dataset. K-fold cross-validation and leave-one-out cross-validation (LOOCV) techniques can provide the best cross-validation estimate (James et al. 2013b). The k-fold cross-validation and LOOCV results are generated using “boot” package (Canty and Ripley 2016) in R statistical software (R Core Team 2016). The results in Fig. 3 show that mean square errors of both LOOCV and k-fold CV models with a degree of polynomial from 2 to 10 are low values and approximately the same. Therefore, the model is validated.

Fig. 3
figure 3

Error curves of tenfold CV and LOOCV with different random splits and ten times repeated of the two cross-validation methods

We use R statistical software to run the multiple linear regression model with the transformed data . The model has adjusted R-squared of 0.601 and residual standard error of 0.82. After that, we run “relaimpo” package (Grömping 2006) in R (R Core Team 2016) to generate the weights or relative importance of attributes. The weights of attributes or independent variables are generated based on LMG indicator as in Eq. (2). The result is shown in Table 1.

Table 1 The relative importance of flood fatality determinants based on multiple linear regression model and LMG indicator after normalization

5.2 Conditional Importance for Random Forest

Random forest algorithm aimed to find the relative influence of independent factors (flood impacts from X1 to X10) to a dependent outcome (flood fatalities). The conditional importance for random forests was conducted using ‘party’ package (Hothorn et al. 2006; Strobl et al. 2007, 2008) in R statistical software. The result is shown in Fig. 4. Housing impact factor (variable X1) has the highest influence on fatalities of this model.

Fig. 4
figure 4

Variable importance scores for the ten predictors in regression tree model by conditional variable importance for random forest

6 Discussion and Conclusion

The interrelation between sustainable development and disaster risk reduction is recognized in Sendai Framework (UNISDR 2015). Part of that, the objective of reduction in mortality in disasters is stated under Goal 11 of Sustainable Development Goals (UN 2015). A summary of flood fatalities in Vietnam and the investigation of damage-influencing attributes on flood fatalities in this study can support future efforts to mitigate fatality in flood disaster and have implications for flood risk management activities.

The recorded damage data suggests that flood hazards have had severe impacts on Vietnamese people, with over 14,927 dead and missing or at least 553 flood mortalities per year between 1989 and 2015 (Fig. 2). The susceptibility has been continuing with 264 flood fatalities in 2016 (UN Country Team in Vietnam 2016). The flood fatalities are unexpectedly high due to the passive response activities of local authorities and residents to flood and storm events. Therefore, flood risk management needs focusing on a proactive approach or mitigation and preparedness activities.

The analysis results from multiple linear regression analysis (Table 1) and conditional importance for random forest (Fig. 4) show that housing damage factor has the most significant influence on flood mortalities . The more houses damaged and flooded, the more fatalities are. The people who lived in a house that is destroyed or flooded are more likely to be killed. The houses of farmers (over 70% of Vietnamese are farmers) are mostly one-story and in poor conditions. They are not strong enough to withstand storms or floods, and they provide no room to escape in high flood depth cases.

Floods and storms affected low-income communities disproportionately, especially in riverine and coastal areas in Vietnam . The rural poor are particularly vulnerable to flood events. The affected households often received extremely limited financial support from the local government, which was stipulated in Decree 67/2007/ND-CP of the government. The poverty rate, as is common on a global scale , is invariably linked to disaster impact.

The results may provide information on community awareness and safety regulations. In particular, our results can be used to recommend government policies that focus on supporting the poor in upgrading their houses in flood-prone areas to mitigate flood fatalities.

This study used the data available in the national disaster database for the analysis. It is, therefore, limited to analyzing the relative influence of damage attributes on fatalities in Vietnam. We call for more detailed studies on flood fatalities such as predictive models and the causes of flood fatalities . However, it is required the disaster loss database documented details on the causes, age, and gender of flood fatalities.

In conclusion, the present study proposes an approach to investigate the damage-influencing attributes related to flood fatalities using statistical learning techniques and a national disaster database. The results show that housing damage factor has the most significant influence on flood mortalities in Vietnam. Our research provides a better understanding of flood fatalities in Vietnam by analyzing and reporting on flood mortalities using statistical learning approach and a national disaster database . The output can produce a reference for the decision-making process in flood risk management in Vietnam.