1 Introduction

A tsunami can be described as a series of sea waves associated with earthquakes, underwater volcanic eruptions, and changes in fault stability (Cho 1995). Tsunami waves are distinct from general currents or ordinary wind-driven sea waves due to their longer wavelength, which indicates that the waves can carry more energy, leading to potentially large wave heights along the shoreline. Although the direct effects of tsunami events are likely to be limited to coastal areas, their destructive power can result in deadly and devastating floods in entire ocean basins (Sohn et al. 2009; Mimura et al. 2011).

Deterministic numerical models that specify the earthquake source can provide scenarios associated with tsunami run-up, propagation, and inundation at particular locations of interest along the coast. The numerical approach is conducive in understanding the spatiotemporal dynamics of tsunami waves and their effects on coastal areas (Imamura et al. 1988; Cho and Yoon 1998; Yoon 2002; Cho et al. 2007). However, substantial amounts of bathymetric, coastline, and earthquake data are required to estimate tsunami heights, so tsunami hazard analyses are typically applied to only a single tsunami event (Choi et al. 2007; Kim et al. 2012; Cho et al. 2013). Moreover, numerical models do not provide the probability distribution information needed for the likelihood estimation step of hazard analysis.

Estimating the probability of tsunami occurrence is a key step in tsunami hazard assessment, in that the probability distribution provides an efficient way of describing tsunami heights. Evaluating the likelihood of a large tsunami carries significance for the design of coastal structures and disaster mitigation strategies, such as a tsunami warning system and evacuation routes. Since Van Dorn (1965) applied a statistical method to estimate the exceedance probability associated with tsunami heights on the coast of the Hawaiian Islands, a statistical approach to tsunami events has been of great interest from scientific and practical viewpoints (Liu et al. 1995a; Choi et al. 2002, 2005, 2006, 2011; Grezio et al. 2010; Cho et al. 2013; Yadav et al. 2013b). Historical tsunami observations are ideally suited to construct a hazard curve, however, a sufficiently long record is not readily available in most cases. In certain cases, it may be appropriate to consider performing numerical simulations for all relevant tsunami sources and combining the results to develop the hazard curve (Geist and Lynett 2014). In these contexts, a formal probabilistic tsunami hazard analysis (PTHA) has been employed over the last decade to combine the likelihood of a tsunami appearing with the distribution of tsunami heights, exceeding a certain threshold for a given tsunami (McCloskey et al. 2008; Geist and Parsons 2009; González et al. 2009; Grezio et al. 2010; Kumar 2012a, b; Mitsoudis et al. 2012; Sørensen et al. 2012; Lorito et al. 2015). Most PTHA analysis are directly based on the estimation of seismic return periods or the likelihood of a given seismic intensity, based on Gutenberg-Richter relationships. Numerical simulation for each tsunami source and its parameters is then performed to evaluate tsunami risk at a particular location. In cases when several seismic sources of information are available, these are combined in a logic-tree to yield PTHA. More generally, probability distributions of tsunami heights are separately estimated for all relevant tsunami sources, and the logic-tree is typically introduced to deal with uncertainty in the model parameters and processes due to limited data and knowledge (Geist and Lynett 2014). The logic-tree PTHA can combine different sources of information to produce a probabilistic quantity, which can be used as a basis for a risk assessment. However, the logic-tree PTHA is constructed for each of the tsunami sources in a discrete manner, which may lead to less efficient simulation and complexity in the modelling stage. In this article, we suggest a framework for simultaneously specifying potential tsunami sources and probability distributions of tsunami heights in a continuous manner.

In the PTHA, numerous statistical models based on different probability distributions have been proposed to estimate the return periods (or likelihoods) of tsunamis in different regions of the world (Orfanogiannaki and Papadopoulos 2007; Geist and Parsons 2008; Hasumi et al. 2009; Grezio et al. 2010; Yadav et al. 2013b; Clare et al. 2014; Knighton and Bastidas 2015; Omira et al. 2015; Shin et al. 2015). On the other hand, relatively little attention has been given to the use of statistical models for tsunami heights (Pelinovsky et al. 1997a; Choi et al. 2002; Geist 2002; Kulikov et al. 2005; Choi et al. 2006, 2011; Kim et al. 2014).

In the PTHA, the probability distributions for the tsunami magnitude and fault location (including the slip distribution) are key elements for describing tsunami risk. Moreover, nearshore wave heights are substantially influenced by bathymetry and resonance effects in bays and harbours (Rabinovich 1997; Geist and Parsons 2006). Among the various factors influencing tsunami generation, in this study, the location and magnitude of an earthquake are primarily considered as key factors in identifying link functions of the parameters in the probability distribution. On the other hand, other earthquake factors such as fault parameters and slip distributions are mainly taken to be constant, based on the virtual earthquakes established by the Korea Peninsula Energy Development Organization (KEDO 1999). Although the PTHA has been an active research field over the last decade, a limited number of studies have been conducted to simultaneously estimate the probability distributions for multiple tsunami events and to explicitly utilize their relationships with earthquake characteristics to produce rapid estimates of the tsunami height distribution along a coast. In addition, the estimation of uncertainty in the parameters of distributions used in tsunami hazard assessments is not often addressed, and it needs to be taken into consideration to provide practical guidance for decision-making. Therefore, formal hazard analyses that rely on probability distributions are of limited use in formulating risk management plans.

Given that background, we explore the following questions:

  1. 1.

    Can the probability distributions of tsunami heights from multiple tsunami events be simultaneously estimated, and can their parameters be functionalized in a regression framework with both the locations and magnitudes of earthquakes?

  2. 2.

    How can the uncertainty in the identified functions of the parameters be quantified within a Bayesian framework? What are advantages and potential limitations of the proposed Bayesian model for quantifying the tsunami hazard?

We here develop a Bayesian model for tsunami heights to investigate those questions in a statistical framework with numerically generated time series along the eastern coast of the Korean Peninsula. Following the brief introduction provided in this section, we present an overview of key data in Sect. 2. The numerical scheme for tsunami modelling and our Bayesian approach to parameter estimation are described in Sect. 3. We summarize our results and discussion in Sect. 4. Finally, we provide our conclusions and suggestions for future work in Sect. 5.

2 Study area and numerical simulation

Concerns on seismic risk in Korea had not been as high as those in Japan. However, a recent earthquake measured a magnitude of 5.8 on the Richter scale, which is the largest recorded in the Korean Peninsula since 1978. Therefore, researchers have shown great interest in examining how seismic activity is potentially related to hazards such as tsunamis and landslides. It is difficult to assess the comprehensive effect of a tsunami on coastal areas because of insufficient historic tsunami data. Only about 20 tsunamis resulting from earthquakes have been recorded and studied in Korea (Pelinovsky et al. 1997a; Choi et al. 2006; Tanioka et al. 2006).

2.1 Study area

The study area includes the eastern coast of the Korean Peninsula, the East Sea, and Japan, as shown in Fig. 1. We performed numerical simulations of the wave propagation with a Boussinesq model for 30 virtual tsunamis to obtain the tsunami height along the eastern coast of the Korean Peninsula (rectangular boxes in Fig. 1). Three historical tsunami events affected the eastern coast: in Japan in 1964 (Niigata tsunami), in 1983 (Central East Sea tsunami), and 1993 (Hokkaido tsunami). In all three cases, substantial property damage and loss of life were officially reported. Since the tsunami in 1993, a tsunami warning system has become a regular part of risk mitigation work along the eastern coast of the Korean Peninsula.

Fig. 1
figure 1

A map showing the study area, the bathymetric contour lines and the locations of the 10 virtual and the 3 historical earthquakes for the numerical simulation in this study

2.2 Numerical simulation of tsunami

Because our aim is to investigate the underlying distributions of tsunami heights based on numerical simulation results, we next provide a brief summary of the theoretical background and numerical simulation procedure for tsunami propagation.

The Boussinesq equations are generally used in numerical models for the simulation of tsunami propagation in shallow seas. In this study, we mainly used a tsunami wave propagation model developed by Cho et al. (2007) to simulate the tsunami heights. The modelling framework is based on the dispersion–correction and leap-frog schemes (Imamura et al. 1988; Cho and Yoon 1998), and the modelling scheme we used allows us to efficiently control the spatiotemporal resolution and provides more practical solutions than conventional algorithms. The utility and accuracy of the model were assessed and verified with virtual and historical events, and the enhanced accuracy in the tsunami run-up estimation was confirmed by Sohn et al. (2009). For more details about the governing equations and physical parameterization, see (Liu et al. 1995b), and for the numerical algorithms, see (Cho 1995; Cho et al. 2007; Sohn et al. 2009).

In this study, we considered 30 virtual earthquakes at different locations and magnitudes (\(\varvec{M}_{\varvec{w}}\)) to investigate distributional changes in tsunami heights on the eastern coast of the Korean Peninsula. In the 1900s, there were four tsunamis that occurred on the east coast of Korea. The most vulnerable tsunami event was the Akita earthquake that occurred in 1983 on the west side of Japan. At this time, a maximum wave height was recorded at about 4.2 m in the Imwon harbour in Korea. In this study, only the tsunami sources in that area were considered for the tsunami propagation analysis. The locations of the virtual earthquakes were established by the Korea Peninsula Energy Development Organization (KEDO 1999) as a set of plausible scenarios for treatment as boundary conditions for free surface displacement. Moreover, we use data from the three historical tsunamis, induced by the Niigata Earthquake \(\text{(}\varvec{M}_{\varvec{w}} :\,7.5)\) in 1964, Central East Sea Earthquake \((\varvec{M}_{\varvec{w}} :\,7.2)\) in 1983, and Hokkaido Earthquake \((\varvec{M}_{\varvec{w}} :\,7.8)\) in 1993, to validate our model. Thus, we consider 30 virtual and 3 historical tsunamis. The virtual tsunamis occur in 10 locations at three different magnitudes: 7.1, 7.7, and 8.3. The earthquake locations for the 10 virtual tsunamis are areas of initial free surface displacement, and their fault parameters are summarized and displayed in Table 1 and Fig. 1, respectively.

Table 1 The locations and fault parameters of the earthquakes for 30 virtual tsunamis and 3 historical tsunamis considered in this study

In Table 1, \(\varvec{S}\) is the depth of the fault plane, \(\varvec{\theta}\) is the strike angle, \(\varvec{\omega}\) is the dip angle of the fault, \(\varvec{\zeta}\) is the slip angle of the fault, \(\varvec{L}\) is the length of the fault, \(\varvec{W}\) is the width of the fault, \(\varvec{D}\) is the displacement of the fault, and \(\varvec{M}_{\varvec{w}}\) is the scale of the earthquake.

A more realistic representation of bathymetry was found to be crucial for simulating certain features of tsunami propagation, and thus the grid size of the finite difference should to be small enough to represent local bathymetric features. However, for a large domain including the East Sea, the use of such a small grid size would not be possible in terms of simulating tsunami propagation (Kim et al. 2014; Sohn et al. 2009). In these contexts, we used a nested multi-grid approach in which the spatial resolution gradually increased from the open sea toward the East Sea in order to better simulate tsunami wave propagation and inundation near the shore. Four computational domains were used in the numerical analysis, labelled A–D in Fig. 1. Region A used the free transmission condition, and the other regions used the dynamic linking method as a boundary condition for the open sea. Fully reflected conditions for regions A to D are given in Table 2. The maximum tsunami heights (H) under the 30 virtual tsunami scenarios considered in this study are illustrated along with the grid index in Fig. 2. The maximum tsunami heights are extracted from each grid along the eastern coast of the Korean Peninsula. The maximum tsunami height at the coastline ranges broadly from 0.1 to 9.9 m, with a tendency to decrease in height as the maximum height occurs further from the location of earthquake, which generally agrees with (Cho et al. 2013; Kim et al. 2014). On the figure, the tsunami heights under the earthquake scenarios #8–10, #18–20 and #28–30 are exceptionally lower than those under the other scenarios. This phenomenon arises mainly due to the Russian Far East coast, which decreases the wave heights of earthquake-induced tsunamis. The spatial patterns of the maximum tsunami heights were largely similar across the different scenarios.

Table 2 Computational information and boundary conditions for each domain
Fig. 2
figure 2

Maximum tsunami heights along with the grid index for the different tsunami scenarios considered in this study. The 30 virtual tsunamis are characterized by 10 different locations and 3 different magnitudes: 7.1 (case #1–10), 8.3 (case #11–20) and 7.7 (case #21–30)

3 Statistical methods for modelling tsunami heights

3.1 Overview of probability distributions for tsunami heights

The log-normality of the tsunami heights was theoretically represented by (Go 1987, 1997) and was further validated by other researcher (Choi et al. 2002). Several studies have suggested that the tsunami heights can be adequately described by the log-normal distribution (Van Dorn 1965; Kajiura 1983; Go 1987, 1997; Choi et al. 2006). On the other hand, other studies have shown that the underlying distribution of tsunami heights is found to be different from the log-normal distribution (Mazova et al. 1989; Choi et al. 1994; Pelinovsky et al. 1997a, b; Kim et al. 2014). The log-normality of the tsunami height along the eastern coast of Korean Peninsula has been extensively investigated on the length of the coastal line (Kim et al. 2014). It was found that the log-normality assumption may be inappropriate as the length of the coastal line increases. In this perspective, we reviewed different types of probability distributions—Weibull, normal, log-normal, log-logistic, logistic, inverse Gaussian, gamma, generalized extreme value, and exponential–to represent tsunami heights. Among them, the Weibull distribution was identified as the best-fit, using the BIC (Bayesian information criterion) as shown in Table S1. The distribution with the lowest BIC is preferred for model selection. For the eastern coast of Korean Peninsula, the selected distributions exhibit similarities over the all scenarios. The results aligning with those reported from previous study (Kim et al. 2014), suggest that both the characteristics of the undersea earthquakes and the bathymetry throughout its propagation path are equally effective in terms of characterizing the relative magnitude of the tsunami height.

The Weibull distribution (Weibull 1951) is a continuous probability distribution, a special case of the generalized extreme value distribution (GEV). The Weibull distribution has been successfully applied in many fields, including hydrologic engineering (Singh 1987; Vogel and Kroll 1989; Wilks 1989), wind engineering (Justus et al. 1978; Seguro and Lambert 2000; Pishgar-Komleh et al. 2015), earthquake modelling (Hagiwara 1974; Hasumi et al. 2009; Pasari and Dikshit 2014) and tsunami hazard modelling (Muraleedharan et al. 2006; Yadav et al. 2013a; Fukutani et al. 2016). The probability distribution and cumulative distribution function for the Weibull distribution can be written as follows:

$$f\left( {\left. {\mathbf{H}} \right|\lambda ,\nu } \right) = \lambda \nu \left( {\mathbf{H}} \right)^{\nu - 1} exp^{{ - \lambda \left( {\mathbf{H}} \right)^{\nu } }}$$
(1)
$$F\left( {\left. {\mathbf{H}} \right|\lambda ,\nu } \right) = 1 - exp^{{ - \lambda \left( {\mathbf{H}} \right)^{\nu } }}$$
(2)

where \(\varvec{\lambda}\) is the rate parameter, \(\varvec{\nu}\) is the shape parameter of the distribution, and H is the maximum tsunami height.

The shape of density function given in Eq. ( 1 ) changes significantly with the value of shape parameter \(\varvec{\nu}\); the skewness depends only on the shape parameter. The mean and variance of the Weibull distribution with the rate and shape parameter can be expressed as:

$$E\left( {\mathbf{H}} \right) = \lambda^{ - 1/\upsilon }\Gamma \left( {1 + 1/\nu } \right)$$
(3)
$$var\left( {\mathbf{H}} \right) = \lambda^{ - 2/\upsilon } \left[ {\Gamma \left( {1 + 2/\nu } \right) - \left( {\Gamma \left( {1 + 2/\nu } \right)} \right)^{2} } \right]$$
(4)

where, \({\varvec{\Gamma}}\) is the gamma function.

We used a Bayesian approach to estimate the parameters and their uncertainty. The posterior distribution \(p(\Theta |{\mathbf{H}})\) of the parameter vector Θ, is given by Bayes theorem:

$$p\left( {\Theta |{\mathbf{H}}} \right) = \frac{{p\left( {\varvec{\Theta}} \right) \times p({\mathbf{H}}|{\varvec{\Theta}})}}{{p\left( {\mathbf{H}} \right)}} = \frac{{p\left( {\varvec{\Theta}} \right) \times p({\mathbf{H}}|{\varvec{\Theta}})}}{{\smallint p\left( {\varvec{\Theta}} \right) \times p\left( {{\mathbf{H}}|{\varvec{\Theta}}} \right)d\Theta }} \propto p\left( {\varvec{\Theta}} \right) \times p({\mathbf{H}}|{\varvec{\Theta}})$$
(5)

where \({\varvec{\Theta}}\) is a set of parameters (λ and ν) of the distribution to be fitted, \(p({\mathbf{H}}|{\varvec{\Theta}}\)) is the likelihood function, and \(p({\varvec{\Theta}}\)) is the prior distribution.

The random variables, \({\mathbf{H}}\), can be regarded as exchangeable if their joint distribution is invariant under permutations of the variables. More specifically, exchangeability (or similarity) is determined based on if a given statistical property holds for every finite subset of the random variables for all permutations. The exchangeability condition from a Bayesian perspective is comparable to the independent identically distributed (iid) condition in frequentist theory. The assumption that a sequence of random variables is exchangeable allows us to infer the existence of unobserved subsets of the sequence that were observed within an inductive statistical paradigm.

Conjugate distributions are probability distributions whose prior and posterior distributions are in the same family. In particular, conjugate distributions are favourable for computational reasons. When the shape parameter is unknown, it is known that the Weibull distribution does not have a continuous conjugate joint prior distribution. In this case, the same gamma prior distributions can be assumed for both the shape and rate parameters (Berger and Sun 1993; Kundu and Joarder 2006). In these contexts, the prior distributions for the parameters \(\varvec{\nu}\) and \(\varvec{\lambda}\) were given vague gamma priors, i.e. gamma priors of (0.1, 0.1), indicating a mean of 1 and a variance of 10. The joint posterior distribution of the parameters for individual cases can be estimated by combining the prior distributions and the likelihood function as follows:

$$\varvec{p}\left( {{\varvec{\Theta}}|{\mathbf{H}}} \right) \propto \varvec{p}\left( {\varvec{\Theta}} \right)\varvec{p}\left( {{\mathbf{H}}|{\varvec{\Theta}}} \right) = \mathop \prod \limits_{{\varvec{g} = 1}}^{\varvec{N}} \varvec{\lambda \nu }\left( {\varvec{H}_{\varvec{g}} } \right)^{{\varvec{\nu}- 1}} \varvec{exp}^{{ -\varvec{\lambda}\left( {\varvec{H}_{\varvec{g}} } \right)^{\varvec{\nu}} }} \times \varvec{G}(\varvec{\lambda}|0.1,0.1) \times \varvec{G}(\varvec{\nu}|0.1,0.1)$$
(6)

where N is the number of grids for the wave heights.

3.2 Integrated statistical model for multiple tsunami events

We further explored an integrated model for jointly analysing the probability distributions of tsunami heights in multiple tsunami events. More specifically, we investigated functional aspects in the estimation of distributional parameters for tsunami height, which might be functionally or physically related to the location and magnitude of an earthquake, in a Bayesian regression framework. Thus, the parameters \(\varvec{\nu}\) and \(\varvec{\lambda}\) in Eq. (6) become functions of the location and magnitude of earthquakes, as \(\varvec{F}_{\varvec{\nu}}\) and \(\varvec{F}_{\varvec{\lambda}}\). More specifically, the dependences of the parameters \(\varvec{\nu}\) and \(\varvec{\lambda}\) on the predictors X are specified via a link function, which is assumed to be related in a nonlinear way to the response’s distribution. The joint posterior distribution for multiple tsunami events can be formally reformulated as follows:

$$\varvec{p}\left( {{\varvec{\Theta}}|{\mathbf{H}}} \right) \propto \varvec{p}\left( {\varvec{\Theta}} \right)\varvec{p}\left( {{\mathbf{H}}|{\varvec{\Theta}}} \right) = \mathop \prod \limits_{{\varvec{c} = 1}}^{30} \mathop \prod \limits_{{\varvec{g} = 1}}^{\varvec{N}}\varvec{\lambda}_{\varvec{c}}\varvec{\nu}_{\varvec{c}} \left( {\varvec{H}_{{\varvec{c},\varvec{g}}} } \right)^{{\varvec{\nu}_{\varvec{c}} - 1}} \varvec{exp}^{{ -\varvec{\lambda}_{\varvec{c}} \left( {\varvec{H}_{{\varvec{c},\varvec{g}}} } \right)^{{\varvec{\nu}_{\varvec{c}} }} }} \times \varvec{G}\left[ {\varvec{\nu}_{\varvec{c}} |\varvec{ F}_{\varvec{\nu}} \left( {\varvec{\alpha},{\varvec{\upbeta}}, {\mathbf{X}}_{\varvec{c}} } \right),\varvec{\theta}_{\varvec{\nu}} } \right] \times \varvec{G}\left[ {\varvec{\lambda}_{\varvec{c}} |\varvec{F}_{\varvec{\lambda}} \left( {\varvec{ \gamma },{\varvec{\updelta}},{\mathbf{X}}_{\varvec{c}} } \right),\varvec{\theta}_{\varvec{\lambda}} } \right] \times \varvec{N}\left( {\varvec{\alpha}|0,10^{ - 4} } \right) \times \varvec{N}\left( {{\varvec{\upbeta}}|0,10^{ - 4} } \right) \times \varvec{N}\left( {\varvec{\gamma}|0,10^{ - 4} } \right) \times \varvec{N}\left( {{\varvec{\updelta}}|0,10^{ - 4} } \right) \times \varvec{G}\left( {\varvec{\theta}_{\varvec{\nu}} |0.1,0.1} \right) \times \varvec{G}\left( {\varvec{\theta}_{\varvec{\lambda}} |0.1,0.1} \right)$$
(7)

where c is the case number, and \({\mathbf{X}}\) is a vector of independent variables (the latitude, longitude and magnitude of earthquakes). Note that a nonlinear relationship between the parameters and predictors is described by the terms \({\mathbf{\beta X}}\) and \({\mathbf{\delta X}}\), which are composed of low-order polynomial and power functions. The functional forms of the two parameters are provided in the following section. The terms α and γ are constants and the parameters \({\varvec{\upbeta}}\) and \({\varvec{\updelta}}\) are the 4 × 1 vectors consisting of regression coefficients. It can be seen that there is enough information, with about 14,000 maximum tsunami heights (\({\mathbf{H}}\)) corresponding to 30 tsunami scenarios for the model proposed here; these data are used to estimate the twelve desired parameters in Eq. (7). Hence, non-informative prior distributions for the regression parameters are assigned, as suggested in the literatures (Gelman 2006; Gelman et al. 2014). In other words, the regression coefficients are assumed to be Gaussian with zero-mean and precision 10–4, and prior distributions for the shape and rate parameters (\(\theta_{\nu } \,\,{\text{and}}\,\, \theta_{\lambda }\)) are assumed to be gamma distributions.

The posterior distribution of the model parameters \(\Theta \left( {\alpha ,{\varvec{\upbeta}},\gamma , {\varvec{\updelta}},\varvec{ }\theta_{\nu } ,\theta_{\lambda } } \right)\) is simultaneously estimated using the MCMC (Markov Chain Monte Carlo) method, specifically the Gibbs sampler (Gelman and Hill 2006). Please refer to Gilks et al. (1995) for further information on the construction of the Gibbs sampler for Bayesian MCMC.

4 Results and discussion

4.1 Distributional changes in tsunami heights

In the Bayesian framework, parameters are treated as random variables conditional on evidence inferred from the data. The Bayesian model is specified by a prior distribution over a set of the parameters and a likelihood, which results in a posterior distribution through Bayesian updating for Eq. (6). The posterior medians with 95% credible intervals for the parameters of a Weibull distribution based on 30 virtual earthquakes are summarized in Table S2. The credible intervals that are reported in Table S2 are based on independently generated chains of parameter estimates and thus normality is not assumed. The estimates with a narrow uncertainty bound relative to their medians are statistically significant.

Tsunami height depends on various factors, of which earthquake magnitude and distance from the epicentre are considered the most important. As a way of delineating distributional changes in tsunami height, we explored functional relationships between the estimated parameters and the earthquake location and magnitude. As illustrated in Fig. 3, clear patterns in the parameters (\(\upnu \,\,{\text{and}}\,\,\uplambda\)) across locations and magnitudes are evident in tsunami heights. The rate parameter \(\uplambda\) shows a concave upward curve corresponding to the longitude, with higher values along the edges around 139°E due to the long distance from the earthquakes. For the magnitude, a similar profile is obtained, but the lower rate parameter corresponds to the higher magnitude. For the most part, patterns in the rate parameter with latitude are similar to those with longitude. There are no significant dissimilarities between the rate and shape parameter \(\upnu\) regarding their associations with the location and magnitude of earthquakes, except for smaller variation in the shape parameter. However, the shape parameter forms a concave downward curve corresponding to longitude. The shape of the Weibull distribution changes significantly with the shape parameter (ν). For ν > 1, as the shape parameter increases, the density function is monotonically increased until it reaches the mode, at which it begins to decrease. It is interesting to note that the variance of the Weibull distribution tends to decrease as the value of the shape parameter increases. Moreover, the skewness relies only on the shape parameter. The relationships identified in this study appear to be physically interpretable. The concave relationships between parameters and earthquake attributes (e.g., location and magnitude) reflect the fact that the distance from the earthquake epicentre affects the overall characteristics of the tsunami height. Specifically, the statistical moments (e.g., mean, variance, and skewness) associated with tsunami height all become higher both as the distance from the earthquake epicentre decreases and as the magnitude becomes higher. As illustrated in Fig. 4, the functional forms of the mean and variance of the Weibull distribution exhibit largely similar patterns of association with the Weibull parameters. The maximum mean tsunami height is obtained around 138.5°E and 40°N, as well as the minimum (or maximum) values for the Weibull parameters. The relationships identified in this study appear to be physically interpretable. Generally, the statistical moments (e.g., mean, variance, and skewness) associated with tsunami height become higher as the distance from the earthquake epicentre decreases and at higher magnitude. We will further explore the implications of those patterns to increase the practical use of the information within a Bayesian regression framework.

Fig. 3
figure 3

Scatterplots representing functional relationships between the parameters (\(\varvec{\lambda}\) and \(\varvec{\nu}\)) in a Weibull distribution and the epicentre of earthquakes of different magnitudes. The parameters used here are median values estimated from posterior distributions

Fig. 4
figure 4

The estimated mean and variance of the Weibull distribution for 30 virtual tsunamis, along with the epicentre of earthquakes for different magnitudes

4.2 Functional model for estimating parameters

We next present a functional model for the statistical analysis of tsunami height, simulated from numerical modelling. In our study, we developed a Bayesian generalized linear regression (GLM) model for the estimation of parameters in a Weibull distribution by allowing the tsunami heights to be a function of earthquake characteristics through a link function. The GLM can be regarded as a generalization of an ordinary regression that assumes that the dependent variables have non-Gaussian distributions. We use a stepwise regression, an iterative procedure for constructing a model, to identify the best combination sets of the individual explanatory variables (latitude, longitude, and magnitude) and to minimize the difference between the observed and modelled data. The identified functional forms of the two parameters, ν and λ, can be written as follows:

$$F_{\nu } = \left( {\alpha + \beta_{1} \times X_{lon,c} \times X_{int,c} + \beta_{2} \times X_{lon,c}^{2} + \beta_{3} \times X_{lat,c}^{2} } \right)/\left( {X_{int,c}^{{\beta_{4} }} } \right)$$
(8a)
$$F_{\lambda } = \left( {\gamma + \delta_{1} \times X_{lon,c} \times X_{int,c} + \delta_{2} \times X_{lon,c}^{2} + \delta_{3} \times X_{lat,c}^{2} } \right)/\left( {X_{int,c}^{{\delta_{4} }} } \right)$$
(8b)

where Xlon, Xlat, and Xint are the longitudes, latitudes, and magnitudes of the earthquakes, respectively. Again, note that subscript “c” denotes the number of cases (i.e. 30 virtual tsunami events).

Substituting Eq. (8) into (7) allows simultaneous estimation of the joint posterior distribution of the model parameters \({\varvec{\Theta}}\) for multiple tsunami events. The estimated parameters and their uncertainties are summarized in Table 3. The estimates with stable posterior means and narrow 95% credible intervals can be considered statistically significant. Figure 5a compares the jointly estimated parameters and their uncertainty in the Weibull distribution across cases, using Eq. (8), with the individual estimates of the parameters across the cases, using Eq. (6). The posterior median values of the jointly estimated parameters are almost identical to the individual estimates, within the 95% credible bounds for most cases. We further evaluated the degree of similarity between them using statistics of efficiency, such as the correlation coefficient, Nash–Sutcliffe model efficiency coefficient, and index of agreement. For more details on efficiency statistics, please see Legates and McCabe (1999). There is good agreement among the parameters, with correlation measures over 0.9 for the entire parameter range, as summarized in Table S3. Additionally, we calculated the mean and variance using the theoretically derived moment equations (Eqs. 3, 4) along with the estimated parameters, and compared the results with the observed mean and variance. A scatter plot of the observed (abscissa) and estimated (ordinate) moments (i.e. mean and variance) for all cases is illustrated in Fig. 5b. The near-perfect match along the reference line indicates that the integrated model can reproduce the statistical moments associated with tsunami heights for all of the cases at the different magnitudes and locations. Consequently, the results demonstrate that the proposed model can be effective and informative about tsunami height from an earthquake of a given magnitude at a particular location.

Table 3 The estimated regression coefficients for two parameters in a Weibull distribution and their credible bounds
Fig. 5
figure 5

a The predicted rate and shape parameters corresponding to different earthquakes and their credible bounds. The blue dotted line represents independently estimated parameters, and the red solid line indicates jointly estimated parameters within a Bayesian framework. b Scatter plots for the observed and estimated statistical moments (i.e. mean and variance) for 30 virtual tsunamis

For further validation, we used the location and magnitude of three historical tsunami events to infer the parameters of the Weibull distribution through the proposed Bayesian GLM framework. We sampled the parameters from their fully conditional posterior distributions, which are derived from Eqs. (7)–(8), and compared those results with the independent estimates. The set of boxplots in Fig. 6 illustrates the uncertainty bound for the rate and shape parameters of the historical tsunamis in 1964, 1983, and 1993. Most of the true values (i.e., independent estimates) fall within the interquartile range of the simulated values, suggesting that the set of parameters obtained through the proposed Bayesian GLM offers accuracy comparable to the independent estimates. However, the true value of the shape parameter ν for the Niigata Earthquake in 1964 clearly falls outside the range of the simulated values. This difference could be because the Niigata Earthquake occurred distant from the earthquakes we used in fitting the model. On the other hand, the overall range of the shape parameter is narrower than that of the rate parameter, so the extent of the deviation will not affect the tsunami height predictions. Note that different virtual cases covering a wide range of tsunami mechanisms related to magnitudes and distances from the epicentre need to be explored to better understand the distributional changes in tsunami height, globally as well as locally.

Fig. 6
figure 6

Boxplots for the rate and shape parameters for historical tsunamis that occurred in 1964, 1983, and 1993. The “x” indicates the independently estimated parameters

4.3 Inundation probability estimation

The failure probability of a system can be defined as the probability that the loading condition exceeds its respective threshold (i.e. resistance). In this study, we used a rather simple definition for the estimation of overall inundation probability for the entire coastal area with respect to the tsunami heights as the probability that the tsunami height (i.e. loading) will exceed the ground elevation (i.e. resistance). A conceptual illustration of the estimation of failure probability using two normal distributions for the loading and resistance condition is represented in Fig. 7. Intuitively, the failure probability is the cumulative probability of the overlap between the loading and the resistance density function. Thus, a mathematical illustration of the failure probability can be formulated as follows:

$$p_{f} \left( {r \le l} \right) = p_{r} \left( {r - l \le 0} \right)$$
(9)

where, l, r and pf are loading, resistance condition and failure probability, respectively.

Fig. 7
figure 7

A conceptual illustration of the estimation of risk using two normal distributions for the loading and resistance

Hence, in our case, the failure probabilities for all tsunami heights are then determined using following integral representation (Ang and Tang 1984). Specifically, the integration is performed over the failure area (i.e. r − l ≤ 0) to estimate failure probability, as illustrated in Fig. 7.

$$p_{f} = \mathop \int \limits_{0}^{\infty } f_{r} \left( r \right)\left[ {\mathop \int \limits_{r}^{\infty } f_{l} \left( l \right)dl} \right]dr$$
(10)

In order to explore loading-resistance inference, the probability distributions of the loading (fl) and resistance (fr) need to be estimated. In this manner, we used the joint posterior distributions of two parameters in Weibull distribution, estimated from the functional form as represented in Sect. 4.2. For the distribution of resistance (i.e. ground elevation), fr, the log-normal distribution was considered as the best-fit, given by the negative log-likelihood and BIC value. For presentation purposes, we illustrated the probability density functions of the tsunami height and ground elevation for 6 scenarios (i.e. #4, 8, 14, 18, 28 and 30), as shown in Fig. 8. The grey-shaded band represents the 95% Bayesian credible interval. The failure probability corresponds to the cyan-shaded area. It is clearly shown that the cyan-shaded area for the case of magnitude Mw-8.3 (#14 and #18) is larger than the case of magnitude Mw-7.1 (#4 and #8) and Mw-7.7 (#24 and #28). A direct numerical integration over the failure region was then performed to estimate the failure probability. The failure probability estimates and their credible intervals for 30 virtual tsunamis are presented in Fig. 9. The credible interval for the failure probability was obtained by repeatedly integrating over the failure domains, corresponding to the uncertainty in the probability distributions of tsunami heights. In Fig. 9, the lower and upper edges of the box indicate the 25th and 75th percentiles (i.e., interquartile range) of the failure probability, respectively. The line in the middle represents the median value. On the other hand, the vertical lines extending from the edge of the box correspond to values that are no greater than 1.5 times the interquartile range. How the failure probability and its uncertainty range differ substantially among different scenarios, suggests that the estimation of tsunami hazard is significantly sensitive to the loading conditions. The failure probabilities for the case of magnitude Mw-8.3 are notably higher compared to the case of magnitude Mw-7.1 and Mw-7.7, especially, over the range from latitude 38°– 42° along the fault zone. The results confirm that the proposed modelling scheme can translate the uncertainty in the model parameters into uncertainty in the estimation of tsunami hazard, which, thus, would offer an improved strategy for tsunami hazard mitigation program.

Fig. 8
figure 8

Comparison of the probability density functions between the tsunami height and ground elevation for 6 scenarios. The shaded area corresponds to failure region

Fig. 9
figure 9

The failure probability estimates and their uncertainty bounds for 30 virtual tsunamis. A direct numerical integration over the failure region was performed to estimate the failure probability

5 Concluding remarks

In this study, we investigated distributional changes in tsunami heights on the eastern coast of the Korean Peninsula using virtual earthquakes. We further examined a possible association between the distribution parameter for tsunami height and earthquake location and magnitude by exploring a functional relationship within a Bayesian generalized linear regression framework. We focused on developing a practical statistical tool that will allow rapid evaluation of tsunami hazard using a regression analysis with a small number of predictors. Our primary results are summarized as follows.

  1. 1.

    We confirmed significant distributional changes in tsunami height depending on earthquake location and magnitude in the East Sea. The rate parameter has a concave upward (or downward) trend along the longitude and latitude. We also identified an increased pattern in the parameters as magnitude increased. In general, the statistical moments for the tsunami heights become higher as the distance from the earthquake epicentre decreases as well as at higher magnitudes.

  2. 2.

    We developed a Bayesian GLM model to jointly analyse tsunami height in multiple events. The proposed model explicitly considers functional relationships with earthquake characteristics through a link function for the estimation of parameters in a Weibull distribution. The results show that the proposed model can practically and effectively estimate the tsunami hazard from an earthquake of a given magnitude at a particular location. Specifically, the correlation coefficients between the true and estimated values for both rate and shape parameters were over 0.9. In addition, as an experimental study, we applied the Bayesian GLM model to historical tsunami events to estimate the distributional parameters. That study generally confirmed that the proposed model effectively estimates parameters with a small number of predictors.

  3. 3.

    The joint posterior distribution was used to measure the failure region against the ground elevation for each scenario. Then, the failure probability was computed by directly integrating probability over the identified failure region. Consequently, the failure probability (or region) differed significantly from scenario to scenario, with implication that the failure probability estimation is rather sensitive to the loading scenarios. In this regard, several applications presented in this study showed that the proposed Bayesian approach has the advantage of conveying the uncertainty of the parameter estimates and its substantial effect on modelling results. Especially, the proposed model can effectively translate the uncertainty in the model parameter into the uncertainty of the estimated tsunami hazard.

  4. 4.

    On the other hand, it should be noted that the proposed model showed limitations in accurately estimating the shape parameter for earthquakes separated in space from the virtual earthquakes used to fit the model (i.e., Niigata Earthquake in 1964). Furthermore, the proposed model does not consider other earthquake factors, such as fault parameters. Thus, the proposed model might not be relevant to earthquakes significantly different from those modelled in this study.

Future work could focus on further exploring other earthquake factors and integrating the proposed model into a tsunami hazard analysis tool to handle the lack of adequate tsunami records in coastal areas.