1 Introduction

Probabilistic seismic hazard (PSH) estimates represent a basic element for ruling effective earthquake-resistant design. Hazard maps are commonly developed at national scale and then implemented in seismic codes [e.g., “NEHRP Recommended Seismic Provisions” (Building Seismic Safety Council 2009), “Norme Tecniche per le Costruzioni” (Ministero delle Infrastrutture e dei Trasporti 2008)]. Due to their importance, testing effectiveness of such estimates is becoming increasingly important and several procedures have been proposed [Albarello and D’Amico (2015) and references therein]. The interest of the scientific community to this problem is also enhanced by the coexistence of several alternative seismic hazard maps provided by different computational models equally reliable (at least in principle). Therefore, selecting best performing maps for seismic regulations becomes a mandatory task for scientists and practitioners.

In principle, scoring hazard estimates can be performed by comparing them against observations available at accelerometric sites operating for quite long time intervals (e.g., Albarello and D’Amico 2008; Albarello et al. 2015a; Barani et al. 2016). A critical aspect of this kind of analysis is that, in conventional PSH models for national (or regional) assessments, the effect of local soil deposits is generally neglected. Hence, hazard maps for rock conditions define only a basic level for the definition of the expected ground motion at a site. On the other hand, observations available for scoring are generally collected at sites where the seismic bedrock is overlain by soil deposits and this makes any comparison between hazard estimates and observations quite problematic. To overcome this problem, one may correct observations to obtain the hypothetic ground motion on reference rock (e.g., V S,30 > 800 m/s) through the application of site (de-) amplification factors derived from seismic codes (Albarello and Peruzza 2016) or via deconvolution of the ground motion time histories recorded at each reference accelerometric station with the site transfer function (Castellaro and Albarello 2016). Alternatively, site effects may be directly included into the PSH models.

The scopes of the S2-2012 and S2-2014 DPC-INGV Projects are both probabilistic soil hazard assessments in Italy (Task 4 Working Group 2013; Barani and Spallarossa 2015a, b) and the evaluation of the consistency of the assessed hazard with observations by means of empirical scoring procedures and statistical testing methods (Albarello et al. 2013, 2015b). Besides the deliverables of the S2 DPC-INGV Projects (cited above), a comparison of probabilistic methods that incorporate site effects into ground motion hazard calculations can be found in the article of Barani and Spallarossa (2016). In principle, all PSHA methods that account for site-specific characteristics in a rigorous manner imply the determination of the ground response for a variety of target soil models representative of the actual local site conditions (e.g., Bazzurro and Cornell 2004a; Barani et al. 2013, 2014a, b). Therefore, the application of methods for site-specific PSHA to large areas requires an extensive knowledge of regional geology both to define representative soil models for the numerical ground response analyses and to associate the results from these analyses to areas with similar depositional history and properties. This kind of approach was applied by Pelli et al. (2006) and Haase et al. (2011) to compute the hazard in relatively small areas in northern Italy and Indiana (USA), respectively. Similarly, it could be applied in regions of the world where extensive seismic microzonation studies have been carried out, leading to large-scale evaluations of seismic amplification effects. However, the application of site-specific methods appears unfeasible (or, at least, unsuitable) for large-scale PSHAs. Attempts of large-scale hazard mapping inclusive of site effects are those by Petersen et al. (1997), Romeo et al. (2000), Cramer (2006), and Kalkan et al. (2010). All these studies incorporate site effects by simply implementing ground motion prediction equations (GMPEs) defined for specific soil classes (few classes are generally considered), which in most cases correspond to those provided by national building codes. Basically, this approach assumes that the soil conditions at each node of the calculation grid resemble those at the stations in the database considered for the development the GMPEs selected for the PSHA. It appears clear that this approach ignores detailed site-specific information and, therefore, produces only an approximate assessment of the local hazard. Nevertheless, it could be useful to verify the actual feasibility of this approach, at least to provide first-order hazard estimates to be used for regional-scale risk assessments.

In line with the scopes of the S2 DPC-INGV Projects, this study presents an attempt to evaluate the feasibility of the latter approach by evaluating the agreement (in terms of likelihood) between the outcomes from different hazard models, each one based on a specific GMPE, and the observations at a set of reference accelerometric sites. The effectiveness of different hazard models corresponding to seven alternative GMPEs suitable for PSHA in Italy (Table 1) has been examined via a scoring test. Indirectly, this analysis allows evaluating the performance of the selected GMPEs. As known, GMPEs play a crucial role on the hazard results, particularly due to the impact of the aleatory variability in the ground motion prediction (e.g., Strasser et al. 2009; Barani et al. 2015). Hence, recent research related to PSHA has led to the release of an increasing number of models, with the result of improving the accuracy of the predictions but with no tangible reduction of the aleatory variability in ground motions (Barani et al. 2016).

Table 1 List of GMPEs considered in the present study

The analysis presented here is similar, to some extent, to the one proposed in Barani et al. (2016). First, a PSHA is carried out by performing different computational runs sharing the same input models and parameters values (e.g., geometry of earthquake source, seismicity rates, b-value, M max value) except for the GMPEs, which are applied one at a time separately. Then, scores are assigned to each hazard model by applying a probabilistic test, which calculates the likelihood of the outcomes from each model in relation to available observations at a set of 56 recording sites operating in Italy for at least 25 years (Table 2; Fig. 1). The analysis has been limited to these sites as they present a V S profile down to 30 m. Differently from Barani et al. (2016), the scoring test adopted here is carried out taking into account the correlation among the hazard estimates at the reference recording sites (Albarello and Peruzza 2016). This approach, repeated for each one of the GMPEs considered, allows us to compare the performance of each model.

Table 2 List of accelerometric sites considered in the scoring test
Fig. 1
figure 1

Distributions of the accelerometric sites considered in the scoring test. Contour lines of the Italian PGA hazard map for a mean return period of 475 years (MPS Working Group 2004) are also displayed. MRN (Mirandola), SNC (Soncino), and PGL (Peglio) indicate the sites where we calculated the uniform hazard spectra shown in Fig. 6. SNC is not included in the list of stations used for the scoring test

2 Empirical scoring procedure

The approach applied here is the same as the one presented in the article of Albarello and Peruzza (2016). It consists of comparing the outcomes of a PSH model against observations at a set of S recording sites, each one in operation for a Δt s period of time. To this end, the time span covered by both should be the same. Commonly, the outcome of a PSHA at a given site is expressed in terms of ground motion value \(y^{*}\) with a certain probability P (e.g., 10%) of exceedance in a specified time period Δt ≠ Δt s (e.g., Δt = 50 years). Assuming that the seismic process underlying the hazard assessment is Poissonian (as in this study), the mean annual rate of exceeding \(y^{*}\) is given by (e.g., Kramer 1996):

$$\lambda_{{y^{*} }} = - \frac{{\ln \left( {1 - P\left[ {Y > y^{*} } \right]} \right)}}{\Delta t}$$
(1)

The reciprocal of \(\lambda_{{y^{*} }}\) is the mean return period (MRP, hereinafter).

Hence, given the i-th PSH model (H i ) and the resulting ground motion value \(y^{*}\) associated with a given MRP at the s-th site, the probability P s,i of exceeding \(y^{*}\) during an exposure time Δt s is:

$$P_{s,i} = P_{s,i} \left[ {Y > y^{*} } \right] = 1 - e^{{ - \lambda_{{y^{*} }} \Delta t_{s} }}$$
(2)

Since each PSH model provides different values of \(y^{*}\), each one corresponding to a particular MRP, models can be scored by considering each realization as an independent forecast. Since lower exceedance probabilities correspond to longer MRPs and to higher \(y^{*}\) values, different scores can be obtained for different points along a hazard curve.

Given a model H i , scores are computed by comparing the number \(N_{i}^{*}\) of sites where the threshold \(y^{*}\) is exceeded (i.e., the observed ground motion g s  ≥ \(y^{*}\)), which can be simply determined by counting, with the number μ(N i ) of exceedances expected on the basis of that hazard model. If S is large enough, according to the central limit theorem, numerical simulations show that N i is a normal random variate with mean μ(N i ) and standard deviation σ(N i ) (Albarello and Peruzza 2016). Specifically:

$$\mu (N_{i} ) = \sum\limits_{s = 1}^{S} {P_{s,i} }$$
(3)

and

$$\sigma (N_{i} ) = c_{i} \sqrt {\sum\limits_{s = 1}^{S} {P_{s,i} (1 - P_{s,i} )} }$$
(4)

where c i , which is defined in the next section, is a correction coefficient that accounts for the mutual correlation of the hazard estimates at the S sites; c i is equal to 1 if model outcomes are mutually uncorrelated at all S sites and increases as the correlation among the hazard estimates increases.

Hence, the new random variate Z i

$$Z_{i} = \frac{{N_{{^{i} }}^{*} - \mu (N_{i} )}}{{\sigma (N_{i} )}}$$
(5)

follows the standardized normal distribution. This implies that the likelihood L i of observing at least \(N_{i}^{*}\) ground motion exceedances given the i-th PSH model is:

$$L_{i} = L(N_{i}^{*} |H_{i} ) = L\left( {\left| {Z_{i} } \right|} \right) = \frac{4}{\sqrt \pi }\int\limits_{{\left| {Z_{i} } \right|}}^{\infty } {e^{{ - t^{2} }} dt}$$
(6)

The likelihood L i can be interpreted as the degree of belief in the hypothesis that the observed number of exceedances \(N_{i}^{*}\) is an outcome of the seismicity process described by the i-th model.

It is worth noting that the hypothesis about the earthquake occurrence process assumed in the hazard assessment does not necessarily apply to observations. The general structure of the test (Eqs. 5, 6) still holds in the case that a “non-Poissonian” earthquake occurrence model is adopted to compute the expected number of sites where exceedances occur.

To evaluate the degree of belief in a hazard model (i.e., the confidence in the model H i given the observation of \(N_{i}^{*}\) exceedances), \(Q(H_{i} |N_{i}^{*} )\), additional restrictive hypotheses are required. In the assumption that the set of n H hazard models is exhaustive and that at least one of these models is representative of the actual seismogenic process one can obtain, by the Bayes theorem:

$$Q_{i} = Q\left( {H_{i} |N_{i}^{*} } \right) = \frac{{p\left( {H_{i} } \right)}}{{\sum\nolimits_{j = 1}^{{n_{H} }} {p\left( {H_{j} } \right)L\left( {N_{j}^{*} \left| {H_{j} } \right.} \right)} }}L\left( {N_{i}^{*} \left| {H_{i} } \right.} \right)$$
(7)

where p(H i ) is the ex-ante degree of belief in the i-th model (see Albarello and D’Amico 2015). The likelihood L i and the confidence Q i are inherently different. The former quantifies the “absolute” performances of the i-th model irrespective of the other models. Very low values of L i indicate that model outcomes are apparently incompatible with the observations since the probability of observing a similar outcome is low. On the other hand, Q i provides the relative “score” of the i-th model conditioned to the performances of the other competing models, assuming that at least one of them is representative of the underlying natural process. Thus, the likelihood L i allows testing the feasibility of calculating Q i in the sense that, if none of the considered models provide results compatible with observations (e.g., L i  < 5% for all n H models), Eq. 7 cannot be applied (for a detailed discussion, see Albarello and D’Amico 2015).

Equation 7 allows estimating the values of Q i for one specific realization of the model H i (e.g., a spectral acceleration value corresponding to a specific MRP). This implies that the observation \(N_{i}^{*}\) (i.e., the number of observed exceedances) relative to the i-th model depends on both the number of MRPs (n MRP) and response periods (n T ). In principle, we can consider \(M = n_{\text{MRP}} \cdot n_{T}\) combinations of MRPs and T values. For the m-th combination, we get the observation \(N_{i,m}^{*}\). In the case of n H hazard models, we get \(M \cdot n_{H}\) values of Q i where m and i are in the ranges [1, M] and [1, n H], respectively. By iteratively applying Eq. 7, one can obtain the overall degree of belief associated with the i-th model (hereinafter called “score”) given that \(N_{i,M}^{*}\) exceedances occur, \(Q_{i} = Q(H_{i} |N_{i,M}^{*} )\):

$$\left\{ \begin{aligned}\begin{array}{*{20}l} Q_{i} = Q\left( {H_{i} \left| {N_{i,1}^{*} } \right.} \right) = \frac{{p\left( {H_{i} } \right)}}{{\sum\nolimits_{j = 1}^{{n_{H} }} {p\left( {H_{j} } \right)L\left( {N_{j,1}^{*} \left| {H_{j} } \right.} \right)} }}L\left( {N_{i,1}^{*} \left| {H_{i} } \right.} \right)\quad m = 1 \hfill \\ Q_{i} = Q\left( {H_{i} \left| {N_{i,m}^{*} } \right.} \right) = \frac{{Q\left( {H_{i} \left| {N_{i,m - 1}^{*} } \right.} \right)}}{{\sum\nolimits_{j = 1}^{{n_{H} }} {Q\left( {H_{j} \left| {N_{j,m - 1}^{*} } \right.} \right)L\left( {N_{j,m}^{*} \left| {H_{j} } \right.} \right)} }}L\left( {N_{i,m}^{*} \left| {H_{i} } \right.} \right)\quad m = 2,3, \ldots ,M \hfill \\ \end{array}\end{aligned} \right.$$
(8)

In this study, the values of Q i are computed assuming that p(H i ) = 1/n H .

3 Numerical simulations to estimate c i values

In order to score the PSH models, the relevant c i values must be determined in advance. These values depend on the relative positions of the S sites, of the seismogenic sources, and on the GMPEs considered in the i-th computational model. In the case of the models considered here, the position of the accelerometric sites, as well as that of the seismogenic sources, is kept fixed. Thus, differences among the c i values are expected to depend on GMPEs.

For each GMPE considered, the value of c is computed following a procedure similar to the one proposed by Albarello and Peruzza (2016). In that study, the authors generate a specified number of virtual seismic catalogues by randomly picking a number of epicenters from a given earthquake catalogue. A random magnitude is then assigned to each event by assuming a specified recurrence relationship for the whole Italy. It is clear that this procedure does not account for the role of seismogenic areas (where the generation of earthquakes is assumed as equiprobable) in the PSH model.

To overcome this limitation, a different approach has been considered here. This approach is essentially a Monte Carlo simulation-based PSH assessment similar to the one proposed by Musson (2009). For each source area of the seismogenic zonation used in the PSHA of Italy (MPS Working Group 2004), we generate K (=1000) earthquake catalogues of Δt k years (=1000 years). In each source area, the generation of earthquakes above a threshold magnitude m min is assumed to be a random Poissonian process with a mean annual rate υ(m min). A truncated Gutenberg–Richter distribution with parameter b is assumed to characterize magnitudes in the range [m min , M max]. The values of υ(m min), b, m min, and M max for each source area are provided in Barani et al. (2009). To generate the seismic catalogues for each single seismogenic area, we first compute incremental annual activity rates, n(m), for each magnitude bin in the interval m minM max [bins with amplitude equal to 0.23 magnitude units are assumed according to the MPS Working Group (2004)]. For each bin, we randomly generate a number of events proportional to the activity rate associated with that bin (i.e., corresponding to the nearest integer of the product \(n(m) \cdot \Delta t_{k}\)). A random magnitude value m, compatible with that magnitude bin, is then attributed to each virtual event. The epicenter location of each event is determined by discretizing each seismogenic area into cells with spacing of 0.01° in both latitude and longitude: each earthquake is then attributed to the barycenter of a cell selected randomly (with uniform distribution) from the pool of cells falling in that area. Similarly, a random depth is assigned to each epicenter assuming that the seismogenic thickness of each source zone is drawn by a normal density with mean value and standard deviation equal to those assumed by the MPS Working Group (2004). The mechanism of faulting assigned to each event is the same as that of the pertinent source zone.

Once all K virtual catalogs are generated, a number of events compatible with a period of time \(\overline{{\Delta t_{s} }}\) of interest (in this study, we assume \(\overline{{\Delta t_{s} }} = 39\) years, which nearly corresponds to the average lifetime of the 56 reference recording stations since their installation) is randomly selected from each of them (this implies that K random extractions are performed). More precisely, for each source zone, \(\upsilon (m_{\hbox{min} } ) \cdot \overline{{\Delta t_{s} }}\) events are selected from each one of the K virtual data sets. Alternatively, for each source area, one may select \(\upsilon (m_{\hbox{min} } ) \cdot \overline{{\Delta t_{s} }}\) events from a single data set including \(K \cdot \Delta t_{k}\) years of virtual data and repeats the random extraction K times. For each event extracted from the k-th catalogue, we determine the ground motion level at the s-th site (for the computation of c, a rock condition (i.e., V S,30 > 800 m/s) is assumed for all S sites) by randomly picking a ground motion value from the probability distribution \(P\left[ {Y > y^{*} |m,r} \right]\) associated with a specified GMPE (ground motion distributions are truncated at ε = 3 standard deviations), where Y indicates the median value of the ground motion predicted given m and r, and \(y^{*}\) is a certain ground motion threshold given in input. Note that r indicates either the epicentral distance or the hypocentral distance or the rupture distance, or the Joyner and Boore (1981) distance according to the GMPE used (see Table 1).

From here on, the procedure to the computation of c is analogous to that described in the article of Albarello and Peruzza (2016). Specifically, the value of c for the i-th model is computed as:

$$c_{i} = \frac{{\sigma_{corr} }}{{\sigma_{uncorr} }} = \frac{{\sqrt {\sum\nolimits_{{n_{S} = 0}}^{S} {\left[ {n_{S} - n_{S} p_{i} (n_{S} )} \right]^{2} p_{i} (n_{S} )} } }}{{\sqrt {\sum\nolimits_{s = 1}^{S} {P_{s,i} (1 - P_{s,i} )} } }}$$
(9)

where σ corr and σ uncorr have the meaning of σ(N i ) (Eq. 4) in the case of mutual correlation among the hazard estimates at the S benchmark sites and when no correlation exists (i.e., c i  = 1), n S indicates the number of sites where the threshold \(y^{*}\) is jointly exceeded D k,i (n S ) times, p i (n S ) indicates the probability associated with n S

$$p_{i} (n_{S} ) = \frac{1}{K}\sum\limits_{k = 1}^{K} {D_{k,i} (n_{S} )}$$
(10)

and P s,i indicates again the probability of exceeding \(y^{*}\) at the s-th site

$$P_{s,i} = \frac{1}{K}\sum\limits_{k = 1}^{K} {E_{s,k} }$$
(11)

where E s,k  = 1 when \(y^{*}\) is exceeded (at least once) during the k-th run, otherwise E s,k  = 0.

In this study, the procedure is repeated for different ground motion thresholds \(y^{*}\) (i.e., c is computed for each ground motion value \(y^{*}\) specified in input) of the following ground motion parameters: peak ground acceleration (PGA), 0.15, 1, and 2 s 5%-damped spectral acceleration [S a (T)]. Note that, for each value \(y^{*}\), we compute the average frequency (probability) of exceedance during \(\overline{{\Delta t_{s} }}\) years over the S sites. Assuming a Poisson process, the average annual rate of exceeding \(y^{*}\) can be therefore determined. Thus, one can compute the value of c corresponding to a particular mean return period (MRP).

As shown in Fig. 2, c varies not only as a function of the GMPE but also as a function of the ground motion level \(y^{*}\) and spectral period. In particular, c tends to increase with increasing spectral period. At short periods [i.e., PGA and S a (0.15 s)], c varies around unity for both very low and high ground motion levels while assumes the greatest values for moderate ground motions [i.e., 0.04 g < PGA < 0.15 g and 0.1 g < S a (0.15 s) < 0.3 g]. At longer periods [i.e., S a (1 s) and S a (2 s)], c is greater for moderate to low acceleration values [i.e., 0.03 g < S a (1 s) < 0.08 g and S a (2 s) < 0.03 g]. Among all GMPEs considered, ITA10 is the one providing the highest correlation levels, particularly at longer spectral periods. As shown in Fig. 3, this behavior may be attributed to the slower attenuation of the ground motions provided by this GMPE for larger magnitude events (i.e., approaching the upper limit of applicability, M W = 6.9) which, as known, tend to control the long-period hazard (e.g., Barani et al. 2009).

Fig. 2
figure 2

Variation of the parameter c as a function of PGA (a), S a (0.15 s) (b), S a (1 s) (c), and S a (2 s) (d). Note that distinction is made between BND14v30 and BND14gt but not in the case of CZZ14. For CZZ14, distinction is not necessary as, contrary to BND14, the regression coefficients related to the magnitude, distance, and style of faulting functions are independent of the site amplification term

Fig. 3
figure 3

Scaling of ground-motion with distance computed for the GMPEs listed in Table 1 considering an earthquake with M W = 6.5 and V S,30 = 800 m/s. a S a (1 s) and b S a (2 s)

The values of c i employed in the scoring test are provided as electronic supplementary material.

4 Scoring test

4.1 A note on seismic hazard computation

In order to assess the effectiveness of the hazard models based on the GMPEs in Table 1 in providing reliable (i.e., compatible with observations) hazard estimates, we carried out different computational runs by varying one GMPE at a time while keeping constant the remaining components of the PSH model. In each run, the ground motion hazard is computed at each site according to the site classifications presented in Table 2. The reference seismic hazard model considers the same set of input models and parameters adopted by Barani et al. (2009) for the disaggregation of the Italian ground motion hazard maps. In order to avoid conversions between ground motion values, hazard estimates are computed either in terms of ENVxy or GMxy depending on the GMPE (see Table 1 for acronym definitions). Only BOR14 was corrected to predict GMxy (instead of RotD50) ground motions by applying the empirical conversion models of Barani et al. (2015) in conjunction with those of Beyer and Bommer (2006). According to Bommer et al. (2005), the aleatory variability related to the conversion of the median ground motion was carried across into the aleatory variability of each GMPE. Concerning the source-to-site distance, each GMPE is applied according to the relevant distance metrics (see Table 1). Note that CZZ14 and BND14 model site amplification either as a continuous function of V S,30 or by means of ground categories. Both options are considered (the suffixes “v30” and “gt” are used after the GMPE acronym) in order to explore their effects on the hazard.

4.2 Results of the scoring test

In order to compare the outcomes of each PSH model with observations, a set of S accelerometric sites is considered, each operating for a Δt s time interval (Table 2). As stated previously, the set of benchmark sites includes stations that have been operating in Italy for at least 25 years. Specifically, the largest PGA and S a (T) values (either in terms ENVxy or GMxy depending on the GMPE) have been considered for each station (Luzi et al. 2008; Pacor et al. 2011).

Figure 4 summarizes the results of the scoring test showing the overall likelihood Q i associated with the hazard models corresponding to the GMPEs in Table 1 for separate ground shaking parameters [PGA, S a (0.15 s), S a (1 s), and S a (2 s)]. To this end, the values of L i obtained through Eq. 6 for different MRPSs (30, 50, 72, 101, 140, 201, 475, 975, and 2475 years) are used with Eq. 8. The values of L i are provided as electronic supplementary material. Note that, for an MRP of 30 years and T = 0.15 s, L i is found to be always lower than a 5% significance level. This behavior may be indicative that, for this spectral period, one component of the PSH model (e.g., recurrence model, GMPE) makes the hazard estimates for very short MRPs not compatible with the observations. Hence, this MRP is excluded from the computation of the scores presented in Fig. 4 for T = 0.15 s.

Fig. 4
figure 4

Q-scores (see Eq. 8) associated with the PGA-, S a (0.15 s)-, S a (1 s)-, and S a (2 s)-hazard models corresponding to the GMPEs in Table 1. GMPEs developed for the same region (or by the same research group) are displayed using different tones of the same color (e.g., Italian GMPEs are in green)

Among the PSH models considered, those implementing ITA10 and BND14 (particularly in the “gt” form) are clearly the best performing. The model based on BND14 gets higher scores at 0.15 and 2 s. On the other hand, the model implementing ITA10 appears more effective to forecast the PGA and the spectral acceleration corresponding to a 1 s period. This model exhibits a good performance also for a 2 s response. For this spectral period and for T = 1 s, the model based on AB10 performs very similarly to that based on ITA10. The remaining models get lower scores and, consequently, appear less suitable for PSHAs that include site effects. It is worth noting that, although AKK14 and BND14 are both based on the same strong motion archive (Akkar et al. 2013), the former always provides lower scores. Although the original archive is the same, the final regression data sets are not. In addition, also the functional forms are different. Both these factors contribute to differences in the predicted median ground motions and, more generally, in the ground motion probability distributions. In particular, major differences concern the attenuation with distance (particularly at longer response periods and increasing magnitudes) and site conditions. Other minor differences are related to magnitude scaling for M W greater than 6.75 (for details, see Douglas et al. 2014).

Combining the likelihood values computed for all the MRPs and ground motion parameters considered (again by applying Eq. 8) leads to the diagram in Fig. 5, which summarizes the overall performance of all models tested. BND14 is definitely the dominating (i.e., best performing) model. Note that the hazard model based on ITA10 is significantly penalized by the low score at 0.15 s (see Fig. 4), which, in turn, is due to the very low likelihood value determined for an MRP of 2475 years (see the L i values in the electronic supplementary material). Repeating the scoring test without taking the hazard results for that MRP into account, the model incorporating ITA10 gets indeed the largest overall score. Comparing the scores of BND14v30 with those associated with BND14gt and those of CZZ14v30 with those corresponding to CZZ14gt indicates that modeling site effects by means of a V S,30-dependent soil coefficient instead of using ground types does not lead to improved performances. Using ground types provides slightly higher scores and, therefore, appears more appropriate for large-scale hazard evaluations.

Fig. 5
figure 5

Overall scores associated with the hazard models considered in Fig. 4

5 Drawbacks of using GMPEs for generic site conditions in PSHA

Previous results have shown that, although GMPEs provide only a generic evaluation of site amplification effects, reliable large-scale hazard maps can be computed when suitable models are selected. However, some spatial variations in the results presented above may exist. Moreover, given the approximate nature of the site amplification term included in the GMPEs, some overestimation or underestimation of the hazard may occur in particular areas due to variations in regional geology. Due to the limited sample of benchmark sites, we cannot verify this statistically. Thus, it may be useful to compare the hazard estimates determined through the application of GMPEs for generic site conditions with those resulting from a more rigorous site-specific hazard approach. The comparison is presented here for three sites representative of different soil conditions and hazard levels in Italy (see Fig. 1). Specifically, we have considered the same three sites examined in the article of Barani and Spallarossa (2016): Mirandola (MRN), Soncino (SNC) (this site is not included in the list of stations in Table 2), and Peglio (PGL). MRN and SNC are both located in the Po Plain (northern Italy). However, they present different soil conditions. MRN presents a thick layer of approximately 100 m of soil deposits (mainly sand and clay) above the bedrock (i.e., rock/soil with average shear wave velocity greater than or equal to 800 m/s) while SNC only 17 m (mainly gravel). The third site, PGL, is located in central Italy, in an area characterized by a higher hazard level. Here, the bedrock is overlaid by approximately 40 m of soil and soft rock (clays, clayey marls and gypsum) with shear wave velocity lower than 800 m/s. Although the comparison is provided for three sites only, it may serve to give the reader a clearer idea of the strengths and weaknesses of the method for soil hazard assessment examined in this work.

Figure 6 compares the soil uniform hazard spectra (UHSs) for an MRP of 475 years resulting from the application of the GMPEs in Table 1 with the average UHS determined through the application of the multi-SAF (SAF is for Soil Amplification Function) approach for site-specific PSHA proposed by Barani and Spallarossa (2016). This approach is based on the convolution method described in the article of Bazzurro and Cornell (2004b). Specifically, for each site, 200 numerical ground response analyses, corresponding to 200 randomizations of the reference soil model, are performed (see Barani and Spallarossa 2016). The results from each analysis are then used to define a regression model for soil amplification (termed SAF) for each spectral period of interest. The average multi-SAF UHS is then computed by assuming that each one of the seven GMPEs corresponds to a branch of a logic tree. In this exercise, GMPEs are each assigned equal weights. For a given spectral period, each rock hazard curve from the logic tree is convolved with each one of the 200 SAFs. Globally, 1400 site-specific hazard curves are computed for each spectral period of interest for each site. From this set of hazard curves, we calculate the mean hazard (i.e., the mean annual rate of exceedance of specified ground motion intensities) and the hazard corresponding to different percentiles. Note that the hazard curves are computed by removing the ergodic assumption in the ground motion variability (e.g., Anderson and Brune 1999; Al Atik et al. 2010; Rodriguez-Marek et al. 2011, 2014), thus to avoid double counting of the uncertainty related to site amplification, which is already considered in the numerical ground response analysis via Monte Carlo simulations. This was achieved here by roughly reducing the standard deviations of the logarithmic ground motion by an amount of 15% (Luzi et al. 2014; Rodriguez-Marek et al. 2011). The average multi-SAF UHSs are used as terms of comparison to analyze the effect of each GMPE on the hazard. Precisely, we calculate the difference (Δ) between the spectral acceleration values displayed by each UHS in Fig. 6 and those computed for the same MRP using the multi-SAF approach. The values of Δ derived for the PGA, S a (0.15 s), S a (1 s), and S a (2 s) hazard are shown in Figs. 7, 8, 9 for MRN, SNC, and PGL, respectively. At MRN, where thick soil deposits are present and moderate soil nonlinearity has been observed from the numerical simulations performed by Barani and Spallarossa (2016), all GMPEs with the exception of ITA08 at shorter periods lead to hazard estimates that are greater than those resulting from the multi-SAF approach, which takes into account the nonlinear behavior of soils. Among the GMPEs considered, only AKK14 and BOR14 include a nonlinear site amplification term. The latter model, along with ITA08, ITA10, and AB10, provides a better agreement with the multi-SAF hazard estimates, particularly at low periods (Fig. 7a, b). At this site, using V S,30 to model site effects instead of ground types does not seem beneficial. The values of Δ associated with BND14gt and CZZ14gt are always lower than those obtained by applying BND14v30 and CZZv30. For the site of Soncino (Fig. 8), the reader may observe a general better agreement (with respect to MRN) between the hazard estimates relative to each single GMPE and those obtained with the multi-SAF method. Except for a period of 0.15 s (Fig. 8b), which is close to the site resonance (Barani and Spallarossa 2016), differences Δ in the hazard estimates are always negligible. At this period, except for ITA08, CZZ14v30, and BOR14, all other GMPEs lead to hazard results compatible with those obtained with the multi-SAF method. Concerning PGL, which is the site (among those considered) with the highest hazard, Fig. 9 clearly shows that, at shorter periods (Fig. 9a, b), all GMPEs underestimate the multi-SAF hazard by significant amounts. This is attributable to the ineffectiveness of GMPEs in properly capturing the actual site amplification, which at this site is concentrated between 3 Hz and 10 Hz (Barani and Spallarossa 2016). The opposite occurs when longer periods are considered (Fig. 9c, d). For 1 s and 2 s response, indeed, all GMPEs lead to hazard values that are greater than those provided by the multi-SAF method. At these periods, the lower values of Δ are provided by CZZ14v30. This finding, along with observation that BND14v30 has led to smaller Δ values compared to BND14gt, indicates that, at least for this site, incorporating V S,30 in the site amplification term is helpful in improving hazard estimates at medium-to-long spectral periods.

Fig. 6
figure 6

Soil UHSs for an MRP of 475 years resulting from the application of the GMPEs in Table 1 for the sites of Mirandola (MRN), Soncino (SNC), and Peglio (PGL). The black curves and the shaded areas indicate the average site-specific UHSs and the uncertainty bands between the 2nd and 98th percentile UHSs determined by applying the multi-SAF approach proposed by Barani and Spallarossa (2016)

Fig. 7
figure 7

Difference (Δ) between the PGA (a), 0.15 s (b), 1 s (c), and 2 s (d) spectral acceleration hazard values for an MRP of 475 years computed by using the GMPEs in Table 1 and the hazard values obtained from the multi-SAF approach for the site of Mirandola (MRN). Underestimations with respect to the multi-SAF estimates are indicated by black bars. Bars in gray indicate overestimation

Fig. 8
figure 8

Same as Fig. 7 but for the site of Soncino (SNC)

Fig. 9
figure 9

Same as Fig. 7 but for the site of Peglio (PGL)

6 Discussion and conclusions

Based on our results, reliable large-scale hazard mapping of Italy inclusive of site effects can be achieved through a conventional PSHA based on a restricted number of GMPEs. In particular, if the same input assumptions considered in this work are adopted, two GMPEs, BND14gt and ITA10, appears appropriate. Their relative scores might serve as ‘weights’ expressing the degree of belief in the selected models which, as a first step, were assumed equiprobable. Note that in the application presented here, we have not taken into account the epistemic uncertainties related, for instance, to earthquake sources, activity rates, b-values, and maximum magnitude. We have simply varied one GMPE at a time while keeping constant a particular set of input assumptions, which, in our case, is that providing hazard results closer to the median values obtained using the entire logic tree adopted for the PSHA of Italy (Barani et al. 2009). In the case of logic tree applications, a multi-factorial approach (e.g., Rabinowitz and Steinberg 1991; Barani et al. 2007) is advisable, thus to account for all epistemic uncertainties into the annual rates of exceedance associated with each branch to be examined. Actually, indeed, the results of a logic-tree-based PSHA are influenced by all assumptions considered in the logic tree. As a consequence, the results of the scoring test are dependent on the entire hazard model adopted. In other words, the score associated with a particular assumption of the PSH model is influenced by the remaining logic-tree branches. The advantage of a multi-factorial approach is that it allows one to evaluate the effect on the hazard introduced by the uncertainty of a single assumption by simultaneously varying all other input parameters and models. In such a way, all epistemic uncertainties are carried across into the annual rate of exceedance of a certain level of ground motion (or alternatively, into the ground motion that is exceeded with a certain probability) associated with a particular branch of the logic tree. In this perspective, the empirical scoring test presented here can be profitably applied to evaluate the likelihood of the various components of a PSH model. This can be helpful to prune unlikely branches from “leafy” logic trees.

Although the use of GMPEs for generic site conditions is found to be particularly appropriate for large scale soil hazard applications, some caution is required in handling the assessed hazard. At specific sites or in particular areas, such an approach could lead to hazard estimates that may be either over-conservative (e.g., in the case of non-linear soil response) or under-conservative with respect to those determined by the application of more rigorous site-specific PSH approaches, particularly in the frequency range around the site fundamental frequency. This should be carefully considered by the final users of large-scale soil hazard maps, especially when handling such estimates in specific regions characterized by particular geological conditions and/or for the design of important facilities for which, however, site-specific approaches are strongly recommended (e.g., Abrahamson et al. 2004; Rodriguez-Marek et al. 2014).