Introduction

The literature reports several studies on early warning systems for the prediction of rainfall-induced landslides. They can be employed at “local” or “regional” scale (ICG 2012; Thiebes et al. 2012; Intrieri et al. 2013; Calvello and Piciullo 2016). Local warning systems address individual landslides (e.g. Lollino et al. 2002; Blikra 2008; Iovine et al. 2010; Intrieri et al. 2012; Michoud et al. 2013; Thiebes et al. 2013; Manconi and Giordan 2015), while regional warning systems deal with populations of landslides in a region (e.g. Alfieri et al. 2012; Capparelli and Tiranti 2010; Martelloni et al. 2012; Rossi et al. 2012; Segoni et al. 2014, 2015; Calvello et al. 2015a, b; Rosi et al. 2012; Stähli et al. 2015).

Regional landslide early warning systems are used to assess the probability of occurrence of rainfall-induced landslides over large areas, typically through the prediction and monitoring of meteorological variables, in order to warn authorities, civil protection personnel and the population. They can be schematized distinguishing among warning models and warning management strategies (Calvello and Piciullo 2016). A regional landslide early warning model (ReLWaM) includes a regional correlation law (ReCoL) and a decision algorithm. A ReCoL is defined as a functional relationship between rainfall and landslides that can lead to the definition of rainfall thresholds for possible landslide occurrence (Guzzetti et al. 2007). A decisional algorithm contains a set of assumptions for defining the number of warning levels and of procedures linking rainfall thresholds to warning levels. ReCoL and warning models refer to the technical sphere of a regional landslide early warning system (Calvello et al. 2015b), whereas warning management considers aspects oriented to the social sphere, i.e. warning dissemination, communication strategy and emergency plan. Once the procedures to define and operate the ReLWaM are defined, a periodic analysis of the performance and an update of the ReCoL (i.e. of the rainfall thresholds, Rosi et al. 2015) are needed to improve the performance of the system and its reliability.

The evaluation of the performance of a ReLWaM is based on 2 × 2 contingency tables computed for the joint frequency distribution of observed and predicted landslides (e.g. Giannecchini et al. 2012; Martelloni et al. 2012; Peres and Cancelliere 2014; Staley et al. 2013; Lagomarsino et al. 2015; Greco et al. 2013; Segoni et al. 2014; Gariano et al. 2015; Rosi et al. 2015; Stähli et al. 2015). Segoni et al. (2015), Lagomarsino et al. (2015) and Gariano et al. (2015) have proposed similar approaches to evaluate the reliability of rainfall thresholds for the prediction of rainfall-induced landslides, using back analyses, contingency tables and skill scores. However, in these cases, the model performance is assessed neglecting some important aspects which are peculiar to ReLWaM, among which (Calvello and Piciullo 2016) (i) the possible occurrence of multiple landslides in a warning area, (ii) the duration of the warning, (iii) the level of the warning in relation to the landslide spatial density in the warning area and (iv) the relative importance that the system managers attribute to different types of errors (e.g. false positives and false negatives). Recently, Sättele et al. (2015, 2016) have proposed a framework for the evaluation of the effectiveness of an early warning system for all kinds of natural hazards. The framework starts from the assessment of the technical and the inherent reliability of the system, evaluated differently for automated and non-automated systems, and leads to an effectiveness analysis.

The main topic covered by this paper is how to employ rainfall thresholds into a reliable ReLWaM. To this aim, several questions need to be answered, such as (i) which rainfall thresholds should be used in the landslide early warning system? (ii) How the thresholds should be selected? (iii) What is the optimal number of warning levels? (iv) To which warning level should correspond a rainfall threshold?

In an attempt to answer these questions, we propose a method based on a process chain in order to realize an objective procedure for the definition and the evaluation of a reliable threshold-based operational early warning system. First, we adopt a consolidated approach (Brunetti et al. 2010; Peruccacci et al. 2012; Gariano et al. 2015; Melillo et al. 2015, 2016) to define and validate empirical, cumulated event rainfall—rainfall duration (ED) thresholds for possible landslide occurrences. Afterwards, we propose a method for issuing warning levels, as a result of the comparison between measured rainfall and established thresholds. Finally, we assess the performance of the ReLWaM employing the event, duration matrix, performance (EDuMaP) method, proposed by Calvello and Piciullo (2016). We test the process chain into an area of 1619 km2 in the Campania region, southern Italy.

A process chain method

Warning model: from rainfall thresholds to warning levels

The technical procedures of a reliable ReLWaM necessary to define and issue a certain warning level (WL) can be schematically resumed into five steps (Fig. 1). Step 1 consists of defining and validating a set of rainfall thresholds with different exceedance probabilities (“Definition and validation of empirical rainfall thresholds” section). Step 2 refers to the selection of rainfall thresholds for the activation of increasing WLs in the ReLWaM. The higher is the WL, the larger is the probability of landslide occurrence. In step 3, cumulated rainfall on different time intervals is calculated starting from rainfall measurements and compared with the rainfall thresholds associated to pre-identified WLs, to issue the appropriate WL, in step 4. Finally, in step 5, an evaluation of the ReLWaM performance (“Performance evaluation of the warning model” section) in order to increase the reliability of the model through a periodical update of the WLs is strictly necessary.

Fig. 1
figure 1

Steps of the procedure proposed to define and to issue warning levels within a regional early warning model for rainfall-induced landslides

Figure 2 shows a hypothetical application of the procedure for issuing a WL. Hyetographs in the figure show the measured hourly rainfall, while inset graphs display three ED thresholds (black lines) that identify four increasing WLs. The areas in green, yellow, orange and red represent the combinations of cumulated rainfall, E, and duration, D, which belong to each WL. Starting from the time t = 0 (starting evaluation time), the cumulated rainfall E is calculated for fixed antecedent intervals: 6 (Fig. 2a), 12 (Fig. 2b), 24, 36 and 48 h (Fig. 2c). The resulting E,D condition of each antecedent interval (blue dot) belongs to a certain WL area in the inset graph. The maximum WL reached is emitted in the next 6 h (e.g. orange in Fig. 2d). The procedure is repeated after 6 h, with a new reference time t = 0 (Fig. 2e). Again, the highest WL resulting from the antecedent rainfall conditions is issued for the following 6 h (e.g. red in Fig. 2f). The procedure applied herein is based, as shown in Fig. 2d–f, on 6-h-long steps, thus allowing an evaluation of the landslide WL four times per day.

Fig. 2
figure 2

Example of the algorithm adopted to determine warning levels (WLs) using rainfall thresholds. a Cumulated rainfall (blue bars) for the antecedent period of 6 h before the evaluation time t = 0 and related WL (blue dot in the inset graph). b Cumulated rainfall (blue bars) for the antecedent periods of 6 and 12 h before the evaluation time t = 0 and related WL (blue dots in the inset graph). c Cumulated rainfall (blue bars) for the antecedent periods of 6, 12, 24, 36 and 48 h before the evaluation time t = 0 and related WL (blue dots in the inset graph). d Maximum warning level in c for the period t i  − t i  + 6 h. e Cumulated rainfall (blue bars) for the antecedent periods of 6, 12, 24, 36 and 48 h before the new evaluation time t = 0, 6 h later and related WL (blue dots in the inset graph). f Maximum warning level in e for the period t i  + 6 h − t i  + 12 h

Definition and validation of empirical rainfall thresholds

Empirical rainfall thresholds for possible landslide occurrence are defined through statistical analyses of past rainfall events that have resulted in landslides in a given study area. To obtain reliable thresholds, a large number of landslides for which the location and the time (or period) of the failure are known and sufficiently accurate information on the rainfall responsible for landslide are needed. The geographical location and the occurrence time of the failure are usually affected by uncertainty. Thus, a class of geographical and of temporal accuracy is assigned to each landslide. Adopting a consolidated approach, three classes of geographic accuracy, G, are adopted, where the accuracy depends on the type and quality of the available information (Gariano et al. 2012). The first class (G 1) is attributed to landslides mapped with a geographic accuracy of 1 km2 or less. The second (G 10) and the third (G 100) classes are attributed to landslides that are located with an accuracy of less than 10 km2 and less than 100 km2, respectively. Moreover, three classes of temporal accuracy, T, are defined. The first class (T 1) includes landslides for which the exact time of occurrence is known. The second and the third classes include landslides for which the part of the day (T 2) or the day of occurrence (T 3) was inferred, respectively.

To reconstruct the rainfall conditions responsible for landslides, the procedure proposed by Melillo et al. (2015, 2016) is applied to hourly rainfall measurements. First, starting from a rainfall record and considering a minimum dry period (i.e. a period without rainfall or with a negligible amount of rainfall) between two consecutive rainfall periods, all the rainfall events are singled out. A minimum dry period of 96 h to distinguish the rainfall events in the wet season (from November to March) and of 48 h to separate the rainfall events in the dry season (from April to October) is considered. The rainfall conditions responsible for each landslide are automatically calculated using rainfall record from a representative rain gauge located in a buffer of 12 km from the landslide location. Criteria to select the rain gauge include proximity, the elevation difference between the rain gauge and the landslide and the local morphological setting. The procedure calculates single or multiple rainfall conditions responsible for each landslide listed in the catalogue (Melillo et al. 2016).

To determine empirical rainfall thresholds, the frequentist method proposed by Brunetti et al. (2010) and modified by Peruccacci et al. (2012) is applied to all the reconstructed rainfall conditions responsible for the landslides. In this approach, the threshold curve is a power law equation linking the cumulated event rainfall E (in mm) to the rainfall duration D (in h),

$$ E=\left(\alpha \pm \varDelta \alpha \right)\cdot {D}^{\left(\gamma \pm \varDelta \gamma \right)} $$
(1)

where α is a scaling constant (the intercept), γ is the shape parameter (defining the slope of the power law curve), and Δα and Δγ are the uncertainties of α and γ calculated using a “bootstrap” non-parametric statistical technique. The uncertainties associated with the thresholds depend on the number and on the distribution of the empirical data points and decrease as the number of the empirical data increase in the dataset (Vennari et al. 2014).

To validate the thresholds, the method proposed by Gariano et al. (2015) that exploits a contingency table (Wilks 1995), a receiver operating characteristic (ROC) analysis (Fawcett 2006) and the related skill scores is adopted. In the contingency table, a “true positive” (TP) is an empirical (D,E) pair located above the threshold that has resulted in (at least) one landslide, and a “true negative” (TN) is an empirical (D,E) point below the threshold that has not resulted in known landslides. “False positives” (FP) occur when the (D,E) rainfall conditions exceeded the threshold and landslides did not occur (or where not reported). A “false negative” (FN) occurs when the (D,E) rainfall conditions were below the threshold and landslides occurred. The four contingencies are affected by biases caused by the lack of information on rainfall and/or landslide data (Gariano et al. 2015). Using the total number of TP, TN, FP and FN, four skill scores are calculated, namely (i) the probability of detection score, \( \mathrm{POD}=\frac{TP}{TP+FN} \), (ii) the probability of false detection score, \( POFD=\frac{FP}{FP+TN} \), (iii) the probability of false alarm score, \( \mathrm{POFA}=\frac{FP}{TP+FP} \), and the Hanssen and Kuipers (1965) skill score, \( HK=\frac{TP}{TP+FN}-\frac{FP}{FP+TN}=\mathrm{POD}-\mathrm{POFD} \). The ROC analysis is usually performed constructing a ROC plot that shows the probability of detection (POD) against the probability of false detection (POFD). In the ROC plane, the ROC curve is obtained varying the exceedance probability of the rainfall threshold, and each point represents the prediction capability of the single threshold. For each rainfall threshold, the Euclidean distance δ between the point representing the threshold on the ROC curve and the upper left corner of the ROC plot—known as “perfect classification” (neither FN nor FP)—is calculated. The shorter the distance δ, the more suitable is the threshold and, consequently, the model prediction skill.

To establish the optimal, “best-performing” threshold that maximizes and/or minimizes the skill scores, the index Λ is defined as a linear combination of HK, POFA and δ:

$$ \Lambda ={\lambda}_1\cdot HK-{\lambda}_2\cdot \mathrm{POFA}-{\lambda}_3\cdot \delta $$
(2)

where λ 1, λ 2 and λ 3 are positive scalar coefficients representing the weights of the individual skill scores and λ 1 + λ 2 + λ 3 = 1. The combination of the skill scores that maximizes Λ represents the optimal compromise between the minimization of incorrect landslide predictions and the maximization of the correct predictions.

Performance evaluation of the warning model

To evaluate the performance of the ReLWaM adopted within a regional landslide early warning system, the “event, duration matrix, performance” (EDuMaP) method proposed by Calvello and Piciullo (2016) is applied. EDuMaP comprises the following three successive steps:

  • The analysis of the landslide (LE) and warning (WE) events

  • The definition and computation of a “duration matrix” that lists the time intervals associated with the occurrence of landslide events in relation to the emission of warning events

  • The evaluation of the performance of the early warning model using an established set of performance indicators

In the first step, the values of the following ten input parameters need to be specified:

  1. 1.

    Number of WL used by the model

  2. 2.

    Thresholds used to differentiate among the k classes of LE on the basis of their spatial characteristics, L den(k)

  3. 3.

    Value of the time interval between the sending out of the first WL identified within a WE and the assumed beginning of the WE, t LEAD

  4. 4.

    Type of landslides addressed by the warning model, L typ

  5. 5.

    Time quantifying the maximum temporal gap among landslides included within a single LE, Δt LE

  6. 6.

    Time interval between the last landslide identified within a LE and the assumed ending of the LE, t OVER

  7. 7.

    Area of analysis for which both landslides and warnings data are available, A

  8. 8.

    Subdivision of the area of analysis in m classes on the basis of the spatial criteria adopted to issue the warnings, ΔA

  9. 9.

    Temporal length of the databases for which both landslides and warnings data are available, ΔT

  10. 10.

    Minimum unit of time used to identify LE and WE, Δt (see also Appendix).

The assessment of the model performance requires the preliminary identification of the LE and WE from analyses carried out on the landslide and warning databases. A LE is retrieved from the landslide database according to data, classification, spatial and temporal characteristics of the landslide records. In particular, a LE is obtained by grouping the collected landslides as a function of the following (Calvello and Piciullo 2016): the landslide types (L typ), the minimum interval between successive landslide events (Δt LE), the temporal discretization of the analysis (Δt) and the over time (t OVER) (Appendix). A WE is defined as a set of WLs issued within a given warning zone (ΔA), grouped considering their temporal characteristics (Δt).

Concerning the second step, the number of rows and columns in the duration matrix is equal to the number of classes and levels defined for the LE and the WE, respectively. Figure 3 portrays a 4 × 4 duration matrix related to four levels of WE (no warning, WL0; moderate warning, WL1; high warning, WL2; very high warning, WL3) and four classes of LE (no landslides, no; few landslides, small event, S; several landslides, intermediate event, I; many landslides, large event, L). Each element d ij of the duration matrix is computed, within the time frame of the analysis ΔT, as follows:

$$ {d}_{ij}={\displaystyle \sum_{\varDelta T}}\left({t}_{ij}\right) $$
(3)
Fig. 3
figure 3

Structure of the duration matrix with four levels of WE (key: no, no warning; M, moderate warning; H, high warning; VH, very high warning) and four classes of LE (key: no, no landslides; few landslides, small event, S; several landslides, intermediate event, I; many landslides, large event, L). Modified from Calvello and Piciullo (2016)

where i identifies the WE level, j identifies the LE class, and t ij is the time during which a WE of level i is concurrent with a LE of class j.

In the final step, two performance criteria (Fig. 4) are applied to assign a meaning to the elements of the duration matrix and to carry out the performance analysis. The “alert classification” criterion (Fig. 4a) employs an alert classification scheme derived from a standard contingency table, and it identifies correct predictions (CP), false alerts (FAs), missed alerts (MAs) and true negatives (TNs). The issuing of the two highest levels of warning (WL2 and WL3) concurrently with the occurrence of the greatest classes of LE (I and L) is assumed as CP of the ReLWaM. It is the same for the issuing of the two lowest levels of warning (WL0 and WL1) and the simultaneous occurrence of the smallest classes of LE (no and S). FA and MA are incorrect predictions of the system, and TN represents the absence of both warning and landslide occurrences. The “grade of accuracy” criterion (Fig. 4b) assigns a colour code to the components of the duration matrix in relation to the agreement between a given WE and a given LE. For instance, if the maximum WL is issued (i.e. WL3) and only few landslides occur (i.e. the LE class is S), this should be considered a significant error of the warning model. Using this criterion, the elements are classified in four colour-coded classes, as follows: green (Gre) for the elements which are assumed to be representative of the best model response, yellow (Yel) for elements representative of minor model errors, red (Red) for elements representative of a significant model error and purple (Pur) for elements representative of a severe model error. Starting from the two performance criteria, several performance indicators can be derived (Calvello and Piciullo 2016). Table 1 lists the indicators considered in this work.

Fig. 4
figure 4

Alert classification (a) and grade of accuracy (b) performance criteria used for the analysis of the duration matrix with four classes of WE (key: no, no warning; M, moderate warning; H, high warning; VH, very high warning) and four classes of LE (key: no, no landslides; few landslides, small event, S; several landslides, intermediate event, I; many landslides, large event, L). Modified from Calvello and Piciullo (2016)

Table 1 Performance indicators used for the analysis

Case study

Landslide early warning system in Campania

The Campania region extends for 13,671 km2 in southern Italy. The southern Apennines mountain range dominates the orography, exceeding 2000 m of elevation. A hilly landscape characterizes the eastern side of the region, whereas large plains separating isolated limestone and volcanic reliefs are present in the western part of the region. In the region, the mean annual rainfall ranges from 1000 to 2000 mm (Longobardi et al. 2016). Due to the rugged orography, severe storms are frequent in the region and result in abundant flash floods, debris flows and shallow landslides (Cascini et al. 2008; Vennari et al. 2016, and references therein) that cause casualties and serious damage to urban areas and infrastructures. In the 50-year period 1950–2014, 286 persons were killed or went missing, 406 were injured, and more than 23,000 people were evacuated due to landslides in the region (http://polaris.irpi.cnr.it).

In Campania, a regional landslide early warning system exists as a part of the regional warning system developed and managed by the regional civil protection agency to deal with “hydraulic and geo-hydrological risks” (DPGR 299/2005). The system includes two phases: wheatear forecast and environmental monitoring. The first phase consists in issuing warnings based on numerical rainfall forecasts. For the purpose, the Campania region is subdivided into eight alert zones (AZ, Fig. 5) for weather forecast and early warning purposes, according to homogeneity criteria, which consider the following factors: hydrography, morphology, rainfall, geology, land use and hydraulic and hydrogeological and administrative boundaries. The monitoring phase includes (i) the evaluation of meteorological and hydrological events and (ii) the hydrological and weather forecast at steps of 6 h, through now-casting techniques and rainfall-runoff modelling using real-time parameters. The rainfall monitoring network encompasses 154 rain gauges and a meteorological radar.

Fig. 5
figure 5

Map of the Camp-3 alert zone (Sorrentino-Amalfitana peninsula, Pizzo d’Alvano massif, Picentini mountains) showing shaded relief, classes of altitude (m a.s.l.), 58 rain gauges used in this study (blue triangles) and 140 rainfall-induced landslides (red circles). The main toponyms are also indicated. The insets show the location of Campania region in Italy and the subdivision of the region into eight alert zones for civil protection purposes

The test area

Our test area is the Camp-3 AZ (marked with number 3 in Fig. 5), extending for 1619 km2 and encompassing 109 municipalities, 58 rain gauges. It includes the Lattari mountains, the Pizzo d’Alvano massif and the Picentini mountains (Fig. 5). Due to the presence of pyroclastic soil deposits mantling the carbonatic bedrock, the area is highly susceptible to rainfall-induced shallow landslides and debris flows (Calcaterra et al. 2003; Di Crescenzo and Santo 2005; Cascini et al. 2008; Terranova et al. 2015; Napolitano et al. 2015). Indeed, it suffered some of the most catastrophic rainfall-induced landslide events in Europe. The most damaging events occurred on 25 October 1954 and caused, in the area of the Sorrentino-Amalfitana peninsula, 482 casualties, including 318 deaths and more than 12,000 evacuees (http://polaris.irpi.cnr.it). The most recent catastrophic event is dated 4–5 May 1998. In those days, more than 100 slope failures occurred over the slopes of the Pizzo d’Alvano massif and about 2 million m3 of material was mobilized, causing 159 deaths, more than 6400 evacuees and €500 million damage to buildings and infrastructure (Cascini 2004).

Catalogue of landslides

A catalogue of 305 rainfall-induced shallow landslides was compiled between January 2003 and December 2013 (11-year period) for the Campania region. Information on landslide occurrences was gathered from newspapers, internet and technical reports provided by local Fire Brigades and Civil Protection agency. The authors are aware that additional landslides may have occurred in the area in the analyzed time frame, although they may have not been reported; thus, they are not included in the catalogue, due to lack of information.

Regarding the Camp-3 alert zone, 140 rainfall-induced landslides were collected in the period from 2003 to 2013. The landslides archived in the catalogue occurred almost exclusively from September to March (129 out of 140) and were most abundant in January (32 landslides). The years with greatest number of recorded landslides (25) were 2009 and 2010. More than half of the landslides in the area (82 out of 140, 59 %) were localized with a high geographic accuracy (G 1), 53 failures (38 %) with a medium accuracy (G 10) and only 5 landslides (3 %) with a low accuracy (G 100). The exact time of occurrence (T 1) is known for 86 landslides (61 % of the total); on the contrary, it was inferred (T 2) for 32 landslides (23 %). For the remaining 22 landslides (16 %) in the catalogue, only the day of occurrence is known (T 3). Information on the landslide type is not available for about half of the documented failures in the catalogue (67 out of 140). The remaining landslides were classified as rock falls (38), earth flows (12), debris flows (13) and mudflows (10) (sensu Hungr et al. 2014). The catalogue of 140 rainfall-induced landslides occurred in the Camp-3 AZ was divided into two subsets: (i) a calibration set, listing 96 landslides occurred between January 2003 and December 2010, used to define the rainfall thresholds, and (ii) a validation set, listing 44 landslides occurred between January 2011 and December 2013, used to validate the thresholds.

Results and discussion

Rainfall thresholds

Adopting the procedure presented in “Definition and validation of empirical rainfall thresholds” section and using information on 96 landslides occurred in the Camp-3 AZ between January 2003 and December 2010 (calibration set) and rainfall data recorded by 58 rain gauges, empirical rainfall thresholds for several exceedance probabilities (percentiles) were determined. Following Melillo et al. (2015, 2016), 201 multiple (D,E) rainfall combinations responsible for the 96 documented landslides were reconstructed. Figure 6 shows, in log-log coordinates, the 201 multiple combinations (blue points, calibration set) and the related rainfall thresholds at 1 % (T 1,AZ3) and 5 % (T 5,AZ3) exceedance probability levels. Multiple combinations associated to the landslides cover the range of duration 1 ≤ D ≤ 650 h, which is the range of validity for the threshold, and the range of cumulated rainfall 5.6 ≤ E ≤ 249.5 mm. Threshold parameters α, γ, Δα and Δγ (Eq. 1) for different exceedance probabilities (from 1 to 90 %) are reported in Table 2. The table lists also the parameters for the thresholds at 5 % exceedance probability level (T 5,Cam, α = 10.1, γ = 0.25, Δα = 1.1, Δγ = 0.02) calculated using the dataset for the whole Campania region: 627 multiple conditions responsible for 305 landslides in the period 2003–2013. This threshold is reported in order to make a comparison with thresholds defined for other regions in southern Italy for similar periods (Calabria, Vennari et al. 2014; Sicily, Gariano et al. 2015). In particular, the T 5,Cam is very similar to the one defined for Sicily for the period 2002–2011, whose parameters are α = 10.4, γ = 0.27, Δα = 1.4 and Δγ = 0.03. On the other hand, T 5,Cam is steeper (i.e. is characterized by a lower value of the γ parameter) than the threshold defined for Calabria for the period 1996–2011, whose parameters are α = 8.6, γ = 0.41, Δα = 1.1 and Δγ = 0.03.

Fig. 6
figure 6

Multiple ED rainfall conditions (multiple combinations) responsible for 96 landslides in the Camp-3 AZ and related rainfall thresholds at 1 % (T 1,AZ3) and 5 % (T 5,AZ3) exceedance probability levels. Shaded areas portray uncertainty associated with the threshold curves. Data are in log-log coordinates

Table 2 Parameters α and γ of ED thresholds at the 5 % exceedance probability level and related uncertainties Δα and Δγ determined using different subsets of data

The thresholds defined for the Camp-3 AZ were validated using 43 triggering rainfall conditions (red points in Fig. 7, validation set) responsible for 44 landslides occurred in the analyzed alert zone between January 2011 and December 2013. Two landslides were associated to the same rainfall condition. For validation purposes, only one rainfall condition is associated with each landslide for the calculation of the values in the contingency table, as made by Gariano et al. (2015). The 43 rainfall conditions are in the range of duration 2 ≤ D ≤ 274 h and in the range of cumulated rainfall 17.4 ≤ E ≤ 142.6 mm. In addition, 3995 rainfall events were reconstructed in the same period (green points in Fig. 7). These rainfall events are in the ranges of 1 ≤ D ≤ 274 h and 1.2 ≤ E ≤ 190.2 mm.

Fig. 7
figure 7

Rainfall duration vs. cumulated event rainfall conditions in Camp-3 AZ in the period 2011–2013, compared with thresholds (blue solid lines) at 1, 5, 10, 20 and 50 % exceedance probability levels (indicated by the numbers in the labels), determined using the calibration set. Red points are 43 ED rainfall conditions associated with the triggering of shallow landslides in the validation period. Green points are 3995 rainfall events for which information on triggered landslides is not available. Gray points are 159 rainfall events with duration exceeding the range of validity of the thresholds (D > 650 h). Data are in log-log coordinates

Table 3 summarizes the four contingencies (TP, FP, FN, TN) and the four skill scores (TPR, FPR, FAR, HK) for ten thresholds, at different exceedance probabilities or percentiles (from 1 to 90 %). The largest values for the HK, δ and Λ indices were obtained by T 10,AZ3 that can be considered the optimal threshold, representing the best compromise between the minimum number of incorrect landslide predictions (FP, FN) and the maximum number of correct predictions (TP, TN).

Table 3 Contingencies (TP, FP, FN, TN) and skill scores (POD, POFD, POFA, HK, δ, Λ) calculated for thresholds at different exceedance probabilities

From rainfall thresholds to warning levels

The method proposed in “Warning model: from rainfall thresholds to warning levels” section was applied to the case study of Camp-3 AZ for the period 2003–2013. Nine combinations, P, of thresholds at different exceedance probabilities (i.e. threshold percentiles, Table 3) were considered for the issuing of the WLs, as reported in Fig. 8. The first warning level (WL0) can be defined by ED conditions not exceeding the lowest threshold in the combination. Then, ED conditions included between the first and the second thresholds activate the second warning level (WL1). Consequently, ED conditions, exceeding the second threshold and remaining below the third one, activate the WL2. Finally, ED conditions exceeding the third threshold determine the issuing of the highest warning level (WL3).

Fig. 8
figure 8

Extents of the four warning levels (WL0, WL1, WL2, WL3) for the nine combinations of threshold percentiles (P)

Starting from the 1 January 2003, at 00:00, considering steps of 6 h, the antecedent rainfall conditions at time intervals of 6, 12, 24, 36 and 48 h, for each rain gauge of the Camp-3 AZ, were evaluated. The values obtained were compared with the percentile combinations associated with the four WLs. The highest WL threshold exceeded in at least one rain gauge defined the WL to be issued for the following 6-h period to the entire Camp-3 AZ. The procedure was employed at 6-h steps for the whole period of the analysis, obtaining nine different sets of warnings, each set related to each combination of percentiles considered.

Table 4 lists the hours of activations per WL for each combination of percentiles (P) in the period 2003–2013. As expected, the higher the percentile employed for a single WL, the lower the number of hours of alert (defined as the hours of WL1, WL2 and WL3). Evidently, raising the percentile associated to WL i , with i ≠ 0, and keeping the others unchanged, a decrease of hours for WL i and an increase of WL i − 1 are obtained.

Table 4 Number of hours (and related percentages) of activation for the four warning levels per each combination of threshold percentiles

Landslide and warning event analysis

As described in “Performance evaluation of the warning model” section, the definition of a set of ten parameters is necessary to carry on the first step of the EDuMaP method, i.e. the event analysis. The parameters used to define and characterize the LE were kept constant for all the considered percentile combinations. The recorded landslides were grouped into LE considering all rainfall-induced landslide, a minimum interval between successive landslide events Δt LE = 24 h, a temporal discretization for the analysis Δt = 1 h and no over time (t OVER = 0). Taking into account these parameters, 89 landslide events were defined in the Camp-3 AZ (parameter A and ΔA) in the period 2003–2010 (parameter ΔT), derived by the 140 landslides collected in the catalogue. Table 5 lists the number of reconstructed LE per number of landslides. Most of the LEs (62) report only one landslide (i.e. preceded and followed by 24 h without landslides). The highest number of landslides composing a LE is seven. LEs were grouped into four classes, based on the number of landslides belonging to each event (L den). LEs with up to two landslides were classified as small events (S). LEs with a number of landslides between 3 and 9 were classified as intermediate events (I), and LEs having more than nine landslides were considered large events (L). In the considered period, 75 LEs were classified as small and 14 LEs as intermediate, and no LE was classified as large (Table 6). Regarding the WE, nine different datasets were obtained from the nine combinations of the percentiles (Table 4), which produced a different duration matrix and, consequently, different values of performance indicators. For all the combinations, the lead time (t LEAD = 0) was always set to zero.

Table 5 Number of landslide events (LEs) as a function of the number of landslides
Table 6 Number of landslide events (LEs) pertaining to the four classes

Performance evaluation with EDuMaP

In order to define the optimal percentile combination to be employed as WL in a reliable ReLWaM, i.e. the combination that provides the best ReLWaM performance, the EDuMaP method was applied (Piciullo et al. 2016) as last step of the process chain proposed (Step 5 of Fig. 1).

Table 7 and Fig. 9 show the results obtained for the nine percentile combinations considering the element of the duration matrix in terms of alert classification (Fig. 4a) and grade of accuracy (Fig. 4b) criteria. The two pairs of percentile combinations P 3,10,50-P 1,10,50 and P 3,35,50-P 1,35,50 differ for the percentile used as threshold for WL1. In terms of hours (Table 7), this affects CP and TN (alert classification criterion) and Yel and Gre (grade of accuracy criterion), due to the way that the elements of the duration matrix were defined for each criteria. Higher values of CP and Yel were obtained for lower percentiles considered as WL1 (e.g. comparing P 3,10,50 to P 1,10,50 and P 3,35,50 to P 1,35,50, Table 7, Fig. 9). This behaviour is due to a relocation of t ij durations (see Eq. 3) from the first to the second row of the matrix. The combinations P 1,10,50-P 1,35,50, P 3,10,50-P 3,35,50 and P 1,50,90-P 1,65,90-P 1,80,90 have different thresholds for WL2 (Fig. 8). An increase of the threshold considered as WL2 resulted, in terms of hours, in a reduction of FA and Red, an increase of CP and Yel and a slight variation of MA and Gre (Table 7, Fig. 9). The combinations P 1,50,65-P 1,50,90 and P 1,65,80-P 1,65,90 differ for the percentile considered as WL3 (Fig. 8). An increase of WL3 threshold implied a slight variation in terms of hours for CP, FA, Gre and Yel. On the contrary, a substantial difference, of one order of magnitude, is obtained for Red and Pur errors, with a reduction of the number of hours for severe model errors.

Table 7 Values in hours of the duration matrix elements in terms of “alert classification” (criterion a, Fig. 4a) and the “grade of accuracy” (criterion b, Fig. 4b) criteria
Fig. 9
figure 9

Percentage of CA, MA, FA and TN (criterion A) and of Pur, Red, Yel and Gre (criterion B) obtained for the nine considered percentile combinations

The evaluation of performance indicators was conducted neglecting the element d 11 of the duration matrix that represents the number of hours without either landslides or warnings. Typically, the value of this element is orders of magnitude higher than the other elements of the matrix because it also includes all the hours without rainfall, for which a ReLWaM is not designed to deal with, specifically. Thus, d 11 element is neglected in our analysis in order to avoid an overestimation of the performance. Table 8 and Figs. 10 and 11 show the results in terms of performance indicators for the nine different percentile combinations. Success (Fig. 10) and error (Fig. 11) performance indicators are plotted separately. Concerning the success indicators and, in particular, the efficiency index (I eff), raising the percentile of WL2, a general increase is observed, as it is evident comparing P 3,10,50-P 3,35,50, P 1,10,50-P 1,35,50 and P 1,50,90-P 1,65,90-P 1,80,90 (Fig. 8). In particular, a 25 % increment in the percentile related to the activation of the second WL, passing from P 3,10,50 to P 3,35,50 or from P 1,10,50 to P 1,35,50, corresponds to about 35 % of increase of I eff. Raising the percentile of 15 %, from P 1,50,90 to P 1,65,90 and from P 1,65,90 to P 1,80,90, the I eff shows an increment of about 5 % (Table 8, Fig. 10). The percentile of WL3 does not influence the I eff (i.e. P 1,50,65-P 1,50,90 and P 1,65,80-P 1,65,90) because CP and FA are subjected to a very small variation and the MA value is orders of magnitude lower than the first. Regarding WL1, if its percentile is reduced, a positive effect can be observed on the I eff, because CP increases.

Table 8 Performance indicators obtained for each percentile combination
Fig. 10
figure 10

Bar chart showing the values of success indicators for each percentile combination. Efficiency index (I eff), hit rate (HRL), predictive power (PPW) and threat score (TS) values are shown as percentages (green bars). The absolute values for the odds ratio (OR) are also reported (brown bars, on secondary vertical axes in inverse order)

Fig. 11
figure 11

Bar chart showing the percentage values of error indicators for each percentile combination: misclassification rate (MR), missed alert rate (R MA), false alert rate (R FA), error rate (ER) and probability of serious errors (P SM)

The hit rate (HRL) is very high for all the percentile combinations (Fig. 10), slightly lower than 100 % (Table 8), due to a minor number of hours of MA compared to those of CP (Table 7). The positive predictive power (PPw) shows variations similar to the I eff, because they just differ in the calculation for the MA, which, in this case, are very low compared to the other elements of the duration matrix (Fig. 10). The odds ratio (OR), which can be considered as a rate between correct and predictions, increases as a function of the reduction of FA and MA and the increment of CP (Fig. 11).

Among the error indicators, the missed alert rate (R MA) and the false alert rate (R FA) are dependent, respectively, by the hours of MA and FA. The first is very low, probably also dependent by the low number of LE of class intermediate and large. The FA substantially decreases, in terms of hours, as the percentiles increase (Table 8, Fig. 11). The error rate (ER) and the probability of serious mistakes (P SM) are evaluated excluding the element d 11 in order to exclusively evaluate the errors due to the functioning of the system, avoiding underestimation. For this case study, these indicators are principally dependent by the value assumed by FA, which show high value of Red and Pur errors for low percentile combinations of WL2 and WL3 (Figs. 9 and 11). It is important to point out that, for our case study, in the period of analysis, d 4,1 is the only contribution to Pur errors, because components d 1,4 and d 2,4 are null for all the nine percentile combinations (Table 7) since they are not LE classified as large.

Among the nine combinations of percentiles, P 1,80,90 provides the best results in terms of both success and error performance indicators (Table 8). However, the performance analysis was conducted with a database of landslides of an 11-year period, during which no large LE and few intermediate LE occurred. Thus, the performance analysis was oriented, basically, on defining the percentile combinations with both low FA and high CP. The aim is obtained raising the percentiles for WL2 and WL3 (i.e. reducing FA) and decreasing the percentile of WL1 (i.e. increasing CP).

Conclusions

As a general goal, this paper focuses on the definition of an operational and reliable regional landslide early warning model (ReLWaM): “operational” in terms of considering all the assumptions and procedures needed to technically operate a regional early warning system, including (i) the definition of rainfall thresholds and WLs, (ii) the evaluation of monitored rainfall, (iii) the comparison between rainfall and thresholds and (iv) the production and issuing of warnings and “reliable” since it is based on an optimal definition of WL thresholds, resulting in the best early warning performance. To deal with these issues, a process chain in five steps, for the definition and the performance assessment of an operational regional warning system for rainfall-induced landslides, based on rainfall thresholds, is proposed. The method defined in “A process chain method” section can be used to issue a certain level of warning at 6-h steps, by comparing the monitored rainfall with WL thresholds. The highest threshold exceeded defines the WL to be issued for the following 6 h in a certain warning zone. This work does not address some important warning management issues, e.g. risk perception, policy adopted to communicate with the people at risk, evacuation procedures and monitoring network and instruments used to issue the warnings.

As a specific target, an operational and reliable ReLWaM for rainfall-induced landslides was conceived for the Camp-3 AZ, in the Campania region, southern Italy, through the application of the process chain herein proposed. Empirical rainfall thresholds at different exceedance probabilities (percentiles) were defined applying a well-known frequentist method and validated using ROC analysis and skill scores. The optimal threshold (i.e. the one with exceedance probability equal to 10 %) defined with the ROC analysis cannot employed by itself in a ReLWaM, due to the high probability of FAs. For this reason, nine percentile combinations were separately considered as thresholds for the activation of three WLs. Each percentile combination resulted in a distinct WE database. Finally, to define the optimal percentile combination, i.e. the one that provides the best ReLWaM performance, the EDuMaP method was applied. The performance analysis carried out for different percentile combinations highlights a high influence of percentile related to the activation of WL2 on the I eff index. The OR index probably represents the most effective indicator to describe the results, as it relates CP and incorrect ones. The percentile combination P 1,80,90 resulted to be the best solution for the ReLWaM employed in the Camp-3 AZ, because it yields good results both in terms of success and error performance indicators and the highest value of OR. It is worth highlighting that the database, for the period of analysis, has a low number of intermediate LE and no large LE; thus, the choice of the best performance was principally oriented on the FA reduction and CP increment. The high number of hours of FA can be justified by the way that the WLs are defined. In fact, a certain WL is issued if the related threshold is exceeded in at least one rain gauge in the area of analysis. This approach can be considered conservative, as it leads to a high number of FA but results in fewer MA. Moreover, in the analyses herein proposed, only the monitored rainfall was compared with WL thresholds. More generally, other variables could be considered as relevant for triggering landslides and could be taken into account in the process of landslide forecasting and performance analysis. The best percentile combination obtained represents the optimal solution for the database available at the time that the performance analysis was carried out. Therefore, a continuous collection of data, an update of the thresholds and a periodic performance assessment are necessary to maintain a high reliability of the ReLWaM.

In conclusion, our work provides the following important insights:

  • The definition of a set of rainfall thresholds at different exceedance probabilities (percentiles) is a fundamental issue.

  • A decisional algorithm is needed for passing from rainfall thresholds to WL to be issued in a certain warning zone.

  • A percentile combination, without a performance evaluation, is not sufficient to obtain a reliable and performative ReLWaM.

  • The definition of a single threshold is not the most reliable solution to be employed in ReLWaM;

  • The performance evaluation revealed the importance of OR in selecting the optimal combination of percentiles to be employed as WLs in a ReLWaM.

  • A performance evaluation is strictly connected to the availability of a landslide catalogue and to the accuracy of the information included in it.