Introduction

Landslides are frequent and widespread natural phenomena that cause casualties and damages to public and private properties worldwide. Rainfall is the main trigger of landslides and rainfall thresholds are the most used tools to forecast the possible occurrence of a landslide in a given study area; they are defined as the rainfall conditions that when reached or exceeded, are likely to trigger landslides (Guzzetti et al. 2008). In general, White et al. (1996) defined a threshold as a condition—expressed in quantitative terms by a mathematical law—whose overcoming results in a change of state of a system. For what concerns landslides, a threshold represents the lower bound of known hydrological conditions (e.g., rainfall, infiltration, soil moisture) that resulted in landslides (Reichenbach et al. 1998). In a Cartesian plane, thresholds are expressed in terms of curves that delimit a portion of the plane containing the hydrological conditions related to known slope failures. An upgrading to this approach is obtained by including in the analysis (and in the Cartesian plane) also the known hydrological conditions not related to landslide occurrences. In these cases, thresholds are defined as the best separators among triggering and non-triggering known conditions (Crozier 1997). A further improvement consists in dividing the Cartesian plane in three parts, by means of two thresholds: a lower threshold, below which no landslides occur, and an upper threshold, above which landslides always occur (Wilson et al. 1993). Between the two thresholds, different probabilities of occurrence are defined, with uncertainties related to the incompleteness of knowledge on the physical process (Crozier 1997) and on the landslide database.

The first author introducing the concept of a minimum amount of rainfall necessary to trigger a landslide was Endo (1969). Five years later, Onodera et al. (1974) proposed the first quantitative rainfall threshold for landslide triggering. Afterwards, Campbell (1975) and Caine (1980) published the two most famous pioneering works about rainfall thresholds. In particular, analyzing the rainfall conditions responsible for the initiation of soil slips in California, Campbell (1975) pointed out that the cause of the failures were the combination of antecedent cumulated rainfall and event rainfall intensity. Further, Caine (1980) proposed the first global threshold, expressed by a power-law equation and representing the minimum boundary of 73 rainfall intensity (I) vs. rainfall duration (D) conditions that have triggered landslides in several parts of the World. Since those pioneering works—and despite criticisms—rainfall thresholds were widely used to characterize the relationship between rainfall and the triggering of landslides (De Vita et al. 1998; Reichenbach et al. 1998; Corominas 2000; Crosta and Frattini 2001; Aleotti 2004; Wieczorek and Glade 2005). Guzzetti et al. (2007, 2008) published two works proposing an extensive review of the international literature. They highlighted that, since then, rainfall thresholds were broadly used considering: different scale of analysis (global, regional, local), a wide variety of rainfall parameters, various physiographic settings, and different landslide types. Guzzetti et al. (2007) also stated that for defining rainfall thresholds, physically based (process-based, conceptual thresholds) or empirical (historical, statistical thresholds) approaches can be used. Among the latter, three kinds of rainfall measurements were more frequently used: rainfall measurements obtained for specific rainfall events; antecedent rainfall conditions; other thresholds, including hydrological thresholds. Finally, they proposed a global threshold based on a global database of 2626 rainfall events resulted in shallow landslides and debris flows.

In the decade following the works by Guzzetti et al. (2007, 2008), the topic was further investigated, producing abundance of case studies at different scales of analysis, and significant technical and scientific advances. Therefore, we felt necessary to further review the recent international literature, collecting information about the definition, the employment, and the validation of landslide rainfall thresholds worldwide, in order to highlight the best practices, the main drawbacks still affecting recent case studies, the most common critical problems, and the most effective solutions adopted. The aim of the paper is to describe the main characteristics of the reviewed thresholds and to produce a complete analysis of the most used approaches followed in every step of the procedure for defining reliable rainfall thresholds. The contribution of this manuscript could be seen not only as a review of the state of the art, but also an effective method to address the future trends of the research and to disseminate the best practices among scientists and stakeholders involved in landslide hazard management.

The paper is organized in four sections. The “Method” section describes the method used to collect information in order to compare the different thresholds reviewed. The “Results and discussion” section presents the result obtained from the analysis and discusses the main findings. Finally, the “Concluding remarks: lessons learnt and advices” section concludes and summarizes the lesson learnt. For the sake of brevity, in the following sections we often use just the word “threshold” letting it mean “rainfall threshold for landslide occurrence.”

Method

Data gathering

The main purpose of this review is to analyze all recent advances in the definition and employment of rainfall thresholds for landslide occurrence. To this aim, we examined all published papers presenting studies on the definition of thresholds, including the contributions on the employment of such thresholds into prototypal or operational landslide early warning systems (LEWSs). We concentrated our research in the period 2008–2016, because all the works published before 2008 were included in the reviews made by Guzzetti et al. (2007, 2008). We decided to restrict our review exclusively to peer-reviewed papers published in journals indexed in SCOPUS or ISI Web of knowledge databases, because the literature on the subject is very broad and diversified. We excluded from the analysis all proceedings of technical conferences, multi-authored books, as well as all the gray literature. Peer-reviewed papers in indexed international journals are the most common means to disseminate significant scientific advances, contemporary ensuring relevance, and reliability of the scientific content. Moreover, we considered only papers written in English, for guaranteeing accessibility and readability to all works reviewed herein. With the aforementioned filters adopted to collect information on rainfall thresholds, we identified 107 papers, for a total number of 115 thresholds. Some articles described two or more thresholds, using both different approaches and datasets; for these reason, we treated them separately.

Collected information

For each of the 105 thresholds, we collected and grouped technical and scientific information according to four categories: publication details, geographical distribution and uses, dataset features, and threshold definition. In each category, we selected several descriptive information to describe thoroughly the threshold characteristics. We acknowledge that, for many articles, not all the abovementioned information were clearly expressed. However, for the completeness of the review, we tried to fill the information for each parameter, but at the same time, we only included the data clearly expressed and described by the authors of the papers.

Publication details

The first category contains information on the information necessary for the identification of the paper presenting the threshold, such as authors, title, year, title, journal, DOI, complete reference, and identification code composed as “Year_FirstAuthor_JournalName” (e.g., 2018_Segoni_Landslides). Moreover, a specific field for notes and other relevant information was planned.

Geographical distribution and use

The second category comprises information on the location, extent, and use of the threshold (Table 1). We collected information on the location and the extent of the study area for which a threshold was defined. Concerning the spatial scale of analysis, we defined the following classes, as a function of geographical and administrative constraints: global, national, regional, basin, local, and slope. The “global” scale refers to thresholds established worldwide as a minimum level below which landslides do not occur (Guzzetti et al. 2008); the “national” to thresholds developed for a whole nation; “regional” is intended here as an administrative subdivision of a nation (i.e., a province), typically extending for thousands square kilometers. The “basin” scale refers to thresholds defined for a hydrographic basin of a river and “local”, for thresholds developed in a very restricted area (from a few to some hundreds of km2) with local climatic regime and geomorphological setting. The “slope” scale is used for studies dealing with a single slope or a single landslide.

Table 1 Information included in our analysis. Threshold purpose and use

Concerning the implementation of the thresholds in a LEWS, we defined three classes. The first (Y, yes) consists of thresholds implemented into an operational or prototypal LEWS. The second (N, no) entails the thresholds not deemed to be part of a LEWS: this mainly happens for a “minimum rainfall threshold,” which aims at assessing the minimum rainfall conditions for the landslide triggering or to characterize a landslide prone area. The third (P, preliminary) comprises the thresholds not yet integrated into an operational or prototypal LEWS, but defined—as stated in the investigated papers—to this objective. The number of warning levels (i.e., the number of states defined by the threshold system) was also considered. The minimum value is 2, obtained in the case of one threshold defining the states “no landslide”/“landslide.” Moreover, we analyzed the minimum territorial extent (MTE) employed by a LEWS for issuing a warning. We evaluated the average extension (in km2) of the MTE and defined three classes: territorial unit, when the area interested is divided in smaller parts for which different levels of warning can be issued simultaneously; whole area, if only one level of warning at a time can be issued for the area; single slope, when the LEWS is employed at slope scale. Finally, we collected information on the target of the warnings, e.g., local administrations, civil protection, decision makers, and citizens.

Dataset features

The third category includes information about the main features of the dataset used to prepare the thresholds (Table 2). First, we studied the period of analysis, i.e., the time lapse of rainfall and landslide data used for threshold calibration and/or validation. Then, we collected the number and the type (according to Cruden and Varnes 1996) of landslides used for the threshold definition. In addition, we stored the sources used to gather information on landslide occurrences (e.g., database, newspapers, technical and scientific reports, field surveys) and source(s) of rainfall data (e.g., rain gauge, satellite, radar). About rainfall data, we analyzed the number of rainfall events used for the analysis, together with the temporal and spatial resolution of the measurement. We categorized temporal information in four classes (subhourly, hourly, daily, monthly). Regarding spatial information, we extracted from the analyzed papers the number of measuring stations and the average spatial density of measures. Finally, in case the threshold definition and employment required additional instruments (e.g., to measure snowpack thickness, soil moisture, or temperature), we also collected information on the secondary monitoring instruments and other monitored variables.

Table 2 Information included in our analysis. Dataset features

Threshold definition

Finally, the fourth category comprises details regarding the definition and validation/evaluation of the thresholds (Table 3). The criteria used for rain gauge selection were classified as “nearest,” when the rain gauge is the closest to each landslide; “manual/expert judgment,” when the rain gauges are selected after an expert evaluation of the pluviographs; “automatic selection,” when a criterion was set up in advance to automatically relate each landslide to a rain gauge among many possible ones; “reference rain gauge,” when a single rain gauge was a-priori chosen, among many that were selectable, to characterize all landslides in a given area; and “only reference,” when there was only a rain gauge in the area and the choice was forced. Moreover, we listed several classes to describe the procedures used for defining rainfall events and for passing from a series of pluviographs to a point cloud in a graph. In particular, “manual” procedures consist in analyzing every pluviograph or other rainfall measurements and adopting an expert judgment to define the extents of the rainfall events. “Standard” procedures consist in setting in advance some standards to define the rainfall event or its attributes or else in defining the values based on standard durations (e.g., daily intensity, 3-day total rainfall). Furthermore, an automated method (e.g., an algorithm) can be used to analyze rain gauges and rainfall data and quantify the optimal attributes to be entered in the graph; this was named the “algorithm” approach, and it is based on the values of information inside the rainfall event (e.g., peak intensity or event total rainfall amount) or on other variables (e.g., return period).

Table 3 Information included in our analysis. Threshold definition

Furthermore, we catalogued two relevant features: the method used to draw the thresholds (also determining the threshold type) and the rainfall variables or parameters used to define the thresholds. The first information was classified as “manual,” “statistical technique,” “probabilistic approach,” and “physically based calculation.” For what concerns the rainfall variables, we explicitly considered only the most common combinations, namely “intensity-duration,” “rainfall event-duration,” and “antecedent condition,” and grouped all the other approaches in the “other” class.

Finally, we collected some details about validation of the thresholds. This is a fundamental issue for rainfall definition and it is intended to be performed against an independent dataset that was not used for calibration. Nevertheless, the predictive capability of the thresholds often undergoes a check on historical data already used for the threshold definition. For this reason, we considered three classes of validation: Y/s (Yes) with the same (or part of the same) dataset used for threshold definition; Y/d (Yes) with a different dataset; N (No). In addition, we analyzed the methods used for validation: count (the authors just enumerate the occurrences above/below the threshold), skill scores (usually derived from a contingency matrix), and receiver operating characteristic (ROC) curves (Hosmer and Lemeshow 1989; Fawcett 2006), comparison with other thresholds.

Results and discussion

Publication details

Journals

Our review identified 107 papers presenting 115 thresholds, published on 26 different international journals. Figure 1a shows the number of papers on rainfall thresholds published on the main journals during the investigated period. Landslides, Geomorphology, and Natural Hazards journals together represent about half of the total scientific production. However, the topic has drawn the attention of 26 journals pertaining to the subject areas “Earth and Planetary Sciences” and “Environmental sciences,” demonstrating a widespread interest in thresholds as research topic.

Fig. 1
figure 1

a Geographical distribution of the analyzed rainfall thresholds. Countries colored based on the number of published thresholds. In the inset in the bottom left, the number of papers per scientific journal in which they were published. b Bar chart showing the number of thresholds published in scientific journals from 2008 to 2016, according to SCOPUS and ISI Web of knowledge databases. Each year, the number of thresholds implemented in a LEWS (Yes), the preliminary thresholds (Preliminary), and thresholds not deemed to be part of a LEWS (No) are also shown by means of different colored bars. c Pie chart showing the percentages of thresholds per spatial scale (see text for explanation). The total number of the thresholds is reported in square brackets in the top left. d Pie chart showing the percentages of thresholds in classes of study area extension. Key: global scale, entire world; n.s., not specified. The total number of the thresholds is reported in square brackets in the top left. e Bar chart showing the number of threshold levels per analyzed work. Bars colored as in b. f Bar chart showing the targets for the warning issued by the operational or prototype LEWS in which the thresholds are implemented. Key: n.s., not specified; none, thresholds not deemed to be part of a LEWS. g Bar chart showing the types of the minimum territorial unit for warning (MTE) in which the warning messages of the operational or theoretical LEWS is issued. Key: n.s., not specified; none, thresholds not deemed to be part of a LEWS. h Distribution of the areas of MTE: 25, 50, and 75% percentiles are reported each type; the number of MTE in each class is also reported

Spatial distribution

The reviewed literature reports rainfall thresholds for a wide variety of countries, as showed in Fig. 1a. Europe and Asia are the most represented continents (52 and 36%, respectively); 9% of the thresholds analyzed are located in the Americas, while Africa and Oceania are only marginally represented (1% each). This distribution is partially in accordance with the distributions of landslide hazard and risk across the world (Nadim et al. 2006, 2013; Petley 2012). Moreover, it is also linked to the progress of scientific advances in this field: Africa and central/south America started to focus on the issue of landslide forecasting only recently. In North America, the number of articles is low (3%), but they all are related to thresholds implemented (or to be implemented) in functional LEWS, demonstrating that the studies on landslide thresholds are advanced and their low number is mainly related to a landslide hazard lower than in other continents.

About half of the papers pertains to thresholds located in two countries: Italy (35%) and China (14%). Again, this overwhelming concentration of thresholds reflects directly the exposition of these two countries to landslide related risk (Nadim et al. 2006; Haque et al. 2016), supported by a long tradition of studies in the field of landslides (Gokceoglu and Sezer 2009; Wu et al. 2015a). The urge to cope with landslide risks in these two countries induces an intense research activity applied to forecasting tools and LEWSs. Moreover, about half (53%) of the thresholds implemented into LEWSs pertain to test sites located in Italy, while half of the works related to study areas in China describe operational or prototype LEWSs. Regarding other countries, India and Portugal are well represented in our review, each with 5% of the analyzed thresholds, followed by Malaysia and Taiwan (4%), Japan, Slovenia and the USA (3%), Brazil, France, Honduras, Nepal, Philippines, and Spain (2%).

Temporal distribution

Figure 1b shows the number of thresholds published each year in international journals. The publishing activity in the field of rainfall thresholds has recently improved: splitting the surveyed time interval into three 3-year periods, more than half of the works (53%) were published in the last 3 years: 26 in the 2008–2010, 25 in the 2011–2013, and 63 in the 2014–2016. This outcome can be considered a proof that rainfall threshold research is a very pressing and up-to-date issue in the scientific research community.

Geographical distribution and use

Spatial scale and areal extension

The scientific community is mainly focusing on threshold analysis applied to restricted and specifically defined areas. Indeed, only two global-scale thresholds were published, both in 2008 (Guzzetti et al. 2008; Hong and Adler 2008). National scale approaches are only a few (8.3% of the total), while most of the works are at regional (23.5%), basin (39.2%), and local scale (24.2%). Furthermore, 4.2% of the thresholds reviewed herein were conceived for even smaller test sites, at the slope scale (Fig. 1c). The spatial detail is even greater if we consider that in one of the national scale studies and in eight of the regional ones, the authors did not propose a single threshold for the whole study area, but subdivided the study area into several subzones, defining a specific threshold for each of them (Lagomarsino et al. 2013; Segoni et al. 2014b; Rosi et al. 2016). All authors support their decision by observing that the subdivision leads to territorial units with higher meteorological and geomorphological homogeneity, where more reliable rainfall thresholds can be defined. However, the scale of analysis classification does not reflect perfectly the spatial extent of the study area. As instance, the national scale analysis performed by Rosi et al. (2016) in Slovenia and the regional scale analysis performed by Segoni et al. (2015a) in Tuscany have similar extension (slightly above 20,000 km2). Therefore, we considered significant to take into account also the extensions of the study areas (Fig. 1d): all orders of magnitude from 102 to 105 km2 are quite uniformly represented (Zêzere et al. 2015; Robbins 2016). This outcome is in accordance with the previous reviews of Guzzetti et al. (2007, 2008) and it is a poof that rainfall thresholds are suitable for every extent of analysis. Remarkably, a relevant percentage of works (13%) does not report the extension of the study area and in some articles we had to estimate the extension of the study areas from the figures in which they were shown.

Objective of the thresholds

Usually two categories of thresholds are identified based on their specific objective: thresholds aimed at forecasting landslide occurrences and thresholds aimed at identifying the minimum rainfall conditions likely triggering landslides. While the first category encompasses thresholds that try to find a balance between correct and incorrect prediction, the second category identifies the minimum possible rainfall conditions associated with landslides. These approaches are equally represented in our review. Exactly 50% of the thresholds analyzed are aimed at identifying the minimum rainfall conditions for landslides (e.g., Chen and Wang 2014). Concerning thresholds for landslide forecasting or warning, about 34% of them are part of preliminary works on LEWSs, i.e., the thresholds are not yet integrated into a real or prototype system, but in the paper the authors declare that the threshold was defined with this objective (e.g., Mathew et al. 2014). On the other hand, in 16% of the works analyzed, thresholds are implemented into a LEWS, or at least an operational use is depicted, simulated and evaluated in the paper (e.g., Baum and Godt 2010). Figure 1b shows the yearly distribution of the thresholds implemented in a LEWS, preliminary works, and thresholds not conceived to be incorporated in a LEWS. The percentages do not exhibit evident variations among the years of analysis. This outcome demonstrates that both thresholds aimed at forecasting landslide and minimum rainfall thresholds are up-to-date approaches aimed at pursuing different objectives: the first ones represent a useful tool to be employed in landslide risk mitigation strategies; the second ones are useful to characterize cases of study and to investigate scientific issues related to landslide triggering.

Number of threshold levels

The objective of a threshold is to separate rainfall conditions leading to (at least) two possible states: instability (above the threshold) and stability (below the threshold). Since the transition between the stability and the instability field cannot be devised so sharply, in many works, more than two thresholds are depicted, for increasing levels of instability/warning. This approach is often used in thresholds to be used in a LEWS, where different levels of warning can correspond to the exceeding of different thresholds. As showed in Fig. 1e, 68 works (60%) present a single threshold that discriminates between stability and instability conditions (e.g., Miller et al. 2009; Kanungo and Sharma 2014). This simple approach can be more frequently observed in thresholds that are not directly devised with the purpose of warning system implementation (38 out of 68, e.g., Khan et al. 2012; Hasnawir and Kubota 2008). Conversely, complex threshold systems accounting for 3, 4, or more states of the systems are typically used in warning systems or in preliminary works (e.g., Jaiswal and van Westen 2013; Huang et al. 2015; Floris et al. 2012).

Targets for warning

Among all the articles describing operational or prototypal versions of LEWSs, only 28 clearly state which is (or is expected to be) the main target for the issued warnings. In 14 cases (23.7%, Fig. 1f), Civil Protection authority is mentioned as the main subject expected to receive warnings. It is not surprisingly that 12 out of these 14 works are based in Italy (e.g., Martelloni et al. 2012), where Civil Protection was in charge of managing natural hazards at all levels since at least 1992. We include in this class also two Slovenian works, which address Civil Protection and National Geological Service as the main target for warnings (Rosi et al. 2016; Jemec Auflič et al. 2016). In six cases (10.2%), local authorities are the main receivers of the warnings. This can include both local government (e.g., Zhou and Tang 2014; Althuwaynee et al. 2015) or technical offices (e.g., Mathew et al. 2014). In two cases (3.4%), the warnings arrive directly to the population (Calvello et al. 2015), while in six cases (10.2%), the target for warning are both population and local authorities (e.g., Baum and Godt 2010; Chen and Huang 2010; Tiranti and Rabuffetti 2010; Jaiswal and van Westen 2013; Zhuang et al. 2014; Huang et al. 2015).

Minimum territorial extent for warnings

The application of rainfall thresholds to warning systems requires a spatial resolution compatible with operational purposes. The aim is to divide an area into territorial units with meteorological and hydrogeological homogeneity, to reduce the number of false and missed warning. Therefore, many works introduced a partition of the study area into different territorial units, monitored and alerted independently (Jaiswal et al. 2010; Martelloni et al. 2012). As can be seen from Fig. 1g, h, this approach brings the consequence of reducing the area adopted for issuing a warning: when the study area is partitioned into smaller territorial units instead of being managed as a single entity, the mean area drops from 435 to 302 km2 and the 25th percentile from 100 to 15 km2. This approach is particularly relevant in large study areas: as instance, Zhuang et al. (2014) worked on a 10,106 km2 wide area, but dividing it into three distinct MTEs a mean areal extension of 3368 km2 was obtained. Moreover, Segoni et al. (2014b) designed a regional scale warning system, subdividing a 23,000 km2 wide study area into 25 different alert zones with an average extension of about 900 km2 and demonstrated that this approach enhances the forecasting effectiveness of the system. It is worth highlighting also experimental studies in which rainfall thresholds and susceptibility maps are coupled in an effort to further reduce the spatial resolution of regional scale warning systems (Segoni et al. 2015b; Jemec Auflič et al. 2016).

Dataset features

Period of analysis

The starting point for most part of the threshold analyses is collecting landslide and rainfall data. The period of analysis of a given work is the time span for which it was possible to gather both landslide and rainfall data in order to find an empirical correlation.

Since it is quite established that the longer the timespan of the dataset used and the more robust and performing is the threshold (e.g., Lagomarsino et al. 2013; Rosi et al. 2015), the objective of every researcher is to collect as long as possible datasets of rainfall and landslides. However, both datasets are intrinsically limited and subject to varying degrees of completeness. We found that typically (46% of the occurrences) the dataset used for the works spans through 5 to 20 years. In 17% of occurrences, it is lower than 5 years and sometimes (6%), it is 1 year or even less. In 12% of the works, the authors use a very long dataset (> 50 years) but argue about the completeness of the landslide inventory since usually historical landslide archives keep track only of the major events and neglect the ones with limited impact on the society.

We observed a consistent time gap between the most recent data used in the threshold analysis and the publication date of the article. Only half of the articles have a gap lower than 3 years, while for 30% of them the gap is at least 5 years. A relevant number of works rely on even older data: in 18% of the cases, the gap is higher than 7 years and in 12% is 10 years or more. This gap could point out that in many cases rainfall and landslide databases are not immediately available to the researchers. The main reasons could be that the compilation of landslide databases is time consuming; therefore, catalogues are not constantly and timely updated.

Since it is demonstrated that a bigger and updated calibration dataset enhances the effectiveness of thresholds and consequently of the LEWS (Rosi et al. 2015), local and central governments should develop the political will of keeping databases updated and foster scientific and technical programs that allow to have complete and updated landslide catalogues. As instance, Ma et al. (2015) report that in their test site in China, groups of experts perform field surveys immediately after each rainfall event and populate the landslide database with high-quality and constantly updated data. The same practices are pursued for the LEWS operational in Rio de Janeiro, Brazil (Calvello et al. 2015). To obtain the same objective with reduced workforce, some researchers are developing methods to provide automated, fast, and timely updates by means of semantic search engines on online news (Battistini et al. 2013, 2017). Another point, which is common to many scientific research fields, is the development of methods and wills to share data. In many works, the initial databases are often created from the scratch or consist in updated versions of databases previously used in other works by the same authors. We found that only 7% of the works have a gap of 1 year between the most recent data used in the threshold analysis and the date of publication of the article. These numbers demonstrate that the availability of updated databases is an issue that should be urgently addressed by research community and decision makers.

Source of landslide data

Regarding the most used methods to compile the landslide databases, one third of the reviewed works makes use of two or more sources of information, trying to compile a database as much complete as possible, not only in terms of number of events reported, but also in terms of information to be used in the analyses. As an instance, local newspapers usually report with good temporal precision landslides that had a relevant impact on human infrastructures but very rarely provide reliable technical/scientific information (e.g., landslide type). On the contrary, scientific reports may be rich of details but sometimes they can be written after a long lasting event and they may not state clearly the exact moment or day of landslide occurrence.

Figure 2a reports the distribution of landslide information source. The most used sources of information are reports (used in 50 cases), which can have various different origins: fire brigades, civil protection, local administration, technical offices, and scientific reports. Another relevant source of information are news (e.g., Peruccacci et al. 2012; Gariano et al. 2012, 2015; Vennari et al. 2014), which can be found in newspapers archives (27 occurrences) and/or on internet (7 occurrences). Furthermore, 26 studies make use of official databases released by different organizations, mainly governmental organizations, local authorities, or research institutions. Reports, newspapers, internet, and official databases are the main sources of information for regional or national scale thresholds (e.g., Lagomarsino et al. 2015; Piciullo et al. 2017). Direct post-event field surveys are quite used (18 occurrences), especially in small areas or in landslide studies limited to a few relevant rainfall events. In two circumstances, field surveys were also accompanied by interviews to local inhabitants to better ascertain the exact timing of the landslides (e.g., Erener and Düzgün 2013). Remote sensing can be a valuable tool to compile post-event catalogues or to constantly update large inventories (Tofani et al. 2013), indeed surveys performed by remote sensing techniques were used in 10 circumstances, both in large scale and small scale studies (e.g., Chang et al. 2008; Shieh et al. 2009; Papa et al. 2013). In 7 works, historical records of various origin, including tree ring series (Saez et al. 2103), were taken into account, while 6 works declare to include datasets used by previous works (e.g., Winter et al. 2010; Terranova et al. 2015). In 3 circumstances, the threshold analysis is based on monitoring data coming from different instruments: Staley et al. (2013) used channel monitoring equipment to set up a comprehensive database of post-fire debris flows; Vallet et al. (2016) used displacement sensors to correlate rainfall to the reactivation of a monitored landslide; Burtin et al. (2009) used seismometers to detected debris flows activations.

Fig. 2
figure 2

a Bar chart showing information sources used to define the analyzed thresholds. bi Pie charts showing the percentages of b landslide types considered to define the thresholds (key: n.s., not specified); c sources of rainfall data used to define thresholds; d rainfall time scales adopted in the analysis; e spatial density of used rain gauges, in classes (key: n.s., not specified; n.e., not expected; n.a., not available); f methods adopted for the selection of rain gauges (rg); g methods adopted for the extraction of rainfall parameters; h main parameters adopted for defining the thresholds; i types of thresholds, i.e., methods used for drawing or defining the thresholds (key: n.s., not specified). The total number of the thresholds is reported in square brackets in the top right of each panel

Landslide types

Regarding the landslide types considered in the threshold analysis, the reviewed articles followed two radically different approaches: while some of them are focused on few landslide types (sometimes even a single type); others include different landslide types, up to cases where all landslide types are encompassed. The distribution of landslide types is reported in Fig. 2b. Most of the thresholds analyzed (43%) are built for wide groups of landslides, in particular 38.3% for shallow landslides (Brunetti et al. 2010; Peruccacci et al. 2012; Cepeda et al. 2010) and 4.7% for deep seated landslides (Uchida et al. 2013). On the other hand, some works focus on a very specific type of landslide, such as debris flows for the 25.8% of cases (e.g., Winter et al. 2010; Huang and Tang 2014; Nikolopoulos et al. 2015), rockslides (2.3%; Sengupta et al. 2010; Terranova et al. 2015), and earth flows (3.9%; Greco et al. 2013). A few works are even more restrictive and specific, addressing a landslide type triggered or favored by particular contour conditions. It is the case, as instance, of slope-cut-related landslides (Zhuang and Peng 2014), post-fire debris flows (Cannon et al. 2008), and post-seismic debris flows (Shieh et al. 2009; Tang et al. 2012; Zhou and Tang 2014; Guo et al. 2016; Li et al. 2016). Conversely, 21.9% of the works declare to take into account every type of landslide encountered in their study area (Marques et al. 2008), even if sometimes a few prevailing landslide types are identified (Segoni et al. 2014a; Zêzere et al. 2015).

A deeper analysis highlighted some approaches of considering few landslide types even when all the landslides available were taken into account. This is the case of works in which a very widespread study area is partitioned into subzones that have homogeneous geomorphological characteristics (e.g., Martelloni et al. 2012; Segoni et al. 2014b; Ma et al. 2015, Tiranti and Rabuffetti 2010). Each subzone is independently analyzed for the definition of a specific threshold equation. In this way, instead of having a large and heterogeneous study area with different landslides type, the authors obtain a mosaic of smaller subzones characterized by more homogenous landslide catalogues. Finally, it should be highlighted that 3.1% of the works analyzed do not state clearly the landslide type considered in the analyses.

Landslide number

Only the 88% of the analyzed works clearly state the number of landslides used for defining thresholds. In 40 cases detailed information, on the number of both landslides and landslide events (a single landslide event could be composed by one or more landslides) were provided. Regrettably, 12% of the works leave this important aspect unspecified.

Only few attempts were made to establish the minimum number of landslides necessary to define a reliable threshold (e.g., Peruccacci et al. 2012; Vennari et al. 2014). In our review, we found a very wide number of landslides used to define thresholds. Some works declare to use hundreds of landslides (Berti et al. 2012; Lee et al. 2015; Ma et al. 2015; Saito et al. 2010; Lagomarsino et al. 2013; Segoni et al. 2015a; Rosi et al. 2015) or hundreds of rainfall events responsible for the landslides (Guzzetti et al. 2008; Saito et al. 2017; Peruccacci et al. 2012; Vennari et al. 2014; Gariano et al. 2015). Other works perform a threshold analysis using only a few (i.e., below 10) landslides (Sengupta et al. 2010; Chen and Wang 2014; Chien-Yuang et al. 2008; Tang et al. 2012; Greco et al. 2013). In addition, some local scale studies focus on a single landslide considering many events of reactivation (Terranova et al. 2015; Vallet et al. 2016), or cases in which the analysis is based on physically based theoretical calculations of the slope stability and no actual landslide events were taken into account (Alvioli et al. 2014; Bovolo and Bathurst 2012).

Source of rainfall data

Our review highlights that still the rain gauges are by far the most used instrument to obtain rainfall data for threshold analysis: 79.5% of the works rely on rain gauges, among which 7% combines rain gauges and radar measurements (Fig. 2c). On the other hand, 6.6 and 4.1% of the rainfall measurements are provided only by radar and satellite measurements, respectively. In some circumstances, this is a forced choice, because no reliable rain gauge network exists in the studied area and satellite or radar measurements are the best choices available (Posner and Georgakakos 2015; Robbins 2016). Conversely, in other cases, the choice of using radar (mainly ground-based radar) is an attempt to obtain rainfall measurement with higher spatial and temporal resolution and reduced uncertainties (Saito et al. 2010; Marra et al. 2016). Moreover, 3.3% of the works do not clearly state the source of rainfall data while the 2 works presenting global thresholds make use of literature data. Finally, 5.7% of the thresholds are not based on rainfall measurements and make use of theoretical values to calculate theoretical rainfall thresholds, mainly by means of physically based models (Salciarini et al. 2012; Papa et al. 2013; Wu et al. 2015b).

Rainfall time scale

As shown in Fig. 2d, almost half of the thresholds analyzed (52.2%) are defined using hourly rainfall data (e.g., Garcia-Urquia and Axelsson 2015; Lainas et al. 2016). A higher temporal detail was used in only 4.3% of the works (Cannon et al. 2008; Huggel et al. 2010; Staley et al. 2013; Ciabatta et al. 2016; Iadanza et al. 2016), which relied on the use of advanced rain gauges. Daily rainfall data (Nolasco-Javier et al. 2015) are also widely used (39.1%) and in two circumstances (1.7%, Saez et al. 2013; Kanjanakul et al. 2016), thresholds were defined using monthly measures. However, it should be specified that these two cases are very peculiar, consisting respectively in an investigation of historical landslides through tree-ring series to assess climate changes (Saez et al. 2013) and a work located in a poorly instrumented area in Thailand (Kanjanakul et al. 2016). Eventually, 2.6% of the works analyzed do not specify clearly the time scale used in the threshold analysis. We acknowledge that the abovementioned percentages are referred to the majority of measures used in each work. However, in 6.1% of the cases, time resolutions of mixed origin are used, mainly in case of scarce density of instrumented networks. In these cases, data are gathered from instruments of mixed origin (different types of rain gauges or rain gauges and radar) and in case of thresholds devised for very large areas (up to global thresholds) collecting published data of heterogeneous origin (Guzzetti et al. 2008; Mathew et al. 2014; Calvello et al. 2015). The temporal resolution is clearly correlated to the objective of the thresholds. All thresholds used for early warning purposes are based on data of at least hourly time scale, while in preliminary works, also daily measures were considered. Minimum thresholds were proposed with datasets of every temporal resolution.

Spatial density of rainfall measurements

Figure 2e shows the distribution of the density of rain gauges used for defining the analyzed thresholds. The number of used stations greatly varies from a minimum of 1 (a single rain gauge used in 25.2% of the cases, e.g., Sengupta et al. 2010; Khan et al. 2012; Huang and Tang 2014; Napolitano et al. 2016) to a maximum of 2457 (Ma et al. 2015). Less than 15% of the analyzed thresholds were defined relying on data recorded from more than 100 rain gauges; on the other hand, for more than 60%, data gathered from up to 100 gauges were exploited. Oddly, in 12 cases (10.4% of the total), we were not able to found the number of used rain gauges.

Regarding the spatial density of rain gauges in the study area, only 7.9% of the analyzed cases have a very high density (≥ 500 gauges per 100 km2; e.g., Dahal and Hasegawa 2008). About a third of the occurrences (31.6%) have a coarser spatial gauge resolution, using up to 100 rain gauges per 100 km2, while 23.7% of the thresholds were defined considering between 100 and 500 rain gauges per 100 km2 (Fig. 2e). Overall, decrease in gauge density leads to increased underestimation of rainfall, which in turn leads to large underestimation of the thresholds, especially in those based on intensity and duration (Nikolopoulos et al. 2014, 2015). High spatial density of rain gauges is usually encountered in very small test site; however, sometimes relevant densities are reported even for relatively large areas with well-equipped rain gauge networks. It is the case, as instance, of the works by Rosi et al. (2012), Ma et al. (2015), and Piciullo et al. (2017). These cases took advantage respectively of 332 gauges in an Italian region extending 23,000 km2 (1.44 gauges per 100 km2), 2457 gauges for a 101,800 km2 area in China (2.41 per 100 km2), and 58 gauges in a 1619 km2 area in Italy (3.58 per 100 km2).

In some cases, other rainfall sources (apart from rain gauges) were considered. In particular, when radar measurements are used, the spatial resolution can be greatly enhanced, depending on the pixel resolution, which can arrive even to a few km2 in case of ground-based radar. This is one of the most evident technical advantages of using radar instruments rather than rain gauge networks (Peleg et al. 2013; Borga et al. 2014; Marra et al. 2014, 2016).

Other monitoring instruments and other variables monitored

In 20 cases (17.4%), additional monitoring instruments, besides those used to obtain rainfall data for the threshold analysis, were used. In particular, in 7 cases, they consist in other instruments used to measure rainfall, e.g., to integrate rainfall measures or to have redundancy of rainfall data. In 13 circumstances, instead, the work describes the use of instruments that measure physical phenomena other than rainfall. This circumstance is mostly verified in small test sites that include monitoring stations equipped with video cameras, pore pressure transducers, tensiometers, watermark pressure head sensors, geophones and, more in general, multi sensors monitoring stations. In 6 cases, the test sites is equipped with instruments for temperature measurements. The use of temperature data is due to cope with snowmelt induced landslides (Bíl and Müller 2008), to take into account snow accumulation/melting phenomena in regional scale threshold analysis (Martelloni et al. 2013), to model the degree of saturation in order to adjust operational rainfall thresholds (Ponziani et al. 2012) or to assess the impact of climate change in landslide hazard (Saez et al. 2013; Ciabatta et al. 2016). Only in two cases (Baum and Godt 2010; Napolitano et al. 2016) a correlation between rainfall data and pore pressure measures were used to define thresholds; this approach, despite not simple and not always feasible, would be useful for enhancing the reliability of the thresholds.

Threshold definition

Rain gauge selection

The decision on the rain gauges selection is an important step in the rainfall threshold analysis (when based on gauges). The method adopted influences the results of the analysis and the operational activities of a LEWS; therefore, we focused on this crucial aspect to review all possible solutions adopted in the recent literature (Fig. 2f).

First, we remark that this important information is not described, or not clearly specified, in 29 cases out of 115 (25.2%). This is a major point of weakness, because the method proposed in these works cannot be adequately evaluated and replicated elsewhere. In 14 cases (12.2%), the rain gauge selection is not a part of the method, because of the inherent nature of the work. As instance, rainfall data are based on radar measurements (e.g., Saito et al. 2010) or theoretical rainfall values are used to define physically based rainfall thresholds (e.g., Salciarini et al. 2012). The remaining 72 cases, for which the rain gauge selection is described, can be grouped into six main classes, as reported in Fig. 2f: automatic selection, manual/expert judgment, nearest, reference, single, and others.

In 21 cases (18.3%), the choice was obligated because in the study area, there was only a rain gauge, therefore chosen as the only reference rain gauge for the whole area. We mainly encountered this approach in very small areas, from a few km2 to a few decades of km2 (e.g., Floris et al. 2012; Napolitano et al. 2016), or in developing countries where only few rain gauges have suitable technical characteristics such as an adequate time resolution and a long record of rainfall data (e.g., Afungang and Bateira 2016). In 18 cases (15.7%), the nearest rain gauge is selected by default to provide the rainfall data to be correlated to each landslide. In some works, minor refinements of this technique are proposed. As an instance, Palenzuela et al. (2016), in case of rain gauges with similar distance, select the rain gauge with the same mean annual precipitation class of the landslide spot. Winter et al. (2010) selected the nearest rain gauges but for landslides where no rain gauge was available within 20 km, radar data were used. Althuwaynee et al. (2015) used the nearest rain gauge technique only if the landslide was nearer than 6 km, while all other landslides were discarded from the threshold analysis. However, many works point out that the nearest rain gauge may not be the best option to capture the triggering rainfall, especially in complex geographic settings and in case of localized convective storms. Therefore, more refined approaches were developed by the research community. In 12 cases (10.4%), the procedure of rain gauge selection is completely automated and based on objective and quantitative elements. In these cases, a fundamental part of the threshold analysis is the development of automated algorithms that associates to each landslide a rain gauge according to some criteria. The advantage of this method is that it is standardized and replicable, it reduces subjectivity and it can be applied to big datasets in small amounts of time. As instance, different algorithms test all possible reference rain gauges and select the one performing better in terms of return times, weights, and skill scores (Lagomarsino et al. 2013; Staley et al. 2013; Segoni et al. 2014a; Melillo et al. 2015, 2016; Iadanza et al. 2016). In 11 cases (9.6%), no particular procedures are implemented and the researchers selected the most suitable rain gauge, between many possible ones, according to manual/expert judgements, therefore with an a-posteriori and subjective selection. For each landslide, the pluviographs recorded by several rain gauges are evaluated and the authors selected the pluviometer that, according to their judgment, better captures the triggering event. In some articles, the expert/manual judgment is strengthened providing further details about the criteria that guided the choice of the authors (e.g., Gariano et al. 2012; Giannecchini et al. 2012; Jaiswal and van Westen 2013; Vennari et al. 2014). Most of the times, these criteria included one of the following: proximity, similar elevation, and same mountainside of the landslide. In six cases (5.2%), a single rain gauge is a-priori chosen among many possible ones, to characterize all the landslides in the study area or in a subzone of a wide study area. The criteria of the selection include, e.g., the accordance with the rainfall regime of the study area (de Oliveira et al. 2016) or a combination of geographic and technical features (Martelloni et al. 2012). In two cases, Thiessen polygons were drawn for all the rain gauges available and each rain gauge is used to characterize the landslides located inside its polygon. This system was used in two regional scale approaches with hundreds of rain gauges and thousands of landslide records distributed in two Italian regions, namely Tuscany (Rosi et al. 2012) and Emilia Romagna (Berti et al. 2012). In other cases, a mixed approach was considered, e.g., Jemec and Komac (2013) used an expert judgment approach to select for each landslide one out of three possible reference rain gauges previously identified. These latter methods were grouped together in the class “other” (Fig. 2f), representing the 3.4% of analyzed works.

Methods for the extraction of rainfall parameters

Another crucial step in the threshold definition is the analysis of the pluviograph and the definition of rainfall values (or events) responsible for landslides. In the past, this part of the threshold analysis was often characterized by a relevant degree of subjectivity and uncertainty. This flaw was addressed by most part of the herein reviewed works.

In 15 cases (13.0% of the total, Fig. 2g), the expert judgment was used to manually analyze the pluviographs and assess the rainfall conditions to be associated with landslides (e.g., triggering rainfall events). These occurrences concern thresholds based on intensity-duration (Dahal and Hasegawa 2008; Rosi et al. 2012) or total event rainfall duration (Sengupta et al. 2010; Vennari et al. 2014). In 23 occurrences (20.0%), software substitutes the expert judgment and identifies the most appropriate values by means of objective and reproducible approaches (algorithm approach). As instance, Chang et al. (2008) and Giannecchini et al. (2016) used logistic regression, while other authors purposely developed software (Segoni et al. 2014a; Vessia et al. 2014, 2016). Melillo et al. (2015) proposed an algorithm that reconstructs the rainfall events, identifies the rainfall conditions that have resulted in landslides, and measures the duration and the cumulated rainfall for the events based on a reduced set of parameters to account for different physical settings and operational conditions. However, the most used approach to get objective and fully reproducible measures is to set in advance a standardized criterion (59 occurrences, 51.3%). The most straightforward method is to take into account a standard duration to calculate the cumulated rainfall (Calvello et al. 2015; Garcia Urqia 2016; Bai et al. 2014) or to set standard amount of time without rainfall to define the start/end of each rainfall event (Tiranti and Rabuffetti 2010). To conclude, in 8 cases (7.0%), the points in the graph are not derived from pluviographs but are derived from physically based calculations (e.g., Salciarini et al. 2012; Papa et al. 2013; Van Asch et al. 2014; Wu et al. 2015b; Napolitano et al. 2016), while in other 8 cases (7.0%), the approach is not clearly specified (e.g., Huang and Tang 2014; Guo et al. 2016).

Threshold parameters

The most common couple of parameters used for rainfall threshold definition is intensity-duration, I-D (48.6%, Fig. 2h). This approach follows a consolidated tradition that dates back to the work of Caine (1980). It is worth highlighting that different approaches can be used to define intensity and duration, and sometimes, significant differences exist from a work to another. As an instance, Cannon et al. (2008) uses the peak intensity, Brunetti et al. (2010) uses the mean intensity, while another series of works tries to identify the I-D combination associated to the higher return period (Segoni et al. 2014a, b).

The second most used approach relies on antecedent rainfall conditions (26.8%). A more in-depth look at this category reveals that antecedent rainfall is taken into account in an overwhelming variety of ways. It is not an exaggeration stating that almost every research group uses its own approach. As an instance, Tien Bui et al. (2013) combined daily rainfall and 15-day antecedent rainfall, Saadatkhah et al. (2015) considered 3- and 30-day antecedent rainfall, Lee et al. (2015) used daily and 3-day cumulated rainfall. Moreover, some authors do not use directly the rainfall measures, but process them to calculate antecedent rainfall indexes, trying to better account for the degree of saturation of the terrain (Jaiswal and van Westen 2009; Lee et al. 2014; Ma et al. 2014; Kanjanakul et al. 2016). Another category well represented in our review uses the total rainfall accumulated during the event and its duration (15.9%). This category is more homogeneous than the previous ones and the only differences can be found in the approaches used to define the extents of the rainfall events. As instance, Peruccacci et al. (2012) defined a rainfall event as preceded/followed by a dry period of 48 or 96 h (in the dry and wet season, respectively).

Out of these categories, the reviewed literature presents a wide variety of different parameters used to define rainfall thresholds (8.7%). As instance, Martelloni et al. (2012) introduced the use of the standard deviation from the mean rainfall amount accumulated during progressively increasing time steps. Greco et al. (2013) used a mobility function defined as the convolution integral of rainfall intensity with an empirical transfer function. Vallet et al. (2016) defined a multidimensional threshold that includes groundwater recharge and several rainfall indexes.

Methods for threshold drawing/definition

We found numerous methods adopted in literature for drawing the thresholds and we decided to group them into four broad classes: manual, statistical, probabilistic, and physically based. The first two were the most adopted classes, both chosen by the 40.9% of the analyzed records (Fig. 2i). In the “manual” class, we found 47 cases in which the thresholds were actually drawn manually by delimiting the lower bound of the point cloud representing the triggering rainfall conditions (e.g., Floris and Bozzano 2008; Li et al. 2011; Ma et al. 2013; Posner and Georgakakos 2015), or by searching the best fit of the lower part of the cloud (e.g., Kanungo and Sharma 2014; Althuwaynee et al. 2015; Lainas et al. 2016). Moreover, in 20 cases, a regression was adopted to manually draw the thresholds (e.g., Marques et al. 2008; Saez et al. 2013; Nikolopoulos et al. 2014, 2015). Among the 47 thresholds defined by means of a statistical method, it is worth mentioning the frequentist method proposed by Brunetti et al. (2010) and applied, with further improvements, in several study areas in Italy (Peruccacci et al. 2012; Vennari et al. 2014; Gariano et al. 2015; Melillo et al. 2016; Piciullo et al. 2017). Among the remaining statistical approaches, partial duration series (Palenzuela et al. 2016) return time calculations (Chen and Huang 2010) or point density analysis (Garcia Urquia 2016) were used.

Furthermore, five thresholds (4.3%) were defined using probabilistic approaches, among which Bayesian analysis (Berti et al. 2012; Robbins 2016). Finally, in 7.8% of the cases, thresholds were defined using physically based approaches (e.g., Ruiz-Villanueva et al. 2011; Salciarini et al. 2012; Papa et al. 2013; Alvioli et al. 2014; Wu et al. 2015b; Napolitano et al. 2016). Regrettably, in six cases (5.2%), we were not able to identify the method adopted to draw the thresholds.

Validation

The validation of the predictive capability of the thresholds represents one of the most important issues in the process of rainfall threshold definition. Figure 3 reports, each year, the number of thresholds for which a validation procedure was adopted. Deplorably, 46 thresholds (40.0%) are presented without any analysis about their predictive capability. Conversely, 31 (27.0%) works present a validation, despite made by considering the same dataset used for rainfall calibration, and only in 38 works (33.0%) the validation were correctly evaluated with an independent dataset. For the latter works, we could assess the calibration/validation ratio (defined as the ratio between the time lengths of the two datasets). Usually, most part of the data are used to calibration, while only a limited part is left for validation. We found only 5 works with similar duration in calibration and validation datasets (Tiranti and Rabuffetti 2010; Greco et al. 2013; Martelloni et al. 2013; Staley et al. 2013; Saadatkhah et al. 2015). On the other side, in 5 works (18%), the thresholds were validated using a very limited sample (sometimes just a couple of events pertaining to a very restricted time span). In two circumstances (7%), we found a validation dataset equal to 40% of the total dataset available and typically (54% of the occurrences), the validation period was between 10 and 33% of the total.

Fig. 3
figure 3

Bar chart showing, each year, the number of thresholds for which a validation (in green) or a performance evaluation (in blue) procedure was conducted, as parts of the total number of thresholds published per year (gray, black-bordered bars). The inset in the top left shows a bar chart reporting the most used criteria for validation. Key: n.a., not available

The validation process is particularly widespread for thresholds studies with cogent applicative aims, such as the implementation in LEWSs. In these cases, the predictive effectiveness of the threshold is a key point of the research, and thus should be adequately analyzed in the articles and supported by data. Indeed, thresholds that were validated were usually implemented into operational or prototypal LEWS (75%). Conversely, thresholds that are neither evaluated nor validated are most commonly found in studies that are not related with LEWSs (68%).

A common method to validate thresholds performance was not established yet. The most used method (24.4% of the cases, Fig. 3) is to compile a contingency matrix and to combine true negatives (TN), true positives (TP), false negatives (FN), and false positives (FP) to calculate different skill scores (Giannecchini et al. 2012; Martelloni et al. 2012; Staley et al. 2013; Segoni et al. 2014a; Peres and Cancelliere 2014; Gariano et al. 2015), like threat score (also named critical success index, Gilbert 1884; Schaefer 1990) or true skill statistics (Hanssen and Kuipers 1965). This method has a high degree of objectivity and standardization and provides a quantitative method to assess and compare the performances of different thresholds. In 5.7% of the cases, ROC analysis is devised and area under curve or other parameters are computed to assess the overall performance of the threshold. In 14.6% of the cases, the evaluation performed is very thorough and includes a quantitative comparison with different thresholds, aimed to demonstrate analytically that the threshold presented is the best one possible for the case of study at hand. The “alternate thresholds” used for comparison may include thresholds obtained with other configurations of the same threshold model, alternate rainfall threshold parameters, or even with completely different threshold models. As an instance, Althuwaynee et al. (2015) compared the performances of different antecedent rainfall durations; Frattini et al. (2009) performed a comparison between physically based and logistic regression derived ID thresholds; Lagomarsino et al. (2015) applied two completely different threshold models to the same datasets in order to define which would be more effective for a warning system. In 23 (18.7%) circumstances, we found that the model outputs and real landslide data are compared but a real confusion matrix was not built, and the work is limited to “count” only one or two parameters, such as “hits” (Gioia et al. 2015). The drawback of this method is that it cannot provide an overall overview of the performance of the threshold. In 5 cases (3.9%), other methods are used, alone or in addition to one of the aforementioned ones. Martelloni et al. (2012) compared in terms of skill scores the outputs of their models to the warning levels issued randomly; Chen and Huang (2010) computed a casualties/event ratio to demonstrate the effectiveness of the threshold based LEWS they had implemented. Furthermore, Piciullo et al. (2016, 2017) evaluated the best rainfall thresholds combination to be employed in a LEWS employing the EDuMaP method (Calvello and Piciullo 2016). Other works provide visual and qualitative assessments of threshold performances, e.g., by comparing calculated thresholds and rain paths during representative events (Papa et al. 2013) or by a visual comparison with the landslide inventory (Salciarini et al. 2012).

All these standard metrics used to perform validation have inherent uncertainties, in particular when applied in wide areas, which must be acknowledged. Gariano et al. (2015) analyzed the epistemic uncertainty related to the lack of information on landslide occurrence at regional scale and found that an underestimation of only 1% in the number of considered landslides can result in a significant decrease in the performance of validation procedure.

Major sources of incompleteness

We found many works lacking relevant information about the data used, the method of analysis, and the definition of the thresholds. Figure 4 quantifies the degree of incompleteness for every threshold information. Surprisingly, in about 25% of the analyzed thresholds, a clear description of the method adopted for the rain gauge selection is missing. In 12.2 and 10.4% of the cases, we were not able to found respectively the number of landslides and rain gauges considered for rainfall/landslides correlations. Even a simple information as the extension of the study area is missing in 13% of the analyzed cases, even if for several articles, we appraised it from the figures. Moreover, considering only the thresholds implemented in a LEWS, we found that, in 10.5% of the cases, some important details, e.g., the targets or the territorial unit subject of the warnings, were not described. Finally, the parameter with the highest degree of incompleteness is the validation: 40.0% of the thresholds were not accompanied by a validation process, thus making impossible an adequate assessment of their predictive capability and their correspondence with experimental data.

Fig. 4
figure 4

Percentage of unspecified information about threshold features. aThese percentages are calculated considering only the thresholds implemented in a LEWS, for which those parameters are evaluable

Concluding remarks: Lessons learnt and advices

The process of rainfall thresholds definition presents critical issues. There are important steps that cannot be neglected in order to obtain replicable and reliable thresholds with a high predictive capability, more so if an implementation in a LEWS is proposed. The description of the criterion used for landslide-rain gauge association, as well as the clear explanation of rainfall parameters extraction from the rainfall records, needs particular attention. Regrettably, a lack of these details was highlighted respectively in the 25.2 and 7% of analyzed papers. Our review highlights that a consistent part of the scientific research is progressing towards the development of methods aiming at automatizing the most relevant phases of the threshold analysis for reducing uncertainties, avoiding subjectivity and fostering the full reproducibility of the method. In the years before 2007, this was one of the drawbacks of literature on rainfall thresholds, but this gap is going to be filled by algorithms and standardized procedures that were successfully tested and consolidated through many cases of study. A standardized procedure is recommended in particular if the thresholds will be implemented into a LEWS, in order to ensure objectivity and reproducibility of the proposed method and to foster a rapid and periodic update of the thresholds.

However, the reliability of a threshold does not depend solely on the applied method, but also on the quantity and quality of the input data. Therefore, every research article or technical repot should define clearly: the number and type of landslides, the features of the study area (including area extension and geomorphological context), the rainfall and landslide data sources, the period of analysis, the rainfall measurements density and their temporal resolution. These details are also important in order to assess the feasibility to apply a threshold, or at least the method used for its definition, to other case studies.

Regarding the landslide database, only few attempts were made to establish a minimum number of landslides to be used for threshold analysis. Thus, we recommend that every possible effort should be made in compiling the widest and most complete database possible for the study area. Accessing multiple sources of information can be very useful in this regard, even if the validation process of the data, in order to avoid repetition and/or errors, could be time consuming. In this regard, a branch of the research has already started to define methods to compile landslide inventories with powerful remote sensing instruments, with datamining on internet news, or with timely and well-planned post-event field surveys. Unfortunately, most of the data are available only for a restricted number of scientists, technicians, and insiders. A consistent progress in the research could be fostered simply by sharing data and providing open access to local and governmental databases.

Regarding the landslide type, our review revealed that thresholds were defined both for landslides of a very specific kind and for populations of landslides of mixed kind, even if the latter case was used mainly where not enough information were available and/or to increase the number of landslides to consider for rainfall/landslide correlations. Moreover, our review identified some other techniques to increase the reliability of thresholds, by partitioning the study area into subzones with geomorphological homogeneity. We recommend a partition of the study area in particular for the definition of regional thresholds to be employed over large areas.

Another improvement in rainfall threshold research consists in the use of the most recent technological advances concerning instruments capable of rain measurements with fine and finer spatial and temporal resolution. The advances in satellite and ground-based radar technology are providing a great support to rainfall threshold research; however, at present, rain gauges are still the most common instrument used. Rainfall measurements with very fine temporal resolutions are useful only in case of thresholds based on peak intensity, while for thresholds based on total event rainfall or rainfall accumulated over daily (or larger) intervals, hourly resolution is sufficient. Conversely, regarding spatial resolution, the highest possible density of measurements is desirable, because it is possible to better account for the spatial variability of rainfall patterns, especially during convective rainstorms striking a complex orographic system. Even if approaches exist to define a single reference rain gauge for large areas, the possibility of selecting the best instrument among many different ones helps obtaining good results. In case a rain gauge network does not guarantee an adequate spatial and temporal resolution of measurements, we suggest considering the possibility of using satellite data, whose availability has greatly increased worldwide.

Finally, a critical key point in the scientific papers regarding rainfall thresholds published in the period 2008–2016 concerns the validation procedure. Indeed, only in 38 works (33.0%) out of 115 analyzed herein, a validation was performed against an independent dataset, and for the 40.0% of the thresholds reviewed, the validation process was not carried out at all. Validation is an essential part of the scientific research, necessary to verify the results of a method, and rainfall thresholds cannot be an exception, especially if used for applicative purpose in LEWSs. Thus, we recommend a quantitative validation based on single or multiple parameters that could easily measure the predictive capability of the thresholds. Concerning the most used criteria for validation, our review revealed that ROC curves and skill scores derived from confusion matrixes are becoming a standard. Comparisons with different thresholds present in literature are useful to comment the physical and geomorphological meaning of the rainfall-landslide correlation, but cannot be considered as a method of validation. On the contrary, a robust method to demonstrate the reliability of the thresholds is the application of a different method (i.e., different rainfall variables, alternate approaches or different statistical models) to the same data set and to compare the results obtained.

Lastly, we point out that the research on rainfall threshold is still ongoing and there is need to define objective and reproducible thresholds, and to strengthen the societal perception of their reliability in managing landslide hazard. The analysis of the recent literature allowed defining some standards and good practices, which we propose to use as guidelines both for further research and for technical applications.