1 Introduction

Moist deep convection in the atmosphere (Stevens 2005), manifesting in storms of all scales, is responsible for the most severe precipitation events. However, as convection is by nature scarce in space and time, it is challenging to describe its properties, being fluxes of heat, momentum and water, appropriately. It is thus common to estimate convective heavy or extreme precipitation by e.g. the 99th or 99.9th percentile of hourly precipitation rate, or, to apply thresholds on precipitation rate and the frequency of their exceeding is then representative of the frequency of extreme events (Ban et al. 2020; Pichelli et al. 2021). However, statistical analyses in the Eulerian frame of reference remain limited to the description of the time series of each grid cell separately, and no information is retained about the underlying events and their spatial structure. Instead, information about the convective events themselves can be yielded through the application of a tracking algorithm, here referred to simply as tracker. By this, precipitation events are identified and tracked in time, meaning the analysis is transferred from the Eulerian into the Lagrangian frame of reference. Through the use of trackers the precipitation events themselves and their properties are in focus, rather than conditions at specific locations. Many studies (Prein et al. 2017a; Crook et al. 2019; Purr et al. 2019; Guo et al. 2022) have shown the benefit of this idea.

Tracking algorithms were originally developed in order to evaluate precipitation events in numerical weather predictions (Davis et al. 2006a, b; Wernli et al. 2008; Johnson et al. 2013). By the use of a tracker modelled precipitation objects can be compared against observations regarding their spatial structure, intensity, propagation and location. Any model output or observational field that is typically associated with a precipitation event may serve as the tracker input field. Although typically precipitation rate itself is used, also indirect proxies, like outgoing longwave radiation (Morel and Senesi 2002a; Chen et al. 2019), or mid-level vertical velocity or vorticity, are suitable. On even finer resolutions individual updrafts of convective systems can be analyzed, along with their merging and splitting dynamics (Moseley et al. 2013). Another important application of tracking algorithms is the detection and observation of tropical and extra-tropical cyclones (Neu et al. 2013). Furthermore also droughts are operationally monitored using trackers (Abatzoglou et al. 2017).

The dynamical downscaling approach allows for investigating the impact of climate change on local scales and to derive actionable information for a variety of sectors (Giorgi 2019, 2020). During the last decades convection-permitting Regional Climate Models (cpRCMs) were established, solving the non-hydrostatic equations of the atmosphere on grids with horizontal grid spacing smaller than 4 km and allowing to turn off flawed parameterizations of deep convection (Prein et al. 2015). At first, year-long integrations of distinct regions were realized (Grell et al. 2000), then decade-long integrations (Rasmussen et al. 2011) and decade-long integrations of entire continents (Prein et al. 2017a), and recently robust ensembles of cpRCMs (Coppola et al. 2020; Kendon et al. 2021) have been achieved. cpRCMs brought great advances and still need further exploration of their capabilities (Lucas–Picher et al. 2021): significant added value (Rummukainen 2016; Ciarlo et al. 2020) lies in the representation of precipitation, in particular regarding its diurnal cycle, intensity and extremes (Ban et al. 2015; Kendon et al. 2017; Fumière et al. 2020; Reder et al. 2022) and over complex orography (Reder et al. 2020; Adinolfi et al. 2020). However, due to a more realistic orography, improvements are also found concerning surface temperature (Hohenegger et al. 2008) and mesoscale wind systems (Belušić et al. 2018). Through the application of trackers on cpRCMs, the climate change signal of convective storms can be analyzed in great detail (Prein et al. 2017a, b; Purr et al. 2019). Along with advances in model development, novel observational precipitation datasets, based on Doppler radar measurements, providing comparable spatial and temporal resolution emerged and allow for a rigorous evaluation of cpRCMs. Still, their impact on the evaluation of cpRCMs must be considered carefully (Prein and Gobiet 2017d).

This present study uses an ensemble of cpRCMs, developed by the CORDEX - Flagship Pilot Study on Convective phenomena at high resolution over Europe and the Mediterranean (CORDEX-FPSCONV, Coppola et al. (2020)), conducted on a domain covering the Alps and the northern Mediterranean Sea. This region is renowned for its severe precipitation events (Drobinski et al. 2014) and for being a climate change hotspot (Giorgi 2006). Several research groups set up trackers in order to evaluate the behaviour of cpRCMs in simulating precipitation events and storms. This study takes advantage of this opportunity and carries out an inter-comparison study on a set of four trackers. They will be applied both to the model ensemble’s evaluation runs, driven by ERA-Interim reanalysis data, and to a composite of high-resolution observational datasets. Thus the scientific objective of this study is two-fold:

  1. 1.

    Tracker-Inter-comparison: we inter-compare four different trackers in order to find out about their reliability: do different trackers yield the same scientific conclusions?

  2. 2.

    Model-Evaluation: we evaluate an ensemble of convection-permitting regional climate models against observations in the Lagrangian frame of reference by using the trackers, and focus on the following aspects:

    • How good are cpRCMs at simulating the basic properties of precipitation events (intensity, spatial and temporal scales, rain volume)?

    • How good are cpRCMs at simulating the spatial patterns and the annual cycle of basic properties of precipitation events?

The paper is organized as follows. In Sect. 2 we introduce the model ensemble and observational datasets used, as well as the two historic events that serve as case studies. In Sect. 3 we explain the workflow of the tracking algorithms and motivate the setup chosen in order to identify the precipitation events of interest. In Sect. 4 we present our results concerning the tracker inter-comparison and in Sect. 5 we present results on the model ensemble evaluation. Finally we summarize our findings and give conclusions in Sect. 6.

2 Model ensemble, observational datasets and historical events

In this section we briefly describe the CORDEX-FPSCONV ensemble of cpRCMs as well as the composite of datasets of precipitation measurements that in this study serve the input fields for the tracker analyses. Moreover we here introduce two historical heavy precipitation events, that we use as case studies for the tracker inter-comparison.

2.1 Ensemble of convection-permitting regional climate models

The CORDEX-FPSCONV community produced a first-of-its-kind ensemble of cpRCMs for the domain studied herein, that is covering the Alps and the northern Mediterranean (Coppola et al. 2020). Its Eulerian evaluation of precipitation is found in (Ban et al. 2020) and (Pichelli et al. 2021), which we here build upon and extend into the Lagrangian frame of reference. Importantly both studies demonstrate how the cpRCMs reduce model biases in comparison to the driving RCMs. Therein may also be found detailed information on the models. We here analyze the evaluation runs, whose boundary conditions are derived from ERA-Interim reanalysis data, through intermediate driving simulations at coarser resolution (RCM) (Ban et al. 2020). The ensemble contains several members using the COSMO-CLMcom and WRF model, which differ in their nesting strategy and physics parameterizations respectively. The model ensemble is summarized in Table 1.

Table 1 Summary of numerical models used in this study

Prior to the tracker analysis we remapped each of the models from their native grid onto the analysis grid ALP-3i, using distance weighted average remapping. It is a “regular lat-lon grid”, spanning in longitude from \(1^{\circ }\hbox {E}\) to \(17^{\circ }\hbox {E}\) in 582 grid cells, and in latitude from \(40^{\circ }\hbox {N}\) to \(50^{\circ }\hbox {N}\) in 364 grid cells. This results in a grid spacing of \(0.0275^{\circ }\) in both latitude and longitude, which translates on average to about 3 km.

2.2 High-resolution observational datasets of precipitation

We use a composite of four observational datasets of precipitation covering France, Germany, Switzerland and Italy respectively, over a common time period from 2001 to end of 2009. Their original spatial resolution is comparable to that of the convection-permitting models, with native grid spacings ranging from 1 to 3 km, and their temporal resolution is hourly. Thus the observational datasets can be neatly compared to hourly precipitation rates of the models. All of the datasets except of one (GRIPHO) are based upon Doppler radar measurements and adjusted with rain gauge measurements. The spatial and temporal resolution of these datasets are the highest currently available for the respective regions. Still, Doppler radar observations are known to systematically underestimate precipitation amounts over mountainous terrain, e.g. through the shielding effect (Germann et al. 2022), and underestimate particularly heavy precipitation rates (Schleiss et al. 2020). Further, also rain gauges under-catch orographic precipitation and are moreover affected by windy conditions (La Barbera et al. 2002). Furthermore, interpolation methods used to map station data onto regular grids induce an underestimation of high intensities (smoothing effect) and an overestimation of low intensities (moist extension into dry areas) (Isotta et al. 2014). A brief summary of the individual datasets, including their specific spatial resolution and references is given in Table 2.

Table 2 Summary of observational datasets of hourly precipitation rate used in this study

Prior to the tracker analysis each of the datasets was remapped onto the analysis grid ALP-3i, again using distance weighted average remapping. Then we merged them and use their arithmetic mean for regions along the borders of the nations, where measurements overlap. In this way, both the observations and the models, were mapped onto the same grid.

GRIPHO over Italy and posteriori masking We here inform about two shortcomings of our analysis and show how we deal with them when interpreting the results.

Firstly, the observational dataset covering Italy, GRIPHO, is based on quality-controlled rain gauge measurements solely. The station density is greater in the north than in the south of Italy, and on average it is estimated to about 1 per \(9\times 9\,\hbox {km}^2\). It is then remapped onto a convection-permitting grid with a \(3\,\hbox {km}\) grid spacing. In comparison to that, the other datasets are based upon Doppler radar measurements and rain gauges and their original spatial resolution is even finer than that of the analysis grid. We here must expect differences in the spatio-temporal characterization of the precipitation field observed by GRIPHO with respect to the other datasets. Nonetheless, GRIPHO is the most accurate observational dataset available for Italy and in particular the representation of extreme events was found improved (Fantini 2019), especially over Northern Italy where the station density is higher and where the most extreme precipitation events occur. In terms of domain complexity though, we note that Italy is surrounded by the Mediterranean Sea and the Alps, intersected by the Apennine Mountains and further shows both steep coastlines and a large plain area (Po Valley). Due to this high degree of complexity, which translates into very complex and local interactions, the precipitation events are renowned for being particularly severe and their modelling particularly challenging (see e.g. Morgan 1973; Buzzi and Alberoni 1992; Medina and Houze Jr 2003; Rotunno and Houze 2007; Panziera et al. 2015; Miglietta et al. 2016; Pichelli et al. 2017).

Secondly, the observations do not cover the entire domain simulated by the models, in particular not the Mediterranean sea. We consider this by posteriori applying a mask onto the tracker analyses of models, meaning that only tracks whose centroid is located within the domain of the observations are considered. This implies that in models events entering the observational domain and here particularly those making landfall, are expected to be overestimated in their scales, but little in their averaged properties. Note further that for the Swiss dataset RDisaggH, there is no data available for the period up to June 2003, which is also accounted for through masking.

We account for both of these two shortcomings by presenting the relative biases of the model ensemble not only for the entire domain of observations, but also separately without GRIPHO as well as for GRIPHO exclusively, which we consider representative of the most extreme precipitation events within the domain. By this we account for and understand both, the specific model biases due to the complex Italian domain as well as specific biases associated with the GRIPHO drawbacks. Further, by doing so the overestimation of landfalling tracks can be estimated, because by excluding GRIPHO we also exclude the greatest part of the coastline.

2.3 Historical heavy precipitation events

In the following we introduce the two historical heavy precipitation events that share these characteristics: both occurred along the Mediterranean coastline, both regions affected show steep orographic features and both happened in autumn. Coincidentally they both occurred along the same degree of latitude and the one happened just a little more than one year after the other.

2.3.1 Gard, France in September 2002

The first case study is a heavy precipitation event that occurred in south-eastern France, in the Gard region, during the 8th and 9th of September 2002 (Delrieu et al. 2005; Chancibault et al. 2006). Lasting more than a day, the event was particularly remarkable due to its rain amounts greater than 200 mm within 24 h spread over an area of 5500 \(\hbox {km}^2\). The maximum rain rates of 600–700 mm observed locally by rain gauges are among the highest daily records in the region. The propitious slow-evolving synoptic-scale situation combined an upper-level south-westerly diffluent flow over south-eastern France with a moist and warm low-level south-easterly flow. The rain event can be characterized by three phases (Delrieu et al. 2005): at first a Mesoscale Convective System (MCS) developed over the Gard plains, second a displacement of the MCS toward the Cévennes mountain ridge took place and third, the passage of a cold front with embedded convection swept the convective activity out of the region. This catastrophic event resulted in 24 fatalities and an economic damage estimated at 1.2 billion €(Sauvagnargues-Lesage 2004).

2.3.2 Carrara, Italy in September 2003

A second case study we carry out by looking at a heavy precipitation event that happened in Carrara, Italy, in September 2003, and which caused severe flash flooding and landslides. Cortopassi and Daddi (2008) investigated how the extensive quarrying activities of the region destabilize the terrain and promote hydro-geological hazards. It may be described as a landfalling convective system. A trough extending from a well structured low pressure centered over Northern Europe, advected hot and humid air from the Mediterranean sea and provided large-scale lifting. At the steep orography of the Apuan Alps the convective instability was triggered and the propagation of the storms was blocked. As a consequence, the region was affected by torrential rain, accumulating up to about \(200\,\hbox {mm}\) within a period of only 2 h. The event claimed two fatalities and caused major damage to the local infrastructure.

3 Trackers

We here describe the basic functionality of the four tracking algorithms investigated in this study, which are referred to as MODE, OSIRIS, DYMECS and celltrack. Their functionalities are summarized in Table 5 and we provide a detailed description of each tracker in the appendix (Sec. 1). The trackers are completely independent developments and are here applied with setups that are as similar as possible, in order to compare the same precipitation objects. In Sect. 4, we compare the trackers individually against each other, while in Sect. 5, we evaluate the model ensemble by the mean of all four trackers, which we refer to as the “tracker ensemble mean”.

The 1-hour accumulated precipitation fields (from observations or models) are used as input for all trackers. The principle operations of all trackers investigated include a first step of masking through a specified threshold, followed by a step of clustering in space to form objects and then tracking of those in time to form tracks. Prior to that the input field is smoothed in space. The treatment of cell merging and splitting is done by the allocation of metatracks, which can be understood as the smaller branches of merged or split tracks. It is a functionality that is not available in all trackers.

We designed the tracker setup such that it is able to identify precipitation events, that cause high impact weather situations like flash floods. For this reason we chose the (relatively large) precipitation threshold of \(5\,\hbox {mm}\,\hbox {h}^{-1}\). On the other hand we want to investigate the small-scale isolated thunderstorms that cpRCMs are capable of resolving, and to this end, we chose a (relatively small) minimum space-time volume threshold of 100 cells. The input field is smoothed across \(3\times 3\) grid cells prior to the analysis.

The common tracker setup is summarized in Table 3:

Table 3 Summary of the tracker setup

The characteristic track properties we are investigating in this study are defined in Table 4, with \(\mathrm {pr}\) describing the precipitation field of a track (Table 5):

Table 4 Definitions of characteristic track properties
Table 5 Summary of Trackers investigated in this study: “Institute” denotes the group executing the analysis using “Tracker”, representing the abbreviation of the respective tracker

4 Results on tracker inter-comparison

In this section we inter-compare the four trackers in two steps: first, we compare their performance at analyzing the two historic events, Gard 2002 and Carrara 2003 (Sect. 4.1), and second, we compare their climatological properties, derived from the tracker-analyses of the entire 9-year periods of observations and model ensemble (Sect. 4.2).

4.1 Tracker inter-comparison using case studies Gard 2002 and Carrara 2003

We apply the four trackers on the observational dataset and investigate only the region and time periods of the respective historic events. In Fig. 1, we show for both events the accumulated total precipitation along with the location of tracks and their respective rain volume. Note that we do not show the full path of propagation for the tracks, as for the stationary precipitation systems investigated here, the paths of propagation appear erratic and the information added does not bring relevant insight. In general we recommend that propagation features of multi-celled convective systems (e.g. distance travelled or propagation velocity) must be interpreted with caution, as the correct identification of their center is challenging. In Table 6, we summarize the properties of all tracks associated with the two historic events.

Table 6 Averaged and integrated track properties, as defined in Table 4, associated with historic events Gard 2002 and Carrara 2003

We first note, that all of our trackers do identify both historic events and attribute several tracks to them. OSIRIS and particularly DYMECS identify more tracks than MTD and celltrack, which can be explained for DYMECS by the allocation of metatracks in case of track merging or splitting. We find that the number of tracks identified reflects in the sum of duration and sum of mean track area. For each event, one main track is responsible for the major part of the rain volume and all the trackers agree well on these most severe tracks. Focusing on intensities, lower intensities with OSIRIS can be explained by the calculation of the diagnostics on the smoothed precipitation field.

Still overall and as listed in Table 7, we find that all trackers agree on the following relations:

Table 7 Qualitative attribution of track properties to the two historical events, Carrara 2003 and Gard 2002

Thus, all trackers describe the two events with equivalent track properties and moreover the properties attributed and the relations found agree well with the description of the events in literature: the smaller and less intense event, Carrara 2003, is attributed with smaller spatial scales and less intensity than the larger and more intense event, Gard 2002. Based upon these results, the choice of the trackers seems irrelevant, meaning that from any of the trackers’ analyses, equivalent scientific conclusions would be derived. In other words, with the differences being only of quantitative nature, the scientific conclusions are found to be independent of the choice of the tracker.

Fig. 1
figure 1

Historical events Gard 2002 (left panel) and Carrara 2003 (right panel) using the observations. Shading illustrates accumulated total precipitation, P(total) [mm]. Filled circles indicate the location of the centroid of a track and their radius is proportional to their respective rain volume. Filled contours indicate the elevation of the model terrain in intervals of 250 m

4.2 Tracker inter-comparison using the climatologies of model ensemble and observations

We continue the tracker inter-comparison over climatological scale through the ensemble of cpRCM simulations and the observations for the common time period 2001–2009. Table 8 shows the climatological means of characteristic track properties of each tracker and, in Fig. 2, we show the relative biases of each tracker for the mean and 90th percentile of track properties with respect to the observations.

Table 8 Climatology of track properties derived from both the model ensemble and the observations
Fig. 2
figure 2

Relative bias in characteristic of a the mean and b the 90th percentile of track properties of each tracker with respect to the observations \(\frac{ModEns\,\,-\,\,Obs}{Obs}\) [%]. Black solid lines are increments of +-5%, with the thick black line representing the tracker ensemble mean of the observations, i.e. 0%

With respect to characteristic track properties, for both their mean and 90th percentile, all trackers identify the following qualitative biases, shown in Table 9, when comparing the model ensemble against observations:

Table 9 Qualitative biases of characteristic track properties that are consistent across all trackers
Fig. 3
figure 3

Annual Cycle of track occurrence frequency, OF [month\(^{-1}\)], for all trackers analyzing the observations. Error bars indicate inter-annual variability by the temporal standard deviation

This means that all trackers derive for all properties of precipitation events the same qualitative biases, but these can differ in magnitude.

Looking only at tracker results of the observation-based climatology (Table 8), the characteristic track properties are overall similar between trackers. Particularly mean area, mean and maximum precipitation rates and duration are estimated similarly by the trackers. Some pronounced quantitative differences can still be found and attributed to tracker characteristics: firstly, due to the allocation of metatracks at track merging and splitting, the number of tracks is highest with DYMECS, whereas the space-time volume is smallest. Secondly, because of the calculation of the characteristics from the smoothed field, the intensities are lowest with OSIRIS.

Figure 3 shows the annual cycle of track occurrence frequency identified in the observations. For all four trackers, we find that the distribution is unimodal, with a peak in August and a minimum in February. Similarly to what we found for the two single historic events in Sect. 4.1, a tracker that allocates metatracks at splitting and merging (DYMECS) identifies more tracks in total than those that do not (MTD, celltrack and OSIRIS). It also shows greater inter-annual variability. Again, despite the differences of quantitative nature among trackers, the scientific conclusions when comparing climatologies of model ensemble and observations are mainly independent of the choice of the tracker.

Fig. 4
figure 4

Annual cycles for the tracker ensemble mean of both the model ensemble (dashed line) and observations (solid line), with panel a showing track occurrence frequency, OF [\(\hbox {month}^{-1}\)], panel b accumulated precipitation of tracks, P(tracks) [mm \(\hbox {month}^{-1}\)], panel c heavy precipitation fraction, P(tracks)/P(total) [%], and panel d accumulated total precipitation, P(total) [mm \(\hbox {month}^{-1}\)]. Error bars for the model ensemble indicate the temporal standard deviation of the model ensemble mean across years, and likewise for the observations error bars display inter-annual variability by the temporal standard deviation across years

5 Results on model evaluation

In this section, we evaluate the representation of precipitation events in the cpRCM ensemble. To this end, we use the mean of the tracker analyses (tracker ensemble mean) and compare the entire 9-years periods of the model ensemble against the composite of observations.

In Fig. 4a we show the annual cycle of track occurrence for the tracker ensemble mean, of both the model ensemble and observations. We find that the number of events occurring in spring and fall is overestimated, whereas for July and August the occurrence frequency of tracks in the models is close to that of the observations. The annual cycle of the model ensemble shows two peaks, one in June and one in August, whereas the annual cycle of the observations is unimodal. With respect to the estimate of inter-annual variability, given by the temporal variance across years, we find that the model ensemble exceeds the observations.

Figure 4b and d show accumulated precipitation of tracks (\(\hbox {P}_\mathrm {T}\)) and total accumulated precipitation (P(total)), whereas panel c) shows their fraction. In this domain and time period there is no pronounced annual cycle found in P(total). If anything it is rather the models that show a dry summer w.r.t. a wet winter. In other words, the model ensemble overestimates P(total) in winter and underestimates it in summer. In contrast to that, \(\hbox {P}_\mathrm {T}\) shows a strong seasonality, with the model ensemble showing a broad peak from May to November and the observations peaking from July to October. Here we find an overestimation of \(\hbox {P}_\mathrm {T}\) throughout the whole year. Consequently their fraction, \(\hbox {P}_\mathrm {T}\)/P(total), also shows strong seasonality, again with a peak in summer, and once more we identify a substantial overestimation by the model ensemble. Moreover we from this see that our setup chosen attributes only about 5% (observations) and 10% (model ensemble) of the annual precipitation amount to tracks. This overestimation was already of intense precipitation was already found in (Berthou et al. 2018; Meredith et al. 2020). Eventually we also see that this tracker setup serves well in identifying heavy precipitation events, as the fraction of precipitation amount identified is relatively low.

Fig. 5
figure 5

Panel a shows the track occurrence frequency density, \(\hbox {OFD}\,[\hbox {month}^{-1}\,\hbox {pixel}^{-1}]\), of the tracker ensemble mean for the model ensemble and panel b shows the same for the observations. Panel c shows their difference and panel d show the difference, but with model ensemble and observations being normalized by their total number of tracks, respectively. A pixel is here defined as the reference area of \(0.36^{\circ }\,\times \,0.36^{\circ }\). The green iso-line shows the model elevation at 1000 m.a.s.l.. Panel e) shows the model-observation difference in track occurrence by model elevation

Figure 5 shows the track occurrence frequency density of the tracker ensemble mean for the model ensemble and observations, as well as their difference. Panel d) shows the normalized difference, i.e. as if there were as many tracks in model ensemble as in observations, and by this emphasizes qualitative differences. Moreover in panel e) we show the difference in track occurrence for different seasons and different elevations. It is evident from observations (Fig. 5b) that track occurrence is strongly correlated to the topography, meaning that the orographic forcing plays a major role for precipitation events to occur; this is well captured by the models as well (Fig. 5a). Prominent hotspots of heavy precipitation are the Julian Alps (North-East Italy), the Western Alps (especially the Italian side and the southern Maritime Alps between Italy and France) and the Massif Central (South France). Also Corsica and the Apennines can be identified as hotspots. However, also dry spots, located in the interior of the Alps, like in Tyrol in Austria, or in the Western Alps are prominent in observations and re-produced well by the cpRCMs. Track occurrence appears overestimated over orography, particularly in the Maritime Alps, the Tyrolean Alps, the Apennines, the Black Forest and to a lesser extent, in the southern Massif Central. In contrast to this, in plains ahead of mountains, like in northern Italy, occurrence frequency is underestimated. We have seen already in the annual cycle of occurrence frequency (Fig. 4 a)), that cpRCMs most strongly overestimate track occurrence in spring (MAM) and estimates OF well in summer (JJA). We now in panel e) of Fig. 5 identify clearly that cpRCMs in all seasons overestimate tracks above 1000 m.a.s.l., whereas in summertime, below 1000 m.a.s.l. OF is underestimated. This behaviour may be explained through the following considerations: numerical models easily trigger convection through orographic lifting. However, they appear to struggle to trigger thunderstorms or to resolve complex thunderstorm dynamics over plain terrain in summer, even at convection-permitting resolution (see also Craig et al. 2012; Heim et al. 2020; Prein et al. 2021). On the other hand, observational datasets under-catch rainfall amounts in mountainous regions. Therefore, model performance over orography is expected to be better than it seems. This finding is in line with Lundquist et al. (2019), who propose that well-tuned cpRCMs may outperform observational datasets over complex mountain terrain, in terms of total precipitation amounts.

The statistics of characteristic track properties of the tracker ensemble mean, for the climatologies of both model ensemble and observations, are listed in Table 8. Relative model biases of mean track properties are illustrated in panels a) and b) of Fig. 6 and are summarized in Table 10.

Table 10 Relative biases [%] of characteristic track properties, as defined in Table 4, using the tracker ensemble mean, for the whole domain, as well as without GRIPHO and exclusively for GRIPHO

We find that the biases of mean track properties are mostly positive (see Fig. 6b and Table 10). Biases of the 90th percentile of track properties are still much larger (see Fig. 6d), suggesting that extreme events are strongly overestimated regarding their scales and intensity. Considering the complex orography of the domain investigated (Rotunno and Houze 2007), in combination with the known issue of under-catchment of orographic precipitation in observations, the positive biases were to be expected. Despite this, we find it important to note that mean precipitation rate of tracks is well-estimated (+6% allover the domain, +3% w/o GRIPHO-Italy). The absolute number of events, maximum precipitation rate, track duration and rainfall volume are considerably overestimated (\(>17\%\)). Only the mean area of tracks is underestimated. Model biases with respect to GRIPHO-Italy differ from those of the other datasets and regions qualitatively only in terms of number of tracks, showing here an underestimation. It is worth to note that the model spread is particularly large in terms of number of tracks (Fig. 6a). For all other properties we find smaller biases over regions with radar-based datasets, i.e. France, Germany and Switzerland (w/o GRIPHO-Italy) than over Italy (only GRIPHO-Italy). It is particularly the spatio-temporal properties (duration, mean area, space-time volume) and maximum precipitation rate, that show the greatest differences. We assume that the bias reduction in spatio-temporal properties, particularly in mean area, is associated with improvements that the spatially continuous radar measurements ensure. Larger model biases over Italy might be also attributed to some higher degree of complexity not well captured by some or all cpRCMs within the ensemble. Certainly the optimal spatial-temporal representation of precipitation events in radar-based datasets constitutes an advantage in the evaluation of models in a context of Lagrangian analysis. Our findings confirm the key role of observational datasets with comparable spatial resolution in evaluation studies of RCMs (Torma et al. 2015; Prein and Gobiet 2017d).

Fig. 6
figure 6

Panels a and c The purple shaded area illustrates the relative bias of (a) the tracker ensemble mean of the model ensemble mean and c) the 90th percentile, with respect to the observations: \(\frac{\overline{ModEns}^{Tr}\,\,-\,\,\overline{Obs}^{Tr}}{\overline{Obs}^{Tr}}\)[%], while purple lines indicate the individual models and green lines individual years of the observations. Panels b and d shows also \(\frac{\overline{ModEns}^{Tr}\,\,-\,\,\overline{Obs}^{Tr}}{\overline{Obs}^{Tr}}\)[%] for the mean and 90th percentile of characteristic properties, with the Italian dataset GRIPHO excluded as well as for GRIPHO only. Black solid lines are increments of +-5%, with the thick black line denoting 0%. Panels ej: probability density functions of Duration [h], Area [\(\hbox {km}^2\)], Rain Volume [\(\hbox {m}^{3}\,\hbox {E6}\)], Space–Time Volume [\(\hbox {km}^{2}\,\hbox {h}\)], Maximum Precipitation Rate [\(\hbox {mm}\,\hbox {h}^{-1}\)] and Mean Precipitation Rate [\(\hbox {mm}\,\hbox {h}^{-1}\)]

The probability density functions in Fig. 6 give more detailed insight into the models’ behaviour. Looking at track duration, we find that cpRCMs simulate precipitation events of temporal scales longer than 50 hours, that are not found in observations. In turn simulated precipitation events are generally too small regarding their area, whereas we see only a minor overestimation of the distribution’s tail. As a combination of the biases of duration and area, the bias of geometrical volume is still positive. It is mostly the overestimation of track duration that causes to the positive biases in geometrical volume and also biases in rain volume are mostly found in the tails of the distribution. In other words, cpRCMs are found to simulate precipitation events of large scales that are not seen in observations. This finding is also reflected in the high biases of the 90th percentile of track properties shown in Fig. 6c and d.

We in Fig. 8 (in the Supplementary Material Section 9) provide the relative biases of mean properties for each model individually and we here would like to address the 2 cpRCM families WRF and CCLM. While the WRF models differ in their physics parameterizations, the CCLM models differ only in their nesting strategy. Among the WRF models the variability in mean biases is considerably large, with e.g. the IPSL-WRF and UHOH being particularly different. In turn the biases among the CCLM model family look much more similar. We from this conclude that physics parameterizations have a large effect on model behaviour and thus can generate greater ensemble variability than differing nesting strategies.

Fig. 7
figure 7

The spatial biases of the tracker ensemble mean of the model ensemble w.r.t. observations. Panels af: duration [h], (mean) area [\(\hbox {km}^2\)], rain volume [\(\hbox {m}^3\) E6], (geometrical) volume [\(\hbox {km}^{2}\,\hbox {h}\)], mean and maximum precipitation (rate) [\(\hbox {mm}\,\hbox {h}^{-1}\)]. Again a pixel is here defined as the reference area of \(0.36^{\circ }\,\times \,0.36^{\circ }\) and the green iso-line shows the model elevation at 1000 m.a.s.l.

The spatial mapping of model biases in Fig. 7 allows us to further understand impacts of the technical shortcoming mentioned in Sect. 2.2. Note that we in Fig. 9 (in the Supplementary Material Section 9) show the mapping of the respective properties for observations and model ensemble. The posteriori masking of tracks means that landfalling tracks are overestimated in their spatial and temporal scales. In fact we find that along the coasts rain volume and geometrical volume are overestimated, whereas the other averaged variables appear unaffected. Over Italy we again find the pronounced underestimation of track mean area and overestimation of track duration. We here can speculate that GRIPHO’s interpolation method smoothens the field strongly, enlarging the spatial extent of events. Also positive biases in mean and maximum precipitation are pronounced over Italy, but are not dramatically different from the other sub-regions in the domain.

6 Summary and conclusions

The present study has a two-fold scientific objective: on the one hand we provide an inter-comparison of tracking algorithms and on the other hand we present an evaluation of an ensemble of convection-permitting regional climate models in terms of Lagrangian precipitation events. We here summarize our findings and give conclusions.

With respect to the tracker inter-comparison (see Sect. 4) we were able to show through both, the comparison of two historic events and the comparison of climatologies of model ensemble and observations, that all trackers investigated produce equal relations of characteristic track properties and model biases (see Tables 7 and 9 and Figs. 1, 2 and 3). Thus all trackers produce qualitatively equal results. In other words, differences among the trackers were found to be only of quantitative nature, which could be addressed to certain specifications of the algorithms. From this we infer that from each tracker analysis the equivalent scientific conclusion would be derived. This result suggests that all trackers investigated are reliable analysis tools of atmospheric research.

The choice of tracker depends here much on whether metatracks, allocated when tracks are splitting or merging, are of interest. Further code availability, portability and user support also play a major role.

We find that the setup chosen here, given through smoothing, precipitation rate and volume threshold (Table 3), identifies an abundance of precipitation events all over the domain, of which only a fraction would be considered an extreme event. In our analysis of two historical events, we see that those are represented by several tracks. We recommend to consider that a tracker would identify fewer or only a single track, if thresholds on precipitation rate and minimum volume were raised and smoothing strengthened. The choice of setup depends upon the user-specific application. Certainly though, the most intense events are retained. In turn, reducing thresholds and weakening smoothing will result in a setup that identifies more and greater tracks and a greater fraction of precipitation will be attributed to the events.

With respect to model evaluation (see Sect. 5) we summarize the following findings. Looking into the spatial representation of precipitation events (see Fig. 5), we found that cpRCMs perform well in reproducing hotspots of heavy precipitation, which are generally associated with orographic features. At the same time though, cpRCMs appear to overestimate the occurrence of precipitation events over orography. However, the under-catchment of orographic precipitation in radar-based and rain gauge observations (Creutin et al. 1997; La Barbera et al. 2002; Prein and Gobiet 2017d; Germann et al. 2022) suggests that cpRCMs perform better than it seems. The idea of cpRCMs outperforming observations in complex terrain, particularly in terms of total precipitation amounts, is strongly supported in Lundquist et al. (2019). In contrast to this, we found the occurrence of precipitation events underestimated over plain terrain and ahead of orographic features, particularly in summer. The same model behaviour was found by Prein et al. (2017a) for North America, where the occurrence frequency of MCSs was underestimated in the central plains but overestimated over the Appalachians. We here assume that, despite of the convection-permitting resolution, complex thunderstorms (e.g. supercells or squall lines) in plain terrain are either not triggered or their dynamics still under-resolved (see also Bryan and Morrison (2012), Pichelli et al. (2017), Prein et al. (2021)). Moreover, the correct prescription of sea surface temperatures is crucial for the intensity and evolution of characteristic landfalling Mediterranean heavy precipitation events (Lebeaupin et al. 2006). Looking into the seasonal representation of precipitation events, we find that cpRCMs overestimate the occurrence of tracks and associated precipitation amounts particularly in late spring (AMJ), and also in fall. In late summer months (JAS) the domain-wide occurrence appears estimated well, as the overestimation in regions over 1000 m.a.s.l. is compensated by the underestimation in regions below 1000 m.a.s.l..

In terms of characteristic properties of precipitation events we found the following biases (listed in Table 10 and illustrated in Fig. 6) and give reference to tracker studies using convection-permitting models. The occurrence frequency of events is overestimated with respect to radar-based observations (in line with Clark et al. (2014), Prein et al. (2017a), Caillaud et al. (2021)) and under-estimated over Italy, although the models spread is large around this property. The mean area of tracks is underestimated (in line with (Crook et al. 2019), but in contrast to Caillaud et al. (2021)), while their duration is overestimated (in line with Crook et al. (2019), Purr et al. (2019)). Still, we have identified that both of these biases are particularly pronounced over Italy. In turn, the cpRCMs agree much better with the radar-based observational datasets in terms of track area and duration. The tracks’ space-time volume, that is the combination of area and duration, as well as the rain volume, are overestimated. However, we here find considerable impact by the differing representation of landfalling tracks in models and observations, and excluding a major part of the coastline (the Italian sub-region) reduces the biases much. Mean precipitation rates show only small positive biases, with cpRCMs aligning again even better with radar-based observations. Maximum precipitation rate is overestimated in models and here again biases are much reduced when using radar-based observations as benchmark (in line with (Davis et al. 2006b; Prein et al. 2017a; Crook et al. 2019; Caillaud et al. 2021). Still we find that cpRCMs simulate precipitation events of scales and intensities that are not seen in observations, which means an overestimation of extreme event properties. Overall, the results on cpRCM evaluation are encouraging. On the one hand the (mostly) positive biases we find are to be expected, when assuming underestimated precipitation amounts in observations in a region of such complex orography. On the other hand we find that biases of the spatio-temporal properties of precipitation events in cpRCMs appear much reduced when using high-resolution observational datasets, based upon Doppler radar measurements.