1 Introduction

Consequences of using a small data set for prediction purposes are not difficult to surmise: the most significant threat to modelers being the extremely high standard errors of the estimates, meaning that the point estimates of the forecasts will not be reliable and the confidence intervals will be large—both being characteristics of bad inference. To set our notions straight, let us adopt the following terminological convention.

Rare events Those events in nature or elsewhere that do not occur frequently, although we do not insist on objectively quantifying “frequently.” This is in keeping with our intuitive notion of “rarity.” For instance, a large-scale volcanic eruption, a significant strike from a strong hurricane, an earthquake of magnitude seven or more, winning a lottery or an aircraft disaster can all be classified as rare events. Despite their low occurrence probability, efficient and accurate forecasts for these events remain imperative to derive, especially in view of their profound and often, devastating influence on human lives and property. Such events have been studied in considerable detail by Ho (1991, 2008, 2010), Maguire et al. (1952), among others.

Sparse events Irrespective of whether or not an event is rare, due to financial or technological constraints, availability of a sufficiently rich historical record might not be a reasonable expectation, at times. For example, accurate rainfall (which is not a rare event) data at a certain city might only be available since 2000 and we hereby agree to categorize those cases as sparse events.

This classification is intrinsically subjective, but will still be immensely instrumental in our understanding of the nature of the events we intend to investigate. Events belonging to either category can contribute to small data sets, and the forecasting problem only gets compounded when the two combine. A classic instance is as follows: The Parkfield Experiment [http://earthquake.usgs.gov/research/parkfield/index.php] is a comprehensive, long-term earthquake research project on the San Andreas fault. Led by the US Geological Survey (USGS) and the State of California, the experiment’s purpose is to better understand the physics of earthquakes—what actually happens on the fault and in the surrounding region before, during and after an earthquake. Ultimately, scientists hope to better understand the earthquake process and, if possible, to provide a scientific basis for earthquake prediction. Since its inception in 1985, the experiment has involved more than 100 researchers at the USGS and collaborating universities and government laboratories. Their coordinated efforts have led to a dense network of instruments designed to “capture” the anticipated earthquake and reveal the earthquake process in unprecedented detail. The National Earthquake Prediction Evaluation Council (NEPEC) issued a statement in 1985 that an earthquake of about M 6 would probably occur before 1993 on the San Andreas Fault near Parkfield (Shearer 1985). However, no such event occurred until September 28, 2004. Statistically speaking, the chief characteristic of the Parkfield earthquake prediction experiment is that it is a small data set, which unavoidably poses a challenge.

Tan et al. (2014) have shown how the method introduced here can successfully model strong sandstorms, based on historical records from 1954 to 2002. The dearth of a sufficiently rich reliable history necessitated the need of more data points to arrive at reasonable forecasts, and the empirical recurrence rate (ERR) statistic, introduced later, served the purpose adequately. But a cursory glance at the nature of the underlying process revealed a marked seasonal pattern (higher frequency during specific months of the year), and hence, the event was sparse, but not rare and this work endeavors to justify that the technique is powerful enough to tackle an event that embraces both the classes with identical ease and elegance.

1.1 The strategy

The inter-event times of earthquakes at Parkfield can be described as a small point process with somewhat periodic recurrence rates. A nonhomogeneous Poisson process (NHPP) generalizes a homogeneous Poisson process (HPP) and is often appropriate for modeling of a series of events that occur over time in a nonstationary fashion. NHPPs have been used to model event occurrences in a variety of applications in earth sciences, ranging from mining accidents (e.g., Maguire et al. 1952) to volcanic hazard and risk assessment studies (e.g., Ho 1991). A major difficulty with the NHPP is that it has infinitely many forms for the intensity function. Ho (2008, 2010) proposed a linking bridge between a point process and the classical time series via a sequence of the ERRs, calculated sequentially at equidistant time intervals. The technique commences with an empirical recurrence rate plot (ERR-plot), designed to record the temporal pattern of a targeted stochastic process. Autoregressive integrated moving average (ARIMA) modeling techniques (Box and Jenkins 1976; Ljung and Box 1978) are applied to an ERR time series, referred to as a “training sample.” Three processes are distinguished: (i) model identification, (ii) parameter estimation and (iii) prediction and comparison of future values with a set of holdout-ERRs, referred to as “the prediction set.” The predicted ERRs that in turn are used to estimate the mean number of events corresponding to the prediction set. The predictability of all the candidate models can be assessed, and, consequently, the pool of the selected models is fine-tuned to produce the most useful model that fits the training sample and best fits the prediction set.

2 ARIMA modeling

2.1 Empirical recurrence rate

Let t 1, …, t n be the time of the n ordered events during an observation period [0, T] from the first occurrence to the last occurrence. Then a discrete time series {z l } is generated sequentially at equidistant time intervals h, 2h, …, lh, …, Nh (=T). If 0 is adopted as the time-origin and h as the time-step, then we regard z l as the observation at time t = lh. A time series of the ERR’s (Ho 2008) is developed as follows:

$$z_{l} = {{n_{l} } \mathord{\left/ {\vphantom {{n_{l} } {lh}}} \right. \kern-0pt} {lh}},$$

where n l is the total number of observations in [0,ℓh) and l = 1, 2, …, N. Note that z evolves over time and is simply the maximum likelihood estimator (MLE) of the mean unit rate, λ, if the underlying process observed in [0, ℓh) is an HPP.

The ERRs are unconventionally created to be cumulative to offset the potential of creating a time series with a lot of detrimental zero values through the discretization process. For instance, point processes that characterize small recurrence rates or, in particular, exhibit seasonality with a lot of off-season zero counts such as sand–dust storms and hurricane data are recorded as well.

2.2 Data

Since the great earthquake of January 9, 1857, earthquake sequences with main shocks of magnitude (M) 6 have occurred near Parkfield on February 2, 1881, March 3, 1901, March 10, 1922, June 8, 1934, June 28, 1966, and September 28, 2004 (Bakun et al. 2005). Our initial data treatments include: (i) set the observing time-origin on January 9, 1857, and let December 3, 2014, be the ending date to achieve a total of 57,670 days (= 365 × 158) for the entire observation period, (ii) use time-step, h = 730 days (2 years), and model the ERRs in annual rate and (iii) a revised time series excluding the first twelve (burn-in period) and the last ten data points is the training sample, which has 57 lags (= 79–12–10) covering five earthquakes. Note that the last peak of the entire ERR time series (Fig. 1) occurs at lag 74, reflecting the September 28, 2004, earthquake which falls in the prediction set (lag 70–79), to be used to further evaluate the reasonableness and predictive ability of the candidate model. The very nature of the event of interest here qualifies itself as a rare event, while the relatively short (as compared to other such studies) historical record of 158 years ensures that it is a sparse event, too.

Fig. 1
figure 1

Behavior of the ERR series on the burn-in period, training and prediction sets

2.3 Model

The initial ACF and PACF graphs are shown below (Figs. 2 and 3) which clearly exhibit a very slow decrease, indicative of a nonstationary behavior.

Fig. 2
figure 2

Raw ACF

Fig. 3
figure 3

Raw PACF

To capture this possible seasonal pattern, we take the first-order differenced series ∇z t  = z t  − z t–1 and plot the corresponding ACF and PACF curves (Figs. 4, 5):

Fig. 4
figure 4

ACF of first differenced series

Fig. 5
figure 5

PACF of first differenced series

We can see significant peaks at lags 1s, 2s, 3s, 4s, …, where s = 10 (approximately) with relatively slow decay indicating the need for seasonal differencing. The ACF and PACF curves for the new series:

$$\nabla_{10} \nabla z_{t} = \left( {1 - B^{10} } \right)\left( {1 - B} \right)z_{t}$$

are shown below (Figs. 6, 7).

These graphs tend to show a strong peak at 1s and relatively smaller peaks at 2s, 3s, 4s. So it appears that either.

Fig. 6
figure 6

ACF of seasonal differenced series

Fig. 7
figure 7

PACF of seasonal differenced series

  1. 1.

    the ACF is cutting off after lag 1s and the PACF is tailing off in the seasonal lags,

  2. 2.

    the ACF is cutting off after lag 2s and the PACF is tailing off at the seasonal lags,

  3. 3.

    They are both tailing off in the seasonal lags.

These suggest either (i) an SMA of order Q = 1, (ii) an SMA of order Q = 2 or (iii) a SARMA of order P = 1 or 2 (due to 1 or 2 significant peaks in the final PACF) and Q = 1.

Next, to identify the nonseasonal parameters, we focus on the within-season lags and it seems that either.

  1. (i)

    both ACF and PACF are tailing off.

  2. (ii)

    ACF and PACF are tailing off at lags 2 or 3.

These suggest either (i) p = q = 1 or (ii) p = 2 or 3, q = 2 or 3.

Narrowing down our search domain this way, we now choose the best from these competing “nearby” models according to the minimum AIC criterion. The best model we found was a SARIMA with these parameters: p = 2, d = 1, q = 0 and P = 1, D = 1, Q = 1, s = 10 with the AIC value of -402.81. Thus, our final model is the following SARIMA(2, 1, 0) × (1, 1, 1)10:

$$\varPhi_{2} \left( {B^{10} } \right)\phi \left( B \right)\nabla_{10}^{1} \nabla^{1} z_{t} = \varTheta_{1} \left( {B^{10} } \right)\theta \left( B \right)w_{t}$$

where the symbols have their usual meanings. The parameter estimates and the summary statistics are shown in Table 1.

Table 1 Parameter estimates from the fitted model

Next, we subject this model to the usual diagnostic tests and detail our findings in the next few figures (Figs. 8, 9). We find that the standardized residuals from the fit are well within acceptable limits, ACF of the residuals are negligible and the Ljung-Box tests have significantly high p values (close to 0.9), thereby failing to reject the independence hypothesis of the residuals. The Q–Q norm plot obtained from these residuals also seems to support the normality assumption on the residuals.

Fig. 8
figure 8

Model Diagnostics

Fig. 9
figure 9

Q–Q Plot

2.4 Forecast

Ten ERRs, forecasted within this coherent methodology, are reasonably close to the actual values in the prediction set (Table 2). Specifically, Fig. 10 depicts a new earthquake to occur between December 5, 2002, and December 4, 2004 (=lag 74 or the fifth time-step in the prediction set) which would have captured an actual event recorded on September 28, 2004. Thus, the final model passes the diagnostic testing procedures and is able to draw prediction about upcoming earthquake recorded in the prediction set.

Table 2 ERRs of the forecast and those in the prediction set for the Parkfield seismicity
Fig. 10
figure 10

Complete data of Parkfield (training sample and prediction set) with ten forecasts appended to the training sample for model validation

The forecasts obtained from the model derived similarly (Fig. 11) after pooling together both the training and the prediction set indicate that the next severe earthquake is likely to occur around 2025 (at lag 84).

Fig. 11
figure 11

Forecasts from the full model

3 Conclusions

The purpose of this paper was to emphasize the fact that ERR would be an invaluable weapon in every modeler’s arsenal, especially while dealing with events which are sparse or rare or both. Figure 2 depicts a slowly decaying ACF curve, a classic signature of a long memory process, and interested researchers might explore the possibility of fitting fractionally differenced time series models in this situation. Here, following the principle of parsimony, we have refrained from creating a model which is unnecessarily complex. It goes without saying that a relentless search for a better model would invariably bring in more advanced technical ideas, but it is our firm conviction that as long as the underlying structure remains ERR, better forecasts can be achieved without paying a hefty price in terms of model complexity.

As a welcome corollary, we have accomplished the goal of successfully modeling the historical earthquake data of Parkfield. The key is that the ERRs time series not only enhance the pattern while smoothing the raw data, but is also able to transform a small point process to a discrete time series with a desired size suitable for ARIMA modeling. The fitted model passes the diagnostic testing procedures and correctly captures the event of 2004, while delivers a challenging message that with a small sample size, we will not be able to reject even a silly model. Educated and supported by the drama produced by the NEPEC (Shearer 1985), this is probably as far as our model can claim. The proposed model, however, could potentially serve as references for more advanced modeling in long-term seismic risk assessment studies (e.g., Amei et al. 2012). Based on the foundation laid, future research can also be carried out to address the applicability of the method to other systems, in particular, to processes with several regimes.