Abstract
Rare events are plentiful in nature and most of them have devastating consequences on human lives and property. Modeling such events is intrinsically challenging due to their very characteristic of rarity, and at times, reliable forecasts are immensely difficult to obtain due to the dearth of available data. Such situations are commonplace and typically happen when we are not equipped with a sufficiently rich historical record faithfully following the event of interest. The purpose of this endeavor is to promote the use of a certain kind of smoothing statistic termed empirical recurrence rate which can generate pseudodata over barren observation periods and to realize that such a method can effectively enlarge the size of the data set and thereby generate better prediction power. It’s simple method of construction appeals to intuition and hence, it should be profitably applied to analyze events originating from such diverse disciplines as meteorology, medical science, oceanography, volcanology, seismology, etc. We illustrate the applicability of our method with the aid of historical records of strong earthquakes at Parkfield, California, and describe how it triumphs over more established methods from a forecasting viewpoint.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Consequences of using a small data set for prediction purposes are not difficult to surmise: the most significant threat to modelers being the extremely high standard errors of the estimates, meaning that the point estimates of the forecasts will not be reliable and the confidence intervals will be large—both being characteristics of bad inference. To set our notions straight, let us adopt the following terminological convention.
Rare events Those events in nature or elsewhere that do not occur frequently, although we do not insist on objectively quantifying “frequently.” This is in keeping with our intuitive notion of “rarity.” For instance, a large-scale volcanic eruption, a significant strike from a strong hurricane, an earthquake of magnitude seven or more, winning a lottery or an aircraft disaster can all be classified as rare events. Despite their low occurrence probability, efficient and accurate forecasts for these events remain imperative to derive, especially in view of their profound and often, devastating influence on human lives and property. Such events have been studied in considerable detail by Ho (1991, 2008, 2010), Maguire et al. (1952), among others.
Sparse events Irrespective of whether or not an event is rare, due to financial or technological constraints, availability of a sufficiently rich historical record might not be a reasonable expectation, at times. For example, accurate rainfall (which is not a rare event) data at a certain city might only be available since 2000 and we hereby agree to categorize those cases as sparse events.
This classification is intrinsically subjective, but will still be immensely instrumental in our understanding of the nature of the events we intend to investigate. Events belonging to either category can contribute to small data sets, and the forecasting problem only gets compounded when the two combine. A classic instance is as follows: The Parkfield Experiment [http://earthquake.usgs.gov/research/parkfield/index.php] is a comprehensive, long-term earthquake research project on the San Andreas fault. Led by the US Geological Survey (USGS) and the State of California, the experiment’s purpose is to better understand the physics of earthquakes—what actually happens on the fault and in the surrounding region before, during and after an earthquake. Ultimately, scientists hope to better understand the earthquake process and, if possible, to provide a scientific basis for earthquake prediction. Since its inception in 1985, the experiment has involved more than 100 researchers at the USGS and collaborating universities and government laboratories. Their coordinated efforts have led to a dense network of instruments designed to “capture” the anticipated earthquake and reveal the earthquake process in unprecedented detail. The National Earthquake Prediction Evaluation Council (NEPEC) issued a statement in 1985 that an earthquake of about M 6 would probably occur before 1993 on the San Andreas Fault near Parkfield (Shearer 1985). However, no such event occurred until September 28, 2004. Statistically speaking, the chief characteristic of the Parkfield earthquake prediction experiment is that it is a small data set, which unavoidably poses a challenge.
Tan et al. (2014) have shown how the method introduced here can successfully model strong sandstorms, based on historical records from 1954 to 2002. The dearth of a sufficiently rich reliable history necessitated the need of more data points to arrive at reasonable forecasts, and the empirical recurrence rate (ERR) statistic, introduced later, served the purpose adequately. But a cursory glance at the nature of the underlying process revealed a marked seasonal pattern (higher frequency during specific months of the year), and hence, the event was sparse, but not rare and this work endeavors to justify that the technique is powerful enough to tackle an event that embraces both the classes with identical ease and elegance.
1.1 The strategy
The inter-event times of earthquakes at Parkfield can be described as a small point process with somewhat periodic recurrence rates. A nonhomogeneous Poisson process (NHPP) generalizes a homogeneous Poisson process (HPP) and is often appropriate for modeling of a series of events that occur over time in a nonstationary fashion. NHPPs have been used to model event occurrences in a variety of applications in earth sciences, ranging from mining accidents (e.g., Maguire et al. 1952) to volcanic hazard and risk assessment studies (e.g., Ho 1991). A major difficulty with the NHPP is that it has infinitely many forms for the intensity function. Ho (2008, 2010) proposed a linking bridge between a point process and the classical time series via a sequence of the ERRs, calculated sequentially at equidistant time intervals. The technique commences with an empirical recurrence rate plot (ERR-plot), designed to record the temporal pattern of a targeted stochastic process. Autoregressive integrated moving average (ARIMA) modeling techniques (Box and Jenkins 1976; Ljung and Box 1978) are applied to an ERR time series, referred to as a “training sample.” Three processes are distinguished: (i) model identification, (ii) parameter estimation and (iii) prediction and comparison of future values with a set of holdout-ERRs, referred to as “the prediction set.” The predicted ERRs that in turn are used to estimate the mean number of events corresponding to the prediction set. The predictability of all the candidate models can be assessed, and, consequently, the pool of the selected models is fine-tuned to produce the most useful model that fits the training sample and best fits the prediction set.
2 ARIMA modeling
2.1 Empirical recurrence rate
Let t 1, …, t n be the time of the n ordered events during an observation period [0, T] from the first occurrence to the last occurrence. Then a discrete time series {z l } is generated sequentially at equidistant time intervals h, 2h, …, lh, …, Nh (=T). If 0 is adopted as the time-origin and h as the time-step, then we regard z l as the observation at time t = lh. A time series of the ERR’s (Ho 2008) is developed as follows:
where n l is the total number of observations in [0,ℓh) and l = 1, 2, …, N. Note that z ℓ evolves over time and is simply the maximum likelihood estimator (MLE) of the mean unit rate, λ, if the underlying process observed in [0, ℓh) is an HPP.
The ERRs are unconventionally created to be cumulative to offset the potential of creating a time series with a lot of detrimental zero values through the discretization process. For instance, point processes that characterize small recurrence rates or, in particular, exhibit seasonality with a lot of off-season zero counts such as sand–dust storms and hurricane data are recorded as well.
2.2 Data
Since the great earthquake of January 9, 1857, earthquake sequences with main shocks of magnitude (M) 6 have occurred near Parkfield on February 2, 1881, March 3, 1901, March 10, 1922, June 8, 1934, June 28, 1966, and September 28, 2004 (Bakun et al. 2005). Our initial data treatments include: (i) set the observing time-origin on January 9, 1857, and let December 3, 2014, be the ending date to achieve a total of 57,670 days (= 365 × 158) for the entire observation period, (ii) use time-step, h = 730 days (2 years), and model the ERRs in annual rate and (iii) a revised time series excluding the first twelve (burn-in period) and the last ten data points is the training sample, which has 57 lags (= 79–12–10) covering five earthquakes. Note that the last peak of the entire ERR time series (Fig. 1) occurs at lag 74, reflecting the September 28, 2004, earthquake which falls in the prediction set (lag 70–79), to be used to further evaluate the reasonableness and predictive ability of the candidate model. The very nature of the event of interest here qualifies itself as a rare event, while the relatively short (as compared to other such studies) historical record of 158 years ensures that it is a sparse event, too.
2.3 Model
The initial ACF and PACF graphs are shown below (Figs. 2 and 3) which clearly exhibit a very slow decrease, indicative of a nonstationary behavior.
To capture this possible seasonal pattern, we take the first-order differenced series ∇z t = z t − z t–1 and plot the corresponding ACF and PACF curves (Figs. 4, 5):
We can see significant peaks at lags 1s, 2s, 3s, 4s, …, where s = 10 (approximately) with relatively slow decay indicating the need for seasonal differencing. The ACF and PACF curves for the new series:
These graphs tend to show a strong peak at 1s and relatively smaller peaks at 2s, 3s, 4s. So it appears that either.
-
1.
the ACF is cutting off after lag 1s and the PACF is tailing off in the seasonal lags,
-
2.
the ACF is cutting off after lag 2s and the PACF is tailing off at the seasonal lags,
-
3.
They are both tailing off in the seasonal lags.
These suggest either (i) an SMA of order Q = 1, (ii) an SMA of order Q = 2 or (iii) a SARMA of order P = 1 or 2 (due to 1 or 2 significant peaks in the final PACF) and Q = 1.
Next, to identify the nonseasonal parameters, we focus on the within-season lags and it seems that either.
-
(i)
both ACF and PACF are tailing off.
-
(ii)
ACF and PACF are tailing off at lags 2 or 3.
These suggest either (i) p = q = 1 or (ii) p = 2 or 3, q = 2 or 3.
Narrowing down our search domain this way, we now choose the best from these competing “nearby” models according to the minimum AIC criterion. The best model we found was a SARIMA with these parameters: p = 2, d = 1, q = 0 and P = 1, D = 1, Q = 1, s = 10 with the AIC value of -402.81. Thus, our final model is the following SARIMA(2, 1, 0) × (1, 1, 1)10:
where the symbols have their usual meanings. The parameter estimates and the summary statistics are shown in Table 1.
Next, we subject this model to the usual diagnostic tests and detail our findings in the next few figures (Figs. 8, 9). We find that the standardized residuals from the fit are well within acceptable limits, ACF of the residuals are negligible and the Ljung-Box tests have significantly high p values (close to 0.9), thereby failing to reject the independence hypothesis of the residuals. The Q–Q norm plot obtained from these residuals also seems to support the normality assumption on the residuals.
2.4 Forecast
Ten ERRs, forecasted within this coherent methodology, are reasonably close to the actual values in the prediction set (Table 2). Specifically, Fig. 10 depicts a new earthquake to occur between December 5, 2002, and December 4, 2004 (=lag 74 or the fifth time-step in the prediction set) which would have captured an actual event recorded on September 28, 2004. Thus, the final model passes the diagnostic testing procedures and is able to draw prediction about upcoming earthquake recorded in the prediction set.
The forecasts obtained from the model derived similarly (Fig. 11) after pooling together both the training and the prediction set indicate that the next severe earthquake is likely to occur around 2025 (at lag 84).
3 Conclusions
The purpose of this paper was to emphasize the fact that ERR would be an invaluable weapon in every modeler’s arsenal, especially while dealing with events which are sparse or rare or both. Figure 2 depicts a slowly decaying ACF curve, a classic signature of a long memory process, and interested researchers might explore the possibility of fitting fractionally differenced time series models in this situation. Here, following the principle of parsimony, we have refrained from creating a model which is unnecessarily complex. It goes without saying that a relentless search for a better model would invariably bring in more advanced technical ideas, but it is our firm conviction that as long as the underlying structure remains ERR, better forecasts can be achieved without paying a hefty price in terms of model complexity.
As a welcome corollary, we have accomplished the goal of successfully modeling the historical earthquake data of Parkfield. The key is that the ERRs time series not only enhance the pattern while smoothing the raw data, but is also able to transform a small point process to a discrete time series with a desired size suitable for ARIMA modeling. The fitted model passes the diagnostic testing procedures and correctly captures the event of 2004, while delivers a challenging message that with a small sample size, we will not be able to reject even a silly model. Educated and supported by the drama produced by the NEPEC (Shearer 1985), this is probably as far as our model can claim. The proposed model, however, could potentially serve as references for more advanced modeling in long-term seismic risk assessment studies (e.g., Amei et al. 2012). Based on the foundation laid, future research can also be carried out to address the applicability of the method to other systems, in particular, to processes with several regimes.
References
Amei A, Fu W, Ho CH (2012) Time series analysis for predicting the occurrences of large scale earthquakes. Int J Appl Sci Technol 2(7):64–75
Bakun WH, Aagaard B, Dost B, Ellsworth WL, Hardebeck JL, Harris RA, Ji C, Johnston MJS, Langbein J, Lienkaemper JJ, Michael AJ, Murray JR, Nadeau RM, Reasenberg PA, Reichle MS, Roeloffs EA, Shakal A, Simpson RW, Waldhauser F (2005) Implications for prediction and hazard assessment from the 2004 Parkfield earthquake. Nature 437:969–974. doi:10.1038/nature04067
Box GEP, Jenkins GM (1976) Time series analysis: forecasting and control. Holden-Day, San Francisco, p 592
Ho CH (1991) Nonhomogenous Poisson model for volcanic eruptions. Math Geol 23(2):167–173
Ho CH (2008) Empirical recurrence rate time series for volcanism: application to Avachinsky volcano, Russia. J Volcanol Geotherm Res 173:15–25
Ho CH (2010) Hazard area and recurrence rate time series for determining the probability of volcanic disruption of the proposed high-level radioactive waste repository at Yucca Mountain, Nevada, USA. Bull Volc 72:205–219
Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65:297–303
Maguire BA, Pearson ES, Wynn AHA (1952) The time intervals between industrial accidents. Biometrika 39:168–180
Shearer R (1985) Minutes of the National Earthquake Prediction Evaluation Council (NEPEC). US Geological Survey Open-File Repository, 85507
Tan S, Bhaduri M, Ho CH (2014) A Statistical Model for Long-Term Forecasts of Strong Sand Dust Storms. J Geosci Environ Prot 2:16–26. doi:10.4236/gep.2014.23003
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ho, CH., Bhaduri, M. On a novel approach to forecast sparse rare events: applications to Parkfield earthquake prediction. Nat Hazards 78, 669–679 (2015). https://doi.org/10.1007/s11069-015-1739-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11069-015-1739-1