On a novel approach to forecast sparse rare events: applications to Parkfield earthquake prediction

Ho, Chih-Hsiang; Bhaduri, Moinak

doi:10.1007/s11069-015-1739-1

On a novel approach to forecast sparse rare events: applications to Parkfield earthquake prediction

Original Paper
Published: 08 April 2015

Volume 78, pages 669–679, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Natural Hazards Aims and scope Submit manuscript

On a novel approach to forecast sparse rare events: applications to Parkfield earthquake prediction

Download PDF

Chih-Hsiang Ho¹ &
Moinak Bhaduri¹

338 Accesses
15 Citations
Explore all metrics

Abstract

Rare events are plentiful in nature and most of them have devastating consequences on human lives and property. Modeling such events is intrinsically challenging due to their very characteristic of rarity, and at times, reliable forecasts are immensely difficult to obtain due to the dearth of available data. Such situations are commonplace and typically happen when we are not equipped with a sufficiently rich historical record faithfully following the event of interest. The purpose of this endeavor is to promote the use of a certain kind of smoothing statistic termed empirical recurrence rate which can generate pseudodata over barren observation periods and to realize that such a method can effectively enlarge the size of the data set and thereby generate better prediction power. It’s simple method of construction appeals to intuition and hence, it should be profitably applied to analyze events originating from such diverse disciplines as meteorology, medical science, oceanography, volcanology, seismology, etc. We illustrate the applicability of our method with the aid of historical records of strong earthquakes at Parkfield, California, and describe how it triumphs over more established methods from a forecasting viewpoint.

The Science of Earthquake Forecasting: What’s Next for Nepal and the Himalayan Region?

Forecasting the magnitude of the largest expected earthquake

Article Open access 06 September 2019

Modelling Earthquakes: Characterizing Magnitudes and Inter-Arrival Times

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Consequences of using a small data set for prediction purposes are not difficult to surmise: the most significant threat to modelers being the extremely high standard errors of the estimates, meaning that the point estimates of the forecasts will not be reliable and the confidence intervals will be large—both being characteristics of bad inference. To set our notions straight, let us adopt the following terminological convention.

Rare events Those events in nature or elsewhere that do not occur frequently, although we do not insist on objectively quantifying “frequently.” This is in keeping with our intuitive notion of “rarity.” For instance, a large-scale volcanic eruption, a significant strike from a strong hurricane, an earthquake of magnitude seven or more, winning a lottery or an aircraft disaster can all be classified as rare events. Despite their low occurrence probability, efficient and accurate forecasts for these events remain imperative to derive, especially in view of their profound and often, devastating influence on human lives and property. Such events have been studied in considerable detail by Ho (1991, 2008, 2010), Maguire et al. (1952), among others.

Sparse events Irrespective of whether or not an event is rare, due to financial or technological constraints, availability of a sufficiently rich historical record might not be a reasonable expectation, at times. For example, accurate rainfall (which is not a rare event) data at a certain city might only be available since 2000 and we hereby agree to categorize those cases as sparse events.

This classification is intrinsically subjective, but will still be immensely instrumental in our understanding of the nature of the events we intend to investigate. Events belonging to either category can contribute to small data sets, and the forecasting problem only gets compounded when the two combine. A classic instance is as follows: The Parkfield Experiment [http://earthquake.usgs.gov/research/parkfield/index.php] is a comprehensive, long-term earthquake research project on the San Andreas fault. Led by the US Geological Survey (USGS) and the State of California, the experiment’s purpose is to better understand the physics of earthquakes—what actually happens on the fault and in the surrounding region before, during and after an earthquake. Ultimately, scientists hope to better understand the earthquake process and, if possible, to provide a scientific basis for earthquake prediction. Since its inception in 1985, the experiment has involved more than 100 researchers at the USGS and collaborating universities and government laboratories. Their coordinated efforts have led to a dense network of instruments designed to “capture” the anticipated earthquake and reveal the earthquake process in unprecedented detail. The National Earthquake Prediction Evaluation Council (NEPEC) issued a statement in 1985 that an earthquake of about M 6 would probably occur before 1993 on the San Andreas Fault near Parkfield (Shearer 1985). However, no such event occurred until September 28, 2004. Statistically speaking, the chief characteristic of the Parkfield earthquake prediction experiment is that it is a small data set, which unavoidably poses a challenge.

Tan et al. (2014) have shown how the method introduced here can successfully model strong sandstorms, based on historical records from 1954 to 2002. The dearth of a sufficiently rich reliable history necessitated the need of more data points to arrive at reasonable forecasts, and the empirical recurrence rate (ERR) statistic, introduced later, served the purpose adequately. But a cursory glance at the nature of the underlying process revealed a marked seasonal pattern (higher frequency during specific months of the year), and hence, the event was sparse, but not rare and this work endeavors to justify that the technique is powerful enough to tackle an event that embraces both the classes with identical ease and elegance.

1.1 The strategy

The inter-event times of earthquakes at Parkfield can be described as a small point process with somewhat periodic recurrence rates. A nonhomogeneous Poisson process (NHPP) generalizes a homogeneous Poisson process (HPP) and is often appropriate for modeling of a series of events that occur over time in a nonstationary fashion. NHPPs have been used to model event occurrences in a variety of applications in earth sciences, ranging from mining accidents (e.g., Maguire et al. 1952) to volcanic hazard and risk assessment studies (e.g., Ho 1991). A major difficulty with the NHPP is that it has infinitely many forms for the intensity function. Ho (2008, 2010) proposed a linking bridge between a point process and the classical time series via a sequence of the ERRs, calculated sequentially at equidistant time intervals. The technique commences with an empirical recurrence rate plot (ERR-plot), designed to record the temporal pattern of a targeted stochastic process. Autoregressive integrated moving average (ARIMA) modeling techniques (Box and Jenkins 1976; Ljung and Box 1978) are applied to an ERR time series, referred to as a “training sample.” Three processes are distinguished: (i) model identification, (ii) parameter estimation and (iii) prediction and comparison of future values with a set of holdout-ERRs, referred to as “the prediction set.” The predicted ERRs that in turn are used to estimate the mean number of events corresponding to the prediction set. The predictability of all the candidate models can be assessed, and, consequently, the pool of the selected models is fine-tuned to produce the most useful model that fits the training sample and best fits the prediction set.

2 ARIMA modeling

2.1 Empirical recurrence rate

Let t ₁, …, t _n be the time of the n ordered events during an observation period [0, T] from the first occurrence to the last occurrence. Then a discrete time series {z _l} is generated sequentially at equidistant time intervals h, 2h, …, lh, …, Nh (=T). If 0 is adopted as the time-origin and h as the time-step, then we regard z _l as the observation at time t = lh. A time series of the ERR’s (Ho 2008) is developed as follows:

$$z_{l} = {{n_{l} } \mathord{\left/ {\vphantom {{n_{l} } {lh}}} \right. \kern-0pt} {lh}},$$

where n _l is the total number of observations in [0,ℓh) and l = 1, 2, …, N. Note that z _ℓ evolves over time and is simply the maximum likelihood estimator (MLE) of the mean unit rate, λ, if the underlying process observed in [0, ℓh) is an HPP.

The ERRs are unconventionally created to be cumulative to offset the potential of creating a time series with a lot of detrimental zero values through the discretization process. For instance, point processes that characterize small recurrence rates or, in particular, exhibit seasonality with a lot of off-season zero counts such as sand–dust storms and hurricane data are recorded as well.

2.2 Data

Since the great earthquake of January 9, 1857, earthquake sequences with main shocks of magnitude (M) 6 have occurred near Parkfield on February 2, 1881, March 3, 1901, March 10, 1922, June 8, 1934, June 28, 1966, and September 28, 2004 (Bakun et al. 2005). Our initial data treatments include: (i) set the observing time-origin on January 9, 1857, and let December 3, 2014, be the ending date to achieve a total of 57,670 days (= 365 × 158) for the entire observation period, (ii) use time-step, h = 730 days (2 years), and model the ERRs in annual rate and (iii) a revised time series excluding the first twelve (burn-in period) and the last ten data points is the training sample, which has 57 lags (= 79–12–10) covering five earthquakes. Note that the last peak of the entire ERR time series (Fig. 1) occurs at lag 74, reflecting the September 28, 2004, earthquake which falls in the prediction set (lag 70–79), to be used to further evaluate the reasonableness and predictive ability of the candidate model. The very nature of the event of interest here qualifies itself as a rare event, while the relatively short (as compared to other such studies) historical record of 158 years ensures that it is a sparse event, too.

2.3 Model

The initial ACF and PACF graphs are shown below (Figs. 2 and 3) which clearly exhibit a very slow decrease, indicative of a nonstationary behavior.

To capture this possible seasonal pattern, we take the first-order differenced series ∇z _t = z _t − z _t–1 and plot the corresponding ACF and PACF curves (Figs. 4, 5):

We can see significant peaks at lags 1s, 2s, 3s, 4s, …, where s = 10 (approximately) with relatively slow decay indicating the need for seasonal differencing. The ACF and PACF curves for the new series:

$$\nabla_{10} \nabla z_{t} = \left( {1 - B^{10} } \right)\left( {1 - B} \right)z_{t}$$

are shown below (Figs. 6, 7).

These graphs tend to show a strong peak at 1s and relatively smaller peaks at 2s, 3s, 4s. So it appears that either.

1.
the ACF is cutting off after lag 1s and the PACF is tailing off in the seasonal lags,
2.
the ACF is cutting off after lag 2s and the PACF is tailing off at the seasonal lags,
3.
They are both tailing off in the seasonal lags.

These suggest either (i) an SMA of order Q = 1, (ii) an SMA of order Q = 2 or (iii) a SARMA of order P = 1 or 2 (due to 1 or 2 significant peaks in the final PACF) and Q = 1.

Next, to identify the nonseasonal parameters, we focus on the within-season lags and it seems that either.

(i)
both ACF and PACF are tailing off.
(ii)
ACF and PACF are tailing off at lags 2 or 3.

These suggest either (i) p = q = 1 or (ii) p = 2 or 3, q = 2 or 3.

Narrowing down our search domain this way, we now choose the best from these competing “nearby” models according to the minimum AIC criterion. The best model we found was a SARIMA with these parameters: p = 2, d = 1, q = 0 and P = 1, D = 1, Q = 1, s = 10 with the AIC value of -402.81. Thus, our final model is the following SARIMA(2, 1, 0) × (1, 1, 1)₁₀:

$$\varPhi_{2} \left( {B^{10} } \right)\phi \left( B \right)\nabla_{10}^{1} \nabla^{1} z_{t} = \varTheta_{1} \left( {B^{10} } \right)\theta \left( B \right)w_{t}$$

where the symbols have their usual meanings. The parameter estimates and the summary statistics are shown in Table 1.

Table 1 Parameter estimates from the fitted model

Full size table

Next, we subject this model to the usual diagnostic tests and detail our findings in the next few figures (Figs. 8, 9). We find that the standardized residuals from the fit are well within acceptable limits, ACF of the residuals are negligible and the Ljung-Box tests have significantly high p values (close to 0.9), thereby failing to reject the independence hypothesis of the residuals. The Q–Q norm plot obtained from these residuals also seems to support the normality assumption on the residuals.

2.4 Forecast

Ten ERRs, forecasted within this coherent methodology, are reasonably close to the actual values in the prediction set (Table 2). Specifically, Fig. 10 depicts a new earthquake to occur between December 5, 2002, and December 4, 2004 (=lag 74 or the fifth time-step in the prediction set) which would have captured an actual event recorded on September 28, 2004. Thus, the final model passes the diagnostic testing procedures and is able to draw prediction about upcoming earthquake recorded in the prediction set.

Table 2 ERRs of the forecast and those in the prediction set for the Parkfield seismicity

Full size table

The forecasts obtained from the model derived similarly (Fig. 11) after pooling together both the training and the prediction set indicate that the next severe earthquake is likely to occur around 2025 (at lag 84).

3 Conclusions

The purpose of this paper was to emphasize the fact that ERR would be an invaluable weapon in every modeler’s arsenal, especially while dealing with events which are sparse or rare or both. Figure 2 depicts a slowly decaying ACF curve, a classic signature of a long memory process, and interested researchers might explore the possibility of fitting fractionally differenced time series models in this situation. Here, following the principle of parsimony, we have refrained from creating a model which is unnecessarily complex. It goes without saying that a relentless search for a better model would invariably bring in more advanced technical ideas, but it is our firm conviction that as long as the underlying structure remains ERR, better forecasts can be achieved without paying a hefty price in terms of model complexity.

As a welcome corollary, we have accomplished the goal of successfully modeling the historical earthquake data of Parkfield. The key is that the ERRs time series not only enhance the pattern while smoothing the raw data, but is also able to transform a small point process to a discrete time series with a desired size suitable for ARIMA modeling. The fitted model passes the diagnostic testing procedures and correctly captures the event of 2004, while delivers a challenging message that with a small sample size, we will not be able to reject even a silly model. Educated and supported by the drama produced by the NEPEC (Shearer 1985), this is probably as far as our model can claim. The proposed model, however, could potentially serve as references for more advanced modeling in long-term seismic risk assessment studies (e.g., Amei et al. 2012). Based on the foundation laid, future research can also be carried out to address the applicability of the method to other systems, in particular, to processes with several regimes.

References

Amei A, Fu W, Ho CH (2012) Time series analysis for predicting the occurrences of large scale earthquakes. Int J Appl Sci Technol 2(7):64–75
Google Scholar
Bakun WH, Aagaard B, Dost B, Ellsworth WL, Hardebeck JL, Harris RA, Ji C, Johnston MJS, Langbein J, Lienkaemper JJ, Michael AJ, Murray JR, Nadeau RM, Reasenberg PA, Reichle MS, Roeloffs EA, Shakal A, Simpson RW, Waldhauser F (2005) Implications for prediction and hazard assessment from the 2004 Parkfield earthquake. Nature 437:969–974. doi:10.1038/nature04067
Article Google Scholar
Box GEP, Jenkins GM (1976) Time series analysis: forecasting and control. Holden-Day, San Francisco, p 592
Google Scholar
Ho CH (1991) Nonhomogenous Poisson model for volcanic eruptions. Math Geol 23(2):167–173
Article Google Scholar
Ho CH (2008) Empirical recurrence rate time series for volcanism: application to Avachinsky volcano, Russia. J Volcanol Geotherm Res 173:15–25
Article Google Scholar
Ho CH (2010) Hazard area and recurrence rate time series for determining the probability of volcanic disruption of the proposed high-level radioactive waste repository at Yucca Mountain, Nevada, USA. Bull Volc 72:205–219
Article Google Scholar
Ljung GM, Box GEP (1978) On a measure of lack of fit in time series models. Biometrika 65:297–303
Article Google Scholar
Maguire BA, Pearson ES, Wynn AHA (1952) The time intervals between industrial accidents. Biometrika 39:168–180
Article Google Scholar
Shearer R (1985) Minutes of the National Earthquake Prediction Evaluation Council (NEPEC). US Geological Survey Open-File Repository, 85507
Tan S, Bhaduri M, Ho CH (2014) A Statistical Model for Long-Term Forecasts of Strong Sand Dust Storms. J Geosci Environ Prot 2:16–26. doi:10.4236/gep.2014.23003
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematical Sciences, University of Nevada, Las Vegas, 4505 S. Maryland Pkwy, Las Vegas, NV, USA
Chih-Hsiang Ho & Moinak Bhaduri

Authors

Chih-Hsiang Ho
View author publications
You can also search for this author in PubMed Google Scholar
Moinak Bhaduri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moinak Bhaduri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ho, CH., Bhaduri, M. On a novel approach to forecast sparse rare events: applications to Parkfield earthquake prediction. Nat Hazards 78, 669–679 (2015). https://doi.org/10.1007/s11069-015-1739-1

Download citation

Received: 10 November 2014
Accepted: 03 April 2015
Published: 08 April 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s11069-015-1739-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On a novel approach to forecast sparse rare events: applications to Parkfield earthquake prediction

Abstract

Similar content being viewed by others

The Science of Earthquake Forecasting: What’s Next for Nepal and the Himalayan Region?

Forecasting the magnitude of the largest expected earthquake

Modelling Earthquakes: Characterizing Magnitudes and Inter-Arrival Times