Introduction and background

The land subsidence and uplift are a global phenomenon that are induced or triggered by natural and anthropogenic causes. The natural causes are likely initiated by processes such as erosion and consolidation of sediment, tectonics, glacial isostatic adjustment, volcanic and geodynamic processes, seasonal hydrological effects, and dissolution of rock such as limestones (Stewart et al. 2000; Ebmeier et al. 2016). The anthropogenic causes are directly related to human activities such as urban development (i.e., weights from infrastructure, building materials, and groundwater consumption), fluid injection related to hydraulic fracturing, extraction of natural resources (i.e., hydrocarbons and minerals), land use such as agricultural practices, irrigation, diversion of surface-water supplies, and other activities that are altering components of the hydrological cycle or climatic patterns (Ellsworth 2013; Galloway et al. 2016; Carisi et al. 2017).

The main hydrological mechanisms that cause vertical displacement include elastic and poroelastic deformations. Both deformations are influenced by physiographical settings and seasonal changes in hydrological surface and near-surface loadings (Puskas et al. 2007; Herring et al. 2016; Birhanu et al. 2018). The elastic deformation is caused by changes in loadings of water mass where higher weights will cause subsidence (i.e., winter snowpack or during the rainy season) and lower weights will cause uplift (i.e., summer or dry season when evaporation and runoff exceeds precipitation). On the other hand, the poroelastic deformation is produced by changes in groundwater storage that affects pore spaces through subsidence when groundwater is extracted and uplifting when groundwater is recharged from seasonal precipitation (Sun et al. 1999; Fu et al. 2013; Amos et al. 2014; Galloway et al. 2016).

In the USA, pervasive land subsidence has been observed throughout different regions including Phoenix, Arizona (Miller and Shirzaei 2015), Santa Clara Valley, California (Schmidt and Bürgmann 2003), San Joaquin Valley, California (Amos et al. 2014), southern New Jersey (Sun et al. 1999), New Orleans, Louisiana (Dixon et al. 2006), and Houston, Texas (Miller and Shirzaei 2019). The subsidence rates across the USA are variable for different regions and time periods. For example, magnitudes of land subsidence in the San Joaquin Valley between mid-1920s and 1970 exceeded 8.5 m and in 1972 reached 9 m (Poland et al. 1972). Since then, subsidence rates in the San Joaquin Valley have further increased especially during California’s most extreme drought in recorded history (between October 2012 and September 2016) where some areas subsided at rates as high as 0.6 m per year (NASA Jet Propulsion Laboratory) (NASA/JPL 2019). Another example of long-term rates between 1947 and 2006 reported for Grand Isle, Louisiana and Galveston, Texas showed subsidence of 7.59 ± 0.23 mm/year and 4.71 ± 0.21 mm/year, respectively (Kolker et al. 2011). Subsidence rates related to natural disasters and floods have also been measured. In the city of Houston, following the flooding event from hurricane Harvey in 2017 that remain nearly stationary over Texas for 3 days as a tropical storm, the observed correlation between flooded areas and subsided zones showed that a total of 85% of the flooded areas subsided at rate comparable with > 5 mm/year (Milliner et al. 2018; Miller and Shirzaei 2019).

Assessment of vertical displacement of the Earth’s crust is accomplished by different techniques including high-precision surveying, implementation of interferometric synthetic aperture radar (InSAR), use of permanent global positioning system (GPS) stations, aerial digital photogrammetry, and laser scanning (Bürgmann et al. 2000; Fu and Freymueller 2012; Fabris et al. 2014; Liwei et al. 2014; Simonetto et al. 2014; Miller and Shirzaei 2015, 2019). The products acquired by those techniques generate continuous and discrete outputs with various accuracies which range between a few millimeters and a few centimeters.

The advantages of GPS include continuous coverage of high sampling rates that generate hourly, daily, weekly, or monthly solutions; high observational precision in horizontal and vertical components; long-term stability; potential for detection of rapid displacements and horizontal strain components; and an expanding global coverage of observation stations within GNSS (Global Navigation Satellite System) (Argus et al. 2014; Cherniak and Zakharenkova 2017; Hammond et al. 2012). Recent increase in the number of permanent GNSS stations that have been installed around the world provides high-rate and real-time or near real-time geodetic data that increases the accuracy of classical geodetic techniques and provide understanding of vertical displacement at different spatial scales (Wang et al. 2015; Maciuk and Szombara 2018). For example, an accuracy of 1–2 mm/year of seasonal ground movements has been reported by implementation of GNSS techniques (Wang et al. 2014, 2015). The Plate Boundary Observatory (PBO), operated by the University NAVSTAR Consortium (UNAVCO) (https://www.unavco.org/), is one of the providers for GPS/GNSS data products that are collected from numerous permanent stations which are globally distributed.

Recent literature addresses the understanding and representation of complex interactions between seasonal hydrological effects and periodic crustal deformation (Dixon et al. 2006; Fu and Freymueller 2012; Fu et al. 2013; Argus et al. 2014; Fabris et al. 2014; Hammond et al. 2016). To date, many different approaches have been used to quantify hydrological measurements of mass loading (i.e., variation from the surface water and ice) such as from field surveys, satellite radar, and optical images. For example, spatio-temporal changes in terrestrial water storage have been acquired from satellite gravimetry using Gravity Recovery and Climate Experiment (GRACE) (e.g., Rodell et al. 2009; Famiglietti 2014). In addition, predictions from the North America Land Data Assimilation System (NLDAS) hydrologic model (Mitchell et al. 2004; Milliner et al. 2018) and other approaches such as the WaterGAP (Water Global Assessment and Prognosis) hydrological model have been applied. In particular, Rajner and Liwosz (2011) used WaterGAP Global Hydrology Model (WGHM) to model water mass changes for calculating crustal loading deformation based on the elastic preliminary reference Earth model (PREM). The approach applies convolution of water masses with appropriate Green’s function that represent the loading Love numbers in the spatial domain (Döll et al. 2003; Rajner and Liwosz 2011).

At global and broad spatial scales, some of the shortcomings with vertical displacement assessment arise from computational approaches that use sparse GPS measurements which are often correlated with the coarse spatial (~ 300 km) and temporal (1 month) measurements from GRACE (Fu and Freymueller 2012; Fu et al. 2013; Tan et al. 2016; Karegar et al. 2018). Indeed, taking advantage of denser multi-sensor/network integration (i.e., GPS and precipitation network) for continuous monitoring at broad spatial scale (i.e., contiguous USA) requires new geocomputational approaches for data collection, processing, and integrated analysis. Therefore, in an effort to deal with those challenges, this research presents an integrated R and Hadoop-GIS framework. The role of Hadoop is to efficiently process and integrate large amounts of data by connected computer network using parallel and distributed processing. The role of GIS, on the other hand, is used for visualization and other analysis to explore trends and model predictions in space and time.

Specifically, the proposed approach uses R software (https://www.r-project.org) that bridges Hadoop, GIS software, spatial libraries, and web mapping services (Ihaka and Gentleman 1996; Neteler and Mitasova 2008; Hengl and Reuter 2009; Prajapati 2013; Conrad et al. 2015). The broad objectives of this exploratory research are to evaluate the relationship between time-series data collected from the US GPS network stations and estimated precipitation. The data processing workflow implements (1) a system with multi-node Hadoop cluster (a total of three computers) for processing of data; (2) development of custom-built R scripts for spatial integration with GIS; and (3) visualization and analysis of relationships between vertical displacement from GPS and hydrological loading from precipitation using an interactive web-based implementation.

Study area and methodology

Study area and datasets

The study uses data for the continental USA for 48 months between January 1, 2013 and December 31, 2016. The crustal vertical deformation is characterized by time-series estimated from GPS network from a total of 4347 stations. The data are processed by the Nevada Geodetic Laboratory (NGL) at the University of Nevada, Reno (Blewitt et al. 2018) and published on their file transfer protocol (FTP) server (ftp://gneiss.nbmg.unr.edu/rapids). The main data components that are collected from the GPS network include attributes such as easting, northing, and vertical positional coordinates. The stations in the network measure movements of the surface from millimeter to centimeter level positional accuracies. The vertical component (up-direction) of daily GPS measurements contains higher error (~ 3–5 mm) than the horizontal components (~ 1–2 mm). The NLG site offers several types of processed data packages (i.e., solutions), available in different frequencies, such as 24-h sample rate and 5-min sample rate (Blewitt et al. 2016). The GPS time-series used in this study are GPS daily solution.

The NGL data processing is based on an exhaustive and complex strategy of non-fiducial daily products from the National Aeronautics and Space Administration (NASA) Jet Propulsion Laboratory (JPL) archive (https://www.jpl.nasa.gov) that include parameters such as GPS satellite orbit estimates, satellite clock estimates, time-pole parameter estimates, and satellite eclipse times. The parameters support the processing of GPS data in RINEX (Receiver Independent Exchange) format that integrates different archives such as UNAVCO (ftp://data-out.unavco.org), CDDIS (ftp://cddis.gsfc.nasa.gov), and CORS (ftp://cors.ngs.noaa.gov/cors). The NASA JPL, GIPSY OASIS-II software uses the ‘Precise Point Positioning’ (PPP) strategy to process GPS data and to provide daily solution with a high level of positional accuracy (Blewitt et al. 2016, 2018; Hammond et al. 2016).

The monthly precipitation data that matches the GPS time-series dataset were acquired from PRISM (Parameter-elevation Relationships on Independent Slopes Model) Climate Group (http://www.prism.oregonstate.edu/) (Daly et al. 2008). The PRISM data are compiled from climatic observations from a wide range of monitoring networks (i.e. National Weather Service Cooperative Observer Program (COOP), USDA NRCS Snow Telemetry (SNOTEL), and USDA Forest Service and Bureau of Land Management Remote Automatic Weather Stations (RAWS)) that are used for determining seasonal hydrological loadings such as from snow and rain over the USA and some areas outside the conterminous USA. The PRISM interpolation method calculates different climatic elements using an elevation-regression relationship and weights that are based on the physiographic similarity (i.e., elevation, coastal proximity, topographic position and orientation, vertical atmospheric layer, and orographic effectiveness of the terrain) at approximately 800-m spatial resolution. The quality-controlled network of surface stations used in the analysis includes a nearly 13,000 precipitation and 10,000 temperature stations. The estimates from PRISM are available at multiple spatial and temporal resolutions for a period between 1895 and the present.

An integrated R and Hadoop-GIS framework

The analysis of GPS time-series was implemented by Apache Hadoop (http://hadoop.apache.org). The Hadoop framework is an open source implementation that empowers parallel processing of “big data” on commodity hardware distributed as multiple nodes for data analytics and compute-intensive applications. The Hadoop Distributed File System (HDFS) is the main component for storing the data while the processing framework implements MapReduce model which reduces the complexity of the problem into parallelized distributed solution across the cluster (Uskenbayeva et al. 2015; Triguero et al. 2015; Landset et al. 2015; Babar et al. 2019). The MapReduce is a programming model which consists of two main elements: (1) a mapper used for splitting data for parallel processing across the cluster and (2) a reducer which partitions and combines the data into a single value or a set of values. The Apache Hive extension is a high-level MapReduce framework which is used in this research for reducing the GPS daily solutions to monthly averages (Rodger 2015; Kukreja 2016). Hive acts as an abstraction for Hadoop and it is known as a data warehouse infrastructure. It supports HiveQL which is a SQL-like language for processing queries (Bansal et al. 2016).

While Hadoop can be used to handle large datasets through distributed storage and batch computing, the R software offers statistical capabilities and a wide variety of geospatial packages that are used for the integration of the R and Hadoop-GIS framework. The R and Hadoop integration can be achieved in several ways including integration of R Hadoop packages, Hadoop Streaming that is accomplished by using different programming languages than the native JAVA, R and Hadoop Integrated Programming Environment (RHIPE), and ORCH that represents an Oracle R Connector. In this research, the Hadoop Streaming integration is used which required installation of R on individual nodes in the Hadoop cluster. In addition, the geospatial functionality of the Geographical Resources Analysis Support System (GRASS) environment was directly accessed from R scripts and packages (Fig. 1)

Fig. 1
figure 1

Integrated R and Hadoop-GIS framework

Methodology

The Hadoop distributed multi-node cluster environment included a total of three computers (i.e., nodes) for processing of the data. The GPS data were uploaded directly on the HDFS in their raw comma-separated value (CSV) format. The workflow of uploaded data on the HDFS included data preprocessing and aggregation through HiveQL querying statements and storing of the results for subsequent spatial analysis, mapping, and visualization. Initially, the data preprocessing involved screening for potential outliers in the GPS time-series and noise removal using interquartile range (IQR) techniques. After, the daily GPS time-series were aggregated into monthly averages for representing vertical displacements.

The regularized spline with tension interpolation method followed to convert the monthly GPS averages into continuous raster datasets using spatial libraries in R in conjunction with GRASS GIS. The interpolated data represented the monthly vertical crustal deformation from the GPS dataset. The hydrological loading from the precipitation datasets was downloaded as monthly rasters and integrated with the interpolated GPS maps. Both datasets were detrended (i.e., removal of trends from a time-series) using a linear method. Also, for subsequent data exploration and examination of spatial relationships, both datasets were standardized using standard z-score method. Lastly, the RStudio Server (https://www.rstudio.com) was used to provide an interactive web-based environment and better understanding of relationships between vertical crustal deformation from the GPS time-series and the hydrological loadings from the precipitation dataset.

Results and discussion

The spatial distribution of monthly precipitation average varies across the USA (Fig. 2a). The precipitation average shown in the map reflects a total of 48 months of interpolated precipitation datasets expressed in millimeters (mm). The amounts of precipitation affect the elastic crustal surface properties of the Earth, fluctuating upward and downward in response to losses or increases in the loadings. The changes in precipitation can be depicted from the standard deviation map (Fig. 2b) which are mostly driven by seasonal oscillations. In the map, the precipitation average is higher in the east and lower in the west, except the northwest which receives the highest precipitation. Also, the standard deviation map of the precipitation shows that the southeast and the northwest have the highest variability. The average vertical displacement map generated from the GPS time-series shows that below average values are mostly concentrated in the northern central region (Fig. 2c). This region was covered by glaciers more than 10,000 years ago and still undergoes isostatic rebound (Sella et al. 2007). However, the standard deviation map of vertical displacement shows that the western region, which is the most mountainous (i.e., Rocky Mountains), has the least variation. This is likely related, in part, to different geologic setting (i.e., GPS seasonal deformation dominated by surface mass loading) (Fig. 2d). For instance, this region has a much higher proportion of bedrock exposed (i.e., less sediment cover) and, of course, a lower density of vegetation due to climate and elevation; therefore, less GPS seasonal deformation dominated by surface mass loading relative to the northwest. Interestingly, that the northeastern region has more mountains but also shows lower variability. This region was also covered by glaciers like the north central area.

Fig. 2
figure 2

Precipitation maps (mm) of a monthly average and b standard deviation (not detrended), and GPS maps (mm) of c monthly average and d standard deviation (detrended)

Locally weighted scatterplot smoothing (LOWESS) curves for the monthly time-series averages were generated from standardized monthly distributions (Fig. 3). Although the monthly distributions of the time-series were highly variable, the curves depict the seasonal patterns of averages associated with the precipitation and GPS displacement for the entire USA. Additional analysis from individual monthly comparisons between the precipitation and the vertical displacement suggested that some months were highly correlated. For instance, the monthly correlations between the time-series using a 1-year period between January and December 2014 showed that the highest Pearson’s correlation was r = − 0.6 for the month of December (Fig. 4). The figure also shows a presence of both positive and negative correlation trends that are most likely influenced by different monthly precipitation patterns and crustal displacement responses. For example, the trends are directly influenced by precipitation patterns across the USA where along the west coast most of the precipitation falls during the winter while the least precipitation occurs in the summer months. In the central USA, the wettest periods are during the summer months, while in the eastern portion of the country, the distribution of precipitation occurs evenly throughout the year.

Fig. 3
figure 3

LOWESS curves from monthly precipitation and GPS displacement from entire USA

Fig. 4
figure 4

Monthly correlations between precipitation and GPS displacement from entire USA

In the eastern USA, Florida is an exception that has low winter precipitation and high summer averages that exceed 15 cm per month (Ward and Elliot 1995). In addition, other factors such as precipitation type and amounts, duration, intensity, snowmelt and snowpack, temperature, terrain, and distribution of different geological properties can influence the relationship between precipitation and the vertical displacement.

The relationship from Fig. 3 is further explored by wavelet coherence analysis. The wavelet coherence is similar to correlation measurement where value of 1 (red color) represents high correlation and a value of 0 (blue color) represents no correlation (Fig. 5a). With exception of few gaps in the red band centered between 0.25 and 2 periods (months), there is high correlation between the time-series. The high correlation is also shown in the average (global) coherence plot (Fig. 5b) which suggests that high correlation is associated with periods 4, 6, 9, and 12 months. This analysis indicates that there is seasonal, semi-annual, and annual correlation between precipitation and GPS displacement. The phase is represented by arrows (Fig. 5a) indicating the leading/lagging of the two time-series. A zero-phase difference means synchronized time-series that move together. When arrows point to the east (west) when times series are positively (negatively) correlated while arrows pointing southward mean that the first time-series leads the second, whereas northward pointing one shows the opposite. Most of the arrows are pointing southward which means that precipitation preceded the vertical displacement.

Fig. 5
figure 5

Wavelet coherence (a) and average coherence (b) between average monthly precipitation and average GPS displacement

The interactive web-based graphical user interface (GUI) developed by R and the Shiny package (Beeley 2016; Chang et al. 2020) enhances the traditional analysis generated by the Hadoop-GIS framework and shows the potentials for extended exploration of the processed precipitation and vertical displacement time-series maps (Fig. 6). In particular, a total of 48 corresponding map sets are loaded for visualization and exploration of spatial and temporal variation and trends across the USA (Fig. 6). The interactive web interface in Fig. 6 shows the maps associated with the highest Pearson’s correlation from Fig. 4 which was r = − 0.6 for the month of December 2014. The correlations suggest that low precipitation is associated with higher vertical displacement. In Fig. 6, the selected point near Santa Rosa, California (i.e., longitude = − 122.199° and latitude = 38.128°) shows a plot from both time-series at local scale. The x-axis shows the time period comparison, starting from the January 2013 and ending at December 2016. The y-axis represents the standardized values from both time-series. The correlation that corresponds to this location is r = − 0.325 representing the relationship between precipitation and vertical displacement for a total period of 48 months. However, at local scale, different places across the USA would experience correlations in different directions (i.e., positive and negative) with different magnitudes.

Fig. 6
figure 6

Interactive web-based module for visualization and exploration of monthly precipitation and GPS displacement relationships

For instance, agricultural areas that are heavily irrigated (i.e., San Joaquin Valley, California) would have different correlations compared with areas that are less irrigated. Additional factors such as watershed topography, catchment area, slope, evapotranspiration, geologic settings, and susceptibility to extreme weather events can also affect surface loadings at different locations and produce different patterns of correlations.

Additional features of the interactive web-based GUI include tools for enhancing and smoothing data plots in two dimensions, such as weighted moving average filter and Whittaker smoother, as well as tools for correlation analysis, including wavelet module and cross-correlation function (CCF). An example of an application of the Whittaker filter for smoothing the data is shown in Fig. 7a using the data from Fig. 6. The plot shows the raw data in transparent colors which correspond to the same colors of the smoothed time-series. In Fig. 7, although the correlation from the smoothed solution has decreased to r = − 0.047, the visualization clearly illustrated the precipitation seasonality and an uplift of the crust around August 2014 surrounded by subsidence at both ends of the time-series.

Fig. 7
figure 7

Correlation visualization and analysis using a Whittaker filter for smoothing the time-series and b cross-correlation function (CCF) for identifying the time lags between both time-series. The spatial point corresponds to Fig. 6 (Long = − 122.199, Lat = 38.128)

In the GUI, the relationship between standardized precipitation and vertical displacement can also be explored by cross-correlation functions (CCF) analysis (Fig. 7b). The CCF represents the correlation between the observations of the two time-series as here the precipitation and the vertical displacement that are separated by time units (lags) expressed as months (Jayawardhana and Gorsevski 2019). In the CCF plot (Fig. 7b), the horizontal blue-dashed lines represent the upper and the lower confidence levels for significance (i.e., confidence of 95%). The confidence intervals are computed from a number of observations and lag interval between the time-series. The main assumptions to satisfy the CCF requirement include the following (1) the time-series are uncorrelated, (2) the processes are not autocorrelated, (3) the populations are normally distributed, and (4) the sample size is large. The large correlation estimates with a lag that has an autocorrelation (ACF) that exceed the confidence intervals are used for determining whether a relationship exists between the two series.

For example, a peak above the upper ACF line at lag zero indicating an overall positive cross-correlations between the time-series (Fig. 7b). The lag zero is the most significant cross-correlation at the 95% level of significance in the figure but also lags − 6 and 6 are significant. Although the interpretation from the wavelet arrows suggested that precipitation preceded the vertical displacement, the CCF significance at lag zero may imply otherwise that it is difficult to predict which time-series would precede the other time-series. However, the wavelet analysis results from a broad geographical scale while the CCF results from a local scale using a single point. In addition, Fig. 7b reflects the seasonal pattern of the series that are noticeable at lags − 6 and 6 which are below the lower ACF line showing significant negative cross-correlations. Thus, there is a clear pattern that both time-series evolve concurrently where positive correlation is generated when one time-series increases the other time-series increases as well, and vice versa. In this example, the correlations at lags − 6 and 6 indicate that the peak period for the occurrence is most likely associated with subsidence.

Conclusions

The implementation of integrated R and Hadoop-GIS framework shows the potential of this research to be implemented at local and regional scales (i.e., continental USA) in tandem with additional heterogeneous sources of spatio-temporal data using free open source software (FOSS) deployed on commodity computers. The presented computational framework integrated a multi-node Hadoop and Hive cluster environment for processing the GPS data from permanent stations through distributed and parallel processing, while the geospatial implementation for producing monthly relationships between the GPS data and precipitation used GRASS GIS and R packages. In the prototype, the interactive web-based GUI for geovisualization was fully developed in R software environment.

The analytical potential of the presented near real-time assessment tool for monitoring vertical displacement was demonstrated by different modules, such as the wavelet coherence and CCF analysis. The wavelet coherence demonstration suggested that seasonal, semi-annual, and annual correlations occurred between precipitation and GPS displacement and that vertical displacement was influenced by seasonal hydrologic loading using mean estimates. The leading time-series in the wavelet coherence analysis was the precipitation, which represented the hydrological loading that influences the behavior of the seasonal ground deformation. Also, the analysis from the CCF showed the potential for exploring seasonal patterns at local scale that were generated by both time-series.

Future recommendations for additional development of this framework can focus on the integration of auxiliary datasets such as InSAR products, Light Detection and Ranging (LiDAR), GRACE datasets, NLDAS datasets, land use/land cover from remote sensing, integration of geologic materials and lithology, and other coverages that relate to urban development, water consumption, and oil and gas extractions. Such integration of big data and application of spatio-temporal analytics to real-time data can improve our understanding for possible long-term monitoring of surface deformations especially induced by human activities.