Introduction

The Region of Waterloo (RoW) in Ontario (Fig. 1a,b) is one of the largest municipalities in Canada, which relies mostly (>75%) on groundwater supplies for its drinking water. The dramatic growth of the region and increasing water demand have prompted the development of municipal wellfields as well as the need to sustainably manage groundwater resources.

Fig. 1
figure 1

a The location of the study area in relation to Canada and the US. b The location of the Mannheim East Well Field within the Region of Waterloo. c Conceptual hydrogeologic model of the Waterloo Moraine (after Blackport et al. 2014)

There are more than 40 wellfields, consisting of more than 120 water supply wells within the region, which supply in excess of 269,000 m3/day of groundwater to urban citizens (Lake Erie Region Source Protection Committee 2012). Groundwater is extracted from a complex multi-aquifer-aquitard system within the Waterloo Moraine (Fig. 1c), which was formed by interlobate glacial activity, with seven well fields pumping groundwater from the upper aquifer units, while ten additional well fields pump groundwater from the deeper units (Bajc and Shirota 2007).

The complexity and susceptibility of the Waterloo Moraine to overexploitation of groundwater resources and its potential contamination requires a sound understanding of local hydrogeology, including the reliable estimation of hydraulic parameters such as hydraulic conductivity (K) and specific storage (Ss).

A number of hydraulic parameter estimation approaches have been developed and studied during the past several decades which include: (1) the analysis of small-scale data including grain size distributions (Hazen 1911; Kozeny 1927; Shepherd 1989), collection of core samples for laboratory permeameter analyses (Sudicky 1986; Sudicky et al. 2010; Alexander et al. 2011), slug tests (Hvorslev 1951; Bouwer and Rice 1976), flowmeter tests (Rehfeldt et al. 1992); and (2) performing pumping tests and fitting data to analytical solutions to determine the large-scale hydraulic properties of the aquifer (Theis 1935; Cooper and Jacob 1946). However, the large area of the municipal well field raises the question of whether small-scale hydraulic parameter estimates are suitable and reliable in predicting groundwater levels and flow. Another concern is that it is difficult to conduct dedicated pumping tests within a municipal well field where pumping/injection schedules cannot be readily modified or terminated. When dedicated pumping tests can be conducted, existing analytical solutions that treat the subsurface to be homogeneous are typically utilized, which can yield biased and questionable parameter estimates (Wu et al. 2005; Berg and Illman 2011a, b, 2013, 2015).

As part of their management and operation, most municipal well fields contain a network of observation wells to monitor the response of the aquifer system to pumping, and to inform optimal extraction rates. In addition to monitoring the response of the aquifer and informing pumping/injection rates, these monitoring data can potentially be used to better characterize regional groundwater flow and estimate hydraulic parameters (Yeh and Lee 2007; Harp and Vesselinov 2011).

In a previous study by Luo and Illman (2016), long-term pumping/injection events and water-level variation records were used to estimate hydraulic parameters including transmissivity (T) and storativity (S) for the shallow aquifer (AFB2) within the Waterloo Moraine. A set of T and S values were estimated between each production and monitoring borehole by fingerprinting the water-level variations to pumping/injection rate changes. The fingerprinting process was accomplished through the Theis (1935) model implemented in the WELLS code (Mishra and Vesselinov 2011) coupled with a nonlinear parameter estimation code PEST (Doherty 2005). The Luo and Illman (2016) study showed that long-term municipal water-level records were amenable for hydraulic parameter estimation, as the geometric means of the individual T and S estimates were similar to previous pumping test results at the study site. However, the wide range of estimated T (9–55,335 m/day) and S (0.002–0.736) indicated the high degree of heterogeneity of the investigated aquifer. Furthermore, poor validation results using data that were not used for calibration purposes suggested that T and S estimates from individual pumping and monitoring boreholes may not be suitable for the drawdown prediction of other monitoring wells. In order to increase the accuracy of parameter estimates at this site, Luo and Illman (2016) concluded that a more sophisticated groundwater flow model that considers heterogeneity in K and Ss, as well as better accounting of the forcing functions (i.e., initial and boundary conditions as well as source/sink terms), is needed to obtain more accurate parameter estimates.

There are a number of approaches to map K heterogeneity. A conceptually simple approach is to map the K heterogeneity through the interpolation of small-scale K estimates including permeameter tests, slug tests and single-hole tests, but a large amount of data is required. For example, Rehfeldt et al. (1992) estimated that about 400,000 K measurements would be required to accurately predict the transport of tracers in an alluvial aquifer at the Columbus Air Force Base in Mississippi, USA, commonly known as the Macrodispersion Experiment (MADE) site. Thus, it would be prohibitively expensive and time-consuming to perform such analyses at a municipal wellfield.

An integrated stochastic-deterministic approach is another method to map K heterogeneity through calibrating a flow model, especially for large-scale systems (Traum et al. 2014; Sampath et al. 2015, 2016; Liao et al. 2020). Generally, the transition probability (TP) approach generates separate “TP zones” and three-dimensional (3D) material distribution based on stratigraphy information from borehole records to create conditional realizations (Carle and Fogg 1997). The realizations are then incorporated into the flow model and calibrated using long-term hydraulic head data. The accuracy of the TP approach is dependent on the density of the boreholes, while the pumping rate is usually simplified to annual/monthly pumping rates (Traum et al. 2014) or calibrated based on the well type (domestic use, public supply, etc.). Thus, such a method is more suitable for large-scale systems with sufficient borehole records and without precise consideration of municipal well operations. Berg and Illman (2015) also examined the TP approach and compared its performance to other approaches such as kriging, geological modeling, and hydraulic tomography (HT) for capturing heterogeneity at a small-scale field site.

Hydraulic tomography has been developed to map subsurface heterogeneity through inverse modeling of hydraulic heads recorded at multiple locations from sequential pumping tests. Over the last several decades, it has been tested under synthetic (e.g., Yeh and Liu 2000; Bohling et al. 2002; Cardiff et al. 2013a, b), laboratory (e.g., Illman et al. 2007, 2008, 2009; Illman et al. 2010a, b; Illman et al. 2015; Berg and Illman 2011a; Zhao et al. 2016), and field conditions (e.g., Illman et al. 2009; Berg and Illman 2011b; Berg and Illman 2015; Zhao and Illman 2018; Zhao et al. 2019). HT relies on the joint inverse modeling of multiple sets of hydraulic heads from different observation intervals, while pumping and/or injecting water at multiple locations in geologic media. Due to the difficulty in performing sequential pumping tests at municipal wellfields, HT has not been applied previously at such sites.

Previous studies conducted in the study area have mostly relied on pumping tests to estimate hydraulic parameters of various units (e.g., CH2M HILL 2003a, b; Golder Associates 2011; Matrix Solutions Inc, S.S. Papadopulos and Associates 2014a, b), but such tests typically require shutting down the municipal wells and the interpretation is more difficult when the municipal wells are in operation. In order to overcome this difficulty and apply HT analysis without conducting sequential pumping tests, long-term pumping/injection events and water-level variation records obtained during municipal well field operation are used in this study to jointly calibrate a groundwater flow model consisting of homogeneous geological units to estimate the K and Ss. As an initial attempt, geological models are used for HT analysis in this study, as previous studies have shown the importance of geological data in parameter estimation when wells are far apart (Illman et al. 2015; Zhao et al. 2016; Luo et al. 2017; Zhao and Illman 2017, 2018). However, since perfect knowledge of stratigraphy is not available, there is a critical need to assess the impact of various conceptualizations of site geology on groundwater model calibration and HT at the field scale.

Geological uncertainty in groundwater modeling normally originates from: (1) the geological structure; (2) the use of effective model parameters; and (3) model parameters including local-scale heterogeneity (Refsgaard et al. 2012). Zhao et al. (2016) compared the performance of four geological models of different accuracies using laboratory sandbox data by model calibration and validation. Results revealed that it was possible to calibrate geological models both with and without accurate hydrostratigraphy because of the parametric compensation effect (Refsgaard et al. 2012). However, Zhao et al. (2016) found that the calibration of inaccurate geological models led to unrealistic parameter estimates in some geological units and poor model validation results; thus, while an inaccurate model can be well calibrated, this does not necessary result in a robust model that is suitable for making accurate predictions of groundwater flow.

The overall goal of this study is to examine the feasibility of conducting HT analysis of existing water level records influenced through municipal well field operations. Specifically, the objectives of the study are to: (1) demonstrate the usefulness of long-term pumping/injection and monitoring well records obtained through municipal well field operations for estimating hydraulic parameters (i.e., K and Ss) of geological units; (2) investigate the impact of different geological conceptualizations on the performance of groundwater model calibration and validation; and (3) explore the importance of geological data in improving the results of HT analysis at a large-scale field site.

Site description and geology

This study focuses on the Mannheim East Municipal Well Field (Fig. 1b) located in the southwest area of the city of Kitchener, Ontario, Canada. In order to minimize the effect of boundary conditions on simulated groundwater levels, the model is constructed for a larger area (5 km × 5 km) with the Mannheim East Well Field located approximately in the center of the simulation domain.

Information on more than 500 wells is recorded in the Region of Waterloo’s WRAS+ database (Regional Municipality of Waterloo 2014) which contains various data sets such as well depths, screen intervals, static water levels, lithology, etc. There are 13 water-supply wells and 19 monitoring wells with 28 screens completed at various depths within the study area, with continuously monitored pumping/injection rates and water-level records. The Mannheim East Well Field (around 3.6 km2) is subdivided into three smaller well sites, which are identified as Mannheim East, Peaking, and Aquifer Storage and Recovery (ASR; shown in Fig. 2), while the distribution of water supply and monitoring wells utilized in this study is also shown in Fig. 2. A detailed description of these wells is provided in Table S1 of the electronic supplementary material (ESM).

Fig. 2
figure 2

The distribution of water-supply and monitoring boreholes in the study area. The red triangles indicate water-supply wells, rectangles indicate municipal well fields, and black circles indicate monitoring wells. Names of water supply and monitoring wells are modified for security purposes

The site sits on top of the Waterloo Moraine, which is a Quaternary kame and kettle complex formed by numerous advances and retreats of ice lobes during the Wisconsinan glaciation (shown in Fig. 1c), that has been studied extensively by Karrow (1993). The resulting glaciofluvial sediments consist of a variety of materials including clay, interbedded tills, fine sand, sandy gravel, and coarse gravel, which are normally stratified and poorly sorted (Martin and Frind 1998; Golder Associates 2011).

Four relatively continuous till units have been identified within the Waterloo Moraine, including Pre-Catfish Creek Tills, Catfish Creek Till, Maryhill Till and Tavistock/Pork Stanley Till. The Pre-Catfish Creek Tills, which include the first till units deposited in the area, are generally hard, stony silts to clayey silt tills (Karrow 1993). These till units, including the Canning Till and several other tills, were formed during the Wisconsinan glacial events and locally overlie the bedrock (Martin and Frind 1998).

The Catfish Creek Till, which is the next oldest unit, was deposited by a major glacial advance across southern Ontario, is an extremely dense, stony silt till and commonly referred to as “hardpan” by local, experienced, water-well drillers (Golder Associates 2011).

The Maryhill Till, which separates the upper and deeper aquifers, is a clay-rich low-K natural infiltration barrier. Previous studies have identified three separate ice advances which resulted in the Upper, Middle and Lower Maryhill Till (Karrow 1993; Paloschi 1993; Bajc and Shirota 2007).

The youngest till units, within the Tavistock/Pork Stanley Till, overlie large portions of the upper aquifer. The Tavistock Till is a dark brown clayey silt till, similar to the Maryhill Till, while the Port Stanley Till is recognized as a sandy silt to silty sand till (Golder Associates 2011).

Description of geological models

Modeling can provide valuable insights on the Waterloo Moraine groundwater system and practical advice for source water protection and management for the Region of Waterloo. As models evolved from a simple, layer-cake concept to a fully 3D distribution of geological units, the focus has changed in scale from the well field scale to the scale of the entire Waterloo Moraine system to solve more sophisticated problems such as the assessment of well vulnerability and wellhead protection areas (Frind et al. 2014).

45 years of modeling the Waterloo Moraine

The first groundwater flow models of the Waterloo Moraine were developed as a simple, two-dimensional (2D), finite element, layer-cake system by Emil Frind in 1973 (International Water Supply Ltd. 1973). The model was calibrated to hydraulic head values at different observation wells and then used for the prediction of aquifer responses under various pumping conditions.

A quasi-3D model was then successfully developed and utilized by Rudolph (1985) and Rudolph and Sudicky (1990) at the Greenbrook well field to capture the complexity of the Waterloo Moraine system. The Waterloo North Aquifer System Study (Terraqua Investigations Ltd 1992) and the Study of the Hydrogeology of the Waterloo Moraine (Terraqua Investigations Ltd 1995) were conducted to define the major aquifer and aquitard units and regional recharge zones. Then, a fully 3D Waterloo Moraine model was created by Martin and Frind (1998) based on the application of WATFLOW (Molson et al. 1995). The groundwater model utilized triangular, prismatic, finite elements and allowed for grid refinement, which resulted in the better handling of complex geometries and representation of irregular and sloping layers (Callow 1996). The boundaries of the model were defined as natural features including rivers, creeks, and swamps.

Bajc and Shirota (2007) constructed a new geological model of the Waterloo Moraine, applying a basin analysis approach to data collection and interpretation, which provided details to various geological units, including information on the distribution, thickness, geometry and other attributes. The model was built mainly based on geological information and subsurface sediment structure including geological data from a regional borehole database (Farvolden et al. 1987; Bajc and Newton 2007), published information on Quaternary geology, downhole geophysical logs, and identification of available sediment exposures. Since hydrogeological data including hydraulic head and hydraulic test observation data were not used in model layer interpretation, the model layers were considered stratigraphic layers, which may not be consistent with hydrogeological data at each well field (Blackport et al. 2014). Refinements to this model were made within various municipal well fields through subsequent studies (Stantec Consulting Ltd. 2009, 2012a, b, c; Golder Associates 2011; Blackport Hydrogeology Inc. 2012a, b; Matrix Solutions Inc, S.S. Papadopulos and Associates 2014a, b).

Development of a new geological model

In this study, a new geological model was constructed based on the lithology of wells installed within the study area using Leapfrog Geo (ARANZ Geo Ltd. 2015). Leapfrog Geo constructs 3D geological models using borehole records and GIS data based on the Fast Radial Basis Function (RBF) method. RBF is an interpolation method first used by Hardy (1971) to interpolate scattered topographic data. The smooth interpolation surface is created by Leapfrog Geo using available data points based on weighted linear combinations of covariance functions, which is widely used in geological modelling (Carr et al. 2001; Frank et al. 2007). In total, lithology information from 250 wells were utilized for the construction of the new geological model. The distribution of these borehole records at the study site is illustrated in Fig. 3. For each borehole record, lithology information was obtained from the RoW’s WRAS+ database and summarized based on the three main materials identified for each core sample. The topography of the geological model is imported into Leapfrog Geo using the digital elevation model (DEM) data (30-m resolution) from Ontario Ministry of Natural Resources and Forestry. Leapfrog Geo creates a smooth interpolation surface between two adjacent geological layers based on lithology data from boreholes. In total, 11 groups of geological units are identified based on the conceptual hydrogeologic model of the Waterloo Moraine constructed by Bajc and Shirota (2007) and Matrix Solutions Inc, S.S. Papadopulos and Associates (2014a, b). The nomenclature of the Ontario Geological Survey (OGS) is adopted here for layer identification, in which AT refers to an aquitard, while AF refers to an aquifer. Following AT or AF, letters and numbers are used to identify the sequence of units, with “A” as the youngest grouped sequence followed by “B”, and “1” as the youngest unit in the group followed by “2”—for example, ATB1 refers to the youngest Aquitard in the B sequence, whereas AFF1 refers to an older Aquifer of the F sequence.

Fig. 3
figure 3

a 3D geological model of the site constructed using Leapfrog Geo. b Distribution of selected wells within the study area along with locations where cross-sections are provided. c Cross-sections along A–A′ and B–B′. d Cross-sections along C–C′ and D–D′

Figure 3 shows the resulting 3D geological model with four cross-sections. The dimensions of the geological model are 5 km × 5 km in X (east) and Y (north) directions with an elevation of 200 masl (meters above sea level) as the bottom and the topography as the top. The bottom of the model is set at 200 masl because no data are available below 200 masl, while Aquitard ATE1 and ATG1 are regionally extensive. It is assumed that the well field is not affected by deeper groundwater flow in the bedrock since there are two extensive aquitards (ATB3 and ATE1/ATG1) covering nearly the entire study area. However, due to lack of data in the deeper units, further studies are necessary to test this assumption. In total, 11 geological layers were identified, listed from youngest to oldest, these units are: ATB1, AFB1, ATB2, AFB2, ATB3, ATC1, AFC1, ATE1, AFF1, ATG1 and Bedrock.

In comparison to the conceptual hydrogeologic model (Fig. 1c) of the Waterloo Moraine, some layers were merged (ATC1 and ATC2 were combined as ATC1 and AFF1 and AFD1 were combined as AFF1) in the newly constructed geological model. This is because: (1) these geological layers are thin and consist of similar materials, and (2) they are located at low elevations where geological data from borehole logs are limited in order for one to accurately separate these layers.

Examination of Figs. 1c and 3 reveals that ATB1 is a thin and patchy aquitard that lies on top of the study area, while AFB1 is an unconfined aquifer present throughout the study area with considerable recharge from precipitation that appears to take place in the central and eastern portions of the site. ATB2 is a thin aquitard that separates AFB2 and AFB1 in most of the study area. AFB2 is the primary water-supply aquifer in the Mannheim East wellfield. The AFB2 aquifer is evident in the central area of the study site, where the municipal well field was developed with a maximum thickness of approximately 40 m; however, the thickness decreases as it extends to the edges of the geological model. Beneath the AFB2 aquifer, the ATB3 aquitard is continuous across the study area followed by the aquitard ATC1. These two aquitards with extremely low K separate the upper aquifers (AFB1 and AFB2) from the lower aquifer/aquitard system. Between the ATC1 aquitard and the bedrock, four geological layers (AFC1, ATE1, AFF1, and ATG1) have been further identified. These layers are found to be thin and discontinuous within the study area.

Description of geological models

In total, four geological models are utilized in this study including: (1) the 5-layer model; (2) the 11-layer model; (3) the Waterloo model; and (4) the Regional model for model calibration as well as model validation. Detailed layer information of each model is provided in Table S2 of the ESM and cross-sections of each model with screen information are shown in Fig. 4. The 11-layer geological model, the Waterloo model and the Regional model all divide the study domain into 11-layers, with main differences in the layer thicknesses of the upper aquifer/aquitard units and layer classification for the lower aquifer/aquitard units. Groundwater modelers typically fixate on a single conceptional model, but in this study, four geological models will be utilized for model calibration and validation. The results from the four models are compared and discussed next.

Model 1. 5-layer geological model

A simplified geological model has been developed by merging some of the layers with similar material, specifically ATB1 as AT1, AFB1, ATB2 and AFB2 as AF1, ATB3 and ATC1 as AT2, AFC1, ATE1, and AFF1 as AF2, ATG1 and bedrock as AT3 (shown in Table S2 of the ESM).

Fig. 4
figure 4

Cross-sections along D-D′ for: a the 5-layer geological model; b the 11-layer geological model; c the Waterloo model; and d the Regional model with screen midpoint elevation information. The black circles indicate water supply wells, while red squares indicate monitoring boreholes

Model 2. 11-layer geological model

Detailed information of the 11-layer geological model is explained in the previous section. While the five-layer geological model mainly reflects the contrast in low and high K zones, the 11-layer geological model incorporates more detailed stratigraphy information resulting in higher-resolution representation of local heterogeneity and hydraulic connectivity compared with other large-scale models treated in this study.

Model 3. Waterloo and Model 4. Regional models

Two additional models are used in the study for groundwater flow model calibration and validation, including the Waterloo model, built by Bajc and Shirota (2007), and the Regional model (Matrix Solutions Inc, S.S. Papadopulos and Associates 2014a, b), refined based on the Waterloo model. The Waterloo model is constructed based on subsurface information from RoW monitoring wells, urban geological database, field mapping data, cored boreholes, Ministry of the Environment, Conservation, and Parks (MECP) water well records and geophysical databases, while refinements have been made to the Regional model with available hydrogeological data including municipal pumping data, hydraulic head data, water quality data, isotopic data, and wellfield shutdown data (Blackport et al. 2014).

Data used for groundwater flow model calibration and validation

In this study, the dataset used by Luo and Illman (2016) was first studied and a decision was made to utilize a shorter record for model calibration. In particular, the pumping/injection rate records in water supply wells from 1 January 2005 to 31 December 2013 were utilized by Luo and Illman (2016), while this study utilized 1-year data from the year of 2013 to achieve computational efficiency. The importance of the length and number of data from observation records to include in inverse models is discussed by Luo et al. (2020) using a synthetic model based on five different simulation durations. Their results reveal that periods with large water-level variations and continuous data points need to be included to better interpret municipal well data for site heterogeneity characterization. It is also found that longer simulation durations do not always yield better results since the pumping/injection influence from the water-supply wells may propagate beyond the investigated area.

The pumping/injection rate records from 13 water-supply wells (shown in Fig. 5) and water-level records from 19 monitoring locations with 28 screens at different times during the year 2013 are obtained from the WRAS+ database for groundwater model flow calibration. Due to the resolution of the pumping data in the database, pumping/injection rates in water-supply wells are expressed as daily pumped volume in m3. In reality, pumping/injection events normally operate for a couple of hours throughout a single day. However, the accurate operational time is not provided, thus the pumping/injection rates are simplified as daily pumping/injection rates. Therefore, for each water-supply well, 365 records are extracted from the database within the selected period. It should be noted that pumping/injection rates in these water-supply wells are not constant; instead, they vary frequently in most wells (shown in Fig. 5). Typically, municipal wells are used as water-supply boreholes, while some ASR wells are designed to inject and store treated water from the Grand River during low water demand periods, and the stored water is extracted during high demand periods. Thus, negative pumping rates shown in Fig. 5 indicate injection.

Fig. 5
figure 5

Pumping or injection rates in water supply wells during the year 2013. Negative pumping rates indicate injection

The screen midpoints of all water-supply wells are located between 315 to 325 masl, while the screen midpoints of the water-monitoring boreholes are located between 185.82 and 368.88 masl. The screens of both water-supply and water-monitoring wells are mainly located at the bottom of AFB2, with few wells installed within the AFB1, AFC1 and Bedrock units based on the 11-layer geological model.

Although the depths of the screened intervals vary widely, well screens installed in the AFB1, AFC1 and bedrock units lack constant monitoring records, based on Table S1 of the ESM, with less than 40 data points available throughout 2013. In addition, there are a large number of water level measurements in many wells, but in some wells, the monitoring record is quite sparse. Most wells are located at the bottom of AFB2 for both the 11-layer geological model and the Regional model, but at the bottom of AFB2/upper and middle of ATB3 for the Waterloo model. The location of OW16 is at AFC1 for the 11-layer geological model, at ATG1 for the Waterloo model, and ATC1 for the Regional model.

Water levels in water-supply and monitoring wells are measured manually and electronically with pressure transducers. The transducers automatically record the water level every hour, thus the data recorded at the beginning of each day (12:00 am) are used as the water level for each day, thus 365 records are used for 2013. At some wells, water levels are recorded monthly or bi-monthly through manual measurements, so the available data with manual measurements range from 3 to 12 in 2013. The flow rates in all water-supply wells are electronically recorded, thus 365 data points are available for 2013. Table S1 in the ESM summarizes the number of data points used from each well for model calibration as well as model validation.

In this study, drawdown is defined as the hydraulic head change from the initial head and is expressed as:

$$ \Delta h\left(x,y,z,t\right)={h}_0\left(x,y,z\right)-h\left(x,y,z,t\right) $$
(1)

where ∆h(x, y, z, t) is drawdown at some point in the simulation domain defined by the Cartesian coordinate system (x, y, z) at time t due to municipal well operations and h0(x, y, z) is the initial hydraulic head when t is equal to 0, while h(x, y, z, t) is hydraulic head at time t.

It is important to keep in mind that since groundwater is constantly pumped or injected from water-supply wells, the initial static water level is unknown for each screen. Based on the comparison of simulated and measured drawdown using empirical hydraulic parameters (Martin and Frind 1998) for various geological models during the forward simulation process, the simulated drawdown curves fit well with the measured drawdown data after 20 days for most of the monitoring wells. Therefore, the first 20 data points from monitoring wells are not used for model calibration and validation in order to simulate water level fluctuation as a result of various pumping/injection events. The same strategy was applied by Luo and Illman (2016), which provided optimal matching between simulated and observed data. In total, 4,985 data points are used for model calibration and 5,085 data points are used for model validation (see Table S1 of the ESM) in this study. Data from January to June 2013 are used for model calibration, while data from July to December 2013 are used for model validation.

Description of groundwater models

The groundwater flow model has the same dimension as the geological model. Prior to constructing the 3D groundwater flow model, a 2D grid with 3210 nodes was generated based on the plan view of the simulation domain, as shown in Fig. 6a. Triangular elements with a size of 200 m were applied to discretize the simulation domain. At locations where there are water-supply and monitoring wells, the grid was refined by a factor of five. The element size is determined based on the geological model resolution and computational efficiency. The DEM data are 30-m resolution and the geological model (5 km × 5 km) is built based on the lithology information from 250 wells. As shown later, simulated drawdowns generally capture the observed drawdown curves, so the current model discretization sufficiently captures the observed records.

Fig. 6
figure 6

a Generated two-dimensional grid of the study area (plan view). Generated three-dimensional grids for: b the 11-layer geological model; c the Waterloo model; and d the Regional model

Layer information identified in the constructed geological model was then introduced to generate the 3D groundwater flow model, as shown in Fig. 6b–d for each geological model. Each geological layer was subdivided into several layers based on the approximate thickness as well as the distance to the pumping wells. In particular, fine grids were assigned to the layer of the water-supply aquifer, while coarse grids were assigned in upper and lower layers. In total, the 3D hydrogeologic model was discretized into 30 layers for both 5- and 11-layer geological models with 188,460 computational elements and 99,510 nodes, 27 layers for the Waterloo model with 144,482 elements and 76,842 nodes, and 27 layers for the Regional model with 139,412 elements and 74,196 nodes. All four geological models were discretized using the same 2D grid with uniform and isotropic K values of the elements located in the same layer.

All groundwater flow simulations were conducted using the groundwater flow and transport simulator HydroGeoSphere (HGS; Aquanty Inc. 2018) coupled with the model-independent parameter estimation code PEST (Doherty 2005). PEST is unique in that it wraps around any model allowing for automatic calibration using a variant of the Levenberg-Marquardt algorithm to minimize the objective function ∅, which represents a weighted sum of squared differences between computed and measured calibration targets. Here, the objective function is expressed as:

$$ \varnothing \left(\mathbf{a}\right)={\left[{\mathbf{h}}^{\ast}\ \left(x,y,z,t\right)-\mathbf{h}\ \left(x,y,z,t;\mathbf{a}\right)\right]}^{\mathrm{T}}\ \mathbf{W}\left[{\mathbf{h}}^{\ast}\ \left(x,y,z,t\right)-\mathbf{h}\left(x,y,z,t;\mathbf{a}\right)\right] $$
(2)

where a is a vector of M model parameters to be optimized and h and h are vectors of simulated and measured hydraulic head values respectively, at N match points in space-time, and W is an N × N diagonal weight matrix. Equation (1) is implemented using unit weights, which makes it equivalent to ordinary least squares. The parameters of interest for this paper are K and Ss.

There are five cases considered in the study, including: (1) the five-layer model with uniform initial hydraulic parameters for all layers; (2) the five-layer model with different hydraulic parameters for each layer; (3) the 11-layer model with different hydraulic parameters for each layer; (4) the Waterloo model with different hydraulic parameters for each layer; and (5) the Regional model with different hydraulic parameters.

In case 1 (the five-layer uni model), the initial hydraulic parameters for all layers are set to a uniform value. Specifically, the initial K value for calibrating the five-layer geological model was set as 6.00 × 10−5 m/s with a minimum bound of 1.00 × 10−9 m/s and a maximum bound of 0.01 m/s. Likewise, the initial Ss value was set as 0.0006 m−1 with a minimum bound of 1.0 × 10−8 m−1 and a maximum bound of 0.1 m−1. Since most observation points are located in AFB2, it is essential to set appropriate initial K and Ss values for the other layers in order to increase the computational efficiency and reliability of results. Doherty (2005) stated that the properly selected initial parameter values will not only increase the optimization efficiency, but also transfer a highly nonlinear model to a reasonably linear model through parameter transformation. In particular, the predominant materials for each layer were identified and used to assign initial K and Ss values, as shown in Table S2 of the ESM. The corresponding K and Ss values for each material were based on Martin and Frind (1998), provided as Table S4 of the ESM. The representative values were identified by Martin and Frind (1998) from the literature and also calibrated based on previous pumping and slug test results at the same wellfield site. Since there are several water-supply wells located within the ATB3 unit of the Waterloo model, and in order to increase the computational efficiency, the initial K and Ss values were set the same as AFB2 of the Waterloo model.

Four geological models with appropriate initial K and Ss values for each layer were calibrated and validated as cases 2–5. The minimum and maximum bounds of the estimated parameters in cases 2–5 are set the same as in case 1.

In terms of boundary conditions, the bottom face was defined as a no-flow boundary, since the bottom layer of all four models is bedrock, which underlies the aquitard layer ATG1 for both 11-layer geological model and the Waterloo model. The four side faces were set as constant head boundaries, implying that the hydraulic heads on the boundary faces were not affected by pumping/ injection events. This was examined by plotting the simulated drawdown distribution at the end of the pumping/injection records (see Fig. S3 of the ESM).

The static water level in the monitoring boreholes at the beginning of pumping records from 284 wells (mainly located at AFB2) within the study area was selected and used for kriging of hydraulic head with Tecplot (2011). The resulting hydraulic heads were generally higher at the northwest portion of the study area and lower at the southeast part, indicating that groundwater flows from the northeast to southwest, which is consistent with historical records.

The hydraulic head ranges from 306.28 to 359.25 masl within the study area and were used as initial head values for the simulation domain and constant boundary head for the four boundary faces. It is noted that the hydraulic heads along the vertical direction on the side boundaries were set to be the same since there was no available hydraulic head data for lower aquifers in the study area.

In order to set the boundary condition for the top of the simulation domain, daily precipitation data from 2013 were obtained from the weather station located on the University of Waterloo campus that is 9 km away from the study site. These data were modified as net precipitation (45% of the total precipitation for each day) and used as nodal fluxes to define the boundary condition at the top face. It was assumed that the effect of evapotranspiration (ET) was constant through the 6-month simulation period. The current study focused on groundwater flow in the phreatic zone and neglected variably saturated flow in the vadose zone. This is due to the fact that most data are collected from AFB2, a geological unit that is 20–50 m from the ground surface and overlain by the top-most aquitard layers ATB1 and ATB2, both of which are aquitards that limit vertical groundwater flow, thus the influence of variably saturated flow can be safely neglected. In addition, transient effects in recharge were not considered in this study because ATB1 and ATB2 have very low K values and the aquitard ATB2 is quite extensive in the study area (Matrix Solutions Inc, S.S. Papadopulos and Associates 2014a, b) which dampens the transient effect in the underlying aquifer.

Guo (2017) studied the relationship between precipitation and ET during the years of 2010, 2011, 2012 and 2014 at the Laurel Creek Watershed near the study area, and found that the average simulated and measured annual ET accounted for 56.5 and 54.3% of the annual rainfall, respectively. Thus, 45% of the total precipitation was used in this study as the net precipitation for each day during the simulation period considering the proximity of the study area to the Laurel Creek Watershed.

Results and discussion

Model calibration results

Inverse modeling of pumping/injection records from 13 water-supply wells was performed on a PC with a six-core CPU and 16 GB of random access memory (RAM) for model calibration. All inverse models for the five cases ran until the convergence criteria were met for either a maximum number of iterations or an observation of no significant improvement between simulated and observed head. The improvement between simulated and observed head was evaluated using: (1) the difference between the current objective function and the lowest objective function achieved to date, and (2) the magnitude of the maximum relative parameter change between optimization iterations.

Calibration of the five-layer geological model took about 24 h to estimate 10 unknowns with 361 model calls for cases 1 and 2, while the calibrations of other three geological models were all completed within 72 h to estimate 22 unknowns with total model calls ranging from 627 for the Regional model (case 5) to 849 for the 11-layer geological model (case 3).

Although cases 3–5 have the same number of unknowns with the same amount of data, the number of model calls varied from case to case. This is due to the fact that appropriate initial parameter values are critical to PEST, since the closer are the initial parameter values to optimal, the faster PEST will converge. They also make optimization possible, especially for highly nonlinear models. Thus, the different number of model calls among cases may be due to the difference between the initial and the optimal parameter values and the optimization process (Doherty 2005).

The estimated K and Ss values for each geological layer are plotted in Figs. 7 and 8, respectively. The estimated K and Ss values and their 95% confidence intervals are summarized in Fig. 9 with numerical values provided in Table S3 of the ESM.

Fig. 7
figure 7

Estimated K fields from the inversion of pumping and injection events by 13 water-supply wells during January to June, 2013 for: a the 5-layer uni model; b the 5-layer geological model; c the 11-layer geological model; d the Waterloo model; and e the Regional model

Fig. 8
figure 8

Estimated Ss fields from the inversion of pumping and injection events by 13 water-supply wells during January to June, 2013 for: a the 5-layer uni model; b the 5-layer geological model; c the 11-layer geological model; d the Waterloo model; and e the Regional model

Fig. 9
figure 9

Estimated K and Ss values as well as their corresponding 95% confidence intervals through PEST calibrations for different geological models: a the 5-layer uni model; b the 5-layer geological model; c the 11-layer geological model; d the Waterloo model; and e the Regional model. Some confidence intervals exceed the range of the vertical axis due to its large size. Values are provided in Table S3 of the ESM

The estimated K and SS fields shown in Figs. 7 and 8 and corresponding K and Ss values for each geological layer shown in Fig. 9 reveal large differences among four models which highlight the importance of the geological information on groundwater modeling results. In particular, utilizing different geological conceptualizations for inverse modeling with the same data results in dramatic differences in estimated K and Ss values (e.g., K value for ATB3 ranges from 7.89 × 10−12 to 4.71 × 10−4 m/s among cases 3, 4 and 5). The estimated K and Ss values in case 1 are less realistic compared with the ones for the same geological layer in other cases, which highlights the importance of the initial parameter values to the model calibration.

As mentioned previously, a highly nonlinear model may be hard to optimize without properly selected initial parameter values (Doherty 2005). Doherty (2005) also stated that observations can be insensitive to initial parameter values if initial parameter values are not chosen wisely—for example, a parameter may have little effect on model results over a part of its domain, while it may have a much larger effect over other parts of the domain. Thus, the initial parameter values for case 1 may not be within the sensitive part of its domain to yield reasonable results. It was also noticed that the corresponding 95% confidence intervals are extremely large, especially for the lower geological layers. The extremely large 95% confidence intervals may be due to merging layers of different K values and also is a result of insufficient observation points, both of which have been suggested by Zhao and Illman (2018) in relation to a different site.

The remaining four geological models with appropriate initial K and Ss values (cases 2–5) are all well calibrated with more realistic estimations and much smaller 95% confidence intervals compared with case 1. Thus, the reliability of the estimated hydraulic parameters can be greatly increased by using appropriate initial K and Ss values as prior information. As previously noted, most data points are collected from the monitoring boreholes located at the bottom of AFB2, so similar K and Ss values were obtained from model calibration for the 11-layer geological model (case 3) and the Regional model (case 5).

The K estimated for AFB2 is 8.14 × 10−4 m/s with the 11-layer geological model (case 3), while a value of 5.17 × 10−4 m/s is estimated for the 5-layer geological model (case 2), which is a consequence of using one layer to represent multiple soil types. The estimated K of AFB2 (1.21 × 10−3 m/s), ATB3 (4.71 × 10−4 m/s), and ATC1 (2.12 × 10−6 m/s) for the Waterloo model (case 4) is relatively high, which may be due to the inaccurate classification of the geological layers. Since most of the screens are located within AFB2 and ATB3, the large estimated value of K for the aquitard enables the groundwater to be pumped from it and match the corresponding observed drawdowns at monitoring wells. This is also one drawback of the geological approach in groundwater modeling when the layer morphology is fixed, as unrealistic hydraulic parameters could be estimated (Zhao et al. 2016). Golder Associates (2011) summarized the K values at each water-supply well (Table S5 of the ESM), ranging from 9 × 10−4 to 1 × 10−1 m/s from well to well, which is similar to the estimated K values for AFB2 in this study.

The 95% confidence intervals for the estimated K for the upper part of the domain including ATB1, AFB1, ATB2, AFB2, ATB3, and ATC1, are relatively small in cases 2–5. The lower aquifer layer of the five-layer geological model is constructed by combining AFC1, ATE1 and AFF1 and assigning the initial K based on the property of two aquifers, but in reality, AFC1 and ATE1 are two discontinuous shallow aquifers with a thicker aquitard (ATE1) lying between these two aquifers. Such a merged layer would affect the reliability of the K estimate, which is evident through the extremely large 95% confidence intervals.

The 95% confidence intervals are calculated in PEST on the basis of the linearity assumption encapsulated in the Jacobian matrix, so it is highly dependent on the assumptions underlying the model. If the confidence intervals are exaggerated from a breakdown in the linearity assumption, PEST will not truncate the confidence intervals by assigning narrower parameter bounds in order to avoid an unduly optimistic impression of parameter certainty, which will result in large confidence intervals.

There are three potential reasons for high levels of parameter uncertainty, including a poor fit between model outcomes and field observations, a high level of parameter correlation, or insensitivity on the part of certain parameters (Doherty 2005). Since most data are concentrated in AFB2, the parameters of other layers may not be sufficiently sensitive to observation data, which results in large confidence intervals. Doherty (2005) stated that the high levels of parameter uncertainty resulting from excessive correlation or from insensitivity of parameters can be reduced by including more measurement data, so more data are needed from the deeper layers for more accurate parameter estimation.

The estimated Ss values are similar to K in that more reliable results are obtained for the shallower part of the models and large confidence intervals are mainly found in the lower layers of the five-layer uni model (case 1), the five-layer geological model (case 2) and the Waterloo models (case 4). The estimated Ss values are found to vary in the range of 3.32 × 10−6 to 1.95 × 10−3 m−1 for the 11-layer geological model (case 3), while for the 5-layer geological model (case 2), Ss varies between 6.78 × 10−6 and 1.81 × 10−3 m−1. The Ss values estimated for AFB2 vary from 8.35 × 10−5 m−1 for the Waterloo model to 3.16 × 10−4 m−1 for the Regional model.

It is of interest to note that compared with previous estimates of Ss by Luo and Illman (2016) which ranged from 1.0 × 10−4 to 3.6 × 10−2 m−1 with a geometric mean of 4.0 × 10−3 m−1, the range is greatly reduced and the values are much smaller for this study. This suggests that the Ss estimates of Luo and Illman (2016) could be impacted by a scale effect, which could be due to the use of the Theis (1935) solution by Luo and Illman (2016), which neglects heterogeneity and borehole storage effects (Luo and Illman 2017; see also Neville 2017). In contrast, flow parameters estimated through this study considered a larger simulation domain and the estimation was achieved simultaneously for all pumping/injection and observation well pairs considering layer-by-layer heterogeneity. It is also important to note that flow processes are simulated more accurately for the numerical model which may have caused the estimated storage parameters to be smaller. Similar findings have been found by Vesselinov et al. (2001) who discovered that scale effects in flow parameters became suppressed when the model used to conduct parameter estimation more rigorously considered heterogeneity in estimated parameters.

Performance of model calibration

The effect of different geological conceptualizations on model calibration is evaluated by comparing the simulated drawdowns versus observed drawdowns from 28 observation locations used for the model calibrations, as plotted in Fig. 10. A linear model is fit for each geological model case for performance evaluation.

Fig. 10
figure 10

Scatterplots of observed versus simulated drawdowns for model calibration (black) and validation (red) based on 28 observation locations for: a the 5-layer uni model; b the 5-layer geological model; c the 11-layer geological model; d the Waterloo model; and e the Regional model. The solid line is a 1:1 line indicating a perfect match. The dashed lines are the best fit lines for model calibration (black) and validation (red). The linear model fit results are also included on each plot

Generally, the fit greatly improves from cases 1–5, with the slopes of the linear model ranging from 0.76 to 1.15 and values of the coefficient of determination (R2) increasing from 0.42 to 0.77. The R2 value alone cannot be used to assess the fit, but it shows how close the data are to the fitted regression line and it is also used as an indication of data scatter. It is noted that the fits of the four geological models with appropriate initial K and Ss values (cases 2–5) are quite similar to the slopes of the linear model, ranging from 0.76 to 0.86 and values of R2 increasing from 0.74 to 0.77. Although the estimated K and Ss values are quite different in the four cases, the simulated drawdown matches the observed drawdown for all four geological models quite well. Quantitative assessment is conducted by computing the mean absolute error norm (L1) and the mean square error norm (L2). Those quantities are computed as:

$$ {L}_1=\frac{1}{n}\ {\sum}_{i=1}^n\left|{\chi}_i-\hat{\chi_i}\right| $$
(3)
$$ {L}_2=\frac{1}{n}\ {\sum}_{i=1}^n{\left({\chi}_i-\hat{\chi_i}\right)}^2 $$
(4)

where n is the total number of drawdown data, i indicates the data number, \( {\chi}_i\ \mathrm{and}\ \hat{\chi_i} \) represent the estimates from simulated and measured drawdowns, respectively.

The calculated L1 and L2 values are shown in Fig. 10. In particular, the calibration result based on the Regional model yields the smallest L1 and L2, while the five-layer uni model yields the largest L1 and L2. The L1 and L2 values for the five-layer model and the 11-layer model (case 3) are similar which may be due to the similar hydraulic properties of AF1 and AFB2, where most of the observations are located.

Large errors are mainly observed at wells where rapid changes in water levels are observed, but all the models fail to capture this fluctuation. In a previous study, Luo and Illman (2016) explained such poor matches by suggesting the potential existence of a high K pathway between some of the water-supply wells and water-monitoring boreholes.

Hydraulic tomography based on geological models, as presented in this study, treats each geological layer as homogeneous and isotropic, but in reality, both aquifers and aquitards are highly heterogeneous and could be anisotropic (Berg and Illman 2011b, 2013, 2015; Zhao and Illman 2017, 2018), thus a more sophisticated groundwater model that considers heterogeneity and anisotropy in each geological layer may be needed for future work to overcome this difficulty.

Another factor that may be contributing to the inconsistency is that the pumping/injection events normally extend for a couple of hours, but this study used daily pumping/injection rates for each water-supply well, which could decrease the actual pumping/injection rate for the simulation process.

Model validation results

The performances of different models in their ability to predict drawdown at monitoring wells were evaluated using the pumping/injection records from July to December, 2013. Simulated drawdowns are compared with corresponding observed drawdowns to provide quantitative evaluation. Similar to the calibration results, the validation results for four geological models with appropriate initial K and Ss values (cases 2–5) are similar in their overall shape in terms of the point distribution and the slope of the fit lines, while the five-layer uni (uniform) model in case 1 yields the worst results with biased prediction.

In a previous study, Zhao et al. (2016) found that as the number of pumping and monitoring points decreases, the performance gap among these models was reduced. Thus, the data set used in the calibration and validation processes may not be large enough to produce dramatic differences in the scatterplots among individual models. The slope of the linear model for the five-layer uni model is highest among the four models, and it also yields the largest L1 and L2 values. The 11-layer geological model and the Regional model both have the smallest L1 and L2, with a higher slope and a larger 2 for the 11-layer geological model.

While the validation scatterplots show great similarity among individual models (cases 2–5), the simulated and observed drawdowns for each observation well are provided (see Figs. S1 and S2 in the ESM) to better examine the performance of calibrated models in predicting drawdowns. The calibrated models for cases 2–5 can generally capture the drawdown curves for most observation wells, while larger differences are mainly shown for manually measured observation wells. There are not much data available from these wells for model calibration, and some of these wells are located in deeper layers (e.g., AFC1) without reliable estimates of hydraulic parameters. Cases 2–5 share more similarities compared with case 1, while the five-layer uni model (case 1) and the five-layer geological model (case 2) can better capture the rapid changes in water levels. The reason behind this may be the higher K values for AT2 and the merging of AFB1, ATB2 and AFB2 into AF1, which resulted in higher-K pathways between the water-supply wells and observation boreholes. Another reason may be the use of an averaged daily schedule of pumping/injection rates, when in reality, the rates vary more abruptly and can impact water level records and corresponding parameter estimates (Luo et al. 2020).

Summary and conclusions

This study investigated the usefulness of hydraulic tomography (HT) analysis based on geological models at a municipal well field to estimate the spatial distribution of hydraulic parameters (i.e., K and Ss) using long-term pumping/injection and monitoring well records. Pumping/injection rate data from 13 water-supply and water level data from 19 water-monitoring boreholes with 28 screens during the year of 2013 are selected and used for model calibration and validation. Four different geological conceptualizations with varying accuracy are used to examine the importance of geological data in HT analysis.

The study resulted in the following findings and conclusions:

  • Model calibration and validation both revealed that hydraulic parameters (i.e., K and SS) can be estimated using long-term pumping/injection rates and corresponding water-level records from municipal well fields. The estimated parameters are compared with those estimated through independently conducted pumping tests and the hydraulic parameters from both studies are consistent.

  • Compared with traditional K and Ss estimation methods, which are difficult to conduct at well fields, the HT approach is successfully applied in this study through the use of long-term pumping/injection rates and corresponding water-level records. The observed drawdown curves are well captured by the calibrated models (cases 2–5) with appropriate initial parameter values. The use of such data for inverse modeling results in reliable hydraulic parameter estimates, while enhancing cost and time efficiency in terms of site characterization. Therefore, it is suggested that these data be collected and used in future studies at other municipal wellfields.

  • The simplified daily long-term pumping/injection rates and corresponding water-level records are used in the study. However, it is advisable to perform HT analysis based on real-time data with considering finer scale fluctuations in pumping records and corresponding drawdown measurements at observation wells. In addition, such pumping and observation data should also be collected from aquitard layers and deeper layers to better examine the heterogeneity patterns of complex groundwater systems and other municipal well fields.

  • Dramatic differences are observed among four models in hydraulic parameter estimation which emphasize the importance of geological data and geological model construction on hydraulic estimates and groundwater modeling. Although good matches are obtained for both model calibration and validation for the four models, rapid water-level variations observed in the monitoring wells are not fully captured by all models. The large difference of the hydraulic estimates among four models and poor matches for the rapid water-level variation may be caused by the simplification of geological information and inaccurate classification of the geological layers. Thus, it is essential to apply a more sophisticated inverse model which considers the heterogeneity of individual geological units to better predict water-level changes and to obtain more reliable hydraulic estimates.

  • The density of observation points can greatly affect the reliability of estimated hydraulic parameters. The large 95% confidence intervals and inconsistent parameter estimates for deeper geological layers promote the need for deeper well installation as well as hydraulic investigation. In addition, prior information of estimated parameters used in the model can reduce confidence interval widths. In future work, giving the unit a preferred value as prior information in PEST should be promoted instead of a uniform starting estimate for all layers.

  • Hydrogeological data are critical for geological model construction used for the purpose of groundwater modeling. This study found that the geological model constructed with hydrogeological information (case 5) yields the smallest error norms for both groundwater model calibration and validation, so hydrogeological data are essential for HT analysis based on geological models to yield reliable hydraulic parameters and to better capture local heterogeneity.

  • The variability in evapotranspiration throughout the year is not considered in this study, but it could become more important for longer simulation periods. It may also be necessary to more rigorously consider evapotranspiration and other complexities (e.g., surface-water/groundwater interaction, variably saturated flow in the vadose zone, long-term decline of groundwater levels due to dewatering operations, etc.) at other municipal well fields that are not factored into the present study. If important processes are left out in the model used to estimate parameters, it is conceivable that the estimated parameters will be affected. This important topic will require additional studies in the future.

  • Finally, this study applied a new approach to estimate hydraulic parameters for a municipal well field using existing long-term pumping/injection rates and water-level monitoring records. A more sophisticated inverse model which considers each aquifer/aquitard unit to be heterogeneous is currently being built for the study site. It is anticipated that improved parameter estimates will be obtained, which should result in more robust predictions of groundwater level variations due to municipal well operations. All of this should benefit well field management and contaminant transport predictions, as well as improve source water protection.