1 Introduction

With the goal of knowing the true spatial distribution of geological conditions, three-dimensional geological modeling employs a series of techniques and three-dimensional structural or geostatistical models for applications such as geological mapping, civil engineering, reservoir appraisal, mineral targeting, and risk assessment (Bistacchi et al. 2008; Collon et al. 2015; Royse 2010; Wu et al. 2015). In such applications, error in the three-dimensional model can translate into serious consequences of risk, failure, or cost. To minimize the error, measurements of geological attributes are made at points, along transects, as depth fields, or as a two-dimensional mapped surface. Three-dimensional models of a subsurface or geological attribute field are then assembled based on the different datasets, such as borehole data, cross-sections, remotely sensed variables, and geological maps. These data help geologists to visualize, understand, and quantify the geological processes under way and the nature of the current status of the geological form that these processes have created. Ideally, all available multi-source data are reconciled together or conflated to construct a reliable three-dimensional geological model (Fasani et al. 2013; Kaufmann and Martin 2008; Li et al. 2013; Lindsay et al. 2012). However, each type of data carries its own uncertainties. Errors and uncertainties due to combining these pieces of data can be analyzed in addition to the uncertainties of the individual datasets. Combinatorial errors and uncertainty may, consequently, affect the precision of three-dimensional models and the results of any subsequent assessment or calculation. This problem is widely recognized to be a major challenge in each step of three-dimensional geological model creation (Caumon et al. 2009; Jessell et al. 2010; Jones et al. 2004; Li et al. 2013; Tacher et al. 2006; Wellmann and Regenauer-Lieb 2012).

In recent years, strategies for error and uncertainty assessments of raw data and model structures including topology have been proposed for different types of applications (Jones et al. 2004; Li et al. 2013; Lindsay et al. 2012, 2013; Tacher et al. 2006; Thiele et al. 2016; Wellmann and Regenauer-Lieb 2012). A framework and quality detection technology for error analysis of three-dimensional geological modeling data has been presented (Zhu and Zhuang 2010). However, the assessment and accounting of the impact of errors during the process of constructing a three-dimensional geological model remain unclear because the errors in the input data are random, diversely distributed, and vague. Yet, existing geostatistical methods usually assume that all errors follow a single distribution. Therefore, it is necessary to build an aggregated theoretical model for multiple errors analysis that can reveal the complete impact of error and uncertainty on constructed three-dimensional geological models.

While constructing three-dimensional geological structural models, input data are usually transferred into geological contact information according to the rock type or other geological properties. Picking contacts from cuttings, cores, or logs is often performed by geologists or geophysicists. Many studies by petrophysicists describe methods to analyze errors or uncertainties of velocity and geological structures while processing seismic images (Fomel 1994; Fomel and Landa 2014; Pon and Lines 2005; Thore et al. 2002; Grubb et al. 2001). When constructing a three-dimensional geological structure, the impact of errors and uncertainties from geophysical interpretation, field observation, etc. on the three-dimensional structures should be a special concern. Therefore, this study discusses methods for error analysis of the input data, rather than error analysis approaches for picking contacts.

From the viewpoint of mapping, a novel approach is presented to simulate the impact of varying distributions of multiple errors in geological contacts obtained from different types of data sources for constructing three-dimensional geological structural models. Two assumptions are made: (1) each type of data has spatial errors regardless of data consistency, and its error follows one type of probability density function (pdf); and (2) the errors that exist in different data sources are independent, which means that input data may be provided by different geologists or agencies; therefore, Bayes’ theorem cannot be directly utilized in aggregating errors. The proposed method is based on the relationship between a geological attribute (such as lithological character) distribution and the spatial error distribution. Additionally, the method is focusing on analyzing error distributions without multiple z-values for the same x and y location. This study aims to (1) construct the error distribution relationship between the spatial errors and their related geological attributes within geological databases and (2) construct an aggregate model for a one-dimensional multiple errors analysis model at cross-points. A cross-point is the intersection in three-dimensional space of planar or linear geological features. The proposed approach is novel in that (1) it addresses multiple errors in one workflow simultaneously and (2) the error distribution is generated by a stochastic variable around all the boundaries instead of only at the contact points. This provides a more realistic model of the actual pattern of error and uncertainty surrounding the features in a geological model, for example, along fault lines and contacts.

2 Related Research

Error in three-dimensional geological modeling is mainly attributed to three components: error in measurements in the input data, error generated during the modeling process, and errors caused by imprecise knowledge (Bárdossy and Fodor 2004; Mann 1993; Wellmann et al. 2010). Error from the input data is a type of observation error that exists in most raw data. Examples of measurement error of raw data include incorrect orientation of the structure and imprecise interpretation, such as the stratigraphic information of borehole cores or the geological interpretation along cross-sections using seismic data or borehole information. This study focuses on analyzing the distribution of this type of error. Error generated during the modeling process is a type of stochastic, and inherently random, error that shows up in the process of interpolation of the subsurface or is caused by combining data. The last category of uncertainty refers to the result of incomplete or imprecise knowledge of structural existence or persistence, general conceptual ambiguities, or the need for generalizations. Thus, for example, a sedimentary outcrop may be defined by visually interpreting vegetation patterns or by interpolating a boundary from a spatially distributed set of measures. The precise rock type may be poorly defined or rely on arbitrary definitions, such as percent sand. Geological contacts are often classified according to values of different geological properties, such as velocity, rock type, density, and electricity resistivity. In the process of seismic imaging, the issue of structural uncertainty in the domain of velocity estimation has been discussed previously (Pon and Lines 2005; Thore et al. 2002). Researchers also studied the impact of velocity uncertainties on migrated images and AVO attributes (Fomel and Landa 2014; Grubb et al. 2001).

This study only discusses methods for analyzing the spatial error or uncertainty that exists in input data. Except for the dynamic modeling method (Chauvin and Caumon 2015), input data are transferred into geological contacts and are extracted as spatial data, such as points and segments in most approaches, to construct three-dimensional geological structures. Hence, the first type of error manifested is the positional error of geological contacts or boundaries, which is usually regarded as a kind of spatial error. Therefore, positional error methods for simulating spatial error are used to analyze errors in geological data. Based on the stage of modeling, approaches to the analysis of error or uncertainty can be classified into two types. One type of error analysis methods for three-dimensional geological modeling usually focuses on the impact of variabilities introduced by humans or computationally during data collection, processing, and representation in the final model (Bond et al. 2012; Jones et al. 2004; Lindsay et al. 2012; Tacher et al. 2006; Thore et al. 2002). This type of solution strives to reduce the effects of the errors before uncertainties are integrated into a final model. The other type of methods relates to quantifying the impacts of errors and uncertainties on the final three-dimensional model (Caumon et al. 2009; Jessell et al. 2010; Jones et al. 2004; Tacher et al. 2006; Viard et al. 2011; Wellmann et al. 2010; Wellmann and Regenauer-Lieb 2012). These methods simultaneously evaluate the impacts of the input data and all possible three-dimensional models.

On the other hand, recent research has focused on the problem from the viewpoint of alternative spatial data models (Chilès et al. 2004; Shi 1994; Shi and Liu 2000; Wellmann et al. 2010; Zhang et al. 2009): the vector data model or the raster model. When the input data and the three-dimensional final model are organized as a vector data model using points, lines, areas, and volume, the main objective is to simulate the random error distribution in the input data or in the modeling process and its impact on the final model based on the random error theorem. Based on the assumption of error independence between sampling points, the error characteristics of digitized geological input data are measured with aggregated statistics, such as the mean squared positional error, the mean directional error, the error interval, or the error ellipsoid. Errors related to interpolation are usually described by the semivariogram or covariance matrix (Bárdossy and Fodor 2004; Guillen et al. 2008; Tacher et al. 2006; Zhang et al. 2009). Approaches to the analysis of errors along curves or segments have been developed based on randomness in point error analysis. Assuming that errors of segment endpoints are independent and follow a normal distribution, error distributions along a line segment can be inferred by the error ellipsoid and its confidence body of arbitrary points based on different parameters, such as the ε-band (Shi 1994; Zhang et al. 2009), error bands (Caspary and Scheuring 1993; Dutton 1992; Zhang and Tulip 1990), G-band (Shi and Liu 2000), and H-band (Li et al. 2002). When the input data and three-dimensional final model are represented with raster models, such as voxel or grid models, the idea of error analysis is totally different. In this circumstance, geostatistical methods or information entropy-based methods are the best choice for error or uncertainty analysis of each vertex within the whole model space (Calcagno et al. 2008; Chilès et al. 2004; Guillen et al. 2008; Wellmann et al. 2010; Wellmann and Regenauer-Lieb 2012).

Another key problem that should be solved in this study is to aggregate multiple error distributions in input data. From the perspective of a statistician, this kind of problem is treated as combining prior and preposterior probabilities into a posterior probability based on the assumption of data independence or conditional independence (Caers and Hoffman 2006). Geological data, however, usually cannot satisfy this assumption. The approaches taken in previous studies to handle the aggregating probabilities for geological data are diverse because each method is proposed for a specific application. For example, Tarantola and Valette (1982) proposed the concept of conjunction and disjunction of probabilities for inverse problems. Journel (2002) presented the Tau model from a broad perspective, and Polyakova and Journel (2007) improved it by using the Nu model. Allard et al. (2012) reviewed most of the available techniques for aggregating probability distributions and found that methods based on the product of probabilities are preferred. However, a suitable method should be selected from these methods based on the product of probabilities for the specific problem in this study.

A significant improvement in the accuracy of a three-dimensional geological model is made by fusing multiple input data (Collon et al. 2015; Kaufmann and Martin 2008; Travelletti and Malet 2012). However, across, and even within, different kinds of raw data (such as borehole logs, user-defined geological sections, and geological maps), different types of positional error distributions may exist. For example, error along the contact between two formations in a drill hole or outcrops may be expressed by either a normal distribution with a standard deviation around an expected value, a lognormal distribution, or a discrete probability distribution (Davis 2002; Wellmann et al. 2010). Therefore, the assumption of one single distribution cannot reflect the impacts of all of the types of errors in the process of modeling and the final three-dimensional model.

3 Methodology

3.1 Overview

As mentioned above, the assumption of a single error distribution can neither reflect the aggregate quality measurement of multiple source data nor reveal the impact of the differences of multiple errors on three-dimensional model construction. Current research has focused on the analysis of positional error for points. In reality, subsurface feature positions and shapes for three-dimensional geological modeling are determined by sets of related geological attributes and their distributions. Variation in point positions and their error distributions is part of the variation in the geological attributes in the entire space or volume. Therefore, some type of inherent relationship may exist between the spatial error and the geological attributes of contacts. Within error bounds defined by a threshold value, the probability of a point location is the probability that the contact showed up in measurements or samples. Additionally, the probability is the likelihood of encountering a certain type of rock or soil at a given location (Tacher et al. 2006). When spatial errors of contacts are defined with three coordinates, such as x, y, and z, and are transferred into the probability of a particular type of geological attribute, such as rock type, the different errors can be aggregated by linear weighted summation because multiple errors are represented as a single variable.

This research introduces a method for mapping the relationships among multi-source error distributions in three-dimensional geological data and attributes. The procedure for error analysis in the proposed method can be summarized by the following stages (Fig. 1):

Fig. 1
figure 1

Simplified workflow of the proposed method

  1. 1.

    The method begins with a best-guess geological interpretation of observation data, including stratigraphic information in the borehole, cross-sections, or other data. The contact points on each type of source data are assigned an error distribution and confidence interval according to the observations. In this study, we were principally concerned with the contact information for boreholes and cross-sections.

  2. 2.

    For each contact point, the boundary error value is given according to the confidence interval, which captures areas where the interface is uncertain.

  3. 3.

    An error band of a line segment in a cross-section is estimated from the error distribution at the endpoints of the segment, where error types might be different. The notion of distributed probability is introduced. We infer formulations for mapping geological attribute probabilities and the spatial error distribution of contacts for different geological formations. Then, the error distribution in three-dimensional space is transferred into a probability field for each of the contact points.

  4. 4.

    The probability values in the error band of line segments in geological cross-sections are calculated by the same method as in stage 3. Based on the information entropy method, the entropy weighting was calculated for each type of error in the source data. All types of errors on the same contact point are aggregated by a linear summation with entropy weighting. Then, the error distribution of each rock type in the entire three-dimensional space can be estimated and visualized by the probability field.

It is important to note that multiple errors are directly represented in detail by probability fields on the basis of the prior model. This method does not assume the single error distribution on one contact point, as noted by Tacher et al. (2006).

3.2 Spatial Error and Geological Attribute Probability

Every point in space actually exists and falls within the boundaries of strata as chosen by geologists according to the proximal geological attributes, such as geological age, depth, and lithological characteristics. Therefore, errors along geological boundaries appear because uncertainties exist in the measurements and definitions of geological attributes. Accordingly, the problem of positional errors at spatial points can be treated as a set of uncertainties in the related geological attributes. Probability is used to express the spatial error distribution as presented by Tacher et al. (2006). In this study, errors in the subsurface or interpretation of the strata are assumed to be the chance of appearance of a particular rock type at a spatial point. The chance is a probability of the event that “rock type A is found at the location z(u)”, which is termed the “geological attribute probability”. The geological attribute probability is a set of probabilities of geological attributes that decide the classification of the strata and not just the rock type. The cumulative distribution function F(u) of the probability density function f(Z) is often used to calculate probabilities P(A) within a confidence interval

$$ P(A) = F(u) = \int_{a}^{b} {f(z){\text{d}}z} , $$
(1)

where a and b are bounded values of the confidence interval at a chosen level. Here, the P(A) is a function of position or z-value.

3.3 Geological Attribute Probability of an Error on a Contact Line

Except for contact points in borehole data, contact lines extracted from cross-sections are often used as source data for three-dimensional geological modeling because they represent the geological understanding and consensus of geologists. Here, the cross-sections around boreholes are discussed without taking into consideration those sections interpreted from geophysical data, such as reflection or refraction seismic sections. In this situation, contact lines are decomposed into several line segments or more. The error distribution of the line segments is decided by the errors of the endpoints of the line segment.

In geographic information system (GIS) literature, the error distribution of a line segment has been investigated for half a century. Based on the epsilon band (Perkal 1966) and statistical theory, other error models have been developed, such as the error bands (Li et al. 2002; Shi and Liu 2000; Zhang and Tulip 1990), the confidence region (Dutton 1992), the H-band (Chilès et al. 2004), and the G-band (Calcagno et al. 2008) models. These methods make the same assumption that the endpoints of a line segment follow the same error distribution, such as the normal distribution (Tong et al. 2013). However, in the three-dimensional geological multi-source context, the endpoints of a line segment may follow different error distributions, such the bimodal distribution or the P-norm distribution with p = 1.60 (Tong and Liu 2004). In this study, the error distribution of an arbitrary point on a segment was derived based on the error bands.

Given that the depth positions of endpoints (x1, y1, z1) and (x2, y2, z2) of a line segment on a cross-section are independent random variables z1 and z2 (Fig. 2a), for which errors follow distributions f1(z) and f2(z), the depth value of an arbitrary point on the segment (except for the endpoints) can be expressed as z = C1z1 + C2z2, where \( C_{1} + C_{2} = 1 \) and \( C_{1} = \frac{{\sqrt {(x - x_{2} )^{2} + (y - y_{2} )^{2} + (z - z^{\prime}_{ 2} )^{2} } }}{{(x_{1} - x_{2} )^{2} + (y_{1} - y_{2} )^{2} + (z^{\prime}_{ 1} - z^{\prime}_{ 2} )^{2} }} \). The error density function at an arbitrary point z0 on the line segment is

$$ f(z_{0} ) = \frac{1}{{C_{2} }}\int_{ - \infty }^{\infty } {f_{1} (z^{\prime}_{1} )f_{2} \left( {\frac{{z_{0} - C_{1} z^{\prime}_{1} }}{{C_{2} }}} \right){\text{d}}z_{1} } . $$
(2)

For an arbitrary point P(x, y, z), right above point P0(x, y, z) on the line segment (Fig. 2a), the necessary and sufficient condition for belonging to the upper stratum A is that the boundary of stratum A is beneath point P, which means that z0 ≤ z. Therefore, the probability value of P(A, z) equals the cumulative distribution function of the probability density function f(z) from negative infinity to z. Therefore, different results are obtained for different distribution functions, such as the normal distribution function, the uniform distribution function, or a discrete distribution function, on the endpoints. Here, the normal distribution and continuous uniform distribution are discussed in detail. A continuous uniform distribution is taken into consideration when a contact between two formations is not outcropped or when a segment is missing a borehole sample. Usually, the strata boundaries in the cross-section or virtual borehole used in three-dimensional geological modeling are not outcropped. Three types of combinations are presented, as follows (Fig. 2b–d):

  1. 1.

    If the errors of the two endpoints both follow a normal distribution, the probability value of P(A, z) at an arbitrary point (x, y, z) can be expressed as

    $$ P(A,z) = F(z) = \frac{1}{{\sqrt {2\pi } \sigma }}\int_{ - \infty }^{z} {e^{{ - \frac{{(z_{0} - \mu )^{2} }}{{2\sigma^{2} }}}} dz_{0} } , $$
    (3)

    where \( \mu \) is the expected value and \( \sigma \) is the variance.

  2. 2.

    Given that the errors of the two endpoints both follow a continuous uniform distribution \( f_{1} (z) = \left\{ {\begin{array}{*{20}c} {\frac{1}{2a}} &\quad {\left| z \right| \le a} \\ 0 &\quad {\left| z \right| > a} \\ \end{array} } \right. \) and \( f_{2} (z) = \left\{ {\begin{array}{*{20}c} {\frac{1}{2b}} &\quad {\left| z \right| \le b} \\ 0 &\quad {\left| z \right| > b} \\ \end{array} } \right. \), the probability value of P(A, z) on an arbitrary point \( P(x,y,z) \) can be represented as three conditional instances. If \( aC_{1} = bC_{2} \), P (A, z) should be

    $$ P(A,z) = \left\{ {\begin{array}{*{20}l} 0 \hfill &\quad {z < - aC_{1} - bC_{2} } \hfill \\ {\frac{{[z + (aC_{1} + bC_{2} )]^{2} }}{{8abC_{1} C_{2} }}} \hfill &\quad { - aC_{1} - bC_{2} \le z < 0} \hfill \\ {\frac{{ - z^{2} + 2(aC_{1} + bC_{2} )z + (aC_{1} + bC_{2} )^{2} }}{{8abC_{1} C_{2} }}} \hfill &\quad {0 \le z < aC_{1} + bC_{2} } \hfill \\ 1 \hfill &\quad {z \ge aC_{1} + bC_{2} } \hfill \\ \end{array} } \right.. $$
    (4)

    If \( aC_{1} > bC_{2} \), P(A, z) should be

    $$ P(A,z) = \left\{ {\begin{array}{*{20}l} 0 \hfill &\quad {z < - aC_{1} - bC_{2} } \hfill \\ {\frac{{[z + (aC_{1} + bC_{2} )]^{2} }}{{8abC_{1} C_{2} }}} \hfill &\quad { - aC_{1} - bC_{2} \le z < bC_{2} - aC_{1} } \hfill \\ {\frac{1}{2} + \frac{z}{{2aC_{1} }}} \hfill &\quad {bC_{2} - aC_{1} \le z < aC_{1} - bC_{2} } \hfill \\ {\frac{{ - z^{2} + 2(aC_{1} + bC_{2} )z}}{{8abC_{1} C_{2} }} + 1 - \frac{{bC_{2} }}{{2aC_{1} }}} \hfill &\quad {aC_{1} - bC_{2} \le z < aC_{1} + bC_{2} } \hfill \\ 1 \hfill &\quad {z \ge aC_{1} + bC_{2} } \hfill \\ \end{array} } \right.. $$
    (5)

    If \( aC_{1} < bC_{2} \), P (A, z) should be

    $$ P(A,z) = \left\{ {\begin{array}{*{20}l} 0 \hfill &\quad {z < - aC_{1} - bC_{2} } \hfill \\ {\frac{{[z + (aC_{1} + bC_{2} )]^{2} }}{{8abC_{1} C_{2} }}} \hfill &\quad { - aC_{1} - bC_{2} \le z < aC_{1} - bC_{2} } \hfill \\ {\frac{1}{2} + \frac{z}{{2bC_{2} }}} \hfill &\quad {aC_{1} - bC_{2} \le z < bC_{2} - aC_{1} } \hfill \\ {\frac{{ - z^{2} + 2(aC_{1} + bC_{2} )z}}{{8abC_{1} C_{2} }} + 1 - \frac{{aC_{1} }}{{2bC_{2} }}} \hfill &\quad {bC_{2} - aC_{1} \le z < aC_{1} + bC_{2} } \hfill \\ 1 \hfill &\quad {z \ge aC_{1} + bC_{2} } \hfill \\ \end{array} } \right.. $$
    (6)
  3. 3.

    Given that the error at one endpoint follows a continuous uniform distribution \( f_{1} (z) = \left\{ {\begin{array}{*{20}c} {\frac{1}{2a}} & {\left| z \right| \le a} \\ 0 & {\left| z \right| > a} \\ \end{array} } \right. \) and the other follows the normal distribution \( f_{2} (z) = \frac{1}{{\sqrt {2\pi } \sigma }}e^{{\frac{{ - z^{2} }}{{2\sigma^{2} }}}} \), the error distribution function of an arbitrary point \( P(x,y,z) \) can be expressed as (Liu 1999)

    $$ f(z) = \frac{1}{{C_{2} }}\int_{ - \infty }^{\infty } {f_{1} (z_{1} )f_{2} (z_{2} )dz_{1} } = \frac{1}{{C_{2} }}\int_{ - a}^{a} {\frac{1}{{2a\sigma \sqrt {2\pi } }}e^{{\frac{{ - \varphi (\frac{{z_{0} - C_{1} z_{1} }}{{C_{2} }})^{2} }}{{2\sigma^{2} }}}} {\text{d}}z_{1} } . $$
    (7)

    Then, P (A, z) should be

    $$ P(A,z) = \frac{1}{2} + \frac{{C_{2} }}{{2aC_{1} }}\left\{ {\left( {\frac{{z + C_{1} a}}{{C_{2} }}} \right)\varphi_{0} \left( {\frac{{\frac{{z + C_{1} a}}{{C_{2} }}}}{\sigma }} \right) - \left( {\frac{{z - C_{1} a}}{{C_{2} }}} \right)\varphi_{0} \left( {\frac{{\frac{{z - C_{1} a}}{{C_{2} }}}}{\sigma }} \right)} \right.\left. { + \frac{\sigma }{{\sqrt {2\pi } }}\left( {e^{{\frac{{ - (\frac{{z_{0} + C_{1} a}}{{C_{2} }})^{2} }}{{2\sigma^{2} }}}} - e^{{\frac{{ - (\frac{{z_{0} - C_{1} a}}{{C_{2} }})^{2} }}{{2\sigma^{2} }}}} } \right)} \right\}, $$
    (8)

    where \( \varphi_{0} \left( {\frac{{\frac{{z + C_{1} a}}{{C_{2} }}}}{\sigma }} \right) = \varphi \left( {\frac{{\frac{{z + C_{1} a}}{{C_{2} }}}}{\sigma }} \right) - \frac{1}{2} \), \( \varphi_{0} \left( {\frac{{\frac{{z - C_{1} a}}{{C_{2} }}}}{\sigma }} \right) = \varphi \left( {\frac{{\frac{{z - C_{1} a}}{{C_{2} }}}}{\sigma }} \right) - \frac{1}{2} \), and \( \varphi \) is a standard normal distribution function. The geological probabilities at an arbitrary point on a line segment can be obtained using this method, and geological attribute probability distributions of an entire line feature are composed by related probabilities of a finite number of connected segments.

Fig. 2
figure 2

Experiment of error distributions of a line segment. a Line segment between two contact points of boreholes. bd Distributions of the probability that a point belongs to strata B, in which the error endpoints follow different kinds of error distribution; b errors of both endpoints follow a normal distribution; c errors of both endpoints follow a continuous distribution; d errors of the left endpoint follow a normal distribution and those of the right endpoint follow a continuous distribution. e Legend of the probability that a point belongs to strata B

When the error distributions of the two endpoints of a line segment both follow normal distributions, the probabilities that a point belongs to strata B become distributed as an error band (Fig. 2b), as proven by Shi and Liu (2000). In Fig. 2c, the width of the probability distribution belonging to strata B around the segment looks like an error band, in which the maximum values are located at the endpoints and the minimum values are located in the middle of the segment. The reason this occurs is that the error distribution is a linear combination of two identical distribution functions. From Eqs. (7) and (8) in case 3, the geological attribute probability is a cumulative function of the normal distribution function in the confidence interval [− a, a], although errors of each endpoint follow a different distribution. The shape of the probability distribution that belongs to strata B looks like the result of two normal distributions with different confidence intervals.

3.4 Aggregating Multiple Errors by Geological Attribute Probabilities Using Entropy-Based Weighting

After the conversion using the method proposed above, every type of error was represented by the related geological attribute probabilities. The next step is to combine the information from different sources in a probabilistic framework. Studies have presented diverse approaches. A general formulation of all pooling methods was presented by Allard et al. (2012)

$$ T_{G} (A) = Z + U(A) + \left( {1 - \sum\limits_{i = 1}^{n} {w_{i} } } \right)T_{0} (A) + \sum\limits_{i = 1}^{n} {w_{i} T_{i} (A)} , $$
(9)

in which T is related to probabilities in the following way: T ≡ P for all linear pooling methods; T ≡ lnP for methods based on the product of probabilities; and T ≡ lnO = lnP − ln(1 − P) for methods based on the product of odds. U(A) is an updating likelihood when considering the general log-linear pooling; it is the logarithm of the Nu parameter for the Nu model. T0(A) is the prior probability and Z is a normalizing constant. The weight w0 has been set equal to \( \left( {1 - \sum\nolimits_{i = 1}^{n} {w_{i} } } \right) \) to respect external Bayesianity. The form of Eq. (9) is a kind of a generalized linear pooling method. The aggregated probabilities can be calculated by any item of the formula (Allard et al. 2012). All of the methods that can be included in Eq. (9) have some parameters, such as weights, that should be estimated or preset by users, except for entropy-based methods, such as the maximum entropy method. The estimated or preset parameters may introduce new error or uncertainty into the final results. Since entropy was introduced in geostatistics as a measure of the uncertainty of a prior distribution model (Christakos 1990), the entropy-based methods including the maximum entropy approach were widely used in earth science, especially in estimating an unknown target probability distribution.

In this study, an additive method with entropy-based weighting is presented to aggregate multiple errors. Given that N types of source data were involved in the cross-area, n possible stratigraphic attributes exist at the cross-point (x, y, z), and the probability value that the cross-point belongs to stratum i in source data j is Pij(x, y, z). The Pij(x, y, z) can be calculated using Eqs. (1) and (2). The information regarding the entropy of point (x, y, z) can be interpreted as the amount of missing information with respect to the lithological probabilities for the point. Then, an entropy-based weight value for each kind of source data can be described as

$$ \begin{aligned} \omega_{j} &= \frac{{1 - H_{j} }}{{\sum\limits_{j = 1}^{N} {(1 - H_{j} )} }} \hfill \\ H_{j} &= - \sum\limits_{i}^{n} {P_{ij} (x,y,z)} \log P_{ij} (x,y,z), \hfill \\ \end{aligned} $$
(10)

where Hj is an entropy value for the source data j. The wj value is equal to 0 when n is 1. Note that the wj here is not dictatorial, as described by Allard et al. (2012), because wj ≠ 0 when i ≠ j. For the same type of source data, the entropy-based weight value of each point in space may be different because it is determined by the corresponding geological attribute probabilities rather than by the error type. The error in each type of data source has an entropy value that is conditioned to the probabilities, according to Eq. (10). Therefore, the final value of the geological property probability of a position is calculated as

$$ P_{i} (x,y,z) = \sum\limits_{j}^{N} {\omega_{j} P_{ij} (x,y,z)} . $$
(11)

When a new data source is introduced, the entropy value of the point will change because the new data source provides another probability of the geological attribute. Equation (11) is a linear pooling method with a nonlinear weight value wj. The aggregated probability distribution is often clearly multi-modal.

3.5 Experimental Data and Method Testing

The method was tested by analyzing the error distribution at the intersections of different types of data. Several hundred borehole samples, a map of the basement rock, and six intersected cross-sections were collected, as shown in Fig. 3 (Hou et al. 2016). In Fig. 3, there are six cross-sections and nearly 200 boreholes. Boreholes DC043 and DC091 are on the line of section EW2. Sections EW2 and SN1 cross at the position of DC091. Cross-sections were digitized from a scanned image. The basement rock map was digitized according to the isobaths of the basement rocks. The stratigraphic divisions were given for each borehole according to lithological information and test data. To simplify the presentation of the data, only the Quaternary layer and basement rock are specifically identified in Fig. 3. In the intersections, ideally, the contacts of the basement rock and the Quaternary layer extracted from the borehole, the cross-section, and the basement rock map should be at the same point. In fact, the contact points between the Quaternary layer and basement rock from the borehole are not always consistent with those extracted from the basement rock map, such as the point at borehole DC037 shown in Fig. 3.

Fig. 3
figure 3

Testing data. To simplify the problem, strata from borehole cores are classified as two types of strata: Quaternary stratum (marked in yellow) and basement rock. Sections EW2 and SN1 crossed at DC091. Probabilities along boreholes DC043, DC091, DC037, and section EW2 and on the geological map are specifically discussed in the testing section

3.5.1 Attribution Aggregation for a Single Point

In this study, the contacts between the Quaternary layer and basement rock for three cross-points are extracted, as shown in Fig. 3, to test the proposed method. Among these data, the contacts of borehole DC037 were not used to interpolate the basement rock map. The position of the contact point from the basement rock in borehole DC037 is inconsistent with that from the basement rock map, as shown in Fig. 4a. The contacts of the other two boreholes DC091 and DC043 are reconciled to the contacts with the basement rock map. Errors in the geological contacts from the borehole and basement rock map follow the normal distribution function with a standard deviation of 0.667 m.

Fig. 4
figure 4

Error aggregation for different source data at a cross-point: a positions of the cross-points in the boreholes, the cross-section, and the geological map; b attribution distribution of the basement rock in the borehole; c attribution distribution of the basement rock of the contact points extracted from the basement rock map; d aggregated results of b and c; and e legend of the probability of the basement rock distribution

In forming the attribution probability, Fig. 4b, c showed the error distributions around the contact points in the three boreholes and cross-points on the basement map. The borehole DC037 was added after the basement rock map was interpolated with the contacts from the other boreholes and cross-sections. The expected aggregated value of the contact along the borehole DC037 locates between the contact position extracted from the borehole DC037 and that extracted from the basement rock map (Fig. 4d). As shown in Fig. 5a, the curve of the aggregated attribution probability of the basement rock contact point in DC037 is approximately the average of the other two probabilities because the contact position of the basement rock that is extracted from borehole DC037 (approximately − 7 m) is far away from the corresponding position from the basement rock map (approximately − 10 m). Note that the aggregated curve has two points that are tangential to the other two curves where the probability value is 0.5, marked by dashed lines in Fig. 5. For example, the aggregated probability equals the probability value of the basement rock in DC091, while the probability of the basement rock on the point of the basement rock map is 0.5 (Fig. 5c). If the base value is 2 in Eq. (10), the entropy value is 1.0 when the probability value is equal to 0.5. Then, the entropy weight value of the corresponding source is 0. This result means that one data source cannot provide any information to judge which stratum the contact point belongs to when the probability value of the current data source is 0.5.

Fig. 5
figure 5

Comparison of the attribution probability curves of the basement rock before and after aggregation. The left, middle, and right figures are the attribution probability curves of the basement rock for the contact points on the position of boreholes DC037, DC043, and DC091, respectively. The point at which the dashed lines cross marks the position where the aggregated probability is 0.5

3.5.2 Attribution Aggregation Around the Cross-Section Using the Segment

An experiment regarding the error distributions for basement rock boundaries is presented to demonstrate the implementation of the proposed method. Basement rock boundaries were extracted from cross-section EW2 and the basement rock map along the EW2 cross-section. The error of the boundaries of the basement rock extracted from the cross-section EW2 was assumed to follow a continuous distribution because these data are not obtained by outcropping, as noted by Tong et al. (2013), and the errors of the cross-point between the cross-section and the borehole follow a normal distribution. Boundaries of the basement rock with errors that follow a normal distribution were extracted according to the cross-over lines between the cross-section and the basement rock map. Figure 6b, c show the probability distributions of the basement rock boundaries from the EW2 cross-section (Fig. 6a) and the basement rock. In the cross-section, the segments that composed the strata boundaries were treated separately. Errors on endpoints of one segment can follow different distributions. Although the expected value of the basement rock depth in different source data has the same trend, huge differences exist in several local positions, such as the position marked by the black arrow in Fig. 6b, c. The attribute probability distribution of the basement rock appeared to have many glitches in Fig. 6c because the contact points were densely distributed on the basement rock map. The yellow area in Fig. 6d is much wider than that in Fig. 6b, c, as shown by the position marked with the black arrow, because the boundaries of the basement rock in the two types of data do not coincide with each other. This result illustrates that the aggregated probability value is, essentially, a weighted average of the other two data sources. The attribution probability curves were extracted up to a depth of − 21 m from Fig. 6b–d (in Fig. 7). The probabilities of the basement rock boundaries extracted from the cross-section EW2 (red curves in Fig. 7) and from the basement rock map (green curves in Fig. 7) are not coincident with each other. The entropy-weighted aggregated probability curve is located between other curves of the two sources. Additionally, as shown in the left image of Fig. 8, the probabilities of the basement rock contacts on the cross-sections and borehole cores were calculated. The probability values are aggregated on these contact points where several kinds of source data cross, such as the contact point along DC091.

Fig. 6
figure 6

Probability distribution of the basement rock depth along the cross-section EW2, where the black line marks the depth of − 21 m: a the cross-section EW2 represented by the Quaternary layer and the basement rock; b the probability distribution of the base rock boundaries that are extracted from cross-section EW2; c the probability distribution of the base rock boundaries that are extracted from the basement rock map along cross-section EW2; d the aggregated result of b and c with entropy-based weighting; e the aggregated probability results of the basement rock depth along the cross-section EW2 using the maximum entropy method; and f the legend of the probability that the point belongs to the basement rock

Fig. 7
figure 7

Attribution probability curves of the basement rock using the weighted average and maximum entropy method at the depth of − 21 m, which are extracted from Fig. 6b–e. The purple curve represents the result of the maximum entropy method, and the navy blue curve is the aggregated result of the weighted average method

Fig. 8
figure 8

Results of the probability distribution of the basement rock. In the left figure, the aggregated probability of the basement rock, which is extracted from boreholes, cross-sections, and the basement rock map, is shown, along with the boreholes and cross-sections. The right figure illustrates the probability distribution of the basement rock surface without a zero value

The method presented here is an example of a one-dimensional approach. To extend the method to a three-dimensional situation, a very simple flow is presented, and the analytical solution for a triangle is derived. Assuming a triangle, as shown in Fig. 9, the coordinates of each endpoint are A(x1, y1, z1), B(x2, y2, z2), and C(x3, y3, z3). Point P is an arbitrary point in the triangle and point D is the intersection point of line AP and line CB. The coordinates of point D are D(xD, yD, zD). The lengths of line segments in the image can be represented as: |CD| = a, |BD| = b, |AP| = c, and |PD| = d. Let C1 = b/(a + b), C2 = a/(a + b), C3 = d/(c + d), and C4 = c/(c + d). The depth of point P can be represented as

$$ Z_{0} = C_{3} z_{3} + C_{4} z_{D} \,{\text{and}}\,Z_{D} = C_{1} z_{1} + C_{2} z_{2} . $$
(12)
Fig. 9
figure 9

Diagram of a triangle

The error distribution of the points in the triangle can be calculated according to the method presented in this paper. The basement rock surface is represented by triangle networks. Then, the attribute probabilities of every point on the surface can be calculated. The right image of Fig. 8 illustrates the integrated probability distribution of the basement rock shown along the cross-sections boreholes, and shows the probability distribution of the basement rock surface without a zero value.

4 Discussion

In this study, the entropy value was used as a weighting value for different error distributions of the input data. This is a key factor that influenced the final probability. The weight value was calculated by the probabilities rather than introducing new parameters, as in some other methods. When the probability of a point is much larger than 0.5, the geological attribute on the point is relatively certain. In this circumstance, the contribution of this type of data source is greater than that of other types of data sources in determining the geological attribute. When the probability equals 0.5 and the entropy is 1.0, the geological property on the point is the most unclear, and this data source contributes the least to classifying the geological attributes. Therefore, the weight value is determined by the error characteristics, including the error distribution function and its corresponding parameters, such as the confidence interval and standard deviation.

Essentially, the method presented in this study is a generalized linear pooling method aggregating multiple errors from different data sources. Many disciplines, including statistics, management sciences, and geostatistics, have studied aggregating information from distinct sources. The methods found in the literature can be classified into multiplicative-based and additive-based aggregation operators (Allard et al. 2012). In the methods that can be included in Eq. (9), the maximum entropy method is the only approach that does not introduce any extra parameters. Therefore, the results of the aggregating probabilities were compared (as shown in Figs. 6e and 7). The maximum entropy is a logarithmic linear pooling method and is more sensitive than an entropy-based weighting method. The range of error distribution (yellow area shown in Fig. 6e) in the maximum entropy method is narrower than that in the entropy-based weighting method. For example, at the position marked by the black arrow in Fig. 6, the aggregated probability is approximately 0.5 with the entropy-based weighting method, while the probability calculated by the maximum entropy method is 1.0. From Fig. 7, the probabilistic curve extracted from the result by the entropy-based weighting method appears turbulent from 1500 to 2000 m and from 3500 to 4000 m. However, the curve calculated by the maximum method is much smoother. When the probability is less than 0.5, the aggregated probability is greater than that of the entropy-based weighting method, and vice versa.

The method presented in this study is actually a type of linear method in which the weight values are not preset by users or estimated by introducing some other parameters. The weight value is decided by the characteristics of the error distribution itself. Although the linear method is intrinsically sub-optimal, the presented method can reveal the error range and details of different input data.

The error type is assigned for each contact position from different sources in the presented method. In this study, the error types of contact points are not classified by the stratigraphic information around the contact points. However, this information will not affect the aggregated process and applications because the probabilities around the segments of all the boundaries were calculated separately. Therefore, the error type at each contact point will impact the probability distributions instead of the workflow. The additive method used in this study has an implicit assumption that the error distribution of an arbitrary point of one segment should be limited by the error distribution of the endpoints, which is why the uncertainties on the segment seem smaller than the endpoints of the segment in Figs. 2 and 6.

From Fig. 7, the integrated probability curve (pink curve in Fig. 7) based on the maximum entropy method has the same waveform as the curve obtained using the entropy-based method, but the result with the maximum entropy method is much smoother. Notice that most values of the integrated result with the maximum entropy method are smaller than those values obtained using the entropy-based method (navy blue curve in Fig. 7), except for some positions, such as the position at 1500 m.

Limitations of our method are mainly related to the error types and the conversion from error distribution to geological attribute probabilities. Classification of the error types for contact points is a time-consuming and tedious computation that reduces the efficiency. For classification, it could be reasonable to build an error-type appointment rule based on the stratigraphic information. In this paper, the error distribution has been discussed in the depth direction instead of in two or more dimensions. This avoids the overlay problem on the conjunct point between segments. At this point, conversion formulas for high-dimension error distribution functions should be inferred based on the positional error of the line features or curve lines. This would help us analyze the topological error in the source data. Additionally, the implementation is limited to possible outcomes. This study presents a one-dimensional method and does not analyze a case with multiple z-values, such as an overturned fold, dome structures, and thrust faults.

5 Conclusions

In this study, a one-dimensional multiple errors analysis method is presented for geological contacts that are typically used to build geological models. The foundation of the approach is that the spatial errors of contacts or boundaries between strata can be represented by geological attribute probabilities, of which it is assumed that errors are caused by imprecise measurement or vague classification of geological attributes, such as rock type. The error distribution functions and prior guesses are accounted for, since we performed the analysis based on statistical manipulations. At each location of input data, the probabilities of each kind of rock type can be evaluated using this method. The regulation of different sources and types of errors using the entropy-based weighting is presented. Therefore, in this method, the final chance of finding a certain rock type at one particular point relies on the probabilities of different types of error distributions.

The approach is simple and intuitive for analyzing and visualizing errors in multi-source data. At a contact point, the error distribution of a rock type is calculated using the error of depth of the contact point. The final result is multiple models after integration if more than one input datum exists for the contact point. Similar to the error band used in the geographic information system (GIS), the integrated one-dimensional error density distribution functions along a segment were derived, where errors at the endpoints followed different distribution functions, such as a normal distribution or a continuous distribution. By assuming that the error at a random point of a segment is a combination of the error distributions of the endpoints, the shape of the probability distributions appears to be similar to the error bands presented by Tong and Liu (2004) and Tong et al. (2013). Because the presented method is based on error assessment of a probabilistic model (Tacher et al. 2006), the assumption that only one kind of error distribution can be defined for all source data is not limited to this application. In the two experiments, we have shown how we can aggregate multiple types of errors at cross-points. An advantage of our method is that it is not limited to a specific error distribution and that the workflow itself is always performed in the same way. The test results illustrated that the integrated probability is a weighted average result, although the weights are nonlinear. Like the maximum entropy method, the presented method does not estimate parameters for integration because the weight value was decided by the probabilities of the rock type based on the input data.

Assuming that the uncertainty away from an observation should be higher than the uncertainty at an observation, uncertainties are characterized and measured by information entropy values in the method proposed by Wellmann and Regenauer-Lieb (2012). In this study, the error distribution at arbitrary points of a segment is assumed to follow or be limited by the error distributions of the endpoints of the segments, yet it is still measured by the error distribution function. Therefore, the uncertainty at a random point of a segment is smaller than that at the endpoints. Additionally, totally different results were obtained because of different assumptions.

Several aspects of this research are to be developed further, including an analysis method for higher dimensional error distributions. Additional deterministic vector information, such as dip measurements, should be taken into consideration in the process. Regardless, the proposed method expands and refines the treatment of error and uncertainty in geology and opens new possibilities for more sophisticated uses of three-dimensional geological models in instances in which errors in the model can translate into serious consequences of risk, failure, or cost.