1 Introduction

Benford’s law (BL), also known as the first-digit or significant-digit law, was first found in 1881 by Simon Newcomb. Newcomb asserts that the ten digits do not occur with an equal frequency that must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones (Newcomb 1881). BL states that many naturally occurring sets of observations follow a specified theoretical distribution and follow a monotonically decreasing logarithmic distribution, which is not uniformly distributed as might be expected (Berger and Hill 2015). In BL, the first leading (i.e., first non-zero) decimal digit is not equally likely to be any one of the nine possible digits. Instead of that, the occurrence of numbers starting with 1 and 2 is close to 30% and 18%, respectively, whereas the numbers starting with 8 or 9 are close to 5%.

In general, it is hard to say that there is a relation between the huge numerical data sets found in nature or produced by a human (Mir 2012). Benford (1938) tested the law on a wide range field such as the heights of the mountains, the lengths of the rivers, the surface areas of the rivers, physical constants, molecular weights, death rates or cost data, and figured out to follow the BL. In literature, there have been wide range of implementations of BL from social to numerical application fields. Mir (2014) referred to “Benford Online Bibliography (Berger et al. 2009)” for a comprehensive information on the numerous data sets, which obey BL and its applications across multiple disciplines. Moreover, some applications for figuring out the presence of BL in different fields may be given as follows. The early study was conducted by Becker (1982), who found that there was a logarithmic distribution when failure rates and mean time to failure (MTTF) values were read from left to right for the first non-zero digit. Nagasaka (1984) reviewed various sampling procedures by examining for the resulting sampled integers whether BL holds or not. It was proved that randomly sampled integers do not necessarily obey BL but their Banach limit does for polynomial sampling procedures. Hence, it was proved on BL for geometrical sampling procedures and for linear recurrence sampling procedures. Hill (1995) stated that many tables of numerical data do not follow logarithmic distribution such as lists of telephone numbers in a given region typically begin with the same few digits-and even “neutral” data such as square-root tables of integers are not good fits. However, a surprisingly diverse collection of empirical data does seem to obey the significant-digit law. The details of more empirical evidence may also be found in (Hill 1995). In the field of demographic data for world religion distribution, Mir (2012) investigated numerical data on the country-wise adherent distribution of seven major world religions i.e., Christianity, Islam, Buddhism, Hinduism, Sikhism, Judaism and Baha’ism to see if the proportion of the leading digits occurring in the distribution that conforms to BL. Mir (2012) exposed that the adherent data of all the religions, except Christianity, excellently does conform to BL. Ausloos et al. (2015) reviewed the long birth time series for Romania from BL point of view, distinguishing between families with a religious (Orthodox and Non-Orthodox) affiliation and figured out that there is a drastic breakdown of BL on results.

In this study, it is aimed to investigate the applicability of BL in residuals and normalized residuals estimated from Least Square Estimation (LSE) method. Moreover, BL is applied to the geodetic data set, which contain normal distribution to demonstrate the consistency of the law. In geodesy, the data observed at field contains random errors that are from a normal distribution (N(0, σ)) and the most probable values of the unknown parameters are generally estimated by LSE. To find out the residuals of observed values, differences between estimated value and each observation are taken. Also, the normalized residuals of them are taken into consideration for the outlier detection problem. Here, three different data sets are studied as; (1) simulated data, (2) real data observed by global positioning system (GPS), (3) Leveling network data pertaining to the random error. According to the results, it is found that the residuals and normalized residuals of these data estimated from LSE follow BL.

2 Benford’s law

Like Newcomb before him, Benford (1938) observed that the first pages in logarithmic tables were more referred to than the last pages, which presented an empirical law that shows the distribution of the leading digits are not equal among the naturally occurring phenomena such as area of the river, population of cities, addresses, death rate etc. (Jamain 2001). Although its’ basic form introduced by Benford (1938) was based on empirical observations, several mathematical series i.e., binomial coefficients and factorial (Sarkar 1973), Fibonacci and Lucas numbers (Wlodarski 1971) are related with this law. The BL states that the smaller digits are found more frequently than larger digits, which can be expressed as an expected frequency distribution of the first digits for many numerical data sets (Chandra Das et al. 2017). In its most common formulation, the special case of the first significant (i.e., first non-zero) decimal digit, Benford’s law asserts that the leading digit is not equally likely to be any one of the nine possible digits 1, 2,…, 9. The first digit of the number is the leftmost digit and differs from 0. According to this law, the frequency of the first significant digit (i.e., nonzero digit, k) can be computed as (Benford 1938):

$$P\left( k \right) = \log_{10} \left( {k + 1} \right) - \log_{10} \left( k \right) = \log_{10} \left( {1 + \frac{1}{k}} \right)$$
(1)

where P is the probability of the number k, and k is the any number in the set \(\left\{ {1,2,3, \ldots ,9} \right\}\).

Considering the BL distribution on the smaller digits for 1 and 2, respectively, if the first leading digit is 1, the probability of occurrence is \(P\left( {k = 1} \right) = \log_{10} \left( 2 \right) = 0.3010\), which is more than 30% of the time and if the first leading digit is 2, the probability of the occurrence is \(P\left( {k = 2} \right) = \log_{10} \left( {\frac{3}{2}} \right) = 0.1760\), which is more than 18% of the time. However, if the larger two digits are computed according to the BL, i.e., \(P\left( {k = 8} \right) = \log_{10} \left( {9/8} \right) = 0.0512\), which is about 5% of the time and for 9, the probability is \(P\left( {k = 9} \right) = \log_{10} \left( {10/9} \right) = 0.0458\), which is less than 5%. Table 1 shows the distributions of first digits as derived from BL.

Table 1 The first digit distribution of Benford’s law

3 Data

Surveying is an important field of application of geodesy. In a general manner, surveying is the measuring horizontal and vertical distances between objects, angles between lines, determining the direction of lines, and establishing points by predetermined angular and linear measurements. Apart from classical surveys, satellite-based positioning is the determination of positions of observing sites on land or at sea, in the air and in space by means of artificial satellites (Hofmann-Wellenhof et al. 2001). Either classical surveying or satellite-based surveying generates measurements for certain physical quantities. However, in both cases due to some kind of influences, measurements conducted from the same quantity will differ in general. An error is a difference between a measured value and its true value of a quantity. The sources of the errors in measurements can be listed as humans, instruments and circumstances. These errors raise uncertainty in measurements. In order to eliminate the uncertainty, to improve instruments used during the measure, to train the person or to provide better circumstance will help, however, do not provide error-free measurements and all measurements contain errors. Typically, errors can be classified into three groups: systematical errors, random errors and gross errors. Among them, gross errors can be detected by repeated measurements. Systematical errors that generally pertain to survey instruments or external influences (circumstance) occur in the same direction and magnitude on measurements. They can be eliminated by calibration of instruments and implementation of the proper corrections on measurements for circumstances influences (Berber 2008; Teunissen 2003). However, random errors cannot be eliminated in such a procedure explained for the others and follow normal distribution on measurements. At the stage of the LSE, these errors reflect on the residuals.

BL proportions are compared with the proportions of residuals and normalized residuals. The data used in this study has normal distribution and equations given below for LSE are used to compute the residual and normalized residuals. It is supposed that there are n independent equally weighted measurements, denoted as vector l. The differences between the most probable value of measurements and measured values are the residuals, denoted as v. Accordingly, to compute the normalized residual (w), which is the specific form of residuals, Eq. 9 is implemented.

$${\mathbf{A}} = \left[ {\begin{array}{*{20}c} 1 & 1 & {\begin{array}{*{20}c} \cdots & 1 \\ \end{array} } \\ \end{array} } \right]_{n \times 1}^{T} \quad \left( {{\text{for}}\;{\text{univariate}}\;{\text{model}}} \right)$$
(2a)
$${\mathbf{A}} = \left[ {\begin{array}{*{20}c} 1 & \ldots & 1 \\ {x_{1} } & \ldots & {x_{n} } \\ \end{array} } \right]_{n \times 1}^{T} \quad \left( {{\text{for}}\;{\text{linear}}\;{\text{regression}}\;{\text{model}}} \right)$$
(2b)
$$\hat{\varvec{x}} = \left( {{\mathbf{A}}^{{\mathbf{T}}} {\mathbf{PA}}} \right)^{ - 1} \left( {{\mathbf{A}}^{{\mathbf{T}}} {\mathbf{Pl}}} \right)$$
(3)
$${\mathbf{v}} = {\mathbf{A}}{\hat{\mathbf{x}}} - {\mathbf{l}}$$
(4)
$${\mathbf{Q}}_{{{\mathbf{xx}}}} = \left( {{\mathbf{A}}^{{\mathbf{T}}} {\mathbf{PA}}} \right)^{ - 1}$$
(5)
$${\mathbf{Q}}_{{{\mathbf{ll}}}} = {\mathbf{AQ}}_{{{\mathbf{xx}}}} {\mathbf{A}}^{{\mathbf{T}}}$$
(6)
$${\mathbf{Q}}_{{{\mathbf{vv}}}} = {\mathbf{P}}^{ - 1} - {\mathbf{Q}}_{{{\mathbf{ll}}}}$$
(7)
$$\sigma_{o} = \sqrt {\frac{{{\mathbf{v}}^{{\mathbf{T}}} {\mathbf{Pv}}}}{n - u}}$$
(8)
$$w_{i} = \frac{{\left| {v_{i} } \right|}}{{\sigma_{o} \sqrt {Q_{vivi} } }}$$
(9)

Here l is the observation vector, \({\hat{\mathbf{x}}}\) is the unknown vector, A is the coefficient matrix, P is the weight matrix that is identity matrix, \({\mathbf{v}}\) is the residual vector, \({\mathbf{w}}\) is the normalized residuals, \(\hat{\sigma }_{0}^{2}\) is the variance of unit weight, n is the number of data and u is the number of unknown parameters.

4 Regression models

BL has been applied to two geodetic survey data sets simulated depending on regression models with normal distribution explained below that whether or not follow a specific BL distribution due to residuals and normalized residuals estimated from LSE. For generating the data based on regression models, the standard deviation of the models is taken 0.02 m with expected value 0. Formann (2010) reviewed some common distributions to explore their relations with BL. One of the experienced distribution examined by Formann (2010) is the normal distribution, \(N\left( {\mu ,\sigma } \right)\). In Formann (2010), the simulated random data is generated with normal distribution \(\left( {\mu = 1.1,\sigma = 0.25} \right)\) and results are concluded that normal distribution does not follow BL distribution. However, the first digit of the expected value dominates the results, thus the majority of the data will be involved into the first digit, in which the expected value starts. This case can be seen in Formann (2010). Then, in this study, we chose the expected value as 0 to eliminate the effect of first digit of the expected value on results. Moreover, the expected values of the residuals and normalized residuals estimated from LSE will be 0 that will not affect the results.

4.1 Univariate model

The first model is the univariate model. The random errors ei are generated by a random generator, and y (the measurement) is obtained with y = a + ei. The random errors follow normal distribution (N(0;0.02 m)). Three different cases have been simulated for different numbers of observations (100, 1000 and 10,000). For the simulation a was chosen 5 m and observation was obtained by adding the random errors to the 5 m. LSE is applied to data to estimate unknown parameter, residuals and normalized residuals.

4.2 Linear regression model

The second simulation model is the linear regression model, that was obtained with y = a + bx + ei. For the simulation, a and b were chosen as 1 m and the random errors e follow normal distribution (N(0;0.02 m)). For the analysis, LSE is applied to estimate unknown parameters, residuals and normalized residuals.

4.3 GPS data

To generate the second group, an IGS station data (ISTA) is processed by kinematic Precise Point Positioning (PPP) approach using GIPSY OASIS II v6.4 software. Here, the observation duration of GPS data is 4 days (on 1st, 2nd, 3rd, 4th of January 2016) (Fig. 1). LSE is applied to estimate mean values of coordinates for the 4 days. Also, the residuals and normalized residuals of the three-dimensional coordinates was estimated for the BL analysis.

Fig. 1
figure 1

X, Y, Z coordinates of real data

4.4 Leveling network

The third data group is simulated for a leveling network, which involves 55 points and 110 observations. For the leveling network, the height differences Δhk are computed from the fixed points. They are free of random errors, and then the random errors are generated from a normal distribution. They are added to the height differences. The precision is taken as \(\sigma_{h} = \sigma_{0} \sqrt S \left( {\sigma_{0} = 1\frac{\text{mm}}{{\sqrt {1\,{\text{km}}} }}} \right)\) for the leveling network where S is the total length of the leveling lines in km. The height differences were considered as observations and unknown parameters (heights of the points) were estimated by the LSE method (Koch 1999). The residuals and normalized residuals for height differences were used for BL analysis.

4.5 Data analysis

The Null-Hypothesis (Ho) is written to test for compliance with BL between the observed and expected first digit distributions. According to the Ho hypothesis that the frequencies obtained from observations are the same as expected frequencies basis of BL. If the observed frequencies are close to corresponding expected frequencies, the \(\chi^{2}\) value will be small, indicating a good fit. If the observed frequencies differ considerably from the expected frequencies, the \(\chi^{2}\) value will be large and the fit will be poor. A good fit leads to acceptance of Ho, whereas a poor fit leads to its rejection. To test whether Ho hypothesis is accepted or not, the Chi-square goodness-of-fit statistic is calculated using (Walpole et al. 1998);

$$\chi^{2} = N\mathop \sum \limits_{k = 1}^{9} \frac{{\left( {P\left( k \right) - B\left( k \right)} \right)^{2} }}{B\left( k \right)}$$
(10)

where \(\chi^{2}\) is a value of a random variable whose sampling distribution is approximated very closely by the Chi-squared distribution with v = 9 − 1 degrees of freedom. N is the sum of the frequencies, and P(k) is the proportion from the data and B(k) is the proportion from the BL. The test is based on how good a fit we have between the frequency of occurrence of observations in an observed sample and the expected frequencies obtained from the hypothesized distribution (Walpole et al. 1998).

In addition to this, the numerical range of the data set in terms of order of magnitude (OOM) may be used as a strong indicator for compliance with BL (Brown 2005; Kossovsky 2014). Brown (2005) emphasized that the data sets which vary with a large numerical range can have an expectation to show a good correlation with BL. The OOM value equal or bigger than 4 would be suitable. Kossovsky (2014) and Whyman et al. (2016) state that requirement for data configuration with regards to compliance with BL should be sufficient, also Kossovsky (2014) declares that the value of the OOM of the data set should be approximately over 3. The numerical range of the data set used to commit conformity or non-conformity of the BL can be estimated by OOM:

$${\text{OOM}} = \log \left( {\frac{{{\text{x}}_{\hbox{max} } }}{{{\text{x}}_{\hbox{min} } }}} \right)$$
(11)

where \(x_{min}\) and \(x_{max}\) are the minimum and maximum values of the data set, respectively. Here, the OOM is computed from the absolute values of max and min values of data set excluding zeroes.

5 Results

To test the compliance with BL, goodness-of-fit test were applied to the proportions. As we have 9 first digits, the degrees of freedom is taken 9 − 1 = 8. Considering the 95% confidence level, the decision value (\(\chi^{2}\)) is computed 15.507. If this value is exceeded in any case, the compliance with BL will be rejected, otherwise, it will be accepted.

5.1 Regression models

The distribution of residuals and normalized residuals calculated from the univariate model for the number of 100, 1000 and 10,000 data are shown in Fig. 2 and the statistical considerations are given in Tables 2 and 3. For each data group simulated by the univariate model, v and w proportions approximate BL proportions. According to the first digits from 1 to 9, BL proposes the occurrence of first digits from 30.10 to 4.58%, respectively. When we consider the data for 100 measurements, the proportions of v and w are obtained from 37 to 3% and from 34 to 7%, respectively. The similar results are also obtained for 1.000 and 10.000 measurements (See, column 5 and 7 of Tables 2, 3).

Fig. 2
figure 2

Comparisons of residuals (a) and normalized residuals (b) first digit frequency estimated by univariate model and expected frequencies according to BL

Table 2 Frequencies and proportions for univariate model using residuals
Table 3 Frequencies and proportions for univariate model using normalized residuals

In the same manner, when we compare the residuals and normalized residuals’ first digit occurrence estimated by linear regression model and BL (Fig. 3), it can be seen that the graphical representations resemble the expected BL proportions. Tables 4 and 5 show the proportions of the observation number 100, 1000 and 10,000 in columns 3, 5 and 7, which approximate the BL proportions.

Fig. 3
figure 3

Comparisons of residuals (a) and normalized residuals (b) first digit frequency estimated by linear regression model and expected frequencies according to BL

Table 4 Frequencies and proportions for linear regression model using residuals
Table 5 Frequencies and proportions for linear regression model using normalized residuals

Since the computed \(\chi^{2}\) values are less than \(\chi_{0.05}^{2} = 15.507\) for 1 degrees of freedom, we can conclude that if the LSE method is applied to the data used in this study as specified geodetic data that have normal distribution with expected value 0 and standard deviation of 0.02 provide a good fit for the distribution of BL in terms of computed residuals and normalized residuals that their expected values are 0.

Moreover, we changed the standard deviations for simulated data to explore its effect on the results. To simulate the data, the standard deviation is altered from 0.001 to 1 with taking the expected value 0. Tables 6 and 7 represent the expected and observed proportions provided by the univariate model with calculated \(\chi^{2}\), respectively. Since the calculated \(\chi^{2}\) values (the last column of Tables 6, 7) are compared with the decision value \((\chi_{0.05}^{2} = 15.507)\), it is seen that the calculated \(\chi^{2}\) values are smaller than the decision value in all cases. Changing the standard deviations does not bring any meaningful results on the outcomes. The similar results are also obtained for normalized residuals (see, Tables 8, 9). However, in all cases of changing the standard deviation, it may be concluded that the larger data size gives the lower \(\chi^{2}\) values for v and w values of both models simulated.

Table 6 Frequencies of univariate model for residuals
Table 7 Frequencies of univariate model for normalized residuals
Table 8 Frequencies of linear regression model for residuals
Table 9 Frequencies of linear regression model for normalized residuals

5.2 GPS data

Apart from simulated data, the residuals and normalized residuals of GPS data were used as the real data set. The epoch number of GPS measurements is 11.036. LSE is applied to estimate the mean values of the coordinates for 4 days. Also, the residuals and normalized residuals are separately estimated for three coordinate components (denoted in tables and figures as X, Y, Z). The graphical representations of comparisons of residuals and normalized residuals for X, Y, Z coordinates with BL proportions are shown in Fig. 4 and the statistical considerations are given in Tables 10 and 11. The proportions of v and w for X coordinate component are obtained from 36.29 to 5.62% and from 33.65 to 6.14%, respectively. The similar results are also obtained for Y and Z coordinates (see, columns 5 and 7 of Tables 10, 11).

Fig. 4
figure 4

Comparisons of residuals (a) and normalized residuals (b) first digit frequency computed for X, Y, Z and expected frequencies according to BL

Table 10 Frequencies and proportions for real data using residuals (X, Y, Z)
Table 11 Frequencies and proportions for real data using normalized residuals (X, Y, Z)

5.3 Leveling network

As a simulated data, we generate leveling network contains 55 points and LSE is applied to height differences of 110 observations. The residuals and normalized residuals are obtained using LSE. Here, the height differences were considered as observations and the coefficient matrix was formed due to the network configuration. Figure 5 represents the graphical forms of BL proportions for residuals and normalized residuals, respectively. The statistical results are given in Table 12. According to the results, the proportions of v and w are obtained from 33.64 to 4.55% and from 30 to 6.76%, respectively. These findings fit the BL proportions.

Fig. 5
figure 5

Comparisons of residuals (a) and normalized residuals (b) first digit frequency estimated by leveling network and expected frequencies according to BL

Table 12 Frequencies and proportions for leveling network

According to the results given in Table 13, the \(\chi^{2}\) values of each data set were computed and it is seen that the decision value is greater than them in all cases, which figure out the compliance with BL. Moreover, the larger data size gives the lower \(\chi^{2}\) values for v and w values of both models simulated and varying standard deviation on generating the random data does not change the distribution significantly. When we compare the decision value of \(\chi^{2}\) of the real data (see Table 13), the \(\chi^{2}\) values of the data are under the decision value (15.507) in all cases that is satisfied the compliance with BL. The \(\chi^{2}\) values computed for real data are lower than the simulated models and the larger data size provides lower \(\chi^{2}\).

Table 13 \(\chi^{2}\) values of data sets for residuals and normalized residuals

Table 14 shows the OOM values of the data sets computed from the absolute values of residuals and normalized residuals by Eq. (11). For all data sets, the OOM values are larger than 3 (or 4). For the simulated data set depending on the linear regression model, the OOM values increase, while the numerical range of the data sets increase. Although there is a slightly decrease between the number of data 100–1000, the OOM values are still over 3 (or 4).

Table 14 OOM values of data sets for residuals and normalized residuals

6 Conclusion

In this study, we have shown, for the first time that the residuals and normalized residuals of geodetic data set estimated from LSE follow BL. The first data set was simulated according to the regression models and randomly distributed errors were added. Without giving outliers, the data, which we assume that it should involve these types of errors that come from the nature of the survey, conform to this law with high statistical accuracy. To prove it with real data applications, we have chosen GPS data and leveling network. In the same manner, the statistical results support the same outcomes with higher accuracy. One point common in the data set is that the larger number of data provides the lower \(\chi^{2}\). This situation is also provided by the OOM values. The numerical range of the data sets increase when the number of data sets are increased. Meanwhile, the OOM values computed are over 3 (or 4) that they can be suggested as showing a good correlation with BL. BL applied as a useful tool for fraud detection in financial fields. Accordingly, this paper presents the first outcomes of BL implementation in residuals and normalized residuals estimated from the LSE method in geodetic studies.