1 Introduction

There are several error sources in measurements due to inherent uncertainties such as distribution of physical properties, setting error, equipment resolution, and environmental effect. Certain errors affect the nominal accuracy that causes bias, and some errors disperse repeated measurements that cause deviation. Throughout the literature survey, there have been methodologies commonly used in measurement error treatment. As one of the representative methods, smoothing has been widely used and studied to squash out invalid fluctuations of data. Smoothing has been implemented by kernel smoothing [1], process convolution [2, 3], Gaussian process regression (GPR) with multi-kernel [4], GPR with random inputs [5, 6], spline smoothing [7, 8] and various filters [9]. A problem of almost all smoothing methods is to determine the level of smoothness controlled by their model parameters. For example, there have been researches to determine the frame length of the Savitzky–Golay (SG) filter [10,11,12,13,14,15] and kernel bandwidth of kernel soothing [16,17,18,19]. There also exists a criterion to determine smoothness for general purpose regardless of smoothing method [8]. In addition, studies on boundary problems [19,20,21,22] have been proposed to solve boundary errors occurring in smoothing.

Next, regression is often utilized for deviation treatment to integrate multiple measurements into a single prediction model with statistical synthesis. GPR, one of the most well-known statistical regression methods [23,24,25,26,27,28], constructs a regression model by hyperparameter optimization through likelihood maximization. Since GPR often causes overfitting by tracking all of meaningless spiky oscillations of data instead of smoothing them, smoothing must be preceded before the regression.

In spite of the developed methods, measurement error treatment is still an unresolved area. The first reason is that the aforementioned error treatment methods cannot assure whether the error is removed and the actual signal is preserved. Second, most of the smoothing methods cannot remove noise clearly, which means there could be a case of overfitting or under-smoothness. This is because the smoothness selection criteria are based on the model accuracy such as cross-validation error or mean squared error. The final difficulty is that error treatment method depends on numerical or physical properties of data, which makes it difficult to select an appropriate method for given data.

To resolve the last problem, a case that commonly occurs but hard to handle is utilized in this study. The selected case is to estimate unmeasurable physical property from other measurable properties using an analytic function. This is hard to treat since the error of the measured properties may cause unpredictable and complex ramifications on the accuracy of the resultant estimated property. For the second problem, this study suggests a new criterion for local intensive smoothing while maintaining a specified level of accuracy. Finally, for the first problem, model selection with prior knowledge is applied. Under high-uncertainty situation, several hypothetical models can be established and the most valid model among the candidates must be selected. In this case, we utilize physical knowledge and Occam’s razor-based model selection [29,30,31]. The whole process is illustrated with a problem of refractive index estimation of water.

The remainder of this article is organized as follows: introduction on the optical property, characteristic of measured raw data, and problems of the existing method are explained in Sect. 2. The proposed process to remove measurement noise including smoothing, GPR, and model selection is described in Sect. 3. Application of the proposed process to refractive index estimation of water as a case study is illustrated in Sect. 4. Finally, Sect. 5 summarizes the application results.

2 Optical property estimation and measured data characteristics

This section presents the fundamental physics of optical property estimation, the investigation of the measured data, and problems of the existing method in the optical property estimation without preprocess [32]. Especially, refractive index estimation of water and its experimental conditions that will be used as a case study in Sect. 4 are described.

2.1 Introduction of optical property estimation

Optical properties of materials have practical importance in many engineering fields such as solar thermal collector or glass fibers. Recently, various studies to tune the optical properties of materials with nanoparticles have been intensively reported [33, 34]. Plasmonic nanofluid is a suspension employing the plasmonic nanoparticles whose electron can couple with the light. It is possible to improve the absorption efficiency of a solar thermal collector and to minimize pumping loss simultaneously using an extremely small amount of metal nanoparticles.

In applications of the plasmonic nanofluids, it is important to accurately predict absorption and scattering phenomena of the nanofluids that depend on material, shape, and size of the nanoparticles and properties of the base fluid [35]. For the prediction, a refractive index of the base fluid is required, but there has been little information of optical constants about the base fluid for the direct solar thermal collector. Therefore, the measurement of the refractive index of the unknown fluid is essential for the prediction of absorption and scattering in nanofluids.

There have been two methods to obtain the refractive index: (1) to measure refraction angle or beam displacement [36,37,38] and (2) to solve an inverse problem of Airy’s formulae using measured transmission (T) and reflection (R) spectra [39, 40]. Using the second method, the refractive index (n) is directly determined through a simple calculation from ultraviolet–visible to the infrared range as

$$n(\lambda) = {\text{function}}(T(\lambda ),R(\lambda )) $$
(1)

at a certain wavelength λ. Detailed explanation on Eq. (1) is found in Kim et al. [32].

If T and R are accurately measured, the refractive index of the base fluid can be precisely estimated. However, estimation of the refractive index using Eq. (1) is quite sensitive to measurement errors as can be seen in the subsequent section. Consequently, it is essential to develop an uncertainty treatment method to estimate an accurate refractive index with this method when there exist measurement errors in T and R.

2.2 Experimental conditions and problems of the existing method

This section presents measurement results of T and R of water, base fluid for the nanofluid, and the refractive index estimated using Eq. (1) with the mean of the measured T and R without any preprocess according to the existing method [32]. For the refractive index estimation of water, T and R are measured 30 times along the wavelength, as shown in Fig. 1a and b. As can be seen from Fig. 1b, the fluctuation is severe in short wavelength, and the deviation between measurements is very large in long wavelength in R. As shown in Fig. 1a, T has a large interval between the maximum and minimum values compared with R, and T shows relatively small fluctuation along the wavelength and small deviation between measurements.

Fig. 1
figure 1

Measured data of aT and bR from 30 experiments

In the previous study [32], the mean of repetitive measurements with no preprocess on T and R is used for the refractive index estimation that results in large errors, as shown in Fig. 2. The estimation accuracy of the refractive index can be quantified using Palik’s data that is known as the exact refractive index of water. The estimation error between the exact refractive index and the mean of 30 measurements is 3.95% at the wavelength of 575 nm that is a very low level of accuracy even the multiple measurements cancel out most deviation errors. Hence, additional treatment is needed to reduce the estimation error under 2%.

Fig. 2
figure 2

Estimated refractive index of water using the existing method [32]

2.3 Measured data characteristics

The reasons for the large estimation error come from (1) measurement bias and (2) noisy fluctuation of measured data. To examine the causes of errors, first, the region where T + R > 1 as shown in Fig. 3 must have a measurement bias that needs correction. According to the nature of physics, T + R must be equal to or less than one since T + R + absorptance must be equal to one and the absorptance is always non-negative.

Fig. 3
figure 3

T + R obtained using the mean of measured T and R

Second, as shown in Fig. 1b, spiky fluctuations along the wavelength are observed in R. Estimated refractive index is sensitive to the measurement error of R since the scale of R is small compared with T and the effect of the same amount of value change of R becomes larger than that of T.

However, the measurement error cannot be reduced by careful manipulation since it is originated from nature and experimental equipment. Therefore, a process to resolve the problems above is proposed by introducing appropriate error treatment methods in Sect. 3.

3 Methods and process for data treatment

Proper data treatment to reduce measurement errors in T and R enhances the accuracy of refractive index estimation. For the purpose, a data treatment process is proposed that consists of smoothing of fluctuation, statistical regression of multiple measurements, and physics-based model selection explained in Sects. 3.2, 3.3, and 3.4, respectively. Before performing the process, data segmentation for local intensive smoothing according to data characteristics needs to be carried out first that will be explained in Sect. 3.1.

3.1 Data segmentation for local intensive smoothing

Data segmentation is essential to decide regions to perform local intensive smoothing. The third derivative of data is used as a fluctuation level index in this research. The relative third derivatives of T and R are shown in Fig. 4, where the relative third derivative of Y is calculated as \(\left| {\frac{{{\text{3rd derivative of }}{\mathbf{Y}}}}{{\max ({\mathbf{Y}}) - \min ({\mathbf{Y}})}}} \right|.\) As shown in Fig. 4, T shows small fluctuation in the whole wavelength, whereas R shows very high fluctuation in short- and long-wavelength regions compared with the criterion of 0.002. Therefore, region segmentation must be preceded to properly pick out regions to be smoothed intensively in R.

Fig. 4
figure 4

Relative third derivatives of T and R

To establish the data segmentation standard, Gaussian mixture [41, 42] is utilized in this research. Gaussian mixture, a kind of data clustering methods according to data similarity and dissimilarity, is applied to region segmentation according to the likelihood of the third derivative level of R with a specified number of regions. Using the Gaussian mixture, split positions with the highest loglikelihood (lnL) are selected for the best positions whose number is determined by AICc (Akaike information criterion corrected) [43, 44]. ln L and AICc are calculated as

$$ \ln L = \ln p({\mathbf{Y}}|t_{1} ,t_{2} , \ldots ,t_{k - 1} ) = \ln \left( {\sum\limits_{j = 1}^{k} {N({\mathbf{Y}}_{j} |\mu_{j} ,\sigma_{j} )} } \right) $$
(2)

and

$$ AICc = - 2\ln L + 2k + \frac{2k(k + 1)}{{n - k - 1}}{,} $$
(3)

where Y means data to be segmented, k is the number of segment groups, tk−1 is the (k−1)th split position, Yj is the jth segment of Y, μj and σj are parameters of normal distribution fit for dataj, and n is the number of elements in vector Y. In the region segmentation process, only information about the split positions (t1 ~ tk−1) is used in the next procedure. AICc results obtained using the relative third derivative of measured R are shown in Fig. 5 which shows that the best number of regions is four due to the minimum AICc value. Detailed results from the region segmentation are shown in Sect. 4 using a case study of refractive index estimation of water.

Fig. 5
figure 5

AICc results according to the number of regions of R

3.2 Strategies for local intensive smoothing

As aforementioned, local intensive smoothing is required for locally and highly fluctuating regions as observed in R. The SG filter, which is one of the existing smoothing methods and has been extensively studied including the bandwidth selection [16,17,18,19], is introduced in Sect. 3.2.1. However, since the SG filter has limitations in local intensive smoothing, a new smoothing method is proposed in Sect. 3.2.2 to apply exclusively to high-fluctuation regions.

3.2.1 Theoretical background of SG filter

An SG filter, which is a least square fitting method inside of moving frame with a predetermined frame length and polynomial order, is selected to be the basic smoothing method in this research. It follows the form of a weighted sum of nearby neighbor’s data within the frame with the frame length of len. With the determined len that must be an odd number and raw data before filtering y = [y1, y2, y3,…], filtered data Ynew at the jth point is calculated with a coefficient matrix C as

$$ {\mathbf{Ynew}}_{j} = \left( {{\mathbf{C}} \otimes {\mathbf{y}}} \right)_{j} = \sum\limits_{{i = \frac{{ - {\text{len}} + 1}}{2}}}^{{\frac{{{\text{len}} - 1}}{2}}} {{\mathbf{C}}_{i} y_{j + i} } {.} $$
(4)

For example, if len = 5, then i =  − 2, − 1, …, 2 where the minus i implies that the point yj+i is located in the left of yj. The coefficient matrix C is obtained using the least square fit of yj−2, yj−1, …, yj+2 with a low order of polynomials. For a case of equally spaced data, when z = [− 2, − 1, 0, 1, 2]T, the mth order polynomial approximation of the data within the frame can be expressed as

$$ {\mathbf{Ynew}} = a_{0} + a_{1} {\mathbf{z}} + a_{2} {\mathbf{z}}^{2} + \cdots + a_{m} {\mathbf{z}}^{m} . $$
(5)

a = [a0, a1,…, am]T in Eq. (5) is obtained by solving the normal equation of

$$ {\mathbf{y}} = a_{0} + a_{1} {\mathbf{z}} + a_{2} {\mathbf{z}}^{2} + \cdots + a_{m} {\mathbf{z}}^{m} $$
(6)

whose solution is given by

$$ {\mathbf{a}} = \left( {{\mathbf{Z}}^{{\text{T}}} {\mathbf{Z}}} \right)^{ - 1}{{\mathbf{Z}}^{{\text{T}}} {\mathbf{y}}} $$
(7)

with

$$ {\mathbf{Z}} = \left[ {\begin{array}{*{20}c} {{\mathbf{z}}_{(1)}^{0} } & {{\mathbf{z}}_{(1)}^{1} } & \cdots & {{\mathbf{z}}_{(1)}^{m} } \\ {{\mathbf{z}}_{(2)}^{0} } & {{\mathbf{z}}_{(2)}^{1} } & \cdots & {{\mathbf{z}}_{(2)}^{m} } \\ \vdots & \vdots & \cdots & \vdots \\ {{\mathbf{z}}_{{({\text{len}})}}^{0} } & {{\mathbf{z}}_{{({\text{len}})}}^{1} } & \cdots & {{\mathbf{z}}_{{({\text{len}})}}^{m} } \\ \end{array} } \right]. $$

Therefore, the convolution coefficient C is expressed as

$$ {\mathbf{C}} = \left( {{\mathbf{Z}}^{{\text{T}}} {\mathbf{Z}}} \right)^{ - 1} {\mathbf{Z}}^{{\text{T}}} $$
(8)

As shown in Fig. 6, as the frame length becomes larger, smoothing effect becomes higher, which means that the decision of the frame length directly affects noise removal.

Fig. 6
figure 6

Effect of frame length in SG filter

3.2.2 Proposed intensive smoothing method

The SG filter is appropriate due to its smoothness control ability by frame length and not distorting the front and rear parts of the data. The important problem is how to determine the frame length to control the smoothing level. To resolve the problem, the existing spline smoothing method [7, 8] utilizes the regularization equation [43, 44] given by

$$ g = \left( {\sum {\left( {{\mathbf{Y}}_{i} - f({\mathbf{x}}_{i} )} \right)^{2} } + \alpha \int {\frac{{\partial^{2} f({\mathbf{x}})}}{{\partial {\mathbf{x}}^{2} }}{\text{d}}{\mathbf{x}}} } \right){,} $$
(9)

where the first term is conformity between true observation and prediction f and the second term is a roughness penalty. This regularization has been widely used to impose smoothness for other types of smoothing models such as bandwidth selection for kernel smoothing. In Eq. (9), α is determined by minimizing the cross-validation error [8]. Since the smoothing parameter of the SG filter is defined by len, Eq. (9) can be rewritten in terms of two parameters, len and α, as

$$ g({\text{len}},\alpha ) = \left( {\sum {\left( {{\mathbf{Y}}_{i} - f({\mathbf{x}}_{i} ;{\text{len}},\alpha )} \right)^{2} } + \alpha \int {\frac{{\partial^{2} f({\mathbf{x}};{\text{len}},\alpha )}}{{\partial {\mathbf{x}}^{2} }}{\text{d}}{\mathbf{x}}} } \right) $$
(10)

which needs to be minimized to obtain the optimal len with a specified α. α and len that minimize the cross-validation error (cve) given by

$$ {\text{cve}}({\text{len}}) = \sum\limits_{i}^{N} {\left( {{\mathbf{y}} - f_{\sim i} ({\mathbf{x}};{\text{len}},\alpha_{*} )} \right)^{2} } $$
(11)

with specified α = α* is selected. As can be seen in Fig. 7, the smoothing effect becomes stronger as α increases. Especially, when there is no roughness penalty in Eq. (10), the smoothing result almost follows the measurement data.

Fig. 7
figure 7

Effect of α in the proposed intensive smoothing method

The overall smoothing process is shown in Fig. 8. As shown in Fig. 8, the process includes two optimization loops: the inner loop for lenopt and the outer loop for αopt. That is, each α has its own lenopt, and α with the minimum cve is selected as αopt.

Fig. 8
figure 8

The parameter selection process of the existing method using Eqs. (10) and (11)

However, this parameter selection process does not guarantee sufficient smoothness of spiky oscillation data since the cve criterion in Eq. (11) focuses on prediction accuracy. Moreover, in the case of R, the data oscillation along the wavelength is not even that requires locally adaptive smoothing. Existing locally adaptive smoothing methods have limitations that the frame length cannot be larger than the user-defined predetermined maximum, and the smoothing level is still under-smoothness. To resolve these problems, a new criterion for smoothness control is suggested in this study as

$$ {\text{len}}_{{{\text{opt}}}} = \mathop {\arg \min }\limits_{{{\text{len}}}} \left| {\frac{{\sqrt {\frac{1}{n}\sum {\left( {{\mathbf{Y}}_{i} - {\text{pred}}^{ - i} ({\mathbf{x}}_{i} ,{\text{len}},\alpha_{*} )} \right)^{2} } } }}{{\max ({\mathbf{Y}}) - \min ({\mathbf{Y}})}} - S_{{{\text{cve}}}} } \right| $$
(12)

with a specified cve criterion Scve. This criterion maximizes smoothness as long as Scve is satisfied. The difference between the proposed and existing criteria is shown in Fig. 9. The existing method selects α that minimizes cve while the proposed method selects α that makes normalized cve closest to the given Scve. αopt and lenopt of the proposed method are larger than those of the existing method that results in higher smoothing. Scve = 0.03 is adopted in the proposed local intensive smoothing. As the polynomial order, intensive smoothing regions adopts one and the other adopts six.

Fig. 9
figure 9

The proposed method for the determination of aα and b frame length

3.3 Data integration of multiple measurements

This section explains a method to integrate multiple measurements into a single prediction model using GPR that is a well-known statistical regression modeling method with high accuracy. For derivations of GPR, it is assumed that data y follows the form of y(x) = f(x) + ε, f = f(x), where ε ~ N(0,σ2) with Gaussian noise σ2 and latent function f [23,24,25,26,27,28]. This is because it can converge to a normal distribution according to the central limit theorem if the number of noise errors extracted from an independent process is large enough. In Bayesian approach,

$$ {\mathbf{y}} = {\mathbf{f}} + \sigma^{2} {\mathbf{I}} $$
(13)

with an identity matrix I and Gaussian prior as

$$ p({\mathbf{f}}) \sim N({\mathbf{m}}_{0} ,{\mathbf{K}}_{0} ), $$
(14)

where the mean function value m0 = m(x) with mean function m, the covariance matrix (K0)ij = k(xi, xj) with a covariance function k. To calculate p(f*|y) that is a prediction on new input x* with given training data {x,y}, the marginalization of joint posterior p(f,f*|y) along the latent function value f is implemented as

$$ p({\mathbf{f}}_{*} |{\mathbf{y}}) = \int {p({\mathbf{f}},{\mathbf{f}}_{*} |{\mathbf{y}})} {\text{d}}{\mathbf{f}} = \frac{1}{{p({\mathbf{y}})}}\int {p({\mathbf{f}},{\mathbf{f}}_{*} )p({\mathbf{y}}|{\mathbf{f}})} {\text{d}}{\mathbf{f}} $$
(15)

since the joint posterior according to the Bayes’ rule is written as

$$ p({\mathbf{f}},{\mathbf{f}}_{*} |{\mathbf{y}}) = \frac{{p({\mathbf{f}},{\mathbf{f}}_{*} )p({\mathbf{y}}|{\mathbf{f}})}}{{p({\mathbf{y}})}}. $$
(16)

Each term in the integrand of Eq. (15) is expressed as

$$ p({\mathbf{f}},{\mathbf{f}}_{*} ) \sim N\left( {\left[ {\begin{array}{*{20}c} {{\mathbf{m}}_{0} } \\ {{\mathbf{m}}_{*} } \\ \end{array} } \right],\left[ {\begin{array}{*{20}c} {{\mathbf{K}}_{0} } & {{\mathbf{K}}_{*} } \\ {{\mathbf{K}}_{*}^{{\text{T}}} } & {{\mathbf{K}}_{**} } \\ \end{array} } \right]} \right) $$
(17)

and

$$ \, p({\mathbf{y}}|{\mathbf{f}}) \sim N({\mathbf{f}},\sigma^{2} {\mathbf{I}}), $$
(18)

respectively, where (K*)ij = k(xi, x*j), (K**)ij = k(x*i, x*j), and m* = m(x*) with new input x*.

Using the previous research [45], Eq. (15) with Eqs. (17) and (18) is rewritten as

$$ p({\mathbf{f}}_{*} |{\mathbf{y}}) = N({\mathbf{m}}_{{{\text{post}}}} ,{\mathbf{K}}_{{{\text{post}}}} ) = N({\mathbf{m}}_{*} + {\mathbf{K}}_{*}^{{\text{T}}} ({\mathbf{K}}_{0} + \sigma^{2} {\mathbf{I}})^{ - 1} ({\mathbf{y}} - {\mathbf{m}}_{0} ), \, {\mathbf{K}}_{**} - {\mathbf{K}}_{*}^{{\text{T}}} ({\mathbf{K}}_{0} + \sigma^{2} {\mathbf{I}})^{ - 1} {\mathbf{K}}_{*} ). $$
(19)

When the mean function is defined using polynomials, the mean function m0 and m* are expressed as H0Tβ and H*Tβ with basis matrices H0 and H*, respectively.

All the hyperparameters for the model including β, σ, and parameters for covariance function are obtained by maximizing marginal likelihood p(y) over the latent function value f given by

$$ \log p({\mathbf{y}}) = \int {p({\mathbf{y}}|{\mathbf{f}})p({\mathbf{f}}){\text{d}}{\mathbf{f}} = } - \frac{1}{2}{\mathbf{y}}^{{\text{T}}} ({\mathbf{K}}_{0} + \sigma^{2} {\mathbf{I}})^{ - 1} {\mathbf{y}} - \frac{1}{2}\log \left| {{\mathbf{K}}_{0} + \sigma^{2} {\mathbf{I}}} \right| - \frac{n}{2}\log 2\pi $$
(20)

using Eqs. (14) and (18). A more detailed calculation process is found in references [23,24,25,26,27,28].

3.4 Model selection through physical validity-based bias correction

This section presents a method for bias correction for cases of T + R > 1 where there is no way to find out which out of T or R causes the bias. Therefore, three cases are considered as shown in Fig. 10: Case 1 assumes that all the bias is caused by R, case 2 assumes that all the bias is caused by T, and case 3 assumes that the bias is caused equally by both T and R. After smoothing and GPR for each case, the best case is selected using the final model simplicity from each case based on the principle of Occam’s razor [29,30,31] that selects the simplest model as the best model. In the proposed process, the model with the highest linearity of the final estimated refractive index among the three cases is selected.

Fig. 10
figure 10

The overall process of data treatment

According to the process in Fig. 10, Eq. (1) is modified as

$$ {n({\text{case}}},\lambda ) = {\text{\,function}}({\text{GPR\_T}}({\text{case}},\lambda ),GPR\_{\text{R}}({\text{case}},\lambda )),\quad {\text{case = 1,2,3.}} $$
(21)

whose inputs are noise removed and statistically synthesized regression data obtained from the proposed process.

4 Case study: refractive index estimation of water

This section illustrates the estimation results of the refractive index of water using (1) mean of 30 measurements, (2) smoothing only, and (3) bias correction + smoothing. All the smoothed data apply GPR, and the most appropriate bias case among three hypothetical cases assumed in Fig. 10 is selected in the final step. The region segmentation results with the best split points using the method in Sect. 3.1 are shown in Fig. 11. In the case of R, the best number of regions is four as shown in Fig. 5. Figure 11a and b shows the probability density function (PDF) and segmentation results of the third derivatives of four regions of the second data with the smallest value of AICc, respectively. Figure 11b also shows that four regions are appropriately divided according to the fluctuation level. In addition, Fig. 11c shows the division of the regions at similar locations. In this figure, the 31st data are the result of applying the Gaussian mixture method using 30 data averages. Since regions 1, 3, and 4 exceed the criterion of the 3rd derivative as shown in Fig. 4, the regions are intensively smoothed using the proposed method in Sect. 3.2.

Fig. 11
figure 11

a Distributions of 3rd derivative values in each region of the second data, b region segmentation of R according to the 3rd derivative of the second data, c region segmentation of R according to the 3rd derivative of all 30 data

Smoothing results of T and R are shown in Fig. 12a and b, respectively. From the region segmentation result in Fig. 11, local intensive smoothing is applied using the proposed criterion to regions 1, 3, and 4 for R. Meanwhile, only region 2 for R is smoothed using the existing criterion because of its low third derivative value. On the other hand, T does not need intensive smoothing over all regions, and thus the existing criterion is adopted as shown in Fig. 12a.

Fig. 12
figure 12

Data smoothing results of aT and bR using the proposed smoothing in Sect. 3.2

For comparison, the Nadaraya–Watson method [1] and the spline smoothing [7, 8] are applied in regions 1 and 4 of R where the 3rd derivative value is high. Given that the total range of R is from 0.03 to 0.08, regions 1 and 4 need to be intensively smoothed because these regions have very small range of R. However, as shown in Fig. 13, the spline smoothing and Nadaraya–Watson method do not smooth the given data. On the other hand, the proposed method shows clear smoothing results because users can adjust the smoothing intensity.

Fig. 13
figure 13

Comparison of other methods in regions 1 and 4 of R

Figure 14 illustrates refractive index estimation results of three cases in bias correction as assumed in Fig. 10 after smoothing and GPR. Case 2 is selected as the most probable bias correction case since it shows the highest linearity according to the principle of Occam’s razor.

Fig. 14
figure 14

Refractive index estimation from bias correction according to three cases

It can be seen from Fig. 15 that the error is gradually reduced according to the process of smoothing, and bias correction and smoothing. Smoothing squashes the noisy peaks and enables stable prediction. Bias correction shows the best result by relocating physically invalid shifted values close to the correct position using the model selection. Moreover, the error estimated from a single measurement is very high in the long-wavelength domain as shown in Fig. 15. If the refractive index was estimated with only one measurement data, the error would be very high, which shows the importance of the statistical approach.

Fig. 15
figure 15

Comparison of the refractive index estimation accuracy using mean of measurements, smoothing only, and bias correction and smoothing

The level of error reduction is quantified in Fig. 16. Bias correction and smoothing shows the highest improvement that the absolute error compared with the result from the mean of 30 measurements is reduced from 3.90% to 1.95%.

Fig. 16
figure 16

Estimation error of the refractive index of water using the proposed method

5 Conclusion

In this paper, a theoretically reliable process for measurement error treatment is proposed and applied to refractive index estimation of water which cannot be directly measured but can be estimated using other measurable properties such as transmittance T and reflectance R. It is apparent that the measurement errors of T and R are propagated to refractive index, and an appropriate treatment of the errors enhances the resultant estimation accuracy of refractive index. A series of processes consisting of (1) smoothing of spiky fluctuation, (2) data synthesis of multiple measurements into a single prediction model using GPR, and (3) bias correction based on physical validity is proposed for the error treatment method. In addition, a local intensive smoothing criterion with data segmentation method based on data fluctuation level is proposed. The process is validated using a case study of refractive index estimation of water whose true refractive index is known. From the validation, it is shown that the proposed method reduces estimation error by 50% compared with the existing method. In addition, it is shown that the region requiring intensive smoothing is well selected, showing clean smoothing results. Therefore, the novelty of the proposed method is that it is capable of smoothing any type of data by analyzing characteristics of the data and optimizing smoothing parameters. However, there exists weakness of the proposed method in that users need to empirically select Scve according to the characteristics of data. Scve selection method will be a new topic that can be studied in the near future.