1 Introduction

The localization problem in communication systems has been on research spot for many years. Node position estimation enables a large set of location-based services [1], boosts biological studies [2], and improves defense capabilities [3].

In a multipath scenario, the primary signal of an emitter and/or its reflections arrives at the sensors, resulting in a different number of possible targets. The non-line-of-sight (NLOS) between the emitter and the sensors affects the position estimation in an outdoor scenario [4, 5], because only reflections can be used to estimate the emitter localization.

However, it is still possible to use these signals for transmitter localization and tracking, if multipath exploitation is performed and information about the environment is available [6]. In [7], multipath information is used with the image theory to locate the emitter using only one sensor. The hybrid approach presented in [8], uses multipath information, machine learning, and propagation simulation tools to enhance the performance of outdoor TDOA systems [9]. The work presented in [10], uses evolutionary algorithms to optimize target localization in a wireless sensor network. A two-stage position estimation system is presented in [11]: the first step is to ignore the additional path error and estimate an initial position; the second step is to include the additional path error in the estimation problem to enhance the results using a variable projection method. The research presented in [12], describes a method to estimate the node location using deep learning in outdoor environments, more specifically in a heterogeneous network environment.

In this work, we consider an area of 788 m × 736 m = 579,968 m2. In this area, we need to locate a target (for example, a cell phone from hostile target—a non-collaborative emitter). Since we assume an adverse (sometimes hostile) area, it is not possible to obtain real data for the location with NLOS. However, it is possible to obtain real data obtained by a reconnaissance patrol in a known position. It is also possible to obtain simulated data based on ray-tracing for the region of interest. This scenario will define three datasets: Simulation Dataset which contains simulated data from the ray-tracing simulation, Emitter Dataset, which contains real data from a reconnaissance patrol in known positions and Target Dataset, which contains real data from a target. We will refer to this scenario as outdoor adverse position localization. To our knowledge, none of related bibliography addressed this type of scenario.

Many machine learning methods can be used for node localization; however, it should be noted that random forest and gradient boosting are ensemble methods having several advantages that make them suitable for outdoor fingerprinting localization:

  • Random Forest uses decision trees, which are very prone to success [13]; typically, it is used to achieve higher accuracy, based on different sets of attributes and samples. Random forest is known for fast training, matching speed, high classification accuracy and good performance for high-dimensional input data [14, 15]. An approach of position fingerprints using Random Forest can be found in [16]. An algorithm and a synthetic volume cross-correlation (VCC) function to extract the multipath features from the TDOAs measurements are available in [17].

  • Gradient Boosting is a boosting method built on weak classifiers. It was applied to indoor localization and performed equally or better than k-nearest neighbors (kNN) in [18]. It is based on weak learners (high bias, low variance). The base predictors or weak learners are shallow trees, sometimes even as small as decision stumps (trees with only one level of decision). The boosting approaches reduce error mainly by reducing bias and by aggregating the output from many models.

Finally, both algorithms in [19, 20] can handle the discrepancies in data quality, which cause over-fitting and under-representative data observations. However, a radio-map is also required to allow multipath exploitation; yet, it is not always possible to build the dataset with the real-world signals, and a simulation tool can be used. Therefore, some mismatches may occur between the dataset produced with simulation tools and real-world measurements provided by sensors. Also, the machine learning engine can produce a few outliers in the position estimation, and the tuning parameters cannot deal with intense noise in the real world measurements setup.

For this reason, it is essential to improve the estimations even when we have to deal with the differences caused by noise and mismatching between synthetic and real-world data in an outdoor hostile position localization.

The contributions of this work are listed in the following.

  1. 1

    Enhancement of the machine learning model with few real world data, a common requirement for law-enforcement agencies and defense forces;

  2. 2

    The use of synthetic data and limited amount of real world data to train the model and to predict real world data;

  3. 3

    Performance of Gradient Boosting and Random Forest models for mismatching and noisy measurements of multipath fingerprints in outdoor hostile position localization; and

  4. 4

    Guidelines to optimize the hyperparameters of machine learning algorithms applied to estimate emitter position in outdoor hostile scenario, using mainly simulation data.

The rest of the paper is organized as follows: Sect. 2 shows the methods in localization systems using multipath exploitation and machine learning. Section 3 presents the main approaches when using localization methods and the effect of the NLOS between emitter and sensors, and explains how the ray-tracing simulation tool is used to produce the dataset employed by the machine learning algorithms. Section 4 presents the main aspects of a multipath in localization systems. Section 5 presents how to apply machine learning in localization problems and describes the main aspects of the machine learning fingerprint framework, explaining the general characteristics of methods random forest and gradient boosting. In the following, Sect. 6.1 details the proposed scenario, presents and analyses the results obtained for both random forest and gradient boosting. This section also presents the results from two experiments to evaluate how the machine learning algorithms behave under noisy environment and mismatch measurements. Finally, Sect. 7 summarizes conclusions and perspectives for future work.

2 Machine learning fingerprints in localization problems

This section aims to explain the method to include multipath information in an outdoor localization system, using simulation data and signal processing of real measurements. The ray-tracing electromagnetic simulation tool can extract the channel impulse response (CIR) of the signal that arrives in each TDOA sensor [21]. We used the AWE WinProp ray-tracing to extract CIR fingerprints. We created a fingerprint using the same design employed in [22] and used machine learning algorithms to predict the target position. The dataset was created with amplitudes and delays of wavefront components arriving on the sensors. We used two other Datasets from real-world data: (1) an Emitter dataset corresponding to 1000 measurements using four sensors with emitter from a fixed know position, and (2) a target dataset corresponding to measurements using four sensors with emitter in different positions [22].

In Algorithm 1, we have an overview of the machine learning training process: the dataset generated by synthetic volume cross-correlation (VCC) produced by ray-tracing (simulation dataset) was the only data used for training the model. In contrast, the real-world data (emitter dataset) was used for the validation step (to set hyperparameters). When hyperparameters are selected, both datasets are used to generate the final model.

figure a

After generating a model using random forest and gradient boosting, we used the target dataset to predict locations (\({\mathbf {p}}_t=[x_t,y_t]^{T}\)) and compared the performance of each model. Two simulations were performed: one evaluating the model response to the addition of Gaussian noise to the features of the target dataset; and another one estimating the models’ response when cancelling features from the target dataset.

Figure 1 presents the suggested framework. First, the machine learning engine uses the Simulation Dataset, optimizes the hyperparameters using the Emitter Dataset, and creates the prediction model. Later, both machine learning models (gradient boosting or random forest) are used to perform position prediction. In parallel, we add noise and create an artificial mismatch on the features of the target dataset. Meanwhile, we evaluate the prediction error using the Euclidean distance between the predicted and the actual positions.

Fig. 1
figure 1

Overview of the proposed approach

3 Using NLOS in localization systems

This section shows the existing approaches to deal with NLOS measurements in outdoor localization systems; most do not use multipath corrupted position estimations. Others only include multipath fingerprints using real measurements data.

Traditional TDOA techniques are strongly affected by reflected and diffracted rays in the environment; when there are measurements with NLOS paths, the location errors can be substantial. The performance of the localization systems depends on the signal processing algorithms and the channel characteristics severely affected by NLOS rays [23].

There are several approaches to deal with multipath in localization systems; the classical TDOA localization approaches have relied only on line-of-sight propagation with degradation in performance whenever NLOS rays occurred. Regarding antenna simplicity, small size, weight and power requirements (SWaP) [24], TOA and TDOA techniques are the most popular schemes used for emitter localization in wireless networks.

NLOS multipath propagation introduces an inherent error into the localization because it alters the propagation paths and adds additional channel delays. It is possible to obtain the propagation information to deal with NLOS using “radio frequency fingerprinting,” either performing extensive measurement campaigns or using ray-tracing to simulate the environment.

Ray-tracing models the signal propagation finding the ray propagation trajectories in a defined scenario, with geometric optics approximation.

In [25], a multipath database characterization was created with a grid of possible emitter positions, where the angle of arrival (azimuth and elevation) and the time of arrival were recorded, giving a signature for each possible transmitter location in an area of interest that is populated via ray-tracing simulations. The received signal with NLOS components is compared with the values in this database, to estimate the emitter position with the same multipath information.

The approach described in [26] uses multipath characteristics of the scenario to build a database of NLOS rays, applying a clustering procedure to match the real-world measurement with the simulated one to locate the emitter. In [27], the authors presented different localization fingerprints using algorithms received signal strength (RSS) and the K-nearest neighbour (KNN). There are different types of “Fingerprints,” as discussed in [28] and [29], where a performance improvement was observed when the position estimation was evaluated using the channel state information (CSI) for long term evolution (LTE). In [30], several approaches for indoor and outdoor multipath fingerprint enhancement localization systems in 5G and IoT are presented. A modified version of the random forest algorithm [22] is used to implement a localization system with the information of Wi-Fi access points in indoor scenarios; the system gives the target position as a classification processing.

4 Ray tracing to exploit multipath in localization systems

The machine learning fingerprint framework presented in [16] used a ray-tracing simulation tool to extract multipath features for all possible emitter positions in the scenario. Sometimes, due to operational or physical restrictions, it is challenging to build a dataset with real world measurements; thus some studies, like those in [8], use simulation tools. In [21], the term “ray tracing” fingerprint relates to a “radio map” of received signal strength (RSS) in a coverage prediction, that takes into account the output power of the emitter to perform the position estimation [31].

Figure 2 shows a typical performance of a TDoA localization system in outdoor scenario, as described in [16]. We can see how the estimations differ from the actual emitter position, multipath and NLOS effect degrading the system performance severely.

Fig. 2
figure 2

Performance of a Localization System in Outdoor Scenario [16]

Figure 3 shows that the multipath fingerprints also give information about the propagation mechanism (reflection, diffraction or scattering). The simulation output draws each path from each point in a scenario grid, which describes amplitude, delay (\(\alpha _i,\tau _i\)), reflection points and angular information of each ray. They represent the emitter-sensor interactions, and it is the basis for the visibility matrix where each interaction is considered as a layer in a multilayer scheme. Therefore, the ray tracing (RT) software gives information about each ray path, describing the edges and walls touched by the rays in the emitter-sensor path.

Fig. 3
figure 3

Extraction of CIR fingerprints using ray tracing

The performance is highly dependable on the details given in the scenario setup. In practical outdoor implementations, buildings are only represented by simple structures, where details such as windows and doors are not present. For a suburban outdoor scenario with simple buildings, the ray-tracing does give reasonable information about the main specular components in the propagation channel.

A ray tracing simulation provides the site-specific channel impulse response, which means that, as soon the position of each sensor is defined, it is possible to obtain the multipath information of each point of the scenario.

The specular components path including all reflection and diffraction points are available at the end of the ray-tracing simulation. Depending on the desired number of ray interactions: the image theory allows to identify the reflection points, and the virtual nodes, all possible rays in the simulation domain can be estimated.

Following the approach of [32], the ray-tracing scenario can be decomposed into walls and edges; the “view tree” represents the emitter-sensor interactions, and it is the basis for the visibility matrix, where each interaction is considered a layer in a multilayer scheme as depicted in Fig. 4.

Fig. 4
figure 4

Visibility tree from building walls and edges in ray tracing simulation, adapted from [32]

With the output file of the ray tracing software, it is possible to search for a given position where the main reflectors are those the rays bounce before arriving at the sensor.

The inputs of the localization problem are the positions of the sensors, usually known, the received signal and the scenario description or characterization. With this information, it is possible to improve the performance of the localization system by adding a multipath fingerprint using NLOS patterns.

In this point, the approximation, not only in the scenario description but also in the RT multipath information, can play an essential role in the machine learning framework. For this reason, the rays description should be good enough to establish the model, but can not be as precise that loses the generalization features. Generalization occurs when we try to describe the learning of the target function from simulation training data.

The data model and the position estimation problem is a generalization of the approach introduced by [33]:

$$\mathbf{r }=\mathbf{f }(\mathbf{x })+\mathbf{n },$$
(1)

where \(\mathbf{r }\) is the measurements vector, \(\mathbf{x }\) is the vector with the unknown source position that we want to estimate, \(\mathbf{f}(x)\) is a non-linear function that maps the position vector into the measurements, and \(\mathbf{n }\) is a vector that describes the zero mean noise that corrupts the measurements. TDOA location is carried out using the range differences assuming that the received signals are synchronized.

When the source emits a signal at instant \(t_0\) (unknown), the lth sensor receives the signal at time \(t_l\), \(l=1,2,\ldots L\). It is possible to obtain \(L(L-1)/2\) distinct TDOAs. If there are four sensors, we would have six delays from which only three are usually employed: \(\tau _{21},\tau _{31}\), and \(\tau _{41}\). The time differences and range differences are related by a constant, the speed of light. Using the range difference formulation with the TDOA values, we have:

$$r_{{\rm TDOA}}=d_{l,1}+n_{{\rm TDOA}},\quad l=2,3,\ldots , L,$$
(2)

where the term \(d_{l,1}=d_l-d_1\), and \(n_{{\rm TDOA}}\) is the error in the range differences. It is possible to use the following compact matrix notation.

$$\begin{aligned}\begin{array}{l} {\mathbf{r }_{{\rm TDOA}}} = {\left[ {{r_{{\rm TDOA},2}},{r_{{\rm TDOA},3}} \ldots {r_{{\rm TDOA},L}}} \right] ^{T}}\\ {\mathbf{n }_{{\rm TDOA}}} = {\left[ {{n_{{\rm TDOA},2}},{n_{{\rm TDOA},3}} \ldots {n_{{\rm TDOA},L}}} \right] ^{T}} \end{array} \end{aligned}$$
$$\begin{aligned} \begin{aligned} \mathbf{f }_{{\rm TDOA}}(X)=&\left[ {\begin{array}{*{20}{c}} {\sqrt{{{\left( {x - {x_2}} \right) }^2} + {{\left( {y - {y_2}} \right) }^2}} - \sqrt{{{\left( {x - {x_1}} \right) }^2} + {{\left( {y - {y_1}} \right) }^2}} }\\ {\sqrt{{{\left( {x - {x_3}} \right) }^2} + {{\left( {y - {y_3}} \right) }^2}} - \sqrt{{{\left( {x - {x_1}} \right) }^2} + {{\left( {y - {y_1}} \right) }^2}} }\\ \vdots \\ {\sqrt{{{\left( {x - {x_L}} \right) }^2} + {{\left( {y - {y_L}} \right) }^2}} - \sqrt{{{\left( {x - {x_1}} \right) }^2} + {{\left( {y - {y_1}} \right) }^2}} } \end{array}} \right] \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} p\left( {{\mathbf{r }_{{\rm TDOA}}}} \right) =&\frac{1}{{{{\left( {2\pi } \right) }^{(L - 1)/2}}{{\left| {{\mathbf{C }_{{\rm TDOA}}}} \right| }^{1/2}}}} \times \\&\exp \left( { - \frac{1}{2}{{\left( {{\mathbf{r }_{{\rm TDOA}}} - {\mathbf{d }_1}} \right) }^T}\mathbf{C }_{{\rm TDOA}}^{ - 1}\left( {{\mathbf{r }_{{\rm TDOA}}} - {\mathbf{d }_1}} \right) } \right) \end{aligned} \end{aligned}$$
(4)

The position estimation is, therefore, a process to deal with the nonlinear formulation of vector \(\mathbf{f }\) using either least square (LS) or weighted least squares (WLS) formulation to evaluate the error between the estimated and the actual positions,

$$\begin{aligned} \mathbf{e }_{{\rm nonlinear}}=\mathbf{e }_{{\rm noise}}=\mathbf{r }-\mathbf{f }({\tilde{x}}). \end{aligned}$$
(5)

In case of multipath, time differences or range differences present an extra error caused by the NLOS signals:

$$\begin{aligned} \mathbf{e }_{{\rm nonlinear}}=\mathbf{e }_{{\rm noise}}+\mathbf{e }_{{\rm NLOS}}. \end{aligned}$$
(6)

The effects of the multipath are included in the signal data model and in the Cramér–Rao bound, presented in [23], as an “extra error” in the estimation. The TDOA in an NLOS scenario is, therefore, a standard system but with an extra noise in the estimation performance that leads to an inaccurate position estimate.

5 Machine learning algorithms

Machine learning algorithms make an approximation to find a position as a regression process (f). This function f should be able to map input variables (\(\alpha _i,\tau _i\)) to an output variable \({\mathbf {p}}_t=[x_t,y_t]^T\), the target position of an observation,

$$\begin{aligned} {\mathbf {p}}_t=f(\alpha _i,\tau _i). \end{aligned}$$
(7)

Assuming there is not too much fluctuation in z, we use only x and y for our simulations.

Several machine learning algorithms can be used to solve the proposed problem. We focus on meta algorithms based on ensemble methods to cover different areas of the problem, which, through a voting scheme, tend to provide better solutions. Originally developed to reduce the variance and then to improve the accuracy, ensemble methods have since been successfully used to address a variety of machine learning problems. We have selected two known machine learning algorithms based on ensemble methods: Random Forest and Gradient Boosting.

Random Forest was introduced in [34], based on an earlier work described in [35], and uses decision trees and bagging [36]. Random forest can be used for either categorical labels (classification) or continuous labels (regression).

Bootstrap aggregating, or bagging model, is a method for fitting multiple versions of a prediction model and then combining them into an aggregated prediction (ensemble model) [36]. In bagging, b bootstrap copies of the original training data are created, the regression or classification algorithm is applied to each bootstrap sample and, in the regression context, new predictions are made by averaging the predictions from individual regressors. The bagged prediction \({\tilde{f}}_{{\rm bag}}\) is given as

$$\begin{aligned} {\tilde{f}}_{{\rm bag}}= {\tilde{f}}_{1}(X) + {\tilde{f}}_{2}(X) + {\tilde{f}}_{3}(X) +\cdots + {\tilde{f}}_{b}(X), \end{aligned}$$
(8)

where X is the data for which we want to generate a prediction and \({\tilde{f}}_{1}(X),{\tilde{f}}_{2}(X) \ldots {\tilde{f}}_{b}(X)\) are the predictions from the individual regressor. Because of the aggregation process, bagging effectively reduces the variance of an individual regressor but does not always improve upon an individual base learner. Since each base learner is completely independent of one another, we could run them in parallel. Figure 5 shows a bagging example in which random subsets of the original dataset are drawn with replacement as random subsets of the samples. Random forest’s base estimators are built on subsets of both samples and features from original dataset.

Fig. 5
figure 5

Bagging example (based on [36])

According to [37], gradient boosting is a class of the machine learning methods based on the idea that a combination of simple classifiers, obtained by a weak learner, can perform better than any of the simple classifiers alone. A weak learner is a learning algorithm capable of producing classifiers with the probability of error strictly (but only slightly) lower than that of random guessing. The same idea could be extended to the regression task. Gradient boosting produces a model based on weak learners (typically decision trees) in a stage-wise fashion like other boosting methods, but it identifies the shortcomings of weak learners by using gradients in the loss function. In most cases, the decision trees used on gradient boosting is composed by one internal node (the root) which is immediately connected to the terminal nodes or leaves. These smalls decision trees are called stumps.

The main idea of boosting is to add new models to the ensemble sequentially; it approaches the bias-variance trade-off by starting with a weak learner and sequentially boosts its performance by continuing to build new trees (from weak learners), where each new tree in the sequence tries to fix up where the previous one made the biggest mistakes. Figure 6 shows this approach.

Fig. 6
figure 6

Ensemble sequentially in boosting (source [38])

Gradient boosting may use a decision tree as a weak learner, and since each decision tree training depends on later results from the last decision tree, it is not possible to parallel the training process; it is, therefore, a sequential process.

Random Forest models mostly depend on the number of estimators, which is the number of trees that will be used to fit. We have chosen the same approach used in [16], keeping most of the hyperparameters as default values and changing the number of estimators. By default, random forest trains fully grown trees which could be done in a parallel way; the model size is therefore limited by computer available memory.

Gradient Boosting models based on decision trees also mostly depends on the number of estimators, i.e. the number of trees which will be used in the fitting process. But since this only creates weak learners, we can limit the maximum depth of the trees, avoiding to increase the maximum depth, which could makes the model more complex and more likely to overfit.

Random Forest is especially attractive when using noisy real-world data, while gradient boosting is more sensitive to overfitting if the data is noisy.

6 Results and discussion

In this Section we present the results from two scenarios: a simulation scenario, which uses only simulation dataset, and a real scenario, which uses simulation and real-world datasets. Then, results from noise and mismatch experiments from the real-world scenario are presented.

6.1 Datasets

In this work, we have three different datasets available.

  • The simulation dataset: containing 127,000 observations, each consisting of 20 pairs of amplitude and delay (\(\alpha _i,\tau _i\)) for each 4 simulated receiver, summing up 160 features and their respective localizations \({\mathbf {p}}\) (labels).

  • The target dataset containing 2973 real world data observations, each consisting of 20 pairs of amplitude and delay (\(\alpha _i,\tau _i\)) for each of the 4 receivers, summing up 160 features and their respective localizations \({\mathbf {p}}\) (labels).

  • The emitter dataset: containing 1000 measurements using 4 sensors with emitter from a know position, each observation consists of 20 pairs of amplitude and delay (\(\alpha _i,\tau _i\)) for each of the 4 receivers, summing up 160 features and their respective localizations, \({\mathbf {p}}\) (labels).

6.2 Simulation and real scenarios

We have settled two main scenarios to apply machine learning methods to disclose the results of real and simulated environment:

  1. 1

    Simulation scenario: in this scenario, we have used just simulation dataset during training, validation and test processes for both random forests and gradient boosting models. The training dataset, which consists of 64% of the samples from simulation dataset, is used to fit the parameters. The validation dataset, which consists of 16% of the samples from simulation dataset, is used to tune the hyperparameters. Finally, the test dataset, which consists of 20% of the samples from Simulation dataset, is used to to assess the performance. The goal was to understand how the machine learning algorithms would perform on ideal conditions. Figure 7 presents the machine learning training process for the simulation scenario;

  2. 2

    Real-world scenario: in this scenario, we have used simulation dataset and emitter dataset during training and validation processes to optimize the machine learning hyperparameters and real world data from target dataset to evaluate the model’s performance. The main goal was to estimate the performance of both random forest and gradient boosts under real world environment conditions, where we could use Ray Tracing fingerprint (simulation dataset) to generate most of the data and a small amount of real world data from a fixed located source (emitter dataset) for training and validation processes. Finally, we used real world data (target dataset) for performance evaluation. Figure 8 presents the machine learning training process for the real scenario. Using the real scenario, we can evaluate how both random forest and gradient boosting would behave under noisy environment and mismatch measurements where a subset of features is nullified.

Fig. 7
figure 7

Training process using only simulation dataset

Fig. 8
figure 8

Training process using simulation dataset and emitter dataset

The loss function used to measure the model performance was the mean squared error (MSE) between the actual and the predicted localization in both scenarios; Eq. (9) presents its definition:

$$\begin{aligned} {\rm MSE} = \frac{1}{n}\Sigma _{i=1}^{n} \big ||{{\mathbf {p}}_i - {\tilde{\mathbf{p}}}_i}^2\big ||, \end{aligned}$$
(9)

where \({\mathbf {p}}_i\) is the actual localization of emitter i and \(\tilde{{\mathbf {p}}}_i\) is the predicted localization of emitter i.

After hyperparameters tuning, the performance estimation were evaluated for both scenarios. Figure 9 presents (a) how to evaluate performance using test subset from simulation dataset in the simulation scenario and (b) how to evaluate performance using target dataset in the real-world scenario.

Fig. 9
figure 9

Estimating model’s error on target dataset

6.3 Results from simulation scenario

Since we want to obtain minimum mean square distance error from predicted and actual localization using simulation dataset, we have tried to range the number of base estimators for both random forest and gradient boostingFootnote 1 using the training and validation datasets.

Depicted in Fig. 10, is the relationship between the number of base estimators (from 10 to 100) and the mean distance error achieved using validation dataset and random forest model. We have chosen the number of estimators as 80. This configuration does not yield the smallest error, but the difference of error between 80 and 100 base estimators is about 0.2 m, which is very small and could be ignored. But the time required to generate a model with 80 base estimators are around 70% of the time required to generate a model with 100 base estimators. The mean distance error in the Test Dataset using 80 base estimators was 1.38 m.

Fig. 10
figure 10

Estimating model’s error on test dataset for random forest

Figure 11 presents how the mean distance error from predictions and true value for Validation Dataset for a range from 10 to 400 base estimators for XGBoost. We have chosen the number of estimators as 100 and a mean distance error value of 3 m, which was a trade-off between number estimators and mean distance error. When using, for instance, 400 estimators, we obtained a mean distance error value of 2.5 m, but the XGBoost model using 400 estimators will take more than 8 times the amount of time to generate a model than a XGBoost model using only 100 base estimators. The mean distance error in the Test Dataset using 80 base estimators was 2.52 m.

Fig. 11
figure 11

Estimating model’s error on test dataset for XGBoost

The results from simulation scenario shows that both random forest and gradient boosting (implemented as XGBoost) obtains suitable performance evaluated as mean distance error from true value. These results, 1.38 m for random forest and 2.52 m for gradient boosting, are consistent with results that used only simulation data as found in [25, 39,40,41].

So, the conclusion is that when using just simulation data, both random forest and XGBoost obtained very good results. These results could be applied when there are real data avaliable for traning, validation and test. But, as presented, when considering outdoor hostile position localization, there is no real data avaliable for the region of interest.

6.4 Results from the real scenario

In the real scenario, the simulation dataset was used for training the algorithm and the emitter dataset was used for validation (hyperparameters tuning) and later, when generating final model. For the random forest model, we used a range of base estimators from 10 to 1200 and the best trade off between mean distance error and number of base estimators was obtained with 80 estimators and a mean distance error of 120.98 m. By using 1200 base estimators, we obtain a mean distance error of 120.49 m, a very modest improvement over 80 estimators. Figure 12 presents the mean distance error evaluated considering the number of base estimators for random forest.

Fig. 12
figure 12

Estimating model’s error on emitter dataset for random forest

For XGBoost, the best mean distance error, 87.83 m, was obtained using 80 base estimators. Figure 13 presents the mean distance error evaluated considering the number of base estimators for XGBoost.

Fig. 13
figure 13

Estimating model’s error on emitter dataset for XGBoost

Since we have defined the number of base estimators from both random forest and gradient boosting (XGBoost), a new model was trained using both simulation dataset and emitter dataset. The generated model was then used for estimating the target position \({\mathbf {p}}_t\) from real world data using target dataset. For random forest, the mean distance error obtained when applying the Target Dataset was 148.51 ± 115.84 m and for XGBoost, the mean distance error when applying the target dataset was 145.63 ± 121.20 m. If we just used simulation data for traning and validation and apply the generated model to evaluate the target dataset, the the mean distance error obtained was 237.54 ± 138.78 when using random forest model and 219.31 ± 134.95 when using XGBoost model. So, simulation data (simulation dataset) and real data from one known position (emitter dataset) we could notice and mean distance error improvement of 89 m for Random Forest and 74 m for XGBoost.

For sure, when considering the real scenario, it is obvious that the mean distance errors are bigger than the mean distance errors from the simulation scenario. Anyway, when using just 1000 measurements from a known position (emmiter dataset) in training process we obtained a considerable improvement when using just simulation data to predict real data. Still, if we consider the Simulation Dataset’s total area (788 m × 736 m = 579,968 m2), the random forest model obtained a mean distance error value of 148.51 m which defined an area \(A = \pi r^2\) of 69,288.5 \(m^2\) which represents 11.94% of total area. For XGBoost model, we obtained a mean distance error value of 145.63 m which defines an area \(A = \pi r^2\) of 66,627.20 m\(^2\) that represents 11.48% of the total area. So, in both cases, the target’s search area are reduced by the amount of 9 times the total area.

In both cases, about 46% of all predictions had mean distance error less than 75 m, which would reduces the target’s search area by almost 33 times the total area. Figure 14 shows an example of a 75 m radius search area bounded by the simulation region (black-edged rectangle). The red region defines the circle with a radius of 75 m and the red point in the center defines the actual target position, usually not known. When using just simulation data to predict real data, 46% of all predictions had mean distance error of 127 m, represented as the external blue circle.

Fig. 14
figure 14

An example of a 75 m radius search area from the proposed method and a 127 m radius search area when using just simulation data to predict real data

When considering model’s time and CPU’s performance in the Real Scenario, we come to the following observations:

  • It is clear that the random forest algorithm used much more computer resources; on the other hand, it allows parallel execution due to the fact that the internal structure of the decision trees is independent. It should be noted that a Random Forest model with 80 estimators used more than 6 GB of RAM during training, and it took 30 min to be created using Google Colaboratory’s basic default CPU configuration;

  • Gradient boosting uses less RAM and, in this particular case, it takes less time to create the model (mostly because the base estimators are stumps). The gradient boosting model with 80 base estimators used about 2.5 GB RAM, and it took 10 min to the generate the model on Google Colaboratory’s basic default CPU configuration.

Therefore, gradient boosting implementation (XGBoost) achieved better time and CPU performance. The main reason seems to be that the amount of base estimators of both models were not big enough to allow the parallel training from random forest to surpass gradient boost serial training.

6.4.1 Experiment with noisy features

To evaluate how random forest and XGboost algorithms would behave in outdoor hostile position localization scenario when adding noise, we have added to all 160 features of the Target Dataset, a zero mean Gaussian noise with \(\gamma\) standard deviation. Each feature has its own mean and standard deviation, so we define each the standard deviation \(\gamma _i\) as

$$\begin{aligned} \gamma _i=level\times \sigma _i, \end{aligned}$$
(10)

where \(\sigma _i\) is the standard deviation of a given feature and i is from range 1 to 160. Since there are 160 features, the noise applied to each feature will be proportional (level) to its own standard deviation.

Thus, for each feature i, the generated Gaussian noise will have the following distribution:

$$\begin{aligned} P(x) = \frac{1}{{\gamma _i \sqrt{2\pi } }}e^{{{ - \left( {x - \mu } \right) ^2 } / {2\gamma ^2 }}}. \end{aligned}$$
(11)

In this experiment, the range of the factor level used was: from 0.1 to 1.0, with an increment of 0.05 and from 1 to 9, with an increment of 1. Figure 15 shows the noise experiment process.

Fig. 15
figure 15

Noise experiment process

Figure 16 shows the results of this experiment. The main red line is the average Euclidean distance error between actual and predicted positions of all 2973 observations from the target dataset using random forest. Assuming \(\mu\) as mean and \(\sigma\) as the standard deviation of the random forest positioning errors the red area defines \(\mu -\sigma\) and \(\mu +\sigma\). The main green line is the average Euclidean distance error between actual and predicted positions of all 2973 observations from target dataset using gradient boosting. The green area defines its \(\mu -\sigma\) and \(\mu +\sigma\) limits.

Evaluating both models with features corrupted by the same type of noise, random forest and gradient boosting obtained about the same results when the level of the noise was low. Random Forest obtained slightly better results when the noise level was greater than 2. So, the first important finding is that, in a noisier environment, one should give preference to random forest models. This is expected, because random forest uses bagging as a ensemble method and the predictions of a single tree of the random forest are highly sensitive to noise, the average of many trees is not, since the trees are not correlated.

Fig. 16
figure 16

Noise effects over euclidean distance error for random forest model and gradient boosting model

Still from Fig. 16, we notice that the standard deviations of the Euclidean distance errors are high, which means that the methods are not robust to noise. The main cause of these high values is that both models were generated using mainly simulation data and on real world data (another data distribution). However, several locations were obtained with error lower than 50 meters. In terms of the localization task, 50-m error may be considered as satisfactory because the model was created using synthetic data and optimized using only a small amount of real world data.

Fig. 17
figure 17

Position estimation experiment (with noise level equal to 2) where the error was less than 50 m for a random forest (blue) and b gradient boosting (red), the yellow line is the positions assumed by the target

Figure 17 presents the performance for both (a) random forest and (b) gradient boosting model when the noise variance is set to 2. The yellow points are all from the target dataset. The blue points are position estimations with random forest error less than 50 m, adding up to 172 points and the red points are position estimation with gradient boosting error less than 50 m, adding up to 133 points.

The second finding is that, considering all target observations with error less than 50 m, there is no clear zone where Random Forest is better than gradient boosting or vice-versa.

6.4.2 Mismatching experiment

In the mismatching experiment, a subset of features is nullified and the performance of the model is evaluated. The selected noise level (variance) for this experiment was set to 2—the value in which both models have comparable errors when using the emitter dataset in training process. Figure 18 presents this process.

Fig. 18
figure 18

Mismatching experiment process

Since the 4 sensors receive 20 rays (parameters pairs of the amplitude and the delay—\(\alpha _i,\tau _i\)), for each step we have selected the n most significant pairs of features, where n is in the range from 1 to 20. Figure 19 presents the results. The main red line is the average Euclidean distance error between actual and predicted positions of all 2973 observations using random forest. The red area defines the limits \(\mu -\sigma\) to \(\mu +\sigma\). The main green line is the average Euclidean distance error between actual and predicted positions of all 2,973 observations using gradient boosting. The green area defines the region between \(\mu -\sigma\) and \(\mu +\sigma\).

Fig. 19
figure 19

Mismatching effects over euclidean distance error for random forest model and for gradient boosting model

Both models did a good job in dealing with mismatching because the models have not suffered significant variations in performance (mean error) with the variation of the number of available characteristics. One reason for that is that many of the features from the simulation dataset are already zero.

Figure 20 presents the performance for both random forest and gradient boosting model when the number of feature pairs is set to 1 (one amplitude and one delay), which is the worst case for NLOS scenario. All Yellow points are from the target dataset. The Red points are positions where random forest error was less than 50 m, and the Blue points are location where Gradient Boosting error was less than 50 meters.

Fig. 20
figure 20

Position estimation experiment where the error was less than 50 m for a random forest (blue) and b gradient boosting (red) for the mismatching experiment

Another finding, similarly to the noisy experiment when considering all target observations where the error is less than 50 m, some localizations are better modelled by Random Forest and others by gradient boosting. Again, there is no clear domain where Random Forest is better than gradient boosting or vice-versa. Finally, both random forest and XGBoost achieved almost no loss due to the decrease in the number of parameters pair available, indicating that most of the information are in the first parameter pair of the amplitude and the delay—\(\alpha _1,\tau _1\) from the 4 sensors.

7 Conclusion

This paper presents a comparison of algorithms random forest and gradient boosting when employed to enhance a kernel-based machine learning localization scheme using TDOA fingerprinting in an outdoor hostile position localization problem.

The results presented herein can serve as guidelines in similar problems to adjust the number of estimators and to help in the definition of which machine learning implementation is more suitable and how to use it when there is limited real data for the area of interest, similar to outdoor hostile position localization problem.

Fingerprinting localization methods can deal with measurement errors by continuously improving the estimation based on the real-world measurements available in the area of interests. So, in this work, we have used a fixed located real-world data to help improving machine learning performance when estimating real-world measurements. To validate the proposed algorithms, we have evaluated them in two scenarios: simulation scenario, which uses just simulation data, and real scenario, which uses both simulation and limited real-world data for predicting actual positions from real-world data. The latter scenario was much more challenging.

For real scenario, which we called outdoor hostile position localization and using only 4 passive sensors, a trained model from simulated data from an area of 579,968 m2 and real world data from a fixed located emitter, it was possible to estimate the position of a moving target \({\mathbf {p}}_t=[x_t,y_t]^{{T}}\) by reducing the search area by 9 times for both Random Forest and XGBoost. Still, in almost 50% of cases, the reduction was about 33 times, which is very considerable. Hence, the proposed algorithms and method make our approach very appealing for practical applications in NLOS propagation environments.

Geo-information is a promising field in signal processing for localization, because it can reduce error of the radio-map created by simulation tools. We consider, for future work, to build a data fusion engine with cartographic database and signal processing; and to use optimization tools to deal with the raw information produced by multipath reflection in a TDOA system deployment.