1 Introduction

In this work we are interested in studying the following problem: given a dataset \({\mathcal {R}}=\{(t_i,y_i), i=1,\ldots ,r\}\) of points in \({\mathbb {R}}^m\times {\mathbb {R}}\), resulting from some experiment, we want to find a model \(\varphi :{\mathbb {R}}^{m}\rightarrow {\mathbb {R}}\) for fitting this dataset free from influence of possible outliers. In a more precise way, given a model \(\varphi (t)\) depending on n parameters (\(x\in {\mathbb {R}}^n\)), that is, \(\varphi (t)=\phi (x,t)\), we want to find a set \({\mathcal {P}}\subset {\mathcal {R}}\) with p elements and parameters \({\overline{x}}\in {\mathbb {R}}^{n}\), such that \(\phi ({\overline{x}},t_i)\approx y_i\), \(\forall (t_i,y_i)\in {\mathcal {P}}\) (in the least squares sense). The \(r-p\) remaining elements in \({\mathcal {R}}-{\mathcal {P}}\) are the possible outliers.

There are several definitions of what an outlier is. The definition that best suits the present work concerns to errors in \(y_i\), that is, grotesque errors in evaluation of some measure for a given and reasonably precise \(t_i\). This is somewhat different from the geometric interpretation of outliers, in the sense that the point \((t_i, y_i)\) is (geometrically) very far from the graph of a function that one wants to find. Typically in our tests, outliers are present when there are errors resulting from the measurement of some experiment. As a consequence, their presence may contaminate the obtained model and, therefore, deteriorate or limit its use. There are several strategies to handle the presence of outliers in datasets [9, 12, 22, 23]. In a more recent approach, as highlighted by [13] and references therein, techniques based on machine learning are exploited in the context of deal with a large amount of data, lack of models and categorical variables.

In order to get a fitting model free from influence of outliers, we use an approach based on low order-value optimization (LOVO) [5] which is defined as follows. Consider \(R_i:{\mathbb {R}}^n \rightarrow {\mathbb {R}}\), \(i=1,\ldots ,r\). Given \(x \in {\mathbb {R}}^n\), we can sort \(\{R_i(x), i=1,\ldots ,r\}\) in ascending order:

$$\begin{aligned} R_{i_1(x)}(x) \le R_{i_2(x)}(x)\le \cdots \le R_{i_k(x)}(x) \le \dots \le R_{i_r(x)}(x), \end{aligned}$$
(1)

where \(i_k(x)\) is the \(i_k\)-th smallest element in that set, for the given value of x. Given \(0<p\le r\), the LOVO function is defined by

$$\begin{aligned} S_p(x)=\sum _{k=1}^{p} R_{i_k(x)}(x) \end{aligned}$$
(2)

and the LOVO problem is

$$\begin{aligned} \min S_p(x). \end{aligned}$$
(3)

Essentially, this problem can be seen as a generalization of nonlinear least squares, as elucidated in [5]. To reiterate this affirmation, we can consider \(\varphi (t)=\phi (x,t)\) as the model selected for fitting, and define \(R_i(x)= \dfrac{1}{2}(F_i(x))^2\), where \(F_i(x)=y_i-\phi (x,t_i), i=1,\ldots ,r\). Thus, we have the particular LOVO problem

$$\begin{aligned} \min S_p(x)=\min \sum _{k=1}^{p}R_{i_k(x)}(x)= \min \sum _{k=1}^p \frac{1}{2}\left( F_{i_k(x)}(x)\right) ^2. \end{aligned}$$
(4)

Each \(R_i\) is a residual function. Consequently, if we assume \(p=r\) the LOVO problem is the classical least squares problem. When \(p<r\) the parameter \({\bar{x}}\in {\mathbb {R}}^n\) that solves (4) defines a model \(\phi ({\bar{x}},t)\) free from the influence of the worst \(r - p\) deviations. Throughout this work, p is also know as the number of trusted points.

Several applications can be modeled in the LOVO context, as illustrated in [4, 5, 7, 16, 18]. LOVO problems originated in the studies of Order Value Optimization (OVO) problems [2, 3]. For more details on the relationship between these problems, see Ref. [5]. An excellent survey about LOVO problems and variations is given in [17]. Although it is well known that LOVO deals with detection of outliers, there is a limitation: the mandatory definition of the value p, which is associated to the number of possible outliers. This is the main gap that this paper intends to fill. We present a new method that combines a voting schema and an adaptation of the Levenberg–Marquardt algorithm in context of LOVO problems.

Levenberg–Marquardt algorithms can be viewed as a particular case of trust-region algorithms, using specific models to solve nonlinear equations. In [4], a LOVO trust-region algorithm is presented with global and local convergence properties and an application to protein alignment problems. Second-order derivatives were needed in the algorithm for the local convergence analysis. In least-squares problems, as the objective function has a well known structure, Levenberg–Marquardt algorithms use a linear model for the adjustment function, instead of a quadratic model for the general nonlinear function. This approach eliminates the necessity of using second-order information for the model while still having second-order information about the function to be minimized. More details of Levenberg–Marquardt methods can be found in [21].

Another approach that also uses first-order information of models to obtain second-order approximation in least-squares functions is the Gauss–Newton method. Gauss–Newton in the context of LOVO functions for image recognition was discussed in [1]. The authors presented a LOVO approach to the detection of lines and circles with fixed radii. Line-search was used for obtaining global convergence. The main drawbacks of this approach were that near-singularity of the Gauss–Newton system had to be fixed in a heuristic way and the number p of trusted points had to be previously estimated and fixed for each problem. The algorithm was shown to be very efficient against state-of-art algorithms, when the number of parameters to be estimated started to increase.

In [24], outlier detection techniques are classified in 7 groups for problems of data streaming: Statistic-based, depth-based, deviation-based, distance-based, clustering-based, sliding-window-based and autoregression-based. In [26], classification is divided only between geometric and algebraic algorithms for robust curve and surface fitting. The approach used in this work is clearly algebraic, strongly based in the fact that the user knows what kind of model is to be used. Although models are used, we make no assumption on the distribution of the points, so we do not fit clearly in any of the types described in [24]. We also make the assumption that the values \(t_i\) are given exactly, what is called as fixed-regressor model in [21].

This work deals with the robust adjustment of models to data. A new version of the Levenberg–Marquardt algorithm for LOVO problems is developed, so the necessity of second-order information of function \(R_i\) is avoided. In addition, the number of possible outliers is estimated by a voting schema. The main difference of the proposed voting schema is that it is based in the values of p which has, by definition, a discrete domain. In other techniques, such as the Hough Transform [10, 14, 15, 25], continuous intervals of the model’s parameters are discretized. Also, the increase in the number of parameters to adjust does not impact the proposed voting system. The main improvements of this work can be stated as follows

  • a Levenberg–Marquardt algorithm with global convergence for LOVO problems is developed, which avoids the use of second-order information such as in [4] or heuristic strategies to improve conditioning as in [1];

  • a voting schema based on the values of p is developed, whose size does not increase with the size or discretization of the parameters of the model, such that the number of trusted points does not have to be previously estimated as in [1];

  • extensive numerical results are presented, which show the behavior of the proposed method and are also freely available for download.

This work is organized as follows. In Sect. 2 we describe the Levenberg–Marquardt algorithm in the LOVO context and demonstrate its convergence properties. In Sect. 3 the voting schema is discussed, which will make the LOVO algorithm independent of the choice of p and will be the basis of the robust fitting. Section 4 is devoted to the discussion of the implementation details and comparison against other algorithms for robust fitting. Finally, in Sect. 5 we draw some conclusions on the presented strategy. Throughout this paper we use the notation \({\mathbb {R}}_+:=\{x\in {\mathbb {R}}\ | \ x\ge 0\}\).

2 The Levenberg–Marquardt method for LOVO problems

Following [5], let us start this section by pointing out an alternative definition of LOVO problems for theoretical purposes. Denoting \({\mathcal {C}}=\{{\mathcal {C}}_1,\ldots ,{\mathcal {C}}_q\}\) the set of all combinations of the elements \(\{1,2,\ldots ,r\}\) taken p at a time, we can define for each \(i\in \{1,\ldots ,q\} \) the following functions

$$\begin{aligned} f_i(x)=\sum _{k \in {\mathcal {C}}_i} R_k(x) \end{aligned}$$
(5)

and

$$\begin{aligned} f_{min}(x)=\min \{f_i(x),i=1,\ldots ,q\}. \end{aligned}$$
(6)

It is simple to note that \(S_p(x)=f_{min}(x)\) which is a useful notation in our context. Moreover, it is possible to note that \(S_p\) is a continuous function if \(f_i\) is a continuous function for all \(i=1,\ldots ,q\), but, even assuming differentiability over \(f_i\), we cannot guarantee the same for \(S_p\). In addition, since \(R_k(x) = \dfrac{1}{2}(F_k(x))^2,k\in {\mathcal {C}}_i,i=1,\ldots ,q\) we can write

$$\begin{aligned} f_{i}(x)=\frac{1}{2}\sum _{k\in {\mathcal {C}}_i} F_{k}(x)^2=\dfrac{1}{2}\Vert F_{{\mathcal {C}}_i}(x)\Vert ^2_2. \end{aligned}$$
(7)

Throughout this work, following (7), given a set \({\mathcal {C}}_i \in {\mathcal {C}}\), \(F_{{\mathcal {C}}_i}(x): {\mathbb {R}}^n \rightarrow {\mathbb {R}}^p\) will always refer to the map that takes x to the p-sized vector composed by the functions \(F_k(x)\) defined by (4), for \(k \in {\mathcal {C}}_i\) in any fixed order. Similarly, \(J_{{\mathcal {C}}_i}(x)\) is defined as the Jacobian of this map. Additionally, we assume the continuous differentiability for \(F_i\), \(i=1,\ldots ,r\).

The goal of this section is to define a version of Levenberg–Marquardt method (LM) to solve the specific problem (4), for a given p, as well as a result on global convergence. The new version will be called by simplicity LM-LOVO. It is well known that the Levenberg–Marquardt method proposed in [19] is closely related to trust-region methods and our approach is based on it. Consequently, some definitions and remarks are necessary.

Definition 1

Given \(x \in {\mathbb {R}}^n\) we define the minimal function set of \(f_{min}\) in x by

$$\begin{aligned} I_{min}(x)=\{i\in \{1,\dots ,q\} \ | \ f_{min}(x)=f_i(x)\}. \end{aligned}$$

In order to define a search direction for LM-LOVO at the current point \(x_k\), we choose an index \(i \in I_{min}(x_k)\) and compute the direction defined by the classical Levenberg–Marquardt method using \(f_i(x)\), that is, the search direction \(d_k \in {\mathbb {R}}^n\) is defined as the solution of

$$\begin{aligned} \min _{d \in {\mathbb {R}}^n}m_{k,i}(d)=\dfrac{1}{2}\Vert F_{{\mathcal {C}}_i}(x_k)+J_{{\mathcal {C}}_i}(x_k)d\Vert _2^2 +\dfrac{\gamma _k}{2}\Vert d \Vert _2^2, \end{aligned}$$
(8)

where \(\gamma _k \in {\mathbb {R}}_+\) is the damping parameter. Equivalently, the direction d can be obtained by

$$\begin{aligned} \left( J_{{\mathcal {C}}_i}(x_k)^T J_{{\mathcal {C}}_i}(x_k) +\gamma _k I \right) d=-\nabla f_i(x_k), \end{aligned}$$
(9)

where \(\nabla f_i(x_k)=J_{{\mathcal {C}}_i}(x_k)^T F_{{\mathcal {C}}_i}(x_k)\) and \(I \in {\mathbb {R}}^{n \times n}\) is the identity matrix.

To ensure sufficient decrease in the defined search direction, we can consider a similar strategy of trust-region methods, which involves monitoring the actual decrease (given by \(f_{min}\)) and the predicted decrease (given by \(m_{k,i}\)) at direction \(d_k\):

$$\begin{aligned} \rho _{k,i}=\dfrac{f_{min}(x_k)-f_{min}(x_k+d_k)}{m_{k,i}(0)-m_{k,i}(d_k)}. \end{aligned}$$
(10)

We formalize the conceptual algorithm LM-LOVO in the Algorithm 1.

A noteworthy property of Levenberg–Marquardt (and also Gauss–Newton) coupled with the LOVO approach is that, assuming that the exact model was chosen, the number of trusted points p was correctly identified, \(i \in I_{min}(x_k)\) and there is no relevant noise in observations \(y_j\), \(j \in {\mathcal {C}}_i\), then the Hessian of the model, \(\nabla ^2 m_{k, i}(d) = J_{{\mathcal {C}}_i}(x_k)^T J_{{\mathcal {C}}_i}(x_k) + \gamma _k I\) approximates the Hessian of \(f_i\), \(\nabla ^2 f_i(x_k) = J_{{\mathcal {C}}_i}(x_k)^T J_{{\mathcal {C}}_i}(x_k) + \sum _{j \in {\mathcal {C}}_i} \nabla ^2 F_j(x_k) F_j(x_k)\), as the algorithm converges. This occurs because \(F_j(x_k) \rightarrow 0\), \(j \in {\mathcal {C}}_i\), and \(\gamma _k \rightarrow 0\) (see the definition of \(\gamma _k\) and Theorem 1). This property would not occur if a traditional Levenberg–Marquardt algorithm is applied to data with outliers.

figure a

In what follows, we show that Algorithm 1 is well defined and converges to stationary points of the LOVO problem. We begin with some basic assumptions on the boundedness of the points generated by the algorithm and on the smoothness of the involved functions.

Assumption 1

The level set

$$\begin{aligned} C(x_0)=\{x\in {\mathbb {R}}^n \ | \ f_{min}(x)\le f_{min}(x_0)\} \end{aligned}$$

is a bounded set of \({\mathbb {R}}^n\) and the functions \(f_i\), \(i = 1,\dots ,q\), have Lipschitz continuous gradients with Lipschitz constants \(L_i>0\) in an open set containing \(C(x_0)\).

The next proposition is classical in the literature of trust-region methods and ensures decrease of \(m_{k,{i_k}}(.)\) on the Cauchy direction. It was adapted to the LOVO context.

Proposition 1

Given \(x_k\in {\mathbb {R}}^n\), \(\gamma _k\in {\mathbb {R}}_+\) and \({i_k}\in \{1,\dots ,q\}\), the Cauchy step obtained from

$$\begin{aligned} {\widehat{t}}={{\,\mathrm{\hbox {arg min}}\,}}_{t\in {\mathbb {R}}} \ \ \{m_{k,i_k}(-t\nabla f_{i_k}(x_k))\} \end{aligned}$$

and expressed by \(d^C(x_k)=-{\widehat{t}}\nabla f_{i_k}(x_k)\in {\mathbb {R}}^n\), satisfies

$$\begin{aligned} m_{k,i_k}(0)-m_{k,i_k}\left( d^C(x_k)\right) \ge \dfrac{\theta \Vert \nabla f_{i_k}(x_k)\Vert ^2_2}{2\left( \Vert J_{{\mathcal {C}}_{i_k}}(x_k)\Vert ^2_2 + \gamma _k \right) }, \end{aligned}$$
(11)

for some \(\theta > 0\), independent of k.

Proof

The Cauchy step is explicitly given by

$$\begin{aligned} d^C(x_k) = - \frac{\Vert \nabla f_{i_k}(x_k) \Vert _2^2}{\Vert J_{{\mathcal {C}}_{i_k}}(x_k) \nabla f_{i_k}(x_k) \Vert _2^2 + \gamma _k \Vert \nabla f_{i_k}(x_k) \Vert _2^2} \nabla f_{i_k}(x_k). \end{aligned}$$

By simple substitution in \(m_{k, i_k}\) (defined by (8)), and observing that \(\nabla f_{i_k}(x_k) = J_{{\mathcal {C}}_{i_k}}(x_k)^T F_{{\mathcal {C}}_{i_k}}(x_k)\) and \(\Vert J_{{\mathcal {C}}_{i_k}}(x_k) \nabla f_{i_k}(x_k) \Vert _2 \le \Vert J_{{\mathcal {C}}_{i_k}}(x_k)\Vert _2 \Vert \nabla f_{i_k}(x_k) \Vert _2\), it is not hard to show that

$$\begin{aligned} m_{k,i_k}(0)-m_{k,i_k}\left( d^C(x_k)\right) \ge \dfrac{1}{2} \dfrac{\Vert \nabla f_{i_k}(x_k)\Vert ^2_2}{\left( \Vert J_{{\mathcal {C}}_{i_k}}(x_k)\Vert ^2_2 + \gamma _k \right) }, \end{aligned}$$

and (11) holds if we define \(\theta \in (0, 1)\). \(\square \)

Since the Cauchy step is obtained by the constant that minimizes the model \(m_{k,i_k}(.)\) on the direction of the gradient vector, by Proposition 1 we can conclude that there exists \(\theta > 0\) such that

$$\begin{aligned} m_{k,i_k}(0)-m_{k,i_k}(d_k) \ge \dfrac{\theta \Vert \nabla f_{i_k}(x_k)\Vert ^2_2}{2\left( \Vert J_{{\mathcal {C}}_{i_k}}({x_k})\Vert ^2_2 + \gamma _k \right) }, \end{aligned}$$
(12)

since \(d_k\in {\mathbb {R}}^n\) from (9) is the global minimizer of \(m_{k,i_k}\).

Inspired by [6], we present Lemma 1 that shows that Step 5 is always executed by Algorithm 1 if \(\lambda \) is chosen big enough.

Lemma 1

Let \(x_k\in {\mathbb {R}}^n\) and \(i_k\in I_{min}(x_k)\) be a vector and an index, respectively, both fixed in the Step 1 of the Algorithm 1. Then, the Step 3 of the Algorithm 1 will be executed a finite number of times.

Proof

To achieve this goal, we will show that

$$\begin{aligned} \lim _{\lambda \rightarrow \infty } \rho _{k,i_k} \ge 2. \end{aligned}$$

For each \(\lambda \) fixed in the Step 1 of the Algorithm 1, we have that

$$\begin{aligned} \begin{aligned} 1-\dfrac{\rho _{k,i_k}}{2}&= 1-\dfrac{f_{min}(x_{k})-f_{min}(x_{k}+d_{k})}{2(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))} \\&= \dfrac{2m_{k,i_k}(0)-2m_{k,i_k}(d_{k}) - f_{min}(x_{k})+f_{min}(x_{k}+d_{k})}{2(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))} \\&= \dfrac{f_{min}(x_{k}+d_{k}) +f_{min}(x_{k})-2m_{k,i_k}(d_{k})}{2(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))}. \end{aligned} \end{aligned}$$
(13)

From Taylor series expansion and the Lipschitz continuity of \(\nabla f_{i_k}(x_{k})\)

$$\begin{aligned} f_{i_k}(x_{k}+d_{k}) \le f_{i_k}(x_{k}) + \nabla f_{i_k}(x_{k})^Td_{k}+ \dfrac{L_{i_k}}{2}\Vert d_{k}\Vert ^2_2. \end{aligned}$$
(14)

By Eq. (14) and the definition of \(f_{min}\), we obtain

$$\begin{aligned} f_{min}(x_{k}+d_{k}) \le f_{i_k}(x_{k}+d_{k}){\mathop {\le }\limits ^{\tiny {(14)}}} f_{i_k}(x_{k}) + \nabla f_{i_k}(x_{k})^Td_{k} + \dfrac{L_{i_k}}{2}\Vert d_{k}\Vert ^2_2. \end{aligned}$$
(15)

Through the expressions (7) and (9), we have

$$\begin{aligned} \begin{aligned} \Vert F_{{\mathcal {C}}_{i_k}}(x_{k})&+J_{{\mathcal {C}}_{i_k}}(x_{k})d_{k}\Vert ^2_2 +\gamma _k\Vert d_{k}\Vert ^2_2 \\&= \Vert F_{{\mathcal {C}}_{i_k}}(x_{k})\Vert ^2_2 + 2 F_{{\mathcal {C}}_{i_k}}(x_{k})^TJ_{{\mathcal {C}}_{i_k}}(x_{k})d_{k} + \Vert J_{{\mathcal {C}}_{i_k}}(x_{k})d_{k}\Vert ^2_2 +\gamma _k\Vert d_{k}\Vert ^2_2\\&= \Vert F_{{\mathcal {C}}_{i_k}}(x_{k})\Vert ^2_2 + 2 F_{{\mathcal {C}}_{i_k}}(x_{k})^TJ_{{\mathcal {C}}_{i_k}}(x_{k})d_{k} \\&\qquad + d_{k}^T\left( J_{{\mathcal {C}}_{i_k}}(x_{k})^TJ_{{\mathcal {C}}_{i_k}}(x_{k})+\gamma _{k}I\right) d_{k}\\&{\mathop {=}\limits ^{\tiny {(7),(9)}}} 2f_{i_k}(x_{k}) + 2 \nabla f_{i_k}(x_{k})^Td_{k} - d_{k}^T\nabla f_{i_k}(x_{k})\\&= 2f_{i_k}(x_{k}) + \nabla f_{i_k}(x_{k})^Td_{k}. \end{aligned} \end{aligned}$$
(16)

Using (15), (16) and the definition of \(m_{k,i_k}\) in (13), we get

$$\begin{aligned} \begin{aligned} 1 - \dfrac{\rho _{k,i_k}}{2}&= \dfrac{f_{min}(x_{k}+d_{k}) + f_{i_k}(x_{k}) - \Vert F_{{\mathcal {C}}_{i_k}}(x_{k})+J_{{\mathcal {C}}_{i_k}}(x_{k})d_{k}\Vert ^2_2 - \gamma _k\Vert d_{k}\Vert ^2_2}{2(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))} \\&{\mathop {=}\limits ^{\tiny {(16)}}} \dfrac{f_{min}(x_{k}+d_{k}) - f_{i_k}(x_{k})- \nabla f_{i_k}(x_{k})^Td_{k}}{2(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))} \\&{\mathop {\le }\limits ^{\tiny {(15)}}} \dfrac{L_{i_k}\Vert d_k\Vert ^2_2}{4(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))} . \end{aligned} \end{aligned}$$
(17)

From (9) and the definition of \(\gamma _k\), we note that

$$\begin{aligned} \Vert d_{k}\Vert _2 \le \dfrac{\Vert \nabla f_{i_k}(x_{k})\Vert _2}{\sigma _{k}+\gamma _{k}} \le \dfrac{\Vert \nabla f_{i_k}(x_{k})\Vert _2}{\gamma _{k}} = \dfrac{1}{\Vert \nabla f_{i_k}(x_k)\Vert _2\lambda }, \end{aligned}$$
(18)

where \(\sigma _{k}=\sigma _{min}(J_{{\mathcal {C}}_{i_k}}(x_{k})^T J_{{\mathcal {C}}_{i_k}}(x_{k}))\) and \(\sigma _{min}(B)\) represents the smallest eigenvalue of B.

Replacing (18) in (17), we obtain

$$\begin{aligned} \begin{aligned} 1-\dfrac{\rho _{k,i_k}}{2}&\le \dfrac{\dfrac{L_{i_k}}{\Vert \nabla f_{i_k}(x_{k})\Vert ^2_2\lambda ^2}}{4(m_{k,i_k}(0)-m_{k,i_k}(d_{k}))} {\mathop {\le }\limits ^{(12)}} \dfrac{\dfrac{L_{i_k}}{\Vert \nabla f_{i_k}(x_{k})\Vert ^2_2\lambda ^2}}{\dfrac{4\theta \Vert \nabla f_{i_k}(x_{k})\Vert ^2_2}{2(\Vert J_{{\mathcal {C}}_{i_k}}(x_{k})\Vert ^2_2 + \gamma _{k})}} \\&= \dfrac{(\Vert J_{{\mathcal {C}}_{i_k}}(x_{k})\Vert ^2_2 + \gamma _{k}) L_{i_k} }{2\theta \Vert \nabla f_{i_k}(x_{k})\Vert ^4_2 \lambda ^2} \le \left( \dfrac{\Vert J_{{\mathcal {C}}_{i_k}}(x_{k})\Vert ^2_2}{ \Vert \nabla f_{i_k}(x_{k})\Vert ^4_2} + \dfrac{1}{\Vert \nabla f_{i_k}(x_{k})\Vert ^2_2} \right) \dfrac{L_{i_k}}{2 \theta \lambda }, \end{aligned} \end{aligned}$$
(19)

where the last inequality comes from the definition of \(\gamma _k\) in Algorithm 1 and assuming that \(\lambda \ge 1\), which can always be enforced.

Using (19), we conclude that

$$\begin{aligned} \lim _{\lambda \rightarrow \infty } 1-\dfrac{\rho _{k,i_k}}{2} \le 0, \end{aligned}$$

or equivalently

$$\begin{aligned} \lim _{\lambda \rightarrow \infty } \rho _{k,i_k}\ge 2, \end{aligned}$$

which proves the result. \(\square \)

Our studies move toward showing convergence results for Algorithm 1 to stationary points. At this point we should be aware of the fact that LOVO problems admit two types of stationary condition: weak and strong [4].

Definition 2

A point \(x^*\) is a weakly critical point of (3) when \(x^*\) is a stationary point of \(f_i\) for some \(i\in I_{min}(x^*)\). A point \(x^*\) is a strongly critical point of (3) if \(x^*\) is a stationary point of \(f_i\) for all \(i\in I_{min}(x^*)\).

The global convergence to weakly critical points is given by Theorem 1. This type of convergence is less expensive to verify in practice and therefore more common to deal with. Convergence to strongly critical points is theoretically interesting and can be accomplished by using the concept of \(\delta \)-active indexes, defined in [5]. The algorithm, for such type of convergence has to be slightly modified. We provide the new algorithm and all the technical details in “Appendix 1”.

Theorem 1

Let \(\{x_k\}_{k\in {\mathbb {N}}}\) be a sequence generated by Algorithm 1 by choosing \(\varepsilon =0\) and \(x^*\) a limit point of that sequence. Consider \({\mathcal {K}}'=\{ k \ | \ i_k=i \}\subset {\mathbb {N}}\) an infinite subset of indexes for \(i\in \{1,\dots ,q\}\) such that \(\lim _{k \in {\mathcal {K}}'} x_k = x^*\) and assume that Assumption 1 holds. Then, we have

$$\begin{aligned} \lim _{k \in {\mathcal {K}}'} \Vert \nabla f_i(x_{k})\Vert _2=0 \end{aligned}$$

and \(i\in I_{min}(x^*)\).

Proof

Clearly, there is an index i chosen an infinite number of times by Algorithm 1, since \(\{1,\dots ,q\}\) is a finite set.

Let us suppose by contradiction that, for this index i, there exist \(\beta >0\) and an infinite subset \({\mathcal {K}}_1\subset {\mathcal {K}}'\) such that \(\Vert \nabla f_i(x_{k})\Vert _2\ge \beta \), for all \(k\in {\mathcal {K}}_1\).

Using the continuity of the Jacobian \(J_{{\mathcal {C}}_{i}(x)}\), we ensure that

$$\begin{aligned} \Vert J_{{\mathcal {C}}_{i}}(x_{k})\Vert _2 \le \sup _{k\in {\mathcal {K}}_1}\{\Vert J_{{\mathcal {C}}_i}(x_k)\Vert ^2_2\} =J_i, \end{aligned}$$
(20)

for all \(k \in {\mathcal {K}}_1\).

By (19), for \(\lambda \ge 1\) we obtain

$$\begin{aligned} \begin{aligned}&1 - \dfrac{\rho _{k, i}}{2} \le \left( \dfrac{\Vert J_{{\mathcal {C}}_{i}}(x_{k})\Vert ^2_2}{ \Vert \nabla f_{i}(x_{k})\Vert ^4_2} + \dfrac{1}{\Vert \nabla f_{i}(x_{k})\Vert ^2_2} \right) \dfrac{L_{i}}{2 \theta \lambda }\\ \Rightarrow&1 - \dfrac{\rho _{k, i}}{2} {\mathop {\le }\limits ^{(20)}} \left( \dfrac{J_i^2}{\beta ^4} + \dfrac{1}{\beta ^2} \right) \dfrac{L_{i}}{2 \theta \lambda }\\ \Rightarrow&\rho _{k, i} \ge 2 - \left( \dfrac{J_i^2}{\beta ^4} + \dfrac{1}{\beta ^2} \right) \dfrac{L_{i}}{\theta \lambda }, \end{aligned} \end{aligned}$$
(21)

for all \(k\in {\mathcal {K}}_1\).

Through expression (21), we have that Step 5 of Algorithm 1 will certainly be executed when \(\lambda \ge b = \max \left\{ 1, \left( \dfrac{J_i^2}{\beta ^4} + \dfrac{1}{\beta ^2} \right) \dfrac{L_{i}}{\theta } \right\} \) since

$$\begin{aligned} \rho _{k, i} \ge 2 - \left( \dfrac{J_i^2}{\beta ^4} + \dfrac{1}{\beta ^2} \right) \dfrac{L_{i}}{\theta \lambda } \ge 2 - \left( \dfrac{J_i^2}{\beta ^4} + \dfrac{1}{\beta ^2} \right) \dfrac{L_{i}}{\theta b} \ge 1 >\mu , \end{aligned}$$
(22)

for all \(k\in {\mathcal {K}}_1\). Therefore, based on Step 4 of Algorithm 1, the value of \(\lambda _k\) will be upper bounded by \(\lambda _k\le M ={\overline{\lambda }} b\), for all \(k\in {\mathcal {K}}_1\).

Then, for all \(k\in {\mathcal {K}}_1\), we get

$$\begin{aligned} \begin{aligned}&\dfrac{f_{min}(x_k)-f_{min}(x_{k+1})}{m_{k,i}(0)-m_{k,i}(d_k)} \ge \mu \\&\Leftrightarrow f_{min}(x_k)-f_{min}(x_{k+1}) \ge \mu (m_{k,i}(0)-m_{k,i}(d_k))\\&\Rightarrow f_{min}(x_k)-f_{min}(x_{k+1}) {\mathop {\ge }\limits ^{(12)}} \mu \left( \theta \dfrac{\Vert \nabla f_i(x_k)\Vert _2^2}{2 (\Vert J_{{\mathcal {C}}_i}(x_k)\Vert ^2_2+\gamma _{k})}\right) \\&\Rightarrow f_{min}(x_k)-f_{min}(x_{k+1}) \ge \dfrac{\mu \theta \Vert \nabla f_i(x_k)\Vert _2^2}{2 (\Vert J_{{\mathcal {C}}_i}(x_k)\Vert ^2_2+ \lambda _k\Vert \nabla f_i(x_k)\Vert _2^2 )}\\&\Rightarrow f_{min}(x_k)-f_{min}(x_{k+1}) \ge \dfrac{\mu \theta }{2 \left( \dfrac{\Vert J_{{\mathcal {C}}_i}(x_k)\Vert ^2_2}{\Vert \nabla f_i(x_k)\Vert _2^2} + \lambda _k \right) }\\&\Rightarrow f_{min}(x_k)-f_{min}(x_{k+1}) \ge \dfrac{\mu \theta }{2 \left( \dfrac{\Vert J_{{\mathcal {C}}_i}(x_k)\Vert ^2_2}{\beta ^2} + \lambda _k \right) }\\&\Rightarrow f_{min}(x_k)-f_{min}(x_{k+1}) \ge \dfrac{\mu \theta \beta ^2}{2 (\Vert J_{{\mathcal {C}}_i}(x_k)\Vert ^2_2+\beta ^2\lambda _k)}\\&\Rightarrow f_{min}(x_k)-f_{min}(x_{k+1}) \ge \dfrac{\mu \theta \beta ^2}{2 (J_i^2+\beta ^2M)}\\&\Leftrightarrow f_{min}(x_{k+1})-f_{min}(x_k) \le - \dfrac{\mu \theta \beta ^2}{2c}, \end{aligned} \end{aligned}$$
(23)

where \(c=J_i^2+\beta ^2 M\).

Expression (23) and the fact that \(f_{min}(x_{k+1})\le f_{min}(x_k)\), for all \(k\in {\mathcal {K}}'\), contradict the hypothesis that \(f_{min}\) is bounded from below. We conclude that there is no such \({\mathcal {K}}_1\) and, therefore,

$$\begin{aligned} \lim _{k \in {\mathcal {K}}'} \Vert \nabla f_i(x_{k})\Vert _2 = 0. \end{aligned}$$

To prove the second statement of the theorem, we use the fact that, in Algorithm 1, \(i_k\in I_{min}(x_{k})\), for all k, obtaining that

$$\begin{aligned} f_{i_k}(x_{k}) = f_i(x_{k}) \le f_j(x_{k}), \ \forall \ k\in {\mathcal {K}}' \ \text{ and } \ \forall \ j\in \{1,\dots ,q\}. \end{aligned}$$
(24)

By (24) and the continuity of the function \(f_i\), for all \(i\in \{1,\dots ,q\}\), we have

$$\begin{aligned} f_i(x^*) \le f_j(x^*), \ \forall \ j\in \{1,\dots ,q\}, \end{aligned}$$

which means that \(i\in I_{min}(x^*)\), concluding the proof. \(\square \)

It is not hard to show that, if \(i \in I_{min}(x_k)\), then all the previous theoretical results remain valid if we replace \(\rho _{k, i}\) by

$$\begin{aligned} {\hat{\rho }}_{k, i} = \dfrac{f_{i}(x_k) - f_{i}(x_k+d_k)}{m_{k,i}(0)-m_{k,i}(d_k)}, \end{aligned}$$

where \(f_i\) was used instead of \(f_{min}\). The main reason for using \({\hat{\rho }}_{k, i}\) is practical: when computing \(f_i(x_k + d_k)\) we can use the same set \({\mathcal {C}}_i\) of index of functions \(F_j\) that was used when \(f_i(x_k) = f_{min}(x_k)\) was computed. Recall that the computation of \(f_{min}(x)\) requires the computation of all \(F_j\), \(j = 1, \dots , r\), sorting their values and taking the smallest p functions. On the other hand, since \(f_i(x_k + d_k) \ge f_{min}(x_k + d_k)\) and \(f_i(x_k) = f_{min}(x_k)\), we have that

$$\begin{aligned} {\hat{\rho }}_{k, i} = \dfrac{f_{i}(x_k)-f_{i}(x_k+d_k)}{m_{k,i}(0)-m_{k,i}(d_k)} \le \dfrac{f_{min}(x_k)-f_{min}(x_k+d_k)}{m_{k,i}(0)-m_{k,i}(d_k)} = \rho _{k, i}, \end{aligned}$$

which means that it might be necessary more executions of Step 3 before condition \({\hat{\rho }}_{k, i_k} \ge \mu \) is satisfied. Each execution of Step 3 involves the solution of a linear system. If the number r of functions is very large so that sorting is the bottleneck of the algorithm, the use of \({\hat{\rho }}_{k, i}\) can be an interesting alternative.

3 The voting system

The main drawback of Algorithm 1 is the need to know the number p of trusted points, which is used by \(S_p\) (2) (or, equivalently, by \(f_{min}\)). It is not usual to know the exact number of trusted points in any experiment.

To overcome this difficulty, an algorithm for testing different values of p was created, detailed by Algorithm 2. The main idea of the method is to call Algorithm 1 for several different values of p and store the obtained solution. The solutions are then preprocessed, where stationary points that are not global minimizers of their respective problem are eliminated. This elimination is based on the fact that, if \(\bar{x}_{p}\) and \({{\bar{x}}}_{q}\), \(p < q\), are solutions for their respective problems, then \(S_{p}({{\bar{x}}}_{p})\) cannot be greater than \(S_{q}({{\bar{x}}}_{q})\) if they are both global minimizers. Therefore, if \(S_p({{\bar{x}}}_p) > S_q({{\bar{x}}}_q)\), then \({{\bar{x}}}_{p}\) is not a global minimizer and can be safely eliminated. The last steps (Steps 4 and 5) compute the similarity between each pair of solutions and obtain the most similar ones. Element \(C_p\) of vector C stores the number of times that some other solution was considered similar to \({{\bar{x}}}_p\), in the sense of a tolerance \(\epsilon \). The most similar solution with greatest p is considered the robust adjustment model for the problem. Algorithm 2 is a proposal of a voting system, where the solution that was not eliminated by the preprocessing and occurred with highest frequency (in the similarity sense) is selected.

figure b

The execution of Algorithm 2 can be easily parallelizable. Each call of Algorithm 1 with a different value of p can be performed independently at Step 2. All the convergence results from Sect. 2 remain valid, so Algorithm 2 is well defined. All the specific implementation details of the algorithm are discussed in Sect. 4.

4 Numerical implementation and experiments

In this section we discuss the implementation details of Algorithms 1 and 2 . From now on, Algorithm 1 will be called LM-LOVO and Algorithm 2 will be called RAFF. Both algorithms were implemented in the Julia language, version 1.0.4 and are available in the official Julia repository. See [8] for information about the RAFF.jl package installation and usage.

Algorithm LM-LOVO is a sequential nonlinear programming algorithm, which means that only the traditional parallelization techniques can be applied. Since fitting problems have small dimension and a large dataset, the main gains would be the parallelization of the objective function, not the full algorithm. Matrix and vector operations are also eligible for parallelization.

Following traditional LOVO implementations [1], the choice of index \(i_k \in I_{min}(x_k)\) is performed by simply evaluating functions \(F_i(x_k)\), \(i = 1, \dots , r\), sorting them in ascending order and them dropping the \(r - p\) largest values. Any sorting algorithm can be used, but we used our implementation of the selection sort algorithm. This choice is interesting, since the computational cost is linear when the vector is already in ascending order, what is not unusual if LM-LOVO is converging and \(i_{k + 1} = i_k\), for example.

The convergence theory needs the sufficient decrease parameter \(\rho _{k, i_k}\) to be calculated in order to define step acceptance and the update of the damping parameter. In practice, LM-LOVO uses the simple decrease test at Step 4

$$\begin{aligned} f_{min}(x_k + d_k) < f_{min}, \end{aligned}$$

which was shown to work well in practice.

The computation of direction \(d_k\) is performed by solving the linear system (9) by the Cholesky factorization of matrix \(J_{{\mathcal {C}}_{i_k}}(x_k)^T J_{{\mathcal {C}}_{i_k}}(x_k) + \gamma _k I\). In the case where Steps 3 and 4 are repeated at the same iteration k, the QR factorization is more indicated, since it can be reused when the iterate \(x_k\) remains the same and only the dumping factor is changed. See [19] for more details about the use of QR factorizations in the Levenberg–Marquardt algorithm. If there is no interest in using the QR factorization, then the Cholesky factorization is recommended.

LM-LOVO was carefully implemented, since it is used as a subroutine of RAFF for solving adjustment problems. A solution \({{\bar{x}}} = x_k\) is declared as successful if

$$\begin{aligned} \Vert \nabla f_{i_k}({{\bar{x}}}) \Vert _2 \le \varepsilon \end{aligned}$$
(25)

for some \(i_k \in I_{min}({{\bar{x}}})\), where \(f_{i_k}\) is given by (5). The algorithm stops if the gradient cannot be computed due to numerical errors or if the limit of 400 iterations has been reached. We also set \({\overline{\lambda }} = 2\) as default.

In order to show the behavior of LM-LOVO we solved the problem of adjusting some data to the one-dimensional logistic model, widely used in statistics

$$\begin{aligned} \phi (x, t) = x_1 + \frac{x_2}{1 + \exp (- x_3 t + x_4)}, \end{aligned}$$

where \(x \in {\mathbb {R}}^4\) represents the parameters of the model and \(t \in {\mathbb {R}}\) represents the variable of the model. In order to generate random data for the test, the procedures detailed in Sect. 4.1 were used. The produced data is displayed in Fig. 1, where \(r = 10\), \(p = 9\) and the exact solution was \(x^* = (6000, -5000, -0.2, -3.7)\). This example has only \(r - p = 1\) outlier.

Fig. 1
figure 1

Test problem simulating an experiment following the logistic model. The continuous line represents the adjusted model, while the dashed line is the “exact” solution. LM-LOVO correctly identifies and ignores the outlier. (Color figure online)

LM-LOVO was run with its default parameters, using \(x = (0, 0, 0, 0)\) as a starting point and \(p = 9\), indicating that there are 9 points which are trustable for adjusting the model. The solution found is also shown in Fig. 1, given by \({{\bar{x}}} = (795.356, 5749.86, 0.161791, 3.02475)\), as a continuous line, while the “exact” solution is depicted as a dashed line. We observe that it is not expected the exact solution \(x^*\) to be found, since the points were perturbed. The outlier is correctly identified as the dark/red triangle.

The example in Fig. 1 has an outlier that is visually easy to identify, so the correct number of \(p = 9\) trusted points was used. However, that might not be the case, specially if there is an automated process that needs to perform the adjustments, or if the model is multi-dimensional. Algorithm RAFF was implemented to solve this drawback.

RAFF was also implemented in the Julia language and is the main method of the RAFF.jl package [8]. As already mentioned in Sect. 2, RAFF is easily parallelizable, so serial and parallel/distributed versions are available, through the Distributed.jl package. The algorithm (or the user) defines an interval of values of p to test and calls LM-LOVO to solve each subproblem for a given value of p. It is known that LOVO problems have many local minimizers, but we are strongly interested in global ones. Therefore, the traditional multi-start technique is applied to generate random starting points. The larger the number of different starting points, the greater is the chance to find global minimizers. Also, the computational cost is increased. The parallel/distributed version of RAFF solves this drawback, distributing problems with different values of p among different cores, processors or even computers.

For the computation of the similarity between solutions \({{\bar{x}}}_p\) and \({{\bar{x}}}_q\) in Step 4, the Euclidean norm of the vector of differences was used

$$\begin{aligned} M_{pq} = \Vert {{\bar{x}}}_p - {{\bar{x}}}_q \Vert _2. \end{aligned}$$

For each p in the interval, the best solution \({{\bar{x}}}_p\) obtained among all the runs of LM-LOVO for that p is stored. In order to avoid considering points in the cases where LM-LOVO has not converged for some value p, we set \(M_{ip} = M_{pi} = \infty \) for all \(i = p_{min}, \dots , p_{max}\) in case of failure.

In the preprocessing phase (Step 3 of RAFF) solutions \({{\bar{x}}}_q\) that clearly are not minimizers are also eliminated by setting \(M_{iq} = M_{qi} = \infty \) for all \(i = p_{min}, \dots , p_{max}\). To detect such points, we check if \(S_q({{\bar{x}}}_q) > S_p({{\bar{x}}}_p)\) for some \(q < p \le p_{max}\). The idea is that the less points are considered in the adjustment, the smaller the residual should be at the global minimizer. The preprocessing phase also tries to eliminate solution \(\bar{x}_{p_{max}}\). To do that, the valid solution \({{\bar{x}}}_p\) with smallest value of \(S_p({{\bar{x}}}_p)\), which was not eliminated by the previous strategy, is chosen, where \(p < p_{max}\). Solution \(\bar{x}_{p_{max}}\) is eliminated if \(S_p({{\bar{x}}}_p) < S_{p_{max}}(\bar{x}_{p_{max}})\) and the number of observed points \((t_i, y_i)\) such that \(|y_i - \phi ({{\bar{x}}}_p, t_i)| < |y_i - \phi ({{\bar{x}}}_{p_{max}}, t_i)|\), for \(i = 1, \dots , r\), is greater or equal than r/2.

The last implementation detail of RAFF that needs to be addressed is the choice of \(\epsilon \). Although this value can be provided by the user, we found very hard to select a number that resulted in a correct adjustment. Very small or very large values of \(\epsilon \), result in the selection of \({{\bar{x}}}_{p_{max}}\) as the solution, since each solution will be similar to itself or similar to every solution, and we always select the largest p in such cases. To solve this issue, the following calculation has been observed to work well in practice

$$\begin{aligned} \epsilon = \min (M) + {{\,\mathrm{avg}\,}}(M) / (1 + p_{max}^{1/2}), \end{aligned}$$
(26)

where M is the similarity matrix and function \({{\,\mathrm{avg}\,}}\) computes the average similarity by considering only the lower triangular part of M and ignoring \(\infty \) values (which represent eliminated solutions). If there is no convergence for any value of \(p \in [p_{min}, p_{max}]\), then \({{\bar{x}}}_{p_{max}}\) is returned, regardless if it has successfully converged or not.

4.1 Experiments for outlier detection and robust fitting

In the first set of tests, we verified the ability and efficiency of RAFF to detect outliers for well known statistical and mathematical models:

  • Linear model: \(\phi (x, t) = x_1 t + x_2\)

  • Cubic model: \(\phi (x, t) = x_1 t^3 + x_2 t^2 + x_3 t + x_4\)

  • Exponential model: \(\phi (x, t) = x_1 + x_2 \exp ( - x_3 t )\)

  • Logistic model: \(\phi (x, t) = x_1 + \frac{x_2}{1 + \exp (- x_3 t + x_4)}\)

The large number of parameters to be adjusted increases the difficulty of the problem, since the number of local minima also increases. For these tests, we followed some ideas described in [20]. For each model, we created 1000 random generated problems having: 10 points and 1 outlier, 10 points and 2 outliers, 100 points and 1 outlier, and 100 points and 10 outliers. For each combination, we also tested the effect of the multistart strategy using: 1, 10, 100 and 1000 random starting points.

The procedure for generating each random instance is described as follows. It is also part of the RAFF.jl package [8]. Let \(x^*\) be the exact solution for this fitting problem. First, r uniformly spaced values for \(t_i\) are selected in the interval [1, 30]. Then, a set \(O \subset \{1, \dots , r\}\) with \(r - p\) elements values is randomly selected to be the set of outliers. For all \(i = 1, \dots , r\) a perturbed value is computed, simulating the results from an experiment. Therefore, we set \(y_i = \phi (x^*, t_i) + \xi _i\), where \(\xi _i \sim {{\,\mathrm{{\mathcal {N}}}\,}}(0, 200)\), if \(i \not \in O\) and, otherwise, \(y_i = \phi (x^*, t_i) + 7 s \xi '_i \xi _i\), where \(\xi _i \sim {{\,\mathrm{{\mathcal {N}}}\,}}(0, 200)\), \(\xi '_i\) a uniform random number between 1 and 2 and \(s \in \{-1, 1\}\) is randomly selected at the beginning of this process (so all outliers are in the “same side” of the curve). The exact solutions used to generate the instances are given in Table 1. The example illustrated in Fig. 1 was also generated by this procedure.

Table 1 Exact solutions used for each model in order to generate random instances

The parallel version of RAFF was run with its default parameters on a Intel Xeon E3-1220 v3 3.10GHz with 4 cores and 16GB of RAM and Linux LUbuntu 18.04 operating system. The obtained results are displayed in Tables 2 and 3 . In those tables, r is the number of points representing the experiments, p is the number of trusted points, FR is the ratio of problems in which all the outliers have been found (but other points may be declared as outliers), ER is the ratio of problems where exactly the \(r - p\) outliers have been found, TP is the average number of correctly identified outliers, FP is the average number of incorrectly identified outliers, Avg. is the average number of points that have been declared as outliers by the algorithm and Time is the total CPU time in seconds to run all the 1000 tests, measured with the +@elapsed+ Julia macro. By default, \(p_{min} = 0.5 r\) and \(p_{max} = r\) are set in the algorithm. The success criteria (25) of LM-LOVO was set to \(\varepsilon = 10^{-4}\), while \({\overline{\lambda }}\) was set to 2. For each combination (Model, r, p) there are 4 rows in Tables 2 and 3 , representing different numbers of multistart trials: 1, 10, 100 and 1000.

Table 2 Results of RAFF for the detection of outliers for linear and cubic models
Table 3 Results for RAFF for the detection of outliers for exponential and logistic models

Some conclusions can be drawn from Tables 2 and 3 . We can see that RAFF attains its best performance for outlier detection when the number of correct points is not small, even though the percentage of outliers is high. For the exponential and logistic models, we also can see clearly the effect of the multistart strategy in increasing the ratio of identified outliers. In problems with 100 experiments, we observe that in almost all the cases the number of outliers have been overestimated in average: although the ratio of outlier identification is high (FR), the ratio of runs where only the exact outliers have been detected (TR) is very low, being below 20% of the runs. For small test sets, this ratio increases up to 50%, but difficult models, such as the exponential and logistic, have very low ratios. However, as we can observe in Fig. 2, the shape and the parameters of the model are clearly free from the influence of outliers. This observation suggests that maybe the perturbation added to all the values is causing the algorithm to detect correct points as outliers. The effect of the number of multi-start runs linearly increases the runtime of the algorithm, but is able to improve the adjustment, specially for the logistic model. The exponential model has an awkward behavior, where the ER ratio decreases when the number of multi-start runs increases, although the ratio of problems where all the outliers have been detected increases (FR). This might indicate that the tolerance (26) could be improved. We can also observe that the runtime of the exponential model is ten times higher than the other models.

When the size of the problem is multiplied by 10 (from 10 points to 100), we observe that the CPU time is multiplied by 5. This occurs because the time used by communication in the parallel runs is less important for larger datasets. Again, the exponential model is an exception.

Fig. 2
figure 2

Selected instances of test problems with \(r = 100\) and \(p = 90\) and the solutions obtained by RAFF. All the outliers have been correctly identified in those cases (dark/red triangles). Non-outliers are described by circles, where the dark/red ones represent points incorrectly classified as outliers by the algorithm. (Color figure online)

In a second round of experiments, the same procedure was used to generate random test problems simulating results from 100 experiments (\(r = 100\)), where a cluster of 10% of the points are outliers (\(p = 90\)). The default interval used for the values of t is [1, 30], and the clustered outliers always belong to [5, 10]. Selected instances for each type of model are shown in Fig. 3 as well as the solution found by RAFF. Again, 1000 random problems were generated for each type of model and the multi-start procedure was fixed to 100 starting points for each problem. The obtained results are shown in Table 4. A clustered set of outliers can strongly affect the model but is also easier to detect, when the number of outliers is not very large. As we can observe in Table 4, the ratio of instances where all the outliers have been successfully detected has increased in all models. The logistic model is the most difficult to fit since, on average, RAFF detects 17 points as outliers and 9 of them are correctly classified (TP). All the other models are able to correctly identify 10 outliers, on average, and have a higher FR ratio.

This set of experiments also shows another benefit of the present approach. If the user roughly knows the number of points that belong to a given model, such information can be used in the elimination of random (not necessary Gaussian) noise. The clustered example will be also shown to be an advantage over traditional robust least-squares algorithms in Sects. 4.2 and 4.3 .

Fig. 3
figure 3

Selected instances of test problems containing a clustered set of outliers and the solutions obtained by RAFF. All the outliers have been correctly identified in those cases (dark/red triangles). Non-outliers are described by circles, where the dark/red ones represent points incorrectly classified as outliers by the algorithm. (Color figure online)

Table 4 Numerical results for problems with \(p = 100\) data points and 10% of clustered outliers

4.2 Comparison against robust algorithms

We compared the fitting obtained by RAFF against classical and robust fitting algorithms provided by the SciPy library version 1.3.1 in PythonFootnote 1. The robust fitting algorithm in SciPy consists of using different loss functions in the least squares formulation. The following loss functions were used: linear (usual least squares formulation), soft_l1 (smooth approximation of the \(\ell _1\) loss function), huber and cauchy. The PyCall.jl Julia library was used to load and call SciPy.

Two more algorithms based on the RANSAC (Random Sample Consensus) [11], implemented in C\(++\) from the Theia Vision LibraryFootnote 2 version 0.8, were considered. The first one, called here RANSAC, is the traditional version of RANSAC and the second one is LMED, based on the work [22], which does not need the error threshold, the opposite of case of RANSAC (where the threshold is problem dependent).

All the algorithms from SciPy were run with their default parameters. The best model among 100 runs was selected as the solution for each algorithm and the starting point used was randomly generated following the normal distribution with \(\mu = 0\) and \(\sigma = 1\). Algorithms RANSAC and LMED were run only 10 times, due to the higher CPU time used and the good quality of the solution achieved. RANSAC and LMED were run with a maximum of 1000 iterations, sampling 10% of the data and with the MLE score parameter activated. In order to adjust models to the sampled data, the Ceres least squares solverFootnote 3 version 1.13.0 was used, since Theia has a natural interface to it. All the scripts used in the tests are available at https://github.com/fsobral/RAFF.jl. Once again, the parallel version of RAFF was used. The test problems were generated by the same procedures discussed in Sect. 4.1. However, only one problem (instead of 1000) for each configuration (Model, r, p) was used.

Unlike RAFF, traditional fitting algorithms do not return the possible outliers of a dataset. Robust algorithms such as least squares using \(\ell _1\) or Huber loss functions are able to ignore the effect of outliers, but not to easily detect them. Therefore, for the tests we selected one instance of each test of type (model, r, p), where the models and values for r and p are the same used in Tables 2, 3, 4. The results are displayed in Table 5. For each problem p and each algorithm a, we measured the adjustment error \(A_{a, p}\) between the model obtained by the algorithm \(\phi _p(x_{a, p}^\star , t)\) and the points that are non-outliers, which is given by

$$\begin{aligned} A_{a, p} = \sqrt{\sum \limits _{\begin{array}{c} i \in {\mathcal {P}}\\ i\ \text{ non-outlier } \end{array}} (\phi _p(x_{a,p}^\star , t_i) - y_i)^2}, \end{aligned}$$

where \(\phi _p\) was the model used to adjust problem p. Each row of Table 5 represents one problem and contains the relative adjustment error for each algorithm, which is defined by

$$\begin{aligned} {{\bar{A}}}_{a, p} = \frac{A_{a, p}}{\min _{i} \{A_{i, p}\}} \end{aligned}$$
(27)

and the time taken to find the model (in parenthesis). The last row contains the number of times that each algorithm has found a solution with adjustment error smaller than 1% of best, smaller than 10% of the best and smaller than 20% of the best adjustment measure found for that algorithm, respectively, in all the test set. We can observe that RAFF, LMED and soft_l1 were the best algorithms. RAFF was the solver that found the best models in most of the problems (11/24), followed by LMED (9/24). Its parallel version was consistently the fastest solver among all. It is important to observe that RAFF was the only who was easily adapted to run in parallel. However, the parallelism is related only to the solution of subproblems for different p, not to the multistart runs, which are run sequentially. Therefore, RAFF solves considerably more problems per core than the other algorithms in a very competitive CPU time. When parallelism is turned of, the CPU time is very similar to the traditional least squares algorithm (linear). Also, RAFF was the only one that easily outputs the list of possible outliers without the need of any threshold parameter. Clustered instance (cubic, 100, 90) and instance (logistic, 10, 9) and the models obtained by each algorithm are shown in Fig. 4.

Fig. 4
figure 4

Two problems and the models found for each algorithm. On the left, a cubic model with 100 points and a set of 10 clustered outliers. On the right, a logistic model with 10 points and only one outlier. (Color figure online)

4.3 Experiments for circle detection

The problem of detecting patterns in images is very discussed in the vision area in Computer Science. LOVO algorithms have also been applied to solving such problems, as a nonlinear programming alternative to traditional techniques [1]. The drawback, again, is the necessity of providing a reasonable number of trusted points. RAFF allows the user to provide an interval of possible trusted points, so the algorithm can efficiently save computational effort when trying to find patterns in images. Since LOVO problems need a model to be provided, circle detection is a perfect application to the algorithm.

We followed tests similar to [26], using a circular model

$$\begin{aligned} \phi (x, t) = (t_1 - x_1)^2 + (t_2 - x_2)^2 - x_3^2 \end{aligned}$$

instead of the ellipse model considered in the work. Two test sets were generated. In the first set \(r = 100\) points were uniformly distributed in the border of the circle with center \((-10, 30)\) and radius 2. If the point is not an outlier, a random perturbation \(\xi \sim {{\,\mathrm{{\mathcal {N}}}\,}}(0, 0.1)\) is added to each one of its \(t_1\) and \(t_2\) coordinates. For outliers, the random noise is given by \(\xi \sim {{\,\mathrm{{\mathcal {N}}}\,}}(0, 2)\), as suggested in [26]. In the second set, \(r = 300\) was considered. The same circumference was used and p points (non-outliers) were uniformly distributed in the circumference and slightly perturbed with a noise \(\xi \sim {{\,\mathrm{{\mathcal {N}}}\,}}(0, 0.1)\) as before. The remaining \(300 - p\) points (the outliers) were randomly distributed in a square whose side was 4 times the radius of the circle, using the uniform distribution. Nine problems were generated in each test set, with outlier ratio ranging from 10% up to 90% (i. e. ratio of non-outliers decreasing 90% to 10%).

The same algorithms were compared, the only difference from Sect. 4.2 is that the error threshold of RANSAC was reduced to 10 and 100 random starting points near (1, 1, 1) were used for all algorithms, except RANSAC and LMED. For those two algorithms, we kept the number of trials to 10. Also, we tested two versions of RAFF. In pure RAFF, we decreased the lower bound \(p_{min}\) from its default value 0.5r to the value of p, when p falls below the default. In RAFFint, we used the option of providing upper and lower bounds for the number of trusted points. If p is the current number of non-outliers in the instance, the interval given to RAFFint is \([p - 0.3r, p + 0.3r] \cap [0, r]\). The measure (27) was used and the results are shown in Fig. 5.

We can observe that RAFF, LMED and cauchy achieved the best results. RAFF found worse models than most of the robust algorithms in the problems of the first test set, although the results are still very close. Its relative performance increases as the outlier ratio increases. This can be explained as the strong attraction that RAFF has to finding a solution similar to traditional least squares algorithms. In Fig. 6 we can see that RAFF has difficulty in finding outliers that belong to the interior of the circle. To solve this drawback, RAFF also accepts a lower bound in the number of outliers, rather than only an upper bound. This ability is useful for large datasets with a lot of noise, as is the case of the second test set, and allows the detection of inner outliers. This is represented by RAFFint. We can see in Fig. 5 that the performance of both versions of RAFF is better than traditional robust algorithms in the case of a large number of outliers.

Fig. 5
figure 5

Relative adjustment error in the circle detection problem for increasing outlier ratio and two different types of perturbation: by normal and uniform distributions. (Color figure online)

Fig. 6
figure 6

Difference of outlier detection when upper bounds on the outlier ratio are provided. The inner outliers are harder to detect, since the error they cause in the model is smaller. (Color figure online)

4.4 Comparison against another LOVO approach for model adjustment

We also compared RAFF against the LOVO Gauss–Newton line-search algorithm described in [1]. The authors were unable to obtain the codes used in the paper, thus a Julia implementation was coded following the description of the algorithm. The algorithm is also available in the RAFF.jl package as function gnlslovo.

As previously mentioned, the algorithm in [1] needs a reasonable choice of parameter p, the number of trusted points. A direct comparison between the LOVO Gauss–Newton algorithm and Algorithm 1 is not very useful, since both algorithms are well known to solve nonlinear least-squares problems. Gauss–Newton algorithms are usually faster and more accurate than the Levenberg–Marquardt approach, but they might suffer when the matrix used to compute the descent direction is ill conditioned. Therefore, in this subsection, our aim is to show that bad choices of p in the LOVO Gauss–Newton algorithm [1] lead to poorer adjustments and are usually avoided when the voting system used by RAFF is applied with an inexact interval estimate around p.

The same set of clustered problems from Sect. 4.2 was used, but only for \(r = 100\) and \(p = 90\). Also, we generated a circle detection problem with random noise (\(r = 300\) and \(p = 100\)), in the same fashion as Sect. 4.3. To compare the two approaches, the adjustment error (27) was used. For each model, different values of p were given to the LOVO Gauss–Newton algorithm and an interval around each p was given to RAFF. The interval was defined as \(\left[ \max \{0, p/r - 0.3\} r, \min \{1, p/r + 0.3\} r \right] \). The Gauss–Newton algorithm was allowed to use 100 random initial points while RAFF was allowed to use 30 random initial points for each p in the interval. The values of p used and the adjustment error obtained are displayed in Table 6.

Table 5 Comparison against different robust fitting algorithms using one instance of each test problem and 100 random starting points as a multistart strategy
Table 6 Adjustment error for the LOVO Gauss–Newton algorithm [1] (GN) with fixed p and RAFF using the voting system with an interval around p

We can clearly see the benefits of using the voting system. Even if poor values of p are provided, RAFF is usually able to find better adjustments than the Gauss–Newton algorithm with fixed p. The voting system provides a way to compare the solution for different values of p, which is not a trivial task. We also observe that LOVO Gauss–Newton is able to find reasonable solutions using values of p not too close to the true ones. For the logistic model, the results of Gauss–Newton are worse, since the matrix that appears in the problems is very ill conditioned. The LOVO Gauss–Newton algorithm could also be used inside the voting system of RAFF, replacing the LOVO Levenberg–Marquardt algorithm, but the results would be very similar.

5 Conclusions

In this paper, we have described a LOVO version of the Levenberg–Marquardt algorithm for solving nonlinear equations, which is specialized to the adjustment of models where the data contains outliers. The theoretical properties of the algorithm were studied and convergence to strongly and weakly stationary points has been proved. To overcome the necessity of providing the number of outliers in the algorithm, a voting system has been proposed. A complete framework to robust adjustment of data was implemented in the Julia language and compared to public available and well tested robust fitting algorithms. The proposed algorithm was shown to be competitive, being able to find better adjusted models in the presence of outliers in most of the problems. In the circle detection problem, the proposed algorithm was also shown to be competitive and had a good performance even when the outlier ration exceeds 50%. By comparing against a LOVO Gauss–Newton algorithm for model adjustment, the voting system was shown to be a good strategy when the number of outliers cannot be estimated beforehand. The implemented algorithm and all the scripts used for testing and generation of the tests are freely available and constantly updated at https://github.com/fsobral/RAFF.jl.