Fitting Aggregation Functions to Data: Part I - Linearization and Regularization

Bartoszuk, Maciej; Beliakov, Gleb; Gagolewski, Marek; James, Simon

doi:10.1007/978-3-319-40581-0_62

Maciej Bartoszuk¹⁶,
Gleb Beliakov¹⁷,
Marek Gagolewski^16,18 &
…
Simon James¹⁷

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 611))

Included in the following conference series:

International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems

875 Accesses
6 Citations

Abstract

The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in the literature yet. In the first part of this two-part contribution we deal with the concept of regularization, a quite standard technique from machine learning applied so as to increase the fit quality on test and validation data samples. Due to the constraints on the weighting vector, it turns out that quite different methods can be used in the current framework, as compared to regression models. Moreover, it is worth noting that so far fitting weighted quasi-arithmetic means to empirical data has only been performed approximately, via the so-called linearization technique. In this paper we consider exact solutions to such special optimization tasks and indicate cases where linearization leads to much worse solutions.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Review of Aggregation Functions

Least Median of Squares (LMS) and Least Trimmed Squares (LTS) Fitting for the Weighted Arithmetic Mean

Aggregation Functions on [0,1]

Keywords

1 Introduction

In various situations, one is faced with a need to combine $n\ge 2$ numeric values in the unit interval, so that a single representative output is produced. Usually, some idempotent aggregation function, see, e.g., [7, 13], is the required data fusion tool.

Definition 1

We say that ${\mathsf {F}}:[0,1]^n\rightarrow [0,1]$ is an idempotent aggregation function, whenever it is nondecreasing in each variable, and for all $x\in [0,1]$ it holds ${\mathsf {F}}(x,\dots ,x)=x$.

Among useful idempotent aggregation functions we find weighted quasi-arithmetic means.

Definition 2

Let $\varphi :[0,1]\rightarrow \bar{\mathbb {R}}$ be a continuous and strictly monotonic function and ${\mathbf {w}}$ be a weighting vector of length n, i.e., one such that for all i it holds $w_i\ge 0$ and $\sum _{i=1}^n w_i=1$. Then a weighted quasi-arithmetic mean generated by $\varphi $ and ${\mathbf {w}}$ is an idempotent aggregation function ${\mathsf {WQAMean}}_{\varphi ,{\mathbf {w}}}:[0,1]^{n}\rightarrow [0,1]$ given for ${\mathbf {x}}\in [0,1]^n$ by:

$$\begin{aligned} {\mathsf {WQAMean}}_{\varphi ,{\mathbf {w}}}({\mathbf {x}}) = \varphi ^{-1}\left( \sum _{i=1}^n w_i\varphi (x_i)\right) = \varphi ^{-1}\left( {\mathbf {w}}^T \varphi ({\mathbf {x}}) \right) . \end{aligned}$$

Here are a few interesting cases in the above class:

${\mathsf {WAMean}}_{\mathbf {w}}({\mathbf {x}})=\sum _{i=1}^n w_i x_i={\mathbf {w}}^T {\mathbf {x}}$, (weighted arithmetic mean, convex combination of inputs, $\varphi (x)=x$)
${\mathsf {WHMean}}_{\mathbf {w}}({\mathbf {x}})=\frac{1}{\sum _{i=1}^n {w_i/x_i}}$, (weighted harmonic mean, $\varphi (x)=1/x$)
${\mathsf {WGMean}}_{\mathbf {w}}({\mathbf {x}})=\prod _{i=1}^n x_i^{w_i}$, (weighted geometric mean, $\varphi (x)=\log x$)
${\mathsf {PMean}}_{r,{\mathbf {w}}}({\mathbf {x}})=\left( \sum _{i=1}^n w_i x_i^r \right) ^{1/r}$ for some $r\ne 0$, (weighted power mean, $\varphi (x)=x^r$)
${\mathsf {EMean}}_{\gamma ,{\mathbf {w}}}({\mathbf {x}})=\frac{1}{\gamma } \log \left( \sum _{i=1}^n w_i e^{\gamma x_i}\right) $ for some $\gamma \ne 0$. (weighted exponential mean, $\varphi (x)=e^{\gamma x}$)

Let us presume that we observe $m\ge n$ input vectors ${\mathbf {X}}=[{\mathbf {x}}^{(1)},\dots ,{\mathbf {x}}^{(m)}]\in [0,1]^{n\times m}$ together with m desired output values ${\mathbf {Y}}=[y^{(1)},\dots ,y^{(m)}]\in [0,1]^{1\times m}$ and that we would like to fit a model that determines the best functional relationship between the input values and the desired outputs. Generally, such a task is referred to as regression in machine learning. Nevertheless, classical regression models do not guarantee any preservation of important algebraic properties like the mentioned nondecreasingness or idempotence. Therefore, in our case, for a fixed generator function $\varphi $, we shall focus on the task concerning fitting a weighted quasi-arithmetic mean ${\mathsf {WQAMean}}_{\varphi ,{\mathbf {w}}}$, compare, e.g., [4, 7, 12, 21], to an empirical data set.

Given a loss function $E:\mathbb {R}^m\rightarrow [0,\infty )$ that is strictly decreasing towards ${\mathbf {0}}$, $E(0,\dots ,0)=0$, the task of our interest may be expressed as an optimization problem:

$$ \mathrm {minimize}\ E\left( \varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi (x_i^{(1)})\right) - y^{(1)}, \dots , \varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi (x_i^{(m)})\right) - y^{(m)} \right) $$

with respect to ${\mathbf {w}}$, under the constraints that ${\mathbf {1}}^T{\mathbf {w}}=1$ and ${\mathbf {w}}\ge {\mathbf {0}}$. Typically, E is an $L_p$ norm, in particular: $E(e_1,\dots ,e_m)=\sqrt{\sum _{i=1}^m e_i^2}$ (least squared error fitting, LSE), $E(e_1,\dots ,e_m)=\sum _{i=1}^m |e_i|$ (least absolute deviation fitting, LAD), or $E(e_1,\dots ,e_m)=\bigvee _{i=1}^m |e_i|$ (least maximum absolute deviation fitting, LMD).

In the weighted arithmetic mean case $(\varphi (x)=x)$, it is well-known that an LSE fit can be expressed as a quadratic programming task, and both LAD and LMD fits may be solved by introducing a few auxiliary variables and then by applying some linear programming solvers, see [6, 9, 12].

More generally, for arbitrary but fixed $\varphi $ (note that if $\varphi $ is unknown one may rely on a notion of spline functions to model a generator function, see [2, 3, 5, 6, 9, 10]), the weight fitting task has up to now been solved approximately via a technique called linearization, compare [5, 6, 8, 9, 19]. Observing that if $y^{(j)}$ is not subject to any measurement error, i.e., we have $\varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi (x_i^{(j)})\right) = y^{(j)}$, instead of minimizing a function of $\varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi (x_i^{(j)})\right) - y^{(j)}$ we may consider a function of $\sum _{i=1}^n w_i \varphi (x_i^{(j)}) - \varphi (y^{(j)})$. Thus, the input and output values can be transformed prior to applying a weight fit procedure and then we may proceed in the same manner as when $\varphi (x)=x$ (and in fact deal with a linear interpolation problem). However, in practice this is rarely the case.

What is more, as noted recently in [12], we usually observe that a model may be overfit to a training data set and thus perform weakly on test or validation samples. Also, sometimes we would like to fit a function which is nondecreasing and idempotent but the input values need to be properly normalized prior to aggregate them so that these two important properties are meaningful.

The aim of this two-part paper is to complement the mentioned results (and extend the preliminary outcomes listed in [12]) concerning fitting of weighted quasi-arithmetic means (weighted arithmetic means in particular). In Sect. 2 we discuss possible ways to fit a weighted quasi-arithmetic mean without relying on the linearization technique. Moreover, we perform various numerical experiments that enable us to indicate in which cases linearization leads to a significant decrease in the fit quality. In Sect. 3 we discuss different ways of regularizing a model so as to prevent overfitting. One of the possible approaches consists of adding a special penalty, which is a common procedure in machine learning. However, as parameters of our model fulfill specific constraints (non-negative weights adding up to 1), other ways are possible in our setting too. Section 4 concludes this part of the contribution.

Moreover, in the second part [1] we deal with the problem of properly normalizing (transforming) discordant input values in such a way that idempotent aggregation functions may be fit. We present an application of such a procedure in a classification task dealing with the identification of pairs of similar R [17] source code chunks.

2 Linearization

From now on let us assume that $E(e_1,\dots ,e_m)=\sqrt{\sum _{i=1}^m e_i^2}$, i.e., we would like to find a least squares error fit. Such an approach is perhaps most common in machine learning literature [14]. What is more, most of the ideas presented in this paper can be quite easily applied in other settings.

Torra in [19, 20] (compare also, e.g., [6, 8, 9]) already discussed weighted quasi-arithmetic mean fitting tasks. Nevertheless, it was noted that the problem is difficult in general, so one may simplify the problem assuming that the desired outputs are not subject to errors. In such a case, noting that a fixed generator function $\varphi $ is surely invertible, we have for all j:

$$ \sum _{i=1}^n w_i \varphi (x_i^{(j)}) = \varphi (y^{(j)}). $$

Using this assumption, instead of minimizing:

$$ \Vert \varphi ^{-1}\left( {\mathbf {w}}^T \varphi ({\mathbf {X}}) \right) - {\mathbf {Y}} \Vert _2^2 $$

one may decide to minimize a quite different (in general) goodness of fit measure:

$$ \Vert {\mathbf {w}}^T \varphi ({\mathbf {X}}) - \varphi ({\mathbf {Y}}) \Vert _2^2. $$

Such an approach is often called linearization of inputs.

Let us suppose, however, that we would like to solve the original weight fit problem and not the simplified (approximate) one.

Example 1

([12]). Suppose that $n=5$ and we are given $m=9$ toy data points given as below. Here ${\mathbf {Y}}$ was generated using ${\mathbf {w}}=(0.33, 0.43, 0.10, 0.08, 0.06)$ and $\varphi (x)=x^2$ with white noise was added ($\sigma =0.05$).

j	1	2	3	4	5	6	7	8	9
$x_1^{(j)}$	0.12	0.48	0.65	0.07	0.37	0.22	0.29	0.57	0.84
$x_2^{(j)}$	0.73	0.41	0.45	0.79	0.92	0.23	0.90	0.40	0.57
$x_3^{(j)}$	0.43	0.84	0.70	0.96	0.81	0.86	0.72	0.53	0.42
$x_4^{(j)}$	0.52	0.75	0.48	0.40	0.62	0.28	0.80	0.92	0.79
$x_5^{(j)}$	0.69	0.70	0.24	0.22	0.92	0.34	0.15	0.50	0.50
$y^{(j)}$	0.65	0.58	0.70	0.51	0.82	0.56	0.70	0.64	0.75

Here are the true $\mathfrak {d}_1$, $\mathfrak {d}_2$, and $\mathfrak {d}_\infty $ error measures in the case of the linearized and the exact LSE and LAD minimization tasks.

E	$\mathfrak {d}_1$	$\mathfrak {d}_2$	$\mathfrak {d}_\infty $
LAD – linearization	0.7385	0.4120	0.2798
LSE – linearization	0.7423	0.2859	0.1626
LAD – optimal	0.7157	0.3170	0.2044
LSE – optimal	0.7587	0.2817	0.1501

In the above example, the differences are relatively small, but not negligible. While the use of the linearization technique for least-squared error fitting of quasi-arithmetic means will often lead to reliable results, there are clearly some situations where such a technique may not be justified. Fitting to the transformed dataset $\varphi (\mathbf X), \varphi (\mathbf Y)$ essentially stretches the space along which the residuals are distributed and for some functions this will have a larger impact than others.As an example, consider the geometric mean with generator $\varphi (x) = \log x$. For lower values of y, differences in the transformed residuals can become disproportionately large and pull the weights towards these data points. Further on we shall perform a few numerical experiments to indicate the generator functions that lead to much greater discrepancies.

2.1 Algorithms

We aim to:

$$\begin{aligned} \mathrm {minimize}\ \sum _{j=1}^m \left( \varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi \left( x_i^{(j)}\right) \right) - y^{(j)} \right) ^2 \quad \text {w.r.t. }{\mathbf {w}} \end{aligned}$$

subject to ${\mathbf {w}}\ge {\mathbf {0}}$ and ${\mathbf {1}}^T {\mathbf {w}}=1$. By homogeneity and triangle inequality of $\Vert \cdot \Vert _2$ we have that this is a convex optimization problem.

A Solution Based on a Nonlinear Solver. First of all, we may consider the above as a generic nonlinear optimization task. To drop the constraints on ${\mathbf {w}}$, we can use an approach considered (in a different context) by Filev and Yager [11], see also [20]. We take a different parameter space, $\varvec{\lambda }\in \mathbb {R}^n$, such that:

$$ w_i = \frac{\exp (\lambda _i)}{\sum _{k=1}^n \exp (\lambda _k)}. $$

Assuming that $\varphi ^{-1}$ is differentiable, let us determine the gradient $\nabla E(\varvec{\lambda })$. For any $k=1,\dots ,n$ it holds:

$$\begin{aligned}&\frac{\partial }{\partial \lambda _k} E(\varvec{\lambda }) = 2\frac{\exp (\lambda _k)}{\sum _{i=1}^n \exp (\lambda _i)} \sum _{j=1}^m \left( \varphi ^{-1}\left( \frac{\sum _{i=1}^n \exp (\lambda _i) \varphi \left( x_i^{(j)}\right) }{\sum _{i=1}^n \exp (\lambda _i)}\right) - y^{(j)} \right) \\&\cdot (\varphi ^{-1})'\left( \frac{\sum _{i=1}^n \exp (\lambda _i) \varphi \left( x_i^{(j)}\right) }{\sum _{i=1}^n \exp (\lambda _i)} \right) \cdot \left( \varphi \left( x_k^{(j)}\right) - \frac{\sum _{i=1}^n \exp (\lambda _i) \varphi \left( x_i^{(j)}\right) }{\sum _{i=1}^n \exp (\lambda _i)} \right) . \end{aligned}$$

Assuming that ${\mathbf {Z}}={\mathbf {w}}^T\varphi ({\mathbf {X}})$ and ${\mathbf {w}}=\exp (\varvec{\lambda })/{\mathbf {1}}^T\exp (\varvec{\lambda })$, we have:

$$\begin{aligned} \nabla E(\varvec{\lambda })= & {} 2\cdot {\mathbf {w}}\cdot \Bigg ( \left( \left( \phi ^{-1}({\mathbf {Z}})-Y\right) \cdot (\varphi ^{-1})'\left( {\mathbf {Z}}\right) \right) \times \left( \varphi ({\mathbf {X}})^T-{\mathbf {Z}}\right) \Bigg ), \end{aligned}$$

where $\cdot , -$ stand for elementwise vectorized multiplication and subtraction, respectively, $\times $ denotes matrix multiplication, and $\varphi ({\mathbf {X}})^T-{\mathbf {Z}}$ means that we subtract ${\mathbf {Z}}$ from each column in $\varphi ({\mathbf {X}})^T$. The solution may be computed using, e.g., a quasi-Newton nonlinear optimization method by Broyden, Fletcher, Goldfarb and Shanno (the BFGS algorithm, see [16]). However, let us note that while using the mentioned reparametrization, the BFGS algorithm may occasionally fail to converge as now the solution is not unique.

Another possible way of solving the above problem would be to rewrite the objective as a function of $n-1$ variables $v_1,\dots ,v_{n-1}$ such that $w_i=v_i$ for $i=1,\dots ,n-1$ and $w_n=1-\sum _{i=1}^{n-1} v_i$. This is a constrained optimization problem, but the constraints on ${\mathbf {v}}$ are linear: ${\mathbf {v}}\ge {\mathbf {0}}$ and ${\mathbf {1}}^T {\mathbf {v}} \le {1}$. With generic nonlinear solvers, such an optimization task is usually determined by adding an appropriate barrier function (e.g., logarithmic barrier, see [16]) term to the objective function.

Compensation Factors in Linearization. Another possible way is to apply a compensation factor such that for any residual in the linearization technique $v^{(j)} = \left( \sum _{i=1}^n \varphi (x_i^{(j)})\right) - \varphi (y^{(j)})$, we estimate the true residual $r^{(j)} = \varphi ^{-1}\left( \sum _{i=1}^n \varphi (x_i^{(j)})\right) - y^{(j)}$.

We begin with an estimate of our true residual, which we denote by $r_\mathrm {est}$. The estimated residual for any known $v^{(j)}$ is then calculated as:

$$\begin{aligned} \hbox {est}(r^{(j)}) = \frac{v^{(j)}r_\mathrm {est}}{\varphi (y^{(j)}+r_\mathrm {est}) - \varphi (y^{(j)})}. \end{aligned}$$

In other words, we calculate the average rate of change between $y^{(j)}$ and $y^{(j)}+r_\mathrm {est}$ and then use the reciprocal of this as our compensation factor. A visual illustration of this process is shown in Fig. 1.

Where $y^{(j)}+r_\mathrm {est}$ is outside our domain [0, 1], we can instead use the average rate of change between $y^{(j)}$ and the boundary point (or very close to the boundary if $\varphi $ is infinite).

Obviously the average rate of change will differ depending on whether $v^{(j)}$ is positive or negative, and so we can split $v^{(j)}$ into its positive and negative components so that a different compensation factor can be applied.

We let $v^{(j)} = v^{(j)}_+ + v^{(j)}_-$ where $v^{(j)}_+\ge 0, v^{(j)}_-\le 0$ which we then use as decision variables in our quadratic programming task. We optimize with respect to these, and add constraints:

$$ \left( \sum _{i=1}^n \varphi (x_i^{(j)})\right) + v^{(j)}_+ + v^{(j)}_-= \varphi (y^{(j)}).$$

In summary, we have the following quadratic programming task.

$$\begin{aligned} \mathrm {minimize}&\sum \limits _{j=1}^m \left( \frac{v^{(j)}_+r_\mathrm {est}}{\varphi (y^{(j)}+r_\mathrm {est}) - \varphi (y^{(j)})}\right) ^2 + \left( \frac{v^{(j)}_-r_\mathrm {est}}{\varphi (y^{(j)}-r_\mathrm {est}) - \varphi (y^{(j)})} \right) ^2 \quad \text {w.r.t. }{\mathbf {w}},{\mathbf {v}}_{-},{\mathbf {v}}_{+}\\ \nonumber \hbox {such that}&\sum \limits _{i=1}^n w_i =1, w_i \ge 0, i = 1, \ldots , n, \\ \nonumber&\left( {\sum }_{i=1}^n \varphi (x_i^{(j)})\right) + v^{(j)}_+ + v^{(j)}_-= \varphi (y^{(j)}), \\ \nonumber&v^{(j)}_+ \ge 0, -v^{(j)}_- \ge 0, j=1,\dots ,m. \end{aligned}$$

Since the usefulness of this method will depend on $r_\mathrm {est}$, we can also set up a bilevel optimization problem such that it is optimized for the given training set. Nevertheless, note that this time we deal with an approximate method. The following experiments show that the method is useful for compensating for the stretching effect of the generating function and also help us identify some specific instances of where linearization by itself has poor performance.

2.2 Experiments

For each of the generating functions $\varphi (x)=x^2, x^3, x^{1/2}, \log x$ we created data sets with $m=20$ and $m=100$ test points. For each trial, we generated $\mathbf w$ randomly and then after calculating the desired output values we added Gaussian noise with $\sigma = 0.05$ and 0.1. We then measured the LSE using:

(i)
the linearization technique where the data are transformed using $\varphi (\mathbf X), \varphi (\mathbf Y)$, i.e., the sum of $\left( \left( \sum _{i=1}^n \varphi (x_i^{(j)})\right) - \varphi (y^{(j)})\right) ^2$,
(ii)
the method proposed here with $r_\mathrm {est}=0.1$,
(iii)
the method proposed here with $r_\mathrm {est}$ optimized,
(iv)
a general non-linear optimization solver.

The $\mathbf x^{(j)}$ vectors were generated both from the uniform distribution on $[0,1]^n$ as well as an exponential distribution scaled to the interval [0, 1]. The uniform data would be expected to result in y values distributed around the middle of the interval, while exponentially distributed data would often result in outputs closer to the lower end of the interval. We expect the latter case to result in worse performance for linearization.

After obtaining the fitted weighted vectors, we calculated the total LSE and normalized these values by expressing them as a proportion of the optimal LSE from the weighting vector obtained from the generalized solver. The results are shown in Table 1.

Table 1. Relative LSE calculated as proportion of total LSE obtained using a general nonlinear solver (iv). Results represent averages over 10 trials

Full size table

A number of observations can be made from this data. Firstly, we note that linearization for $\varphi (x) = x^2$ tended only to produce increases in LSE of about 2–7 %, regardless of the data distribution. On the other hand, $\varphi (x) = x^3$ showed increases in LSE of between 9–24 % when there were only $m=20$ data instances.

The most dramatic results were obtained when fitting the geometric mean. For exponentially distributed input vectors, the method of linearization increased the error by up to 277 % when the noise added to y was generated using $\sigma = 0.1$. There was large variability in these trials – in the best case using linearization increased the error by 45 %, while at other times the difference was 10 fold (LSE was 0.8752 compared with 0.0940 when $r_\mathrm {est}=0.1$ was used, about a 3.2 % increase on the optimal LSE). The worst results seemed to occur where the data set included y values equal to zero. The weight associated with the lowest input for that instance is pulled up to try and reduce the very large error. The compensation factor however did seem to obtain decent improvements even when the data was exponentially distributed.

While a setting of $r_\mathrm {est}=0.1$ tended to result in significant improvements for small errors ($\sigma = 0.05$), increased error of $\sigma = 0.1$ often had optimal values that were closer to 0.2.

3 Regularization

In this section we discuss a few possible ways to prevent a model being overfit to a training sample. In other words, we would like that for other samples of the same kind (e.g., following the same statistical distribution) the model performance does not decrease drastically.

3.1 Tikhonov Regularization

The Tikhonov regularization [18] is a basis for the ridge regression [15] method. It has a form of an additional penalty term dependent on a scaled squared $L_2$ norm of the weighting vector.

In our case we may consider, for some $\lambda $, an optimization task:

$$ \mathrm {minimize}\ \sum _{j=1}^m \left( \varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi (x_i^{(j)})\right) - y^{(j)} \right) ^2 + \lambda \sum _{i=1}^n w_i^2 \quad \text {w.r.t. }{\mathbf {w}} $$

subject to ${\mathbf {1}}^T{\mathbf {w}}=1$ and ${\mathbf {w}}\ge {\mathbf {0}}$. Note that due to the usual constraints on ${\mathbf {w}}$, the use of the $L_1$ norm instead of squared $L_2$ (like, e.g., in Lasso regression) does not make much sense at this point: we always have $\Vert {\mathbf {w}}\Vert _1=1$.

In the simplest case ($\varphi (x)=x$), the above optimization problem can be written in terms of the following quadratic programming task:

$$\begin{aligned} \mathrm {minimize}\ 0.5\, {\mathbf {w}}^T ({\mathbf {X}} {\mathbf {X}}^T+\lambda {\mathbf {I}}) {\mathbf {w}} - ({\mathbf {X}}{\mathbf {Y}}^T)^T {\mathbf {w}} \quad \text {w.r.t. }{\mathbf {w}} \end{aligned}$$

subject to ${\mathbf {w}}\ge {\mathbf {0}}$, ${\mathbf {1}}^T{\mathbf {w}}=1$, which minimizes the squared error plus a $\lambda \Vert {\mathbf {w}}\Vert _2^2$ penalty term. Note that for other generator functions $\varphi $ we can easily incorporate an appropriate penalty to an optimization task considered in the previous section.

Remark 1

Unlike in regression problems, where we always presuppose that $\lambda \ge 0$, in our framework we are bounded with additional constraints on ${\mathbf {w}}$ which, for large $\lambda $, tend to generate weighting vectors such that $w_i\rightarrow 1/n$. On the other hand, in the current framework the case of $\lambda <0$ may also lead to useful outcomes. Yet, we should note that for $\lambda \rightarrow -\infty $ we observe that $w_j\rightarrow 1$ for some j.

Example 2

Let us consider a data set generated randomly with R as follows:

The data points are divided into two groups: a training sample (80 % of the observations, used to estimate the weights) and a test sample (20 %, used to compute the error). Figure 2 depicts squared error measures as a function of Tikhonov regularization coefficient $\lambda $. We see we were able to improve the error measure by ca. 9 % by using $\lambda \simeq 4.83$.

3.2 Weights Dispersion Entropy

Similar to minimizing sum of squared weights, maximizing weights dispersion entropy can also have benefits, measuring the degree to which the function takes into account all the inputs, compare [9] for its use for a different purpose. It is given by:

$$\begin{aligned} \mathsf {Disp}(\mathbf w) = - \sum \limits _{i=1}^n w_i \log w_i, \end{aligned}$$

with the convention $0 \cdot \log 0 = 0$.

In cases where fitting results in multiple solutions, for example when there is too few data for there to be a singular minimizer, Torra proposed the use of weights dispersion as an additional criterion to determine the best solution [19]. It is implemented as a second level of the optimization. After obtaining a minimum A to the objective in the standard least squares fitting problem, one then solves:

$$\begin{aligned} \mathrm {minimize}&\sum \limits _{i=1}^n w_i \log w_i + \lambda \left( \sum \limits _{j=1}^m \left( \varphi ^{-1}\left( \sum \limits _{i=1}^n w_i \varphi (x_{i}^{(j)})\right) - y^{(j)}\right) ^2 - A\right) ^2\ \text {w.r.t. }{\mathbf {w}}\\ \nonumber \hbox {such that}&\sum \limits _{i=1}^n w_i =1, w_i \ge 0, i = 1, \ldots , n, \end{aligned}$$

for some $\lambda >0$.

One may alternatively consider a one-level task like:

$$ \mathrm {minimize}\ \sum _{j=1}^m \left( \varphi ^{-1}\left( \sum _{i=1}^n w_i \varphi (x_i^{(j)})\right) - y^{(j)} \right) ^2 + \lambda \sum \limits _{i=1}^n w_i \log w_i \quad \text {w.r.t. }{\mathbf {w}} $$

subject to the standard constraints.

Remark 2

In fact, weights dispersion and sum of squared weights are both examples of functions used to model income inequality in economics and evenness in ecology. There are numerous other functions used in these fields that could also be used as secondary objectives to achieve the task of weight regularization (e.g., the Gini index).

4 Conclusion

We have considered some practical issues concerning fitting weighted quasi arithmetic means to empirical data using supervised learning-like approaches. First of all, we pointed out the drawbacks of the commonly applied linearization technique, which may lead to far-from-optimal solutions. Moreover, we analyzed some ways to prevent model overfitting.

Note that the discussion can be easily generalized to other error measures and fitting other aggregation functions parametrized via a weighting vector, e.g., weighted Bonferroni means and generalized OWA operators. In future applications, the compensation and regularization techniques proposed here can be used to learn more useful and informative models.

References

Bartoszuk, M., Beliakov, G., Gagolewski, M., James, S.: Fitting aggregation functions to data: Part II - idempotization. In: Carvalho, J.P., Lesot, M.-J., Kaymak, U., Vieira, S., Bouchon-Meunier, B., (eds.) IPMU 2016, Part II, CCIS, vol. 611, pp. 780–789. Springer, Heidelberg (2016)
Google Scholar
Beliakov, G.: Shape preserving approximation using least squares splines. Approximation Theory Appl. 16(4), 80–98 (2000)
MathSciNet MATH Google Scholar
Beliakov, G.: Monotone approximation of aggregation operators using least squares splines. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 10, 659–676 (2002)
Article MathSciNet MATH Google Scholar
Beliakov, G.: How to build aggregation operators from data. Int. J. Intell. Syst. 18, 903–923 (2003)
Article MATH Google Scholar
Beliakov, G.: Learning weights in the generalized OWA operators. Fuzzy Optim. Decis. Making 4, 119–130 (2005)
Article MathSciNet MATH Google Scholar
Beliakov, G.: Construction of aggregation functions from data using linear programming. Fuzzy Sets Syst. 160, 65–75 (2009)
Article MathSciNet MATH Google Scholar
Beliakov, G., Bustince, H., Calvo, T.: A Practical Guide to Averaging Functions. STUDFUZZ. Springer, Switzerland (2016)
Book Google Scholar
Beliakov, G., James, S.: Using linear programming for weights identification of generalized bonferroni means in R. In: Narukawa, Y., López, B., Villaret, M., Torra, V. (eds.) MDAI 2012. LNCS, vol. 7647, pp. 35–44. Springer, Heidelberg (2012)
Chapter Google Scholar
Beliakov, G., Pradera, A., Calvo, T.: Aggregation Functions: A Guide for Practitioners. Springer, Heidelberg (2007)
MATH Google Scholar
Beliakov, G., Warren, J.: Appropriate choice of aggregation operators in fuzzy decision support systems. IEEE Trans. Fuzzy Syst. 9(6), 773–784 (2001)
Article Google Scholar
Filev, D., Yager, R.R.: On the issue of obtaining OWA operator weights. Fuzzy Sets Syst. 94, 157–169 (1998)
Article MathSciNet Google Scholar
Gagolewski, M.: Data Fusion: Theory, Methods, and Applications. Institute of Computer Science, Polish Academy of Sciences, Warsaw (2015)
Google Scholar
Grabisch, M., Marichal, J.L., Mesiar, R., Pap, E.: Aggregation Functions. Cambridge University Press, New York (2009)
Book MATH Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2013)
MATH Google Scholar
Hoerl, A., Kennard, R.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
Article MathSciNet MATH Google Scholar
Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006)
MATH Google Scholar
R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2016). http://www.R-project.org
Tikhonov, A., Arsenin, V.: Solution of Ill-posed Problems. Winston & Sons, Washington (1977)
MATH Google Scholar
Torra, V.: Learning weights for the quasi-weighted means. IEEE Trans. Fuzzy Syst. 10(5), 653–666 (2002)
Article MathSciNet Google Scholar
Torra, V.: OWA operators in data modeling and reidentification. IEEE Trans. Fuzzy Syst. 12(5), 652–660 (2004)
Article Google Scholar
Torra, V.: Aggregation operators and models. Fuzzy Sets Syst. 156, 407–410 (2005)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This study was supported by the National Science Center, Poland, research project 2014/13/D/HS4/01700.

Author information

Authors and Affiliations

Faculty of Mathematics and Information Science, Warsaw University of Technology, ul. Koszykowa 75, 00-662, Warsaw, Poland
Maciej Bartoszuk & Marek Gagolewski
School of Information Technology, Deakin University, 221 Burwood Hwy, Burwood, VIC, 3125, Australia
Gleb Beliakov & Simon James
Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warsaw, Poland
Marek Gagolewski

Authors

Maciej Bartoszuk
View author publications
You can also search for this author in PubMed Google Scholar
Gleb Beliakov
View author publications
You can also search for this author in PubMed Google Scholar
Marek Gagolewski
View author publications
You can also search for this author in PubMed Google Scholar
Simon James
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Gagolewski .

Editor information

Editors and Affiliations

INESC-ID,Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Joao Paulo Carvalho
LIP 6, Université Pierre et Marie Curie, Paris, France
Marie-Jeanne Lesot
School of Industrial Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
Uzay Kaymak
IDMEC,Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal
Susana Vieira
LIP6, Université Pierre et Marie Curie, CNRS, Paris, France
Bernadette Bouchon-Meunier
Iona College, Machine Intelligence Institute, New Rochelle, New York, USA
Ronald R. Yager

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bartoszuk, M., Beliakov, G., Gagolewski, M., James, S. (2016). Fitting Aggregation Functions to Data: Part I - Linearization and Regularization. In: Carvalho, J., Lesot, MJ., Kaymak, U., Vieira, S., Bouchon-Meunier, B., Yager, R. (eds) Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2016. Communications in Computer and Information Science, vol 611. Springer, Cham. https://doi.org/10.1007/978-3-319-40581-0_62

Download citation

DOI: https://doi.org/10.1007/978-3-319-40581-0_62
Published: 11 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40580-3
Online ISBN: 978-3-319-40581-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fitting Aggregation Functions to Data: Part I - Linearization and Regularization

Abstract