1 Introduction

Skewness-based projection pursuit looks for interesting data projections by means of skewness maximization, where the skewness of a data projection is measured by its third standardized moment. Skewness maximization is often paired with skewness removal, to ease the search for interesting structures. Skewness-based projection pursuit has been used in normality testing (Malkovich and Afifi 1973), point estimation (Loperfido 2010), cluster analysis (Loperfido 2019) and stochastic ordering (Arevalillo and Navarro 2019).

There has been a renewed interest in skewness-based projection pursuit, with focus on its parametric interpretation when the sampled distribution is either a finite mixture (Loperfido 2013, 2015, 2019), a skew-normal (Loperfido 2010; Balakrishnan and Scarpa 2012; Tarpey and Loperfido 2015), or a scale mixture of skew-normal distributions (Kim and Kim 2017; Arevalillo and Navarro 2015, 2020, 2021a, b). Loperfido (2018) used a generalized skew-normal distribution to illustrate the connection between skewness maximization and tensor eigenvectors.

A tensor is symmetric if it remains unchanged when permuting its subscripts. Its dimension is the number of distinct values that a subscript can take. The third moment \({\mathcal {M}}_{3,{\textbf{x}}}=\left\{ \textrm{E}\left( X_{i}X_{j}X_{k}\right) \right\} \in {\mathbb {R}}^{p}\times {\mathbb {R}}^{p}\times {\mathbb {R}}^{p}\) of a p-dimensional random vector satisfying \({\textrm{E}}\left( \left| X_{i}^{3}\right| \right) <\infty \) for \(i\in \left\{ 1,...,p\right\} \) is a symmetric third order tensor with dimension p. The third cumulant \({\mathcal {K}}_{3,{\textbf{x}} }\) of \({\textbf{x}}\) is the third moment of \({\textbf{x}}-\varvec{\mu }\), where \( \varvec{\mu }\) is the mean of \({\textbf{x}}\). The third standardized moment \({\mathcal {M}}_{3,{\textbf{z}}}\) of \({\textbf{x}}\) is the third moment of \( {\textbf{z}}=\varvec{\Sigma }^{-1/2}\left( {\textbf{x}}-\varvec{\mu }\right) \), where \(\varvec{\Sigma }\) is the positive definite covariance matrix of x.

Tensor unfolding is the process which rearranges the tensor’s elements into a matrix according to the index which is most meaningful for the problem at hand. Each row of the resulting matrix contains the tensor elements identified by the same value of the unfolding index. Within each row, tensor’s elements are arranged beginning with those identified by smallest values of the first other indices. More formally, let \({\textbf{A}}_{\left( u\right) }\) be the matrix whose \(i-\)th row contains all elements of the tensor \({\mathcal {A}}\) with the i-th value of the u-th index, while the elements of \({\textbf{A}}_{\left( u\right) }\) in the same row are ordered according to the reflected lexicographic ordering of their indices. For example, the third-order tensor \({\mathcal {A}}=\left\{ a_{ijk}\right\} \in {\mathbb {R}}^{3}\times {\mathbb {R}}^{4}\times {\mathbb {R}}^{2}\) can be unfolded in three different ways, to obtain the matrices \({\textbf{A}}_{\left( 1\right) }, {\textbf{A}}_{\left( 2\right) }\) and \({\textbf{A}}_{\left( 3\right) }\). They are represented below, with the index of the unfolding mode in bold and the other indices in smaller font, to emphasize the different unfoldings:

The unfolding of a symmetric tensor does not depend on the unfolding index. We therefore denote with \({{\textbf {A}}}\) the unfolding of the symmetric tensor \( {\mathcal {A}}\), without mentioning the unfolding index. The coskewness of a p -dimensional random vector \({{\textbf {x}}}\) with finite third moments and mean \( \varvec{\mu }\) is the unfolding of the third cumulant of \({{\textbf {x}}}\): \(\varvec{\Gamma }=\text {E}\left\{ \left( {{\textbf {x}}}-\varvec{\mu } \right) \otimes \left( {{\textbf {x}}}-\varvec{\mu }\right) ^{\top }\otimes \left( {{\textbf {x}}}-\varvec{\mu }\right) ^{\top }\right\} \in {\mathbb {R}} ^{p}\times {\mathbb {R}}^{p^{2}}\text {,}\) where“\(\otimes \)" denotes the Kronecker product. Similarly, the standardized coskewness of \({{\textbf {x}}}\) is the unfolding of the third standardized cumulant of \({{\textbf {x}}}\): \(\varvec{\Pi }=\text {E}\left( {{\textbf {z}}}\otimes {{\textbf {z}}}^{\top }\otimes {{\textbf {z}}}^{\top }\right) \in {\mathbb {R}}^{p}\times {\mathbb {R}}^{p^{2}}\text {.}\)

There are other ways to denote and arrange multivariate moments and cumulants (De Luca and Loperfido 2015; Doss et al. 2023; Rao Jammalamadaka et al. 2021; Pereira et al. 2022). In this paper, we favor the coskewness due to its close connection with the eigenpairs of third-order tensors.

Consider now the problem of finding the stationary points of a homogeneous polynomial of degree k in p variables, under the constraint that the squared sum of the variables themselves is one. When k equals 2 the polynomial is a quadratic form and the problem reduces to the derivation of the eigenpairs of the symmetric matrix which characterizes the polynomial itself. Eigenvalues and eigenvectors of symmetric tensors generalize eigenvectors and eigenvalues of symmetric matrices to polynomials of degree greater than 2.

More formally, let \({\mathcal {A}}\) be a symmetric tensor of order k and dimension p. Also, let A be the matrix obtained by unfolding \( {\mathcal {A}}\) along one of its modes. A scalar \(\lambda \) and a p -dimensional, nonnull vector \({{\textbf {x}}}\) are an eigenvalue and the corresponding eigenvector of \({\mathcal {A}}\) if they satisfy \({\textbf {Ax}} ^{\otimes \left( k-1\right) }=\lambda {{\textbf {x}}}\), where \({{\textbf {x}}} ^{\otimes \left( k-1\right) }\) denotes the product \({{\textbf {x}}}\otimes \cdots \otimes {{\textbf {x}}}\), in which the symbol “\(\otimes "\) appears \(k-1\) times. In particular, if \({\mathcal {A}}\) is a third-order tensor, \(\lambda \) and \({{\textbf {v}}}\) satisfy \({{\textbf {A}}}\left( {\textbf { x}}\otimes {\textbf {x}}\right) =\lambda {{\textbf {x}}}\). The eigenvectors of a tensor are the stationary points of the homogeneous polynomial uniquely associated to the tensor itself.

Lim (2005) and Qi (2005) independently introduced tensor eigenvalues and tensor eigenvectors. Sturmfels (2016) thoroughly reviews the topic and states some open problems. Eigenvalues and eigenvectors are defined for any real tensor, including the asymmetric ones. In such cases, however, tensor eigenvalues and eigenvectors depend on the choice of the unfolding index and may not be real. Moreover, such cases are not directly connected to skewness-based projection pursuit and are therefore ignored in the rest of the paper.

The tensor eigenvalue with the greatest norm is the dominant tensor eigenvalue, while the associated tensor eigenvector of unit length is the dominant tensor eigenvector. The constraint on the eigenvector’s norm is necessary because if \({\mathcal {A}}\) is a symmetric tensor of order k and \(\lambda \) is an eigenvalue of \({\mathcal {A}}\) then \( \lambda c^{k-2}\) is the eigenvalue of \({\mathcal {A}}\) associated with the eigenvector \(c{{\textbf {x}}}\), where c is a nonnull scalar. Clearly, this constraint is not necessary for ordinary matrix eigenvectors and for base eigenvectors, that is tensor eigenvectors associated with null eigenvalues. As an example, let \({\mathcal {A}}=\left\{ a_{ijk}\right\} \) be a tensor of order 3 and dimension 3 such that \(a_{ijk}\) equals one when the indices i, j and k differ from each other, and zero otherwise. Its unfolding is

$$\begin{aligned} {{\textbf {A}}}=\left( \begin{array}{ccccccccc} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 1 &{} 0 \\ 0 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 0 &{} 0 \\ 0 &{} 1 &{} 0 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \end{array} \right) . \end{aligned}$$

As shown in Loperfido (2018), the dominant eigenvector and the dominant eigenvalue of \({\mathcal {A}}\) are

$$\begin{aligned} {{\textbf {v}}}=\frac{1}{\sqrt{3}}\left( \begin{array}{c} 1 \\ 1 \\ 1 \end{array} \right) \text { and }\lambda =\frac{2}{\sqrt{3}}\text {.} \end{aligned}$$

Other, nondominant eigenvectors are proportional to one of the following vectors:

$$\begin{aligned} \left( \begin{array}{r} 1 \\ -1 \\ 1 \end{array} \right) \text {,}\left( \begin{array}{r} -1 \\ 1 \\ 1 \end{array} \right) \text {, }\left( \begin{array}{r} 1 \\ 1 \\ -1 \end{array} \right) \text {. } \end{aligned}$$

Base eigenvectors, that is tensor eigenvectors associated with null tensor eigenvalues, are proportional to one of the following vectors:

$$\begin{aligned} \left( \begin{array}{r} 1 \\ 0 \\ 0 \end{array} \right) \text {,}\left( \begin{array}{r} 0 \\ 1 \\ 0 \end{array} \right) \text {, }\left( \begin{array}{r} 0 \\ 0 \\ 1 \end{array} \right) \text {. } \end{aligned}$$

The connection between dominant tensor eigenpairs and skewness-based projection pursuit becomes apparent when considering the directional skewness of a random vector, that is the maximal skewness achievable by a linear projection of the random vector itself:

$$\begin{aligned} \gamma _{D}\left( {{\textbf {x}}}\right) =\underset{{{\textbf {a}}}\in {\mathbb {S}}^{p-1} }{\max }\frac{\text {E}\left\{ \left( {{\textbf {a}}}^{\top }{{\textbf {x}}}-{{\textbf {a}}}^{\top }\varvec{\mu }\right) ^{3}\right\} }{\left( {{\textbf {a}}}^{\top } \varvec{\Sigma a}\right) ^{3/2}}\text {,} \end{aligned}$$

where \({\mathbb {S}}^{p-1}\) is the p-dimensional unit hypersphere. As shown in Section 3 of Loperfido (2018), the projection achieving maximal skewness is an affine function of \({{\textbf {v}}}^{\top }\varvec{\Sigma }^{-1/2}{{\textbf {x}}}\), where \( {{\textbf {v}}}\) is the dominant eigenvector of the third standardized cumulant \( {\mathcal {M}}_{3,{{\textbf {z}}}}\) of \({{\textbf {x}}}\), while the skewness of \({{\textbf {v}}}^{\top }\varvec{\Sigma }^{-1/2}{{\textbf {x}}}\) is the dominant tensor eigenvalue \(\lambda \) of \({\mathcal {M}}_{3,{{\textbf {z}}}}\): \(\varvec{\Pi }\left( {{\textbf {v}}}\otimes {{\textbf {v}}}\right) =\lambda {{\textbf {v}}} \text {.}\) On the other hand, the third cumulant of \({{\textbf {u}}}^{\top }{{\textbf {x}}}\) is zero, if \({{\textbf {u}}}\) is base eigenvector of the third cumulant of \({\textbf {x }}\): \(\varvec{\Gamma }\left( {{\textbf {u}}}\otimes {{\textbf {u}}}\right) ={{\textbf {0}}}_{p} \text {,}\) where \({{\textbf {0}}}_{p}\) is the p-dimensional null vector.

The present paper contributes to the literature on projection pursuit by using tensor concepts to investigate the statistical properties of skewness maximization related to model-based clustering and large sample inference. The results in the paper support a tensor approach to projection pursuit both in the exploratory and the inferential steps of the statistical analysis. The paper is interdisciplinary in nature, since it bridges tensor algebra and projection pursuit. The rest of the paper is organized as follows: Section 2 applies skewness maximization, and therefore dominant tensor eigenvectors, to cluster separation. Section 3 investigates the asymptotic properties of dominant and base eigenvectors of sample third-order cumulants. Section 4 illustrates the results of the previous sections with six well-known data sets. Section 5 contains some concluding remarks and hints for future research. The Appendix contains the proofs.

2 Clustering

Friedman and Tukey (1974) proposed to use projection pursuit to isolate a cluster and then to repeat the procedure on the remaining data. Independently, Hennig (2004) proposed a similar approach, aimed to cluster data where one cluster is homogeneous and well separated from the remaining, possibly more scattered, clusters. The theoretical results in this section support both proposals, when projection pursuit is based on skewness maximization.

The following proposition states that a function of a finitely supported random variable maximizes skewness if it maps every outcome of the random variable itself which has not minimal probability onto the same value, thus obtaining a dichotomous distribution.

Proposition 1

Let X be a random variable with finite support \( X=\left\{ x_{1},...,x_{k}\right\} \). Also, let \(Y=g(X)\) be a real, nondegenerate function of X: \(var(Y)>0\). Finally, let \(x_{j}\) be the unique element of \(\ X\) occurring with minimal probability: \(0<\text {Pr}\left( X=x_{j}\right) <\text {Pr}\left( X=x_{i}\right) \text {: \; }i\ne j \text {.}\) Then the third standardized cumulant of Y attains its maximum absolute value if and only if Y is dichotomous with \( \text {Pr}\left\{ Y=g\left( x_{j}\right) \right\} =\text {Pr}\left( X=x_{j}\right) \).

Proposition 1 is instrumental in proving Theorem 2, but it is also of interest by itself. As seen in the proof of Proposition 1 in the Appendix, the third standardized cumulant of Y is

$$\begin{aligned} \gamma _{1}\left( Y\right) =\frac{1-2p_{1}}{\sqrt{p_{1}\left( 1-p_{1}\right) }}\text {, where }p_{1}=\underset{i}{\min \; }\text {Pr}\left( X=x_{i}\right) \text {.} \end{aligned}$$

Consider now a (not necessarily random) sample \(X_{1}\),..., \(X_{n}\), whose mean, variance and skewness are

$$\begin{aligned} {\overline{X}}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }}X_{i}\text {, } S^{2}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }}\left( X_{i}-{\overline{X}} \right) ^{2}\text { and }G_{1}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }} \left( \frac{X_{i}-{\overline{X}}}{S}\right) ^{3}\text {.} \end{aligned}$$

Theorem 1 implies that \(G_{1}\) achieves its maximum value when all but one observations equal each other, that is when \(p_{1}=1/n\text { and }G_{1}=(n-2)/\sqrt{n-1}\text {.}\)

Since the third sample standardized cumulant is a continuous function of the observations, it tends to be close to \(\left( n-2\right) / \sqrt{n-1}\) when one oservation is very different from the remaning ones, while the latter are very close to each other. This reasoning motivates the use of skewness when testing for the presence of outliers, as argued by Ferguson (1961) under the more restrictive normality assumption.

A weakly symmetric distribution is a distribution whose third cumulant is a null matrix (Loperfido 2014). Symmetric distributions with finite third moments are weakly symmetric but the opposite is not necessarily true. Loperfido (2013), Loperfido (2015) and Loperfido (2019), uses skewness-based projection pursuit for cluster detection, when data come from finite mixtures of weakly symmetric distributions. In particular, Loperfido (2013) and Loperfido (2015) dealt with finite weakly symmetric location mixtures, that is finite mixtures of weakly symmetric distributions only differing in their means. The following theorem shows that, for finite weakly symmetric location mixtures, the component with the smallest weight is best separated from the remaining ones by the projections attaining maximal skewness, when the component’s mean is far away from the other components’ means. The theorem supports skewness maximization as a tool for the iterative detection and removal of clusters, as suggested in Friedman and Tukey (1974).

Theorem 1

Let the distribution of the random vector \({{\textbf {x}}}\) be a finite location mixture of weakly symmetric distributions with linearly independent means. Also, let the mean of the component with the smallest weight have norm \(\textit{c}>0\). Finally, let \({{\textbf {u}}}^{\top } {{\textbf {x}}}\) and \({{\textbf {v}}}^{\top }{{\textbf {x}}}\) be the best discriminating projection of \({{\textbf {x}}}\) and the projection of \( {{\textbf {x}}}\) which maximizes skewness. Then

$$\begin{aligned} \underset{c\rightarrow +\infty }{\lim }\rho ^{2}\left( {{\textbf {u}}}^{\top } {\textbf {x,v}}^{\top }{{\textbf {x}}}\right) =1\text {.} \end{aligned}$$

Theorem 2 provides the mathematical background for the following sequential clustering procedure. Data are projected onto the direction which maximizes skewness in order to separate a cluster from the others. The detected cluster is then removed from the data and the procedure is repeated until no clusters are left. Theorem 2 might also be used for detecting outliers, which might be regarded as limiting cases of small-sized, well-separated clusters (Hou and Wentzell 2014) and have been modelled by means of finite normal location mixtures (Archimbaud et al. 2018). We illustrate the use of skewness maximization for the iterative detection and removal of clusters with a mixture of three normal distributions with identical covariance matrices. Let C be the random variable representing the cluster memberships. It takes the values 1, 2 and 3 with probabilities 0.1, 0.4 and 0.5: \(\text {P}\left( C=1\right) =0.1\text {, }\text {P}\left( C=2\right) =0.4 \text {, }\text {P}\left( C=3\right) =0.5\text {.}\) Also, let \({{\textbf {x}}}|C=i\sim N\left( \varvec{\mu }_{i},{{\textbf {I}}} _{2}\right) \) be the distribution of \({{\textbf {x}}}\) in the i-th cluster, where \({{\textbf {I}}}_{2}\) is the bivariate identity matrix and

$$\begin{aligned} \varvec{\mu }_{1}=\left( \begin{array}{c} 10 \\ 10 \end{array} \right) \text {, }\varvec{\mu }_{2}=\left( \begin{array}{r} 5 \\ -5 \end{array} \right) \text {, }\varvec{\mu }_{3}=\left( \begin{array}{r} -4 \\ 4 \end{array} \right) \text {.} \end{aligned}$$

The distribution of \({{\textbf {x}}}\) is then a location normal mixture with three components, where the mean of the component with the smallest weight has a norm much greater than the other ones: \({{\textbf {x}}}\sim 0.1\cdot N\left( \varvec{\mu }_{1},{{\textbf {I}}}_{2}\right) +0.4\cdot N\left( \varvec{\mu }_{2},{{\textbf {I}}}_{2}\right) +0.5\cdot N\left( \varvec{\mu }_{3},{{\textbf {I}}}_{2}\right) \text {.}\) The mean, the within-group covariance, the between-group covariance and the total covariance are

$$\begin{aligned} \varvec{\mu }=\left( \begin{array}{c} 1 \\ 1 \end{array} \right) \text {, }{{\textbf {W}}}=\left( \begin{array}{cc} 1 &{} 0 \\ 0 &{} 1 \end{array} \right) \text {, }{{\textbf {B}}}=\left( \begin{array}{cc} 27 &{} -9 \\ -9 &{} 27 \end{array} \right) \text { and }\varvec{\Sigma }=\left( \begin{array}{cc} 28 &{} -9 \\ -9 &{} 28 \end{array} \right) \text {.} \end{aligned}$$

The Fisher’s discriminating direction is the dominant eigenvector of the matrix

$$\begin{aligned} \varvec{\Sigma }^{-1}{{\textbf {B}}}=\left( \begin{array}{cc} 28 &{} -9 \\ -9 &{} 28 \end{array} \right) ^{-1}\left( \begin{array}{cc} 27 &{} -9 \\ -9 &{} 27 \end{array} \right) =\frac{1}{703}\left( \begin{array}{cc} 675 &{} -9 \\ -9 &{} 675 \end{array} \right) \approx \left( \begin{array}{cc} 0.960 &{} -0.013 \\ -0.013\, &{} 0.960 \end{array} \right) \text {,} \end{aligned}$$

which is proportional to the bidimensional vector of ones \({{\textbf {1}}}_{2}\). The Fisher linear discriminant projection is \({{\textbf {1}}} _{2}^{\top }{{\textbf {x}}}\), and it is also the linear projection which best separates Cluster 1 from Cluster 2 and Cluster 3, which are merged together: \({\varvec{1}}_{2}^{\top }{} {\textbf {x}}\sim 0.1\cdot N\left( 20,2\right) +0.9\cdot N\left( 0,2\right) \text {.}\) The coskewness of \({{\textbf {x}}}\) and the positive definite square root of the concentration matrix \(\varvec{\Sigma }^{-1}\) are

$$\begin{aligned} cos\left( {{\textbf {x}}}\right) =\left( \begin{array}{rrrr} -29.\,61 &{} 6.\,39 &{} 6.\,39 &{} 42.\,39 \\ 6.\,39 &{} 42.\,39 &{} 42.\,39 &{} -65.\,61 \end{array} \right) \text { and }\varvec{\Sigma }^{-1/2}=\left( \begin{array}{cc} 0.197 &{} 0.033 \\ 0.033 &{} 0.197 \end{array} \right) \text {.} \end{aligned}$$

The standardized coskewness of \({{\textbf {x}}}\) is

$$\begin{aligned} cos\left( {{\textbf {z}}}\right) =\varvec{\Sigma }^{-1/2}cos\left( {{\textbf {x}}} \right) \left( \varvec{\Sigma }^{-1/2}\otimes \varvec{\Sigma } ^{-1/2}\right) =\left( \begin{array}{rrrr} -0.174 &{} 0.110 &{} 0.110\, &{} 0.274 \\ 0.110 &{} 0.274 &{} 0.274 &{} -0.338 \end{array} \right) \text {.} \end{aligned}$$

The bidimensional vector of ones is the dominant eigenvector of the third standardized cumulant of \({{\textbf {x}}}\):

$$\begin{aligned} cos\left( {{\textbf {z}}}\right) \left( {\varvec{1}}_{2}\otimes {\varvec{1}} _{2}\right) =\left( \begin{array}{rrrr} -0.174 &{} 0.110 &{} 0.110\, &{} 0.274 \\ 0.110 &{} 0.274 &{} 0.274 &{} -0.338 \end{array} \right) \left( \begin{array}{c} 1 \\ 1 \\ 1 \\ 1 \end{array} \right) =0.32\cdot {\varvec{1}}_{2}\text {.} \end{aligned}$$

As remarked in the Introduction, the projection of \({{\textbf {x}}}\) with maximal skewness is \({\varvec{1}}_{2}^{\top }\varvec{\Sigma }^{-1/2} {{\textbf {x}}}\). Since \({\varvec{1}}_{2}\) is an eigenvector of \(\varvec{\Sigma }\), it is also an eigenvector of \(\varvec{\Sigma }^{-1/2}\). The projection of \({{\textbf {x}}}\) with maximal skewness is then \({\varvec{1}}_{2}^{\top } {{\textbf {x}}}\), which coincides with the Fisher linear discriminant function.

In order to separate Cluster 2 from Cluster 3 we assume that we can take out Cluster 1, so we obtain the distribution \({{\textbf {x}}}|C\ne 1\sim (4/9)\cdot N\left( \varvec{\mu }_{2},{{\textbf {I}}} _{2}\right) +(5/9)\cdot N\left( \varvec{\mu }_{3},{{\textbf {I}}}_{2}\right) \text {,}\) which is a mixture with unequal weights of two normal distributions with the same covariance matrices. As shown in Loperfido (2013), the linear projection which maximizes skewness is \(\left( \varvec{\mu }_{2}-\varvec{\mu }_{3}\right) ^{\top }cov\left( {{\textbf {x}}}|C\ne 1\right) {{\textbf {x}}}\propto \varvec{\mu }_{2}^{\top } {{\textbf {x}}}\propto X_{1}-X_{2}\text {,}\) where \(X_{1}\) and \(X_{2}\) are the first and the second component of \({{\textbf {x}}}\). The projection \(X_{1}-X_{2}\) coincides, up to location and scale changes, with the Fisher linear discriminant projection. We used the projection \(X_{1}+X_{2}\) to separate the first cluster from the other two, and then the projection \(X_{1}-X_{2}\) to separate the second cluster from the third one. The example suggests that Theorem 2 might hold under more general assumptions, since \(\varvec{\mu } _{2}\) and \(\varvec{\mu } _{3}\) are proportional to each other (\(-0.2\varvec{\mu } _{2}=0.25\varvec{\mu } _{3})\), thus violating the assumptions of Theorem 2.

3 Asymptotics

Let \({{\textbf {x}}}_{i}^{\top }\) be the i-th row of \({{\textbf {X}}}\), \(i\in \left\{ 1,...,n\right\} \). The mean, the covariance and the coskewness of \( {{\textbf {X}}}\) are

$$\begin{aligned}{} & {} {{\textbf {m}}}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }}{{\textbf {x}}}_{i}\text {, }{{\textbf {S}}}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }}\left( {{\textbf {x}}} _{i}-{{\textbf {m}}}\right) \left( {{\textbf {x}}}_{i}-{{\textbf {m}}}\right) ^{\top }\text { and}\\{} & {} \text {}{{\textbf {G}}}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }}\left( {{\textbf {x}}}_{i}-{{\textbf {m}}}\right) \otimes \left( {{\textbf {x}}}_{i}-{{\textbf {m}}} \right) ^{\top }\otimes \left( {{\textbf {x}}}_{i}-{{\textbf {m}}}\right) ^{\top } \text {.} \end{aligned}$$

Let \({{\textbf {z}}}_{i}^{\top }\) be the i-th row of the standardized data \( {{\textbf {Z}}}={{\textbf {H}}}_{n}{} {\textbf {XS}}^{-1/2}\), \(i\in \left\{ 1,...,n\right\} \), where \({{\textbf {H}}}_{n}={{\textbf {I}}}_{n}-{{\textbf {1}}}_{n}{{\textbf {1}}}_{n}^{\top }/n \) is the \(n\times n\) centring matrix, \({{\textbf {1}}}_{n}\) is the \(n-\) dimensional vector of ones, \({{\textbf {I}}}_{n}\) is the \(n-\)dimensional identity matrix, and \({{\textbf {S}}}^{-1/2}\) is the symmetric, positive definite square root of the sample concentration matrix \({{\textbf {S}}}^{-1}\). The standardized coskewness of \({{\textbf {X}}}\) is just the coskewnness of \({{\textbf {Z}}}\):

$$\begin{aligned} {{\textbf {Q}}}=\frac{1}{n}\overset{n}{\underset{i=1}{\sum }}{{\textbf {z}}} _{i}\otimes {{\textbf {z}}}_{i}^{\top }\otimes {{\textbf {z}}}_{i}^{\top }\text {.} \end{aligned}$$

The dominant eigenvalue \(l_{1}\) of \({{\textbf {Q}}}\) is also the maximal skewness achievable by a linear projection of \({{\textbf {X}}}\). Inferential projection pursuit investigates the connections between \(l_{1}\) and its population counterpart, that is the dominant tensor eigenvalue of the third standardized moment of the underlying distribution. As mentioned in the Introduction, the first inferential use of moment optimizing projections dates back to Malkovich and Afifi (1973), within a multivariate normality testing framework. Machado (1983) shows that these statistics have an asymptotic distribution, under normality. Baringhaus and Henze (1991) relates the asymptotic distribution of the same statistics to the maximum of a gaussian process, under the assumption of elliptical symmetry. Naito (1997) uses the results in Baringhaus and Henze (1991) and Sun (1993) for approximating the tail probabilities of a generalized moment index which includes the one proposed by Jones and Sibson (1987). Kuriki and Takemura (2008) uses a geometric approach to derive exact formulae for the tail probabilities of Malkovich and Afifi (1973) statistics and other maxima of multilinear forms. Loperfido (2018), supported by both theoretical and empirical arguments, conjectures that the asymptotic distribution of maximal skewness might be conveniently approximated by a skew-normal distribution, under the null hypothesis of normality.

All of the above papers deal with hypothesis testing, and none of them with point estimation. We address the latter inferential issue by showing that the dominant eigenpair of the third sample moment converges almost surely to its population counterpart, under mild assumptions.

Theorem 2

Let \(\lambda \) and \({{\textbf {v}}}\) be the simple, dominant tensor eigenvalue and its tensor eigenvector of the third moment of the p-dimensional random vector \({{\textbf {x}}}\). Also, let the n-th elements of the sequences \(\left\{ {{\textbf {X}}}_{n}\right\} \), \(\left\{ {\mathcal {M}}_{n}\right\} \), \(\left\{ \lambda _{n}\right\} \) and \(\left\{ {{\textbf {v}}}_{n}\right\} \), be the \( n\times p\) data matrix whose rows are independent outcomes of \({{\textbf {x}}}\), the third moment of \({{\textbf {X}}}_{n}\), the dominant tensor eigenvalues of \( {\mathcal {M}}_{n}\) and the tensor eigenvector of \(\lambda _{n}\). Then \(\left\{ \lambda _{n}\right\} \) and \(\left\{ {{\textbf {v}}}_{n}\right\} \) converge almost surely to \(\lambda \) and \({{\textbf {v}}}\) as n tends to infinity: \(\lambda _{n}\overset{a.s.}{\longrightarrow }\lambda \) and \({{\textbf {v}}}_{n}\overset{a.s.}{\longrightarrow }{{\textbf {v}}}\).

Skewness-based projection pursuit is also concerned with base tensor eigenvectors, given their close connection with weakly symmetric projections, that is projections whose coskewnesses are null matrices. Weakly symmetric projections may be used before skewness-based projection pursuit as tools for data reduction, following the approach in Jones and Sibson (1987), Hui and Lindsay (2010), Ray (2010), Lindsay and Yao (2012) and Loperfido (2023). Weakly symmetric projections may also be used after skewness-based projection pursuit, to facilitate the search for interesting structures other than skewness, as proposed by Huber (1985) and Daszykowski (2007). Statistical applications of weakly symmetric projections are not limited to projection pursuit. For example, they may also be useful in multivariate mean testing (Loperfido 2014, 2019).

Weakly symmetric sampled distributions and weakly symmetric data projections are characterized by having null Mardia’s skewnesses (Mardia 1970). Let \( {{\textbf {x}}}\) and \({{\textbf {y}}}\) two \(p-\)dimensional, independent and identically distributed random vectors with mean \(\varvec{\mu }\), nonsingular variance \(\varvec{\Sigma }\) and finite third moments. The Mardia’s skewness of \({{\textbf {x}}}\) (\({{\textbf {y}}}\)) is

$$\begin{aligned} \beta _{1,M}^{M}\left( {{\textbf {x}}}\right) =\text {E}\left[ \left\{ \left( {{\textbf {x}}}-\varvec{\mu }\right) ^{\top }\varvec{\Sigma }^{-1}\left( {{\textbf {y}}}-\varvec{\mu }\right) \right\} ^{3}\right] . \end{aligned}$$

Its sample counterpart is

$$\begin{aligned} b_{1,M}\left( {{\textbf {X}}}\right) =\frac{1}{n^{2}}\overset{n}{\underset{i=1}{ \sum }}\overset{n}{\underset{j=1}{\sum }}\left\{ \left[ \left( {{\textbf {x}}} _{i}-{{\textbf {m}}}\right) ^{\top }{{\textbf {S}}}^{-1}\left( {{\textbf {x}}}_{j}-{{\textbf {m}}}\right) \right] \right\} ^{3}\text {.} \end{aligned}$$

The Mardia’s skewness equals the squared norm of the standardized coskewness, so that the Mardia’s skewness equal zero if and only if the coskewness is a null matrix, that is under weak symmetry. In particular, a projection onto the direction of a base eigenvector of the coskewness is weakly symmetric. However, due to sampling variability, the sample coskewness might not have base eigenvectors while the coskewness of the underlying distribution does. In such situations, almost weakly symmetric projections, that is projections having the smallest Mardia’s skewness, are intuitively appealing. The following theorem supports this approach.

Theorem 3

Let the third cumulant of the p-dimensional random vector \({{\textbf {x}}}\) have base eigenvectors constituting a linear space of dimension \(q<p\). Also, let the elements of the sequences \(\left\{ {{\textbf {X}}}_{n}\right\} \) and \( \left\{ {{\textbf {B}}}_{n}\right\} \) be \(n\times p\) data matrices whose rows are independent outcomes of \({{\textbf {x}}}\) and \(p\times q\) matrices of full rank minimizing the Mardia’s skewness of \({{\textbf {X}}}_{n}{{\textbf {B}}}_{n}\). Then each row of \(\left\{ {{\textbf {X}}}_{n}{{\textbf {B}}}_{n}\right\} \) converges almost surely to a weakly symmetric random vector.

Base matrix eigenvectors constitute a linear space, but the same does not necessarily happens for base tensor eigenvectors. As an example, consider the generalized skew-normal distribution \(2\phi \left( z_{1}\right) \phi \left( z_{2}\right) \phi \left( z_{3}\right) \Phi \left( \theta z_{1}z_{2}z_{3}\right) \), where \(\phi \left( \cdot \right) \) is the pdf of a standard normal random distribution, \(\Phi \left( \cdot \right) \) is the cdf of a standard normal random distribution and \(\theta \) is a nonnull, real value. As shown in Loperfido (2018, 2019), the distribution is standardized and its only nonnull third moment is \(\text {E}\left( Z_{1}Z_{2}Z_{3}\right) =\gamma =\gamma \left( \theta \right) \) (a function of \(\theta \)), so that its coskewness is

$$\begin{aligned} \varvec{\Gamma }=\left( \begin{array}{ccccccccc} 0 &{} 0 &{} 0 &{} 0 &{} 0 &{} \gamma &{} 0 &{} \gamma &{} 0 \\ 0 &{} 0 &{} \gamma &{} 0 &{} 0 &{} 0 &{} \gamma &{} 0 &{} 0 \\ 0 &{} \gamma &{} 0 &{} \gamma &{} 0 &{} 0 &{} 0 &{} 0 &{} 0 \end{array} \right) \text {.} \end{aligned}$$

The base eigenvectors of \(\varvec{\Gamma }\) are

$$\begin{aligned} \left( \begin{array}{c} 1 \\ 0 \\ 0 \end{array} \right) \text {, }\left( \begin{array}{c} 0 \\ 1 \\ 0 \end{array} \right) \text { and }\left( \begin{array}{c} 0 \\ 0 \\ 1 \end{array} \right) \text {.} \end{aligned}$$

However, no nontrivial linear combination of them is a base eigenvector of \( \varvec{\Gamma }\).

4 Examples

In this section we use six well-known data sets to assess the practical usefulness of skewness-based projection pursuit as a clustering method. Each of them is divided into two groups, so the group membership of each sample unit is known. The data sets differ with respect to the skewnesses of their groups and the performance of the linear discriminant fuction in separating the groups themselves. We first classified the observations using the linear discriminant function, which relies on the knowledge of group memberships. Then we classified the same data with skewness-based projection pursuit, which does not rely on the knowledge of group memberships. The classification procedure based on projection pursuit articulates into two steps. First, the data are projected onto the direction which maximizes their skewness. Second, the projected data are classified into two groups using k-means clustering, which is quite efficient when applied to univariate data. As expected, the former method outperforms the latter, since it uses more information. However, the difference is small enough to encourage the use of skewness-based projection pursuit for classifying data when group memberships are unknown. We also visually inspected the data with scatterplots of the two most skewed projections, which revealed further insight into the clustering structure of data. Table 1 summarizes the performances of the two classification methods.

Table 1 The first and the second row of the table contain the percentages of correctly classified units by linear discrimination and skewness maximization

Next, we give a more detailed description of the data and the classification results.

Australian athletes. The Australian Institute of Sports collected several body measurements from 202 elite athletes of both genders competing in different disciplines. Since the seminal paper by Azzalini and Dalla Valle (1996) the data are known to be skewed. We aim at classifying the 100 female athletes and the 102 male athletes by means of their body fat and lean body mass indices. The linear discriminant function correctly classifies 187 athletes, that is about \(92.6\%\) of them. Skewness-based projection pursuit correctly classifies 151 athletes, that is about \( 74.8\% \) of them. The scatterplot of the two most skewed projections (Fig. 1a) shows a clear separation of the two groups, together with their marked non-elliptical shapes.

Fig. 1
figure 1

a Australian athletes (dots represent female athletes, pluses represent male athletes); b Australian crabs (dots represent blue crabs, pluses represent orange crabs); c Breast cancer (dots represent benign tumors, pluses represent malignant tumors); d Female sparrows (dots represent deceased sparrows, pluses represent survived sparrows); e Financial returns (dots represent negative signs, pluses represent positive signs); f Swiss banknotes (dots represent forged bills, pluses represent genuine bills) (colour figure online)

Australian crabs. Campbell and Mahon (1974) collected 5 morphological measurements (frontal lobe size, rear width, carapace length, carapace width, and body depth) of the blue and orange species of crabs living in Fremantle, Western Australia. More precisely, there are 100 specimen of blue crabs and 100 specimen of orange crabs. Measurements in both groups are often modelled by normal mixtures with equal or proportional covariances. The linear discriminant function correctly classifies all crabs. Skewness-based projection pursuit correctly classifies 160 crabs, that is exactly \(80\%\) of them. The separation between the two groups becomes even more apparent from the scatterplot of the two most skewed projections (Fig. 1b), which also shows a much smaller scatter in the blue crabs group.

Breast cancer. Street et al. (1993) computed ten integer-valued features from digitized images of fine needle aspirates of breast masses belonging to 699 women diagnosed with breast cancer. The features describe characteristics of the cell nuclei present in the image. The tumor was benign for 458 women in the sample, and malignant for 241. We found data in both groups to be significantly skewed. The linear discriminant function correctly classifies 681 women, that is about \(97.4\%\) of them. Skewness-based projection pursuit correctly classifies 542 women, that is about \(77.5\%\) of them. The difference in performances between the methods might be due to the presence of potential outliers, as hinted by the scatterplot of the two most skewed projections (Fig. 1c).

Female sparrows. Manly and Navarro Alberto (2016) considered total length, alar extent, length of beak and head, length of humerous, and length of keel of sternum of 49 female sparrows. Data were collected after a severe storm, after which 21 of them survived. The sample sizes of both groups are too small to test the symmetry hypothesis. However, an exploratory data analysis (not reported here) hint that skewness may be negligible. The linear discriminant function correctly classifies 32 sparrows, that is about \(65.3\%\) of them. Skewness-based projection pursuit correctly classifies 27 sparrows, that is about \(55\%\) of them. The poor performance of both methods, and especially of the latter, could have been anticipated by looking at the scatterplot of the two most skewed projections (Fig. 1d), where the groups are not well separated.

Financial returns. Morgan Stanley Capital International Inc. recorded 1291 percentage logarithmic daily returns (simply returns, henceforth) in the financial markets of France, Netherlands and Spain. De Luca and Loperfido (2015) clustered the returns according to the sign of the previous day U.S. return, obtaining two groups with 597 and 694 returns each, which were found to be significantly skewed. The linear discriminant function correctly classifies 740 returns, that is about \(57.3\%\) of them. Skewness-based projection pursuit correctly classifies 719 returns, that is about \(55.7\%\) of them. As in the previous data set, the two groups are very poorly separated in the scatterplot of the two most skewed projections (Fig. 1e), with the exceptions of a few outliers, which constitute a well-known stylized fact of financial returns.

Swiss banknotes. Flury (1988) reported several measurements from 100 genuine and 100 forged old Swiss 1000 franc bills. Greselin et al. (2011) focused on their width, measured on both sides, and found them to be bivariate normal in the forged group, but not in the genuine one. They also rejected the homoscedastic hypothesis. Other statistical analyses, not shown here, clearly suggest that some skewness is present in the genuine group, but not in the forged one. The linear discriminant function correctly classifies 156 bills, that is \(78\%\) of them. Skewness-based projection pursuit correctly classifies 126 bills, that is about \(63\%\) of them. The two groups appear to be even better separated in the scatterplot of the two most skewed projections (Fig. 1f), which also hints the presence of some possible outliers in the genuine group.

5 Conclusions

This paper investigated some connections between third-order tensor eigenvectors and skewness-based projection pursuit. The former concept belongs to multilinear algebra, while the latter concept belongs to multivariate analysis. The theoretical results in the paper support the use of skewness-based projection pursuit both in the exploratory and in the inferential stages of statistical analysis. The practical usefulness of the method is illustrated with six dataset which already appeared in the statistical literature: the Australian Athletes dataset, the Australian Crabs dataset, the Breast Cancer dataset, the Female Sparrows dataset, the Financial Returns dataset and the Swiss Banknotes dataset. They all suggest that skewness-based projection pursuit might be used to recover the linear discriminant function when the group memberships are unknown.

On the other hand, the above examples are limited in several ways. Firstly, they only consider two clusters, while the theorem and the example in Section 2 support the use of skewness-based projection pursuit in the presence of more clusters. Secondly, the optimal discriminant function might not be linear, as it happens when there are two multivariate normal distributions with different means and covariances (see, e.g., Mardia et al. 1979, page 312). Thirdly, the comparison between the performances of the two approaches should not rely on the misclassification rate only, but should include other performance measures, as for example the receiving operating curve (ROC) and the area under the ROC curve (AUC). Space constraints prevented us from investigating these issues in the present paper, but we are planning to address them in the future by means of both real and synthetic data.

Maximally skewed projections of some well-known distributions admit simple and insightful interpretations (Arevalillo and Navarro 2019, 2020). It is then worth asking which widely used multivariate probability distributions have third-order cumulants whose eigenvectors admit a simple tractable analytical form. This would simplify both their computation and their interpretation. It would also give more insight into the asymptotic properties of skewness-based projection pursuit. Similar remarks also hold for kurtosis-based projection pursuit, which relies on kurtosis optimization and is closely related to the eigenvectors of fourth-order symmetric tensors (Loperfido 2017). Moreover, the joint use of skewness and kurtosis optimization might lead to some additional insight into data features (Arevalillo and Navarro 2021a). We are currently investigating these topics.