Principal component analysis on interval data

Gioia, Federica; Lauro, Carlo N.

doi:10.1007/s00180-006-0267-6

Principal component analysis on interval data

Published: 01 June 2006

Volume 21, pages 343–363, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Computational Statistics Aims and scope Submit manuscript

Principal component analysis on interval data

Download PDF

Federica Gioia¹ &
Carlo N. Lauro¹

1025 Accesses
63 Citations
Explore all metrics

Summary

Real world data analysis is often affected by different types of errors as: measurement errors, computation errors, imprecision related to the method adopted for estimating the data.

The uncertainty in the data, which is strictly connected to the above errors, may be treated by considering, rather than a single value for each data, the interval of values in which it may fall: the interval data. Statistical units described by interval data can be assumed as a special case of Symbolic Object (SO). In Symbolic Data Analysis (SDA), these data are represented as boxes. Accordingly, purpose of the present work is the extension of Principal Component analysis (PCA) to obtain a visualisation of such boxes, on a lower dimensional space pointing out of the relationships among the variables, the units, and between both of them. The aim is to use, when possible, the interval algebra instruments to adapt the mathematical models, on the basis of the classical PCA, to the case in which an interval data matrix is given. The proposed method has been tested on a real data set and the numerical results, which are in agreement with the theory, are reported.

Generalization of Jaccard Index for Interval Data Analysis

Article 01 March 2023

Constrained nonmetric principal component analysis

Article 11 July 2019

Selected statistical methods of data analysis for multivariate functional data

Article Open access 23 February 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The statistical modelling of many problems must account in the majority of cases of “errors” both in the data and in the solution. These errors may be for example: measurement errors, computation errors, errors due to uncertainty in estimating parameters. Interval algebra provides a powerful tool for determining the effects of uncertainties or errors and for accounting them in the final solution.

Interval mathematics deals with numbers which are not single values but sets of numbers ranging between a maximum and a minimum value. Those sets of numbers are the sets all possible determinations of the errors. A form of interval algebra appeared for the first time in the literature in (Burkill 1924), (Young 1931); then in (Sunaga 1958). Modern developments of such an algebra were started by R.E. Moore (Moore 1966). Main results may be found in (Alefeld & Herzerberger 1983), (Kearfott & Kreinovich 1996), (Neumaier 1990).

The methods which have been proposed for treating errors in the data, may be also applied to different kind of data that in real life are of interval type. For example:

Financial data; e. g., (opening value and closing value in a session)
Customer satisfaction data (expected or perceived characteristic of the quality of a product).
Tolerance limits in quality control.
Confidence intervals of estimates from sample surveys.
Query on a database.

It is known that statistical methods have been primarily developed for single-valued variables. However, in real life there are many situations in which the adoption of single-valued variables cause a loss of information. This have prompted the development of new methodologies of statistical analysis for treating interval-valued variables, that is variables that may assume not just a single value on the unit on which they have been measured, but an interval of values. Statistical indexes for interval-valued variables have been defined in (Canal & Pereira 1998) as scalar statistical summaries. These scalar indexes, may cause loss of information inherent in the interval data For preserving the information contained in the interval data many researchers and in particular Diday and his school of Symbolic Data Analysis (SDA) have developed some methodologies for interval data which provide interval index solutions that sometimes appear oversized as they include unsuitable elements. An approach, which is typical for handling imprecise data, is proposed by (Marino & Palumbo 2003). The centre and the radius of each considered interval and the relations between these two quantities are taken into account. An alternative approach for treating interval-valued variables is proposed in (Gioia & Lauro 2005). The methodology consists in using both the interval algebra and the optimization theory.

Methods for Factorial Analysis and in particular for Principal Component Analysis (PCA) on interval data, has been proposed by (Cazes et al. 1997), (Chouakria 1998), (Chouakria et al. 1998), (Gioia F. 2001), (Lauro & Palumbo 2000), (Lauro et al. 2000), (Palumbo & Lauro 2003), (Rodriguez 2000). Statistical units described by interval data can be assumed as a special case of Symbolic Object (SO). In Symbolic Data Analysis (SDA), these data are represented as boxes. The purpose of the present work is the extension of Principal Component analysis (PCA) to obtain a visualisation of such boxes, on a lower dimensional space pointing out the relationships among the variables, the units, and between both of them. The approach that we propose, having previously analysed the applicability of the interval algebra tools (Alefeld & Herzberger 1983), (Neumaier 1990), (Kearfott & Kreinovich 1996) is to adapt the mathematical models, on the basis of the classical PCA, to the case in which an interval data matrix is given. With difference to other approaches proposed in the literature that work on scalar recoding of the intervals using classical tools of analysis, we make extensively use of the interval algebra tools combined with some optimization techniques. The introduced methodology, named Interval Principal Component Analysis (IPCA) will embrace classical PCA as special case.

In section 2 of the present work some definitions, notations and main results of the interval algebra are introduced. In section 3 the IPCA methodology is presented. In section 4 and section 5 the interpretation of the obtained interval solutions and some numerical results on a real data set are presented.

2 Definitions notations and basic facts

2.1 Interval algebra

An interval [a,b] with a ≤ b, is defined as the set of real numbers between a and b:

$$[a, b]=\{x / a \leq x \leq b\}$$

Degenerate intervals of the form [a, a], also named thin intervals, are equivalent to real numbers. The symbols ∈, ⊂, ∪, ∩, will be used in the common sense of set theory. For example by [a,b] ⊂ [c,d] we mean that interval [a,b] is included as a set in the interval [c, d]. Furthermore it is [a,b]=[c,d] ⇔ a=c, b=d.

Let ℑ be the set of intervals. Thus I∈ℑ then I=[a,b] for some a ≤ b. Let us introduce an arithmetic on the elements of ℑ. The arithmetic will be an extension of real arithmetic. If ● is one of the symbols +, −,. /, we define arithmetic operations on intervals by:

$$[a, b] \bullet[c, d]=\{x \bullet y / \quad a \leq x \leq b, c \leq y \leq d\}$$

((2.1.1))

except that we do not define [a,b]/[c,d] if 0 ∈ [c,d].

The sum, the difference, the product, and the ratio (when defined) between two intervals is the set of the sums, the differences, the products, and the ratios between any two numbers from the first and the second interval respectively.

Let us write an equivalent set of definitions in terms of formulas for the endpoints of the resultant intervals.

Let [a,b], [c,d] be elements of 3, it is:

$$\begin{array}{l}{[a, b]+[c, d]=[a+c, b+d]} \\ {[a, b]-[c, d]=[a-d, b-c]} \\ {[a, b] \times[c, d]=[min (a c, a d, b c, b d), max (a c, a d, b c, b d)]}\end{array}$$

((2.1.2))

if 0 ∉ [c,d], then [a,b]/[c,d]=[a,b]×[1/d,1/c]

It can be easily proved that the addition and the product in (2.1.2) are associative and commutative. Real numbers 0 and 1 can be both regarded as units for addition and for product respectively. Other properties may be found in (Moore 1966).

Definition 2.1.1

A rational expression F(X₁, X₂,…,X_n) in the intervals X₁, X₂, … X_n, is a finite combination, with the interval arithmetic operations, of X₁, X₂, …, X_n and a finite set of constant intervals.

Theorem 2.1.1

If F(X₁,X₂,…,X_n) is a rational expression in the intervals X₁, X₂, …, X_n, then

$$X_{1}^{\prime} \subset X_{1}, \ldots, X_{n}^{\prime} \subset X_{n} \Rightarrow F\left(X_{1}^{\prime}, X_{2}^{\prime}, \ldots, X_{n}^{\prime}\right) \subset F\left(X_{1}, X_{2}, \ldots, X_{n}\right)$$

for every set of interval numbers X₁, X₂, …, X_n for which the interval arithmetic operations in F are defined.

From Theorem 2.1.1 follows that, computing a finite number of interval arithmetic operations, it is possible to bound the range of values of a real rational function over interval of values for each of its arguments.

Proposition 2.1.1

If F(x₁, …, x_n) is a real rational function in which each variable x_i, occurs only once and only at the first power, then the corresponding interval expression F(X₁,X₂,…,X_n) will compute the actual range of values of F for x_i in X_i: $F\left(X_{1}, X_{2}, \ldots, X_{n}\right)=\left\{y / y=F\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right.$, x_i ∈ X_i, i = 1, …,n}.

2.2 Interval matrices

Definition 2.2.1

An n×n interval matrix is the following set:

$$X^{I}=[\underline{X}, \overline{X}]=\{X : \underline{X} \leq X \leq \overline{X}\}$$

((2.2.1))

where $\underline{X} e \overline{X}$ are n×n matrix which verify:

$$\underline{X} \leq \overline{X}$$

The inequalities are understood to be component wise.

Introducing the centre matrix and the radius matrix:

$$X_{c}=\frac{1}{2}(\underline{x}+\overline{X}), \Delta=\frac{1}{2}(\overline{X}-\underline{X})$$

the (2.2.1) may be expressed as follow:

$$X^{I}=\left[X_{c}-\Delta, X_{c}+\Delta\right]$$

Definition 2.2.2

An n×n interval matrix X^I is called symmetric if:

$$X^{I}=X_{s}^{I}$$

where:

$$X_{s}^{I}=\left[\frac{1}{2}\left( \underline{X}+\underline{X}^{T}\right), \frac{1}{2}\left(\overline{X}+\overline{X}^{T}\right)\right ]$$

From the definition follows the characterisation:

$$X^{I} \text { is symmetric } \Leftrightarrow \underline{X} \text { and } \overline{X}\text { are symmetric }$$

Hence a symmetric interval matrix may contain non-symmetric matrices. Let us indicate by M_np(R) the set of interval matrices of order n × p. An interval matrix X^I ∈M_np(R) will be represented, in analogy to the case of scalar matrices, by its components as follow: X^I=(X_ij), where X_ij is an interval.

Definition 2.2.3

Let X^I =(X_ij), Y^I =(Y_ij) ∈ M_np(R). Then:
$$X^{I} \pm Y^{I} :=\left(X_{i j} \pm Y_{i j}\right)$$

defines the sum interval matrix and the difference interval matrix respectively.
Let X^I=(X_ij) ∈ M_nr(R) and Y^I =(Y_ij) ∈ M_rp(R). Then:
$$X^{I} Y^{I} :=\left(\sum_{v=I}^{r} X_{i v} Y_{v j}\right)$$

defines the product interval matrix.

In particular:

let X^I=(X_ij) ∈ M_nr (R) and u^I∈ M_rl(R) (interval vector of r interval components), it is:
$$X^{I} \boldsymbol{u}^{I}=\left(\sum_{v=I}^{r} X_{i v} u_{v}\right)$$
Let X^I ∈ M_np(R) and K be an interval. Then:
$$K X^{I}=X^{I} K :=\left(K X_{i j}\right).$$

2.3 Interval eigenvalues and interval eigenvectors

Given an interval data matrix X^I ∈ M_np(R), a lot of research has been done in characterizing solutions of the following interval eigenvalues problem:

$$X^{I} \boldsymbol{u}^{I}=\lambda \boldsymbol{u}^{I}$$

((2.3.1))

which has interesting properties (Deif 1991a), (Rhon 1993), and serves a wide range of applications in physics and engineering.

More in details, the interval eigenvalue problem (2.3.1) is solved by determining two sets $\lambda_{\alpha}^{I}$ and $\boldsymbol{u}_{\alpha}^{I}$ given by:

$$\lambda_{\alpha}^{I}=\left[\lambda_{\alpha}(X) : X \in X^{I}\right]\quad and\quad \boldsymbol{u}_{\alpha}^{I}=\left[\boldsymbol{u}_{\alpha}(X) : X \in X^{I}\right] \quad \alpha=I, \Lambda, r$$

where (λ_α(X),u_α(X)) is an eigenpair of X∈X^I. The couple $\left(\lambda_{\alpha}^{I}, \boldsymbol{u}_{\alpha}^{I}\right)$ will be the α-th eigenpair of X^I and it represents the set of all α-th eigenvalues and the set of the corresponding eigenvectors of all matrices belonging to the interval matrix X^I.

Definition 2.3.1

For x ∈Rⁿ the vector z=sign x may be defined as:

$$z_{i}=\left\{\begin{array}{rl}{1} & if \quad x_{i} \geq 0 \\ {-1} & if \quad x_{i}<0\end{array} \qquad i=1, \ldots, n\right.$$

S=diag(sgn x) will indicate the diagonal matrix with sgn x on the principal diagonal.

The above definitions are necessary to enunciate the following theorem (Deif 1991a) which gives an important instrument for calculating the eigenvalues of an interval matrix.

Let X^I be an n×n real interval matrix, X_c and ΔX its centre and radius matrix respectively, and let u_α(X_c) α=1,…, n, be the eigenvectors of X_c.

Theorem 2.3.1

If X^I is symmetric and if S^α=diag (sgn u_α(X)), (α=1,…n) calculated for X_c is constant on X^I, then the eigenvalue λ_α of X, X∈ X^I ranges over the interval:

$$\lambda_{\alpha}^{I}=\left[\underline{\lambda}_{\alpha}\left(X_{c}-S^{\alpha} \Delta X S^{\alpha}\right), \overline{\lambda}_{\alpha}\left(X_{c}+S^{\alpha} \Delta X S^{\alpha}\right)\right], \quad \alpha=1, \ldots, n$$

((2.3.2))

Theorem 2.3.1 consents to calculate the interval $\lambda_{\alpha}^{I}$ in which the α-th eigenvalue of the matrix X, X∈X^I lies. The theorem gives an interesting interpretation of $\lambda_{\alpha}^{I}$ which may be regarded as the α-th eigenvalue of the given interval matrix. The further novelty lies in the fact that while previously the problem of the search of the bounds for $\lambda_{\alpha}^{I}$was limited to their “estimate”, now the approach is different inasmuch as it is possible to determine exactly the interval $\lambda_{\alpha}^{I}$ without recurring to any approximation. The interval eigenvectors may be computed by solving a linear programming problem as described in (Seif et al. 1992):

Theorem 2.3.2

A necessary and sufficient condition for u _α (X) to be an eigenvector of X corresponding to λ _α (X) is:

$$-\Delta X\left|\boldsymbol{u}_{\alpha}(X)\right| \leq\left(\lambda_{\alpha}(X) I-S^{\alpha} X_{c} S^{\alpha}\right)\left|\boldsymbol{u}_{\alpha}(X) \leq \Delta X\right|\boldsymbol{u}_{\alpha}(X) |$$

((2.3.3))

where I is the unitary matrix and $\underline{\lambda}_{\alpha}(X) \leq \lambda_{\alpha}(X) \leq \overline{\lambda_{\alpha}}(X)$.

To obtain bounds for the components of u_α(X), we write (2.3.3) as:

$$\left[\begin{array}{c}{\lambda_{\alpha}(X) I-S^{\alpha} X_{c} S^{\alpha}-\Delta X} \\ {S^{\alpha} X_{c} S^{\alpha}-\Delta X-\lambda_{\alpha}(X) I}\end{array}\right]\left|\boldsymbol{u}_{\alpha}(X)\right| \leq 0$$

where $\underline{\lambda}_{\alpha}(X) \leq \lambda_{\alpha}(X) \leq \overline{\lambda_{\alpha}}(X)$.

To compute lower and upper bounds for u_α(X) we minimize and maximize ∣u_iα∣ subject to (2.3.3) for α=1, …,n−1, while keeping ∣u_in∣ equal to unity. This type of constrained optimization problems is known as Linear Parametric Programming Problems, the solution of which will be obtained via numerical technique. Bounds for u_α(X) are readily obtained by multiplying those for ∣u_α (X)∣ by the matrix S^α.

2.4 Interval singular values

The interval singular values of an interval matrix X^I can be computed directly from the eigenvalue problem for the matrices X^IX, X∈X^I (Deif 1991b).

Thus the problem of computing the interval singular values of X^I becomes the following: given an interval matrix X^I with central matrix X_c ∈ R^n×p, find a description of the set:

$$\Sigma=\left\{\sigma : X^{T} X \boldsymbol{u}=\sigma^{2}\boldsymbol{ u}, \quad \boldsymbol{u} \neq 0, \quad X \in X^{I}\right\}$$

Rather than compute bounds for set Σ, to confine our self to the single interval singular values of X^I:

$$\sigma_{\alpha}^{I}, \alpha=1, \ldots, p, \quad \forall X \in X^{I}$$

the following three assumption must be introduced:

Assumption 1

sign(u_α(X)), α=1,…,p, is invariant for each X∈X^I, and therefore equals sign (u_α(X_c)) evaluated at the centre matrix X_c.

Assumption 2

$$\left|\delta X \boldsymbol{u}_{\alpha}\right|<2\left|X_{c} \boldsymbol{u}_{\alpha}\right|$$

where ∣δX∣ ≤ ΔX

Assumption 3

sign (X_cu_α), α=1,…p, is invariant for each X∈X^I and is therefore equal to sign(X_cu_α(X_c)), evaluated at the centre matrix X_c.

Conditions for the validity of Assumptions 1,2,3 may be found in (Deif & Rohn 1994). Indicating by:

$$S_{1}^{\alpha}=diag\left(sign\left(\boldsymbol{u}_{\alpha}\right)\right) \quad, \quad S_{2}^{\alpha}=diag\left(sign\left(X_{c} \boldsymbol{u}_{\alpha}\right)\right)$$

it can be proved:

Lemma

Values of δX which extremize the singular value σ_α of the matrix X_c+δX, ∀∣δX∣ ≤ ΔX are given by:

$$\delta X=\pm S_{2}^{\alpha} \Delta X S_{1}^{\alpha}.$$

Theorem 2.4.1

Under some Assumptions 1,2,3, the squared singular values σ² of X_c + δX, ∀∣δX∣ ≤ ΔX, range over the interval:

$$\lambda_{\alpha}^{I}=\left[\underline{\lambda}_{\alpha}, \overline{\lambda}_{\alpha}\right] \quad \alpha=1, \ldots, r$$

where:

$$\begin{array}{l}\underline{\lambda}_{\alpha}=\lambda_{\alpha}\left(X_{c}^{T} X_{c}-2\left(S_{1}^{\alpha} \Delta X^{T} S_{2}^{\alpha} X_{c}\right)_{s}+S_{1}^{\alpha} \Delta X^{T} \Delta X S_{1}^{\alpha}\right) \\ \overline{\lambda}_{\alpha}=\lambda_{\alpha}\left(X_{c}^{T} X_{c}+2\left(S_{1}^{\alpha} \Delta X^{T} S_{2}^{\alpha} X_{c}\right)_{s}+S_{1}^{\alpha} \Delta X^{T} \Delta X S_{1}^{\alpha}\right)\end{array}$$

Thus, once the singular values an interval matrix X^I have been computed, a description of the set:

$$\Sigma=\left\{\sigma : X^{T} X \boldsymbol{u}=\sigma^{2} \boldsymbol{u}, \quad \boldsymbol{u} \neq 0, \quad X \in X^{I}\right\}$$

is provided and, in particular, a description of the set of the eigenvalues of any matrix of the form X^TX, when X∈X^I, is computed.

3 Principal component analysis on interval data

Let us consider an interval data matrix of n units on which p interval-valued variables $X_{1}^{I}, X_{2}^{I}, \ldots, X_{p}^{I}$, with $X_{j}^{I}=\left(X_{i j}=\left[\underline{x}_{i j}, \overline{x}_{i j}\right]\right)_{i}$,i = 1,…,n, have been observed:

$$X^{I}=\begin{bmatrix}\left[\underline{x}_{11}, \overline{x}_{11}\right]& \ldots &\left[x_{1 j}, \overline{x}_{1 j}\right]& \ldots &\left[\underline{x}_{1 p}, \overline{x}_{1 p}\right] \\ \mathrm{M} & & \mathrm{M} & & \mathrm{M} \\ \left[\underline{x}_{i l}, \overline{x}_{i 1}\right] &\ldots &\left[\underline{x}_{i j}, \overline{x}_{i j}\right]& \ldots &\left[\underline{x}_{i p}, \overline{x}_{i p}\right] \\ \mathrm{M} & & \mathrm{M} & & \mathrm{M} \\ \left[\underline{x}_{n l}, \overline{x}_{n l}\right] &\ldots & \left[\underline{x}_{n j}, \overline{x}_{n j}\right]& \ldots &\left[\underline{x}_{n p}, \overline{x}_{n p}\right]\end{bmatrix}$$

((3.1))

X^I may be visualized as a set of n boxes in a p-dimensional space.

The task is to extend to X^I Principal Component Analysis to obtain a visualisation, on a lower dimensional space, of the relationships among the variables, the units, and between both of them.

The aim is to use, when possible, the interval algebra instruments to adapt the mathematical models, on the basis of the classical PCA, to the case in which an interval data matrix is given. Let us suppose that the interval-valued variables have been previously standardized (see Appendix).

It is known that the classical PCA on a real matrix X, in the space spanned by the variables, solves the problem of determining m ≤ p axes u_α,α=1,…,m such that the sum of the squared projections of the point-units on u_α is maximum:

$$\boldsymbol{u}_{\alpha} X^{\prime} X \boldsymbol{u}_{\alpha}=\operatorname{Max} \quad I \leq \alpha \leq m$$

((3.2))

under the constraints:

$$\begin{cases}u^\prime_\alpha{u}_\beta=0 & for\;\alpha\neq\beta\\u^\prime_\alpha{u}_\beta=1 & for\;\alpha=\beta\end{cases}$$

The above optimization problem may be reduced to the eigenvalue problem:

$$X^{\prime} X \boldsymbol{u}_{a}=\lambda \boldsymbol{u}_{\alpha} \quad \quad 1 \leq \alpha \leq m$$

((3.3))

When the data are of interval type, X^I may be substituted in (3.3) and the interval algebra may be used for the products; equation (3.3) becomes an interval eigenvalue problem of the form:

$$\left(X^{I}\right)^{\prime} X^{I} \boldsymbol{u}_{\alpha}^{I}=\lambda^{I} \boldsymbol{u}_{\alpha}^{I}$$

((3.4))

which has the following interval solutions:

$$\left[\lambda_{\alpha}(Z) : Z \in\left(X^{I}\right)^{\prime} X^{I}\right],\left[\boldsymbol{u}_{\alpha}(Z) : Z \in\left(X^{I}\right)^{\prime} X^{I}\right] \quad \alpha=1, \Lambda, p$$

((3.5))

i.e., the set of α-th eigenvalues of any matrix Z contained in the interval product (X^I)′ X^I, and the set of the corresponding eigenvectors respectively. The intervals in (3.5) may be computed by Theorem 2.3.1.

Using the interval algebra for solving problem (3.4) , the interval solutions will be computed but, refer to worse, those intervals are oversized with respect to the intervals of solutions that we are searching for as it will be discussed below. For the sake of simplicity, let us consider the case p=2, thus two interval-valued variables:

$$X_{1}^{I}=\left(X_{i 1}=\left[\underline{x}_{i l}, \overline{x}_{i l}\right]\right), i=1, \ldots, n, X_{2}^{I}=\left(X_{i 2}=\left[\underline{x}_{i 2}, \overline{x}_{i 2}\right]\right), i=1, \ldots, n$$

have been observed on the n considered units.

$X_{1}^{I}$ and $X_{2}^{I}$assume an interval of values on each statistical unit: we do not know the exact value of the components x_i1 or x_i2 for i=1,…n, but only the range in which this value falls. In the proposed approach the task is to contemplate all possible values of the components x_i1, x_i2 each of which in its own interval of values $X_{i 1}=\left[\underline{x}_{i 1}, \overline{x}_{i 1}\right], X_{i 2}=\left[ \underline{x}_{i 2}, \overline{x}_{i 2}\right]$ for i=1,…n. Furthermore for each different set of values x₁₁,x₁₂,…,x_n1 and x₁₂,x₂₂,…,x_n2, where $x_{i j} \in\left[\underline{x}_{i j}, \overline{x}_{i j}\right]$ i = 1,Λ,n,j = 1,2, a different cloud of points in the plane is univocally determined and the PCA on that set of points must be computed. Thus, with interval PCA (IPCA) we mean to determine the set of solutions of the classical PCA on each set of point-units, set which is univocally determined for any different choice of the point-units each of which in its own rectangle of variation.

Therefore, the interval of solutions for which we are looking for are the set of the α-th axes, each of which maximize the sum of square projections of a set of points in the plane, and the set of the variances of those sets of points respectively. This is equivalent to solve the optimization problem (3.3) , and so the eigenvalue problem (3.4) for each matrix X∈X^I.

In the light of the above considerations, the background in approaching directly the interval eigenvalue problem (3.4) , comes out by observing that the following inclusion holds:

$$\left(X^{I}\right)^{\prime} X^{I}=\left\{X Y \quad / X \in\left(X^{\prime}\right)^{\mathrm{I}}, Y \in X^{I}\right\} \supset\left\{X^{\prime} X \quad / \quad X \in X^{I}\right\}$$

((3.6))

this means that in the interval matrix (X^I)′ X^I are contained also matrices which are not of the form X’X. Thus the interval eigenvalues and the interval eigenvectors of (3.4) will be oversized and in particular will include the set of all eigenvalues and the set of the corresponding eigenvectors of any matrix of the form X’X contained in (X^I)′ X^I.

This drawback may be solved by computing an interval eigenvalue problem considering in place of the product:

$$\left(X^{\prime}\right)^{I} X^{I}=\left\{X Y \quad / X \in(\mathrm{X})^{\mathrm{I}}, Y \in X^{I}\right\}$$

the following set of matrices:

$$\Theta^{I}=\left\{X^{\prime} X \quad / X \in X^{I}\right\}$$

i.e., the set of all matrices given by the product of a matrix multiplied by its transpose.

For computing the α-th eigenvalue and the corresponding eigenvector of set Θ, that will still be denoted by $\lambda_{\alpha}^{I} \boldsymbol{u}_{\alpha}^{I}$, Theorem 2.4.1 may be used.

It is important to remark that Theorem 2.4.1 may be applied under strong hypotheses^{Footnote 1} on the input matrix as described in §2.4. When the above hypotheses are not verified, considering that the variables have been previously standardized, the eigenvalues and eigenvectors of the correlation interval matrix may be computed by Theorem 2.3.1 which is subject to a reduced number of hypotheses than Theorem 2.4.1. The correlation interval matrix will be indicated by: $\Gamma^{I}=\left(\operatorname{cor} r_{i j}^{I}\right)$ where $\operatorname{corr}_{i j}^{I}$is the interval of correlations between $X_{i}^{I}, X_{j}^{I}$ (Gioia & Lauro 2005). Notice that while the ij-th component of T^I is the interval of correlations between $X_{i}^{I}$, $X_{j}^{I}$, the ij-th component of (X′)^I X^I is an interval which includes that interval of correlations and contains also redundant elements.

It is important to remark that Θ^I ⊂Γ^I, then the eigenvalues/eigenvectors of Γ^I will be also oversized with respect to those of Θ^I.

The α-th interval axis or interval factor will be the α-th interval eigenvector associated with the α-th interval eigenvalue in decreasing order^{Footnote 2}.

The orthonormality between pairs of interval axes must be interpreted according to:

$$\forall{u}_\alpha\in{u}^I_\alpha\;such\;that\;u^\prime_{\alpha}u_\alpha=1,\;\;\;\exists{u}_\beta\in{u}^I_\beta\;with\;\alpha\neq\beta\;such\;that\;u^\prime_\beta{u}_\beta=I\;/\;u^\prime_{\alpha}u_\beta=0$$

Thus two interval axes are orthonormal to one another if, taking a unitary vector in the first interval axis there exists a unitary vector in the second one so that their scalar product is zero.

In the classical case the importance explained by the α-th factor is computed by: $\lambda_{\alpha} / \sum_{\beta=1}^{p} \lambda_{\beta}$. In the interval case the importance of each interval factor is the interval:

$$\left[\frac{\underline{\lambda}_{\alpha}}{\underline{\lambda}_{\alpha}+\sum\limits_{\beta=1 \atop \beta \neq \alpha}^{p} \overline{\lambda}_{\beta}}\qquad,\qquad \frac{\overline{\lambda}_{\alpha}}{\overline{\lambda}_{\alpha}+\sum\limits_{\beta=1 \atop \beta \neq i}^{p} \underline{\lambda}_{\beta}}\right]$$

((3.7))

i.e., the set of all ratios of variance explained by each real factor u_α belonging to the interval factor$\boldsymbol{u}_{\alpha}^{I}$. The analytical form of the bounds in (3.7) has been computed by considering the following chain of equalities:

$$f\left(\lambda_{1}, \lambda_{2}, \Lambda, \lambda_{p}\right)=\frac{\lambda_{\alpha}}{\sum\limits_{\beta=1}^{p} \lambda_{\beta}}=\frac{1}{1+\frac{1}{\lambda_{\alpha}} \sum\limits_{\beta=1 \atop \beta \neq \alpha}^{p} \lambda_{\beta}}$$

f has been transformed into a real rational function in which each variables occurs only once and at the first power; therefore according to Proposition 2.1.1, the corresponding interval expression $f\left(\lambda_{I}^{I}, \lambda_{2}^{I}, \Lambda \lambda_{p}^{I}\right)$ will compute the actual range of values of f for$\lambda_{\alpha} \in \lambda_{\alpha}^{I} \quad \forall \alpha=1, \Lambda, p$.

Analogously to what already seen in the space R^p, in the space spanned by the units (Rⁿ), the eigenvalues and the eigenvectors of the set:

$$\left(\Theta^{\prime}\right)^{I}=\left\{X X^{\prime} \quad / \quad X \in X^{I}\right\}$$

must be computed; the α-th interval axis will be the α-th interval eigenvector associated with the α-th interval eigenvalue in decreasing order.

Also in this case, Theorem 2.4.1 on the interval matrix (X′)^I may be used if all its hypotheses are satisfied, otherwise the eigenvalues/eigenvectors of the standardized interval matrix (SS’)^I:

$$\left(S S^{\prime}\right)^{I}=\left(\left(s s_{i j}^{\prime}\right)^{I}\right) \quad where \quad\left(s s_{i j}^{\prime}\right)^{I}=\left[\underline{s s}_{i j}^{\prime}, \overline{s s_{i j}^{\prime}}\right]$$

(see Appendix for details) may be computed.

Considering that (Θ)^I ⊂ (SS′)^I, the eigenvalues/eigenvectors of (SS’)^I will be oversized with respect to those of (Θ′)^I.

It is known that a real matrix and its transpose have the same eigenvalues and the corresponding eigenvectors connected by a particular relationship. Let us indicate again with $\lambda_{1}^{I}$, $\lambda_{2}^{I}$, Λ, $\lambda_{p}^{I}$the interval eigenvalues of Σ′ and with $\boldsymbol{v}_{1}^{I},$, $\boldsymbol{v}_{2}^{I},$,…,$\boldsymbol{v}_{p}^{I}$ the corresponding eigenvectors, and let us see how the above relationship applies also for the “interval” case. Let us consider for example the α-th interval eigenvalue $\lambda_{\alpha}^{I}$and let$\boldsymbol{u}_{\alpha}^{I}$, $\boldsymbol{v}_{\alpha}^{I},$, be the corresponding eigenvectors of Θ^I and (Θ′)^I associated with $\lambda_{\alpha}^{I}$ respectively.

Taking an eigenvector of some $X^{\prime} X \in \Theta^{I} : \boldsymbol{v}_{\alpha} \in \boldsymbol{v}_{\alpha}^{I}$, then:

$$\exists \boldsymbol{u} \in \boldsymbol{u}_{\alpha}^{I} \quad / \quad \boldsymbol{u}_{\alpha}=k_{\alpha} X^{\prime} \boldsymbol{v}_{\alpha}$$

((3.8))

where the constant k_α is introduced for the condition of unitary norm of the vector Xv_α.

4 Representation and interpretation

4.1 Units

From classical theory, given an n×p real matrix X we know that the α-th principal component c_α is the vector of the coordinates of the n units on the α-th axis. Two different approaches may be used to compute c_α:

1)
c_α may be computed by multiplying the standardized matrix X by the α-th computed axis u_α: Xu_α.
2)
from the relationship (3.8) among the eigenvectors of X’X and XX’, c_α may be computed by the product $\sqrt{\lambda_{\alpha}} \cdot \boldsymbol{v}_{\alpha}$of the α-th eigenvalue of XX’ with the corresponding eigenvector.

When an n ×p interval-valued matrix X^I is given, the interval coordinate of the i-th interval unit on the α-th interval axis, is a representation of an interval which comes out from a linear combination of the original intervals of the i-th unit by p interval weights; the weights are the interval components of the α-th interval eigenvector. A box in a bi-dimensional space of representation, is a rectangle having for dimensions the interval coordinates of the corresponding unit on the pair of computed interval axis. For computing the α-th interval principal component $\boldsymbol{c}_{\alpha}^{I}=\left(c_{1 \alpha}^{I}, c_{2 \alpha}^{I}, \Lambda, c_{n \alpha}^{I}\right)$ two different approaches may be used:

1)
compute by the interval row-column product:$\boldsymbol{c}_{\alpha}^{I}=X^{I} \boldsymbol{u}_{\alpha}^{I}$
2)
compute the product between a constant interval and an interval vector: $\boldsymbol{c}_{\alpha}^{I}=\sqrt{\lambda_{\alpha}^{I}} \cdot \boldsymbol{v}_{\alpha}^{I}$.

In both cases, the interval algebra product is used thus, the i-th component $c_{i \alpha}^{I}$ of $\boldsymbol{c}_{\alpha}^{I}$ will include the interval coordinate, as it has been defined above, of the i-th interval unit on the α-th interval axis.

We refer to the first approach, for computing principal components, when the theorem for solving the eigenvalue problems (for computing $\boldsymbol{v}_{\alpha}^{I}$) cannot be applied if its hypotheses are not verified. Classical PCA gives a representation of the results by means of graphs, which permit us to represent the units on projection planes spanned by pairs of factors. The methodology (IPCA), that we have introduced, permit us to visualize on planes how the coordinates of the units vary when each component, of the considered interval-valued variable, ranges in its own interval of values, or equivalently when each point-unit describes the boxes to which it belongs.

Indicating with U^I the interval matrix whose j-th column is the interval eigenvector $\boldsymbol{u}_{\alpha}^{\mathrm{I}}$ (α=1,…p), the coordinates of all the interval-units on the computed interval axis are represented by the interval product X^IU^I.

4.2 Interval variables

In the classical case, the coordinate of the i-th variable on the α-th axis is the correlation coefficient between the considered variable and the α-th principal component. Thus variables with greater coordinates (in absolute value) are those which best characterize the factor under consideration.

Furthermore, the standardization of each variable makes the variables, represented in the factorial plane, fall inside the correlation circle.

In the interval case the interval coordinate of the i-th interval-valued variable on the α-th interval axis is the interval correlation coefficient (Gioia & Lauro 2005) between the variable and the α-th interval principal component. The interval variables in the factorial plane however, are represented, not in the circle but in the rectangle of correlations. In fact, computing all possible pair of elements, each of which in its own interval correlation, may happens that pairs with the coordinates that are not in relation one another would be also represented; i.e. pairs of elements which are correlations of different realizations of the two single-valued variables for which the correlation would be considered.

The interval coordinate of the i-th interval-valued variable on the first two interval axes $\boldsymbol{u}_{\alpha}^{I} \boldsymbol{u}_{\beta}^{I}$, namely, the interval correlation between the variable and the first and second interval principal component respectively, will be computed according to the procedure in (Gioia & Lauro 2005) and indicated as follow:

$$\begin{array}{l}{corr\left(\left(X \boldsymbol{u}_{\alpha}\right)^{I}, X_{i}^{I}\right)=\left[\underline{corr}\left(\boldsymbol{u}_{\alpha}, i\right), \overline{corr}\left(\boldsymbol{u}_{\alpha}, i\right)\right]} \\ {corr\left(\left(\left(X \boldsymbol{u}_{\beta}\right)^{I}, X_{i}^{I}\right)=\left[\underline{corr}\left(\boldsymbol{u}_{\beta}, i\right), \overline{corr}\left(\boldsymbol{u}_{\beta}, i\right)\right]\right.}\end{array}$$

Naturally the rectangle of correlations will be restricted, in the representation plane, to its intersection with the circle with centre in the origin and unitary radius.

4.3 Contributions

In the case of single-valued variables, the weight of the i-th unit on the variability of the α-th axis, named absolute contribution, is given by:

$$\frac{c_{i \alpha}^{2}}{\sum\limits_{h=1}^{n} c_{h \alpha}^{2}}$$

((4.3.1))

where $c_{i j}^{2}$ is the squared coordinate of the i-th unit and $\sum_{h=1}^{n} c_{h j}^{2}$ is the variance of the projected units on the α-th axis respectively. In the case of interval-valued variables, (4.3.1) must be considered as a function g of c_iα which may be transformed as follow:

$$g\left(c_{1 \alpha}, c_{2 \alpha}, \Lambda, c_{n \alpha}\right)=\frac{c_{i \alpha}^{2}}{\sum\limits_{h=1}^{n} c_{h \alpha}^{2}}=\frac{1}{1+\frac{1}{c_{i \alpha}^{2}} \sum\limits_{h=1 \atop h \neq i}^{n} c_{h \alpha}^{2}}$$

Proposition 2.1.1 applies to function g, thus the interval:

$$\left[\frac{\underline{c}_{i \alpha}^{2}}{\underline{c}_{i \alpha}^{2}+\sum\limits_{h=1 \atop h \neq i}^{n} \overline{c}_{h \alpha}^{2}}\quad, \quad \frac{\overline{c}_{i \alpha}^{2}}{\overline{c}_{i j}^{2}+\sum\limits_{h=1 \atop h \neq i}^{n} \underline{c}_{h \alpha}^{2}}\right]$$

((4.3.2))

is the set of all absolute contributions of the i-th unit on the α-th axis varying the squared projections $c_{i \alpha}^{2}$ in their interval of values. Interval (4.3.2) is the interval absolute contribution of the i-th interval unit on the α-th interval axis.

The contribution of the j-th variable on the α-th axis may be analogously computed. Interval indexes for the quality of representation on that axis might be calculated substituting the denominator in (4.3.1) with the sum of squared coordinates of the units or of the variables. This procedure however would not furnish a good solution for measuring the “quality” of the reconstruction of the original data matrix. To this purpose the introduction of the singular value decomposition for interval matrices is necessary.

5 Numerical results

This section shows an example of the proposed methodology on a real data set: the Oil data set (Ichino 1988) (the table below). The data set presents eight different classes of oils described by four quantitative interval-valued variables: “Specific gravity”, “Freezing point”, “Iodine value” “Saponification”.

Table 2: The interval data set

Full size table

The first step of the IPCA consists in calculating the following interval correlation matrix:

Table 2: The interval-correlation matrix

Full size table

The interpretation of the interval correlations must take into account both the location and the span of the intervals. Intervals containing the zero are not of interest because they indicate that “everything may happen”. An interval with a radius smaller than that of another one is more interpretable. In fact as the radius of the interval correlations decreases, the stability of the correlations improves and a better interpretation of the results is possible. In the considered example, the interval correlations are well interpretable because all intervals do not contain the zero, thus each pair of interval-valued variables are positively correlated or negatively correlated. For example we observe a strong positive correlation between Iodine and Specific gravity and a strong negative correlation between Freezing point and Specific gravity. At equal lower bounds, the interval correlation between Iodine value and Freezing point is more stable than that between Iodine value and Saponification.

Eigenvalues and explained variance:

λ₁=[2.45,3.40], Explained Variance on the 1st axes: [61%, 86%]

λ₂=[0.68,1.11], Explained Variance on the 2nd axes: [15%, 32%]

λ₃=[0.22,0.33], Explained Variance on the 1st axes: [4%, 9%]

λ₄=[0.00,0.08], Explained Variance on the 1st axes: [0%, 2%].

The choice of the eigenvalues and so of the interval principal components may be done using the interval eigenvalue-one criterion [1,1]. In the numerical example, only the first principal component is of interest because the lower bound of the corresponding eigenvalue is greater than 1. The second eigenvalue respects the condition of the interval eigenvalue-one partially and, moreover, it is not symmetric with respect to 1. Thus the representation on the second axis is not of great interest even though the two first eigenvalues reconstruct most part of the initial variance. Thus, the second axis is not well interpretable.

Interval variables representation:

The principal components representation is made analysing the correlations among the interval-valued variables and the axes, as illustrated below:

Table 3: Interval-correlations Variables/Axes

Full size table

The first axis is well explained by the contraposition of the variable Freezing point, on the positive quadrant, with respect to the variables Specific gravity and Iodine value on the negative quadrant. The second axis is less interpretable because all the correlations vary from −0.99 and 0.99.^{Footnote 3}

Here below, the graphical results achieved by IPCA on the input data table are shown. In Figure 1 the graphical representation of the units is presented; in Figure 2 only the two variables: Specific gravity and Freezing point are represented:

The objects (Fig 1) have a position on the first axis which is strictly connected to the “influence” that the considered variables have on that axis. It can be noticed that Beef and Hog are strongly influenced by Saponification and Freezing point; on the contrary Linseed and Perilla are strongly influenced by Specific gravity and Iodine value. The other oils Camilla and Olive, are positioned in the central zone so they are not particularly characterized by the interval-valued variables.

It is important to remark that the different oils are characterized not only by the positions of the boxes but also by their size and shape. A bigger size of a box with respect to the first axis, remarks a greater variability of the characteristics of the oil represented by the first axis. However also the shape and the position of the box can give information on the variability of the characteristics of the oil, with respect to the first and second axis.

Computational cost: the computational cost of each optimization problem reefers to the cost of a constrained nonlinear optimization or nonlinear programming problem. For computing the correlation matrix, p×p optimization problems must be solved. The computational cost for computing the j-th eigenvector reefers to the cost of a linear parametric programming problem.

Notes

¹ The method works with intervals which are small with respect to the ratio between the radius and the coordinate of the centre of each interval. Empirically it has been observed that the above ratio must be approximately of 2–3%.
² Considering that the α-th eigenvalue of Θ is computed by perturbing the α-th eigenvalue of (X^c)X^c, the ordering on the interval eigenvalues is given by the natural ordering of the corresponding scalar eigenvalues of (X^c)’X^c.
¹ The absolute contributions on the first axes vary from the interval [0, 0.91] for Linseed and the interval [0,0.16] for Sesame, this reflect the “size” of the individuals on the first axes.

References

Alefeld, G. & Herzerberger, J. (1983), ‘Introduction to Interval computation’, Academic Press, New York.
Google Scholar
Billard, L. & Diday, E. (2002), ‘Symbolic regression Analysis’, Proceedings IFCS. In Krzysztof Jajuga et al (EDS.): Data Analysis, Classification and Clustering Methods Heidelberg, Springer-Verlag.
MATH Google Scholar
Billard, L. & Diday, E. (2000), ‘Regression Analysis for Interval-Valued Data’, in: Data Analysis, Classification and Related Methods (eds. H.-H. Bock and E. Diday), Springer, 103–124.
Burkill, J. C. (1924), ‘Functions of Intervals’, Proceedings of the London Mathematical Society, 22, 375–446.
MathSciNet Google Scholar
Canal, L. & Pereira, M. (1998), ‘Towards statistical indices for numeroid data’, in: Proceedings of the NTTS’98 Seminar, Sorrento Italy.
Cazes, P., Chouakria, A., Diday, E. & Schektman, Y. (1997), ‘Extension de l’analyse en composantes principales à des données de type intervalle’, Revue de Statistique Appliquée, XIV, 3, 5–24.
Google Scholar
Chouakria, A. (1998), ‘Extension des méthodes d’analyse factorielle à des données de type intervalle’, Paris IX Dauphine.
Chouakria, A., Diday, E. & Cazes, P. (1998), ‘An improved factorial representation of symbolic objects’, in: KESDA’98 April, Luxembourg.
Deif, A.S. (1991a), ‘The Interval Eigenvalue Problem’, ZAMM 71, 1.61–64, Akademic-Verlag Berlin.
Article MathSciNet Google Scholar
Deif, A. S. (1991b), ‘Singular Values of an Interval Matrix’, Linear Algebra and its Applications 151, 125–133.
Article MathSciNet Google Scholar
Deif, A. S. & Rohn, J. (1994), ‘On the Invariance of the Sign Pattern of Matrix Eigenvectors Under Perturbation’, Linear Algebra and its Applications 196, 63–70.
Article MathSciNet Google Scholar
Gioia, F. (2001), ‘Statistical Methods for Interval Variables’, Ph.D. thesis, Dip. di Matematica e Statistica-Università di Napoli “Federico II”, in Italian.
Gioia, F. & Lauro, C. (2005), ‘Basic Statistical Methods for Interval Data’, Statistica Applicata, 17 (1). In press.
Kearfott, R. B. & Kreinovich, V. (Eds.) (1996), ‘Applications Of Interval Computations’, Kluwer Academic Publishers.
Lauro, C. N. & Palumbo, F. (2000), ‘Principal component analysis of interval data: A symbolic data analysis approach’, Computational Statistics, 15 (1), 73–87.
Article Google Scholar
Lauro, C. N., Verde, R. & Palumbo, F. (2000), ‘Factorial methods with cohesion constraints on symbolic objects’, in: IFCS’00.
Marino, M. & Palumbo, F. (2003), ‘Interval arithmetic for the evaluation of imprecise data effects in least squares linear regression’, Statistica Applicata, 3.
Moore, R. E. (1966), ‘Interval Analysis’, Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Neumaier, A. (1990), ‘Interval methods for systems of equations’, Cambridge University Press, Cambridge.
MATH Google Scholar
Palumbo, F. & Lauro, C.N. (2003), ‘A PCA for interval valued data based on midpoints and radii’, in: New developments in Psychometrics, Yanai H. et al. eds., Psychometric Society, Springer-Verlag, Tokyo.
Google Scholar
Rodriguez, O. (2000), ‘Classification et Modeles Lineaires en Analyse des Donnes Symboliques’, Doctoral Thesis, Universite de Paris Dauphine IX.
Rhon, J. (1993), ‘Interval Matrices: Singularity and real eigenvalues’, SIAM J. Matrix Anal Apply, 14, 82–91.
Article MathSciNet Google Scholar
Seif, N. P., Hashem, S. & Deif, A. S. (1992), ‘Bounding the Eigenvectors for Symmetric Interval Matrices’, ZAMM 72, 233–236.
Article MathSciNet Google Scholar
Sunaga, T. (1958), ‘Theory of an Interval Algebra and its Application to Numerical Analysis’, Gaukutsu Bunken Fukeyu-kai, Tokyo.
MATH Google Scholar
Young, R. C. (1931), ‘The algebra of many-valued quanties’, Math. Ann. 104, 260–290.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Statistica, Università degli Studi di Napoli “Federico II”, Napoli, Italy
Federica Gioia & Carlo N. Lauro

Authors

Federica Gioia
View author publications
You can also search for this author in PubMed Google Scholar
Carlo N. Lauro
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

Given two single-valued variables: X_r = (x_ir), X_s =(x_is), i = 1, …,n, it is known that the correlation between X_r and X_s may be computed as follow:

$$corr\left(X_{r}, X_{s}\right)=h\left(x_{1, r}, \ldots x_{n, r} ; x_{1, s}, \ldots x_{n, s}\right)=\frac{cov\left(X_{r}, X_{s}\right)}{\sqrt{var\left(X_{r}\right)} \sqrt{var\left(X_{s}\right)}}$$

((1))

Let us consider now the following interval-valued variables:

$$X_{r}^{1}=\left(X_{i r}=\left[\underline{x}_{i r}, \overline{x}_{i r}\right]\right) \quad, \quad X_{s}^{1}=\left(X_{i s}=\left[\underline{x}_{is}\ \ , \ \\\overline{x}_{is}\right]\right)_{i} \quad i=1, \ldots, n$$

the interval correlation is computed as follow (Gioia & Lauro 2005):

$$Corr\left(X_{r}^{I}, X_{s}^{I}\right)=\left[\begin{array}{lll}\\ \\\ \ min \quad h\left(x_{l, r}, \ldots, x_{n, r} ; x_{l, s}, \ldots, x_{n, s}\right) \quad, \quad &\ \ max \quad h\left(x_{l, r}, \ldots, x_{n, r} ; x_{l, s}, \ldots, x_{n, s}\right)\\ \ \\ {x_{i r} \in X_{i r}} & {x_{i r} \in X_{i r}} \\ {x_{i s} \in X_{i s}} & {x_{i s} \in X_{i s}} \\ \\ i=1, \ldots, n & i=1, \ldots, n\end{array}\right]$$

where h(x_1,r,…,x_n,r;x_1,s,…,x_n,s) is the function in (1) .

Analogously, given the single-valued variable X_n the standardized S_j=(s_ir)_i, of X_r is given by:

$$s_{i r}=\frac{x_{i r}-\overline{x}_{r}}{\sqrt{n} \cdot \sigma_{r}^{2}}, \quad i=1, \ldots, n$$

((2))

where $\overline{x}_{r}$ and $\sigma_{r}^{2}$ are the mean and the variance of X_r respectively.

When an interval-valued variable $X_{r}^{I}$ is given, following the same approach of (Gioia & Lauro 2005), the component s_ir in (2) , for each i=1,…,n, transforms into the following function:

$$s_{i r}\left(x_{i r}, \ldots x_{n r}\right)=\frac{x_{i r}-\overline{x}_{r}}{\sqrt{n} \cdot \sigma_{r}^{2}}$$

((3))

as x_ir varies in $\lfloor \underline{x}_{i r}, \overline{x}_{i r} \rfloor,$, i=1,…,n. The standardized interval component $s_{i r}^{I}$ of $X_{r}^{I}$ may be computed by minimizing/maximizing function (3) , i.e. calculating the following set:

$$s_{i r}^{I}=\left[\begin{array}{lll} \\ \\ \ min\ \ s_{i r}\left(x_{i r}, \ldots x_{n r}\right) \quad, \quad &\ \ max\ \ s_{i r}\left(x_{i r}, \ldots x_{n r}\right) \\ {x_{i r} \in X_{i r}} &{x_{i r} \in X_{i r}} \\ i=1 ,\ldots n & i=1, \ldots n\end{array}\right]$$

((4))

$s_{i r}^{I}$ in (4) is the interval of the standardized component s_ir that may be computed when each component x_ir ranges in its interval of values. For computing the interval standardized matrix S^I of an n×p matrix X^I, interval (4) may be computed for each i=1,…,n and each r=1, …,p. Given a real matrix X and indicating by S the standardized of X, it is defined the product matrix:$S S^{\prime}=\left(s s^{\prime}_{i j}\right)$. Given an interval matrix X^I, the product of S^I by its transpose will not be computed by the interval matrix product (S’)¹/S¹ but by minimizing/maximizing each component of SS’ when X_ij varies in its interval of values. The interval matrix $\left(S S^{\prime}\right)^{I}=\left(\left(s s_{i j}^{\prime}\right)^{I}\right)$ is:

$$\left(s s_{i j}^{\prime}\right)^{I}=\left[\begin{array}{lll} \\ \\min\ \ ss^{\prime}_{i j}\left(x_{i j}, \ldots x_{n j}\right) \quad, \quad &max\ \ ss^{\prime}_{i j}\left(x_{i j}, \ldots x_{n j}\right) \\ {x_{i j} \in X_{i j}} &{x_{i j} \in X_{i j}} \\ i=1, \ldots n & i=1, \ldots n\end{array}\right]$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gioia, F., Lauro, C.N. Principal component analysis on interval data. Computational Statistics 21, 343–363 (2006). https://doi.org/10.1007/s00180-006-0267-6

Download citation

Published: 01 June 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s00180-006-0267-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Principal component analysis on interval data

Summary

Similar content being viewed by others

Generalization of Jaccard Index for Interval Data Analysis

Constrained nonmetric principal component analysis

Selected statistical methods of data analysis for multivariate functional data

1 Introduction

2 Definitions notations and basic facts

2.1 Interval algebra

2.2 Interval matrices

2.3 Interval eigenvalues and interval eigenvectors

2.4 Interval singular values

Assumption 1

Assumption 2

Assumption 3

3 Principal component analysis on interval data

4 Representation and interpretation

4.1 Units

4.2 Interval variables

4.3 Contributions

5 Numerical results

Interval variables representation:

Notes

References

Author information

Authors and Affiliations

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Principal component analysis on interval data

Summary

Similar content being viewed by others

Generalization of Jaccard Index for Interval Data Analysis

Constrained nonmetric principal component analysis

Selected statistical methods of data analysis for multivariate functional data

1 Introduction

2 Definitions notations and basic facts

2.1 Interval algebra

2.2 Interval matrices

2.3 Interval eigenvalues and interval eigenvectors

2.4 Interval singular values

Assumption 1

Assumption 2

Assumption 3

3 Principal component analysis on interval data

4 Representation and interpretation

4.1 Units

4.2 Interval variables

4.3 Contributions

5 Numerical results

Interval variables representation:

Notes

References

Author information

Authors and Affiliations

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation