1 Introduction

In an age of “big data”, we are faced with challenges to develop statistical methodology that can analyse the data accurately. One approach is to aggregate the data in some meaningful way even if only to reduce the size of the data set. There are a myriad of ways to effect this aggregation. Clearly, it is preferable to carry out any such aggregation according to scientific questions driving the analysis. The resulting aggregated data will perforce be in the form of so-called symbolic data, such as lists, intervals, histograms, and the like. While the aggregation of data sets (small or large) will produce symbolic data, symbolic data can also arise naturally (e.g., species data, many medical entities such as blood pressure, etc.). A detailed description of symbolic data along with numerous examples can be found in Bock and Diday (2000) and Billard and Diday (2003, (2006); Billard (2011) provides a non-technical illustrated introduction to symbolic data. It is important to note the distinction between symbolic data and fuzzy data. In symbolic data, an observation contains multiple values either by natural occurrence or by data aggregation. The values within a symbolic observation follow a probability distribution. By contrast, the values in a fuzzy set are caused by the imprecision or the fuzziness of the definition of the event. The values in a fuzzy set have associated grades of membership. See Zadeh (1965, (1968) and Shapiro (2009) for further information about fuzzy sets. The focus of this paper is on symbolic data and not on fuzzy data.

Often-times, in the absence of knowing what else to do, analysts have taken the average of the aggregated values, or some other seemingly suitably selected “representative” value as a classical surrogate, for a given category. However, it is known that this approach, while giving answers, give answers that are not necessarily correct, since some of the variations present in the data are ignored. For example, Billard (2008) has shown that for a data set of interval or histogram-valued observations, the total variation equals the sum of the within observation variation and the between observation variation. The between observation variation is a measure of the variation obtained when using the average as a classical surrogate to represent the aggregated values of a category (though this can change slightly depending on varying underlying assumptions, but the sense remains). The within observation variation is a measure of the internal variation of each observation. It is this within observation variation that is ignored when using classical surrogates only. To illustrate this further, suppose we have two samples each of size \(n=1\); the first sample contains the observation \(X_{(1)}=[9,11]\) and the second is \(X_{(2)} = [0, 20]\). Both have the same average value (\({=}10\)). Clearly, using the average as the basis for the analysis will give the same answer for both samples, yet also clearly, the two samples have differing values and so should produce differing answers. That is, symbolic data require symbolic methods which use all the information contained in the data.

Our focus is on principal component analysis for histogram-valued observations. The basic principles of principal component analyses are unchanged from those for classically-valued observations. See any of the many texts, e.g., Jolliffe (2004) and Johnson and Wichern (2002) for an applied approach, and Anderson (1984) for a theoretical approach. Recently, Le-Rademacher and Billard (2012) developed methodology for interval data; this included obtaining representations of the projections of the observed hyperrectangles in \(\mathcal {R}^p\) as polytopes in principal component space. These contrast sharply from the point projections in principal component space obtained for classical point data. Le-Rademacher and Billard (2012) also provided an illustrative comparison of their method with previous attempts to analyse interval data, including, e.g., Cazes et al. (1997), Chouakria (1998), Lauro and Palumbo (2000), Irpino et al. (2003), Palumbo and Lauro (2003), Lauro et al. (2008), and Douzal-Chouakria et al. (2011). A brief non-illustrative description of these differences is given in Billard and Le-Rademacher (2013). This comparison showed that these earlier efforts (while advancing the science) had failed in different ways to meet the difficult challenge of capturing all the variations inherent to the data. To extend the projected polytopes beyond visualization, Le-Rademacher and Billard (2013) introduced an algorithm to translate the resulting polytopes into histogram-valued output that can be used as numerical input in further statistical analyses.

To date, no comparable methodology exists for handling histogram-valued data. Note that, the histogram-valued data here are different from the modal categorical data of Cazes (2002), Ichino (2011), and Makosso-Kallyth and Diday (2012). In this work, we extend the interval PCA methodology of Le-Rademacher and Billard (2012, (2013) to obtain principal components for histogram-valued observations. This method uses all the variations contained in the data set by using the symbolic covariance matrix; see Sect. 2. Also, in Sect. 2, a visualization of the resulting principal components and computation of the histogram-valued output are suggested. Since hyperrectangles for histogram data can be viewed as consisting of sets of weighted sub-hyperrectangles, expansion of the polytope methods developed for interval data is proposed. However, the generalization of the methods from interval data to histogram data is not trivial. Challenges encountered in the proposed expansion are addressed in Sect. 2. The algorithms for constructing the polytopes and the histogram-valued output are summarized in Sect. 3 (detailed algorithms are given in the Appendix). In Sect. 4, the new methodology is illustrated on a real data set describing monthly temperatures at weather stations in China. Conclusions are given in Sect. 5.

2 Basics and methodology

2.1 Histogram-valued data: basic statistics

Let \(\mathbf{X} = (X_1,\dots , X_p)\) be a p-dimensional random variable taking values in \(\mathcal {R}^p\). When a realization is a histogram-valued (or, simply, histogram) observation, it takes the form, for each \(X_j\), \(j=1,\dots , p\),

$$\begin{aligned} X_{ij} = \{[a_{ij1}, b_{ij1}),p_{ij1}; \dots ; [a_{ijs_{ij}}, b_{ijs_{ij}}],p_{ijs_{ij}}\},~~i = 1,\dots ,n, \end{aligned}$$
(1)

with \(\sum _{k_j=1}^{s_{ij}} p_{ijk_j} = 1\). The disjoint histogram subintervals \([a_{ijk_j}, b_{ijk_j})\) in Eq. (1), with \(a_{ijk_j} \le b_{ijk_j}\), can be open or closed at either end, and occur with probability or relative frequency \(p_{ijk_j}\), \(i = 1, \dots , n\), \(j=1,\dots , p\), and \(k_j = 1, \dots , s_{ij}\). Typically, the number of subintervals \(s_{ij}\) varies across different observations i and variables j.

The sample mean and sample variance for each \(X_j\), \(j = 1,\dots , p\), were obtained by Billard (2008), respectively, as

$$\begin{aligned} \bar{X}_j = \frac{1}{2n}\sum _{i=1}^n\sum _{k_j=1}^{s_{ij}}(a_{ijk_j}+ b_{ijk_j})p_{ijk_j} \end{aligned}$$
(2)

and

$$\begin{aligned} S^2_j = \frac{1}{3n}\sum _{i=1}^n\sum _{k_j=1}^{s_{ij}}(a_{ijk_j}^2 + a_{ijk_j}b_{ijk_j} + b_{ijk_j}^2)p_{ijk_j} - \frac{1}{4n^2}\left[ \sum _{i=1}^n\sum _{k_j=1}^{s_{ij}}(a_{ijk_j}+b_{ijk_j})p_{ijk_j}\right] ^2. \end{aligned}$$
(3)

The sample symbolic covariance (or, simply, sample covariance) between two variables \(X_{j_1}\) and \(X_{j_2}\) was obtained in Billard (2008) as

$$\begin{aligned} Cov(X_{j_1}, X_{j_2})&= \frac{1}{6n}\sum _{i=1}^n \sum _{k_1=1}^{s_{ij_1}} \sum _{k_2=1}^{s_{ij_2}}\{[2(a_{ij_1k_1} - \bar{X}_{j_1})(a_{ij_2k_2} - \bar{X}_{j_2}) \nonumber \\&\quad +(a_{ij_1k_1} - \bar{X}_{j_1})(b_{ij_2k_2} - \bar{X}_{j_2}) \nonumber \\&\quad +(b_{ij_1k_1} - \bar{X}_{j_1})(a_{ij_2k_2} - \bar{X}_{j_2}) \nonumber \\&\quad +2(b_{ij_1k_1} - \bar{X}_{j_1})(b_{ij_2k_2} - \bar{X}_{j_2})]p_{ij_1k_1j_2k_2} \} \end{aligned}$$
(4)

where \(p_{ij_1k_1j_2k_2}\) is the relative frequency of the rectangle formed by subinterval \([a_{ij_1k_1}, b_{ij_1k_1})\) of \(X_{j_1}\) and subinterval \([a_{ij_2k_2}, b_{ij_2k_2})\) of \(X_{j_2}\). For simplicity of notation in Eq. (4), \(k_1 = k_{j_1}\) and \(k_2 = k_{j_2}\). When \(j_1 = j_2 = j\), then \(Cov(X_{j_1}, X_{j_2}) = S^2_j\). When \(s_{ij} =1\) and hence \(p_{ij1}=1\) for all \(i = 1, \dots , n\), \(j=1,\dots ,p\), the histogram value of Eq. (1) reduces to interval data as a special case. In this case, the sample statistics \(\bar{X}\), \(S_j^2\) and \(Cov(X_{j_1}, X_{j_2})\) of Eqs. (2)−(4) reduce to their respective formula for interval-valued observations (obtained by Bertrand and Goupil (2000), Billard (2008); likewise, when the data are classically valued, where now \(s_{ij}=1\), and \(b_{ij1} = a_{ij1}\) for all ij, these statistics reduce to their well known classical counterparts, as a second special case. An underlying assumption in these formulae is that within a given subinterval, values for \(X_{ij}\) are uniformly distributed across those subintervals (even though the random variable \(X_j\) itself is not necessarily uniformly distributed; indeed, the \(\mathbf{X}\) is often assumed to follow a multivariate normal distribution, at least asymptotically).

Note that, although \(X_j\) is symbolic-valued (a histogram in this case), the mean and the (co)variance of \(X_j\) are classically valued. Using conditional expectation and an internal parameters approach, Le-Rademacher and Billard (2011) showed that the variance of a symbolic-valued random variable is a sum of the mean and the variance of classically valued internal parameters. Similarly, Xu (2010) showed that the covariance of interval-valued random variables is a sum of the mean and the covariance of classically valued internal parameters. These (co)variances are similar to the (co)variances of classically-valued random variables with mixture distribution. Hence, the properties of the covariance matrix for classical data carry through to symbolic data, including the fact that the values of the correlation are between \((-1)\) and \((+1)\). Furthermore, the sample mean \(\bar{X}_j\) of Eq. (2) and the sample variance of Eq. (3) are maximum likelihood estimators (MLE) of the mean and the variance of histogram-valued random variable \(X_j\) Le-Rademacher and Billard (2011). The covariance \(Cov(X_{j_1}, X_{j_2})\) of Eq. (4) is a histogram data generalization of the MLE of the covariance of interval data from Xu (2010) and Billard et al. (2011). Since the symbolic sample covariance matrix, with elements shown in Eq. (4), accounts for the total variation of histogram data, we propose constructing the principal components using this covariance matrix as described in the following sections.

2.2 Principal components

When the dimension p is large and some of the variables are correlated, principal component analysis is often used to reduce the dimension and the collinearity in the data by creating uncorrelated linear combinations of the p variables, called principal components. In particular, the \(\nu \)th principal component, \({PC}_{\nu }\), is the linear transformation of an observation \(\mathbf{X}\) satisfying

$$\begin{aligned} {PC}_{\nu } = e_{\nu _1} X_1 +\dots + e_{\nu _p} X_p, ~~\nu = 1,\dots ,p, \end{aligned}$$
(5)

where \(\lambda _{\nu }\) and \(\mathbf{e}_{\nu } = ( e_{\nu _1}, \dots , e_{\nu _p})\) are the \(\nu \)th eigenvalue and \(\nu \)th eigenvector, respectively, of the covariance matrix \(\mathbf{\Sigma }\) with \(\lambda _{1} \ge \dots \ge \lambda _{p}\), \(\sum _j e^2_{\nu _j} = 1\), and where \(Var({PC}_{\nu }) = \lambda _{\nu }\), and \(Cov({PC}_{\nu }, {PC}_{\nu '}) = 0, ~\nu \ne \nu '\). That is, the eigenvalue \(\lambda _{\nu }\) represents the amount of variation explained by \({PC}_{\nu }\). In practice, the first two or three principal components account for most of the total variation in the data. Hence, subsequent analyses using the first few principal components account for most of the data variability.

Instead of using the covariances directly, the data can be normalized; in this case, the covariance matrix \(\mathbf{\Sigma }\) is replaced by the correlation matrix \(\mathbf{\Sigma }\) with elements

$$\begin{aligned} \Sigma _{j_1j_2} = Cov(X_{j_1},X_{j_2})/[Cov(X_{j_1},X_{j_1})Cov(X_{j_2},X_{j_2})]^{1/2},~~ j_1,j_2 = 1,\dots , p. \end{aligned}$$
(6)

Anderson (1963, (1984), Mardia (1979), Johnson and Wichern (2002), and Jolliffe (2004) provide comprehensive treatments of classical PCA.

Once the covariance matrix, or equivalently the correlation matrix, is obtained, the derivation of the principal components through Eq. (5) is analogous to that for classical data and for interval data. The difference across these data formats is in the relevant formula to calculate the covariance/correlation matrix \(\mathbf{\Sigma }\). For histogram-valued data, the elements of \(\mathbf{\Sigma }\) are given in Eq. (4). Since the covariance matrix \(\mathbf{\Sigma }\) is classically valued, the eigenvalues and the eigenvectors of \(\mathbf{\Sigma }\) have the same properties and the same interpretation as those of classical data. Specifically, the eigenvectors are orthogonal; hence the principal components are orthogonal linear transformations of the original variables. Moreover, the eigenvector \(\mathbf{e}_{\nu }\) represents the direction with the \(\nu \)th largest variation in the observed data. See Jolliffe (2004) for a discussion of mathematical and statistical properties of PCs. Note that, in this work, we propose using the symbolic sample covariance matrix to calculate the principal components. Therefore, the principal component axes align in directions of maximum total variation for histogram data.

Analogous to the classical case, the principal components of histogram-valued variables are linear combinations of those variables. Whereas for classical data, these linear combinations are linear transformations of single data points, for histogram data, they are transformations of all the data points in the histograms. In other words, since the histogram observations are hyperrectangles which are convex sets, the linear combinations of histogram observations are linear transformations of these convex sets. Section 2.3 describes the construction of the transformed convex sets (polytopes) for visualization.

2.3 Visualization

When the data are classical points in \(\mathcal {R}^p\), the projections are points onto the principal component space; see any text on multivariate statistics, e.g., Jolliffe (2004). When the data are intervals so forming hyperrectangles in \(\mathcal {R}^p\), the projections onto principal component space are polytopes; see Le-Rademacher and Billard (2012). They further showed that these polytopes are convex sets with boundaries defined by the transformed vertices of the hyperrectangles. The polytopes can be reconstructed in the principal component space by connecting the transformed vertices of the observed hyperrectangles. The vertices are used as the first step in the reconstruction of the observations because they define the boundaries of the hyperrectangles. After the vertices are transformed, the entire polytope representing a transformed interval-valued observation can be reconstructed in the principal component space. This allows reconstruction of the polytopes without having to transform each individual data point within the observations, which can be infinitely many. The vertices are part of the hyperrectangle representing an observation. They are not by themselves the entire observation. This same idea is extended to reconstructing histogram-valued observations in the principal component space as described in the following.

To construct polytopes for histogram data, we first recognize that each histogram is in effect a weighted set of sub-hyperrectangles in \(\mathcal {R}^p\). Within each sub-hyperrectangle, the density is assumed to be uniform, but that these densities vary across the sub-hyperrectangles. For example, when the number of subintervals \(s_{ij}=2\) and \(p=2\), we have the hyperrectangle as shown in Fig. 1a with the differing weights of the sub-rectangles reflected by their differing colors. This is contrasted with the single uniformly weighted rectangle for interval data in Fig. 1b. In general, there will be \(r_i\) sub-hyperrectangles for the histogram \(\mathbf{X}_i\) where

$$\begin{aligned} r_i = \prod _{j=1}^p s_{ij}. \end{aligned}$$
(7)
Fig. 1
figure 1

a Histogram sub-rectangles. b Interval rectangle

Since a histogram-valued observation can be represented by a hyperrectangle in the sample space \(\mathcal {R}^p\), an observation \(\mathbf{X}_i\) can be expressed in terms of the vertices of its sub-hyperrectangles. Denote the set of subinterval endpoints by \(\mathbf{Q}_i = (Q_{i1},\dots , Q_{ip})\) with \(Q_{ij} = \{a_{ij1}, b_{ij1},\dots , a_{ijs_{ij}}, b_{ijs_{ij}}\}\), \(j = 1,\dots ,p\). Let \(\mathbf{W}_i^v\) be the \((2^pr_i\times p)\) matrix whose rows consist of the vertex coordinates of all possible permutations of the elements of \(\mathbf{Q}_i\). From Eq. (7), it follows that \(\mathbf{W}_i^v\) has \(2^pr_i = 2^p(\prod _{j=1}^ps_{ij})\) rows.

However, usually the upper endpoint of a subinterval equals the lower endpoint of the next subinterval. Therefore, without loss of generality, for each \(X_{ij}\), assume \(b_{ij(k_j-1)}= a_{ijk_j}\) for \(k_j=2,\ldots ,s_{ij}\) and write \(b_{ijs_{ij}} = a_{ij(s_{ij}+1)}\). Then, the set of endpoints \(\mathbf{Q}_i\) can be replaced by \(\mathbf{E}_i = (E_{i1},\dots , E_{ip})\) with \(E_{ij} = \{a_{ij1},\dots , a_{ij(s_{ij}+1)}\}\), \(j = 1,\dots , p\), \(i = 1,\dots ,n\). Let \(\mathbf{X}^{v}_i\) be the matrix whose rows include all possible permutations of the elements of \(\mathbf{E}_i\). Then, \(\mathbf{X}^{v}_i\) has \(N_i=\prod _{j=1}^p{(s_{ij}+1)}\) rows. The quantity \(\prod _{j=1}^p{(s_{ij}+1)}\) is less than or equal to \(2^p(\prod _{j=1}^p{s_{ij}})\). The magnitude of the difference between these two quantities depends on the number of subintervals for each histogram \(X_{ij}\) and the number of variables, p. When \(s_{ij}=1\) for all \(j =1, \ldots , p\), as for interval data,

$$\begin{aligned} \frac{\prod _{j=1}^p{(s_{ij}+1)}}{2^p(\prod _{j=1}^p{s_{ij}})}=\frac{\prod _{j=1}^p{2}}{2^p(\prod _{j=1}^p{1})}=\frac{2^p}{2^p}=1. \end{aligned}$$

When \(s_{ij}\) is large for most \(X_{ij}\), it is easily shown that

$$\begin{aligned} \frac{\prod _{j=1}^p{(s_{ij}+1)}}{2^p(\prod _{j=1}^p{s_{ij}})}\rightarrow \frac{1}{2^p}. \end{aligned}$$

Hence, when p is large, it is more efficient to use the matrix of vertices \(\mathbf{X}^{v}_i\) which is given by

$$\begin{aligned} \mathbf{X}^v_i = \left[ \begin{array}{cccc} a_{i11} &{} a_{i21} &{} \ldots &{} a_{ip1}\\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ a_{i11} &{} a_{i21} &{} \ldots &{} a_{ip(s_{ip}+1)}\\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ a_{i1(s_{i1}+1)} &{} a_{i2(s_{i2}+1)} &{} \ldots &{} a_{ip1}\\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ a_{i1(s_{i1}+1)} &{} a_{i2(s_{i2}+1)} &{} \ldots &{} a_{ip(s_{ip}+1)} \end{array} \right] . \end{aligned}$$
(8)

Figure 2a shows these vertices when \(s_{ij} = 2\) and \(p=2\) (see Fig. 1a) using the endpoints of \(\mathbf{E}_i\); and Fig. 2b gives the vertices for the interval data with \(s_{ij}=1\) and \(p=2\) (see Fig. 1b) using the endpoints of \(\mathbf{Q}_i\). The matrix of endpoints for Fig. 2a has \(N_i = \prod _{j=1}^2(2+1) = 9\) rows and is given by, from Eq. (8),

$$\begin{aligned} \mathbf{X}^{v}_i = \left[ \begin{array}{ccccccccc} a_{i11} &{} a_{i11} &{} a_{i11} &{}a_{i12} &{}a_{i12} &{}a_{i12} &{}a_{i13} &{}a_{i13} &{}a_{i13} \\ a_{i21} &{}a_{i22} &{}a_{i23} &{}a_{i21} &{}a_{i22} &{}a_{i23} &{}a_{i21} &{}a_{i22} &{}a_{i23} \end{array} \right] ^T \end{aligned}$$
(9)

where T stands for the transpose function. Construction of \(\mathbf{X}^v_i\), and hence of \(\mathbf{X}^h_i\), is given in “Constructing the matrix of vertices” in the Appendix.

Fig. 2
figure 2

a Vertices \(\mathbf{E}\) for Histogram sub-rectangles. b Vertices \(\mathbf{Q}\) for Interval rectangle

Equation (5) can now be applied to the set of vertices in \(\mathbf{X}^v_i\) transforming them into points in the principal component space. Let \(\mathbf{Y}_i^v\) be the matrix of transformed vertices of observation i. Then, from Eq. (5), we have

$$\begin{aligned} \mathbf{Y}_i^v = \mathbf{X}_i^v\mathbf{e}. \end{aligned}$$
(10)

Le-Rademacher and Billard (2012) proved that a hyperrectangle representing an interval-valued observation when transformed into a principal components space becomes a polytope. Similarly, each sub-hyperrectangle belonging to a histogram-valued observation i becomes a polytope in principal component space. The vertices of the polytope are the transformed vertices, through Eq. (10), of its corresponding sub-hyperrectangle in the sample space. To construct the polytopes for a histogram-valued observation i, we treat the \(r_i\) sub-hyperrectangles that belong to histogram-valued observation i as a set of \(r_i\) interval-valued observations but with weights, i.e., with densities \(d_i^h\) (say), \(h = 1,\dots , r_i\). Let the matrix of vertices for the hth sub-hyperrectangle be \(\mathbf{X}^h_i\). These sub-matrices are extracted from \(\mathbf{X}^v_i\). Consider the example of Fig. 2a and its associated matrix of vertices in Eq. (9). For \(h=1\) (the lower left-hand sub-rectangle of Fig. 2a), the matrix of vertices is

$$\begin{aligned} \mathbf{X}^{h=1}_i = \left[ \begin{array}{cccc} a_{i11} &{} a_{i11} &{} a_{i12} &{}a_{i12} \\ a_{i21} &{}a_{i22} &{}a_{i21} &{}a_{i22} \end{array} \right] ^T \end{aligned}$$

corresponding to the \(1\mathrm{st}, 2\mathrm{nd}, 4\mathrm{th}, 5\mathrm{th}\) rows of \(\mathbf{X}^v_i\) in Eq. (9); for \(h=2\) (the upper left-hand sub-rectangle of Fig. 2a), we have

$$\begin{aligned} \mathbf{X}^{h=2}_i = \left[ \begin{array}{ccccccccc} a_{i11} &{} a_{i11} &{}a_{i12} &{}a_{i12} \\ a_{i22} &{}a_{i23} &{}a_{i22} &{}a_{i23} \end{array} \right] ^T \end{aligned}$$

corresponding to the \(2\mathrm{nd}, 3\mathrm{rd}, 5\mathrm{th}, 6\mathrm{th}\) rows of \(\mathbf{X}^v_i\) in Eq. (9); and so on. Analogous to Eq. (10), the matrix of vertices of the sub-polytope representing sub-hyperrectangle h, \(h = 1,\dots ,r_i\), is

$$\begin{aligned} \mathbf{Y}_i^h = \mathbf{X}_i^h\mathbf{e}. \end{aligned}$$
(11)

For computational purposes, let \(\mathbf{Y}_i\) be a three-dimensional array composed of the collection of \(\mathbf{Y}_i^h\) of Eq. (11), i.e., \(\mathbf{Y}_i = (\mathbf{Y}_i^1,\dots , \mathbf{Y}_i^{r_i})\). Construction of \(\mathbf{Y}_i\) is given by the algorithm in the Appendix. The sub-polytope representing sub-hyperrectangle h is then constructed by connecting corresponding vertices of \(\mathbf{Y}_i^h\) as described in Le-Rademacher and Billard (2012). Figure 3a shows the polytope projection of a two-dimensional hyperrectangle with four sub-rectangles onto a principal component plane. Figure 3b shows the projection of a three-dimensional hyperrectangle.

Fig. 3
figure 3

a Subintervals of PC Histogram for \(p=2\). b Subintervals of PC Histogram for \(p=3\)

Although densities of the sub-hyperrectangles, equivalently densities of the sub-polytopes, in observation i vary, illustrating this variability presents some challenges. The first challenge is due to the fact that many of the sub-polytopes in an observation are hidden behind other sub-polytopes. Therefore, only densities of the sub-polytopes on the boundary can be visualized. The second challenge includes computational complexity. If densities are specified by different colors, then all polytopes must be filled with color associated with their density. Figure 3b illustrates that this complexity exists even with a simple three-dimensional observation. Writing a program to automate this process is a time-consuming and a challenging project. This can be a potential future project.

With these challenges in mind, we propose separately constructing polytopes (without densities) for visualization and constructing a matrix of densities associated with the sub-polytopes as an alternative way to understand the variability in their densities. The resulting densities are further taken into account when computing the histogram-valued output. Computation of the sub-polytope densities and the output histograms are described in the following section.

2.4 Numerical output

Let \(\mathbf{d}_i\) be an \(r_i\)-vector of densities. Then, the \(h\mathrm{th}\) element of \(\mathbf{d}_i\) is the density of the \(h\mathrm{th}\) sub-hyperrectangle of observation i. For \(j=1,\ldots , p\) and \(k_j=1,\ldots , s_{ij}\),

$$\begin{aligned} d_i^h=\prod _{j=1}^p{p_{ijk_j}} \end{aligned}$$
(12)

where

$$\begin{aligned} h=\sum _{j=1}^{p-1}\left[ (k_j -1)\prod _{l=j+1}^{p-1}s_{il}\right] + k_p. \end{aligned}$$
(13)

That is, e.g., the first sub-hyperrectangle of observation i is formed by the first subinterval of histograms \(X_{ij}\) for all \(j=1,\ldots ,p\). Then, \(k_j=1\) for all j. Thus, from Eq. (13),

$$\begin{aligned} h=\sum _{j=1}^{p-1}\left[ (k_j -1)\prod _{l=j+1}^{p-1}s_{il}\right] + k_p=0+1=1. \end{aligned}$$

Hence, the first element of \(\mathbf{d}_i\) is, from Eq. (12),

$$\begin{aligned} d_i^1=\prod _{j=1}^p{p_{ijk_j}}=\prod _{j=1}^p{p_{ij1}}. \end{aligned}$$

Similarly, the second sub-hyperrectangle of observation i is formed by the first subinterval of histograms \(X_{ij}\) for all \(j=1,\ldots ,p-1\) and the second subinterval of \(X_{ip}\). Then, \(k_j=1\) for \(j=1,\ldots , p-1\) and \(k_p=2\). Therefore,

$$\begin{aligned} h=\sum _{j=1}^{p-1}\left[ (k_j -1)\prod _{l=j+1}^{p-1}s_{il}\right] + k_p=0+2=2. \end{aligned}$$

Hence, the second element of \(\mathbf{d}_i\) is

$$\begin{aligned} d_i^2=\prod _{j=1}^p{p_{ijk_j}}=\left( \prod _{j=1}^{p-1}{p_{ij1}}\right) p_{ip2}. \end{aligned}$$

Other elements of \(\mathbf{d}_i\) can be found the same way. The value of \(\mathbf{d}_i\) is the density of the sub-hyperrectangle whose vertices are expressed in matrix \(\mathbf{X}^h_i\). Equivalently, \(\mathbf{d}_i\) is the density of the sub-polytope whose vertices are expressed in matrix \(\mathbf{Y}^h_i\) of Eq. (11).

Prior to incorporating the sub-polytope densities into the relative frequencies of the histogram output, construct the principal component (PC) histogram corresponding to each sub-polytope of observation i as detailed in Le-Rademacher and Billard (2013). Let

$$\begin{aligned} Z^h_{i \nu }=\{[z_1, z_2), p_1; \ldots ; [z_{s-1}, z_s], p_s \} \end{aligned}$$
(14)

denote the resulting \({PC}_{\nu }\) histogram representing sub-polytope h of observation i where s is the number of subintervals in \(Z^h_{i \nu }\) and \(p_k\) is the relative frequency of subinterval \([z_k, z_{k+1})\) ignoring the sub-polytope density. Then, the relative frequency of \([z_k, z_{k+1})\) is \(d^h_ip_k\) after adjusting for the density. Now, the adjusted \({PC}_{\nu }\) histogram for the sub-polytope h is \(Z^h_{i \nu }=\{[z_1, z_2), d^h_ip_1; \ldots ; [z_{s-1}, z_s], d^h_ip_s \}\).

Next, we propose combining the \(r_i\) adjusted histograms \(Z^h_{i \nu }\) for all \(h=1, \ldots , r_i\) into one histogram representing \({PC}_{\nu }\) of observation i. For convenience, we propose creating a combined histogram of equal subinterval widths. Due to the linear transformation, each subinterval along the \(PC_{\nu }\) axis may contain parts of several sub-polytopes. For example, in Fig. 3a the first subinterval \([z_1, z_2)\) includes part of polytopes 1 and 2; whereas subinterval \([z_2, z_3)\) includes part of all four sub-polytopes. The subinterval relative frequencies for the combined histogram must correctly account for the proportion of various sub-polytopes that contribute to the subinterval. Figure 3b shows the process is more complex with higher dimensional data. Steps to compute the combined histograms are described in “Constructing the PC Histograms” in the Appendix.

3 Algorithm

The algorithm needed to carry out the proposed analysis can be divided into four main components. For each observation, do the following:

  1. A.1

    Construct the matrix of vertices \(\mathbf{X}_i^v\) of the observed sub-hyperrectangles.

  2. A.2

    Construct the polytopes by:

    • computing the transformed vertices \(\mathbf{Y}_i^v (=\mathbf{X}_i^v\mathbf{e})\) and then connect the appropriate vertices to build the polytopes and

    • computing the density vector \(\mathbf{d}_i\).

  3. A.3

    Plot two- or three-dimensional projection of the polytopes from A.2 onto a principal component space.

  4. A.4

    Compute principal component histograms by:

    • computing the principal component histogram for each sub-polytope from A.2 and then

    • combining the sub-polytope histograms into one histogram using the densities computed from A.2.

Details of the algorithm are given in the Appendix. An R-program and an example data are given as supplemental materials.

Although the structure of histogram observations can be complex, especially with a large number of variables, we carefully define the matrix of vertices in Eq. (8) to reduce the size of the data matrix to increase the efficiency of the algorithm. Computation of the polytopes and the principal histograms is straight forward. The algorithm involves permutation of the vertices (part A.1) and direct matrix computations (parts A.2−A.4). With current computing resources, these procedures are routinely done and do not take much time. As a reference, it took 30 s to produce the results shown in Tables 1, 2, 3, 4 and Fig. 4 of Sect. 4 using a personal computer with a 3.00 GHz processor and 8 GB of RAM.

Table 1 Covariance matrix for China weather variables
Table 2 Eigenvalues \(\lambda _{\nu }\) and eigenvectors \(\mathbf{e}_{\nu } = (e_{\nu 1},\dots ,e_{\nu p})\)
Table 3 Correlation between PCs and original variables
Fig. 4
figure 4

Polytope projections onto \(P{C_{1}} \times P{C_2}\) space

The efficiency of this algorithm follows the development of Le-Rademacher and Billard (2012, (2013). First, the polytopes are constructed by connecting the vertices of the polytopes for all observations in the same sequence; see Le-Rademacher and Billard (2012) for a detailed exposition. This allows us to simply recreate the true shape of the observations in the PC space without using a search procedure, e.g., the parallel edge connected shapes of Irpino et al. (2003), which is often much more computationally intensive. Secondly, the output histograms are computed using a two-dimensional projection of the polytopes onto the \(PC_1 \times PC_\nu \) plane to ensure the resulting histograms reflect the largest source of variation in the data while keeping the computation manageable. Furthermore, the symmetry of the sub-polytopes means it is only necessary to compute the subinterval endpoints and the subinterval frequencies of the first half of each sub-histogram. The endpoints and the frequencies of the second half of each sub-histogram can be directly copied from the first half; see Le-Rademacher and Billard (2013) for details. This further improves the efficiency of our algorithm.

4 Illustration

The foregoing theory is illustrated on a data set consisting of temperatures at a subset of 15 weather stations in China available from the website <http://dss.ucar.edu/datasets/ds578.1>. At each location, the average monthly temperatures from 1951 to 2000 were aggregated into histograms with five subintervals. Months represented the variables; however, since the temperatures for the months November–March were essentially the same and so not useful as a distinguishing feature, these five months were combined into one variable “Nov–Mar” (\(X_1\)). Similarly, the temperatures for April and October are combined into “Apr & Oct” (\(X_2\)), May and September into “May & Sep” (\(X_3\)), and June through August into “Jun–Aug” (\(X_4\)). Elevation for each station was also added as a \(5\mathrm{th}\) variable (\(X_5\)), with the single classical value expressed as an interval \(z=[z,z]\). Thus, we have \(p=5\) variables, \(n = 15\) histogram-valued observations with \(s_{ij} = 5\) subintervals, for each \(i = 1,\dots , n\), \(j = 1,\ldots ,4\), and \(s_{ij}=1\) when \(j=5\). For illustrative purposes, we make the not-unreasonable assumption that all variables and observations are equally weighted.

Fig. 5
figure 5

Latitude \(\times \) longitude location of weather stations

The sample covariance matrix can then be calculated from Eq. (4); see Table 1.

Then, from Eq. (6), we can obtain the correlation matrix \(\mathbf{\Sigma }\) as

$$\begin{aligned} \mathbf{\Sigma } = \left[ \begin{array}{rrrrr} 1.00 &{} 0.98 &{} 0.93 &{} 0.77 &{} -0.06 \\ 0.98 &{} 1.00 &{} 0.98 &{} 0.87 &{} -0.22 \\ 0.93 &{} 0.98 &{} 1.00 &{} 0.94 &{} -0.38 \\ 0.77 &{} 0.87 &{} 0.94 &{} 1.00 &{} -0.64 \\ -0.06 &{} -0.22 &{} -0.38 &{} -0.64 &{} 1.00 \end{array} \right] . \end{aligned}$$

The eigenvalues of \(\mathbf{\Sigma }\) are as shown in Table 2, along with the associated eigenvectors. Hence, the principal components can be calculated through Eq. (5). Table 2 also shows that together the first two principal components explain \(99~\%\) of the total variance in the data with \(PC_1\) accounting for \(77.5~\%\) and \(PC_2\) an additional \(21.5~\%\). The correlation between these two principal components and the original variables (Table 3) shows a strong correlation between \(PC_1\) and monthly temperatures whereas \(PC_2\) is strongly correlated with station elevation.

Fig. 6
figure 6

Projections onto \(P{C_1} \times P{C_2}\) space of classical surrogates

The algorithm of Sect. 3 can now be applied to produce the polytopes in the principal component space corresponding to each observation. These are displayed in Fig. 4. The latitude \(\times \) longitude locations of these stations are shown in Fig. 5, where the differing size of the circle reflects the differing station elevations. From Fig. 4, it is clear that observations \(i = 1, \dots , 6\) (i.e., stations 1 \(=\) Mudanjiang, 2 \(=\) Harbin, 3 \(=\) Qiqihar, 4 \(=\) Bugt, 5 \(=\) Nenjiang, and 6 \(=\) Hailar, respectively) form one cluster; observation \(i=7\) (i.e., station 7 \(=\) Lasa) is a one-station cluster; observations \(i=8,9\) (i.e., 8 \(=\) Baoshan, and 9 \(=\) Kunming, respectively) form a third cluster, and observations \(i = 10,\dots , 15\) (i.e., 10 \(=\) Wuzhou, 11 \(=\) Guangzhou, 12 \(=\) Nanning, 13 \(=\) Shantou, 14 \(=\) Haikou, and 15 \(=\) Zhanjiang, respectively) comprise the fourth cluster.

When the locations of Fig. 5 are considered, it is evident that the first cluster consists of stations all located in the north-east and so can be reasonably expected to have somewhat similar weather (though the elevations, especially for Bugt and Hailar, are not all comparable; but this difference is reflected in the polytopes for these two stations being in the left side of the cluster of polytopes for this location). Likewise, the stations in the fourth cluster are all located in the south-east part of the country (see Fig. 5), all with the same low elevations. This similarity is reflected in the comparable polytopes in the \(P{C_1} \times P{C_2}\) space in Fig. 4.

The stations Baoshan and Kunming (\(\#\)8–9), though located in the south, are further west and at higher elevations; hence, the separate cluster matches their actual geography. Here, the elevations as well as the temperature histograms are very similar, with the result that the polytopes are almost identical. Lasa is situated even further west and is much higher in elevation than any of the other 14 stations. Therefore, its temperature patterns would be quite distinct. Thus, the polytope in this case is quite different and quite isolated from those of other stations; see top-left of Fig. 4.

If instead of keeping the variations in the temperatures across years (as in the histogram observations), overall averages were used as classical surrogates, then the resulting projections of these surrogates gave the point projections on \(PC_1 \times PC_2\) space of Fig. 6. The clusters are again clear, and match those obtained from the histogram values, thus validating the methodology proposed herein.

Finally, relative sizes of polytopes provide an additional interpretation. For example, consider the stations Hailar (\(\#6\)) and Kunming (\(\#\)9). The polytope for Hailar is larger in size than is that for Kunming on \(PC_1 \times PC_2\) space. This difference is explained by the fact that the range of temperatures and the internal variations of the histograms for Hailar are larger than those for Kunming. (This difference in variations is evident in Fig. 7 which shows the relevant histograms for these two stations for the March temperatures; likewise for other months). In contrast, the classical principal component values for these two stations are but points on \(PC_1 \times PC_2\), reflecting that classical observations themselves have no internal variation. Note that although Kunming is at a higher elevation than is Hailar, the elevation variable is classically valued and so has no internal variation. That is, the elevation variable value here is not a factor in the size of the polytope (only in its position on \(PC_1 \times PC_2\) space); it is the different internal values of the histograms that are reflected in the different sizes of the polytopes. This additional interpretation from an histogram-valued principal component analysis is not possible from classically valued analyses.

Fig. 7
figure 7

a March histogram for Hailar (\(\#6\)). b March histogram for Kunming (\(\#9\))

Table 4 Histogram output for principal component 1

We further computed the PC histograms for each weather station. The histograms for the first principal component with five subintervals of equal width are shown in Table 4. Note that the purpose of the PC histograms is to provide numerical output for the resulting principal components. Figure 8 provides a graphical depiction of the \(PC_1\) histograms to illustrate the internal distribution of the output histograms. The polytopes of Fig. 4 are a better visualization tool. In addition to the clustering pattern seen in Fig. 4, the resulting PC histograms indicate similarity in the distribution of the \({PC}_1\) histograms within each cluster. The distribution of \({PC}_1\) values for observations \(i = 1, \ldots , 6\) have larger spread and are slightly skewed to the right with more than \(90~\%\) of the values spread between the second and third subintervals. In contrast, the distribution for observations \(i = 10, \ldots , 15\) are more narrow and somewhat more symmetric with 50–60 % of values located near the center of the histograms. Distributions for the observations, \(i=8, 9\) are again similar. When these PC histograms are input into models for further analysis, they reflect the varying distributions within observations along the PC axes.

Fig. 8
figure 8

Graphical depiction of \(PC_1\) histogram output of Table 4

5 Conclusions

In this paper, we propose a principal component analysis methodology for histogram-valued data by expanding the symbolic covariance method that Le-Rademacher and Billard (2012, (2013) proposed for interval-valued data. Currently, there is no comparable PCA methodology for histogram-valued observations under the symbolic data domain.

The principal components in this method are computed from the symbolic covariance matrix to account for the total variance of histogram data. The polytopes of Le-Rademacher and Billard (2012) are adapted to represent the observations in the principal component space by treating the observed histograms as collections of weighted sub-hyperrectangles. We further propose computing a density matrix associated with the sub-hyperrectangles to reflect the varying densities of the sub-hyperrectangles within each observation. The density matrix is also used when computing the numerical output to ensure the relative frequencies of the output histograms, adapted from Le-Rademacher and Billard (2013), correctly accounting for the sub-hyperrectangle densities.

By basing our method on the polytopes of Le-Rademacher and Billard (2012) and the histogram output of Le-Rademacher and Billard (2013), our method inherits the computational efficiency of their methods by using routine matrix computation and avoiding search algorithms. However, generalizing the polytopes and histogram output to histogram-valued observations is not a trivial process as described in Sect. 2 and as illustrated by the detailed algorithm of the Appendix. By carefully defining the matrix of vertices, our method requires little additional power and retains computational efficiency.

6 Supplemental materials

R program: The R program to compute the principal components, the polytopes, and the output histograms as proposed in the article.

Example data: The temperature data used for illustration in the article.