Descriptive statistics for interval-valued observations in the presence of rules

Billard, L.; Diday, E.

doi:10.1007/s00180-006-0259-6

Descriptive statistics for interval-valued observations in the presence of rules

Published: 01 June 2006

Volume 21, pages 187–210, (2006)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Computational Statistics Aims and scope Submit manuscript

Descriptive statistics for interval-valued observations in the presence of rules

Download PDF

L. Billard¹ &
E. Diday²

265 Accesses
15 Citations
Explore all metrics

Summary

While symbolic data exist in their own right, contemporary datasets can be too large to analyse using traditional statistical methodologies. Aggregation of these large datasets into sets of more managable size perforce produce datasets whose entries are symbolic data. This paper studies the derivation of basic description statistics, in particular, histograms and mean and variances plus joint histograms for interval-valued datasets when logical dependency rules are present. Algorithms for calculating these histograms are also provided.

Association measures for interval variables

Article 03 July 2021

On Statistics, Probability, and Entropy of Interval-Valued Datasets

The Nature of Data

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

While symbolic data exist in their own right as small or large datasets, the advent of the modern computer has brought with it classical (and/or symbolic) datasets that are too large in size to be analysed using traditional statistical methodologies even with the computational assistance of those same computers that generated such data. Therefore, in order to elicit reasonable and appropriate analyses and conclusions from the data, it becomes necessary to aggregate the data in some meaningful manner first before analyses can proceed. How this aggregation occurs will depend on some of the underlying questions and/or answers being asked and/or sought. For example, suppose a dataset consists of the medical records for a country (say), and suppose that apart from the more-direct medically related variables, there are also demographic variables such as the individual’s age, gender, town or residence, and so on. One basic question may relate to what happens across towns or residence sites, while another may be concerned with age × gender differences. Thus, these questions led to aggregations by towns, or by age × gender, respectively. The number of possible aggregations is limited only by the number of such basic questions. Whether the original data were classical or symbolic data, the aggregated values will now be as lists, and/or intervals, and/or modal values, regardless of the nature of the aggregation method adopted. For example, a list could be the types of cancer observed, Y = {lung, colon, …}; an interval value could be the pulse rate, Y = 64 ± 1 = (63, 65); a modal value could be a histogram, Y = {(red, p₁), (green, p₂), …} with Σp_i = 1. For a review of symbolic data, see Billard and Diday (2003) and for a more detailed description, see Bock and Diday (2000).

In this paper our focus will be on interval-valued data in the presence of rules, and in particular on obtaining basic descriptive statistics such as frequency histograms, joint frequency histograms and sample means and variances. Rules, so-called, can arise in two (or three) broadly defined ways. The first relates to underlying conditions that exist, be the data classical- or symbolic-valued. For example, interest may center on children, in which case any analysis conditions the data to contain only children; or, the variables Y₁ and Y₂ may be required to satisfy a condition that Y₁ + Y₂ = β (say), or so on. In contrast, when aggregating data into symbolic-valued variables, the very action of aggregation may produce data that perforce engage the adoption of rule(s) to maintain data integrety. For example, suppose we have values for Y₁ = age and Y₂ = number of children, and suppose we have particular classical values Y_a = (21, 2), Y_b = (10, 0), Y_c = (16, 1),…, where Y = (Y₁, Y₂); and suppose further that the concept of interest, after appropriate aggregation, put these three individuals into the same category and produced the symbolic interval-valued observation ξ = {(10, 21), (0, 1, 2)}. As it stands, the value ξ implies that persons in the age interval (10, 21) years had (0, 1, 2) children, including the possibility that the 10-year-old had 1 (or 2) children. To maintain data integrity here, it is necessary to include a rule such as ν = {If Y₁ < 14 (say), then Y₂ = 0}. The need for this type of rule is unique to symbolic data. The precise nature of such rules could perforce vary with the description of the symbolic data value. A possible third type of rule is what would amount to a form of data cleaning; e.g., a “rule” such as age = Y₁ > 0, could be used to catch observed (classical, or symbolic) values of Y₁ = −15 (say, an obvious miskeying situation). In some circumstances, data cleaning rules are absorbed into either of the first two categories defined above. Data cleaning rules however do need to be present for datasets too large to be “eye-balled” for correctness.

Since classical data are but single points in p-dimensional space (where p is the number of variables), rules are relatively easy to manage. However, since symbolic values are p-dimensional hypercubes and/or Cartesian products of distributions in p-dimensional space, rules can and do create difficulties. We focus on rules for interval-valued data; the methodology can be extended to histogram-valued data reasonably easy conceptually (less easy computationally!)

Bertrand and Goupil (2000) derived formula for finding the univariate histogram and sample mean and variance for a single interval-valued variable Y without rules. They also developed the corresponding results for multi-valued (list) data with and without rules. To accommodate rules, their basic approach was to convert each actual possible symbolic data value into a so-called virtual data value where the virtual values were those that satisfied the given rule(s). Billard and Diday (2003), alluded to extending Bertrand and Goupil’s virtual data idea to interval data with rules, but gave no details. Our aim here is to develop this concept further and also to extend it to finding joint histograms for (Y₁, Y₂) where Y₁ and Y₂ are each interval-valued variables and where rules exist. We develop our basic approach through rules applied to the interval-valued data of Table 1.

Table 1 At-Bats and Hits by Team

Full size table

Therefore, in Section 2, we consider the nature of the virtual observation space in the presence of rules and show how the virtual observation values can be determined. Then, in Section 3, we use these virtual observations to obtain univariate histograms under a variety of specific rules. Calculating the sample mean and variance in the presence of rules is studied in Section 4. Derivation of a joint histogram for the bivariate Y = (Y₁, Y₂) is considered in Section 5. The basic principles involved are discussed and summarized in Section 6. These form the nucleus of the methodology required to obtain basic statistics for interval-valued data in the presence of rules. In the course of these derivations, the need arises for calculating a histogram of histogram-valued data and an algorithm for calculating a joint histogram for interval-valued data; these algorithms are outlined in Section 7.

2 Observed and Virtual Symbolic Intervals

The data of Table 1 represent two random variables, viz., Y₁ = Number of At-Bats; and Y₂ = Number of Hits, for baseball players over a season. Players are aggregated by teams, so that the resulting team statistics are now intervals. The results shown in Table 1 are based on actual (Y₁, Y₂) statistics for a sample of players from a variety of baseball teams obtained from Vanessa and Vanessa (2004). Some additional results have been inserted for illustrative purposes.

We denote a particular realization of Y = (Y₁, Y₂) by ξ = (ξ₁, ξ₂) with ξ_i = (a_i, b_i), i = 1, 2. Following Bertrand and Goupil (2000), we make the assumption that specific (point) values of Y_i are uniformly distributed across the interval (a_i, b_i). Further, ξ takes values in the p = 2-dimensional hypercube (i.e., rectangle) bounded by (a₁, b₁) × (a₂, b₂). We denote a specific observation by ξ(u), which is bounded by the rectangle R(u) = (a_1u, b_1u) × (a_2u, b_2u) for u = 1, …, n, where n is the number of observations.

To examine these data more closely, we first make the logical deduction that the Number of At-Bats cannot be less than the Number of Hits, i.e., Y₁ ≥ Y₂. Consider the second observation ξ(2). Each of the ξ₁(2) and ξ₂(2) values is possible. The resulting rectangle R(2) has vertices at (x₁, x₂) = (88, 49), (88, 149), (422, 49) and (422, 149). All (x₁, x₂) values contained in this rectangle appear as possible values. This includes the vertex value (x₁, x₂) = (88, 149), i.e., the number of hits is 149 from 88 at-bats — clearly not a logical possibility. However, another player can have 149 hits from 422 at-bats for example, and so on. Here, the logical rule ν: Y₁ ≥ Y₂ implies that the actual apparent hypercube R(u) has to be transformed to a virtual hypercube V(u) containing only those values of R(u) that satisfy the rule ν. In contrast, the observation u = 6, with ξ(6) = {(24, 26), (133, 141)} would suggest that the ξ₁ and ξ₂ have been transposed. The logical rule here catches this, as part of a data cleaning process for example.

Formally, we adapt the definition of virtual data, from Bertrand and Goupil (2000), as follows.

Definition: The virtual observation space V ≡ V(u) of an actual observation space R ≡ R(u) consists of all possible values x in R which satisfy all the rules ν = {ν₁, ν₂ …) operating on R. That is, for the observation u,

$$V(u)=\left\{x \in R(u), \quad \nu_{i}(x)=1, \quad \text { for all }\ \nu_{i} \text { in } \nu\right\}$$

((1))

where ν_i(x) = 1 if the rule is true for the vector-value x and is 0 if the rule is not true for x. Let us denote the virtual observation by ξ′ = (ξ′_l …, ξ′_p) with ξ′_i = (a′_i, b′_i), i = 1, …, p.

To illustrate this further, suppose that for the Table 1 data, there is a logical rule

$$\nu : Y_{2} \leq \alpha Y_{1}.$$

((2))

Setting α = 1.0 allows for the removal of x values that are not logically possible; while setting α = 0.400, say, is acknowledging that batting averages (= Y₂/Y₁) above 0.400 are unlikely and therefore in this present sense also not logically possible. The impact of this rule ν on the observed rectangle R will produce a virtual hypercube V which has one of the eight patterns, denoted by A, B,…, I, displayed in Figure 1. Those values which fall above the line Y₂ = αY₁ are not logically possible values and so are excluded from R to produce the virtual value V. The shaded regions correspond to the virtual values V. The conditions that apply that give ν(x) = 1 for the respective patterns are given in Table 2.

The pattern A corresponds to those observations for which V(u) is empty; e.g., ξ(6) in Table 1. In this case, the underlying condition from equation (1) that generates ν(x) = 0 is {αb₁ < a₂}. The pattern B represents those observations that are unaffected by the rule ν, i.e., V(u) = R(u). The condition that gives ν(x) = 1 in equation (1) for these patterns translate, in terms of the (a_i, b_i) values, i = 1, 2, to {αa₁ ≥ b₂}.

The four patterns C, D, E, F are similar in that the virtual observation hyper-cube is a triangle, though they differ as to whether or not particular triangle vertices do or do not fall on the line Y₂ = αY₁. Therefore, the virtual ξ values differ accordingly. Thus, for pattern D, the virtual value for the original observation ξ is ξ′₁ = ξ₁ = (a₁, b₁) and ξ′₂ = (a₂, αb₁). Notice that the virtual observation for Y₁ (alone) is unaffected by ν. In contrast, for pattern E, Y₂ is unaffected, ξ′₂ = ξ₂, but the Y₁ values are affected giving the virtual value as ξ′₁ = (a₂/α, b₁). For pattern F, both Y₁ and Y₂ values are affected by ν; whereas in pattern C, neither are. Table 3 displays these virtual values ξ′_i, i = 1, 2.

Also shown, in Table 3, are the apparent virtual values for the bivariate pair (ξ′₁, ξ′₂) for the C, D, E and F patterns. When calculating the histogram for Y₁ (or Y₂) alone, these virtual ξ′_i values are used in the usual manner using Bertrand and Goupil (2000) methodology. However, when calculating the joint histogram for (Y₁, Y₂), routine application of the methodology (see Billard and Diday, 2003) would in this case produce answers as though the hypercube (ξ′₁, ξ′₂) were the rectangle (a′₁, b′₁) × (b′₂, b′₂) with area (b′₁ − a′₁)(a′₂ − a′₂), instead of the triangle whose vertices are {(a′₁, b′₁), (b′₁, a′₂), {a′₂, b′₂)}, and with area ∣V∣ = (b′₁ − a′₁)(b′₂ − a′₂)/2 where ∣A∣ is the area of the region A. Clearly, this feature has to be accommodated, and is addressed further in Section 5. The corresponding areas ∣V∣ for each pattern C, D, E, F are also displayed in Table 3.

The two patterns G and H have the common feature that their 4-sided (non-rectangular) hypercube can be viewed as the union of a triangle and a rectangle. For the pattern G, the virtual description for Y₁ (alone) is now a histogram-valued variable (and not the interval-valued observation of the original data); while for the pattern H, it is the variable Y₂ (considered alone) which has a histogram-valued virtual description. Thus, we can show that in pattern G, the virtual observation becomes

$$\xi_{1}^{\prime}=\left\{\left(a_{2} / \alpha, b_{2} / \alpha\right) p_{1}, \quad\left(b_{2} / \alpha, b_{1}\right) p_{2}\right\}$$

((3))

where the relative frequencies p_i, i = 1, 2, are given by

$$p_{i}=\left|R_{i}\right| /|V|$$

((4))

with

$$\left|R_{1}\right|=\left(b_{2}-a_{2}\right)^{2} /(2 \alpha), \quad\left|R_{2}\right|=\left(\alpha b_{1}-b_{2}\right)\left(b_{2}-a_{2}\right) / \alpha$$

((5))

and

$$|V|=\left|R_{1}\right|+\left|R_{2}\right|;$$

((6))

and where the virtual description of Y₂ (alone) is unaffected, with ξ′₂ = ξ₂ = (a₂, b₂). These are displayed in Table 3 for both patterns G and H. Then, by using the methodology developed in Billard and Diday (2003) for obtaining a histogram of histograms, the respective (univariate) histograms can be obtained. Also, shown in Table 3 is the apparent virtual description of the bivariate pair (Y₁, Y₂). These too are now histogram-valued, rather than interval-valued, observations. However, again as cautioned above for the patterns C, D, E, F, care is required for the ”triangle” pieces, viz., R₁ ≡ [(a₂/α, b₂/α), (a₂, b₂)] in pattern G, and R₂ ≡ [(a₁, b₁), (αa₁, αb₁)} in pattern H.

Finally, we consider the pattern I, reproduced in Figure 2a. In these cases, the virtual observation space V is a 5-sided hypercube which can be partitioned into the triangle R₁, and three different rectangles R₂, R₃, R₄ with respective vertices as indicated in Figure 2a. For data that follow this pattern, the virtual values of both the Y₁ and Y₂ variables (each considered alone) differ from the actual observed values; and in each case the virtual values become histogram-valued instead of the original integral-valued. It follows that for Y₁ (alone) the virtual observation is

$$\xi_{1}^{\prime}=\left\{\left(a_{1}, b_{2} / \alpha\right) p_{1}, \quad\left(b_{2} \alpha, b_{1}\right) p_{2}\right\}$$

((7))

where

$$p_{1}=\left(\left|R_{1}\right|+\left|R_{2}\right|\right) /|V|, \quad p_{2}=\left(\left|R_{3}\right|+\left|R_{4}\right|\right) /|V|$$

((8))

with

$$\begin{array}{c}{\left|R_{1}\right|=\left(b_{2}-a_{1}\right)^{2} /(2 \alpha),\ \ \left|R_{2}\right|=\left(b_{2}-\alpha a_{1}\right)\left(\alpha a_{1}-a_{2}\right) / \alpha}, \\ {\left|R_{3}\right|=\left(\alpha b_{1}-b_{2}\right)\left(\alpha a_{1}-a_{2}\right) / \alpha, \quad\left|R_{4}\right|=\left(\alpha b_{1}-b_{2}\right)\left(b_{2}-\alpha a_{1}\right) / \alpha}\end{array}$$

((9))

and

$$|V|=\left|R_{1}\right|+\cdots+\left|R_{4}\right|.$$

((10))

The virtual observation for Y₂ (alone) is

$$\xi_{2}^{\prime}=\left\{\left(a_{2}, \alpha a_{1}\right) p_{1}^{*}, \quad\left(\alpha a_{1}, b_{2}\right) p_{2}^{*}\right\}$$

((11))

where

$$p_{1}^{*}=\left(\left|R_{2}\right|+\left|R_{3}\right|\right) /|V|, \quad p_{2}^{*}=\left(\left|R_{1}\right|+\left|R_{4}\right|\right) /|V|.$$

((12))

These values are summarized in Table 3. The table also shows the corresponding apparent virtual observation for (Y₁, Y₂) taken together as a bivariate pair. Here, we can show that the virtual value ξ′ of ξ is

$$\begin{array}{c}{\xi^{\prime}=\left\{\left[\left(a_{1}, b_{2} / \alpha\right),\left(\alpha a_{1}, b_{2}\right)\right] p_{1}^{* *},\left[\left(a_{1}, b_{2} / \alpha\right),\left(a_{2}, \alpha a_{1}\right)\right] p_{2}^{* *}\right.}, \\ {\qquad\quad \left[\left(b_{2} / \alpha, b_{1}\right),\left(a_{2}, \alpha a_{1}\right)\right] p_{3}^{* *},\left[\left(b_{2} / \alpha, b_{1}\right),\left(\alpha a_{1}, b_{2}\right)\right] p_{4}^{* *} \}}\end{array}$$

((13))

where

$$p_{i}^{* *}=\left|R_{i}\right| /|V|, \quad i=1, \ldots, 4,$$

((14))

with ∣R_i∣ and ∣V∣ as given in equations (9) and (10). Again, the “triangle” piece (R₁ ≡ [(a₁, b₂/α), (αa₁, b₂)]) requires special care when calculating a joint histogram function.

3 Construction of Histograms

When, after application of the rule ν = (ν₁, ν₂, …), the resulting virtual dataset consists entirely of interval-valued data, the histogram of the virtual dataset can be constructed by using the Bertrand and Goupil methodology which is available computationally in the SODAS software (and can be found on the web at www.ceremade.dauphine.fr/%7Etouati/sodas-pagegarde.htm).

For comparative purposes, we first give the histogram for the baseball dataset of Table 1 when there are no rules. Suppose we build the histogram for Y₁ = Number of At-Bats on the r₁ = 7 intervals [0, 100),…, [600, 650]; and suppose the histogram for Y₂ = Number of Hits is constructed on the r₂ = 9 intervals [0, 50), [50, 75),…, [200, 225), [225, 275]. The resulting histograms are given in column (a) of Table 4 for Y₁ and Table 5 for Y₂, respectively.

Table 4 Histogram for Y₁ = # At-Bats

Full size table

Suppose now interest is restricted to those situations with 120 or more at-bats. This translates to the rule

$$\nu_{1} :\left\{Y_{1} \geq 120\right\}.$$

((15))

Under this rule, observation ξ(6) and ξ(16) are deleted entirely. Observations u = 2, 10, 15, and 17, are truncated; so that the virtual observation for u = 2, becomes ξ′(2) = {(120, 422), (49, 149)}; likewise, ξ′(10), ξ′(15), and ξ′(17) can be found. After application of the rule ν₁, all virtual observations are integral-valued. Then, by building the relevant histograms on the same histogram intervals used in column (a), we obtain the frequencies of column (b) in Table 4 for Y₁ and Table 5 for Y₂, respectively. Comparing the two histograms of columns (a) and (b), we see the impact of this rule. For the variable Y₁, since this rule directly truncates Y₁ values, the two histogram intervals $I_{g_{1}}$, g₁ =1, 2, are clearly affected. However, so are other histogram intervals affected (in contrast to the corresponding comparison for classical data when these latter intervals are not affected). Take, e.g., the histogram interval $I_{g_{1}}=I_{3}=[200,300)$ all of whose internal values are valid under ν₁. Take also, e.g., the contribution of the u = 2 observation to this I₃ interval. Then, the virtual data value ξ′₁(2) contributes a portion equal to (300 − 200)/(422 − 120) = 100/302 to the frequency of I₃, while the original data ξ₁(2) contributes the amount (300 − 200)/(422 − 88) = 100/334 (≠ 100/302) to the frequency in I₃. A comparison of columns (a) and (b) in Table 5 for the histogram for the Y₂ variable also reveals differences. This occurs even though the rule ν₁ does not involve Y₂ directly, and even though for every observation u the virtual ξ′₂(u) = ξ₂(u). The impact of ν₁ on the histogram for Y₂ is a reflection of the u = 6 and u = 16 observations being deleted.

Table 5 Histogram for Y₂ = # Hits

Full size table

Suppose now we apply the rule of equation (2) with α = 1.0, viz.,

$$\nu_{2} : Y_{1} \geq Y_{2},$$

((16))

i.e., the number of hits cannot exceed the number of at-bats. Table 6, column (a) identifies the pattern of the virtual observation in the presence of this rule. Columns (b) and (c) give the virtual observation value for Y₁ and Y₂, respectively, for each case by utilizing Table 3. For example, clearly when u = 1, pattern B pertains. Hence, ξ′₁ = ξ₁, ξ′₂ = ξ₂; also, ξ′ = (ξ′₁, ξ′₂) = ξ. The u = 2 observation under ν₂ reduces to a virtual observation space with pattern I (see Figure 2b). It is really verified that the areas ∣R_i∣, i = 1, …, 4, and ∣V∣ are

Table 6(i) Virtual ξ′₁ and ξ′₂ under rule ν₂ : Y₁ ≥ Y₂

Full size table

Table 6(ii) Virtual ξ′ = (ξ′₁, ξ′₂) under rule ν₂ : Y₁ ≥ Y₂

Full size table

$$\left|R_{1}\right|=1860.5,\quad\left|R_{2}\right|=2379, \quad\left|R_{3}\right|=10647, \quad\left|R_{4}\right|=16653, \quad|V|=31539.5.$$

Hence, the virtual values for this observation become ξ′(2) = (ξ′₁(2), ξ′₂(2)) where for Y₁ (considered alone), by substitution into equations (7)–(9), we have

$$\xi_{1}^{\prime}(2)=\{(88,149), 0.134 ; \quad(149,422), 0.866\};$$

for Y₂ considered alone, from equations (9)–(12),

$$\xi_{2}^{\prime}(2)=\{(49,88), 0.413 ; \quad(88,149), 0.587\};$$

and that for (Y₁, Y₂) the virtual value is, from equations (9), (10) and (13),

$$\begin{aligned} \xi^{\prime}(2)=&\{[(88,149),(88,149)], 0.059 ;\quad [(88,149),(49,88)], 0.075;\\ &[(149,422),(49,88)], 0.338 ;\quad[(149,422),(88,149)], 0.528 \} \end{aligned}$$

Under this rule, only the u = 6th observation fails entirely, as a pattern A virtual observation. However, only the nine observations corresponding to u = 1, 4, 5, 7, 9, 11, 13, 14, 18, are unaffected by this rule, to be identified as a pattern B value. The remaining eight observations are affected in various ways (with a variety of patterns occuring) but with all eight observations having some portion of the original R(u) space eliminated as not being logically possible under ν₂. The virtual values for all the observations in the dataset of Table 1 after application of the rule ν₂ are displayed in Table 6 in columns (b), (c), and (d) for the variable Y₁, Y₂ and (Y₁, Y₂), respectively. Clearly, the virtual dataset contains histogram-valued observations. An algorithm for the determination of a histogram from histogram-valued observations is outlined in Section 7. Therefore, building our histograms for Y₁ (or Y₂) on the same histogram intervals as were used previously, we can obtain the histograms for Y₁ (and Y₂) as displayed in column (c) of Table 4 (and Table 5, respectively).

Column (d) of Table 4 and Table 5 give the corresponding histograms for Y₁ and Y₂, respectively, when in equation (2), α = 0.350, i.e., under the rule

$$\nu_{3} : Y_{2} \leq 0.350 Y_{1}.$$

((17))

In this case, several more of the original observations have virtual values which follow pattern A, as would be expected; and the resulting histograms reflect this restriction. This is especially evident for the $I_{g_{2}}=I_{8}=[200,225)$ interval for the histogram for the number of hits Y₂. Under ν₃, the u = 3 and u = 12 observations are deleted by virtue of their becoming pattern A values in their virtual space. Yet, both of these observations, contributed nonzero frequencies to this I₈ interval for the histograms of columns (a), (b) and (c) in Table 5. We can show that ξ₂(3) = (201, 254) and ξ₂(12) = (189, 238) contributed a frequency equal to 0.453 and 0.510, respectively, with a total contribution of 1.063 when there were no rules.

Finally, column (e), in Table 4 and Table 5 provides the histogram results for the set of rules

$$\nu_{4} :\left(\nu_{1}, \nu_{2}\right) \equiv\left\{Y_{1} \geq 120 \text { and } Y_{2} \leq Y_{1}\right\}$$

((18))

for Y₁ alone and Y₂ alone, respectively. The details are omitted.

4 Sample Means and Variances

Formulae for calculating the empirical mean and variance for interval-valued data were given by Bertrand and Goupil (2000) and for histogram-valued data by Billard and Diday (2003). We have seen how rules in effect transform the actual interval-valued data R(u) into virtual data V(u), u = 1,…,n, with these virtual data also being interval-valued or histogram-valued values. Thus, use of the Bertrand-Goupil or Billard-Diday formula subsequently apply. For completeness, we provide here the formula for histogram-valued data.

Suppose our random variable Y has histogram values ξ(u) = {(a_uj, b_uj), p_uj; j = l,…, s_u} with ∑_j p_uj = 1) where, for observation u, p_uj is the relative frequency (or probability) of taking values on the jth interval (a_uj, b_uj), j = 1,…, s_u where s_u is the total number of histogram-intervals. Note that when s_u = 1 and hence p_uj = 1 for all j, we have an interval-valued observation. Then, from Billard and Diday (2003), the sample mean is given by

$$\overline{Y}=\frac{1}{2 n} \sum_{u=1}^{n}\left\{\sum_{j=1}^{s_{u}}\left(a_{u j}+b_{u j}\right) p_{u j}\right\}$$

((19))

and the sample variance S² and standard deviation S are found from

$$S^{2}=\frac{1}{3 n} \sum_{u=1}^{n}\left\{\sum_{j=1}^{s_{u}}\left(a_{u j}^{2}+a_{u j} b_{u j}+b_{u j}^{2}\right) p_{u j}\right\}-\frac{1}{4 n^{2}}\left\{\sum_{u=1}^{n} \sum_{j=1}^{s_{u}}\left(a_{u j}+b_{u j}\right) p_{u j}\right\}^{2}.$$

((20))

Therefore, by using equations (19) and (20) on the original data of Table 1, we obtain the Ȳ and S values as shown in Table 4 for the Number of At-Bats Y₁, and in Table 5 for the Number of Hits Y₂. Likewise, under the rules ν₁,…, ν₄, we can apply these equations (19) and (20) to the relevant virtual data to obtain the corresponding values for Ȳ and S; these are also displayed in Table 4 and Table 5 for Y₁ and Y₂, respectively.

5 Joint Histograms

Principles underlying the univariate case apply to constructing histograms for p ≥ 2 variables. For illustrative clarity, let us take p = 2 and let us construct the joint histogram for Y = (Y₁, Y₂) on the histogram rectangles R(g₁, g₂) = {[ha₁, hb₁) × [ha₂, hb₂)}, g₁ = 1,…, r₁, g₂ = 1,…, r₂. Then, when there are no rules present, we have from Billard and Diday (2003) that the frequency that observations lie in the rectangle $R_{g_{1} g_{2}}$ is

$$O\left(g_{1}, g_{2}\right)=\sum_{u} \frac{\left|R(u) \cap R\left(g_{1}, g_{2}\right)\right|}{|R(u)|}.$$

((21))

The relative frequency is $p_{g_{1} g_{2}}=O\left(g_{1}, g_{2}\right) / n$.

An algorithm for calculating these $p_{g_{1} g_{2}}$ and O(g₁, g₂) terms is given in Section 7. To illustrate this, we construct a joint histogram for Y = (Y₁, Y₂) using the baseball data of Table 1. Suppose we take histogram intervals on Y₁ as [0, 50), [50, 200),…, [500, 650] and the histogram intervals on Y₂ as [0, 75), [75, 125), …, [225, 275]. Thus, e.g., for g₁ = 3, g₂ = 4, we have the histogram rectangle R(3, 4) = [200, 350) × [175, 225). The observed frequencies are shown in Table 7. The corresponding relative frequencies pg₁g₂ are plotted in Figure 3.

Table 7 Joint Histogram for (Y₁, Y₂) — No Rules (ν₀)

Full size table

When rules are present we replace the actual observation R(u) by its virtual observation V(u). When the V(u) values are themselves rectangles, then the same computational algorithm used for equation (21) pertains. It is often the case that this virtual space V(u), itself a multi-sided hypercube, can be partitioned into components. The patterns G, H, I for the baseball data are examples of such partitioning. When these components are themselves rectangular, then again use of the basic joint histogram algorithm of Section 7 pertains.

Some of the virtual observations in the baseball example have components that are triangles. For these components, routine use of this basic algorithm treats these triangular pieces as though they are rectangles. For these components, appropriate adjustment has to be made. We omit the details.

Thus, to illustrate, we construct the joint histogram for the baseball data, when the rule ν₂ holds on the same histogram intervals as were used above in Table 6 and Figure 3. The observed frequencies are displayed in Table 8, and the relative frequencies are plotted in Figure 4.

Table 8 Joint Histogram for (Y₁, Y₂) — Rule ν₂ : Y₂ ≤ Y₁

Full size table

6 Basic Principles

Suppose we have observations on the p-dimensional variable Y = (Y₁,…, Y_p) with Y_j taking values on the interval ξ_j = (a_j, b_j), j = 1,…, p, for each observation u = 1,…, n. Let R(u) be the p-dimensional rectangle that represents the observation u. Let there be a set of rules ν = {ξ₁, ν₂, …).

The basic issue is to find that subspace V(u) of R(u) which represents those values of R(u) for which the rule v holds; i.e., those x = (x₁,…, x_p) such that ν_i(x) = 1, for all rules ν_i; see equation (1). For some u, this V(u) = ϕ, the empty set; for other u, this V(u) = R(u), the original observation; and for others, this V(u) is a nonempty p-dimensional rectangle V(u) ≡ R*(u) contained in R(u), i.e., V(u) ≡ R*(u) ⊆ R(u). For these observations, the adjustment to the relevant calculation for the descriptive statistic of interest is routine.

Frequently, the virtual observation V(u) is a p-dimensional non-rectangular hypercube. However, it is usually the case that this virtual space can be partitioned into components each of which is itself a rectangle (or a shape such as a triangle which is clearly a half-rectangle). For example, pattern I observed for the baseball data under rule ν₂ (see Table 3) can be partitioned into components R_j, j = 1,…, 4. Each component, R_j is a proportion, p_j, of the whole virtual space V, with $V=\bigcup_{j} R_{j}$, and Σp_j = 1, for a given u. Each R_j component is then added to the dataset as though it were an ”observation” but is an observation with probability weight p_j. This necessitates a probability weight of p = 1 for those observations u for which V(u) is itself a p-dimensional rectangle. When all virtual components {R_j, j = 1, 2, …} are rectangles, then the direct use of the methodologies presented herein apply.

When (as in the baseball example) an R_j is a triangle, adjustment has to be made to ensure that the calculated area of the ”triangle” is indeed that, and not the area of the corresponding rectangle. Components R_j that are not rectangles are different. In some instances, this non-rectanglar shape in and of itself is not a problem though calculating the probability might (but should not in general) be tricky. Situations that are otherwise difficult will be treated elsewhere. This present work assumes V(u) can be partitioned into rectangular components (with appropriate adjustment for trianglar pieces).

7 Histogram Algorithms

7.1 Univariate Histograms of Histogram — Valued Data

An algorithm for calculating the histogram of a set of histogram-valued data is briefly outlined as follows. Suppose the random variable Y has realizations

$$\xi(u)=\left\{\xi_{u j}=\left[a_{u j}, b_{u j}\right), p_{u j} ;\ \ j=1, \ldots, s_{u}\right\}$$

for each u = 1,…, n, where p_uj is the observed relative frequency on the interval $\left[a_{u j}, b_{u j}\right) \text { with } \sum_{j} p_{u j}=1$, and where s_u is the number of histogram intervals for the data value u. In the virtual descriptions of patterns G, H, I in Section 4, s_u = 2. Note that when s_u = 1, and hence p₁ = 1, the observation is interval-valued (as a special case of histogram-valued variables). Suppose we want to construct a histogram of these {ξ(u), u = 1,…, n} observations. Let there be r histogram intervals I_g = [ha, hb), g = 1,…, r − 1, and I_r = [ha, hb], where clearly in I₁, ha ≤ min_u,j a_uj, and in I_r, hb ≥ max_u,j b_uj. Then, from Billard and Diday (2005), the observed frequency for the histogram interval I_g is given by

$$O(g)=\sum_{u} \sum_{j \in Z(g)} p_{u j} \frac{\left\|\xi_{u j} \cap I_{g}\right\|}{\left\|\xi_{u j}\right\|}$$

((22))

where, for each u = 1,…, n, Z(g) is the set of all ξ_uj intervals which overlap with I_g and where ∥A∥ is the length of the interval A. The relative frequency is p_g = O(g)/n.

The basic algorithm for computing O(g) from equation (22) essentially requires ascertaining the precise nature of the (ξ_uj ∩ I_g) term across all j = 1,…, s_u values, taking specific care of the exact endpoint values (a_uj, b_uj) and (ha, hb) and their relative relationships with each other. Once all possibilities have been identified, the process is reasonably straightforward. The algorithm is presented as a SAS macro; using SAS is not essential.

The algorithm itself is presented in Appendix A. This algorithm has in effect three components. The first (identified A) relates to the various initial commands to set up the computer program (such as options, titles, …) including reading in the data. This version of the algorithm assumes data are inputted as

$$\xi=\{[a j, b j) p j ;\ \ j=1, \ldots, n s u\}$$

where nsu = max_u s_u. Adjustments to the data to accommodate data where s_u ≠ s for all u (commonly the case) can also be made at this data manipulation stage (or the appropriate terms in the core macro of part B can be adjusted if preferred). For ease of presentation, we assume these are appropriately handled in the first stage A. The core macro utilizes generic terms for the maximum number of data-histogram intervals (nsu), the first and last histogram ha values (first_ha, and last_ha) and the histogram interval length (hinc). Thus, these are also set in Part A.

Part B is the core macro, here called “hist”. Part B1 addresses the values of the frequencies to be added (the add term) for each data histogram entry and its relationship to the histogram interval (ha, hb). Part B2 adds these frequencies over all data values and calculates the relative frequencies. This part also includes a simple format for outputting the resulting frequencies and relative frequencies (here referred to as’ probabilities’); whatever format suits the reader should be substituted. This macro can then be invoked to calculate the O(g) and p_g for a given g.

Rather than repeatedly invoking the’ hist’ macro for each I_g, we can, alternatively, use part C which is a simple macro, called’ histall’, which calculates all the histogram frequencies over all I_g inside a simple do-loop routine. Then, this’ histall’ macro can be called once, to give O(g) and p_g for all g = 1,…, r. This is particularly useful when all histogram intervals I_g are of the same length.

Clearly, this is a basic algorithm to calculate O(g) and p_g. Variations to accommodate different features (e.g., histograms with different I_q interval lengths) can be readily made.

7.2 Joint Histograms for Interval-Valued Data

Let the p-dimensional interval-valued observation be Y = (Y₁,…, Y_p) with Y_v taking values on the interval (a_v, b_v), v = 1,…, p. We want to construct the joint histogram for the two variables (Y_i, Y_j); for illustrative clarity we take (Y₁, Y₂). We may rewrite equation (21) as

$$O\left(g_{1}, g_{2}\right)=\sum_{u}\left(\frac{b_{1}^{*}-a_{1}^{*}}{b_{1}-a_{1}}\right)\left(\frac{b_{2}^{*}-a_{2}^{*}}{b_{2}-a_{2}}\right)$$

((23))

where $R^{*}=\left\{\left(a_{1}^{*}, b_{1}^{*}\right) \times\left(a_{2}^{*}, b_{2}^{*}\right)\right\}$ is the rectangle which represents the intersection of the data rectangle R(u) and the histogram rectangle I(g₁, g₂). This R* rectangle can be empty. We note that the interval $\left(a_{i}^{*}, b_{i}^{*}\right)$, i = 1, 2, may or may not overlap with the relevant (ha, hb) interval, and that the various possibilities observed when calculating the histogram of histogram data in Section 7.1 pertain here also (see Appendix A); but they pertain for both the Y₁ and Y₂ dimensions. More specifically, O(g₁, g₂) is the sum (over all observations) of cross-product terms, with each cross-product term equal to the product of one term from each of Y₁ and Y₂. A basic algorithm is given in Appendix B and proceeds as follows. Part A, as before relates to the relevant preliminary program statements, including the input of the data.

Calculation of the O(g₁, g₂) of equation (23) consists of two parts, presented here as macros under Parts B and C, respectively. The macro of Part B, called’ hist’ (comparable to but different from the’ hist’ macro of Section 7.1) calculates the term $\left(b_{v}^{*}-a_{v}^{*}\right) /\left(b_{v}-a_{v}\right)$ for a single v value. This macro is written to allow for any specified v value (as shown in, e.g., the a&v term). This term is called prod&k. This macro will be invoked twice, once for each k value (e.g., k = 1 and k = 2), to give a product term value of prod1 and prod2 (for Y₁, and Y₂, respectively) for each observation u.

The second macro of Part C, called’ hist2’, reads in the calculated prod1 and prod2 terms, takes their product and sums these over all observations, i.e., it completes the calculation of equation (23). The cross-product and their summation is achieved via an IML routine, as shown. This particular macro calculates the observed joint frequency and the corresponding joint probability for a single histogram rectangle. There are also included simple format lines for printing these results. Thus, invoking the’ hist2’ macro will produce the joint histogram value for a given histogram rectangle. A third macro, along the lines of the’ histall’ macro shown in (Part C) of Section 7.1 (see Appendix A), could also be written to enable all histogram rectangles to be considered with one invocation. The details are omitted.

There is one final, but important, feature. Let us first consider standard rectangular ξ spaces; i.e., consider 2-dimensional interval-valued rectangular data R(u) for all u such as when there are no rules, or for 2-dimensional virtual data V(u) which are also rectangles. For these situations, the algorithm as described thus far proceeds without any problem. However, when, as often occurs, the virtual observation V(u) is the union of smaller rectangles each with some probability p_i < 1, then appropriate adjustment must be made. For example, in the baseball example, under the rule ν₂, we see from Table 6(ii) that the virtual observation for the u = 2 contains the rectangle [(149, 422), (88, 149)] with probability p = 0.528 (≠ 1). In contrast, the virtual observation for u = 1 is the same as the actual observation, viz., the rectangle [(289, 538), (75, 162)] with probability p = 1. We saw from Section 6 that in general a non-rectangular V(u) can be decomposed into nonover-lapping rectangles R_j(u), j = 1,…,k, each with probability p_j, Σp_j = 1. Therefore, in the data input and manipulation stage (of Part A), these rectangles and probabilities are calculated. We treat each of these R_j(u) as though it were a whole ”observation” but with probability p = p_j (instead of the initially set p = 1 value). This is reflected in the’ hist’ macro by summing these probabilities to obtain the sample size n (n&v = n&v + p&v, of line 4, instead of the more intuitive n = n + 1). It is also reflected in the’ hist2’ macro by taking the product prod1 * prod2 * p1 in the IML routine.

8 Conclusion

Rules can have many forms and can impact data in various ways. While our study herein focused on logical dependency rules on interval-valued data, other forms of data may require different types of rules. There can also be situations where the rule itself varies depending on the ”value” of the symbolic data (interval-valued or not). For example, outlier values may induce their own dependency rules. In another situation, it may be that one variable is correlated with another variable (as is prevalent with medical and/or biologically based variables) with the resulting need for rules that are themselves observation-dependent. In a different direction, once histograms (for example) have been developed in the presence of rules, then other parametric distribution procedures (such as fitting, estimation, and so forth) can be developed. The field is wide-open for more research.

References

Bertrand, P. & Goupil, F. (2000), ‘Descriptive statistics for symbolic data’, Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data (eds. H.-H. Bock and E. Diday), Berlin, Springer-Verlag, pp 103–124.
Google Scholar
Billard, L. & Diday, E. (2003), ‘From the statistics of data to the statistics of knowledge: Symbolic data analysis’, journal of the American Statistical Association 98, pp 470–487.
Article MathSciNet Google Scholar
Billard, L. & Diday, E. (2005), ‘Histograms in symbolic data analysis’, Bulletin International Statistical Institute (in press).
Bock, H.-H. & Diday, E. (2000), ‘Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data’, Berlin, Springer-Verlag.
Book Google Scholar
Vanessa, A. & Vanessa, L. (2004), ‘La meilleure équipe de Base-ball. CERE-MADE, Université dé Paris 9, Dauphine.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of Georgia, 30602-1952, Athens, GA, USA
L. Billard
CEREMADE, Universite de Paris, 9 Dauphine, 75775, Paris, Cedex 16, France
E. Diday

Authors

L. Billard
View author publications
You can also search for this author in PubMed Google Scholar
E. Diday
View author publications
You can also search for this author in PubMed Google Scholar

Appendices

Appendix A - Histogram Algorithm

Appendix B - Joint Histogram Algorithm

Rights and permissions

Reprints and permissions

About this article

Cite this article

Billard, L., Diday, E. Descriptive statistics for interval-valued observations in the presence of rules. Computational Statistics 21, 187–210 (2006). https://doi.org/10.1007/s00180-006-0259-6

Download citation

Published: 01 June 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s00180-006-0259-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Descriptive statistics for interval-valued observations in the presence of rules

Summary

Similar content being viewed by others

Association measures for interval variables