1 Introduction

While symbolic data exist in their own right as small or large datasets, the advent of the modern computer has brought with it classical (and/or symbolic) datasets that are too large in size to be analysed using traditional statistical methodologies even with the computational assistance of those same computers that generated such data. Therefore, in order to elicit reasonable and appropriate analyses and conclusions from the data, it becomes necessary to aggregate the data in some meaningful manner first before analyses can proceed. How this aggregation occurs will depend on some of the underlying questions and/or answers being asked and/or sought. For example, suppose a dataset consists of the medical records for a country (say), and suppose that apart from the more-direct medically related variables, there are also demographic variables such as the individual’s age, gender, town or residence, and so on. One basic question may relate to what happens across towns or residence sites, while another may be concerned with age × gender differences. Thus, these questions led to aggregations by towns, or by age × gender, respectively. The number of possible aggregations is limited only by the number of such basic questions. Whether the original data were classical or symbolic data, the aggregated values will now be as lists, and/or intervals, and/or modal values, regardless of the nature of the aggregation method adopted. For example, a list could be the types of cancer observed, Y = {lung, colon, …}; an interval value could be the pulse rate, Y = 64 ± 1 = (63, 65); a modal value could be a histogram, Y = {(red, p1), (green, p2), …} with Σpi = 1. For a review of symbolic data, see Billard and Diday (2003) and for a more detailed description, see Bock and Diday (2000).

In this paper our focus will be on interval-valued data in the presence of rules, and in particular on obtaining basic descriptive statistics such as frequency histograms, joint frequency histograms and sample means and variances. Rules, so-called, can arise in two (or three) broadly defined ways. The first relates to underlying conditions that exist, be the data classical- or symbolic-valued. For example, interest may center on children, in which case any analysis conditions the data to contain only children; or, the variables Y1 and Y2 may be required to satisfy a condition that Y1 + Y2 = β (say), or so on. In contrast, when aggregating data into symbolic-valued variables, the very action of aggregation may produce data that perforce engage the adoption of rule(s) to maintain data integrety. For example, suppose we have values for Y1 = age and Y2 = number of children, and suppose we have particular classical values Ya = (21, 2), Yb = (10, 0), Yc = (16, 1),…, where Y = (Y1, Y2); and suppose further that the concept of interest, after appropriate aggregation, put these three individuals into the same category and produced the symbolic interval-valued observation ξ = {(10, 21), (0, 1, 2)}. As it stands, the value ξ implies that persons in the age interval (10, 21) years had (0, 1, 2) children, including the possibility that the 10-year-old had 1 (or 2) children. To maintain data integrity here, it is necessary to include a rule such as ν = {If Y1 < 14 (say), then Y2 = 0}. The need for this type of rule is unique to symbolic data. The precise nature of such rules could perforce vary with the description of the symbolic data value. A possible third type of rule is what would amount to a form of data cleaning; e.g., a “rule” such as age = Y1 > 0, could be used to catch observed (classical, or symbolic) values of Y1 = −15 (say, an obvious miskeying situation). In some circumstances, data cleaning rules are absorbed into either of the first two categories defined above. Data cleaning rules however do need to be present for datasets too large to be “eye-balled” for correctness.

Since classical data are but single points in p-dimensional space (where p is the number of variables), rules are relatively easy to manage. However, since symbolic values are p-dimensional hypercubes and/or Cartesian products of distributions in p-dimensional space, rules can and do create difficulties. We focus on rules for interval-valued data; the methodology can be extended to histogram-valued data reasonably easy conceptually (less easy computationally!)

Bertrand and Goupil (2000) derived formula for finding the univariate histogram and sample mean and variance for a single interval-valued variable Y without rules. They also developed the corresponding results for multi-valued (list) data with and without rules. To accommodate rules, their basic approach was to convert each actual possible symbolic data value into a so-called virtual data value where the virtual values were those that satisfied the given rule(s). Billard and Diday (2003), alluded to extending Bertrand and Goupil’s virtual data idea to interval data with rules, but gave no details. Our aim here is to develop this concept further and also to extend it to finding joint histograms for (Y1, Y2) where Y1 and Y2 are each interval-valued variables and where rules exist. We develop our basic approach through rules applied to the interval-valued data of Table 1.

Table 1 At-Bats and Hits by Team

Therefore, in Section 2, we consider the nature of the virtual observation space in the presence of rules and show how the virtual observation values can be determined. Then, in Section 3, we use these virtual observations to obtain univariate histograms under a variety of specific rules. Calculating the sample mean and variance in the presence of rules is studied in Section 4. Derivation of a joint histogram for the bivariate Y = (Y1, Y2) is considered in Section 5. The basic principles involved are discussed and summarized in Section 6. These form the nucleus of the methodology required to obtain basic statistics for interval-valued data in the presence of rules. In the course of these derivations, the need arises for calculating a histogram of histogram-valued data and an algorithm for calculating a joint histogram for interval-valued data; these algorithms are outlined in Section 7.

2 Observed and Virtual Symbolic Intervals

The data of Table 1 represent two random variables, viz., Y1 = Number of At-Bats; and Y2 = Number of Hits, for baseball players over a season. Players are aggregated by teams, so that the resulting team statistics are now intervals. The results shown in Table 1 are based on actual (Y1, Y2) statistics for a sample of players from a variety of baseball teams obtained from Vanessa and Vanessa (2004). Some additional results have been inserted for illustrative purposes.

We denote a particular realization of Y = (Y1, Y2) by ξ = (ξ1, ξ2) with ξi = (ai, bi), i = 1, 2. Following Bertrand and Goupil (2000), we make the assumption that specific (point) values of Yi are uniformly distributed across the interval (ai, bi). Further, ξ takes values in the p = 2-dimensional hypercube (i.e., rectangle) bounded by (a1, b1) × (a2, b2). We denote a specific observation by ξ(u), which is bounded by the rectangle R(u) = (a1u, b1u) × (a2u, b2u) for u = 1, …, n, where n is the number of observations.

To examine these data more closely, we first make the logical deduction that the Number of At-Bats cannot be less than the Number of Hits, i.e., Y1Y2. Consider the second observation ξ(2). Each of the ξ1(2) and ξ2(2) values is possible. The resulting rectangle R(2) has vertices at (x1, x2) = (88, 49), (88, 149), (422, 49) and (422, 149). All (x1, x2) values contained in this rectangle appear as possible values. This includes the vertex value (x1, x2) = (88, 149), i.e., the number of hits is 149 from 88 at-bats — clearly not a logical possibility. However, another player can have 149 hits from 422 at-bats for example, and so on. Here, the logical rule ν: Y1Y2 implies that the actual apparent hypercube R(u) has to be transformed to a virtual hypercube V(u) containing only those values of R(u) that satisfy the rule ν. In contrast, the observation u = 6, with ξ(6) = {(24, 26), (133, 141)} would suggest that the ξ1 and ξ2 have been transposed. The logical rule here catches this, as part of a data cleaning process for example.

Formally, we adapt the definition of virtual data, from Bertrand and Goupil (2000), as follows.

Definition: The virtual observation space VV(u) of an actual observation space RR(u) consists of all possible values x in R which satisfy all the rules ν = {ν1, ν2 …) operating on R. That is, for the observation u,

$$V(u)=\left\{x \in R(u), \quad \nu_{i}(x)=1, \quad \text { for all }\ \nu_{i} \text { in } \nu\right\}$$
((1))

where νi(x) = 1 if the rule is true for the vector-value x and is 0 if the rule is not true for x. Let us denote the virtual observation by ξ′ = (ξl …, ξp) with ξi = (ai, bi), i = 1, …, p.

To illustrate this further, suppose that for the Table 1 data, there is a logical rule

$$\nu : Y_{2} \leq \alpha Y_{1}.$$
((2))

Setting α = 1.0 allows for the removal of x values that are not logically possible; while setting α = 0.400, say, is acknowledging that batting averages (= Y2/Y1) above 0.400 are unlikely and therefore in this present sense also not logically possible. The impact of this rule ν on the observed rectangle R will produce a virtual hypercube V which has one of the eight patterns, denoted by A, B,…, I, displayed in Figure 1. Those values which fall above the line Y2 = αY1 are not logically possible values and so are excluded from R to produce the virtual value V. The shaded regions correspond to the virtual values V. The conditions that apply that give ν(x) = 1 for the respective patterns are given in Table 2.

Figure 1
figure 1

Patterns for Virtual V — shaded regions

The pattern A corresponds to those observations for which V(u) is empty; e.g., ξ(6) in Table 1. In this case, the underlying condition from equation (1) that generates ν(x) = 0 is {αb1 < a2}. The pattern B represents those observations that are unaffected by the rule ν, i.e., V(u) = R(u). The condition that gives ν(x) = 1 in equation (1) for these patterns translate, in terms of the (ai, bi) values, i = 1, 2, to {αa1b2}.

The four patterns C, D, E, F are similar in that the virtual observation hyper-cube is a triangle, though they differ as to whether or not particular triangle vertices do or do not fall on the line Y2 = αY1. Therefore, the virtual ξ values differ accordingly. Thus, for pattern D, the virtual value for the original observation ξ is ξ1 = ξ1 = (a1, b1) and ξ2 = (a2, αb1). Notice that the virtual observation for Y1 (alone) is unaffected by ν. In contrast, for pattern E, Y2 is unaffected, ξ2 = ξ2, but the Y1 values are affected giving the virtual value as ξ1 = (a2/α, b1). For pattern F, both Y1 and Y2 values are affected by ν; whereas in pattern C, neither are. Table 3 displays these virtual values ξi, i = 1, 2.

Table 3
figure Tab3

Virtual Observation Space Values — by Pattern

Table 4.9
figure Tab4_9

Virtual Patterns and Conditions

Also shown, in Table 3, are the apparent virtual values for the bivariate pair (ξ1, ξ2) for the C, D, E and F patterns. When calculating the histogram for Y1 (or Y2) alone, these virtual ξi values are used in the usual manner using Bertrand and Goupil (2000) methodology. However, when calculating the joint histogram for (Y1, Y2), routine application of the methodology (see Billard and Diday, 2003) would in this case produce answers as though the hypercube (ξ1, ξ2) were the rectangle (a1, b1) × (b2, b2) with area (b1a1)(a2a2), instead of the triangle whose vertices are {(a1, b1), (b1, a2), {a2, b2)}, and with area ∣V∣ = (b1a1)(b2a2)/2 where ∣A∣ is the area of the region A. Clearly, this feature has to be accommodated, and is addressed further in Section 5. The corresponding areas ∣V∣ for each pattern C, D, E, F are also displayed in Table 3.

The two patterns G and H have the common feature that their 4-sided (non-rectangular) hypercube can be viewed as the union of a triangle and a rectangle. For the pattern G, the virtual description for Y1 (alone) is now a histogram-valued variable (and not the interval-valued observation of the original data); while for the pattern H, it is the variable Y2 (considered alone) which has a histogram-valued virtual description. Thus, we can show that in pattern G, the virtual observation becomes

$$\xi_{1}^{\prime}=\left\{\left(a_{2} / \alpha, b_{2} / \alpha\right) p_{1}, \quad\left(b_{2} / \alpha, b_{1}\right) p_{2}\right\}$$
((3))

where the relative frequencies pi, i = 1, 2, are given by

$$p_{i}=\left|R_{i}\right| /|V|$$
((4))

with

$$\left|R_{1}\right|=\left(b_{2}-a_{2}\right)^{2} /(2 \alpha), \quad\left|R_{2}\right|=\left(\alpha b_{1}-b_{2}\right)\left(b_{2}-a_{2}\right) / \alpha$$
((5))

and

$$|V|=\left|R_{1}\right|+\left|R_{2}\right|;$$
((6))

and where the virtual description of Y2 (alone) is unaffected, with ξ2 = ξ2 = (a2, b2). These are displayed in Table 3 for both patterns G and H. Then, by using the methodology developed in Billard and Diday (2003) for obtaining a histogram of histograms, the respective (univariate) histograms can be obtained. Also, shown in Table 3 is the apparent virtual description of the bivariate pair (Y1, Y2). These too are now histogram-valued, rather than interval-valued, observations. However, again as cautioned above for the patterns C, D, E, F, care is required for the ”triangle” pieces, viz., R1 ≡ [(a2/α, b2/α), (a2, b2)] in pattern G, and R2 ≡ [(a1, b1), (αa1, αb1)} in pattern H.

Finally, we consider the pattern I, reproduced in Figure 2a. In these cases, the virtual observation space V is a 5-sided hypercube which can be partitioned into the triangle R1, and three different rectangles R2, R3, R4 with respective vertices as indicated in Figure 2a. For data that follow this pattern, the virtual values of both the Y1 and Y2 variables (each considered alone) differ from the actual observed values; and in each case the virtual values become histogram-valued instead of the original integral-valued. It follows that for Y1 (alone) the virtual observation is

Figure 2
figure 2

Pattern I Detail: (a) General, (b) Observation ξ(2)

$$\xi_{1}^{\prime}=\left\{\left(a_{1}, b_{2} / \alpha\right) p_{1}, \quad\left(b_{2} \alpha, b_{1}\right) p_{2}\right\}$$
((7))

where

$$p_{1}=\left(\left|R_{1}\right|+\left|R_{2}\right|\right) /|V|, \quad p_{2}=\left(\left|R_{3}\right|+\left|R_{4}\right|\right) /|V|$$
((8))

with

$$\begin{array}{c}{\left|R_{1}\right|=\left(b_{2}-a_{1}\right)^{2} /(2 \alpha),\ \ \left|R_{2}\right|=\left(b_{2}-\alpha a_{1}\right)\left(\alpha a_{1}-a_{2}\right) / \alpha}, \\ {\left|R_{3}\right|=\left(\alpha b_{1}-b_{2}\right)\left(\alpha a_{1}-a_{2}\right) / \alpha, \quad\left|R_{4}\right|=\left(\alpha b_{1}-b_{2}\right)\left(b_{2}-\alpha a_{1}\right) / \alpha}\end{array}$$
((9))

and

$$|V|=\left|R_{1}\right|+\cdots+\left|R_{4}\right|.$$
((10))

The virtual observation for Y2 (alone) is

$$\xi_{2}^{\prime}=\left\{\left(a_{2}, \alpha a_{1}\right) p_{1}^{*}, \quad\left(\alpha a_{1}, b_{2}\right) p_{2}^{*}\right\}$$
((11))

where

$$p_{1}^{*}=\left(\left|R_{2}\right|+\left|R_{3}\right|\right) /|V|, \quad p_{2}^{*}=\left(\left|R_{1}\right|+\left|R_{4}\right|\right) /|V|.$$
((12))

These values are summarized in Table 3. The table also shows the corresponding apparent virtual observation for (Y1, Y2) taken together as a bivariate pair. Here, we can show that the virtual value ξ′ of ξ is

$$\begin{array}{c}{\xi^{\prime}=\left\{\left[\left(a_{1}, b_{2} / \alpha\right),\left(\alpha a_{1}, b_{2}\right)\right] p_{1}^{* *},\left[\left(a_{1}, b_{2} / \alpha\right),\left(a_{2}, \alpha a_{1}\right)\right] p_{2}^{* *}\right.}, \\ {\qquad\quad \left[\left(b_{2} / \alpha, b_{1}\right),\left(a_{2}, \alpha a_{1}\right)\right] p_{3}^{* *},\left[\left(b_{2} / \alpha, b_{1}\right),\left(\alpha a_{1}, b_{2}\right)\right] p_{4}^{* *} \}}\end{array}$$
((13))

where

$$p_{i}^{* *}=\left|R_{i}\right| /|V|, \quad i=1, \ldots, 4,$$
((14))

with ∣Ri∣ and ∣V∣ as given in equations (9) and (10). Again, the “triangle” piece (R1 ≡ [(a1, b2/α), (αa1, b2)]) requires special care when calculating a joint histogram function.

3 Construction of Histograms

When, after application of the rule ν = (ν1, ν2, …), the resulting virtual dataset consists entirely of interval-valued data, the histogram of the virtual dataset can be constructed by using the Bertrand and Goupil methodology which is available computationally in the SODAS software (and can be found on the web at www.ceremade.dauphine.fr/%7Etouati/sodas-pagegarde.htm).

For comparative purposes, we first give the histogram for the baseball dataset of Table 1 when there are no rules. Suppose we build the histogram for Y1 = Number of At-Bats on the r1 = 7 intervals [0, 100),…, [600, 650]; and suppose the histogram for Y2 = Number of Hits is constructed on the r2 = 9 intervals [0, 50), [50, 75),…, [200, 225), [225, 275]. The resulting histograms are given in column (a) of Table 4 for Y1 and Table 5 for Y2, respectively.

Table 4 Histogram for Y1 = # At-Bats

Suppose now interest is restricted to those situations with 120 or more at-bats. This translates to the rule

$$\nu_{1} :\left\{Y_{1} \geq 120\right\}.$$
((15))

Under this rule, observation ξ(6) and ξ(16) are deleted entirely. Observations u = 2, 10, 15, and 17, are truncated; so that the virtual observation for u = 2, becomes ξ′(2) = {(120, 422), (49, 149)}; likewise, ξ′(10), ξ′(15), and ξ′(17) can be found. After application of the rule ν1, all virtual observations are integral-valued. Then, by building the relevant histograms on the same histogram intervals used in column (a), we obtain the frequencies of column (b) in Table 4 for Y1 and Table 5 for Y2, respectively. Comparing the two histograms of columns (a) and (b), we see the impact of this rule. For the variable Y1, since this rule directly truncates Y1 values, the two histogram intervals \(I_{g_{1}}\), g1 =1, 2, are clearly affected. However, so are other histogram intervals affected (in contrast to the corresponding comparison for classical data when these latter intervals are not affected). Take, e.g., the histogram interval \(I_{g_{1}}=I_{3}=[200,300)\) all of whose internal values are valid under ν1. Take also, e.g., the contribution of the u = 2 observation to this I3 interval. Then, the virtual data value ξ1(2) contributes a portion equal to (300 − 200)/(422 − 120) = 100/302 to the frequency of I3, while the original data ξ1(2) contributes the amount (300 − 200)/(422 − 88) = 100/334 (≠ 100/302) to the frequency in I3. A comparison of columns (a) and (b) in Table 5 for the histogram for the Y2 variable also reveals differences. This occurs even though the rule ν1 does not involve Y2 directly, and even though for every observation u the virtual ξ2(u) = ξ2(u). The impact of ν1 on the histogram for Y2 is a reflection of the u = 6 and u = 16 observations being deleted.

Table 5 Histogram for Y2 = # Hits

Suppose now we apply the rule of equation (2) with α = 1.0, viz.,

$$\nu_{2} : Y_{1} \geq Y_{2},$$
((16))

i.e., the number of hits cannot exceed the number of at-bats. Table 6, column (a) identifies the pattern of the virtual observation in the presence of this rule. Columns (b) and (c) give the virtual observation value for Y1 and Y2, respectively, for each case by utilizing Table 3. For example, clearly when u = 1, pattern B pertains. Hence, ξ1 = ξ1, ξ2 = ξ2; also, ξ′ = (ξ1, ξ2) = ξ. The u = 2 observation under ν2 reduces to a virtual observation space with pattern I (see Figure 2b). It is really verified that the areas ∣Ri∣, i = 1, …, 4, and ∣V∣ are

Table 6(i) Virtual ξ1 and ξ2 under rule ν2 : Y1Y2
Table 6(ii) Virtual ξ′ = (ξ1, ξ2) under rule ν2 : Y1Y2
$$\left|R_{1}\right|=1860.5,\quad\left|R_{2}\right|=2379, \quad\left|R_{3}\right|=10647, \quad\left|R_{4}\right|=16653, \quad|V|=31539.5.$$

Hence, the virtual values for this observation become ξ′(2) = (ξ1(2), ξ2(2)) where for Y1 (considered alone), by substitution into equations (7)(9), we have

$$\xi_{1}^{\prime}(2)=\{(88,149), 0.134 ; \quad(149,422), 0.866\};$$

for Y2 considered alone, from equations (9)(12),

$$\xi_{2}^{\prime}(2)=\{(49,88), 0.413 ; \quad(88,149), 0.587\};$$

and that for (Y1, Y2) the virtual value is, from equations (9), (10) and (13),

$$\begin{aligned} \xi^{\prime}(2)=&\{[(88,149),(88,149)], 0.059 ;\quad [(88,149),(49,88)], 0.075;\\ &[(149,422),(49,88)], 0.338 ;\quad[(149,422),(88,149)], 0.528 \} \end{aligned}$$

Under this rule, only the u = 6th observation fails entirely, as a pattern A virtual observation. However, only the nine observations corresponding to u = 1, 4, 5, 7, 9, 11, 13, 14, 18, are unaffected by this rule, to be identified as a pattern B value. The remaining eight observations are affected in various ways (with a variety of patterns occuring) but with all eight observations having some portion of the original R(u) space eliminated as not being logically possible under ν2. The virtual values for all the observations in the dataset of Table 1 after application of the rule ν2 are displayed in Table 6 in columns (b), (c), and (d) for the variable Y1, Y2 and (Y1, Y2), respectively. Clearly, the virtual dataset contains histogram-valued observations. An algorithm for the determination of a histogram from histogram-valued observations is outlined in Section 7. Therefore, building our histograms for Y1 (or Y2) on the same histogram intervals as were used previously, we can obtain the histograms for Y1 (and Y2) as displayed in column (c) of Table 4 (and Table 5, respectively).

Column (d) of Table 4 and Table 5 give the corresponding histograms for Y1 and Y2, respectively, when in equation (2), α = 0.350, i.e., under the rule

$$\nu_{3} : Y_{2} \leq 0.350 Y_{1}.$$
((17))

In this case, several more of the original observations have virtual values which follow pattern A, as would be expected; and the resulting histograms reflect this restriction. This is especially evident for the \(I_{g_{2}}=I_{8}=[200,225)\) interval for the histogram for the number of hits Y2. Under ν3, the u = 3 and u = 12 observations are deleted by virtue of their becoming pattern A values in their virtual space. Yet, both of these observations, contributed nonzero frequencies to this I8 interval for the histograms of columns (a), (b) and (c) in Table 5. We can show that ξ2(3) = (201, 254) and ξ2(12) = (189, 238) contributed a frequency equal to 0.453 and 0.510, respectively, with a total contribution of 1.063 when there were no rules.

Finally, column (e), in Table 4 and Table 5 provides the histogram results for the set of rules

$$\nu_{4} :\left(\nu_{1}, \nu_{2}\right) \equiv\left\{Y_{1} \geq 120 \text { and } Y_{2} \leq Y_{1}\right\}$$
((18))

for Y1 alone and Y2 alone, respectively. The details are omitted.

4 Sample Means and Variances

Formulae for calculating the empirical mean and variance for interval-valued data were given by Bertrand and Goupil (2000) and for histogram-valued data by Billard and Diday (2003). We have seen how rules in effect transform the actual interval-valued data R(u) into virtual data V(u), u = 1,…,n, with these virtual data also being interval-valued or histogram-valued values. Thus, use of the Bertrand-Goupil or Billard-Diday formula subsequently apply. For completeness, we provide here the formula for histogram-valued data.

Suppose our random variable Y has histogram values ξ(u) = {(auj, buj), puj; j = l,…, su} with ∑j puj = 1) where, for observation u, puj is the relative frequency (or probability) of taking values on the jth interval (auj, buj), j = 1,…, su where su is the total number of histogram-intervals. Note that when su = 1 and hence puj = 1 for all j, we have an interval-valued observation. Then, from Billard and Diday (2003), the sample mean is given by

$$\overline{Y}=\frac{1}{2 n} \sum_{u=1}^{n}\left\{\sum_{j=1}^{s_{u}}\left(a_{u j}+b_{u j}\right) p_{u j}\right\}$$
((19))

and the sample variance S2 and standard deviation S are found from

$$S^{2}=\frac{1}{3 n} \sum_{u=1}^{n}\left\{\sum_{j=1}^{s_{u}}\left(a_{u j}^{2}+a_{u j} b_{u j}+b_{u j}^{2}\right) p_{u j}\right\}-\frac{1}{4 n^{2}}\left\{\sum_{u=1}^{n} \sum_{j=1}^{s_{u}}\left(a_{u j}+b_{u j}\right) p_{u j}\right\}^{2}.$$
((20))

Therefore, by using equations (19) and (20) on the original data of Table 1, we obtain the Ȳ and S values as shown in Table 4 for the Number of At-Bats Y1, and in Table 5 for the Number of Hits Y2. Likewise, under the rules ν1,…, ν4, we can apply these equations (19) and (20) to the relevant virtual data to obtain the corresponding values for Ȳ and S; these are also displayed in Table 4 and Table 5 for Y1 and Y2, respectively.

5 Joint Histograms

Principles underlying the univariate case apply to constructing histograms for p ≥ 2 variables. For illustrative clarity, let us take p = 2 and let us construct the joint histogram for Y = (Y1, Y2) on the histogram rectangles R(g1, g2) = {[ha1, hb1) × [ha2, hb2)}, g1 = 1,…, r1, g2 = 1,…, r2. Then, when there are no rules present, we have from Billard and Diday (2003) that the frequency that observations lie in the rectangle \(R_{g_{1} g_{2}}\) is

$$O\left(g_{1}, g_{2}\right)=\sum_{u} \frac{\left|R(u) \cap R\left(g_{1}, g_{2}\right)\right|}{|R(u)|}.$$
((21))

The relative frequency is \(p_{g_{1} g_{2}}=O\left(g_{1}, g_{2}\right) / n\).

An algorithm for calculating these \(p_{g_{1} g_{2}}\) and O(g1, g2) terms is given in Section 7. To illustrate this, we construct a joint histogram for Y = (Y1, Y2) using the baseball data of Table 1. Suppose we take histogram intervals on Y1 as [0, 50), [50, 200),…, [500, 650] and the histogram intervals on Y2 as [0, 75), [75, 125), …, [225, 275]. Thus, e.g., for g1 = 3, g2 = 4, we have the histogram rectangle R(3, 4) = [200, 350) × [175, 225). The observed frequencies are shown in Table 7. The corresponding relative frequencies pg1g2 are plotted in Figure 3.

Figure 3
figure 3

Joint Histogram (Y1, Y2). No Rules

Table 7 Joint Histogram for (Y1, Y2) — No Rules (ν0)

When rules are present we replace the actual observation R(u) by its virtual observation V(u). When the V(u) values are themselves rectangles, then the same computational algorithm used for equation (21) pertains. It is often the case that this virtual space V(u), itself a multi-sided hypercube, can be partitioned into components. The patterns G, H, I for the baseball data are examples of such partitioning. When these components are themselves rectangular, then again use of the basic joint histogram algorithm of Section 7 pertains.

Some of the virtual observations in the baseball example have components that are triangles. For these components, routine use of this basic algorithm treats these triangular pieces as though they are rectangles. For these components, appropriate adjustment has to be made. We omit the details.

Thus, to illustrate, we construct the joint histogram for the baseball data, when the rule ν2 holds on the same histogram intervals as were used above in Table 6 and Figure 3. The observed frequencies are displayed in Table 8, and the relative frequencies are plotted in Figure 4.

Figure 4
figure 4

Joint Histogram (Y1, Y2), under Rule ν2

Table 8 Joint Histogram for (Y1, Y2) — Rule ν2 : Y2Y1

6 Basic Principles

Suppose we have observations on the p-dimensional variable Y = (Y1,…, Yp) with Yj taking values on the interval ξj = (aj, bj), j = 1,…, p, for each observation u = 1,…, n. Let R(u) be the p-dimensional rectangle that represents the observation u. Let there be a set of rules ν = {ξ1, ν2, …).

The basic issue is to find that subspace V(u) of R(u) which represents those values of R(u) for which the rule v holds; i.e., those x = (x1,…, xp) such that νi(x) = 1, for all rules νi; see equation (1). For some u, this V(u) = ϕ, the empty set; for other u, this V(u) = R(u), the original observation; and for others, this V(u) is a nonempty p-dimensional rectangle V(u) ≡ R*(u) contained in R(u), i.e., V(u) ≡ R*(u) ⊆ R(u). For these observations, the adjustment to the relevant calculation for the descriptive statistic of interest is routine.

Frequently, the virtual observation V(u) is a p-dimensional non-rectangular hypercube. However, it is usually the case that this virtual space can be partitioned into components each of which is itself a rectangle (or a shape such as a triangle which is clearly a half-rectangle). For example, pattern I observed for the baseball data under rule ν2 (see Table 3) can be partitioned into components Rj, j = 1,…, 4. Each component, Rj is a proportion, pj, of the whole virtual space V, with \(V=\bigcup_{j} R_{j}\), and Σpj = 1, for a given u. Each Rj component is then added to the dataset as though it were an ”observation” but is an observation with probability weight pj. This necessitates a probability weight of p = 1 for those observations u for which V(u) is itself a p-dimensional rectangle. When all virtual components {Rj, j = 1, 2, …} are rectangles, then the direct use of the methodologies presented herein apply.

When (as in the baseball example) an Rj is a triangle, adjustment has to be made to ensure that the calculated area of the ”triangle” is indeed that, and not the area of the corresponding rectangle. Components Rj that are not rectangles are different. In some instances, this non-rectanglar shape in and of itself is not a problem though calculating the probability might (but should not in general) be tricky. Situations that are otherwise difficult will be treated elsewhere. This present work assumes V(u) can be partitioned into rectangular components (with appropriate adjustment for trianglar pieces).

7 Histogram Algorithms

7.1 Univariate Histograms of Histogram — Valued Data

An algorithm for calculating the histogram of a set of histogram-valued data is briefly outlined as follows. Suppose the random variable Y has realizations

$$\xi(u)=\left\{\xi_{u j}=\left[a_{u j}, b_{u j}\right), p_{u j} ;\ \ j=1, \ldots, s_{u}\right\}$$

for each u = 1,…, n, where puj is the observed relative frequency on the interval \(\left[a_{u j}, b_{u j}\right) \text { with } \sum_{j} p_{u j}=1\), and where su is the number of histogram intervals for the data value u. In the virtual descriptions of patterns G, H, I in Section 4, su = 2. Note that when su = 1, and hence p1 = 1, the observation is interval-valued (as a special case of histogram-valued variables). Suppose we want to construct a histogram of these {ξ(u), u = 1,…, n} observations. Let there be r histogram intervals Ig = [ha, hb), g = 1,…, r − 1, and Ir = [ha, hb], where clearly in I1, ha ≤ minu,j auj, and in Ir, hb ≥ maxu,j buj. Then, from Billard and Diday (2005), the observed frequency for the histogram interval Ig is given by

$$O(g)=\sum_{u} \sum_{j \in Z(g)} p_{u j} \frac{\left\|\xi_{u j} \cap I_{g}\right\|}{\left\|\xi_{u j}\right\|}$$
((22))

where, for each u = 1,…, n, Z(g) is the set of all ξuj intervals which overlap with Ig and where ∥A∥ is the length of the interval A. The relative frequency is pg = O(g)/n.

The basic algorithm for computing O(g) from equation (22) essentially requires ascertaining the precise nature of the (ξujIg) term across all j = 1,…, su values, taking specific care of the exact endpoint values (auj, buj) and (ha, hb) and their relative relationships with each other. Once all possibilities have been identified, the process is reasonably straightforward. The algorithm is presented as a SAS macro; using SAS is not essential.

The algorithm itself is presented in Appendix A. This algorithm has in effect three components. The first (identified A) relates to the various initial commands to set up the computer program (such as options, titles, …) including reading in the data. This version of the algorithm assumes data are inputted as

$$\xi=\{[a j, b j) p j ;\ \ j=1, \ldots, n s u\}$$

where nsu = maxu su. Adjustments to the data to accommodate data where sus for all u (commonly the case) can also be made at this data manipulation stage (or the appropriate terms in the core macro of part B can be adjusted if preferred). For ease of presentation, we assume these are appropriately handled in the first stage A. The core macro utilizes generic terms for the maximum number of data-histogram intervals (nsu), the first and last histogram ha values (first_ha, and last_ha) and the histogram interval length (hinc). Thus, these are also set in Part A.

Part B is the core macro, here called “hist”. Part B1 addresses the values of the frequencies to be added (the add term) for each data histogram entry and its relationship to the histogram interval (ha, hb). Part B2 adds these frequencies over all data values and calculates the relative frequencies. This part also includes a simple format for outputting the resulting frequencies and relative frequencies (here referred to as’ probabilities’); whatever format suits the reader should be substituted. This macro can then be invoked to calculate the O(g) and pg for a given g.

Rather than repeatedly invoking the’ hist’ macro for each Ig, we can, alternatively, use part C which is a simple macro, called’ histall’, which calculates all the histogram frequencies over all Ig inside a simple do-loop routine. Then, this’ histall’ macro can be called once, to give O(g) and pg for all g = 1,…, r. This is particularly useful when all histogram intervals Ig are of the same length.

Clearly, this is a basic algorithm to calculate O(g) and pg. Variations to accommodate different features (e.g., histograms with different Iq interval lengths) can be readily made.

7.2 Joint Histograms for Interval-Valued Data

Let the p-dimensional interval-valued observation be Y = (Y1,…, Yp) with Yv taking values on the interval (av, bv), v = 1,…, p. We want to construct the joint histogram for the two variables (Yi, Yj); for illustrative clarity we take (Y1, Y2). We may rewrite equation (21) as

$$O\left(g_{1}, g_{2}\right)=\sum_{u}\left(\frac{b_{1}^{*}-a_{1}^{*}}{b_{1}-a_{1}}\right)\left(\frac{b_{2}^{*}-a_{2}^{*}}{b_{2}-a_{2}}\right)$$
((23))

where \(R^{*}=\left\{\left(a_{1}^{*}, b_{1}^{*}\right) \times\left(a_{2}^{*}, b_{2}^{*}\right)\right\}\) is the rectangle which represents the intersection of the data rectangle R(u) and the histogram rectangle I(g1, g2). This R* rectangle can be empty. We note that the interval \(\left(a_{i}^{*}, b_{i}^{*}\right)\), i = 1, 2, may or may not overlap with the relevant (ha, hb) interval, and that the various possibilities observed when calculating the histogram of histogram data in Section 7.1 pertain here also (see Appendix A); but they pertain for both the Y1 and Y2 dimensions. More specifically, O(g1, g2) is the sum (over all observations) of cross-product terms, with each cross-product term equal to the product of one term from each of Y1 and Y2. A basic algorithm is given in Appendix B and proceeds as follows. Part A, as before relates to the relevant preliminary program statements, including the input of the data.

Calculation of the O(g1, g2) of equation (23) consists of two parts, presented here as macros under Parts B and C, respectively. The macro of Part B, called’ hist’ (comparable to but different from the’ hist’ macro of Section 7.1) calculates the term \(\left(b_{v}^{*}-a_{v}^{*}\right) /\left(b_{v}-a_{v}\right)\) for a single v value. This macro is written to allow for any specified v value (as shown in, e.g., the a&v term). This term is called prod&k. This macro will be invoked twice, once for each k value (e.g., k = 1 and k = 2), to give a product term value of prod1 and prod2 (for Y1, and Y2, respectively) for each observation u.

The second macro of Part C, called’ hist2’, reads in the calculated prod1 and prod2 terms, takes their product and sums these over all observations, i.e., it completes the calculation of equation (23). The cross-product and their summation is achieved via an IML routine, as shown. This particular macro calculates the observed joint frequency and the corresponding joint probability for a single histogram rectangle. There are also included simple format lines for printing these results. Thus, invoking the’ hist2’ macro will produce the joint histogram value for a given histogram rectangle. A third macro, along the lines of the’ histall’ macro shown in (Part C) of Section 7.1 (see Appendix A), could also be written to enable all histogram rectangles to be considered with one invocation. The details are omitted.

There is one final, but important, feature. Let us first consider standard rectangular ξ spaces; i.e., consider 2-dimensional interval-valued rectangular data R(u) for all u such as when there are no rules, or for 2-dimensional virtual data V(u) which are also rectangles. For these situations, the algorithm as described thus far proceeds without any problem. However, when, as often occurs, the virtual observation V(u) is the union of smaller rectangles each with some probability pi < 1, then appropriate adjustment must be made. For example, in the baseball example, under the rule ν2, we see from Table 6(ii) that the virtual observation for the u = 2 contains the rectangle [(149, 422), (88, 149)] with probability p = 0.528 (≠ 1). In contrast, the virtual observation for u = 1 is the same as the actual observation, viz., the rectangle [(289, 538), (75, 162)] with probability p = 1. We saw from Section 6 that in general a non-rectangular V(u) can be decomposed into nonover-lapping rectangles Rj(u), j = 1,…,k, each with probability pj, Σpj = 1. Therefore, in the data input and manipulation stage (of Part A), these rectangles and probabilities are calculated. We treat each of these Rj(u) as though it were a whole ”observation” but with probability p = pj (instead of the initially set p = 1 value). This is reflected in the’ hist’ macro by summing these probabilities to obtain the sample size n (n&v = n&v + p&v, of line 4, instead of the more intuitive n = n + 1). It is also reflected in the’ hist2’ macro by taking the product prod1 * prod2 * p1 in the IML routine.

8 Conclusion

Rules can have many forms and can impact data in various ways. While our study herein focused on logical dependency rules on interval-valued data, other forms of data may require different types of rules. There can also be situations where the rule itself varies depending on the ”value” of the symbolic data (interval-valued or not). For example, outlier values may induce their own dependency rules. In another situation, it may be that one variable is correlated with another variable (as is prevalent with medical and/or biologically based variables) with the resulting need for rules that are themselves observation-dependent. In a different direction, once histograms (for example) have been developed in the presence of rules, then other parametric distribution procedures (such as fitting, estimation, and so forth) can be developed. The field is wide-open for more research.