Classification or categorization involves grouping stimuli into distinct classes or categories based upon common functions or physical features (Astley, Peissig, & Wasserman, 2001; Zentall, Galizio, & Critchfield, 2002). In hierarchical classification, classes are themselves categorized into higher order classes (Greene, 1994; Murphy, 2002; Slattery & Stewart, 2014; Slattery, Stewart, & O’Hora, 2011). An example of the latter would be classifying budgie into the category bird and classifying bird into the category animal.

The bulk of research examining classification as a skill set has been conducted within the field of cognitive–developmental psychology (e.g., Arterberry & Bornstein, 2012; Blewitt, Michnick, Golinkoff, & Alioto, 2000; Bornstein & Arterberry, 2010; Deneault & Ricard, 2005, 2006; Hajibayova, 2013; Manders & Hall, 2002; Mompeán, 2006; Rostad, Yott, & Poulin-Dubois, 2012; Valentin & Chanquoy, 2012). Research within this field suggests that this skill set is acquired gradually, beginning within early childhood (3–4 months; see Arterberry & Kellman, 2016; Eimas & Quinn, 1994; Eimas, Quinn, & Cowan, 1994; Quinn, 2016; Quinn & Eimas, 1996; Song, Pruden, Michnick Golinkoff, & Hirsh-Pasek, 2016) with basic classification skills and culminating later in childhood (approximately eleven years old; see Blewitt, 1989, 1994; Carneiro, Albuquerque, & Fernandez, 2009; Deneault & Ricard, 2006) with advanced “hierarchical” classification responding.

Although cognitive–developmental research on classification has identified several crucial features of classification and has generated data regarding variables that may affect it, this approach is largely descriptive. The principal aim of such research is to gain a theoretical comprehension of the phenomenon in question rather than to isolate environment–behavior interactions that can facilitate prediction and influence over behavior. As such, theoretical accounts offered within the cognitive–developmental approach can be argued to be less than optimal for realizing practical change within the applied domain (Margolis & Laurence, 2000; Murphy, 2002; Palmer, 2002).

The approach to classification taken in the current study is based on a contextual behavioral approach referred to as relational frame theory (RFT; Dymond & Roche, 2013; Hayes et al., 2001a). This approach conceptualizes language and cognition as patterns of generalized, contextually controlled relational responding that are learned via multiple exposures to behavior–environment contingencies operating in the socioverbal environment. These patterns are known as relational frames. Acquisition of these frames begins with learning to respond to nonarbitrary or physical relations between objects in the presence of particular contextual cues (e.g., choosing the physically bigger or smaller of two objects in the presence of the cues more and less, respectively). With sufficient exemplars of the pattern, however, this responding can come under the control of the contextual cues alone and is then applicable in contexts without nonarbitrary relational support (e.g., if told that X is more than Y, a child might be able to derive that Y is less than X in the absence of any nonarbitrary or formal differences between the relata). The latter is referred to as arbitrarily applicable relational responding (AARR) because the relational response can be applied in any circumstance no matter what the actual physical relation between the relata is (Hayes et al., 2001b; O’Toole, Barnes-Holmes, Murphy, O’Connor, & Barnes-Holmes, 2009; Stewart & McElwee, 2009).

Over the last number of decades, researchers have shown that patterns of AARR (i.e., relational frames) come in a variety of forms (see Stewart, 2016, and Stewart & Roche, 2013, for an overview). Despite the variety of patterns of AARR, all are characterized by three properties: mutual entailment, combinatorial entailment, and transformation of function. Mutual entailment (ME) refers to the feature of relational framing by which a unidirectional relation from Stimulus A to Stimulus B in a specific context entails a second unidirectional relation from Stimulus B to Stimulus A. For instance, in the earlier example, a child told that “X is more than Y” could derive the mutually entailed relation that “Y is less than X.” Combinatorial entailment (CE) is the phenomenon whereby two stimulus relations are combined to allow derivation of a third relation. For example, a child told that “X is more than Y” and “Y is more than Z” might derive that “X is more than Z” and “Z is less than X.” Finally, the transformation of functions (ToF) is seen when the functions of a stimulus change or transform as a result of its being in a derived relation with another stimulus. For example, consider the case of the child who has derived through CE that X is more than Z. If Z was already a directly conditioned reinforcer for this child, then he or she might work for it; however, if offered a choice between X and Z, the child might choose the former (i.e., X) over the latter (i.e., Z), even though he or she might have had no previous experience of X as a reinforcer. Such a preference would demonstrate that the functions of X had been transformed through its derived relations with Z (e.g., Hayes et al., 2001b).

The generativity of relational framing illustrated by these properties is one source of evidence suggesting the link between AARR and human language and cognition. However, there is by now substantive evidence from many other quarters also showing this link and, further, showing that AARR in its multiple forms appears to be the key functional process underlying these phenomena (e.g., Cassidy, Roche, Colbert, Stewart, & Grey, 2016; Gore, Barnes-Holmes, & Murphy, 2010; Hayes & Stewart, 2016; Healy, Barnes-Holmes, & Smeets, 2000; Luciano et al., 2007; O’Hora et al., 2008; Viscaíno-Torres et al., 2015).

From this perspective, classification as a form of verbal behavior may also be approached in terms of particular patterns of framing. Perhaps the most relevant frames in this regard are those of containment (e.g., A contains B; B is within A) and hierarchy (e.g., A is a member of B; B is a class containing A; see, e.g., Hayes et al., 2001a; Slattery & Stewart, 2014). These patterns of framing, as with others, can be theorized to originate in more basic nonarbitrary relations. For example, in the case of these particular patterns of relational responding, a child might be taught to put one object physically inside another in the presence of a cue such as in or inside or to accurately state which object in a container–contained dyad is the former and which the latter. Such responding, based as it is on the presence of actually physically related objects, might then lay the foundation for relatively more abstract patterns of responding, such as classification (i.e., responding to “members” as being contained in “classes”) or hierarchical classification (classes being contained within classes) in which immediate physical relations (e.g., of containment) are no longer present.

Gil et al. (2012) provided the first empirical demonstration of hierarchical relational responding. The participants were 10 university students, and the experiment was conducted across five phases. In Phase 1, four arbitrary shape stimuli were established as includes, belongs to, same, and different contextual cues, respectively. In Phase 2, participants were trained and tested for the derivation of arbitrary sameness (i.e., “equivalence”) relations between arbitrary nonsense-word stimuli. These stimuli thus made up the lowest level of two separate hierarchical relational networks. Next, in Phase 3, responding in accordance with higher levels of these networks was induced by having participants derive relations of containment between lower level stimuli and (novel, additional) stimuli further up the network by using the includes and belongs to cues. In Phase 4, the researchers established particular functions in particular stimuli at different levels of the hierarchical network, and in the final phase they reported particular resultant patterns of ToF whereby stimuli acquired novel untrained functions by virtue of their position in the hierarchical network.

Slattery and Stewart (2014) subsequently argued that hierarchical relational framing subsumes more than one pattern of hierarchical responding and that these patterns include both hierarchical classification—in which stimuli are related as being in classes and subclasses—as well as hierarchical containment—in which stimuli are related as simply containing and being contained, without full implications of classification being involved (see Markman & Seibert, 1976). They sought to demonstrate a model of hierarchical relational framing as classification per se. Accordingly, they used nonarbitrarily related stimuli (i.e., sharing certain physical properties, as typically seen within classification hierarchies) to establish two arbitrary shapes as contextual cues for member of and includes relational responding with four undergraduates. Thereafter, these cues were then used to establish arbitrary stimuli in particular hierarchical relations. Finally, the derivation of additional untrained hierarchical relations and ToF was assessed. Subsequent patterns of responding showed properties of unilateral property induction, transitive class containment, and asymmetrical class containment, as predicted and in keeping with properties ascribed to hierarchical classification by mainstream theorists.

The aforementioned studies constitute an important starting point with respect to the RFT induction and investigation of hierarchical relational framing. They provide evidence that humans engage in this particular pattern of contextually controlled relational responding and that it can be brought under contextual control in the laboratory. One important direction in the further empirical investigation of this pattern of relational framing is to begin to examine its historical origins. That is the aim of the current study, which seeks to measure the emergence of relational framing related to classification, including both containment and hierarchy, in young children from 3 to 8 years old and to correlate performance in this regard with performance on standardized measures of intellectual potential.

The current study involved using a protocol developed specifically to track repertoires of relational framing related to hierarchical classification from simple and concrete up to complex and abstract; specifically, nonarbitrary containment, arbitrary containment, and arbitrary hierarchy, each assessed in terms of both ME and CE, as well as, in the case of arbitrary hierarchy, ToF. We anticipated that doing so might provide information regarding the typical acquisition of relational framing repertoires relevant to classification. This might further inform researchers and practitioners regarding the development of relational framing in general as well as framing related to hierarchical classification specifically. It might provide information regarding typical hierarchically relevant framing repertoires at different ages as well as potential deficits at particular ages that might be amenable to training. As indicated earlier, the results of previous research suggest that relational framing is positively correlated with performance on measures of intellectual potential. It might be hypothesized that the framing repertoires of containment and hierarchy specifically would be similarly correlated. Assuming that this is the case, such a finding might point to the potential importance of training relational framing skills of categorization using a protocol based on that developed for the current study.

To assess nonarbitrary containment, we presented children with differently colored boxes in which smaller boxes were physically contained inside larger boxes and asked the children whether one particular box was inside another or contained the other. To assess arbitrary containment, we presented the children with a number of differently colored circles that were physically the same size as each other. We then told them about one or more “containment” relations between the circles and probed to see whether they could correctly derive further relations in the absence of any physical containment relations being demonstrated (hence, these tasks probed for arbitrary relations). For example, for ME tasks, we told them that one particular colored circle (e.g., the green one) contained another particular colored circle (e.g., the red one) and then probed to see whether they could derive the correct entailment relation (i.e., in this case, that the red one was inside the green one). Finally, to assess arbitrary hierarchy, we presented children with real and nonsense words on a computer screen, told them about one or more “hierarchical class”–type relations between the stimuli, and probed to see whether they could correctly derive further relations. For example, for ME tasks, we told them that one particular nonsense word (e.g., tol) was a type of animal and then probed to see whether they could endorse the correct entailment relation (i.e., in this case, that the class animal contained tols as members). As for the previous task, this was an arbitrary relational task because the stimuli representing classes and members did not show any physical relationship of relevance to the task.

To assess whether the relational repertoires measured in this study correlated with intellectual performance, a number of standardized tests of intellectual skill were used, including the Peabody Picture Vocabulary Test, Fourth Edition (PPVT–4; Dunn & Dunn, 2007), the Stanford–Binet Intelligence Scales—Fifth Edition (SB5; Roid, 2003), and the Children’s Category Test (CCT; Boll, 1993). A test of Piagetian class inclusion (CI; Piaget, 1952) was also used to examine whether scoring on the relational framing protocol might predict performance on this well-known test of classification.

Method

Participants

Fifty typically developing children (23 female, 27 male) between 3 and 8 years of age (range of 3 years and 0 months to 7 years and 9 months) took part. The mean age and standard deviation for each group (n = 10 each) were as follows: 3-year-olds—M = 3.56 years, SD = 0.26 years; 4-year-olds—M = 4.56 years, SD = 0.27 years; 5-year-olds—M = 5.54 years, SD = 0.29 years; 6-year-olds—M = 6.31 years, SD = 0.28 years; and 7-year-olds—M = 7.27 years, SD = 0.27 years.

Participants were recruited from rural primary schools and play schools within the local area. Prior ethical approval for recruitment of participants for this study was obtained from the research ethics committee of the host institution. Consent for conducting the study was obtained from the principal in each respective school. Parental consent was obtained for each child who participated, and verbal consent was also obtained from each of the participants.

Materials

Participants were exposed to a number of different assessments. These included preassessment of (a) color tacting and (b) yes–no responding. They were also assessed on a number of standardized measures as follows: (a) the PPVT–4 (Dunn & Dunn, 2007), (b) the SB5 (Roid, 2003), and (c) the CCT (Boll, 1993). The assessments also included a test of CI (Piaget, 1952) and a test of relational responding involving three parts: (a) nonarbitrary containment, (b) arbitrary containment, and (c) arbitrary hierarchy.

Color Tacting

This test involved nine paper circles, each of which was colored with a different color used during the relational assessment procedure (i.e., black, white, orange, green, blue, yellow, purple, red, or pink). During the test, the child was exposed to three arrays of three colored circles each and asked to tact each particular color displayed. Children had to answer correctly on all nine trials to proceed.

Yes–No Responding

This test involved 10 pictures of commonly seen objects or animals (e.g., a dog). On each trial, the child was shown a picture of an object or animal and asked a yes–no question. Children were asked five questions to which the answer yes was appropriate (e.g., they were shown a picture of a dog and asked “Is this a dog?”) and five questions to which the answer no was appropriate (e.g., they were shown a picture of a circle and asked “Is this a square?”). Children had to answer correctly on all 10 trials to proceed.

PPVT–4

This test assesses receptive vocabulary in respondents from ages 2 years and 6 months to late adulthood and is often used as a test of scholastic aptitude. It is administered individually and is presented in a multiple-choice format in which the respondent is presented with four pictures and asked to select the one that best illustrates the definition of a particular word. The stimuli include items representing up to 20 content areas (e.g., actions, vegetables, tools) and components of speech (nouns, verbs, or attributes), encompassing a broad range of difficulty. Test–retest reliability has been shown to be between .92 and .96 (Community-University Partnership for the Study of Children, Youth, and Families, 2011), and internal consistency has been noted to be similarly high (between .94 and .95). Construct and convergent validity has been assessed by comparing the PPVT–4 to the Expressive Vocabulary Test, Second Edition (EVT–2; Williams, 2007). Correlations between the two were high (r = .80–.84) for all age groups (Community-University Partnership for the Study of Children, Youth, and Families, 2011).

SB5

This is an intelligence test for use with individuals from ages 2 to 85 years that measures five weighted factors (fluid reasoning, knowledge, quantitative reasoning, visual–spatial processing, and working memory) and consists of both verbal and nonverbal subtests. Participants in the current study were evaluated for the full-scale intelligence quotient (FSIQ), the verbal intelligence quotient (VIQ), and the nonverbal intelligence quotient (NVIQ). Reliability coefficients have been found to be extremely high for the FSIQ (r = .98), the NVIQ (r = .95), and the VIQ (r = .96), showing excellent stability. In addition, the five factor index scores were all above .90 and were higher than the subtest scales, which were, however, comparable to other cognitive tests with ranges from .84 to .89 (Roid, 2003). Validity has been assessed through comparison with other tests of cognitive and intellectual ability, including the Wechsler Preschool and Primary Scale of Intelligence—Revised (WPPSI–R; Wechsler, 1989), the Wechsler Intelligence Scale for Children—Third Edition (WISC–III; Wechsler, 1991), the Wechsler Adult Intelligence Scale—Third Edition (WAIS–III; Wechsler, 1997), the Woodcock–Johnson III Test of Cognitive Ability, and the Woodcock–Johnson III Test of Achievement. Correlations ranged from .78 to .84 for both FSIQ and VIQ with comparable indices on other major IQ batteries (Roid, 2003).

CCT

This test was designed to assess categorization skills for children between 5 and 8 years of age (Level 1) and between 9 and 16 years of age (Level 2). It is individually administered and is designed to assess concept formation and problem-solving abilities with novel material. In the current study, only Level 1 was used. This level consists of five subtests and includes 80 questions. Within each subtest, children are asked to determine the principle underlying correct performance (i.e., matching based on color, size, or proportion) using examiner feedback. Previous research has indicated that the test has adequate test–retest reliability (r = .75) in addition to strong internal consistency for both Levels 1 (r = .88) and 2 (r = .86; MacNeill Horton, 1996).

Piagetian CI

This is a classic test of categorization ability that assesses the capacity to respond to a stimulus as simultaneously belonging to both a class and a superordinate class (Thomas & Horton, 1997). It has its origins in Piagetian developmental psychology (Piaget, 1952), where it was seen as determining if a child had reached the “concrete operational” stage of development, hypothesized to occur by age 7 or 8. In the test, the child is first shown an array of stimuli that belong to a particular category. These stimuli form two different subclasses, with a greater number of stimuli within one category than the other (e.g., five apples and two pears). The child is asked if there are more of the larger subclass or of the category—for example, “Are there more apples, or is there more fruit?” Materials used in the current study included 5.5 × 5.5 cm colored flashcards depicting a variety of animals (horse, dog, pig, cow, cat, dog), fruit (strawberry, apple, pear, banana, lemon, orange), clothing (socks, skirt, T-shirt, dress, pants, coat), and vehicles (car, fire engine, truck, motorcycle, tractor, bus). Multiple examples of each of these stimuli were used.

Relational Responding Test

The current study assessed three different patterns of relational responding, including (a) nonarbitrary containment, (b) arbitrary containment, and (c) arbitrary hierarchy. These were assessed as follows:

  1. 1.

    Nonarbitrary containment: Materials included both square and triangular boxes of different sizes (large, medium, or small) and colors (red, yellow, green, blue, purple, orange, black, white, and pink).

  2. 2.

    Arbitrary containment: Materials included same-sized circles and same-sized triangles of differing colors (red, yellow, green, blue, purple, orange, black, white, and pink).

  3. 3.

    Arbitrary hierarchy: This test used a laptop to show arrays of stimuli that were used as the basis for asking particular questions to probe for derived relations.

Procedure

In the case of each school setting, the research was conducted in a separate classroom or resource room within the school building. In the first (screening) session, children were exposed to the color tacting and yes–no pretests, which together took a total of 10 min per child. All the children passed these tests. Thereafter, they were tested individually on the main assessments in sessions lasting between 30 and 45 min, with the length of a session depending on the test being administered and the age of the child. These sessions involved breaks every 10 or 15 min, also depending on the age of the child. Children were exposed to the main assessments in the following order, with the number of sessions involved for each measure and the range of testing time per session, including breaks, given in parentheses:

  1. 1.

    Relational Responding Test 1 (three sessions lasting 40–45 min each, one for nonarbitrary containment, one for arbitrary containment, and one for arbitrary hierarchy).

  2. 2.

    SB5 (two to three sessions, with 40 min for Session 1, 25–30 min for Session 2, and 20–25 min for Session 3, with the number of sessions and exact duration depending on the age of the child).

  3. 3.

    PPVT–4 (one session, 30–40 min).

  4. 4.

    CCT (one session, 30 min).

  5. 5.

    CI (included in the same session as the CCT, 10 min).

  6. 6.

    Relational Responding Test 2 (number and length of sessions were the same as for Relational Responding Test 1).

Testing sessions took place over two phases. Phase 1, which included Assessments 1–5, involved approximately seven to eight days of testing per child spread out over a duration of approximately 2 weeks. Phase 2, which included Assessment 6 (i.e., the second relational responding test), took place 2 weeks after the initial phase ended and involved a successive 3 days of testing.

Relational Responding Test

In their first session after passing the screening tests, participants were assessed for the three relational responding repertoires seen as involved in classification: nonarbitrary containment (i.e., relating stimuli on the basis of an observed physical containment relationship), arbitrary containment (i.e., relating stimuli on the basis of an abstract or arbitrary containment relationship), and arbitrary hierarchy (i.e., relating stimuli on the basis of an abstract hierarchical class relationship). These three relational repertoires were assessed over the course of three sessions (one repertoire per session) based on 128 questions (32 for each of the first two repertoires and 64 for the third).

In the case of nonarbitrary and arbitrary containment, relational framing was assessed in terms of ME (16 questions) and CE (16 questions; see Tables 1 and 2, respectively). On all trials, participants were first shown the relevant stimuli, and then the experimenter described the relationship between the stimuli. For the assessment of nonarbitrary containment responding, the experimenter further demonstrated the relationship by manipulating the stimuli (e.g., following the description “A red box is inside a blue box,” the experimenter would then place a red box inside a blue box). Participants were then asked a question about the relationship between certain stimuli that could typically be answered in the form of a yes or no response (more specific details regarding format are provided later in subsections corresponding to each repertoire). No contingent feedback was provided at any time.

Table 1 Trial Types Used in the Nonarbitrary Containment Phase
Table 2 Trial Types Used in the Arbitrary Containment Phase

In the case of arbitrary hierarchy, relational framing was assessed in terms of ME (16 questions), ToF through ME (16 questions), CE (16 questions), and ToF through CE (16 questions). On all trials, participants were first shown an array of stimuli on a computer screen, and then the experimenter described the relationship between the stimuli, elaborating slightly for ToF questions. As in previous sections, participants were then presented with a follow-up question that could typically be answered in the form of a yes or no response (again, more specific details regarding format are provided in later sections), and no contingent feedback was provided at any time.

As described previously, for all three relational repertoires, questions were scored in groups of 16. In the case of each item in each 16-question section, participants received either 1 (correct) or 0 (incorrect) for that item, and thus they received a score from 0 to 16 for that section. A score of 13/16 or higher was deemed a pass for that section (it was calculated that achieving a score within this range by chance was approximately 1 in 100). Section final scores were added to give cumulative scores for each relational repertoire and an overall relational framing score.

Participants were first assessed for nonarbitrary containment responding across 32 trials. Table 1 shows a generic representation of the trial types involved. In tests for ME, the participant was presented with two differently colored boxes (one inside another), and all questions within this phase of assessment were focused on the relationship between these two stimuli. A total of 16 questions (including eight questions that could be answered yes and eight that could be answered no) were presented in random order during this phase. In tests for CE, the participant was presented with three differently colored boxes (one inside a second and the second inside a third), and all 16 questions in this phase focused on the relationship between the first and third boxes.

Participants were next assessed for arbitrary containment responding across 32 trials. The trial types for this assessment were similar to those for the assessment of nonarbitrary containment, but this assessment used identically sized colored-circle stimuli between which no physical containment relationship was demonstrated (see Table 2 for a generic representation of the trial types involved).

In tests for ME of arbitrary containment, the participant was presented with two shapes (e.g., circles) of equal size, but different color, and all questions in this assessment phase were focused on the relationship between these two stimuli. A total of 16 questions, presented in random order, were presented during this phase. In tests for CE, the participant was presented with three shapes (designated as A, B, and C) of equal size, but different color, and all 16 questions in this phase focused on the derivation of a relation between Stimuli A and C based on the combination of given relations between A and B and between B and C, respectively.

In the third session of relational testing, participants were assessed for arbitrary hierarchical relational responding across a total of 64 questions (see Table 3 for a list of the trial types involved).

Table 3 Trial Types Used in the Arbitrary Hierarchy Phase

In all tests for arbitrary hierarchy, the participant was presented (on a laptop computer) with on-screen textual descriptions of the relationship(s) among one or more pairs of stimuli (see Fig. 1), and these descriptions were also read aloud. The questions in this phase were presented in groups of eight items each (see Table 3); the questions involved were presented in random order in each. The groups were presented in the following order:

  1. 1.

    ME questions including the cue type of.

  2. 2.

    ME questions including the cue type of that assessed ToF.

  3. 3.

    ME questions including the cue contains.

  4. 4.

    ME questions including the cue contains that assessed ToF.

  5. 5.

    CE questions including the cue type of.

  6. 6.

    CE questions including the cue type of that assessed ToF.

  7. 7.

    CE questions including the cue contains.

  8. 8.

    CE questions including the cue contains that assessed ToF.

Fig. 1
figure 1

Sample stimuli layouts for mutual entailment (top panels) and combinatorial entailment (bottom panels) trials in the arbitrary hierarchy testing phases

In each item in the basic ME phases (i.e., Groups 1 and 3), an initial statement involving a hierarchical relational cue was presented (e.g., “A tol is a type of animal”), and the participant was then required to respond correctly to a follow-up question based on this provided statement (see Table 3). Correctly answered items in Groups 1 and 3 were combined to give a score for ME. In each item in the ME ToF phases (i.e., Groups 2 and 4), an initial statement involving a hierarchical relational cue was presented, a function (e.g., “big eyes”) was (verbally) associated with one or other of the two stimuli, and the participant was required to respond correctly to a follow-up question based on this. Correctly answered items in Groups 2 and 4 were combined to give a score for ME with ToF. In each item in the basic CE phases (i.e., Groups 5 and 7), two statements involving a hierarchical relational cue were presented, and the participant was then required to respond correctly to a follow-up question based on the combination of those statements. Correctly answered items in Groups 5 and 7 were combined to give a score for CE. In each item in the CE ToF phases (i.e., Groups 6 and 8), two statements involving a hierarchical relational cue were presented, a function was associated with one of the stimuli involved, and the participant was required to respond correctly to a follow-up question based on this. Correctly answered items in Groups 6 and 8 were combined to give a score for CE with ToF.

As noted previously in the first paragraph of the procedure, two versions of the relational responding test were given: Relational Responding Tests 1 and 2. Test 2 was similar to Test 1, except that a different set of stimuli was involved. More specifically, in the nonarbitrary and arbitrary containment tests, differently shaped boxes and different shapes, respectively, were used, whereas in the arbitrary hierarchy test, a different set of textual stimuli (i.e., nonsense and real words) was presented (see Table 8 for the full set of arbitrary hierarchy retest questions).

CI

This test involved presenting eight questions focused on CI as previously described. These included the following question types:

  1. 1.

    Are there more [smaller subclass] or more [category]?

  2. 2.

    Are there more [larger subclass] or more [category]?

  3. 3.

    Are there more [category] or more [smaller subclass]?

  4. 4.

    Are there more [category] or more [larger subclass]?

  5. 5.

    Are there less [smaller subclass] or less [category]?

  6. 6.

    Are there less [larger subclass] or less [category]?

  7. 7.

    Are there less [category] or less [smaller subclass]?

  8. 8.

    Are there less [category] or less [larger subclass]?

These questions were presented in random order based on the shuffling of eight question cards. Participants received noncontingent reinforcement for taking part.

Interobserver Agreement (IOA) and Procedural Fidelity

IOA was conducted by research assistants in the case of 20% of participants and was collected during nonarbitrary containment, arbitrary containment, and arbitrary hierarchy assessment sessions. Prior to data collection, IOA collectors were trained in data collection until they reached 100% accuracy. Mean IOA was calculated as 99.48% (range of 96.88% to 100%). Procedural fidelity was conducted for 20% of participants and was collected during nonarbitrary containment, arbitrary containment, and arbitrary hierarchy assessment sessions. A trained research assistant collected procedural fidelity measures. Prior to data collection, correct experimenter procedural responding was modeled to the data collectors for exemplars of procedural integrity, and data collectors were also provided with a printed copy of the assessment protocol and a checklist of the steps in the assessment procedure. Mean procedural integrity was calculated as 99.56% (range of 93.75% to 100%).

Results

Correlations

Table 4 shows a correlation matrix of Spearman’s ρ correlations among the experimental measures and some of their key subscales.

Table 4 Matrix of Spearman’s Rho Correlations for All Measures Administered

This table shows a highly significant correlation between age in months and overall relational framing score (ρ = .852, p < .001), as well as between age and each of the three specific relational repertoires, including nonarbitrary containment (ρ = .826, p < .001), arbitrary containment (ρ = .806, p < .001), and arbitrary hierarchy (ρ = .789, p < .001). The data also show highly significant correlations between the overall and specific relational framing scores and those of the other measures, as well as, in the case of the SB5, its subscales (i.e., SB5, SB5 Verbal, SB5 Nonverbal, PPVT–4, CCT, and CI). In general, the highest correlations are seen for the measures of general intellectual performance (i.e., SB5 and subscales) and language (PPVT–4), with slightly lower correlations for the measures of categorization (i.e., CCT and CI).

Regarding the correlations among the nonrelational framing measures themselves, we can see that the standardized measures of intellectual performance and language correlate very highly with each other, as might be expected (e.g., SB5 and PPVT–4 show a correlation of ρ = .918). Each of these measures also correlates well with the CCT (e.g., PPVT–4: ρ = .769; SB5: ρ = .817) and not quite as well with the CI (e.g., PPVT–4: ρ = .415; SB5: ρ = .432). In regard to the two latter (categorization) measures: (a) The CCT shows much higher levels of correlation with the measures of intellectual and language performance than the CI, (b) both tests show comparable levels of correlation with the relational framing measures, and (c) the level of correlation between these tests themselves is the lowest in the table (i.e., ρ = .285).

Relational Framing Performance per Age Cohort

Table 5 shows the average number of correct responses and corresponding percentage of correct responses per age group and per relational framing pattern, including both relational frames (nonarbitrary containment, arbitrary containment, and arbitrary hierarchy) and relational properties (ME, CE, and ToF). The broad patterns seen are in accordance with what might have been predicted in that:

  1. 1.

    For each relational property (and thus also all relations), all age cohorts perform in the aggregate at least as well as any younger cohort (with one exception, which is for ME of nonarbitrary containment relations as shown by the 5- to 6-year-old and 6- to 7-year-old groups, and in that case the difference is relatively small). In general, aggregate performance on all indices improves with age.

  2. 2.

    For all cohorts and for all three relations, the aggregate score for the CE test is at most as high and is typically lower than the score for the corresponding ME test.

Table 5 Average Correct Response Number and Percentage per Age Group and Relational Framing Pattern (Including Frames and Properties)

Table 6 shows the aggregate number of passes achieved within each age cohort for each of the relational properties assessed for each of the three patterns of relational framing (also see Fig. 2). The broad patterns seen are similar to those seen for the raw data in that:

  1. 1.

    For each relational property (and thus also all relations), all cohorts perform at least as well as any younger cohort and, in general, performance on all indices improves with age.

  2. 2.

    For all cohorts and all three relations, the score for CE is at most as high and is typically lower than that for ME (with one slight exception: ToF for 6- to 7-year-olds)

Table 6 Number of Passes for Each Relational Test per Age Cohort
Fig. 2
figure 2

Number of passes per relational frame/property and age group. ME = mutual entailment; CE = combinatorial entailment; ToF = transformation of functions

Relational Framing Test–Retest Scores

Table 7 shows correlations between scores for the initial test and the retest in the case of all subsections of each relation, for each relational total, and for the overall relational framing test total. All the tests showed very highly significant test–retest correlations.

Table 7 Test–Retest Correlations for Relational Framing Test and Subtests

Discussion

The current study aimed to investigate relational framing related to hierarchical classification in children. To this end, it involved administering to 50 children ranging from 3 to 8 years of age a custom-made protocol designed to assess relational framing presumed to be relevant to hierarchical classification from relatively simple and concrete to relatively more complex and abstract—specifically, nonarbitrary containment, arbitrary containment, and arbitrary hierarchy, each assessed through testing of ME and CE, as well as, in the case of arbitrary hierarchy, through testing of ToF. This was done with a view to gathering information regarding typical acquisition of relational framing repertoires relevant to classification; furthermore, the protocol was administered more than once to check for reliability. All the children were also tested using a number of other measures, including (a) assessment of generalized intellectual (cognitive and linguistic) skill (i.e., the SB5 and the PPVT–4) and (b) assessment of classification skill specifically (i.e., the CCT and the Piagetian CI test). The tests of generalized intellectual performance were administered chiefly to examine the correlation of relational framing of categorization with intellectual performance more generally, whereas the tests of categorization were administered to probe the validity of the claim that the relational framing repertoires examined herein are indeed relevant to classification responding.

The data showed first that there was a developmental trend in terms of the acquisition of the overall relational framing repertoire involved as well as with respect to each of the more specific relational repertoires. This is evident in the correlational data (Table 4), as shown by the high correlation between age and each of the repertoires, and in the descriptive data (Table 5) and the data showing the numbers of passes (Table 6) for each cohort for various patterns of relational framing (frames and properties). These patterns suggest that, as might have been predicted, relational framing of categorization is acquired gradually over the course of several years. The youngest age group (3–4 years) shows little or no capacity in any of the three specific relational repertoires assessed, including even nonarbitrary containment, whereas data for the oldest age group (7–8 years) indicate at least some emergence on each of the three repertoires. All members of this oldest group passed both sections of nonarbitrary containment. However, their performance on the two arbitrary sections and particularly the arbitrary hierarchy section was relatively poor, even though they performed better than the younger groups. These data suggest that children of all age groups might potentially benefit from training in relational frames of categorization and, given the correlations seen between these patterns of framing and performance on standardized measures of language and cognition, that such training might be intellectually beneficial. The data for number of passes also show that the order of sequencing of relational framing repertoires in the current protocol (i.e., nonarbitrary containment, arbitrary containment, and arbitrary hierarchy) was correct and that this is indeed the order of acquisition and would thus also be an appropriate order for assessment and training. Finally, scores on the second administration of the protocol also correlated very highly with scores for the first administration (Table 7), thus suggesting the reliability of the protocol.

Regarding the data showing correlations between relational framing and performance on other measures, a number of points in particular are important. First, as previously mentioned, there were high levels of correlation between relational framing and performance on standardized intellectual measures, including both cognitive and linguistic. These correlations were seen for overall relational framing capacity as well as for specific relational repertoires. These findings further extend a pattern seen in previous RFT studies showing strong correlations between relational framing and intellectual performance (e.g., Moran, Walsh, Stewart, McElwee, & Ming, 2015; O’Toole et al., 2009; Ruiz & Luciano, 2011). Second, although there were strong correlations seen between scores for the relational framing repertoires and those for the standardized tests of categorization specifically, these correlations were not as high as those found between the former and those found for the broad measures of intellectual potential.

An original version of the current protocol assessed ToF not only for arbitrary hierarchy but also for all three framing patterns; indeed, data on ToF for all three frames were collected for the current cohort. However, the items assessing ToF in the case of the two containment relational patterns were problematic, and thus the data were omitted. Hence, the data reported herein for these two frames focus just on derived ME and CE alone and do not assess ToF. Despite this, the correlations between the overall relational framing protocol and the alternative measures used were strong. This is perhaps not surprising, as (a) ToF data were collected for arbitrary hierarchy, which might be argued to be particularly important in terms of the repertoire of classification, and (b) RFT protocols that focus on derived relations alone have previously proven to be excellent predictors of intellectual performance (see, e.g., Cassidy, Roche, & Hayes, 2011). Therefore, the data provided for the containment frames would still have been useful in this respect. At the same time, from an RFT point of view, ToF is a key property of relational framing, and thus collecting it for all three frames would likely boost predictive power. Engaging in the latter as an extension of the current study would be a useful future research direction.

Another possible critique of the categorization framing protocol used in the current study might be that it did not provide as thorough or comprehensive a test of arbitrary hierarchical framing as might be desirable. For example, although framing appropriately in response to the cues of type of and contains is an important part of this pattern, there are other aspects of this pattern that might also be tested. For instance, when two stimuli are framed as being part of the same class, then they should also both be framed as being different from other stimuli framed within a different class and equivalent to each other, in at least some contexts, independent of their physical properties. Such a pattern of responding might be expected from someone with a sufficiently advanced repertoire of hierarchical classification. The current protocol does not involve testing for such relations. Adjusting it so as to do so might be expected to improve its reliability and validity as a test of classification. This would be one useful direction for future work.

Another broader issue for consideration is that when using a relational skills test as a proxy for other skills (i.e., such as categorization) by way of identifying which skills deficits may need ameliorating in a relational skill training intervention, it is assumed that the relational skills levels are distributed more or less normally within each age cohort. In other words, if one does not know how skills are distributed across larger samples, one cannot know the significance of a relational test score that is deviating from the mean. Analysis of this kind would not have been possible with the current data set, as it is too small, but this may be an important consideration for researchers interested in using relational tests as either proxies for other intellectual skills or as screening tests to assess deficits prior to intervention.

Another relevant direction for future studies might be the exploration of the capacity for establishing and/or strengthening repertoires of classification framing. If the current protocol (or a refined version of it) is seen as providing a reasonable measure of relational framing relevant to classification, as seems possible based on these data, then in future studies, it might be used to identify deficits in this repertoire as a precursor to the training of the relevant relational skills. An intervention for one or more aspects of classification framing might be expected to boost not only this repertoire itself but also—analogous to the outcomes seen in the case of other relational frame training interventions—intellectual performance more generally. Indeed, given the centrality of classification to the average person’s intellectual repertoire, perhaps such training might have a more substantial effect in this regard than training involving other frames.

One further possible direction for future research might be the examination of the relationship and/or interaction between classification framing and other framing repertoires. Such work might investigate various phenomena, including correlation between classification and various other patterns of framing, as well as the effect of training particular frames on performance in accordance with classification framing. For example, as suggested previously, sameness and difference frames may be particularly important precursors and/or accompaniments of hierarchical framing, suggesting that training fluency and/or flexibility on these frames might be particularly beneficial with respect to the acquisition and/or strengthening of hierarchical framing and categorization framing more broadly.

In summary, this is a promising initial result for the assessment of classification framing. The pattern of data acquired from the protocol used to measure this repertoire in combination with that from a suite of other measures of intellectual functioning, including ones focused on classification per se, might suggest that the former is indeed tapping into important aspects of categorization. This extends research on relational framing in general and provides an impetus for further work into classification per se from an RFT point of view. Further research might usefully extend the current study by incorporating additional testing of features of categorization as a pattern of responding as well as by facilitating the training of this repertoire in typically developing children or children with developmental delays who show relevant deficits.