Despite the changing landscape of the field of school psychology, tests of cognitive ability continue to play a major role in psychoeducational assessment (Flanagan et al. 2008). Results from these tests inform high-stakes decisions such as eligibility for special education services and types of services available to students with disabilities. As noted by Reynolds et al. (2006), these tests provide objective evaluations of students’ ability which are preferable to subjective opinions of others whose perspectives might be influenced by irrelevant factors.

Though tests of cognitive ability ordinarily rely on verbal interactions between examiner and student to assess ability, these tests are unsuitable for those who have difficulty communicating in or do not communicate in Standard English. These students include those with speech and language disorders, hearing impairments, traumatic brain injury, autism spectrum disorder (ASD), and those who are English language learners (ELLs). For these students, many tests of cognitive ability would yield underestimates of ability because of reliance on verbal interactions. Consequently, a nonverbal test of cognitive ability test may yield fairer and more valid results (McCallum 2003).

Nonverbal tests of cognitive ability measure general ability and are characterized by administration procedures and content that eliminate or reduce the receptive and expressive language required of the student (McCallum 2003; Naglieri and Otero 2012). McCallum (2003) noted that current use of the term nonverbal is confusing when applied to tests of cognitive ability because some tests described as nonverbal have verbal directions. Bracken and McCallum (1998) suggested that a nonverbal assessment is a process in which neither receptive nor expressive language requirements are placed on the examinee or the examiner. The majority of tests of cognitive ability said to be nonverbal do not meet these criteria. These latter tests, McCallum (2003) suggested, are more appropriately termed language-reduced measures. These tests reduce expressive language requirements because students can respond to items by pointing, manipulating objects or materials, or both, eliminating the need for speech. However, test directions are given orally by the examiner which requires understanding of oral language. Some students have difficulty expressing themselves verbally for various reasons (e.g., articulation problems, shy toddlers or preschoolers, students with hearing impairments) but understand English better than they are able to orally or manually communicate. For these students, language-reduced measures can be an appropriate option.

However, if students have verbal expression problems, some also may have language comprehension problems which could confound results from language-reduced tests. Consequently, when a language-reduced test is selected, it is incumbent on examiners to ensure their examinees understand the oral directions or directions given via sign language. If it is unclear whether a student understands the directions after using sample test items, additional sample items can be created. Examples of additional procedures include providing sufficient time to establish rapport with shy children, waiting to give test directions until it is obvious the student is focused on the examiner, eliminating extraneous auditory and visual distractions, and using interpreters familiar with a student’s primary language and dialect. Unfortunately because of the limited number of nonverbal tests available, and considering the technical adequacy of some measures at certain ages, use of language-reduced measures may be the only or the best option, e.g., preschoolers with language expression problems.

To differentiate between nonverbal tests and tests requiring listening comprehension but not oral expression, in this manuscript, we use the terms nonverbal and language-reduced as McCallum suggested. In response to the need for nonverbal intellectual assessment, numerous nonverbal tests of cognitive ability and tests of cognitive ability with nonverbal components have been published. Although having numerous options is usually desirable, comparing examiner manuals across the many variables relevant to test selection for these students may be unnecessarily burdensome for practicing school psychologists. Although the information that follows is available in the examiner’s manual for each test, the purpose of this manuscript is to present the information in a consolidated document to help school psychologists select appropriate nonverbal or language-reduced tests of cognitive ability, as well as understand the tests’ strengths and limitations. Tests were evaluated in terms of standardization samples, psychometric properties, types of directions used, and responses required of students, as well as other test characteristics relevant to meeting student needs that may not be addressed in examiner manuals, including adequacy of floors and item gradients, and percentage of timed items.

Because no single measure should be the basis for conclusions regarding a student’s cognitive ability, and because results from a particular measure may be questionable (e.g., student fatigue, a limited number of test items), besides evaluating the seven nonverbal or language-reduced tests, we also provide tables describing eight additional measures with language-reduced components. Whereas these latter tests were not explicitly constructed as nonverbal tests of cognitive ability, use of their language-reduced components can provide supplementary data to increase the sample of student performance obtained.

Method

Nonverbal and language-reduced tests of cognitive ability for students within the age range of birth through 18 years were reviewed. Overall results are described because they provide the best sample of performance. The following criteria were used to evaluate the tests.

The date when normative data were collected is important. Flynn (1984, 1998) demonstrated that if test norms are not updated periodically, examinees receive inflated test results compared with prior generations. This effect is particularly a concern for examinees in the lower ranges of intelligence (Kanaya et al. 2003; Zhou et al. 2010). Salvia et al. (2010) suggested that ability tests more than 15 years old are too old to be representative. Consequently, only tests with normative data collected within the past 15 years were reviewed. Several examiner manuals did not indicate when normative data were collected. However, their copyright dates suggest the data were collected within the past 15 years.

To be representative, demographic data for the standardization sample should be similar to U.S. Census data in terms of geographic distribution, race/ethnicity, gender, urban/rural residence, socioeconomic status (SES; defined as parents’ education or occupation or both), percentage of students with impairments, and the number of participants per age level. If students with a cognitive impairment are underrepresented, norms may be inflated. At least 100 participants per age or grade level should be included to guarantee stability, represent infrequent characteristics, and enable the calculation of a full range of derived scores (Salvia et al. 2010). Each measure was evaluated using these criteria.

Psychologists tend to agree (e.g., Sattler 2008; Salvia et al. 2010) that when making important decisions regarding students, the minimum reliability coefficient for acceptable reliability for overall results is .90. We used this criterion to evaluate tests’ internal consistency, test–retest, and alternate-form reliability.

Measures also were evaluated for construct, content, and concurrent validity. Methods test authors used to support construct validity are denoted in tables. In the text, independently conducted investigations of the structural validity of measures were noted and described briefly, if available. There are data to suggest contemporary measures of cognitive ability are overfactored (i.e., they purportedly measure more factors than they do) when subjected to rigorous investigations of structural validity (Frazier and Youngstrom 2007), though there are relatively few such investigations of nonverbal measures of cognitive ability, perhaps because most reviewed in the current paper purport to measure a unitary factor. For concurrent validity, the number of other measures of cognitive ability used for comparison is noted, but examiners are encouraged to consult examiner manuals to determine whether the comparison tests are valid.

Inadequate test floors may overestimate performance at some ages. Adequate floors are those where a raw score of 1 results in a standard score two or more standard deviations below the mean (Bracken 1987, 1988). Tests with steep item gradients may over or underestimate students’ performance and not discriminate well between those with deficits and those without. Bracken (1987, 1988) suggested an item gradient is too steep when a change of 1 raw score point results in a change greater than 1/3 of a standard deviation in standard or scaled score points. These criteria were used to evaluate adequacy of floors and item gradients. To decrease the probability of over- or underestimating student performance, consideration of such information is necessary.

In addition, the age range of the test, mode for directions, response mode, percent of timed items, and subgroups for whom data were presented are described. Some measures said to be nonverbal use oral directions, but some students who require a nonverbal measure also need nonverbal directions provided through gestures or sign language. The type of student response required is critical. Some measures require pointing, others manipulation of materials, and several include imitation and paper and pencil tasks. For students with motor impairments and communication concerns, some response modes can be problematic. The percentage of timed items is important for some students, e.g., those who are culturally diverse with different conceptions of time, those who are easily distracted, reluctant to participate, respond impulsively, or who have motor impairments. Simeonsson et al. (2001) suggested that tests with timed items be avoided for students who are deaf, because if timed these students may respond as quickly as they can, disregarding accuracy and negatively affecting their performance.

Results

Evaluation summaries are presented for seven nonverbal and language-reduced measures. Following the summaries are tables with specific information for each variable. Table 1 presents descriptions of the tests’ standardization samples. Table 2 describes their reliability, and Table 3 presents validity information. Table 4 addresses variables other than technical adequacy that may influence test selection because of the characteristics of particular students. Tables 5, 6, 7, 8 describe the corresponding information for eight tests with language-reduced components.

Table 1 Standardization sample for tests where all items require nonverbal responses
Table 2 Reliability for tests where all items require nonverbal responses
Table 3 Validity for tests where all items require nonverbal responses
Table 4 Descriptions of tests where all items require nonverbal responses
Table 5 Standardization sample for tests where components require nonverbal responses
Table 6 Reliability for tests where components require nonverbal responses
Table 7 Validity for tests where components require nonverbal responses
Table 8 Description of tests where components require nonverbal responses

The Bayley Scales of Infant and Toddler Development—Third Edition (BSID-III; Bayley 2006) is for children from birth through 42 months. The test is well standardized with a large sample for each 1-year age level. The sample is similar to census data except for lack of data for urban/rural residence. Extensive information is provided in support of the test’s validity. Adequate floors begin at 16 days and there are no item gradient problems. Data are provided describing how nine subgroups of children performed on the test. The toys and tasks are engaging for young children. Tasks are presented by showing materials or through oral directions. For this language-reduced measure, 86 of the 91 items do not require speech. Children respond by orienting, habituating, manipulating toys, or pointing.

Concerns include low test–retest correlations at all ages and data are provided for age groups rather than for each age level. Twenty-three percent of items are timed which could be problematic for some young children who are, for example, inattentive. No data are presented for children with hearing impairments. Because directions for administration and scoring are complicated, the test requires considerable practice to ensure valid results. Considering the stability of results over time and lack of data on long-term predictive validity, caution is warranted in interpreting results. As for all tests for infants and preschoolers, repeated assessment over time provides a better description of ability than results from a single assessment.

The Comprehensive Test of Nonverbal Intelligence—Second Edition (CTONI-2; Hammill et al. 2009) was developed for ages 6 through 89. Strengths include a large representative norm sample similar to census data on all variables but urban/rural residence, for which no data are presented. Reliability is adequate for internal consistency and test–retest for ages 8 through 16. Considerable validity evidence is provided, and floors and item gradients are adequate. The test is easy to use and has no timed items. Examinees point to indicate responses. Oral or signed instructions are recommended, but pantomimed instructions are optional if necessary. Thus, this is intended primarily as a language-reduced test with oral or signed instructions “whenever possible.” However, detailed, easy-to-use pantomimed instructions, including pictures, appear in the appendices for use with students who cannot follow oral or signed instructions, making the test a nonverbal measure. Data are presented for seven subgroups of students including those with hearing impairments. One percent of the norm sample involved students with a hearing impairment and data are presented on internal consistency, concurrent validity, and discriminative validity for these students, which is more information than in other tests.

The fact that data on use of the test as a nonverbal measure are limited should be considered when interpreting results. Additional concerns include the lack of test–retest data for ages 6 and 7. Thus, results for these ages should be interpreted cautiously because of lack of information on stability of these results over time. An independent investigation of the structural validity of the CTONI-2 suggested its results should only be interpreted at the level of the overall score as the Pictorial and Geometric dimensions were not supported by exploratory factor analysis (McGill 2016). Why students with hearing impairments score nearly a standard deviation lower than their hearing peers is unclear and should be considered when reporting results for these students.

The Leiter International Performance Scale—Third Edition (Leiter-3; Roid et al. 2013) is for examinees ages 3 years through 79 plus. One strength of the test is that the norm sample appears representative on all variables. Reliability data for internal consistency are adequate. Substantial validity evidence is provided, and floors and item gradients are adequate. A strength of the Leiter-3 is that it is one of the very few nonverbal tests available. Instructions are pantomimed and students respond to stimulus pictures by placing blocks into a tray. Tasks and materials are engaging and none of the items are timed. Data are presented for 13 subgroups of students; those with hearing impairments had mean scores similar to those of the norm group.

Concerns for the Leiter-3 include the small number of participants for each age level. Data are presented in 2- or 3-year intervals with the number per interval ranging from 94 to 187. Thus, there were fewer than 100 children at many age levels. Composite test–retest correlations are high, but the retest interval averaged only 7 days and data were collapsed across 3–5 age groups. Considerable familiarity with the instructions is required to administer the test without difficulty. The manual states that examiners should be familiar enough with the test to administer it without using the manual. Although results for students with hearing impairments are similar to those of their hearing peers, and the nonverbal format could be beneficial for these students, additional technical adequacy data (e.g., stability reliability) for these students would be beneficial.

The Primary Test of Nonverbal Intelligence (PTONI; Ehrler and McGhee 2008) was developed for ages 3 through 9. A strength of the test is its adequate sample size that was representative on all demographic variables except data on urban/rural residence are not included. Internal consistency is adequate. Test–retest reliability is strong for all ages except no data are presented for 7-year-olds. Substantial validity evidence is provided. Floors and item gradients are adequate. For this language-reduced test, directions are delivered orally and students point to indicate their responses. The PTONI has no timed items. The test is quick to administer requiring only 5 to 15 min. Data are provided on 11 subgroups of children.

Although quick to administer, a concern with the PTONI is that it is not as comprehensive as some other tests (i.e., provides a limited sample of skills). Also an issue to consider in test selection is that the instruction used for many items is “Find the one that does not belong.” This is an abstract direction that a number of young children or low-functioning children may not understand. Test–retest correlations were high, but data are provided for small age groups rather than each age level and 7-year-olds were not included. Why children with hearing impairments score nearly a standard deviation lower than their hearing peers on this test is unclear.

The Test of Nonverbal Intelligence—Fourth Edition (TONI-4; Brown et al. 2010) is for ages 6 through 89. The test has an adequately sized norm sample for school-age students. This is the only test reviewed which has two forms. The norm sample is representative except urban/rural residence is not addressed. Internal consistency correlations are adequate. The test–retest correlation is adequate for Form B. Substantial information is presented regarding validity. Adequate floors begin at 7–0 for Form A and at 6–6 for Form B; there are no problems with item gradients. Instructions can be oral or pantomimed; examinees respond by pointing. One study suggested the oral and pantomimed instructions yield similar results. For 23 % of the sample, norms were collected using pantomimed instructions. Thus, depending on whether the test is given using oral or pantomimed directions, the test is either a language-reduced or nonverbal measure. The test has no timed items. The test is easy to administer and requires only about 15 min. No information is presented for performance of students with hearing impairments on the TONI-4, although on a prior version they scored on average 2 points lower than hearing students. Data are presented for 10 other subgroups of students.

The test–retest correlation was low for Form A. Another concern is that all alternate-form reliability correlations were less than .90, suggesting different forms yield somewhat different results. Although the test is quick to administer, it is less comprehensive than most other measures and may best be used for screening or as a supplement to other measures.

The Universal Nonverbal Intelligence Test—Second Edition (UNIT-2; Bracken and McCallum 2016) was developed for ages 5 through 21. The norm sample is representative on a number of important variables; however, urban/rural residence is not addressed. Reliability data are strong for both internal consistency and test–retest reliability. Floors and item gradients are adequate. On the UNIT-2, instructions are delivered via standardized gestures and students respond by manipulating materials and pointing.

The UNIT-2 does not report data on the number of students per age level. Considering the total number of students in the norm sample and number of age levels covered by the test, there are at least some age levels with fewer than 100 students. Other concerns include the fact test–retest data are reported by age group. Though substantial validity evidence is provided for the UNIT-2, confirmatory factor analysis results did not consistently support its three factor model (i.e., Reasoning × Memory × Quantitative) at all age levels. In fact, the three factor model had unacceptable fit indexes for the age 5–7-, 11–13-, and 18–21-year groups. One subtest contains timed items, resulting in a total of 8 % of items. Students with hearing impairments scored about 10 points lower than their hearing peers.

The Wechsler Nonverbal Scale of Ability (WNV; Wechsler and Naglieri 2006) was developed for ages 4 through 21. For the norm sample, 100 students were included per age level through age 12; 75 were included for ages 17 and up. The norm sample is similar to census data except students with impairments are underrepresented and no data on urban/rural residence are included. The WNV has considerable evidence in support of validity. The test has adequate floors. This nonverbal test is unique in employing picture sequences to convey directions. Supplemental oral prompts may be used, if needed. Students respond by manipulating materials, drawing, and pointing. The test consists of either a two- or four-subtest option. Data are provided for seven subgroups of students; no significant difference was found between students with hearing impairments and their hearing peers.

One concern about the WNV is that test–retest reliability is low for all ages. Item gradient problems exist on the Matrices and Recognition subtests. Also, depending on the age of the examinee, either 14 or 18 % of items are timed along with the Coding subtest.

Discussion

Recommending particular nonverbal or language-reduced tests is difficult because important factors in test selection depend upon the characteristics of the student to be tested, e.g., a culturally diverse student with a different conception of time or a student who needs nonverbal directions. On occasion, the need to address such student characteristics may outweigh the need for a test that meets the criteria for technical adequacy. For example, a test with reliability coefficients of at least .90 might be less critical than nonverbal directions for some students. Thus, the various aspects of technical adequacy as well as important student variables require consideration if the most appropriate measures are to be selected for each student. Hopefully, the summary evaluations and corresponding tables will aid school psychologists in efficiently navigating this process.

This review elucidates several issues related to current nonverbal assessment options. One is the need for additional nonverbal tests, where the entire test has nonverbal directions and requires nonverbal responses. Currently, the only options where this is the case are the Leiter-3, TONI-4, UNIT-2, and WNV. For the CTONI-2, oral instructions or signed instructions are recommended, but optional pantomimed instructions can be used if needed. Like most tests, each measure has limitations for use with some students, e.g., the TONI-4 arguably provides only a limited sample of skills and the Leiter-3 has too few students at various age levels in the standardization sample.

If test authors and publishers would include sufficient information in examiner manuals to enable school psychologists to make informed decisions regarding whether a test is appropriate for a particular student this information would be welcome. For example, only the Leiter-3 provides data on its sample’s urban/rural residence. Many manuals no longer address this variable. Yet, Roid and Sampers (2004) found significant differences in children’s performance based on community size.

To enable school psychologists to determine whether test results for a particular age level are likely to remain relatively stable until a student’s planning meeting is held, it would be beneficial if test–retest reliability correlations were reported for each age level or every other age level. To be of use, retest intervals of at least 2 weeks would be helpful. Although a number of examiner manuals report test–retest correlations of at least .90, data are typically averaged across several age groups rather than reported by age level. Further, some tests have retest intervals of less than 2 weeks, which is of limited use in practice.

If mean differences between hearing students and students with a hearing impairment were routinely reported in examiner manuals, this information would assist in test selection and interpretation of results for these students. These data are provided for only about half of the tests. Differences in mean standard scores for these two student groups range from 0 on the WNV to 14 on the CTONI-2. The differences in performance could be a function of how students with a hearing loss are taught, difficulty they have in understanding test directions, actual differences in performance, differences on certain cognitive tasks, or some combination of these and other factors. Research examining why these differences occur is warranted to enhance our understanding of the cognitive development of students with hearing impairments and improve cognitive assessment for these students.

Whereas nonverbal and language-reduced tests may lead to fairer and more valid estimates of cognitive ability for students with communication difficulties or those who do not speak Standard English, we caution school psychologists against their indiscriminant use for these students. It would be a flagrant oversimplification to suggest school psychologists who use nonverbal tests of cognitive ability for these students are meeting their ethical and legal obligation to conduct nondiscriminatory assessments. Our results suggest data describing how ELLs perform compared with a test’s standardization sample vary considerably across tests in terms of how subgroups are described, their age ranges, and the size of the samples. Additional data for ELLs would be of considerable assistance. Ortiz and Ochoa (2005) suggested that nonverbal tests are generally preferable for ELLs because of the reduced language demands but noted that the tests themselves do not fully address issues regarding potential linguistic bias or bias due to acculturation. They added that the performance of these students is also affected by how well the student and psychologist interact nonverbally. Consequently, for some students nonverbal tests may be necessary but not sufficient to obtain accurate results. More recent thinking suggests nonverbal tests of cognitive ability should be considered when the student has no or limited oral language and measures are not available to be administered in the student’s dominant language (Carvalho et al. 2014). Outside of tests translated to Spanish, there are few non-English options for assessing students’ cognitive ability. Carvalho et al. (2014) further point out that nonverbal tests of cognitive ability do not include bilingual students in standardization samples, are potentially confounded by the communication that does occur between the examiner and examinee, and do not measure a full range of abilities thought to comprise current theories of intelligence (e.g., comprehension knowledge). To meet one’s ethical and legal obligation in conducting nondiscriminatory assessments of ELLs, school psychologists are guided to recent resources outlining best practices (Carvalho et al. 2014; Ortiz 2014).

Finally, suggesting which nonverbal or language-reduced tests would be most appropriate for students with communication difficulties or students who do not speak standard English is not possible because recommendations would depend upon each student’s individual needs. However, following are several recommendations to help ensure an adequate sampling of a student’s cognitive skills as well as accurate interpretation of results.

  • Because each of the available measures provides only a limited sample of cognitive skills, when possible, use more than one nonverbal measure.

  • Nonverbal measures are necessarily more limited than a verbal measure in terms of the cognitive skills assessed; as noted by DeThorne and Schaefer (2004), it may be best to consider overall results a global index of fluid reasoning and/or visual processing.

  • Supplement nonverbal measures with data from any of the eight cognitive tests with language-reduced components appropriate for the student.

  • When more than one form of a test is available, consider using both forms.

  • Interpret cognitive results in light of data from other types of tests or observational procedures for areas such as adaptive behavior, social skills, and academic performance.

  • Interpret cognitive results noting concerns mentioned in this review about the particular tests administered, e.g., students with hearing impairments tend to score on average 10 points lower on this test than their hearing peers.

  • Interpret cognitive results considering prior assessment results. For initial assessments, mention that typically repeated assessments over time provide a better sample of a student’s performance than a single assessment.