1 Introduction

Human physical and behavioral realism heightens empathy for robots, which in turn enhances social interaction [32, 49]. This is advantageous in settings where it is preferred to have the observer expect humanlike performance from the robot, such as a robot portraying a standardized patient during a trainee’s assessment [17, 20]. It is also helpful to measure how the observer perceives and affectively evaluates human realism to develop design principles for increasing human acceptance of android robots and three-dimensional (3D) computer animated characters.

Accurate measurement is vital because humanlike characters are susceptible to negative affective evaluations known as the uncanny valley effect [31, 33, 40]. These evaluations have been characterized as cold, eerie feelings, associated with, but not equivalent to, fear, anxiety, and disgust, a loss of empathy, and avoidance behavior [22, 29, 35, 39]. Mori [42] illustrated the uncanny valley effect by drawing a valley of eeriness in a graph that otherwise depicts a positive relation between human likeness and affinity. A similar curve approximated ratings of a large sample of real-world robots, though with considerable variance (\(R^2_{\mathrm {adj}}=.29\)) [36]. However, there has been insufficient research on how to measure the uncanny valley effect accurately.

Various methods have been used to evaluate human–robot interactions, including spatial engagement, open-ended questions, and principal component analysis [38, 43, 44, 59, 60]. In the context of the uncanny valley, the present study focuses specifically on scale development for evaluating android robots and 3D computer animated characters.

Fig. 1
figure 1

Categorical perception causes equal-sized differences in a character’s physical similarity to a human being to appear much smaller within the category robot or human than at the boundary between them

Bartneck [2] proposed the Godspeed indices, which were designed to measure anthropomorphism, animacy, likability, perceived intelligence, and perceived safety. These indices average ratings on semantic differential scales. Unfortunately, the anthropomorphism, animacy, and likability indices were highly correlated (\(.67<r<.89\), \(p<.001\)) [21] and thus may not measure distinct concepts. This is a reoccurring measurement issue, because humanness tends to be associated with other positive social attributes.

Although Ho and MacDorman’s [21] perceived humanness, attractiveness, and eeriness indices have high internal reliability and the correlation of eeriness with humanness, attractiveness, and warmth was not significant, their scatter plots form two widely separated clusters: (a) mobile, humanoid, and android robots and (b) 3D computer models of humans that range from the cartoon-like to the photorealistic. Each cluster is tightly grouped despite the varied appearance of the characters within it (Fig. 4). The formation of two tightly grouped but widely separated clusters indicates the presence of categorical perception effects in observing the characters’ appearance or behavior. This pattern occurs if one bipolar adjective describes one perceptual category and the other bipolar adjective describes a different perceptual category, because category perception causes physical difference among stimuli within each category to appear much smaller than equal-sized differences between categories (e.g., robot vs. human, Fig. 1) [12, 19, 41].

Humanness, attractiveness, and eeriness indices should be designed to span category boundaries because anthropomorphic entities whose features span them are prone to elicit the uncanny valley effect [7, 28, 37, 40, 41, 52]. These negative evaluations are likely to persist at least until a new category is formed and labeled [56]. As stimuli that span the new category and its neighbors are categorized, categorical perception then develops along those continua [6, 24].

Categorical perception, also called the perceptual magnet effect, has recently been found on the continuum from 3D computer models to photographs of real people [79, 23, 26, 28, 33]. Various theories have been proposed that broadly relate categorization to the uncanny valley, including theories that are based on categorical perception [48], categorization difficulty [61], cognitive dissonance [18], balance theory [54], and feature inconsistency [40, 41]. The categorical perception of humanlike characters necessitates examining how observers categorize the characters to ensure that the humanness, attractiveness, and eeriness indices adequately represent within-category variation along these three dimensions and span category boundaries.

Fig. 2
figure 2

The 12 characters are five 3D computer animations, (1) Doctor Aki Ross from the film Final Fantasy: The Spirits Within (2001), (2) Billy, the baby from “Tin Toy” (1988), (3) an unnamed man from Phil Rice’s “Apology” (2008), (4) Orville Redenbacher from a popcorn commercial (2007), and (5) Mary Smith from “Heavy Rain: The Casting” (2006), five robots, (6) Roomba 570 (iRobot), (7) Kotaro (JSK, University of Tokyo), (8) Jules (Hanson Robotics), (9) Animatronic Head (David Ng), and (10) Aiko (Le Trung), and two human beings, (11) a man and (12) a woman

The present study seeks to improve these indices for measuring the uncanny valley effect in light of how observers categorize mobile, humanoid, and android robots and 3D computer animated characters. To address the effects of categorical perception and anthropomorphism, card sorting is applied to determine how untrained observers categorize robots, computer animated characters, and real human beings, thus revealing their own categories and the boundary regions between them. The bipolar adjectives of the semantic differential scales composing these indices are next evaluated to determine adjective pairs that span the categories and their boundary regions. The resulting indices are then evaluated in a representative survey. Improving measurement instruments for the uncanny valley is significant both scientifically, in more accurately describing the phenomenon and evaluating its effects, and in testing proposed design principles for overcoming the uncanny valley.

2 Method

This study applied a four-stage exploratory sequential design that sought to improve the humanness, attractiveness, and eeriness indices [21]: (1) a card sorting task to probe how each participant conceptualizes humanlike characters; (2) a laddering interview to collect new candidate adjectives to revise the semantic differential scales that comprise the indices; (3) a bipolar adjective evaluation to verify the importance of the original scales to the categories identified by the participant; and (4) a representative web survey to validate the revised indices. This study was approved by the Indiana University Institutional Review Board (EX0903-35B).

2.1 Participants

For the card sorting task, laddering interview, and adjective evaluation, 30 participants were recruited by email and flyers in a convenience sample from a Midwestern U.S. public university system: 70.0 % were male, 30.0 % female, and the median age was 26. Participants completed these stages from January to June 2013. There was no attrition. Participants received a $10 gift card.

For the web survey, 1311 participants were recruited by email in a random sample of undergraduate students from the same university system: 39.1 % were male, 60.9 % female, 81.5 % were age 18–25, 5.4 % 26–30, and 13.1 % 31 or older. The sample population was 74.1 % white, 7.3 % African American, 5.5 % Hispanic/Latino, 3.1 % Asian, 0.2 % American Indian, 0.1 % Pacific Islander, 2.9 % two or more races, 5.8 % international, and 1.1 % unknown. Additional inclusion criteria were 18 or older, native English speaker, and 20/40 vision or better with correction. Some data was missing at random owing to attrition. The measurement error range was \(\pm 2.89\) % at the 95 % confidence level. Participants completed the web survey from March to April 2014. Participants received no compensation.

2.2 Materials and Procedures

In the card sorting task, each of the 30 participants viewed a randomized sequence of video clips that corresponded to 12 characters: five 3D computer animated characters, five robots, and two human beings (Fig. 2). One of the robots (Hanson Robotics’ Elvis) and two of the 3D computer animated characters (from The Incredibles and The Polar Express) from Ho and MacDorman [21] were replaced to improve representativeness. The aim was to select robots from typical demonstration settings and 3D computer models from a variety of genres—short films, machinima,Footnote 1 advertisements, and videogames—in addition to feature-length films. Two humans were added to extend the range of humanness. The video clips were 480 pixels by 360 pixels (a 4:3 aspect ratio) and were 15–30 s in duration.

A representative frame from each video was printed in color on a \(3\frac{1}{2}\)-by-5-inch card. Using the cards as visual aids, the participant grouped the 12 characters into self-determined categories and proposed a label for each category [51]. The participant was instructed to sort each character into only one category, thus ensuring that the categories were mutually exclusive. The experimenter used prompts, such as “Which characters would you group together, or separate from the others?” The participant then verified the categories by reviewing the video clips at least once again.

Next, in the laddering interview, the participant was asked to list the characteristics of each character. For each characteristic, the participant was asked repeatedly, “Why is that important to you?” The participant’s answer typically linked a formal characteristic like “mechanical movement” to an aesthetic judgment like “mismatched with human appearance” to an experiential characteristic like “weird.” The participant was required to provide at least three laddering responses.

Finally, in the adjective evaluation, the participant rated on a 3-point importance scale (1. slightly important, 2. moderately important, 3. very important) all bipolar adjectives comprising the humanness (12 adjectives), attractiveness (10 adjectives), and eeriness (16 adjectives) indices for each category that the participant had proposed in the card sorting task [21, 53, 58]. Each semantic differential scale is comprised of a bipolar adjective that is low on the scale (e.g., artificial is low on the humanness scale) and a bipolar adjective that is high on the scale (e.g., natural is high on the same scale). If the participant considered both bipolar adjectives important (e.g., artificial and natural), the scale was expected to measure the concept effectively; if the participant only considered one of the bipolar adjectives important, the adjectives might not span that category.

In the representative web survey, each participant rated the 12 characters on the semantic differential scales comprising the three indices, while the corresponding video clip played in a continuous loop. The semantic differential scales included new candidate adjectives from the laddering interview. As before, the characters were presented one at a time and in random order. Scale order was also randomized. The semantic differential scales recorded a 7-point value ranging from \(-3\) to \(+3\).

2.3 Data Analysis Procedures

A semantic differential scale is defined as unbalanced to the extent that one bipolar adjective is important for more categories than the other. In taking the difference in matches between the low and high adjective of a scale, the magnitude represents the degree of imbalance and the sign represents the direction. For example, if a participant proposed three categories—Animation, Humanlike Robot, and Real Human—and stated that natural was important for all three categories but artificial only for Humanlike Robot, the natural–artificial scale is unbalanced (\(3-1=+2\)). Based on the bipolar adjective evaluation, if the mean imbalance of a scale was statistically significant, an alternative bipolar adjective was tested.

For the web survey, three criteria for bipolar adjective selection were applied: (a) high internal reliability, (b) loading on the correct factor, and (c) correlation with the ‘sanity check’ scale. Internal reliability of the indices was assessed with Cronbach’s \(\alpha \). To determine whether the semantic differential scales loaded on factors matching their named concepts, exploratory factor analysis was used, namely, principal component analysis with Promax rotation [16].

To verify that each index measured its concept, the following sanity check scales were included: artificial–natural, unattractive–attractive, and reassuring–eerie for the humanness, attractiveness, and eeriness index, respectively. Sanity check scales have face validity but do not meet other criteria. If a scale of a particular index did not load on the same dimension as its sanity check scale or if its factor loading was low (\({<}\,.40\)), the scale was removed from the index. The sanity check scales were excluded from the final set of revised indices.

A correlation analysis was used to evaluate the discriminant validity of the indices and the degree to which humanness was decorrelated from attractiveness and eeriness. Confirmatory factor analysis further verified the construct validity of the revised indices. Significance in comparing groups was assessed by a one-way analysis of variance (ANOVA). To visualize relations among the semantic differential scales of the indices, multidimensional scaling (MDS) was employed to reduce 18 dimensions to 2.

SPSS (ver. 20) was used to perform internal reliability assessment, exploratory factor analysis, and correlation analysis, LISREL (ver. 8.54) to perform confirmatory factor analysis, and MATLAB (ver. 8.5) to perform multidimensional scaling.

Cronbach’s \(\alpha \) thresholds were interpreted as acceptable \(=.7\), good \(=.8\), and excellent \({=}\,.9\). The factor loading cut-off for scale removal was .40 for exploratory factor analysis and .60 for confirmatory factor analysis. Test statistics were interpreted with a significance threshold of \(\alpha =.05\).

Table 1 Adjective importance and imbalance in bipolar adjective importance by category

3 Results

3.1 Card Sorting

All 30 participants proposed to group the 12 characters in at least two categories. More than half (54 %) proposed at least four categories (\(M=4.38\)), thus exceeding the three nominal categories of robots, animations, and humans. The final categories (in decreasing order of frequency) were human (\(n=16\)), robot (15), animation (14), machine (5), android (3), man (3), woman (3), 3D character (2), advanced robot (2), advertisement (2), cartoon (2), digital creation (2), dummy (2), half human–half robot (2), humanlike robot (2), Japanese doll (2), machine part (2), prototype (2), robot machine (2), and utility robot (2). For the anthropomorphic characters, participants often preferred to use narrower categories (e.g., advanced robot) instead of broader ones (e.g., robot). Even though the participants identified various categories, only three used android specifically.

3.2 Scale Evaluation

Of the 38 bipolar adjectives evaluated with respect to the perceived categories, those comprising the semantic differential scales of the humanness index were deemed most important (\(M=2.00\), \( SD =0.25\), \(n=30\)), followed by attractiveness (\(M=1.64\), \( SD =0.40\)) and eeriness (\(M=1.60\), \( SD =0.33\)). However, when categorizing the anthropomorphic characters, the participants were more likely to choose low humanness adjectives (\(M=-0.34\), \( SD =1.24\)), low eeriness adjectives (\(M=-0.24\), \( SD =0.63\)), and high attractiveness adjectives (\(M=0.33\), \( SD =0.82\)).

Adjective importance was compared for robot-related categories versus the other categories and likewise for animation-related and human-related categories versus the other categories using a one-way ANOVA (Table 1). Fewer humanness adjectives were used for animation-related categories (\(M=1.87\), \( SE =0.07\)) than for other categories (\(M=2.03\), \( SE =0.03\), F[1, \(61]=4.37\), \(p=.041\)) and more attractiveness adjectives were used for human-related categories (\(M=1.85\), \( SE =0.12\)) than for other categories (\(M=1.57\), \( SE =0.05\), F[1, \(61]=6.18\), \(p=.016\)).

Imbalance in the importance of bipolar adjectives was similarly compared. More low humanness adjectives (e.g., inanimate) were used for robot-related categories (\(M=-1.12\), \( SE =0.16\)) than for other categories (\(M=0.11\), \( SE =0.20\), F[1, \(61]=18.57\), \(p<.001\)) and more high humanness adjectives for human-related categories (\(M=1.56,\) \(M=0.11\)) than for other categories (\(M=-0.94\), \( SE =0.10\), F[1, \(61] = 172.93\), \(p<.001\)). More low eeriness adjectives were also used for human-related categories (\(M=-0.70\), \( SE =0.13\)) than for other categories (\(M=-0.09\), \( SE =0.09\), F(1, \(61)=12.47\), \(p<.001\)) and more high attractiveness adjectives for human-related categories (\(M=0.83\), \( SE =0.20\)) than for other categories (\(M=0.18\), \( SE =0.11\), F(1, \(61)=7.91\), \(p=.007\)).

Table 2 Factor loadings of the revised semantic differential scales

3.3 Revised Scales

Bipolar adjectives differed in their rated importance depending on the category. For each category, the difference in importance between the low and high bipolar adjective of each semantic differential scale was compared to identify imbalance in their relative importance.

The results indicate that for the robot category, the scale without definite lifespan–mortal (\(p\,{=}\,.006\)) of the humanness index was significantly unbalanced, as were the scales numbing–freaky (\(p\,{=}\,.005\)) and unemotional–hair-raising (\(p\,{=}\,.002\)) of the eeriness index, thus indicating these scales required revision. For the animation category, two scales of the humanness index were significantly unbalanced: synthetic–real (\(p\,{=}\,.007\)) and mechanical movement–biological movement (\(p\,{=}\,.014\)).

For the human category, inanimate–living (\(p=.001\)) of the humanness index was significantly unbalanced. Three scales of the eeriness index were significantly unbalanced: reassuring–eerie (\(p=.007\)), ordinary–supernatural (\(p<.001\)), and unemotional–hair-raising (\(p=.019\)). Two scales of the attractiveness index were significantly unbalanced: unattractive–attractive (\(p=.034\)) and crude–stylish (\(p=.013\)).

For the android category, two scales of the eeriness index were significantly unbalanced: numbing–freaky (\(p=.014\)) and unemotional–hair-raising (\(p=.029\)). Three scales of the eeriness index were unbalanced: numbing–freaky, ordinary–supernatural, and unemotional–hair-raising.

Using the laddering responses as a pool of candidate adjectives, we tentatively considered dull–freaky and boring–freaky as potential replacements for numbing–freaky; ordinary–unreal and ordinary–creepy for ordinary–supernatural; unemotional–alarming for unemotional–hair-raising; and predictable–eerie for reassuring–eerie. In addition, plain–weird, conformist–bizarre, and habitual–supernatural were also considered. These new scales were then included in the web survey with the original ones to test whether they were more appropriate for untrained observers.

3.4 Validation of New Scales

The five scales of the humanness index were validated: inanimate–living, synthetic–real, mechanical movement–biological movement, human-made–humanlike, and without definite lifespan–mortal, and the sanity check artificial–natural. Overall internal reliability was good (Cronbach’s \(\alpha =.84\)).Footnote 2 The exploratory factor analysis showed all five scales and the sanity check loaded on one factor, which explained 58.30 % of the total variance. These results confirmed the reliability and validity of the original humanness index [21].

The four scales of the attractiveness index were validated: ugly–beautiful, crude–stylish, repulsive–agreeable, and messy–sleek, and the sanity check unattractive–attractive. Overall internal reliability of the index was good (Cronbach’s \(\alpha =.88\)). Exploratory factor analysis showed all four scales, including the sanity check, loaded on a single factor that explained 65.08 % of the total variance. These results confirmed with a new sample the reliability and validity of the original attractiveness index [21].

All seven scales comprising the original eeriness index and its sanity check were validated. Factor analysis confirmed the existence of the two subfactors of the eeriness index previously found in Ho and MacDorman [21]. Uninspiring–spine-tingling, boring–shocking, predictable–thrilling, bland–uncanny, and unemotional–hair-raising loaded on the spine-tingling subfactor, which explained 39.54 % of the total variance with a Cronbach’s \(\alpha \) of .84. Reassuring–eerie, numbing–freaky, and ordinary–supernatural loaded on the eerie subfactor, which explained 23.62 % of the total variance. However, the Cronbach’s \(\alpha \) of the eerie subfactor was only .69, indicating the need to improve its reliability.

Fig. 3
figure 3

Multidimensional scaling was performed on the 18 semantic differential scales using the ratings of the characters in the 12 video clips. The scales of the humanness, eerie, spine-tingling, and attractiveness indices were well separated

Seven candidate scales, dull–freaky, ordinary–unreal, ordinary–creepy, plain–weird, predictable–eerie, conformist–bizarre, and habitual–supernatural loaded on the same dimension as reassuring–eerie, numbing–freaky, and ordinary–supernatural; two scale candidates, unemotional–alarming and boring–freaky, loaded on the same dimension as boring–shocking, uninspiring–spine-tingling, predictable–thrilling, bland–uncanny, and unemotional–hair-raising.

First, the candidates ordinary–creepy (r = .70) and habitual–supernatural (\(r=.71\)) highly correlated with the dimension of the original set, reassuring–eerie, numbing–freaky, and ordinary–supernatural, indicating these scales were redundant and thus should be excluded. Second, adding the candidates unemotional–alarming and boring–freaky only slightly increased the internal reliability of the spine-tingling subfactor (Cronbach’s \(\alpha \)s ranged from .84–.86), indicating this subfactor, which included uninspiring–spine-tingling, boring–shocking, predictable–thrilling, bland–uncanny, and unemotional–hair-raising, was already saturated. Given that these five reliable scales were already available to measure the concept, we did not need to develop any additional scales. Therefore, unemotional–alarming and boring–freaky were excluded from the revised index. Third, ordinary–creepy (\(r_{\mathrm {attr}}=-.45\), \(r_{\mathrm {hum}}=-.31\)), ordinary–unreal (\(r_{\mathrm {attr}}=-.37\), \(r_{\mathrm {hum}}=-.44\)), conformist–bizarre (\(r_{\mathrm {attr}}=-.35\), \(r_{\mathrm {hum}}=-.28\)), and numbing–freaky (\(r_{\mathrm {attr}}=-.30\), \(r_{\mathrm {hum}}=-.23\)) significantly correlated with the attractiveness and humanness indices, which violated the criterion of scale decorrelation (cf. [21]). Therefore, they were excluded from the revised index. (Ordinary–supernatural was retained, despite its bias, because the alternative candidates, ordinary–unreal and ordinary–creepy, loaded on both the eerie and spine-tingling subfactors.)

Based on the three criteria for bipolar adjective selection (i.e., high internal reliability, loading on the correct factor, and correlation with the sanity check scale), four scales were developed for a revised version of the attractiveness index, nine scales for the eeriness index, and five scales for the humanness index.

Confirmatory factor analysis was employed to verify the theoretical structure of this final set of 18 semantic differential scales (shown in Table 2 with their factor loadings). Although one index (RMSEA \(=.061\)) exceeded the cutoff of .05, the remaining indices indicated the 18 semantic differential scales had high goodness-of-fit within the structure of the humanness, eerie, spine-tingling, and attractiveness indices (\(\chi ^2=\) 3783, CFI \(=.97\), NFI \(=.97\), GFI \(=.95\), AGFI \(=.93\)) [5, 10, 15]. The revised scales showed improved fit as compared with those of Ho and MacDorman [21] (RMSEA decreased from .075 to .061, GFI increased from .91 to .95, and AGFI increased from .88 to .93). Further, the statistics of goodness-of-fit indicated the eerie and spine-tingling subfactors of the eeriness index were robust enough to represent their own theoretical construct (\(r=.44\)).

Table 3 Correlation between the revised humanness, attractiveness, and eeriness indices

The correlation analysis indicated the revised indices retained their construct validity (Table 3). Eeriness had no significant correlation with either humanness or attractiveness, reflecting its discriminant validity.

Multidimensional scaling was performed on the 18 semantic differential scales of the humanness, attractiveness, and eeriness indices. The scales occupied three well separated, nonoverlapping regions (Fig. 3). Furthermore, for the eeriness index, the four scales of its eerie subfactor and the five scales of its spine-tingling subfactor occupied two well separated, nonoverlapping regions. The MDS results show the humanness, attractiveness, and eeriness indices distinctly measured their concepts.

In comparing the scatter plot of stimuli from Ho and MacDorman [21] (Fig. 4) with those from this study (Fig. 5), the revised humanness and eeriness indices better capture the extent of within-category variation, thus mitigating the effects of categorical perception. The internal reliability of the eeriness index also increased from acceptable (Cronbach’s \(\alpha =.74\)) to good (.86).

Fig. 4
figure 4

Although Ho and MacDorman’s [21] humanness and eeriness indices had excellent (Cronbach’s \(\alpha =.92\)) and acceptable (.74) internal reliability, respectively, and nonsignificant linear correlation (\(r=.02\), \(p=.514\), straight line), the Animation and Robot groups were tightly clustered and widely separated from each other, and the Human group was omitted. (Characters 1, 47, 9, and 10 were used in both [21] and this study)

Fig. 5
figure 5

The revised humanness and eeriness indices had good internal reliability (Cronbach’s \(\alpha =.87\) and .86, respectively) and nonsignificant linear correlation (\(r=.04\), \(p=.285\), straight line). The animation and robot were spread out and overlapped. A cubic approximation of the relation between humanness and reverse-scaled eeriness resembles Mori’s (1970/2012) graph of the uncanny valley (\(R^2=.640\), dashed line)

4 Discussion

The categorization task revealed how observers apply categories in perceiving humanlike characters [34]. The categories supported inferences both about attributes of the character and about unrelated attributes [62]. Although the study’s untrained participants placed greater importance on the humanness bipolar adjectives than the attractiveness and eeriness ones, they tended to use the latter more frequently when evaluating the characters.

An evaluation of the scales comprising the humanness, attractiveness, and eeriness indices with respect to self-identified categories revealed that some pairs of bipolar adjectives were unbalanced in their importance. If one pole of a scale is unimportant for all sorted characters in a given category, that scale is unlikely to measure differences effectively within the category along the corresponding dimension.

During the card sorting task, untrained participants found it challenging to partition humanlike characters on a humanness continuum. Instead, they relied on their prior domain knowledge about human beings to anchor their judgments [3, 13]. During the laddering interview, this led them to anthropomorphize the robots based on their relatively simple behaviors (cf. [17, 45]). The participants seemed unaware of their own judgment errors because of their lack of knowledge about robots [11, 25, 46, 47]. It is not surprising then that the participants’ cognitive system, which was adapted to a human environment, would produce and fail to detect judgment errors when they were observing nonhuman, humanlike agents [1, 55].

The new scales for the revised humanness, attractiveness, and eeriness indices were derived in part from the participants’ responses. These adjectives may better reflect contemporary U.S. English usage and provide better content validity than previously used adjectives. The revised indices exhibited high internal reliability and, for both the computer animated characters and robots, the bipolarity of the semantic space [4, 14, 27, 50, 57].

Confirmatory factor analysis verified the theoretical structure of the three indices, which were found to measure their putative concepts. The two subscales of the eeriness index provided a more detailed characterization of the eeriness concept. Relative to the computer animated characters, the robots rated higher on the eerie subscale but lower on the spine-tingling subscale.

4.1 Limitations

Although eeriness was not significantly correlated with humanness or attractiveness, attractiveness was significantly correlated with humanness with a medium effect size (\(r=.36\), \(p<.001\)). This constitutes a substantial reduction in effect size (\(r=.61\), \(p<.001\)) from Ho and MacDorman, Table 7 [21]. One source of correlation may be the lack of stylish mechanical-looking robots and cartoon characters among the stimuli. Nevertheless, the difficulty in decorrelating measures of attractiveness and other positive attribute dimensions from those of humanness indicates a general preference in U.S. culture for human attributes relative to nonhuman attributes and also for attractive attributes relative to unattractive attributes.

From the perspective of index development, emotional responses to robots and animation vary considerably between observers. These individual differences complicate the development of quantitative measures of the uncanny valley; thus, their effects require further investigation [7, 29].

Although neither age nor gender were significant factors in our undergraduate population, these variables may become significant in a sample with a more widely distributed age range. Cultural differences and exposure can significantly affect attitudes toward robots [30]. Thus, the revised indices should be tested with with other populations.

5 Conclusion

The revised indices developed in this study have two additional advantages over their previous versions (compare Figs. 3 and 5 of this study with Figs. 8 and 9 of [21], respectively; Fig. 9 was reproduced as Fig. 4 in this paper for ease of comparison). First, the scales of each index exhibit a broader conceptual coverage; they are well differentiated from each other while, nevertheless, remaining reliable (Fig. 3). Second, the humanlike characters no longer form two tightly clustered, but widely separated, categories; instead, they show considerable spread and differentiation along the humanness and eeriness dimensions—and in a U-shaped pattern that somewhat resembles Mori’s original uncanny valley graph (Fig. 5).

The revised indices also retained three advantages of the original indices. First, they maintained their theoretical structure and psychometric properties in large-scale testing. Second, their internal reliability remained high. Third, two subscales of the revised eeriness index, namely, eerie and spine-tingling, continued to serve as two stand-alone concepts for the measurement, as was verified by confirmatory factor analysis. Owing to the above advantages, these indices can contribute to the measurement and plotting of human perceptions of humanlike characters, thus providing valuable feedback to enhance their designs.

The revised indices reliably measure fairly independent dimensions with respect to the perceptions of anthropomorphic characters. In addition to assisting robot developers [2], the revised indices can also assist animators. Comparing different characters or comparing different feature settings and configurations for the same character using the same set of indices will help engineers and animators make design decisions.