1 Introduction

Consumers live in an environment where they receive signals simultaneously from multiple different sensory paths ranging from visual to invisible cues. Unlike separate systems, these sensory paths are closely intertwined through a process known as multisensory integration (Owens and Efros, 2018). Previous psychology research has shown that there are many interactions and interdependencies between the different senses, such as associations between color and sound (Klapetek et al., 2012) and associations between sound and shape (Spence, 2012). These links between senses are referred to as cross-modal correspondence.

Cross-modal correspondence has garnered an increasing level of attention in marketing research, mainly due to its effects on consumer judgment and decision-making (Krishna and Schwarz, 2014). One well-established mechanism for cross-modal correspondence is visual and auditory cross-modal correspondence, wherein consumers positively match high-frequency sounds with light-colored products (Spence, 2011). Cross-modal correspondence between sound frequency and color lightness can be leveraged for numerous retail applications. For example, playing high-frequency background music could help a clothing retailer increase the sales of its overstocked white shirts.

An overview of the literature in the domain of audiovisual cross-modal correspondence reveals that previous studies have focused predominantly on the effect of cross-modal correspondence on consumers’ initial reactions, such as attention and information recall (Iordanescu et al., 2010; Klapetek et al., 2012; Marks, 1987; Tavassoli and Lee, 2003); however, few studies have examined its differentiated effect across different stages of the consumers’ decision-making process. Bettman and Park’s (1980) two-stage cognitive model has had considerable significance in the marketing modeling and consumer literature (Roberts and James, 1991). This model splits consumers’ choice process into an “elimination stage” (i.e., choosing acceptable alternatives) and a “choice stage” (i.e., making a final choice). In general, consumers tend to unconsciously eliminate alternatives at the elimination stage and then use conscious processing to make purchase decisions at the choice stage (Ge et al., 2012; Song et al., 2018).

From this perspective, it is essential not only to understand whether the cross-modal correspondence between sound frequency and color lightness varies across the two stages of the shopping process but also to explore how cross-modal correspondence affects each stage of consumers’ shopping behavior. To answer these questions, we use two laboratory experiments (including an eye-tracking experiment) and one field experiment to examine the relationships between cross-modal correspondence effects and consumers’ shopping behavior. The results of our three studies support improvements to shopping environment design and help shape marketing strategies for stores.

2 Literature review

2.1 A two-stage cognitive model of shopping process

Bettman and Park’s (1980) two-stage theory of choice has been an important theory in the domain of the decision-making process. Ge et al. (2012) summarize that the first stage (i.e., the elimination stage) of the decision-making process involves eliminating alternatives that do not warrant serious consideration, and the second stage (i.e., the choice stage) involves identifying the best alternative among those considerations.

Not surprisingly, there are many meaningful distinctions between the elimination and choice stages. For example, there could be systematic differences in the way information is processed at different stages (Payne, 1976). Decision-makers tend to unconsciously remove alternatives from further consideration and conduct conscious assessments of the remaining alternatives when making their final choice (Bettman et al., 1998; Payne et al., 1988; Song et al., 2018). Notably, van Zee et al. (1992) find that the information used to screen the options does not have much impact on the evaluations, and vice versa.

2.2 Cross-modal correspondence effects at the elimination stage

Consumers are surrounded by multiple sensory inputs, such as visual and auditory stimuli, at all times, from text and images on electronic screens to physical product displays in retail channels. Marketing scholars have paid increasing attention to the importance of these sensory impacts on consumer behavior.

In recent years, a considerable amount of psychology research has examined perceptual matching between stimulus attributes in different sensory modes, i.e., cross-modal retrieval (e.g., Krishna, 2012; Shen and Sengupta, 2014; Spence, 2011, 2012). Specifically, some studies have shown that people feel the synergy between high-pitched (i.e., high-frequency) sounds and light-colored objects (e.g., Klapetek et al., 2012; Spence, 2011). Additionally, humans have latent preferences for certain cross-sensory combinations. The mapping of pitch-to-luminance is not uniquely human but rather constitutes a basic feature of the perceptual system (Ludwig et al., 2011). This consistency of sensory correspondence is purely abstract, rather than following any semantic consistency or any suggestive or specific location of the auditory stimulus.

Hagtvedt and Brasel (2016) find evidence through eye tracking that the effect on attention arises from the cross-modal correspondence between sound frequency and color lightness. Their results confirm that objects with a light color immediately draw increased attention in the presence of a high-frequency sound, whereas objects with a dark color attract more attention in the presence of low-frequency sounds. Thus, compared to those in the low-frequency sound condition, consumers in the high-frequency sound condition are more likely to fixate on lighter objects.

2.3 Cross-modal correspondence effects at the choice stage

To date, most research has focused on the audiovisual cross-modal correspondence effect at the elimination stage of the consumers’ shopping process, wherein consumers are prone to process information unconsciously (see Table 1). Its relative effects at the choice stage where consumers make purchase decisions consciously remain unclear.

Table 1 An overview of the audiovisual cross-modal correspondence effect

Screening at the elimination stage is accomplished by using a non-compensatory decision strategy that ignores some relevant problem information and reduces information-processing demands (Payne et al., 1988). Consumers at this stage avoid trade-offs among attributes of alternatives and unconsciously and intuitively make screening decisions (Beach and Terence, 1987). Specifically, consumers do not engage in conscious efforts on the screening task itself. Instead, they are willing to be exposed to relevant/irrelevant information and are more likely to rely on automatic attention effects to process such relevant/irrelevant information unconsciously.

In contrast, consumers at the choice stage are more particular about their goals and use more concrete terms to construe products than at the elimination stage (Lee and Ariely, 2006). Consumers at this stage tend to use a more effortful compensatory strategy (Gilbride and Allenby, 2006). A compensatory strategy during the choice stage is to determine whether a good value on one attribute of an alternative can compensate for a poor value on another attribute (Bettman et al., 1998). Specifically, consumers utilize conscious thought (Dijksterhuis, 2004; Dijksterhuis and Nordgren, 2006) and consciously evaluate the attributes of the remaining alternatives rather than only the auditory and visual stimuli to make a final purchase choice.

Note that sensory processes are the primary way consumers engage with the world, and sensory information represents the vital foundation for consumers’ behavior and cognition (Krishna, 2012). Klapetek et al. (2012) proposed that people have a default response under cross-modal congruency. When people encounter auditory and visual stimuli, they will attend to stimuli that are synesthetically congruent before attending to incongruent stimuli, consistent with the increased target detection rates. As a default response, this cross-modal correspondence is more likely to occur on a basic and automatic level (Hagtvedt and Brasel, 2016). Thus, this kind of automatic attention effect seems especially prone to occur in the elimination phase of the two-stage cognitive model, wherein consumers are unaware of the specifics of the screening process, allow themselves to be exposed to relevant/irrelevant stimuli, and rely on automatic attention effects (e.g., cross-modal correspondence effect) to rapidly and simply process such relevant/irrelevant information to evaluate and screen alternatives. Formally, we hypothesize the following:

  • H1: The cross-modal correspondence effect between sound frequency and color lightness varies across different stages of the shopping process. Specifically, in the high (vs. low)-frequency sound condition, consumers are more likely to fixate longer (or click) on light (vs. dark) products, and such cross-modal correspondence effect is stronger at the elimination stage than at the choice stage.

2.4 Moderating effects of involvement

The cross-disciplinary concept of involvement, rooted in social psychology (Sherif and Cantril, 1947), has long been a significant topic for researchers in marketing (Andrews et al., 1990). Involvement is considered to influence the complexity or extent of consumers’ decision-making processes (Steinhart et al., 2013).

Consumer involvement is the perception of personal relevance related to product categories or shopping tasks and is regarded as a perceived cognitive state during the focused attention process (Chung et al., 2018). Low-involvement consumers are more likely to be persuaded by affective or peripheral information and engage in superficial processing (Petty et al., 1983), leading to less deliberate, more immediate, and nearly automatic purchase decisions (Shiv and Fedorikhin, 1999).

Involvement motivates consumers to be more concerned about making the right decision and processing all relevant information in greater detail (Puccinelli et al., 2009). When the level of consumer involvement increases, personal relevance will increase, and consumers will be more willing to apply cognitive resources to processing information (Petty et al., 1983). That is, highly involved consumers tend to engage in more detailed and conscious thoughts. As such, highly involved consumers are more likely to use a more effortful compensatory strategy to proceed with product information and make trade-offs among attributes of their alternatives. Therefore, involvement will make the cross-modal correspondence in H1 more conscious, eliminating the automatic cross-modal correspondence effect, especially among highly involved consumers. Formally, we hypothesize the following:

  • H2: Involvement will moderate the cross-modal correspondence effect between sound frequency and color lightness in both the elimination stage and the choice stage. Specifically, the effects of cross-modal correspondence on consumers’ click intention (at the elimination stage) and purchase intention (at the choice stage) will be attenuated among highly involved consumers.

3 Study 1

The purpose of study 1 was to examine how sound frequency affects consumers’ visual attention to color lightness during the two stages of the shopping process. This study was conducted via an eye-tracking laboratory experiment. Eye-tracking technology enables us to directly monitor viewers’ visual attention to specific objects by providing more accurate measurements of visual attention than self-reports do. It also provides a particularly accurate simulation of consumers’ shopping processing.

3.1 Method

We recruited 62 students who had online shopping experiences at a university in Shanghai for study 1. Participants were randomly assigned to one of two conditions (high-frequency sound condition vs. low-frequency sound condition). Following Hagtvedt and Brasel’s (2016) design, we designed high-frequency tones (approximately 1800 Hz) and low-frequency tones (approximately 120 Hz) at a predetermined volume for each experiment. Participants were first asked to indicate their initial preferred lightness from five different levels of red lightness and then were seated at the eye-tracker computer and instructed to keep their eyes on the screen. Afterward, participants were invited to an online store and shown two red hats at the same time in randomized order for 10 s; one hat was light (100% value), and the other hat was dark (60% value).

Participants were then led to the next page, which comprised a detailed textual description of hats, and were asked to consider whether to make a purchase. Both red hats had precisely the same description. This description page was displayed for 30 s. Last, participants were asked to fill in the sound frequency they perceived and background information (see details of the laboratory experiment 1’s design in Appendix 1).

3.2 Results

Manipulation check

The results showed that participants who were exposed to the high-frequency tone (Mhigh = 4.06) perceived a higher frequency than those who were exposed to the low-frequency tone (Mlow = 2.71, t = 4.80, p < 0.001).

Hypothesis testing

First, we conducted a repeated-measures ANCOVA in which sound frequency (high vs. low) was chosen as the between-subjects variable, and color lightness (light vs. dark) was chosen as the within-subject variable. Participants’ initial preferred color and background information were included as covariates. The results revealed that the interaction between sound frequency and color lightness had a significant effect on participants’ visual attention at the elimination stage (F(1, 55) = 150.13, p < 0.001, partial η2 = 0.73; see Figure 1). Specifically, at the elimination stage, the light-colored hat commanded more visual attention than the dark-colored hat in the high-frequency sound condition (Mlight = 2.20 s vs. Mdark = 0.73 s, t = 10.26, p < 0.001), while the dark hat commanded more visual attention than the light hat in the low-frequency sound condition (Mlight = 1.06 s vs. Mdark = 2.33 s, t = 7.65, p < 0.001).

Fig. 1
figure 1

Influence of sound frequency on visual attention towards hats with different color lightness (study 1). a Elimination (click) stage. b Choice (purchase) stage

Second, the repeated-measures ANCOVA results showed that the sound frequency × color lightness interaction also had a significant effect on participants’ visual attention at the second stage (F(1, 55) = 16.03, p < 0.001, partial η2 = 0.23; see Figure 1). Further analysis showed that at the second stage, the light hat commanded more visual attention than the dark hat in the high-frequency sound condition (Mlight = 4.58 s vs. Mdark = 3.75 s, t = 3.46, p < 0.01), while the dark hat also commanded more visual attention than the light hat in the low-frequency sound condition (Mlight = 3.73 s vs. Mdark = 4.55 s, t = 2.44, p < 0.05). Moreover, the sound frequency × color lightness interaction had a stronger effect on participants’ visual attention at the elimination stage than at the choice stage (F(1, 115) = 5.53, p < 0.05, partial η2 = 0.05). Thus, H1 was supported.

4 Study 2

Study 2 served three purposes. First, the study aimed to confirm and generalize the cross-modal correspondence effect during the two stages of the shopping process using a different color (i.e., blue). Second, study 1 focused only on the attentional effect. Attention may affect purchase intention via several mechanisms, such as the self-perception process and a facilitation effect (Shen and Sengupta, 2014). We expect that the cross-modal correspondence effect would also influence consumers’ click intention and purchase intention towards products. We thus use study 2 to provide evidence in that regard. Third, it was used to test Hypothesis 2, which postulated that the cross-modal correspondence effect during the two stages of the shopping process would be influenced by consumers’ involvement.

4.1 Method

A total of 120 undergraduate students who had online shopping experience were recruited from a university in Shanghai to participate in study 2; they were compensated with snacks. Participants were randomly assigned to one of the conditions in a 2 (high-frequency sound vs. low-frequency sound) × 2 (low involvement vs. high involvement) experimental design. The manipulation of sound was similar to study 1. Two levels of involvement (low vs. high) were designed through instructional manipulations (Puccinelli et al., 2013; Suri and Monroe, 2003).

Specifically, participants in the condition of low involvement were instructed to purchase a new T-shirt at an online store, while participants in the condition of high involvement were instructed to imagine that their university was going to launch some T-shirts for students, and they were asked to buy their new T-shirts from the online store. Similar to study 1, participants were invited to an online store where they were instructed through the pre-designed scenario to make their click and purchase decisions and then to complete an online survey (see details of the laboratory experiment 2’s design in Appendix 2).

4.2 Results

Manipulation check

The independent samples t-test results showed that the manipulations of sound frequency (Mhigh = 4.33, Mlow = 2.90, t = 5.57, p < 0.001) and involvement (Mhigh = 4.83, Mlow = 3.15, t = 7.34, p < 0.001) were successful.

Hypothesis testing

First, the results of repeated-measures ANCOVAs again demonstrated the cross-modal correspondence effect (click: F(1, 111) = 123.13, p < 0.001, partial η2 = 0.53; purchase: F(1, 111) = 13.63, p < 0.001, partial η2 = 0.11). Moreover, the sound frequency × color lightness interaction had a stronger effect on participants’ click intention than on purchase intention (F(1, 227) = 40.09, p < 0.001, partial η2 = 0.15). These results provided more evidence to support H1.

More importantly, the three-way interaction between sound frequency, color lightness, and involvement had a significant effect on click intention (F(1, 111) = 53.25, p < 0.001, partial η2 = 0.32; see Figure 2), thus supporting H2. Specifically, the effect of the sound frequency × color lightness interaction on click intention was weaker in the high involvement condition (for high-frequency sound condition: Mlight = 4.21 vs. Mdark = 3.57, t = 2.76, p < 0.05; for low-frequency sound condition: Mlight = 4.19 vs. Mdark = 4.72, t = 2.50, p < 0.05) than in the low involvement condition (for high-frequency sound condition: Mlight = 5.51 vs. Mdark = 2.40, t = 9.54, p < 0.001; for low-frequency sound condition: Mlight = 3.01 vs. Mdark = 5.73, t = 6.65, p < 0.001). These results are presented in Figure 2.

Fig. 2
figure 2

Interaction of sound frequency, color lightness, and involvement (study 2). a Elimination (click) stage. b Choice (purchase) stage

Furthermore, the three-way interaction between sound frequency, color lightness, and involvement had a significant effect on purchase intention (F(1, 111) = 4.19, p < 0.05, partial η2 = 0.04; see Figure 2), supporting H2. The effect of the sound frequency × color lightness interaction on purchase intention was attenuated in the high involvement condition (for high-frequency sound condition: Mlight = 3.40 vs. Mdark = 3.19, t = 1.13, p > 0.10; for low-frequency sound condition: Mlight = 3.99 vs. Mdark = 4.22, t = 1.05, p > 0.10) compared with the low involvement condition (for high-frequency sound condition: Mlight = 4.20 vs. Mdark = 3.54, t = 2.50, p < 0.05; for low-frequency sound condition: Mlight = 3.88 vs. Mdark = 4.69, t = 2.45, p < 0.05). Notably, in the high involvement condition, there was no significant cross-modal correspondence effect on purchase intention.

5 Study 3

5.1 Experimental design

Study 3 examines how sound frequency influences consumers’ response to color lightness in the two stages of the shopping process in a real market context. We conducted this field study with the help of a start-up insole firm, which sells its products through two similar online platforms. The field study adopted a single factorial design (no music vs. high-frequency music vs. low-frequency music) to assign the use of music to one of the platforms (i.e., the treatment platform) in different weeks. That is, the designs of the treatment and control platforms are mostly the same, except for the music manipulation (see Appendix 5). Figure 3 shows that the music designs of the two platforms, demonstrating that the differences between the music designs of the two platforms existed only in the post-treatment period. The control platform was assigned “no music” during the post-treatment period, while the treatment platform was assigned “high-frequency music” or “low-frequency music” on a weekly basis. Following previous literature (e.g., Kumar and Tan, 2015; Yang and Xiong, 2019), we further conducted the randomization checks of the products on the two platforms to ensure that there were no significant differences in product prices, pre-treatment clicks, and pre-treatment conversions between the products at the two platforms (ps > 0.1) (see Appendix 6).

Fig. 3
figure 3

Illustration of the timeline and design of study 3. Note: The differences between the music designs of the two platforms existed only in the post-treatment period. Specifically, both platforms were assigned “no music” during the pre-treatment period (i.e., weeks 1–2). During the post-treatment period, the treatment platform was assigned “high-frequency music” (i.e., weeks 4–5) or “low-frequency music” (i.e., weeks 3 and 6) on a weekly basis, while the control platform was assigned “no music” yet (i.e., weeks 3–6)

The start-up insole company tracked daily performance for each product on the two platforms and provided us with these data for our research. The data from both platforms are at the product-day level, with the same data structure. During the 6-week field experiment, there were 9546 visits to the two platforms (including 3994 visits in the pre-treatment period and 5552 visits in the post-treatment period), which resulted in 4223 clicks (1727 clicks in the pre-treatment period and 2496 clicks in the post-treatment period) and 392 conversions (136 conversions in the pre-treatment period and 256 conversions in the post-treatment period). We utilize the click-through rate (a ratio of clicks to visits) and conversion rate (a ratio of conversions to clicks) for each product to represent consumers’ behavior at the elimination and choice stages of their shopping process, respectively. Specifically, we use a difference-in-difference (DID) method to examine the cross-modal correspondence effect on a product level as follows in Eq. (1):

$$ {Y}_{it}={\beta}_0+{\beta}_1{\mathrm{Post}}_{\mathrm{t}}+{\beta}_2{\mathrm{Treat}}_{\mathrm{i}}\times {\mathrm{Post}}_{\mathrm{t}}+{\beta}_3{\mathrm{Treat}}_{\mathrm{i}}\times {\mathrm{Post}}_{\mathrm{t}}\times {\mathrm{Correspond}}_{\mathrm{i}\mathrm{t}}+\alpha {X}_{it}+{\mu}_i+{\varepsilon}_{it}, $$
(1)

where Yit represents the performance of product i at (i.e., click-through rate and conversion rate) period t. Postt is a dummy variable indicating the pre-treatment (0) or post-treatment period (1); Treati indicates whether product i is in the control (0) or treatment group (1). Correspondit is a dummy variable indicating whether the cross-modal correspondence between sound frequency and color lightness occurs for product i at period t (Correspondit = 0 if a light-color (dark-color) product i is exposed to low-frequency (high-frequency) music at period t, and Correspondit = 1 if not). We also include the effects of control variables Xit, such as time trend and day-of-week effects. μi represents the fixed effects of products, and εit is the error term. β3 is the parameter of interest that captures the cross-modal correspondence effect on product performance. Because the product-specific fixed effects will be collinear with Treati, we do not include the main effect of Treati in Eq. (1). In addition, we only consider the three-way interactions Treati × Postt × Correspondit because only observations from the treatment group in the post-treatment period vary in the levels. Specifically, Correspondit matters only when Treati = 1 and Postt = 1.

5.2 Estimation results

The estimation results for the click-through rate as the dependent variable are presented in Table 2, Column 1. We find strong evidence that cross-modal correspondence has a positive and significant effect on the click-through rate (β3 = 0.023, p < 0.01 in Column 1). This suggests that consumers are more likely to click on light (vs. dark) products with high (vs. low)-frequency sounds in the elimination stage of their shopping process. The coefficient indicates that the cross-modal correspondence increased the products’ click-through rate by 121.05% relative to the median value of 0.019.Footnote 1

Table 2 Model estimation results of study 3

Column (2) shows that cross-modal correspondence has a positive and marginally significant effect on the conversion rate (β3 = 0.032, p < 0.1 in Column 2). The coefficient indicates that the cross-modal correspondence increased the products’ conversion rate by 88.89% relative to the median of 0.036. Moreover, the coefficient β3 for Column (1) is more significant than that for Column (2). This implies that although the cross-modal correspondence between sound frequency and color lightness also occurs in consumers’ final choice stage, such a cross-modal correspondence effect is stronger at the elimination stage than at the final choice stage.

6 General discussion

Based on the actual behavior of participants found in the three experimental studies—one field experiment and two laboratory experiments—we reach three broad conclusions. First, sound frequency has significant effects on consumers’ visual attention related to color lightness. Thus, control over sound frequency conditions allows cross-modal correspondence to guide consumers’ visual attention. Second, this power is more salient at the elimination stage than at the choice stage. Finally, the cross-modal correspondence effect during the two stages of the shopping process is affected by consumers’ involvement. When the instructions for participants to complete the experimental steps improved participants’ level of involvement, the cross-modal correspondence effect was not found to be significant on purchase intention.

There are some alternative explanations for the differential cross-modal correspondence effect on the two stages. One possible explanation is that consumers spend much more time at the choice stage than at the elimination stage. Given that attention can be used to support higher-order information processing and thus might affect consumers’ decision-making (Janiszewski et al., 2013), more time (i.e., more attention) spent at the choice stage may influence the cross-modal correspondence effect at this stage. Another possible explanation is that consumers can read more information at the choice stage than at the elimination stage, which, rather than product color, might distract consumers’ attention. Ge et al. (2012) propose a weight shift mechanism: when consumers evaluate their alternatives at the choice stage, the newly introduced information at this stage about alternatives on one dimension will increase the weight that consumers attach to that dimension in their evaluation process. Thus, consumers may place more decision-making weight on the newly introduced product information displayed at the choice stage, which may weaken the cross-modal correspondence effect in the choice stage.

Our research has several theoretical implications. First, studies have shown that cross-modal mapping occurs prior to conscious awareness of the visual stimuli (Hung et al., 2017), suggesting that audiovisual cross-modal correspondence occurs at an automatic and unconscious level. Based on this theory, we propose a new framework for how cross-modal correspondence effects impact each stage of consumers’ shopping behavior. Second, we contribute to the relevant literature by identifying an essential and previously un-investigated moderating factor—consumer involvement.

Our findings also provide several important practical insights. First, our findings suggest that companies should be aware of the cross-modal correspondence effect on consumers (especially on low-involvement consumers) when developing marketing strategies. Second, marketers may want to change or enhance their advertisement targeting strategies. They should give more consideration to the interaction between hearing and vision in advertising and more effectively highlight their products or services. It is wise to combine light (vs. dark) products with high (vs. low) frequency sounds in their brand’s promotion videos. For example, when marketers choose voice actors to promote their products, they may find it helpful to consider the speaker’s tone. In general, women’s voices are higher than men’s voices, so advertisements characterized by female voices are likely to promote products with light colors more effectively.

Our studies have some limitations, and future research could be extended in several directions. First, while the randomization checks ensured that there were no significant differences between the products on the two platforms, we still observe some “imperfections” of the field experiment. For example, the website designs at the choice stage of the field experiment varied slightly between the two platforms (e.g., the color palette in the right-side table boxes). Though there is no reason to suggest that such imperfections affect our results, this provides future research opportunities to validate our findings with additional field experiments. Second, our experiments used online stores as shopping contexts. Thus, the sample of experiments is limited to online shoppers. Future research should test the generalizability of these results by repeating our research using offline shopper samples. Lastly, in addition to consumer involvement, which adjusts the influence of audiovisual cross-modal correspondence, other important but undiscovered factors may also effectively moderate the effects of cross-modal correspondence. They should therefore also be considered, such as gender, age, and situational factors in consumers’ purchase decisions.