1 Introduction

Food production and consumption have a significant impact on the environment, with food systems contributing 19–29% of global anthropogenic greenhouse gas (GHG) emissions (Vermeulen et al. 2012). In Europe, food is responsible for 20–30% of the impact of household final consumption (Hertwich et al. 2010), with meat and dairy products together accounting for a significant share of total GHG emissions (Notarnicola et al. 2017). Dietary change is therefore one of the most important ways that sustainable lifestyles can be achieved in developed countries (Carlsson-Kanyama 1998). One way to motivate changes in food consumption is through environmental food labels, although there is much debate over the design and effectiveness of environmental food labels. Most existing labels focus on the carbon footprints, and are developed by private and voluntary initiatives (Ernst and Young 2010). The Grenelle Environmental initiative in France advocates for the simplification and the harmonization of European environmental labelling in order to avoid the proliferation of environmental claims that could undermine the credibility of the process.Footnote 1 In Finland, the importance of the environmental dimension in the choice of consumer products is increasing (Rokka and Uusitalo 2008), but there are still significant misunderstandings of some fundamental environmental concepts such as the carbon footprint (Hartikainen et al. 2014). In 2009, a survey showed that 72% of EU citizens believe that a carbon footprint label should be mandatory in the future (Eurobarometer 2009). However, this desire for information does not necessarily imply a willingness to change food consumption habits. In the United Kingdom, while consumer demand for carbon labels is also relatively high, the proliferation and technical content of these labels lead to misinterpretations and misunderstandings that could hinder their impact on purchasing behaviours (Gadema and Oglethorpe 2011). Although UK consumers show very diverse perceptions of environmental labels, they generally tend to be skeptical about unfamiliar labels and general claims (Sirieix et al. 2013). While consumers’ desire for simple environmental labelling seems widespread, the question of its format and its real impact on food purchases remains to be answered and requires further research.

To help inform these policy debates, this study assesses the extent to which simple environmental labelling on the front side of packaging can shift food consumption patterns towards greater sustainability and what format is most effective for achieving this shift. We test three labelling formats: the single traffic lights format (sTL), the multiple traffic lights format (mTL) and the kilometric format (KM). KM indicates the carbon dioxide (CO2) impact for each product by converting that number into the equivalent number of kilometres driven by an average car. The other two formats are based on the traffic-light rating system used for nutrients by the British Food Standard Agency. mTL presents three traffic lights simultaneously: one each for GHG emissions, water eutrophication and air acidity. sTL displays a single traffic light for GHG emissions. To assess the relative impact of each format, we have set up an innovative experiment that replicates online shopping platforms. In an experimental store that offers 282 products, we observe the food purchases of 275 consumers before and after one of the three aforementioned environmental labels is affixed to each product. Changes generated by labels on food shopping baskets are therefore examined under strict ceteris paribus conditions. In addition to providing high control and avoiding confounding factors, this laboratory experiment enables the truthful elicitation of demand by properly incentivizing the purchase of food items.

Many studies in marketing science have sought to understand consumers’ intentions towards eco-labelled products. Such studies have since been widely criticized (Auger et al. 2003; Auger and Devinney 2007) because they are based on unconstrained survey response methods. Respondents are attracted to socially acceptable responses, which lead them to overstate their intentions. In response, many studies have measured consumers’ willingness to pay a premium for eco-labelled products. In his review, Gallastegui (2002) notes that these estimates are not reliable because they were mostly derived from consumer statements (such as the contingent valuation method). Such statements are known to be biased (e.g., social desirability bias). Later, studies using the discrete choice method revealed that most consumers prefer eco-labelled products and are willing to pay a premium for them (Janssen and Hamm 2012; Johnston et al. 2001; Van Loo et al. 2011). Although the method limits the risk of overstatements by forcing trade-offs, choices still remain hypothetical and thus do not guarantee that they reflect true attitudes and actual purchases.

Two alternatives to survey methods are the analysis of observational data and the use of experimental methods. When using observational data, one is limited to examining the effects of labels currently used in markets. As a result, existing studies have mainly focused on the widespread organic label (Onyango et al. 2007; Wier and Calverley 2002). Because of the halo effect, consumers may opt for organic food not only for its low environmental impact but also for its healthier reputation. This makes interpretations of environmental attitudes ambiguous. Experimentalists are able to focus exclusively on environmental labelling with little confounding. They build controlled and reproducible conditions that replicate food purchases: participants actually buy real eco-labelled products in the laboratory (Bougherara and Combris 2009; Marette et al. 2012; Moser and Raffaelli 2012; Tagbata and Sirieix 2008). These experiments confirm the existence of price premium on eco-friendly food. However, this premium is always evaluated in the context of purchasing single products. Environmental attitudes are largely ignored in the context of food basket purchases.

To our knowledge, only two experiments involving food baskets have been conducted to date. Vanclay et al. (2011) observed food purchases before and after the posting of environmental labelling for 8 weeks. The study used labels on 37 food products from five product lines (milk, butter spread, canned tomatoes, bottled water and non-perishable pet food) in a grocery store in Australia. The only format tested was a tricolour logo (green, yellow and black), in the same vein as our single traffic light logo. Consumers shifted from “black” eco-unfriendly products (− 6%) to “green” eco-friendly products (+ 4%). In their between-subject framed-field experiment, Vlaeminck et al. (2014) tested two labelling formats: a multi-coloured labelling format similar to that of our multiple traffic lights and another format with several raw pieces of environmental information. Labels were affixed to 9 products (3 fruits, 3 vegetables and 3 protein-based products) in a Belgian supermarket. Participants were required to buy at least one of these nine products. The study stresses the importance of labelling formats: while simple logos that signal overall environmental quality in colour have positive effects on food purchases, complex logos that provide multiple pieces of raw information on resources used (water, soil, pesticides, etc.) have no effect at all.

Our experiment differs in several ways. First, with 282 products divided into 36 product categories, our laboratory simulates a small grocery store, which makes the task more realistic and allows an assessment of the impact of labelling on a complete food basket rather than on a limited set of products. Second, because labels are applied to all available products, we simulate a comprehensive labelling policy. Third, our intra-subject structure controls individual differences: the same consumers are observed with and without labelling, all other things being equal. Fourth, we compare three formats: two color-coded, one with a single piece of information and one with several pieces of information, and one with a single piece of raw information. The experimental design allows a fair comparison of the formats. Finally, the purchasing task is not subject to any constraints (on quantities or budgets). Participants are free to buy what they want, just as they would be in a real grocery store.

We find that environmental labelling improves the environmental impact of food baskets without affecting the price per 100 g or the nutritional quality of the food purchased. For instance, GHG emissions decrease between 8% and 12%. Multiple traffic lights are relatively more effective at improving the environmental impact of food baskets. These results converge with the study of Gadema and Oglethorpe (2011), which notes the difficulty consumers experience when processing environmental information. To improve the effectiveness of the labels in the choice process, Gadema and Oglethorpe recommend simplicity and clarity. This recommendation echoes findings on nutritional labels. According to marketing and psychology studies, consumers’ attention is best captured by colours and symbols (Muller and Prevost 2016; Argo and Main 2004; Cox III et al. 1997). For example, consumers in supermarkets refer less to nutritional tables than to health logos (van Herpen and van Trijp 2011). Once noticed by consumers, graphic logos such as traffic lights or 3-tiered star icons can even lead to an improvement in purchasing behaviours (Balcombe et al. 2010; Sutherland et al. 2010). However, a critical review of the literature on nutritional labelling (Hieke and Taylor 2012) reaches a mixed conclusion: labels are sometimes useful for some people and only under some circumstances.

2 The Experiment

2.1 Label Formats

We use three indicators to assess the environmental impact of food. One assesses GHG emissions and is expressed in carbon dioxide (CO2) equivalents (in grams), another assesses marine eutrophication in nitrogen (N) equivalents (in grams), and the last one assesses air acidification in sulfur dioxide (SO2) equivalents (in grams). These indicators pertain to, respectively, global warming, water pollution and air pollution. The three environmental indicators follow the lifecycle analysis approach. This approach has gradually come to be favoured in FranceFootnote 2 and in Europe (Manfredi et al. 2012) because it considers each stage of the product’s life cycle (including its packaging) from the manufacturing process to consumption. GHG emissions, eutrophication and acidification are calculated per 100 g of food. In this study, these calculations were carried out by «BIO by Deloitte», specialists in lifecycle analysis.

We test three labelling formats (sTL, mTL and KM) that differ in their complexity and the degree to which the label provides the consumer with conclusions on the environmental quality of a product (normative information), as opposed to the degree to which the label leaves it to the consumer to draw such conclusions (positive information).

KM reports the GHG emissions expressed as kilometres travelled by a common motor vehicle. This number indicates the distance travelled by a typical car when it emits the CO2 equivalent needed to produce, package and transport the food product. For example, if a litre of whole milk is reported at 11.0 km, it means that the product’s GHG emissions are equivalent to those of a car travelling 11 km. KM is positive and based on a single criterion. sTL displays a single coloured dot. The dot is green when the food belongs to the best one-third of products in terms of GHG emissions compared to products of the same food category. The dot is orange when the food belongs to the middle third and red when it belongs to the bottom third. In other words, because each page of the food catalogue represents one food category, there will be 3 green foods, 3 orange foods and 3 red foods if the page comprises 9 products. sTL is normative and based on single criterion. Finally, mTL displays 3 coloured dots. One dot signals the GHG emissions, another the marine eutrophication and the last one air acidification. The color-coding follows the same rule as for sTL. mTL is normative and based on multiple criteria. The three labelling formats are displayed in Table 1.

Table 1 Label formats

2.2 Food Supply

Purchasing behaviours and, therefore, changes in purchasing behaviours induced by labels are inevitably contingent on food supply. Accordingly, choosing foods that will make up the experimental store is a crucial task. First, the food set should be large and varied enough to allow for substitutions between and within food groups. Secondly, the environmental impacts of foods must have sufficient contrast to allow for improvements. Finally, the food set must be realistic and representative of French food consumption. Based on these specifications and with the help of nutritionists and life cycle analysts, we have selected 282 products grouped into 37 categories and 7 meta-categories (see Table 2). Categories refer to self-service store shelves. Within each category, 6 or 9 substitutable products differ in their environmental impacts due to their composition, their preservation (fresh, frozen, and canned), packaging (glass, metal, cardboard) and size (family pack, individual pack).

Table 2 Food categories

We use two indicators to assess the nutritional quality of foods: the aggregated LIM index (Darmon et al. 2009; Maillot et al. 2011) and the energy content. The LIM index is a validated nutrient profiling system that calculates the mean percentage for 3 nutrients whose intake should be limited (saturated fatty acids, free sugar and sodium) based on the recommended maximum values. It is also calculated per 100 g. In addition, we also gathered information on weight (in g) and energy content (in kcal). In the absence of information on packaging, this information was collected from the food composition database originally developed for the Su-Vi-Max project.Footnote 3 Finally, each product is priced according to its actual retail prices. Prices were taken from the largest French food supermarket (Carrefour) at the time of the experiment. All these indicators are then applied to the baskets by summing the food values according to the total weight of the basket: each indicator is expressed per 100 g.

When we look at the 282 foods in the experimental store, we find that there are strong correlations between the three environmental indicators: Spearman’s correlation coefficients are 0.91 between eutrophication and GHG, and 0.71 between acidification and GHG. GHG varies widely from a factor of 1–46. GHG heterogeneity is greater for animal foods than for plant products. Meat, fish and eggs are the least eco-friendly foods, followed by ready-made dishes and dairy products. On the other hand, fruits and vegetables are the eco-friendliest. The same observations can be made for eutrophication and acidification. Meat, fish and eggs are also the most expensive, while fruits and vegetables are the least expensive. As for nutrition, snacks and sweets have the worst LIM index, whereas fruits and vegetables are again at the top of the list (see details in Table 3). GHG is positively correlated with price and LIM index: Spearman correlation coefficients are, respectively, 0.68 and 0.42. In other words, foods with a higher environmental impact are more expensive and have lower nutritional value.

Table 3 Environmental, economic and nutritional characteristics of foods in the experimental store

Although these claims are only verified for the 282 foods in the experimental store, we trust they can be generalized to larger sets of foods. However, they are highly dependent on the functional unit. Those correlations are based on 100 g of product. By considering the energy intake or servings as denominators, correlations, for instance, between GHG and price and between GHG and LIM become negative (Spearman, − 0.10 and − 0.44). This makes sense: meat emits a lot of CO2 and is expensive, but it also provides many calories. This finding suggests that consumers will find it difficult to improve their environmental impact while sustaining their energy consumption. This result converges with Drewnowski et al. (2015) and Masset et al. (2014): «the sustainability dimensions seemed to be compatible when considering price per kilogram of food. However, this conclusion is too simplistic when considering price per kilocalorie».

2.3 Experimental Design

Participants’ task is to do their grocery shopping in our experimental store (see instructions in Appendix 1). The general instruction is to buy food for the household for 2 days. However, as in a real store, the participants are reminded that they are free to buy what and how much they want. If they are not satisfied with the products available in the experimental store or if they are sufficiently supplied at home, they are free to leave the experimental store without purchasing anything. Also, no budgetary constraints are set. To do their shopping, each participant is provided with a food catalogue, a barcode reader and a computer screen. The catalogue includes all 282 food products for sale during the experiment. Each page contains 6 or 9 items from the same category. Each item is displayed with name, brand, front image, price and price per kg or litre, weight or volume and a unique barcode (see Fig. 1). When scanning the barcode, the corresponding item appears on the screen. The on-screen environment is similar to e-shopping sites. Nutritional information (calories, salt, free sugars, saturated fatty acids) and environmental information (GHG emission, marine eutrophication, air acidification) are available on request (by clicking the “see detail” button). If a participant wishes to buy an item, she scans it in the catalogue to make it appear on the screen. She then clicks on the “Add to cart” button. A virtual shopping basket on the left side of the screen reminds users of the items selected (see Fig. 2).

Fig. 1
figure 1

Catalogue. Left: front page; middle: table of content; right: meat category with the single Traffic Light labelling

Fig. 2
figure 2

Computer screen. Example when salami is scanned from the catalogue

There are several steps in the experiment’s procedures, as shown in Table 4. Instructions are displayed both on a large screen and on each participant’s computer screen and are read aloud by the facilitator. Participants are told they will have to fill two baskets of food successively, but nothing is said about environmental labelling until part 2 begins. In part 1, the first basket, the reference basket, is elaborated without labels. Once all participants have validated their reference basket, a new catalogue is distributed. This catalogue is strictly identical to the previous one, except that all products are now labelled with one label format (sTL, mTL or KM). The corresponding format is briefly described to the participants before each participant is then asked to fill a new shopping basket, the label basket. This experimental procedure allows us to observe the impact of environmental labelling on each individual in perfectly identical situations. While this allows a great deal of control over the explanatory variables and thus ensures the nature of the causal relationships, repetition of the task generates a great deal of salience towards labelling and may therefore produce an overestimation of its absolute impact. After the two parts of the shopping experience, participants are invited to complete a socio-demographic questionnaire that provides information on occupation, income, and household size.

Table 4 Session overview

The experiment is made incentive-compatible. The instructions emphasize that participants will actually purchase approximately one quarter of the content of one of their two baskets. A quarter of the food is actually available and hidden from participants in an adjacent room. At the end of the experiment, a shopping basket (either the reference basket or the label basket) is randomly drawn. Any drawn basket product that is available in the adjacent room is actually purchased at the end of the experiment. In total, 96 food items can possibly be sold at the end of the experiment. We have taken great care to distribute the available products in all product categories, including fresh and perishable products (see Table 2). Before each session, a new set of 71 products (i.e., one quarter of 282) was chosen among these 96 items. Participants were also compensated 35€ before the start of the experiment. It was explained to them that this money is only used to compensate them for their attendance and has nothing to do with the rest of the experiment. The amount they will spend in the stores is at their own discretion. In fact, after the products actually available were revealed at the end of the experiment, participants never spent more than 10€ on real purchases.

We run three label treatments, one for each labelling format (sTL, mTL and KM) and one control treatment. In the control treatment, participants keep their part 1 catalogue without environmental labelling. The objective is to control for the impact of repetition on choices and thus to adjust any possible artefactual effect. Following the random control trial standard, participants were randomly assigned to one of the four groups. In total, 19 sessions (5 per label treatment and 4 for the control treatment) of approximately 1 h 30 were conducted.

2.4 Participants

A total of 275 participants were recruited through newspapers and flyers in the Grenoble urban area. To be eligible, individuals had to be between the ages of 20 and 65 and responsible for groceries for their household. Following the standards of randomly controlled trials, participants were allocated to one treatment (sTL, mTL, KM or Control). Each participant received €35 to compensate her participation.

Table 5 presents the socio-economic and demographic characteristics of participants. Most of the participants were women (nearly 2/3) and were educated (almost 2/3 with higher education), and fewer than half had children (under 14 years of age). Their income level (more precisely the disposable income per consumption unit) is close to the French median income (1610 euros per month in 2010). The results from the Wilcoxon rank sum test or the Pearson χ2 test show no significant differences (α level of 0.05) between treatment groups and the control group except for education in the sTL treatment. Compared to the French averages, the size of the households in our sample is slightly larger and the income slightly lower (see sociodemographic table for France and Grenoble in Appendix 2).

Table 5 Socio-demographic characteristics of participants (mean ± standard deviation or %)

3 Results

3.1 Data Analysis

Due to the experimental design, we are able to measure labelling impact within subjects and to compare formats between subjects. In strictly identical conditions, participants shop for food before and after one labelling format is implemented on each item. Accordingly, individual differences between the reference basket and the label basket produce the changes in food choices that are induced by the introduction of the label and by repetition of the task. The effect of repetition is controlled by accounting for changes from the control group (who repeats the shopping task twice without the label implementation). The effects are then compared across groups to assess the relative performance between labelling formats. Data analysis is organized in three parts: descriptive analysis, econometric analysis and behavioural analysis.

For each individual, we measure the quantity of the CO2 equivalent, sulfur dioxide equivalent and nitrogen equivalent (all in grams per 100 g of products), the cost (in Euros per 100 g of products), the energy intake (in kCal per 100 g of products) and the LIM index in both the reference basket and the label basket. We then use the differences between baskets to measure the impact of the labels within subjects. The differentiated impacts of KM, sTL and mTL are then assessed by comparing the individual differences between subjects. We use as statistical tests the Wilcoxon signed-rank test when testing label impact and the Wilcoxon-Mann–Whitney test when comparing treatments.

Second, in order to control for the confounding effects of price, energy and LIM, we run an econometric analysis to examine the impacts of the treatments on environmental footprint. As we have data on the same individuals in both pre- and post-periods, we apply the following model:

$$\Delta y_{i} = \alpha + \beta D_{i} + \theta C_{i} + \varepsilon_{i}$$

where \(\Delta y_{i}\) is the difference in the outcome of interest (GHG per 100 g basket, for instance) from the first basket to the second basket for individual \(i\). The term \(D_{i}\) is a treatment dummy variable that takes the value of one if the individual is in the treatment (KM, sTL or mTL) and zero if they are in the control. \(C_{i}\) is a vector of control variables for each individual \(i\).

Finally, we examine the number and nature of substitutions made by participants between the two baskets. In doing so, we can determine which label triggers the most changes and calculate the environmental impact per change for each label.

3.2 Descriptive Analysis

We first compare the content of the reference baskets to the average content of the experimental store (see Tables 3, 6). Participants choose foods that are more environmentally friendly than the average foods available in the experimental store. While the average GHG emission per 100 g of the 282 products is 283 g, the average GHG is significantly lower in the reference baskets (183 g). Participants also choose cheaper and healthier foods. The average price and LIM index per 100 g are, respectively, 0.31€ and 8.04 in the reference baskets compared to 0.64€ and 15.59 in the experimental store.

Table 6 Means (standard deviations) for environmental, economic and nutritional characteristics (indicators per 100 g) for reference and label baskets; p value for the difference between the reference and the label basket according to the Wilcoxon signed rank test

Environmental labelling, whatever its format, has a positive effect on the environmental quality of baskets (see Table 6). GHG decreases from 179 g in the reference basket to 163 g per 100 g of products in the label basket with the sTL format, from 183 to 160 g with mTL and from 186 to 169 g with KM. All differences are significant at the 1% level according to Wilcoxon signed rank tests. The environmental improvement does not affect cost and nutritional content. Price, energy and LIM index do not significantly vary between the two baskets below the 5% level. While the impact seems bigger with mTL (− 22 g vs. − 15 g and − 16 g for sTL and KM for each individual, on average), relative differences are not statistically significant according to Wilcoxon Mann–Whitney tests (Table 7). Eutrophication and acidification also significantly decrease with labels, except for KM on eutrophication. Relative decreases are significantly higher with mTL than with KM.

Table 7 Individual means of the differences between the label basket and reference basket of the environmental, economic and nutritional indicators (per 100 g) per treatment and per product family

The product family grouping meat, fish and egg categories is clearly the one that allows the biggest environmental improvement (Table 7). For that set of products, mTL generates a larger GHG decrease than KM does. KM does not even perform statistically better than the control.Footnote 4 The superiority of mTL over KM can be generalized for all product families: mTL is always among the two labels that generate the largest decrease in GHG emissions, eutrophication and acidification, and KM is among the two smallest decreases (Table 7). sTL is intermediate. Eutrophication and acidifications are always improved with mTL and sTL and never with KM except for fruits and vegetables. As for economic and nutritional impact, there is no explicit trend.

3.3 Econometric Analysis

In order to control for price, energy, LIM and various sociodemographic characteristics, we run a regression on the differences in, respectively, GHG, eutrophication and acidification between the reference basket and the label basket (results of the estimation are shown in Table 8). Labels, whatever the format, induce a GHG decrease: between 14 and 19 g per 100 g. mTL decreases GHG to a larger extent than do sTL and KM. mTL also provides statistically significant reduction of eutrophication and acidification. This finding is not surprising because only mTL informs on eutrophication and air acidification. Although sTL does not inform on acidification, the improvement is nonetheless significant due to a strong correlation between acidification and GHG (0.91 with both Pearson and Spearman coefficients).

Table 8 Regression estimation of the model \(\Delta y_{i} = \alpha + \beta D_{i} + \theta C_{i} + \varepsilon_{i}\)

The consumption of more expensive and higher calorie products significantly increases GHG emissions, eutrophication and acidification. Sociodemographic moderators do not affect the environmental impacts of baskets. Variables related to income, weight, sex, age and education are not statistically significant, with one exception: baskets of participants with a master’s degree appear to generate less eutrophication but more acidification.

3.4 Behavioural Analysis

The rate of changes between baskets varies according to the treatment (see Table 9). The participants in the control treatment nevertheless make many changes: They replace almost 7 products and remove 0.5 items from their original total of 18 (60%). No label generates more changes. On average, participants keep 63.7%, 67.2% and 69.2% of their reference basket items in their baskets labelled with mTL, sTL and KM, respectively. However, participants in the control groups do not change for the better: each change increases GHG emissions by 84 g on average.

Table 9 Description of the changes between the reference and the label basket and the impact per substitution on GHG and cost —individual means

With labels, each change considerably reduces GHG emissions: − 323 g with mTL, − 290 g with sTL and − 463 g with KM. While previous analyses show that KM is the label with the least impact, here we show that each item replacement reduces GHG emissions significantly more than when mTL and sTL are used. However, at the same time, participants in the KM treatment replace significantly fewer items (4.8 vs. 5.7 in mTL and 5.2 in sTL) and are the only ones who have more items in the label basket (+ 0.15 vs. − 0.61 in mTL and − 0.12 in sTL). Finally, only 60% of the items replaced in KM are within the same food category, compared to 70% and 72% in mTL and sTL. These observations can more or less be generalized to all food meta-categories.

Apart from the control treatment, items in the reference basket are replaced on average by less expensive ones. Each substitution made with KM saves significantly more money (− 0.33€ vs. − 0.08€ and − 0.09€ with mTL and sTL).

Both detailed nutritional and environmental information was available to participants by clicking a button on their computer screen. Participants did not use the buttons. They checked the information, on average, 1.6 and 1.4 times for nutritional and environmental information, respectively, on their reference baskets and 0.4 and 0.3 times, respectively, for their labelled baskets. Less than 10% of the participants checked information at least 5 times. With so few information-viewers, we are unable to statistically test possible behavioural differences between viewers and non-viewers and between treatments.

3.5 Demand Elicitation

Incentive compatibility is key in experimental economics. The incentive mechanism has to be consequential but also manageable for both participants and experimenters in order to be applicable. We thus opted to incentivize only a subset of the food supply. As previously explained, a quarter of the food supply was actually made available for actual purchase at the end of the experiment. While participants did not know which items were actually available, they knew that they had to pay for items from their experimental shopping basket that belonged to the available subset; i.e., they knew they had to actually buy approximately one quarter of their basket. Theoretically, this mechanism should not distort their preferences. Indeed, if \(x \succ y\) then \(p.x \succ p.y\) (Savage 1954) with here \(p = 1/4\). However, participants may have beliefs about the probability distribution and may think that \(p_{x} \ne p_{y}\). If subjective probability is nil for some items, then their purchase becomes hypothetical, and experimental results could thus be distorted by demand effects (Zizzo 2010). Although much effort has been made to distribute the products available for purchase across all food categories (see Table 2, last column) and over shelf-life length (fresh, frozen, canned, etc.), participants may still have beliefs about the distribution of the available products. For instance, participants may think that fresh and frozen foods are less likely to be stored in the experimental store for practical reasons.Footnote 5 In that case, participants may react more to the labelling of short-lived products. To verify this, we perform a robustness check by carrying out the same behavioural analysis on the three labelling treatments as in subsection 3.4, distinguishing fresh products, ready meals, long-life products and frozen products (Table 10). We find no such evidence. On average, participants keep 68% of their fresh products from their reference basket in their label basket. This proportion reaches 60%, 66% and 59%, respectively, for ready meals, long-life products and frozen products. No two-by-two comparisons are significantly different (Mann–Whitney test). Similarly, the GHG impact per substitution is also not in favour of fresh products (−296 g vs. 436 g, 402 g and 312 g). Again, there is no significant difference. Purchases therefore do not appear to be distorted by the demand effect.

Table 10 Description of the changes between the reference and the label basket and the impact per substitution on GHG for the three labelling treatments—individual means

4 Conclusion and Discussion

In this study, we identify the causal effects of providing food labels that inform consumers of environmental impacts. To do this, we construct an experimental store that offers 282 food items, and we conduct an experiment that allows us to observe actual purchasing behaviour with and without the presence of environmental labels. Most closely related work relies instead on hypothetical choice surveys or observational data. In contrast to surveys, our experimental approach avoids issues associated with hypothetical bias. Relative to observational studies, our methods allow us to study labels that do not currently exist in markets, and our methods control for other factors that may vary between labelled and unlabelled products. In contrast to other experimental studies based on real purchases, we employ a large set of products, and thus our setting better represents a natural shopping experience.

The experimental food set has well-known characteristics. First, food of animal origin has a much greater impact on the environment than food of plant origin (Carlsson-Kanyama and Gonzalez 2009; Masset et al. 2014; Wickramasinghe et al. 2013). Second, the heterogeneity of the environmental impact is also much greater for animal products (Wickramasinghe et al. 2013). This implies that a transition to a vegetarian diet is not necessary to reduce the environmental impact of diets, despite what some authors have advocated (Berners-Lee et al. 2012; Reijnders and Soret 2003; Stehfest et al. 2009; van Dooren et al. 2014). We also find very strong correlations not only between our three environmental indicators (GHG, eutrophication and acidifications) but also between GHG and price and energy.

The results show that environmental front-of-pack labelling improves the environmental impact of food baskets. All formats lead to a reduction in GHG emissions. This finding corroborates the results of Vanclay et al. (2011) and Vlaeminck et al. (2014) who also found, for a more limited set of products, positive effects of simplified labelling.

mTL format appears to be the one that generates the largest decrease in GHG emissions. This finding was not expected. Simplicity usually improves label effectiveness (Temple and Fraser 2014). Nevertheless, mTL is multi-criteria and thus more complex than sTL. Muller and Ruffieux (2018) used the same experimental store to test several nutritional labels. They found that sTL induced more policy-oriented responses than did mTL. The main difference that could explain these apparently contradictory results is that the nutritional criteria (sugar, fat and salt) are uncorrelated. Inversely, we have seen that GHG, eutrophication and acidification are strongly correlated. As a result, nutritional mTL displays different colours, whereas environmental mTL most often displays the same three colours. As a result, nutritional mTL forces consumers to make trade-offs, while environmental mTL generates more saliency.

KM is mono-criterion and numerical; it enables participants to compare any two of all the products available. In comparison, mTL and sTL only distinguish products belonging to the same food category and whose environmental impact is sufficiently different so as not to have the same colour. Still, KM does not do a better job than traffic lights. Although each item replacement results in a greater GHG reduction with KM than with mTL and sTL, participants make fewer changes to their initial shopping baskets. This highlights once more the superiority of a simple color-coded logo over a more informative digital one: Consumers comply with normative signals. On the other hand, they are more reluctant to continuously make the necessary efforts to make good use of extensive and positive information. We cannot reject the suggestion that KM’s relative underperformance can be explained in part by consumers’ difficulty in appreciating the equivalence between car kilometres travelled and CO2 emissions. It is already the case that consumers are not familiar with carbon footprint labels and may misinterpret them (Hartikainen et al. 2014; Sirieix et al. 2013); further studies are needed to determine whether the use of equivalence for CO2 emissions adds to consumer confusion.

We have argued that there are many advantages of using a laboratory grocery store to identify the effects of food labelling (more control, presence of counterfactuals, reproducible context, etc.). Nonetheless, there are also limitations. Most prominent, the introduction of labels in the experiment is clear to participants, which, relative to an actual supermarket, likely increases the attention paid to them. Further, there is the potential for experimenter demand effect; i.e., participants form an interpretation of the experiment’s purpose and modify their behaviour in response. It is therefore likely that an experiment such as ours provides an upper bound on the effects of introducing environment-related food labels.