1 Introduction

When following a formal, prescriptive model of strategic management (Ansoff 1984; Dyson 2000; Mintzberg 1990), successful strategy implementation is no less important than designing the strategy. Failed implementation of successful strategies results in lost opportunities of higher organizational performance. However, it has long been bemoaned that managerial attention on the implementation stage is low and the failure rate is high (Alexander 1985; Kaplan and Norton 2004b, 2008; Mankins and Steele 2005; Sterling 2003), although recent research has shown that failure rates of up to 90 % are likely to be overestimations (Cândido and Santos 2015). Modern, balanced performance management concepts—for instance, the balanced scorecard (BSC)—have been increasingly positioned as a remedy for the strategic implementation trap. Over the years, BSC inventors Kaplan and Norton have put increasing emphasis on the scorecard’s role as “the cornerstone of a new strategic management process” (1996b, p. 75). They maintain that use of a BSC will increase an organization’s ability to execute its strategy and therefore ultimately improve its performance. However, the research topic—to determine the net economic benefit from the use of non-financial measures and the BSC—proposed by Ittner and Larcker (1998) as early as seems still to be lacking a clear answer. While various studies have addressed this issue, mixed outcomes are reported (Biggart et al. 2010; Bryant et al. 2004; Davis and Albright 2004; De Geuser et al. 2009; Hoque and James 2000; Iselin et al. 2008; Ittner et al. 2003; Maiga and Jacobs 2003; Malina and Selto 2001; Sim and Koh 2001; Tapinos et al. 2011).

Nevertheless, based on a comprehensive review of the strategy implementation literature, Atkinson (2006) comes to the conclusion that a BSC can indeed play an important role in strategy implementation when used as a strategic control system (Bungay and Goold 1991) that facilitates translating strategy into decision making and action. Therefore, this study is based on the idea that strategy implementation is a decision-making task. It requires a sequence of multiple and interdependent decisions in a complex dynamic environment—which exactly defines a dynamic decision-making or complex problem-solving taskFootnote 1 (Brehmer 1992; Edwards 1962). For instance, implementing a growth strategy that relies on superior service quality requires deciding on the number of hires per month. In addition to psychological factors, such as intelligence and knowledge, that are known to affect dynamic decision-making performance, this study focuses specifically on the impact that a scorecard cockpit has on individual managers’ dynamic decision making while attempting to “translate strategy into action” (Kaplan and Norton 2004b, p. 52). Such a cockpit assembles and visualizes the key indicators that are included in the BSC metrics—as, for instance, illustrated in (Kaplan and Norton 2001, p. 221). It comprises the adequate “battery of instrumentation” (Kaplan and Norton 1996a, p. 2) that managers need to guide their companies towards future success. The scorecard cockpit displays the measures that have been chosen during the BSC development process in accordance with the company’s strategy and four balanced perspectives: financials, customers, internal business processes, and learning and growth. Very important to the BSC concept is the fact that the measurement system and strategy are strongly connected (Kaplan and Norton 1996a, p. 148). Measures should never be selected for a BSC cockpit simply because they have already been used in the organization or because they seem to be fashionable; instead, the only reason for them to appear on a scorecard should be that they are the best indicators available for measuring the associated strategic objectives (Kaplan and Norton 1996a, p. 62). Hence, being strongly linked to strategy, a BSC cockpit can be expected to contribute to addressing well-known strategy implementation issues, such as communication, middle-management issues, unclear priorities and targets, undefined actions, insufficient coordination, and inadequate performance monitoring (Atkinson 2006).

Software vendors, consultants, and practically focused books have been actively promoting scorecard cockpits, dashboards, or war rooms for several years (e.g., Alexander 2007; Daum 2006; Eckerson 2011; Person 2008). There, it is implicitly assumed or even advertised that these instruments are important decision support tools worth investing quite considerable amounts of money in. However, beyond anecdotal support, there is a lack of substantial evidence to support this claim. Empirical research that challenges vendors’ and consultants’ sales arguments and investigates the impact of such cockpits on decision-making performance in general is absent. More specifically, to the best of my knowledge, no study addresses a BSC cockpit’s effect on strategy implementation success while also taking important psychological factors into account—intelligence and knowledge.

There is ample evidence that human performance in complex problem solving is generally poor (e.g., Dörner 1980; Dörner et al. 1994; Moxnes 1998; Reichert and Dörner 1988; Sterman 1989; Strohhecker and Größler 2013; Wittmann and Hattrup 2004). Merely implementing a strategy through a series of operational decisions might be a less complex problem than achieving both—developing a potentially successful strategy and successfully implementing it. Nevertheless, it is still a severe challenge for human decision makers (Lane 1999), and failure rates are high (Strohhecker and Größler 2012). Psychological research indicates that general cognitive ability (G) and knowledge (K) relate to dynamic decision-making performance (Ackerman 1996). Therefore, it could be expected that these two factors also affect decision makers’ performance in the more operational setting of implementing strategies, which is not investigated in traditional dynamic decision-making research.

This study aims to contribute to both dynamic decision-making research and strategy implementation and empirical BSC research by answering the following research question: Do intelligence, knowledge, and a BSC cockpit, which is designed according to principles and examples described in the BSC literature and closely linked with a well-defined strategy, really improve strategy-implementing decision-making performance by bounded-rational individual decision makers?

As a research method, the laboratory experiment is used following psychological and dynamic decision-making research (Brehmer and Dörner 1993; Capelo and Dias 2009; Lipe and Salterio 2000; Tayler 2010). This has the important advantage that the myriad confounding variables that can substantively impact any results from a field study can be controlled (Sprinkle and Williamson 2007). Most importantly, by using a completely deterministic business simulation without any random variables in the laboratory, potentially broken cause-and-effect links between decision making, decision implementation, and performance results are eliminated. Therefore, financial performance indicators, such as economic value added (EVA), are the undistorted result of the actions taken. In addition, these actions are direct and unbiased implementations of the decisions made, as implementation requires from the individual participants no more than entering some numbers and pressing a button.

The remainder of the paper is organized as follows. The subsequent section reviews related literature and develops the research hypotheses. Section three outlines the research method and design of the laboratory experiment used to test the hypotheses. The results of the statistical analysis of experimental data are presented in section four, followed by a discussion in section five. The paper concludes by highlighting its contribution, discussing some limitations, and providing directions for further research.

2 Theory and hypotheses

Kaplan and Norton’s BSC concept has evolved over time from a performance measurement system to a strategic management instrument (Kaplan and Norton 1992, 1996a, 2001, 2004b, 2006, 2008). From a present-day perspective, it consists of two major components. The first is a balanced performance measurement system with a comprehensible number of indicators allocated to four perspectives (Kaplan and Norton 1992, 1996a) recommend between four and seven measures per perspective, and between 16 and 25 measures for the whole scorecard. The four perspectives are as follows: financial perspective, customer perspective, internal process perspective, and learning and growth perspective. The second component is a strategy map that describes the organization’s strategy by highlighting and visualizing the cause-and-effect relationships between the strategy’s major components (Kaplan and Norton 2004b). Both components have to be closely linked in the sense that measures should clearly operationalize strategic objectives.

In their various articles and books (Kaplan and Norton 1992, 1996a, b, 2001, 2004a, b, 2006), Kaplan and Norton maintain that use of the BSC concept will ultimately improve an organization’s performance. They discuss a variety of ways in which the BSC concept contributes to attaining such improvement. They argue that the BSC “gives top managers a fast but comprehensive view of the business” while minimizing “information overload by limiting the number of measures used” (Kaplan and Norton 1992, p. 72). Additionally, they maintain that the BSC links “a company’s long-term strategy with its short-term actions” (Kaplan and Norton 1996b, p. 75). They see this linkage as established through the BSC “translating the strategy to operational terms” and “creating strategic awareness” among employees (Kaplan and Norton 2001, p. 9–11). Strategy can be communicated “to the front lines” (Kaplan and Norton 2001, p. 246), meaning that the “set of hypotheses about cause and effect” (Kaplan and Norton 1996a, p. 149) that constitutes a strategy becomes transparent to all decision makers in an organization. Atkinson (2006) corroborates Kaplan and Norton’s arguments. She concludes that the BSC can address a range of common strategy-implementation issues and proposes it as an effective tool against communication deficits, middle-management issues, unclear priorities and lack of coordination, opaque targets, and lacking interpretation of strategic intent into managerial actions.

Focusing on the decision-making-related aspects and using the control perspective, as widely suggested in management accounting and operations research, strategy implementation can be seen as a simple, first-order control task (e.g., Dyson 2000; Otley 2003; Sterman 1994). Assuming that (long-term) targets and strategy are derived from a more comprehensive strategic development process, an individual’s strategy-implementing decisions are influenced by how the decision maker perceives and processes information about the real system, the targets, and the strategy provided. A gap between the perceived actual and target state of the system would result in decisions guided by the strategy that are translated into actions to close the gap (Fig. 1). According to dynamic decision-making theory, good decision making in dynamically complex settings requires that the “whole process of action regulation” (Dörner and Schaub 1994, p. 434) is conducted successfully: (1) goal elaboration, (2) hypothesis formation, (3) prognosing, (4) planning, (5) monitoring, and (6) self-reflection. An appropriate mental model comprising a clear system of targets, an adequate set of causal hypotheses about the system’s causal structure, and an unambiguous understanding of the strategy improves forecasting of the potential actions’ consequences, balancing pros and cons and making a choice. It also supports monitoring the results of previous actions and learning from one’s own past mistakes. According to experimental research, more accurate mental models do indeed result in better decision-making performance (Capelo and Dias 2009; Gary and Wood 2011; Ritchie-Dunham and Puente 2008).

Fig. 1
figure 1

Strategy implementation as a first-order control task

Using a BSC cockpit has the potential to improve a decision maker’s mental model in at least three ways. First, by presenting measures that are directly related to the its strategic themes, strategy is made explicit and operational. Decision makers always have the means by which abstract themes are concretized before their eyes—for instance, the fact that the strategic theme “increase operational excellence” is measured in sales revenue per full-time equivalent employee. Second, by restricting the number of indicators included in a BSC cockpit to 20–25, decision makers’ information load is limited. While still above the magical number seven plus or minus two information chunks that human beings can process simultaneously (Baddeley 1994; Miller 1956), decision makers’ information load is nevertheless reduced when compared to the multiple indicators presented in traditional reports. At the same time, a balanced mixture of indicators from four different perspectives minimizes the risk of mistakenly ignoring unintended consequences of previous decisions because information feedback is too focused (Sterman 1994). Third, the four BSC perspectives—and with them their strategic themes and associated measures—are linked by the following logic: improvements in learning and growth measures will be followed by advancements in internal process key performance indicators (KPIs), which will result in better customer perspective indicators and finally translate into higher financial success. Embedding the strategy’s strategic themes and measurers in this overall cause-and-effect chain clarifies the causal relations between them and improves decision makers’ mental models of the strategy to be implemented. This reasoning leads to the following hypothesis:

H \(_{1a}\) :

Using a BSC cockpit (that is carefully linked to strategy designed according to principles and examples described in the BSC literature) versus a traditional report cockpit by individual managers in their strategy implementation decision-making process is positively associated with strategy implementation performance.

On the other hand, relying on a BSC cockpit in strategy implementation could also come with dangers. Decision makers without adequate knowledge of the BSC concept may find the cockpit design unusual and alien; therefore, they may be not able to make use of the structural information that the cockpit provides. They may not realize the implicit cause-and-effect relationships between the performance indicators. Therefore, the reduction of measures when compared to traditional reports might be harmful, even though cognitive load is reduced. As a consequence, an alternative hypothesis is stated as follows:

H \(_{1b}\) :

Using a BSC cockpit (that is carefully linked to strategy designed according to principles and examples described in the BSC literature) versus a traditional report cockpit by individual managers in their strategy implementation decision-making process is negatively associated with strategy implementation performance.

Psychological research indicates that an individual’s performance in dynamic decision-making tasks (such as the strategy implementation task investigated in this study) is also influenced by personal traits—most importantly, intelligence and knowledge (Ackerman 1996). General cognitive ability along with lower-order factors are among the most influential factors that positively relate to dynamic decision-making performance—as long as they are measured carefully (Beckmann and Guthke 1995; Wittmann and Hattrup 2004). Wittmann and Hattrup (2004) emphasize that especially the intelligence sub-factor “reasoning with numbers” is an important predictor of dynamic decision-making performance. Implementing a strategy in a business context naturally requires dealing with numbers and developing a thorough understanding of relations between them. Having a higher intelligence allows participants to build better mental models of the (numerical) targets that they are given, the (numerical) feedback that they receive, and the (verbally and numerically described) strategy that guides their decision making. It is therefore hypothesized that:

H \(_{2}\) :

The higher decision makers’ general cognitive ability, the better they perform in strategy implementation.

Knowledge has long been recognized as having a positive impact on managerial performance, and as differentiating experts and novices (Wagner and Sternberg 1985). Additionally, Krems (1995) argues that knowledge in a domain increases cognitive flexibility in problem solving in that domain and, thus, leads to higher performance in decision-making tasks. Wittmann and Hattrup (2004) find that computer-game-related knowledge is the strongest predictor of performance in the game. Moreover, general economic knowledge also has a positive impact. Therefore, it is reasonable to assume that participants with higher levels of domain- and task-related knowledge can build better mental models, which leads to better decision-making performance. Specifically, if decision makers know more about how to interpret financial and non-financial indicators, they will be better at analyzing the current state of the system and have a more accurate mental representation. Therefore, the following hypothesis is stated:

H \(_{3}\) :

The higher the decision makers’ (general and performance-measurement-related) knowledge, the better they perform in strategy implementation.

Whether gender makes a difference in dynamically complex decision making is investigated and discussed in various behavioral research streams (Dörner 1996; Eckel and Grossman 2008; Gallagher and Kaufman 2005; Wittmann and Hattrup 2004); the outcomes are mixed, but suggest that the impact of gender on decision-making performance should at least be controlled.

3 Research methodology

Data for testing the causal hypotheses introduced in the previous section could, in principle, be gathered from case studies, surveys, and field and laboratory experiments. The research methodology literature (Cooper and Schindler 2011; Trochim and Donnelly 2008; Zikmund 2012) ranks experimental field studies first in terms of internal and external validity. Therefore, this would be the method of choice in this research, which aims to establish cause-and-effect relationships. However, there are several obstacles that prevent using an experimental field study. Since field studies involve experiments in natural settings, random events and complexity are the most severe issues. In order to isolate the causal relationships between an organization’s usage of a BSC cockpit and its strategy implementation success, other factors also impacting performance have to be controlled. However, organizations face randomness and a complex network of cause-and-effect relationships that affect their strategy implementation performance. It is very difficult—if not impossible—to keep track of all those possibly disruptive factors (Sprinkle and Williamson 2007, p. 416). Therefore, a true and valid field experiment would either be extremely costly or irreproducible.

In contrast to most other empirical BSC studies mentioned above, which use the field approach or quasi-experimental designs, this study used the experimental laboratory method. More precisely, a randomized two-group design was implemented (Trochim and Donnelly 2008). With this design, all conditions are the same for both the experimental groups, with one exception: the first group is exposed to the treatment of a BSC cockpit, while the second is provided with a traditional reports cockpit. Additionally, pre- and post-trial questionnaires are administered to gather information on the participants’ pre-experience and knowledge, as well as on their self-assessment and opinions.

While internal validity of the classic randomized two-group design is high, the external validity of a laboratory experiment is always problematic (Levitt and List 2007a, b). The inevitable artificiality of the laboratory might prevent the results from being honestly generalized. In this study, external validity is improved by designing the experiment carefully and choosing a task that is as close to reality as possible. Therefore, following dynamic decision-making and system dynamics research (e.g., Brehmer and Dörner 1993; Dörner 1996; Paich and Sterman 1993; Wittmann and Hattrup 2004), strategy implementation was designed as a dynamically complex, feedback-rich, and path-dependent decision-making task.

4 Strategy implementation task

Implementation of a strategy has to happen through a series of operational decisions. Although in reality more or less all management levels are involved, top-management decisions play an important role in the implementation process. Therefore, this study focuses on top-management decision making; as a consequence, participants in the experiment had to act as top managers. They were given a virtual 10-year (40-quarter) contract for the position as managing director (CEO) of a recently founded mortgage brokerage business called eHypo. Their main task was to successfully implement eHypo’s existing and very ambitious organic growth strategy by repeatedly making decisions on prices and resources. They did not have the option to develop a new strategy themselves. The business concept, long-term targets, strategy, and means of intervention were set by the capital owners. However, as eHypo’s CEO, the participants had complete control of the three strategy implementation levers available, which could be adjusted quarter by quarter. These levers included (i) the target commission as price control parameter, (ii) the target number of employees, and (iii) the expenditures for developing the business concept and technology (investment in technology). Quarterly marketing expenditure was determined by a simple decision rule: 2.6 % of forecasted sales revenues were spent on marketing.

eHypo’s vision, as defined by the capital owners, was described as follows: “In 10 years we want to significantly increase the wealth of our shareholders and at the same time retain our independence. To achieve this, we want to be the very best of residential lending brokers taking the lead in market share, income, profitability and awareness levels.” The vision was operationalized by the following set of strategic goals that the participants should achieve simultaneously by making good implementation decisions. Sales revenue was said to grow from €0.3 to €31 million per quarter within a 10-year time-frame, while maintaining profitability throughout the period and keeping eHypo independent. Return on sales (ROS) was to be greater than or equal to 10 % all (or at least most of) the time, and eHypo’s market share in the mortgage brokerage business was to grow to 20 %. EVA—calculated following standard practiceFootnote 2—was to reach at least €40 million by the end of year 10. At that point, eHypo should have been in a solid state, allowing continuation of the business without recapitalization. In addition to this set of strategic goals, an aggregated performance score (P) was provided. The participants were informed that this score was highly positively related with the detailed goals, and that values of 10,000 and higher could be achieved, indicating excellent strategy implementation success. Without revealing the mathematical formula, the instructions described P as the weighted sum of EVA and cumulative absolute ROS variance, which was then multiplied by the attractiveness index that covers the going concern principle. Compared to the EVA measure, P reacts more sensitively to opportunistic end-of-game behavior: for instance, cutting back investments and increasing prices in the last few quarters.

Instructions handed out to the participants included the eHypo strategy paper, which discussed 14 strategic issues that were regarded as important for successfully implementing the growth strategy. It also contained Fig. 2, showing causal links between the 14 strategic issues, providing something close to a strategy map (Kaplan and Norton 2004a). Thereby, all participants had identical and comprehensive information on the strategy to be implemented.

Fig. 2
figure 2

eHypo’s strategy map

Critical to the growth strategy described in the instructions was the accumulation of eHypo’s key resources: staff, technology, employee know-how, and brand awareness (Dierickx and Cool 1989). This growth process should be initiated and maintained without jeopardizing service quality and customer satisfaction in order to avoid the trap of the growth and underinvestment archetype (Senge 1990). eHypo was described as preferring a differentiation strategy over a cost-leadership strategy. High service quality guaranteed by well-trained employees and up-to-date technology were to provide the possibility of escaping sole price competition. Logically consistent with this, the pricing strategy was set to sustain a medium to high price level, compared to competitive mortgage brokers. Since customers were described as increasingly price sensitive, decreasing prices over time could nevertheless be expected.

In the experimental session, a computerized business game built on system dynamics principles (Sterman 2000) and specifically developed according to the eHypo case was used (Strohhecker and Größler 2012). By design, the instructions and the strategy map reflected the causal relationships incorporated in the game. The game was completely deterministic: two identical simulation game runs—specifically, two identical sequences of decisions over 40 quarters—led to exactly the same outcome. It was also ensured that participants could successfully implement the strategy within the simulator and fulfill or even outperform the ambitious long-term goals set by the owners. Consequently, participants were in a more comfortable situation than real managers. They neither had to react to random events nor deal with the question of whether the strategy itself represented a winning or losing proposition.

Due to the complexity and nonlinearity of the system representing eHypo’s business model and environment, an optimal solution cannot be determined analytically. Using a set of common bounded-rational business policies, however, a successful implementation of the strategy set by the owner can be derived by simulation-based policy optimization (Coyle 1985). First, investment in technology was derived from a decision rule that calculated expenses for technology as 4.6 % of forecastedFootnote 3 sales. Second, the target number of employees was calculated from forecasted inquiries multiplied by forecasted employee productivity. Lastly, the target value for the commission was based on a policy that started with a reference value of 1.12 %, which was adjusted to market maturity (influencing customers’ sensitivity to price) and average capacity utilization. Figure 3 shows the time patterns resulting from these policies.

The participants were able to successfully implement the growth strategy by increasing the target value for employees and the investments in technology in an S-shaped manner, as depicted in Fig. 3. Simulation of the strategy implementation decisions shown in Fig. 3 leads to successful execution of the intended growth strategy in terms of all relevant measures, as Fig. 4 demonstrates. Cumulative EVA would exhibit a hockey-stick shape, which is typical for many start-up organizations. Sales revenue and market share would show a typical S-shaped growth pattern, while ROS could be steadily increased to the target level of about 11 % after an initial period of fast growth and stagnation. The Pearson correlation table in the bottom right of Fig. 4 shows that all measures used to evaluate strategy implementation performance are highly and significantly correlated.

Fig. 3
figure 3

Decisions leading to the benchmark strategy implementation scenario

Fig. 4
figure 4

A successful implementation of the growth strategy used as a benchmark

Fig. 5
figure 5

eHypo simulator’s report-based management cockpit

5 Laboratory experiment

To allow thorough reading of the 12 pages of eHypo instructions, the document was handed out to participants one week before the experiment. It was the only input given in advance. The business game in the laboratory was conducted using the eHypo simulator software (Strohhecker and Größler 2012). Having made and entered their decisions, participants could continue by clicking a button and simulating one quarter ahead. The outcomes of their decisions were computed, and the updated values for all measures and the overall performance score were displayed. A simulation run included 40 quarters with decisions to be made in each quarter on each of the three decision variables.

The eHypo simulator included the possibility to show the gaming results to the individual participants by two different means: traditional reports (Fig. 5) and a BSC measurement system (Fig. 6). As can be seen, the BSC management cockpit reduces the number of measures displayed compared to the reports cockpit. In accordance with Kaplan and Norton (1996a) recommendation, eHypo’s BSC includes 23 measures, while the report cockpit shows 52 indicators. Additionally, the BSC cockpit organized the information in a different way. Figures are linked to eHypo’s strategic themes that are themselves related to the BSC’s typical four perspectives—learning and growth, internal processes, customers, financials. Embedded in the BSC concept is the idea of cause-and-effect chains stretching over these four perspectives (Kaplan and Norton 1996a, pp. 30–31). Investments in learning and growth strategic measures are assumed to improve internal processes, which positively affects customers. Satisfied customers are supposed to buy more frequently, in larger quantities, and/or to spread word-of-mouth more intensively. All this should result in improved financial measures. The cockpit’s design is strongly influenced by Kaplan and Norton (2001, p. 221) proposition for a monthly BSC report. Following this example, the eHypo BSC cockpit organizes the perspectives along this overall causal idea, starting with the learning and growth perspective at the bottom and ending with the financial perspective at the top.

Fig. 6
figure 6

eHypo simulator’s BSC management cockpit

The report cockpit in Fig. 5 includes income statement and balance sheet as classic ways to structure and communicate information. Six extra reports focus on business development, research and development, cash flows, customer feedback, growth potential, and EVA. As a consequence, the report cockpit provides more detailed information, although this is not specifically related to the strategy that should be implemented.

To exclude the impact of different visualization techniques on performance, which is investigated, for example, by Coll et al. (1994) and Harvey and Bolger (1996), both cockpits show numbers only. Both show the values for the actual and previous quarter for each figure.

In May and June 2010, five experimental sessions were performed, involving a total of 133 participants. Second-semester students enrolled in full- and part-time Bachelor of Business Administration programs at a private German business school were used as participants. All laboratory sessions were integrated in a course on Managerial Accounting, which also covered the BSC concept. The experiment took place in the second to last lecture of the course. Participants were separated according to gender and then assigned randomly to the two different treatments. This ensured that both experimental groups included a similar ratio of women to men. The first group only had access to the BSC report and did not have the traditional form available; the second group was equipped with simulators that only showed the traditional reports. Of course, both groups had the same starting situation in terms of KPI values. But they always saw all the information in their specific cockpit format only.

For both groups, descriptive data on the participants as well as information on their prior knowledge were gathered. Knowledge domains that potentially have to be considered relevant for this study include strategy implementation, performance measurement, the BSC concept, general business knowledge, and computer knowledge. With the exception of performance measurement, knowledge differences are supposed to be negligible. Therefore, only instruments to assess the participants’ knowledge and experience in dealing with performance measures were included in the pre-game questionnaire. A second questionnaire was used to gather post-game assessments (see Table 5 in the appendix).

The simulation game was conducted in two different labs, one for each treatment, which eliminated the risk of information exchange between groups. To incentivize the participants, a small impact of the performance in the simulation experiment on the course grade was communicated (e.g., Guala 2005). The best students could achieve up to 5 % of the total points awarded for this course. The higher the aggregate performance measure P in the best simulation run out of the first three in quarter 40 (MXR_P@40), the more points were given.

Measuring general cognitive ability by a standard test such as the Wechsler Adult Intelligence Scale (WAIS–IV) is time consuming. Instead of overloading the pre-trial phase by conducting such a lengthy test, students’ entrance assessment-center data were retrieved from the school’s database. This included tests with tasks similar to those from intelligence inventories that aim to measure numerical ability and reasoning (G_NR).Footnote 4 Test results on a scale from 1 to 10 could be obtained. In addition, the Abitur grade (AG) was used as a proxy for cognitive ability. Admittedly, these data stem from archival sources and are a few years older than the data gathered in the laboratory. However, psychological research shows that general cognitive ability is rather stable over long time periods (Larsen et al. 2008; Lyons et al. 2009), which seems to justify the use of these data.

Participants’ knowledge about performance measures was assessed by two scales from the pre-game questionnaire. First, the participants were asked to provide a self-assessment of their experience in dealing with financial measures (K_KPI_SA) on a Likert scale of ten items (0 = no experience at all, 9 = highly experienced). Second, specific knowledge about financial KPIs was assessed using a set of eight exam-like questions that the participants had to answer before the treatment (K_KPI).Footnote 5 Gender (MALE), age (measured in years, AGE_Y), and participation in an integrated degree program (PTS), which were used as control variables, could be easily retrieved from the school’s records, as participants used their student ID number as identification.

Participants in both groups were provided a generous time-frame of 180 min, allocated as follows: about 30 min for the introduction and for questions and answers; about 30 min for the pre-treatment task; about 105 min for the simulation game; and about 15 min for the post-treatment questionnaire. Within the game, participants could become insolvent and therefore fail. In this case, they would virtually be laid off (the simulation was stopped). However, more than one run covering a maximum of 40 quarters each was allowed. While participants were allowed to restart the simulation as often as they wanted, they were instructed that only the results from the first three runs would be evaluated and included in the incentive scheme. This seemed to be a reasonably high number of simulations to avoid failures, which were solely or mainly attributable to faulty operation of the simulator software. On the other hand, the maximum of three valid simulations narrowed the risk that video game syndrome would distort data ascertainment (Sterman 2006). The number of simulation runs and the duration of each simulation were recorded together with all other results in the simulation data file. On completion of the time allowed, the data files with the simulation results were collected so that the relevant data could be extracted.

6 Presentation of results

Of the 133 participants, 40 (30.1 %) were female and 93 (69.9 %) male. A total of 95 participants studied part-time and were employed by a company for at least 50 % of the time (PTS = 1); 38 participants were full-time students. Furthermore, 125 participants were able to complete at least one simulation run successfully (that is, without going bankrupt); 97 participants (72.9 %) were able to completely avoid bankruptcy (number of insolvencies, NI = 0); and 22 participants fell bankrupt once (including one participant who conducted only one run). Seven participants went bankrupt twice and three times, respectively (14 participants total). Table 1 provides summary statistics on all variables included in the study. Post-trial questionnaire items are presented in Table 5 in the Appendix.

Table 1 Descriptive statistics on all variables included in the study

Decision-making behavior in the simulation—in terms of time spent per simulation and number of simulation runs conducted—varied among the participants. Mean time spent per simulation (MR_TPS) ranged between 5.97 and 47.33 min (mean: 22.94 minutes). The time spent on the run with the highest performance (TPS@MXR_P) varied even more: between 3.80 and 70.76 min. Both variables are significantly negatively correlated with the number of simulation runs (NR) conducted (see also Table 2). The majority of participants (57.1 %) ran the eHypo simulator exactly three times. Very few completed fewer than three runs. 39.1 % of the participants ran (voluntarily) more simulations than the three that were incentivized. As a consequence, the number of simulation runs were truncated to a maximum value of three (NR_TT3). Both TPS@MXR_P and NR_TT3 are included as control variables in the analysis.

Table 2 Pearson correlations

Figure 7 provides an overview of the results the participants achieved in their runs, with the highest performance P at the end of quarter 40 (MXR_P@40) out of the first three runs. Comparing these outcomes with the benchmark shown in Fig. 4, it is obvious that no single participant was able to implement the growth strategy as successfully as a set of relatively simple heuristics. The highest-performing candidate achieved a performance score MXR_P@40 of 6,673.95. Mean performance was 128.25. Considering EVA totaled over all 40 quarters [cumulated economic value added (EVAC)], 90 out of 125 (72 %) non-bankrupt participants managed to create value. A total of 28 % of all participants destroyed company value, meaning they achieved a negative EVAC at best. On average, 8,197.33 KEUR was accumulated. One reason for this underachievement is that the ambitious growth target set by the owners was rarely met. While the benchmark implementation resulted in a sales revenue of 30,972.92 KEUR in quarter 40 and a market share of 20.68 %, participants achieved on average 13,443.41 KEUR sales revenue and 8.14 % market share. Another reason can be found in the ROS measure: 73.6 % of the participants achieved less than the benchmark of 11.65 % in quarter 40. That the participants were on average less successful in implementing the growth strategy is not a matter of concern here. Given the research question, it is the degree of variation in strategy implementation performance that is of interest, not its shortfall in comparison to the benchmark.

Fig. 7
figure 7

Descriptive and correlation statistics on performance scores

To test hypotheses H\(_{1a}\) and H\(_{1b}\) (that a BSC cockpit increases/decreases performance) and hypotheses H\(_{2}\) and H\(_{3}\) (that intelligence and knowledge increase performance), BSC usage is dummy coded and the following straightforward regression is estimated (controlling for the number of simulation runs NR_TT3, the number of insolvencies NI, the time spent in the simulation run that resulted in the maximum aggregate performance TPS@MXR_P, part-time studies PTS, age AGE_Y, and gender MALE):

Model A:

MXR_P@40 = \(\beta _{0}+ \beta _{1}\) BSC_Cockpit +\(\beta _{2}\) G_NR + \(\beta _{3}\) AG+ \(\beta _{4}\) K_KPI_SA + \(\beta _{5}\) K_KPI + \(\beta _{6}\) NR_TT3 + \(\beta _{7}\) NI + \(\beta _{8}\) TPS@MXR_P + \(\beta _{9}\) PTS+ \(\beta _{10}\) AGE_Y+ \(\beta _{11}\) MALE + \(\varepsilon \)

As a first step in the statistical analysis, both the data generated in the laboratory experiments and those obtained from databases were carefully screened following the guidelines provided by Tabachnick and Fidell (2013). K_KPI_SA, K_KPI and AGE_Y were found to be missing for one case, which was excluded from further analysis. Additionally, G_NR was found to be missing for 10 cases, AG for nine cases, and PO_I_8 for two cases. Listwise inclusion of cases in the regression or correlation analysis reduced the sample size to 113 or 112, respectively, which might have distorted the results. Therefore, missing values in G_NR, AG, PO_I_8 were singly imputed with estimated values (G_NR_I, AG_I, PO_I_8_I) using the expectation-maximization method (e. g., Schafer and Graham 2002). Two cases with standardized residuals greater than 3 were identified and excluded as outliers, leaving an N of 122 for further analysis. A comprehensive set of Pearson correlations is shown in Table 2.

Table 3 Regression results for models A–E

Table 3 shows the regression results for model A (and additional models B–E, explained below). Testing the typical assumptions for regression does not show any violations. Residuals are normally distributed (Kolmogorov–Smirnov test, Z \(=\) 0.686, p \(=\) 0.734), and, based on examination of the residuals plot, the assumptions of linearity and homoskedasticity can be considered met. According to the Durbin–Watson test statistics, errors are independent (1.767). Based on maximum variance inflation factors of 1.331, multicollinearity is not considered problematic.

In addition to setting the dependent variable to the best aggregate performance measure P in quarter 40 (MXR_P@40), models B–E use the sub-goals EVAC, sales revenues (SR), ROS, and market share (MS) as dependent variables. With the exception of model D where residuals are not normally distributed (Kolmogorov–Smirnov test, Z \(=\) 1.903, p \(=\) 0.001), all other typical regression assumptions are met.

Based on the regression results in Table 3, neither H\(_{1a}\) nor H\(_{1b}\) is supported. In all models, the dummy variable BSC_Cockpit has an insignificant impact on performance. This means that participants using a BSC management cockpit do not make significantly better strategy implementation decisions than participants in the group equipped with a cockpit showing more traditional reports. This finding is supported by the independent sample t-test results compiled in Table 4.

Table 4 T-test results for all variables included in the study

Hypothesis H\(_{2}\) was operationalized as follows: The higher the decision makers’ numerical ability and reasoning, the better they perform in strategy implementation. Regression results from Table 3 support this. In model A, both proxies for cognitive ability, AG_I and G_NR_I, indicate a significant impact on MXR_P@40; in model B, G_NR_I, has a significant impact on the cumulated economic value performance measure (EVAC). Similar results are obtained for the performance measurement knowledge constructs. Both self-assessed knowledge about financial indicators (K_KPI_SA) and knowledge assessed in a test (K_KPI) positively relate to MXR_P@40. K_KPI is also highly significant in models B, C, and E. Therefore, H\(_{3}\), which states that the higher the decision maker’s knowledge the better he or she performs in strategy implementation, can also be seen as supported.

All control variables show a weak significant effect on the dependent variable in at least one model. Regarding the controls that capture certain aspects of the participants’ decision-making behavior—the number of runs conducted (NR_TT3) and the time spent for the best run (TPS@MXR_P)—opposite effects can be observed. Higher numbers of simulation runs translate into lower performance in models A and B. More time spent on a simulation run has a highly significant positive effect on performance in all models but D. Similarly, being enrolled in an integrated degree program (PTS), which allows gaining additional experience “on the job,” is beneficial in all models but D. Gender (MALE) is significant in models B, C, and E. The age of the participants has a negative effect that is weakly significant only in model A.

7 Discussion

This study shows that two different ways of presenting financial and non-financial performance indicators that support individual participants in implementing an ambitious growth strategy do not make a difference. Organizing these indicators in a decision support cockpit according to BSC principles does not result in higher decision-making performance compared to providing these indicators in a more traditional type of cockpit. However, it does no harm either. The decision-makers’ mental models seem insufficiently changed by just using different types of cockpits to make an impact. However, a range of personal factors do show a significant positive impact on strategy implementation performance. As previously found in dynamic decision-making research (Beckmann and Guthke 1995; Brehmer 1992; Wittmann and Hattrup 2004), higher cognitive ability and better knowledge in performance measurement are supportive. Having some work experience is also positively linked to performance in this study’s strategy implementation task, while, at the same time, age is disadvantageous, although only weakly significant. As in the Wittmann and Hattrup (2004) study, gender makes a difference in strategy implementation too—at least in performance measures that are directly related to the size of the company: economic value added, sales revenue, and market share. Male participants do a better job at growing the business and therefore achieve better results in these three measures. Regarding the factors that control, to some extent, for dynamic decision-making behavior, time spent per simulation and number of runs conducted, a positive influence of the first and a negative influence of the second is found. Taking more time in a dynamic decision-making task allows for deeper reasoning and seems beneficial in constructing better mental models, which translate into better decisions.

Analyzing the post-trial questionnaire provides additional insights into potential reasons for the failure of the BSC cockpit in improving strategy implementation decision making. Table 5 lists all 19 questions and organizes them in seven categories. The second highest correlation to performance can be seen in item PO_U_4, which measures the clarity of the causal links between decisions and performance criteria. Participants who understand the cause-and-effect chains, which relate their decisions to performance indicators, perform better than others who lack this understanding. This finding underlines the theoretical argument on the mental models’ important role in decision making. Using the (Kaplan and Norton 1996a, p. 149) definition of strategy as a “set of hypothesis about cause and effect,” item PO_U_4 could also be seen as a measure of strategic clarity: a clear understanding of the causal mechanisms that relate decisions to outcomes results in a clear understanding of strategy. Item PO_P_1 addresses strategic clarity from a slightly different angle. The question was: How clear was the strategy described for eHypo AG in the case study? Participants rating higher on PO_P_1 or PO_U_4 also achieved significantly higher performance (\(\rho =\) 0.217, p \(=\) 0.017; \(\rho =\) 0.351, p = 0.000). These findings support Kaplan and Norton’s (e.g., 2004b, p. 100) argument that strategic clarity is an important success factor. Unfortunately, the BSC cockpit used in this study does not contribute to evoking significant differences in clarity of cause-and-effect chains (see t-test results for items PO_P_1 and PO_U_4 in Table 4). At least in the design chosen in this study (following suggestions in the literature), the BSC cockpit did not support participants’ understanding of the causal relations between their decisions and consequences. Their mental model did not (or did not sufficiently) change.

Of all post-game question items in the information category, only two turned out to relate significantly positively to strategy implementation performance: strategy reflection of information (PO_I_2) and information adequacy (PO_I_8_I). However, as shown in Table 4, testing for differences between the two groups of BSC cockpit and report users revealed no significant results. In participants’ perceptions, arranging measures according to four logically linked BSC perspectives and according to strategic themes does not increase the strategic relationship of the information provided in the cockpit (compared to traditional reports). Moreover, participants do not perceive the information provided through a BSC cockpit as more adequate to control the company than that provided through the report cockpit. Whether they needed more training in the BSC concept to adequately understand the structure behind the information presented in the cockpit, or whether the cockpit itself needed improvement, cannot be answered by this study and must be left for further research.

Significant differences between BSC cockpit and report cockpit users could be observed on three post-trial questions (see Table 4): First, the BSC cockpit provided a more clearly designed and less confusing information screen (PO_I_9). Second, information provided through the report cockpit was considered more multifaceted than through the BSC cockpit (PO_I_15). This last aspect is not really surprising, as the BSC cockpit, by design, showed fewer measures than the report cockpit, and the report cockpit showed a broad range of non-financial indicators. Third, participants with a BSC cockpit rated implementing the strategy with available information as easier than participants with a report cockpit (PO_A_3). While the post-trial items PO_I_9 and PO_I_15 are not significantly correlated with participants’ strategy implementation performance, PO_A_3 shows the highest correlation of all post-trial questions with MXR_P@40. Overall, this effect does not result in a performance difference between the two groups; nevertheless, it is remarkable that participants with access to a BSC cockpit perceive their task as easier than the report cockpit group. This might contribute to explaining positive evaluations of the BSC concept in survey based research (e.g., De Geuser et al. 2009; Rigby 2001; Speckbacher et al. 2003) that does not always and necessarily translate into increases in financial performance (e.g., Ittner et al. 2003).

8 Conclusions, limitations and implications for further research

This study contributes to the BSC and performance measurement literature (Atkinson 2006; Biggart et al. 2010; Davis and Albright 2004; De Geuser et al. 2009; Iselin et al. 2008; Kaplan and Norton 2004b; Tapinos et al. 2011) by showing that a dynamic strategy implementation task is not improved when a BSC cockpit is provided instead of a report cockpit. Participants in a laboratory experiment are not able to make better decisions when faced with a strategy-related reduced set of indicators that are grouped into the four classic BSC perspectives: learning and growth, internal processes, customers, and financials. Participants in the group who are forced to use reports such as balance sheets and income statements do not perform worse. Interestingly, they rate it as less easy to implement the strategy with the available information than the BSC cockpit group. This finding has important implications for BSC designers and adopters because it highlights that success in strategy implementation cannot be achieved just by changing the design of a management information system cockpit. While a BSC cockpit was preferred as less confusing and more clearly designed over a report cockpit by the participants, this did not translate to better decision making. Investing significant amounts of money into developing a BSC cockpit might increase subjective user satisfaction when working with these tools, but might not contribute to increasing more objective performance indicators, supporting similar findings by Ittner et al. (2003). However, it has to be highlighted that this study’s results should not be falsely generalized to discourage applying and implementing the BSC approach as a whole, as the concept is much more comprehensive than the focus of this research.

Contributing to dynamic decision-making research (Brehmer 1992; Dörner et al. 1994; Moxnes 1998; Sterman 1989; Wittmann and Hattrup 2004), this study argues that strategy implementation can be seen a typical dynamic decision-making task. While obviously being less complex than developing and implementing a strategy, executing a specified growth strategy nevertheless poses a huge challenge for human decision makers. Findings on the human “logic of failure” (Dörner 1996) in dynamic decision-making research are supported by this study’s results. In addition, further evidence is provided that cognitive ability and knowledge have a significant positive impact on decision-making performance—not only in the classic complex problem-solving environment (Brehmer 1992; Brehmer and Dörner 1993), but also in a more operational setting, supporting findings by Strohhecker and Größler (2013). In line with recent research on the relation between time on task and ability in complex problem solving (Scherer et al. 2015), this study highlights that taking more time in dynamic decision making is beneficial. Deep thinking, understood as reasoning with numbers (Wittmann and Hattrup 2004), which also in this study shows a positive impact on decision making outcomes, should therefore be complemented by long thinking.

The results contribute to the strategy literature (Ansoff 1984; Dyson 2000; Kaplan and Norton 2008; Mintzberg 1990) by showing that bad execution is not only an organizational phenomenon, but can also be rooted to some extent in decision-maker characteristics, as mentioned above. From this individual perspective, creating strategic clarity seems an important success factor. Participants who had a clearer understanding of the causal links between their actions and the results achieved thereby performed significantly better. However, the BSC cockpit used in this study failed to increase strategic clarity perceived by the participants (compared to the report cockpit). With its tabular, dashboard-like design that closely follows suggestions by Kaplan and Norton, cause-and-effect chains are not visualized and therefore not emphasized. The BSC’s main causal chain—better measures in the learning and growth perspective lead to improved indicators in the internal processes perspective that result in better customer performance indicators and improved financial measures—might be too invisible and perhaps also too generic to be helpful for strategy implementation decision makers. Whether improved cockpit designs that clarify the strategy as a “set of hypotheses about cause and effect” (Kaplan and Norton 1996a, p. 149), or whether better-educated and more experienced users who could get more information out of the BSC cockpit would change this study’s findings, has to be left for further research.

Of course, this study is not without limitations. When focusing on strategy implementation, giving participants the task to decide on execution only seems justified. However, most CEOs have more competences: they do not only implement an already specified strategy; they can also develop one. Therefore, future research could give the participants more “power” and define the task more broadly. The focus on strategy implementation results in a second limitation. Participants need to have a very thorough description of the strategy that they have to implement, including a strategy map that outlines the important cause-and-effect relationships between strategic themes. Giving this information in very detailed form to both experimental groups might have reduced the discriminating effect of a BSC cockpit. Future research could investigate whether different forms of strategy descriptions, such as verbal only versus verbal and graphical, resulted in significant performance differences. Another limitation comes from using a realistic, rather complex case study and business game in this experiment. Achieving good strategy implementation results may need more repetitions. Therefore, a future study could allow participants more simulation runs that count, and/or could use a less complex strategy implementation case study. Another limitation comes from using a student sample in this research. While more than two-thirds of participants had work experience, they could not be expected to hold positions where they had strategy implementation power to the extent reflected in the experiment. They might also not have the type of experience that allows for a different decision-making style; for instance, more intuitive decision making. Therefore, future research could use participants from the population of more experienced managers and investigate whether the factors affecting strategy implementation performance change.