1 Introduction

Multi-Criteria Decision Making (MCDM) is a branch of science that is particularly useful in environments where conflicting criteria and objectives impose immense cognitive burdens on decision makers (DMs). Decision making through MCDM breaks complex decisions into six steps: defining the problem; articulating the objective(s); identifying stakeholders; defining criteria to measure objectives; weighting the criteria; and ranking the decision alternatives based on the objectives, criteria, and weights derived in previous steps (Triantaphyllou 2000). Unlike optimization approaches, MCDM works from a set of explicitly known feasible solutions. Furthermore, optimization demands a mathematical relationship between the criteria and the objective, which is almost impossible for many qualitative criteria (Gandibleux 2006). The design of resilient and sustainable buildings (RSB) can present conflicting—often qualitative—criteria and objectives, and generally proceeds through the iterative refinement of feasible solutions, suggesting it might be a good candidate for MCDM and motivating the present study. MCDM relies on experts’ input to define and weight criteria, and the way these questions are asked can dramatically affect both the responses and the cognitive load on the DM as they balance the project-specific requirements within a context of hazards and environmental priorities. To support decision making, this paper compares two survey methods frequently used to establish weightings in the process of MCDM. The paper begins with a brief description of MCDM in Sect. 1, followed by discussion of sustainability, resilience, and decision making in the context of building design in Sect. 2. Section 3 introduces the method used in this study, and Sects. 4 and 5 present the results and conclusions, respectively.

2 MCDM background

This paper assumes that the first three steps of an MCDM—problem definition, articulation of objective(s), and identification of stakeholders—have been carried out already. This does not mean that these are trivial tasks with little bearing on the MCDM process and results. Rather, they may be considered pre-conditions for the analytical portion of MCDM and their influence on outcomes have been discussed in detail elsewhere (e.g., Wang and Lee 2009). The following two sections discuss the fourth step—criteria selection—and especially the fifth step of criteria weighting in detail.

2.1 Criteria selection

While there are many well-established methods for criteria identification, in all cases, criteria must exhibit four qualities (Phillips et al. 2017), namely be

  • coherent with the overall objective

  • measurable in some way

  • presented on a common scale

  • independent from one another.

In some instances, coherent criteria are readily available and easily measurable. For example, the criteria for a basic energy-retrofit project might include first cost and monthly energy use, each distinct from the other and both quantifiable using available tools and data. However, in many cases selecting or measuring the appropriate criteria presents a significant challenge. Subjective aspects like aesthetics or occupant comfort are not only difficult to measure, but also challenging to present on a common scale, and criteria independence is important to ensure that only the necessary criteria are included. Avoiding duplication reduces the complexity of the problem and facilitates analysis.

The present study selected criteria with the Delphi method. First used in 1948, the Delphi method aims to improve the use of expert predictions in policy-making (Woudenberg 1991) through interactions among a group of experts to agree on decision criteria (Martino et al. 1976). The process begins with a dialogue about criteria that most affect an objective of the problem, followed by an iterative commenting procedure; these repeated cycles of discussion and revision allow experts to update their reflections and arrive at group consensus. Delphi enables knowledge discovery and perspective sharing among experts with different disciplinary backgrounds (Pill 1971), and so the process works well for complex problems that include considerable uncertainty and require multiple perspectives. However, the authority of researchers over the problem definition may cause them to select participants with a certain viewpoint, or to manage group dynamics, potentially introducing bias (Linstone and Turoff 2011). The lack of participant anonymity can sway individuals to change their positions, exacerbating these challenges. Nevertheless, the Delphi method affords results at a reasonable time and cost (William and Webb 1994) even for problems demanding judgment and subjective expertise; when causal relationships are hard to validate; or when the problem needs to consider diverse opinions within a large group of people (Powell 2003; Yang et al. 2012), a good fit for the present study.

2.2 Criteria weighting

After selecting the proper criteria of the problem, their relative importance or weights must be established as inputs for the alternative-ranking process. Criteria weight or value depends on the problem and context; for example, designing RSB in a desert might prioritize water conservation, while projects with scarce funding might value economic cost above other considerations. Weights also reflect different stakeholders’ values. Criteria weights directly affect the final ranking, so assigning accurate values to criteria is one of the most crucial steps in MCDM. There are two general approaches to criteria weighting: subjective and objective. Objective methods use mathematical calculation to arrive at the weights, whereas subjective methods are based on DMs’ opinions (Tzeng et al. 1998).

Objective methods are based on established mathematical procedures to assess or reconcile differences in the importance of individual criteria. They may require significant computational capacity, limiting the ability of DMs to revisit their judgments in a dynamic decision-making environment. Even in very technical cases, applications of objective methods are troublesome, as decisions nearly always involve diverse concerns (Maggino and Ruviglioni 2009). Objective methods tend to be more time-consuming, more expensive, and harder to execute than subjective approaches (Kamal 2012).

Subjective methods are based on DMs’ making tradeoffs among the criteria based on a subjective sense of relative importance, and the weights are easily adjusted if the objective of the problem changes. Although they demand little time or computational capacity, subjective methods incorporate human judgment—and the attendant imprecision and ambiguity—and must carefully attend to internal consistency (Önüt et al. 2009), to deliver accessible methods for criteria weighting with multifarious concerns. The idea of human judgment seems to align with the application of professional judgement made by architects and engineers working in the built environment to address the interests of multiple stakeholders, for example client budgets and the health safety and welfare of the public.

2.3 Preference elicitation methods

Subjective weighting usually employs surveys to elicit DMs’ preferences. Preference in this context means the relative importance of each criterion to the decision maker. MCDM does not recommend one method to establish criteria weighting over others, and similar decision-making problems employ different decision-making approaches. For example, a 2002 study to evaluate renewable energy strategies obtained criteria preferences directly (Polatidis and Haralambopoulos 2002), while another applied pairwise-comparison criteria weighting to a very similar problem in which the objective was to rank different renewable energy projects (San Cristobal 2011); neither study comments on the selection of method. This paper evaluates two subjective criteria preference elicitation methods—Simple Multi-Attribute Rating Technique (SMART) and Potentially All Pairwise RanKings of all possible Alternatives (PAPRIKA), to test their possible applicability to RSB design problems.

SMART is a straightforward weighting method that uses simple additive weight (SAW) technique to obtain weights for individual criteria (Edwards 1977; Edward and Barron 1994). After identifying the decision criteria, DMs assign the highest score to the most important criterion and score the rest of the criteria relative to that. Results are then normalized to unity to obtain criteria weights. This method is simple to administer and easy for DMs to complete, but in its abstraction does not have an analog to contingent and sometimes messy building design process.

PAPRIKA is a pairwise method that presents DMs with a series of hypothetical scenarios in which all but two criteria are held equal. Those two criteria have different values in the two scenarios, and the DM is required to assess the tradeoffs and select the more desirable scenario. Each time a DM makes a choice, PAPRIKA assigns points to the preferred criterion and updates the previous point assignments to determine which comparison to present next. It then continues the process until the final weights are calculated. PAPRIKA surpasses conventional pairwise methods which merely ask which of two criteria is more important and cannot include multiple levels for each criterion or scenario-based tradeoffs. Developed in 2004, PAPRIKA is relatively, appearing in academic literature from 2009 onward (Noseworthy et al. 2009). This study offers the first comparative analysis of PAPRIKA to other survey methods in the context of a real-world problem.

Donald Schön in his book The Reflective Practitioner suggests that when confronted with complex, unique, and uncertain situations, professionals engage in “reflective conversations” that seek to fit a framework of generic knowledge and prior experience to the problem at hand. By repeatedly responding to the “back-talks” of this fitting exercise, professionals refine these frameworks, until the unique and uncertain situation comes to be understood (Schön 1983). Schön cites design decisions made by architects and engineers as examples of reflective practice, and the similarities with the repeated scenario comparisons and tradeoffs suggest PAPRIKA may be an authentic model for eliciting DMs’ behavior in this domain.

2.4 Cognitive burden on DMs

Subjective MCDM can impose a heavy cognitive load on the DMs (Narasiman and Vickery 1988). Cognitive load theory postulates that the cognitive capacity of people’s working memory is limited, and exceeding this threshold hampers performance—perhaps because of the presence of complexities and uncertainties (De Jong and Den Hartog 2010). As DMs’ roles vary in different MCDM methods, it is logical to assume the cognitive load on DMs varies as well. Traditional MCDM methods require the DMs to understand and value each criterion relative to other criteria or make tradeoffs among them, while considering the importance of all criteria simultaneously. An acute awareness of the cognitive burden on DMs when evaluating the criteria motivates this study, with the hypothesis that a series of scenarios, although individually complex, may impose a smaller burden and better reflect the process of making design decisions and tradeoffs for RSB.

DMs are usually considered reliable sources of information in a complex environment, but because of their limited ability to comprehend all the complexities involved, psychological pressure on DMs can adversely affect MCDM (Larichev and Nikiforov 1987). Measuring the cognitive load on the DMs while engaging in the MCDM is difficult, and DMs may not have a coherent view about it (Pomerol and Barba-Romero 2012). A handful of studies have focused on the difficulty of surveying methods for DMs as well as accuracy of capturing their preference weights. Buchanan and Daellenbach (1987) used different survey techniques to collect preference weights, and also asked decision makers about the difficulty. In the context of cost minimization and labor force utilization in the production of electrical components of lamps, Buchanan and Daellenbach concluded that pairwise comparisons pose no problem for DMs. Similarly, Narasimhan and Vickery (1988) recorded preference weights for three criteria using SMART and pairwise comparisons and found no statistical difference in responses to the two methods. In contrast, in a study of production scheduling task, Wallenius (1975) found that users prefer unstructured methods to more sophisticated approaches. More recently, Aloysius et al. (2006) administered two surveys to 153 individuals, one using a direct technique similar to SMART, and the other one using pairwise comparisons. They focused on the participants’ reflection on cognitive load and reported that users believe that pairwise comparisons cause more decisional conflicts, are less accurate, more effortful, and less desirable (Aloysius et al. 2006). However, the results in these fairly simple scenarios may not hold true in complex settings requiring more criteria and criteria levels, such integrating sustainability and resilience in buildings, and the pairwise methods used were less sophisticated than PAPRIKA. Whatever the influence of the survey instrument on cognitive load, it will of course also vary by DM based on knowledge and experience (Pomerol and Barba-Romero 2012).

2.5 Ranking methods

The last step of an MCDM brings together criteria weights and values to rank the various alternatives. The various methods for alternative rankings lie outside the scope of this paper, which refers the interested reader to other studies for more detail on the following approaches: Analytical Hierarchy Process, or AHP (e.g., Saaty 1980); Preference Ranking Organization Method for Enrichment Evaluation, or PROMETHEE (e.g., Brans 1986); The Elimination and Choice Translating Reality, or ELECTRE (e.g., Roy and Bouyssou 1986); and Technique for Order Preference by Similarity to Ideal Solutions, or TOPSIS (e.g., Yoon 1987). Instead of enumerating already well-documented similarities and differences among these methods, this study wishes to highlight the relevance that cognitive burden in decision making has on the process of eliciting criteria preferences.

3 Sustainability, resilience, and decision making in buildings

MCDM has been applied in many fields to improve decision making in complex environments, including building design (e.g., Hsieh et al. 2004; Flager et al. 2009; Balcomb and Curtner 2000), and is a promising method to integrate resilience and sustainability in building design decisions. While concern for the environmental, social, and economic performance of buildings, generally summed up under the label of sustainability, has become widespread in research and practice (e.g., Lützkendorf and Lorenz 2006; Zuo and Zhao 2014) with many scholars defining the term in detail (e.g., Pater and Cristea 2016; Galvic and Lukman 2007), the notion of building resilience is only just capturing the attention of decision makers (Laboy and Fannon 2016; Burroughs 2017). As with sustainability, a wide range of definitions of resilience exist, many of them stressing the need for systems to cope with disruption, maintain essential operations, return to normal operations after the disruption has ended, and elevate to a state of advanced performance following some exogenous shock (Alexander 2013; Linkov et al. 2014; Mirzaee et al. 2018). In the context of buildings, Alexander (2006) argued that while building codes can regulate the design, construction, and maintenance of structures to protect occupants from disasters, the protection measures have not kept pace with the growing vulnerability of places with high risk to disasters. Common to both resilience and sustainability is the notion that there is no single criterion by which they can be evaluated, and that any application of these concepts for decision making must be cognizant of the boundary constraints—environmental, social, economic, and mental—within which alternative actions are compared to each other and judged.

Some prior work has applied MCDM to questions of building sustainability, including energy management (Kolokotsa et al. 2009) and selecting wall insulation to retrofit existing buildings (Ruzgys et al. 2014). Jin et al. used pairwise comparisons to capture DMs’ preferences on ten sustainability criteria and then ranked ten green building technologies for existing buildings (Jin et al. 2016). Cegan et al. (2017) performed a broad literature review on the applications of MCDM in Environmental sciences and sustainability domain and found a rapid growth in the use of MCDM in such problems from 2000 to 2015. MCDM has recently been used to improve sustainability (Invidiata et al. 2018), and to find the best combination of design strategies to construct net zero energy buildings (Harkouss et al. 2018). To date, applications of MCDM to the burgeoning field of building resilience mostly focus on specific hazards and design responses, for example, identifying the best alternative for a structural seismic retrofit, whether concrete and masonry (Formisano and Mazzolani 2015), or steel-braced frames (Deierlein et al. 2011). These applications advance what has been called “engineering resilience,” which assumes an equilibrium model, and prioritizes the rapid return to function (Gunderson 2000; Holling 1996; Scheffer et al. 1993). While not wrong, engineering resilience tends towards narrow optimization (albeit with multiple objectives) rather than the broad incorporation of relevant stakeholders, objectives, and criteria characteristic of MCDM.

Although it can yield ranked lists of alternatives, MCDM does not employ an objective function nor provide a single, optimal solution for three reasons: (i) the decision alternatives are fuzzy and tend to change during the decision process; (ii) the decision maker is not a single person or entity, and members of a decision-making group may choose conflicting or even contradictory criteria; and (iii) the criteria preference weights elicited from DMs may be imprecise or badly formulated (Majumder 2015; Mota et al. 2013; Carlsson and Korhonen 1986; Zimmermann 1990). Real-world decisions about building design often exhibit all three reasons. Fuzziness of the decision environment means the goals and/or decision criteria are fuzzy in nature and their boundaries are not sharply defined, although this does not necessarily mean that the system itself is fuzzy (Belman and Zadeh 1970). Although building for sustainability and resilience relies on objective performance data, the complex tradeoffs of balancing sustainability and resilience in buildings depend on human-in-the-loop interactions. Building projects inherently involve diverse stakeholders, sometimes with conflicting goals. Finally, eliciting accurate criteria preference weights from those DMs is challenging. The limited prior literature and our initial trials suggest that simple, direct ranking methods do not adequately capture the complexities of the decision space from diverse stakeholders. However, more nuanced comparative methods risk over-specification, while demanding considerable time and effort to develop and administer. Robust decision outcomes must address all three of these challenges, by reducing the fuzziness of the decision environment, identifying critical stakeholders, and eliciting accurate criteria preferences.

4 Method

This study grows out of an effort to develop an MCDM-based tool to support design decisions about sustainable and resilient buildings, a process that encompasses multiple criteria evaluated by multiple stakeholders. The present study compares methods to elicit preference weights, seeking reasonable accuracy and acceptable cognitive load. These results do not provide criteria values to increase resilience and sustainability of any particular project, but rather help design teams identify methods to collect such values. The focus on design decisions reduces the fuzziness of the decision space and the diversity of stakeholders, focusing only on decision makers who are design professionals from the Architecture, Engineering, and Construction (AEC) industry. Of course, other stakeholders (e.g., owners, occupants) may weigh the same criteria differently. As detailed below, preference weights from 42 North-American AEC-industry design professionals regarding ten resilience and sustainability criteria were collected using two different survey methods.

4.1 Criteria selection

The absence of a common definition for concepts like building resilience complicates the selection of decision-making criteria: indeed, finding a suitable definition of resilience may itself constitute a decision problem. Although there are many possible metrics for both sustainability and resilience, prior work identified commonalities and conflicts among resilience and sustainability standards and rating systems (Philips et al. 2017). Building on this literature, the present study used the Delphi method with discussion and knowledge sharing among a panel of subject-matter experts to identify important criteria for resilience and sustainability in buildings. The panel experts included leading researchers and practitioners with expertise in structural engineering, architecture, environmental chemistry, LCA, high-performance buildings, urban resilience, and social sciences. This step of our study has been particularly important because people with diverse backgrounds often use the term resilience to mean different things so clearly establishing an appropriate criterion, if not a definition, is essential in a climate of increasing uncertainty (Laboy and Fannon 2016). The prompt, coupled with the particular expertise of this panel, led to identification of criteria in Table 1, which incorporated resilience criteria into the commonly used Triple Bottom Line (TBL) (Elkington 2013) categories, as well as adding a new “recovery” bottom line, resulting in a Quadruple Bottom Line (QBT). The criteria reflect the panel’s collective definition of resilience and sustainability, and acknowledge those criteria will be valued higher, lower, or not at all depending on stakeholder, context, and project. For example, criteria 9 and 10 show a bias towards an engineering definition of resilience; however, economic and socio-technical definitions of resilience emerge in criteria 3, 6, and 8. The criteria selection step is of great importance to the MCDM process, as a different set of criteria can shift the decision outcome. In this case, a constructive dialogue yielded ten criteria that cover the four categories of QBT as shown in Table 1.

Table 1 Ten criteria for sustainability and resilience organized into categories and subcategories

4.2 Survey design

Two surveys were designed to quantify DM’s preference weights among the ten resilience and sustainability criteria. The first—using the SMART method—asked DMs directly value the criteria relative to each other. The second used the PAPRIKA method, in which DMs select their preference through a series of nearly identical scenarios pairs. Both surveys were developed and administered using online tools: Qualtrics Survey platform (Qualtrics, Provo, UT) for the SMART survey, 1000 Minds for PAPRIKA (Hansen and Ombler 2008).

Qualtrics is an online tool for data collection that supports multiple question types. As shown in Fig. 1, the SMART survey was administered using slider bars with scales from 0 to 100. The prompt asked decision makers to assign 100 points to the most important criterion and value the remaining nine criteria relative to that maximum.

Fig. 1
figure 1

SMART survey in Qualtrics environment showing the 0–100 scale and sliders

1000minds.com is an online platform for PAPRIKA that elicits preferences from multiple actors by forcing them to choose between pairs of alternative solutions that slightly differ on specific criteria. The survey designer identifies the quantitative and/or qualitative criteria of the problem, and at least two performance levels for each criterion. During the survey administration, the software automatically develops the necessary pairwise scenario comparisons needed to capture DMs’ preferences based on the total decision space, i.e., the responses to each question govern which question is presented next.

Prior studies in the realm of social, behavioral, and psychological sciences evaluated the effect on responses when the same question is worded differently, and generally conclude that the way the researchers ask questions matters for the results (Moser and Kalton 2017; Walgrave et al. 2016). The primary requirement is clarity, such that the question has only one interpretation (Lee et al. 2016; Magelssen et al. 2016). To promote clarity in the PAPRIKA method, each criterion was associated with a uniquely phrased question and performance levels. This phrasing, shown in Table 2, avoided the redundancy and confusion from repeated performance levels such as “high, medium, low.”

Table 2 Table of criteria, questions, and performance levels used in 1000 minds survey. Note the unique phrasing of the performance levels associated with each criterion

With the ten criteria included in this study and at least two performance levels for each, comparing every criteria and level would require hundreds of pairwise comparisons. However, 1000 minds determines which pairwise comparisons are necessary to arrive at the final weights, thus optimizing to the fewest questions. This survey typically asked DMs approximately 40 questions. An example question is presented in Fig. 2.

Fig. 2
figure 2

An example of a scenario-comparison question generated by 1000minds.com based on the criteria and performance levels

4.3 An opinion-based survey

Complementing the first two surveys about criteria preference weights, participants were asked to complete a simple meta-survey about the experience based on a series of questions assessed using 5-point Likert-type scale. Likert scaling is a bipolar scaling system measuring positive and negative responses usually with neutral response in the middle. The survey is focused on four main dimensions presented in Table 3.

Table 3 Table of opinion-based survey questions and ranges of response options

As shown in Table 3, self-reported cognitive load is used to compare the difficulty of the two methods. Web-based surveys often use self-assessed ratings of cognitive load to indicate participants’ perception of cognitive load, not least because they are dramatically easier to collect than other measures (Paas et al. 2003). While self-reported values always demand careful consideration, prior studies have found participants can quantify their mental load (Gopher and Braune 1984), and that self-reported cognitive load is meaningfully correlated with actual cognitive load (Gimino 2002; Paas et al. 1994). In another example, Lin et al. (2013) asked participants to evaluate a task difficulty, and the participants were found to have over 74% accuracy on the self-reported cognitive load. Many studies successfully use self-reported mental effort on a Likert-type scale as a measure of cognitive load (e.g., Chang and Yang 2010; Kalyuga and Sweller 2005; Scharfenberg and Bogner 2013; Antonenko and Neiderhauser 2010). Collectively, this body of prior work results indicates that self-assessment is a reliable and valid measure of the cognitive load (Chen et al. 2016; Ayres 2006).

4.4 Survey administration

As a pilot test, the surveys were administered to 10 researchers working in the field of building design and who were familiar with the principles of MCDM as well as concepts of sustainability and resilience. Results were reviewed with each respondent to solicit feedback about the survey design and survey method, as well as the language of questions. The responses were used only to refine the survey prior to wider deployment and are not included in the analysis presented below.

After pilot testing, a sample of 250 adults over 18 years of age and with a professional role in the building industry was identified. There was no selection criterion related to gender, ethnicity, or socio-economic level, as these personal data were irrelevant to the research. The survey solicitation consisted of an email with a link to an intake-landing page, which provided information about the relevant human subject protocols and collected basic demographic data about industry role, experience level, and educational attainment. As part of the intake, DMs were prompted to consider a recent project as a method to frame the otherwise-abstract decision weighting process, lending some specificity as the cost of potentially greater variance. The landing page then linked to the SMART survey, at the completion of which users were presented the PAPRIKIA survey, and finally presented the self-reporting meta-survey. The nature of the scenario-comparison questions precluded randomizing the otherwise-preferable randomization of the sequence of the two surveys, potentially introducing order-effects.

The surveys were distributed to 250 professionals in the AEC industry and 59 responded to both surveys, a 23.6% response rate. Of those respondents, 42 (72%) identified their role as design professionals (61% architects or designers, 11% engineers), the target population for this study about design decision making. The remaining respondents indicated roles as owners and developers, product manufacturing, education or other related industries, and were not included in the analysis below. The 42 design professionals averaged 12 years of experience, ranging from 1 year to 37 years.

5 Results and analysis

The following sections present two types of results: first are DM’s preference weights for the building sustainability and resilience criteria, which reveal the priorities of this sample of AEC design professionals. Recalling that the purpose of this study is to compare the two survey methods used to elicit such preferences, the second group of results analyzes the differences in weights between the two methods and the DM’s subjective responses about them.

5.1 Comparison of mean criteria weights for SMART and PAPRIKA surveys

Surveys of decision makers to establish criteria weights for MCDM often use either the average or median of responses, and for small samples with possible outliers, the median better describes central tendency. Therefore, this study compares the median weights to test for significant differences between the two surveys, as summarized in Table 4. Six out of ten criteria show significant differences in preference between the two survey methods: first cost, operational cost, percent functional of the building, energy conservation, life cycle impact, and beyond-code safety. The remaining four criteria—recovery cost, recovery time, occupant comfort, and neighborhood impact—do not show statistically significant differences between survey methods. Energy and life cycle are valued low when they are asked about directly. However, DMs trade other criteria off for these two, valuing them more highly in scenario comparisons compared to direct questions. Beyond-code-safety and first cost are highly valued in direct questions; however, respondents value them less in scenario comparisons, and trade them away for other criteria. In general, when asked directly, respondents place high value on economic/financial criteria. However, when presented with the forced choice of scenarios, respondents trade-off profit for greater environmental sustainability and increased resilience. Neighborhood impact and occupant comfort, which are categorized as social criteria, appear highly consistent across respondents, while beyond-code safety, which is in the same category, is valued quite differently.

Table 4 Median preference values observed using each survey, their differences, and measures of statistical significance

The percentage change describes the increase or decrease in median preference values as captured through PAPRIKA compared to the SMART median value as baseline. Beyond-code safety and first cost showed the largest difference between survey methods, with changes of 6.2 points and 4.7 points, respectively, from the median collected with SMART to the median collected with PAPRIKA. The greatest change on a percentage basis was also beyond-code safety: its median value of preference in PAPRIKA is 81% less than its median value in SMART. The next largest change was for first cost, which showed a difference of 57% between the two methods.

Because the sample size is the same in all the cases, it is possible to compare the t- and p-values with each other. T-values are an indication of the difference in means in the two groups and measure the size of the difference relative to the variance of the data. In Table 4, the three largest absolute t-values correspond to the three lowest p-values, eliminating the possibility that these significant differences are based on chance. These three criteria are, first cost, beyond-code safety, and energy conservation.

The differences in medians of the weights for each criterion as determined by the two surveys are illustrated in Fig. 3. The median of the criteria weight responses in PAPRIKA was subtracted from the median value of the SMART survey, so values greater than zero indicate survey subjects gave greater preference to those criteria when assigning values directly in the SMART than they evinced through their choices in scenarios. Conversely, negative values indicate criteria that DMs preferred when choosing between scenarios but had directly assigned lower scores. The significance levels of the differences are shown on each bar using star marks and the category of each criterion is both color-coded and named on each bar. The category-based color-coding reveals suggestive trends, namely that economic criteria seem to be valued more highly in direct assessment, while environmental and resilience criteria appear to be valued more highly when DMs are making tradeoffs in scenarios. These trends suggest that a user’s valuation not only varies depending on the method of survey, but also that those changes in weighting may be non-uniform across various criteria, and perhaps criteria groups.

Fig. 3
figure 3

Differences in median weight for each criterion from SMART and PAPRIKA surveys. Differences that are not statistically significant are marked with (−), while *, **, *** indicate significance at the p = 0.5, 0.001, and 0.0001 thresholds, respectively. The colored version of this figure indicates category groupings

5.2 Comparison of preference value distributions for each criterion

As described above, respondents as a group valued criteria differently when asked to directly assign a score vs. when selecting those criteria in a scenario. To better understand the responses from individuals that result in these group patterns, each subject’s scenario responses are plotted as a function of their direct weighting for each criterion. These plots are shown in Fig. 4. Each dot represents the opinion of one individual, with the weight in response to direct questions on the horizontal axis and the weight determined through scenario-comparison choices on the vertical axis. A perfectly consistent respondent would have the same value on both axes.

Fig. 4
figure 4

Distributions of individual responses for each criterion, showing the scenario (PAPRIKA) weightings as a function of the direct (SMART) valuations. Perfectly consistent respondents would score each criterion identically in both surveys, yielding the hypothetical trendline with a slope of 1 and intercept of 0, as shown in white. A best-fit linear trend for the actual responses is shown in black, with the shaded zone indicating the 95% confidence interval. The universally low R2 values indicate poor fit and low consistency

Beyond revealing the consistency (or lack thereof) for individual subjects across the two surveys, these distributions also suggest criteria in which respondents are generally consistent, or for which a small number of dramatic changes overwhelm a generally consistent set of responses. To aid in this comparison, black lines represent a linear regression of the distribution, and the gray-toned area defines the 95% confidence interval. The white lines illustrate an idealized distribution in which all the DMs are perfectly consistent, providing the same preference about each criterion in both surveys. Although the graphs visually suggest some criteria are more consistent than others, the low R-squared values in all cases indicate that there is no strong trend in responses to compare to the ideal. Graphs in which most data points are concentrated above or below the white line also highlight those criteria—identified in the previous section—which generally score highly on one survey method but not the other, for example beyond-code life safety.

Respondents highly value first cost and beyond-code safety when they are asked about them directly but trade them away for other criteria when presented in scenario comparisons. The logic holds true for operating cost, although not as intensely. The converse behavior is also present; respondents report less concern about energy conservation when they are asked to value the sustainability criteria directly. However, the same respondents often choose the more sustainable option in scenarios, trading away other criteria to improve energy conservation. For other criteria, the data are quite obscure and responses inconsistent; none of the linear regressions representing the response data for other criteria fit perfectly, and there are many outliers where the same criterion is weighted very differently in the two surveys.

Figure 5 shows box plots of the distribution of responses to all ten criteria in both SMART and PAPRIKA surveys. These plots illustrate how central trends used for MCDM may not reflect the complexity of preference weights across two surveys. For example, the mean value for Life Cycle assessment in the scenario comparisons aligns with the fourth quartile of the direct measurements, because the long tail of high values in PAPRIKA indicate a handful respondents are highly valuing this criterion through that survey method. This is visible in Fig. 4 as well, because while most results fall close to the trend line, a few responses well above the line pull the linear fit away from the idealized curve, perhaps because direct rating tends to encourage central tendency bias. In another cases, such as beyond-code safety, responses are skewed towards lower values when evaluated in scenarios. Overall, the results show that DMs do not consistently weigh the criteria in the two surveys.

Fig. 5
figure 5

Box plots of the distributions of responses to all ten criteria in both SMART (marked here as direct) and PAPRIKA (marked here as Scenarios). The boxes show the first and third quartiles as well as the median value as a horizontal line and mean values with a white circle. Points beyond the whisker represent outliers beyond 1.5 times the interquartile range above the upper quartile and below the lower quartile

5.3 Comparison of self-reported accuracy and cognitive load

In addition to comparing the criteria-weighting methods based on responses to the two surveys, the third meta-survey collected respondents’ subjective experience of the two methods. Their responses to the four questions about the questions are illustrated in Fig. 6.

Fig. 6
figure 6

Respondents’ self-reported assessment of the survey methods. From the left, graph a shows the difficulty of navigating through scenario questions versus navigating through direct questions. Graph b shows the responses to the level of effort needed to answer questions in scenarios versus direct questions. Graph c shows the responses to how realistic the tradeoffs are in scenarios versus direct questions. Graph d shows the DM’s opinion on how accurately the two methods reflect their preference weights

Figure 6a shows that the DMs find navigating through scenario-comparison questions harder than the direct weighting. Figure 6b shows that DMs find the direct questions more realistic compared to scenario comparisons. Figure 6c illustrates that the DMs exert more effort to answer the scenario-comparison questions. Finally, Fig. 6d shows that, in DMs’ opinion, both methods capture their preferences equally well.

Mean values were calculated for each Likert response, and are shown in Table 5. The numeric results suggest that, on average, DMs believe the direct questions better reflect the tradeoffs present in real projects, demand less effort, are easier to navigate, and more accurately capture their preferences. These findings are consistent with Aloysius et al. in that DMs find direct measurement of criteria less confusing, less difficult, and more accurate (Aloysius et al. 2006). These findings also align with Kotteman and Davis (1998), in that DMs prefer less-sophisticated weighting methods. However, in the context of building design and construction, the assessment is quite surprising given that decisions about building design seldom ask for numerical or even ranked preferences across multiple criteria simultaneously. Instead, DMs must evaluate complex, multifarious alternatives, which would seem to be better simulated using the PAPRIKIA method. The high scores for accuracy across both methods are particularly surprising given the low consistency in responses by most DMs discussed in the previous two sections.

Table 5 The mean and median value of the DMs’ self-reported assessment of survey methods

6 Conclusions

This study evaluated DM’s preferences among decision criteria as revealed through two MCDM survey techniques and compared the two survey methods. Forty-two design professionals who participated in the project indicated their preferences for ten resilience and sustainability criteria using both the direct scoring SMART technique, and the scenario-comparison-based PAPRIKA method. For both individuals and the group, criteria weights varied significantly depending on the method used to elicit preference, and these differences would inevitably affect the final ranking of alternatives in MCDM. Respondents also assessed the effort, cognitive load, and estimated the accuracy of the two methods, reporting both as accurate but preferring the direct method. Overall the results indicate substantial challenges in selecting a method to elicit criteria preferences for MCDM in problems of resilience and sustainability in buildings.

The method of elicitation affects the evaluation of building design criteria. Economic criteria consistently scored highly in direct ranking, suggesting a primacy of financial considerations when viewed simultaneously with all other criteria. However, confronted with choices between two alternative scenarios, DM’s choose greater sustainability and resilience even at increased economic cost, resulting in higher values for environmental and social criteria. It is impossible to discern motivations from the bare data, but one possible explanation for these results is that the direct ranking reflects design professionals’ perception of their clients’ priorities, while the indirect results from scenario choices reveal unconscious personal preferences or professional obligations to the public, albeit at some cognitive load. Alternatively, the scenarios may represent the psychological response to an urgent choice, while the SMART ranking represents an abstracted, top-down plan.

It is not clear whether or how the differences in preferences observed in these data affect the actual design process, complicating any application of MCDM to building design. Applying MCDM to real decisions requires understanding the decision process in that context. Although designed to emulate the kind of decisions made in practice, the scenario comparisons remain abstract hypothetical models of real decisions. Even still, the scenario-comparison survey presents situations that direct questions simply cannot, reducing the cognitive load but also affecting the results. While difficult, additional research to determine criteria preferences empirically from the design of real buildings would validate the relationship between preference values and design decision making in the complex contingency and complexity of practice.

Regardless of method or context, the validity of criteria weightings is difficult to determine, as there is no inherently true or actual value. The difference in self-assessed accuracy between the two methods is very small, indicating that respondents are confident that both surveys reflect their actual preferences. This result contrasts sharply with significantly different results for the median value of six of ten preference values captured through the two surveys and the low reliability of individual scores. These findings challenge the DMs’ confidence in their responses, and complicate efforts to collect meaningful criteria weights for use in MCDM.

Subjective criteria collection requires humans in-the-loop, rather than humans-as-inputs. To avoid influencing the meta-survey, study subjects were not provided the weights calculated by PAPRIKA. We hypothesize that the lack of feedback and increased survey time frustrated some respondents, because the nature of this study precluded individuals from reflecting on any inconsistencies in their answers. Further, the findings show that in complex decision environments with more than a handful of criteria, cognitive load may affect DMs’ responses. An iterative procedure could address these limitations, but simply presenting the calculated preference criteria for respondents to modify would effectively replicate the direct survey. We propose presenting final ranked alternatives to the DMs and allowing them to indirectly evaluate their preferences based on insight into the outcome of their preference values. This could be coupled with a collaborative or benchmarking approach might improve the quality of preference weights collected.

The sample in this study is neither sufficiently large nor representative to support conclusions about design professionals’ preferences as a population. Future work to increase the sample size and diversity would enable such findings, perhaps also identifying inter-population differences, for example based on role. Further analysis of intra-personal differences and testing for personal consistency would also yield promising insights in the development and application of decision support tools. Testing these tools in practice on specific projects would offer unique and situated insight into the validity, reliability, and utility of these preference elicitation methods in the context of the design of RSB. Applying of MCDM to the design of buildings requires further study and careful design to support decisions that increase the resilience and sustainability of the built environment.