1 Introduction

Assessing uncertainty in life cycle assessment (LCA) is important for understanding reliability and robustness of the results in the context of decision making (Finnveden et al. 2009). Traditionally, LCA studies only include deterministic values in results. However, a sound decision making can benefit from the understandings of the stochastic distribution of LCA results (Geisler et al. 2004; Sugiyama et al. 2005). For example, when making comparisons among products, ignoring uncertainty may lead to a misleading decision if the distributions of the two LCA results significantly overlap, although their deterministic values favor one versus another (Heijungs and Kleijn 2001). Therefore, many LCA studies have implemented uncertainty analysis for sound decision-support (Hertwich and Hammitt 2001; Huijbregts et al. 2003; Basson and Petrie 2007; Cellura et al. 2011; Clavreul et al. 2012; Noshadravan et al. 2013).

The concept of uncertainty in LCA was first discussed in a workshop of Society of Environmental Toxicology and Chemistry (SETAC) in 1992 in the context of data quality (Fava 1994). Recognizing the significance of incorporating uncertainty, the LCA community formed the SETAC LCA working group on data availability and data quality in the early 1990s. Heijungs (1996) illustrates how uncertainty is propagated from input parameters of an LCA model to its outputs. Weidema and Wesnæs (1996) addressed the problem of data quality concerns by introducing the pedigree method, which has been widely incorporated into various life cycle inventory (LCI) databases to date. European Network for Strategic Life Cycle Assessment Research and Development (LCANET) has suggested making uncertainty quantification a top research priority. During those early years, many efforts were devoted to setting-up the scheme for data quality indicators. Based on such efforts, Huijbregts (1998) established a framework for parameter uncertainty analysis. Subsequently, a framework for quantifying data quality in LCI was also developed.

More recently, the literature focused more on the typologies of uncertainty and the approaches to treat uncertainty (Björklund 2002; Huijbregts 2002; Baker and Lepech 2009). In general, two types of uncertainties have been distinguished: stochastic uncertainty (due to inherent randomness) and epistemic uncertainty (due to lack of knowledge) (Clavreul and Guyonnet 2013; Heijungs and Lenzen 2014). Among them, stochastic uncertainty has been the focus of many LCA studies, while the literature on epistemic uncertainty in LCA is scarce (Laner et al. 2014; Gavankar and Suh 2014). Heijungs and Huijbregts (2004) presented a review of four general uncertainty treatments for stochastic uncertainty and Ciroth et al. (2004) proposed a method for uncertainty calculation. Two types of techniques have emerged: sampling method and analytical approach (Ross et al. 2002; Heijungs and Frischknecht 2004; Clavreul and Guyonnet 2013; Jung et al. 2013). According to the survey of 24 LCA studies that incorporated uncertainty analysis, parameter uncertainty is the most addressed one compared with model and scenario uncertainties, and sampling method is the most frequently used technique to quantify uncertainty (Lloyd and Ries 2008).

In addition to the development of frameworks and methodologies of uncertainty assessment, a number of empirical studies have implemented uncertainty analysis in LCA. Geisler et al. (2004) applied uncertainty assessment to a case study of plant-protection products using generic uncertainty factors for inventories. Huijbregts et al. (2003) performed uncertainty quantification considering parameter, scenario, and model uncertainties in a comparative study of building’s insulation options. Many studies included probability distribution in uncertainty analysis through Monte Carlo Simulation (Maurice et al. 2000; McCleese and LaPuma 2002; Sonnemann et al. 2003; Hung and Ma 2009; Cucurachi and Heijungs 2014).

When using Monte Carlo Simulation (MCS), the shape of distribution in the aggregate LCI results becomes an important issue for efficient storage of such data. In the study of waste incinerators by Sonnemann et al. (2003), the distribution of aggregate LCI results from Monte Carlo simulations looks like a lognormal distribution. Several reports suggest that lognormal distribution could be an appropriate distribution type in inventory data, risk assessment, and impact pathway analysis because lognormal distribution can avoid negative values for emissions and impacts (Hofstetter 1998; Frischknecht et al. 2004). Many LCA studies following Sonnemann et al. (2003) assumed that LCI results are lognormally distributed (Rosenbaum et al. 2004; Hong et al. 2010; Ciroth et al. 2016; Imbeault-Tétreault et al. 2013; Heijungs and Lenzen 2014). However, such an assumption has not been empirically tested in the LCA literature. In the literature, it was shown that the product of lognormally distributed data result in a lognormal distribution (Limpert et al. 2001). However, there is no theoretical underpinnings on the types of distribution for the product of two matrices of which the data are lognormally distributed, which is basically a set of linear combinations of products between lognormally distributed data (Hong et al. 2010). Furthermore, LCA data exhibit not only lognormal distribution but also other types of distribution such as normal and triangular distributions, of which distribution types of the products cannot be determined analytically.

This study aims to determine the probability distribution that best describes LCI results. The paper is the first attempt to generate the distribution profiles for the entire aggregate LCIs of Ecoinvent v3.1. In this study, we performed MCS to simulate random samples of unit process data and to estimate the distribution profiles of LCI results. We tested the hypothesized distributions of LCIs using the overlapping coefficient method and identified the most suitable distribution type to present LCIs.

In the next section, the “method and data” used in this study is presented, followed by “results and discussion” (Sect. 3). In Sect. 4, the main findings are presented and a set of recommendations are discussed.

2 Method and data

2.1 Monte Carlo simulation

In this study, MCS is used to create the distribution of each aggregate LCI result from the entire Ecoinvent data v3.1. MCS is a common sampling technique used in uncertainty assessment to obtain randomly generated numbers (Lloyd and Ries 2008). With the help of advancement in computer hardware and software, MCS of large datasets, such as the Ecoinvent v3.1, became viable (Gentle 2013). Our approach to MCS takes several steps: (1) extract distribution functions of the raw data, which are the data on unit process-level intermediate flows and elementary flows, (2) create random samples based on the probability distributions of the raw data, and (3) iterate the process and collect the sample results. Figure 1 demonstrates the procedure for the statistical analysis used in this study.

Fig. 1
figure 1

Monte Carlo procedure for uncertainty assessment of aggregate LCI

Each and every input parameter for calculating LCI results is considered a stochastic parameter. For one iteration, every unit process data in intermediate flow matrix A and elementary flow matrix B are reconstructed based on their distribution functions. Aggregate LCI results are calculated through the equation, M = BA −1 (Heijungs and Suh 2002).

This process can be summarized as in Eq.1:

$$ {M}_i^{*}=\left(B+\delta {B}_{\mathrm{i}}\right)\ {\left(A+\delta {A}_{\mathrm{i}}\right)}^{-1} $$
(1)
δB i :

Randomly sampled deviation matrix for the elementary flows

B :

Deterministic elementary flow matrix

δA i :

Randomly sampled deviation matrix for the intermediate flows

A :

Deterministic intermediate flow matrix

i :

Number of simulation, i = 1 ,  .  .  .  n (n = 1000)

The resulting M matrix has the dimension of 1869 (elementary flows) × 11,332 (processes), and we have generated 1000 of them, \( \left\{{M}_1^{*},{M}_2^{*},\kern0.75em \dots \kern0.5em {M}_{1000}^{*}\right\} \). To ensure efficiency, we further sampled 1000 data points from each \( {M}_i^{*} \). To do so, we have extracted 1000 randomly chosen elementary flow-process pairs and used them to extract 1000 data points for each run. The sampled 1000 elementary flow-process pairs can be found in the Electronic supplementary material. The number of data points that underwent the following statistical analyses were therefore 1000 (elementary flow-process pairs) by 1000 (runs) = 1,000,000. One whole iteration including simulation, calculation of entire LCI results, and storage of randomly chosen 1000 points takes about 1 min in Python 2.8 in Windows PC with 16 cores. The total time for completing 1000 times of simulations is 1000 times of it, which is about 1000 min≈17 h.

2.2 Distribution functions

A probability distribution function f(x) is a function describing the probability distribution of a random variable X. The most frequently used statistical distribution for the unit process-level inventory in Ecoinvent is lognormal distribution (Table 1). Normal and triangular distributions are also considered as the input parameter distributions, though they are less common as compared with lognormal distribution. The other two distributions similar to lognormal distribution are gamma and Weibull distributions, which will be used to test the distribution of aggregate LCI results in this study. Details about the five distributions are presented in the SI.

Table 1 Summary of probability distribution in Ecoinvent v3.1 unit process data

2.3 Statistical analysis of fitting the distribution

After the 1,000,000 samples as described in the previous section are obtained, statistical analysis is performed to discover the probability distribution of the aggregate LCIs of Ecoinvent v3.1. A general method of finding the best fitting distribution involves the following three steps: (1) plot the data in frequency histogram or density plot to narrow down the list of possible distribution types (Singh et al. 1997); (2) to ensure that the sample is not biased, run a normality test using Shapiro-Wilk normality test following Razali and Wah (2011); and (3) generate LCIs based on the hypothesized distributions and test the fitness of each distribution with the original data using overlapping coefficient method.

LCI results that follow a perfect lognormal distribution can be generated by applying the log-mean and log-standard deviations of the LCI results. To estimate Weibull and gamma distributions, shape and scale, and shape and rate of the LCI distribution are calculated, respectively. The coefficient of overlapping (OVL) is a measure to evaluate the similarity of two probability distributions, which can be used to calculate the percentage of overlapped area between the distribution of LCI sample results and the expected distribution. The greater the value of OVL, the more similar of the two distributions. In Eq. (2), Δ is the OVL that represents the common area under both density curves. If the two density functions are f(x) and g(x), then

$$ \varDelta \left(f,g\right)=\int \min \left\{f(x),g(x)\right\}\ \mathrm{d}x $$
(2)

The OVL of the distribution estimate and the sample aggregate LCI results are calculated in R program. Detailed explanation of overlapping coefficient method can be found in Ridout and Linkie (2009).

2.4 Data sources

We use the unit process inventory data obtained from the Ecoinvent database v3.1 (default allocation method) as our input data. The v3.1 contains more than 11,000 unit processes and nearly 2000 types of environmental exchanges (Weidema et al. 2013). Uncertainty information including uncertainty type and corresponding distribution parameters are given for each unit process data. The unit process data includes both intermediate flow matrix (A) and elementary flow matrix (B) and their distributions. For unit process data in lognormal distribution, all the geometric standard deviations of them are calculated based on their variance in pedigree uncertainty.

We also corrected a few extremely high uncertainty values in the database, which are likely to be erroneous, into reasonable values in order to make the A matrix invertible. For example, one of the intermediate flow in the database follows a lognormal distribution with GSD = 4.1E+22, which is highly unlikely to be reflective of the reality. Furthermore, such high GSDs will lead to extreme values in the (A + δA i ) matrix that will make it non-invertible. Therefore, we adjusted the GSDs of those intermediate flows into reasonably high value (GSD = 5), which is still about four times higher than average GSD, 1.3. For consistency, we also corrected uncertainty values in the B matrix. Because elementary flows have relatively higher GSD values than intermediate flows in the database, we assign GSD = 10 to those GSDs greater than 10 in the B matrix (average GSD of the elements in B matrix is 1.8).

3 Results and discussion

As the first step of our analysis, we constructed frequency and probability density plots of simulation results of LCIs to see their distribution shapes. Figure 2 presents the histograms of LCI results of nine random elementary flow-process pairs. The distribution results are similar to the previous LCI simulations in the literature (Sonnemann et al. 2003; Muller et al. 2016). The shape of the distributions in Fig. 2 can be visually identified as lognormal, gamma, or Weibull distributions (Holland and Fitz-Simons 1982). To further determine the type of probability distributions for these results, normality statistical test and overlapping coefficient method are applied.

Fig. 2
figure 2

Histograms of nine random points in 1000 iterations of LCI results

By definition, if the logarithm of the data is in normal distribution, then the data has a lognormal distribution. The QQ-plots of log-transformed LCI results in the Electronic supplementary material indicate the majority of LCI results are very close to lognormal distribution. The normality of the data can also be assessed through a variety of statistical tests. One of the most common tests is Shapiro-Wilk normality test, which is known to be the most powerful approach to normality test (Razali and Wah 2011). The results of Shapiro-Wilk normality test of simulated LCI are provided in Table 2.

Table 2 Shapiro-Wilk normality test results of simulated LCIs (p value)

The results of normality test for the 1000 random elementary flow-process pairs are presented in Table 2. At 95 % confidence level, p value less than 0.05 means we reject the null hypothesis that the probability distribution of the data is normal. About 99.8 % of the simulated LCI results showed p values greater than 0.05, meaning that nearly all of the simulated LCI results are not normally distributed.

After we log-transformed the LCI outputs, the share of the simulated LCIs that passed the test increased to 43 % (Table 2), indicating that they more likely to be lognormally distributed than normally distributed. At 95 % confidence level, average p value of log-transformed LCI results is 0.18, accepting the null hypothesis that LCI results are lognormally distributed. Still, 57 % of the 1000 samples of LCI results did not passed the normality test after log-transformation. This can be explained by the well-known observation that the power of Shapiro-Wilk test diminishes as the size of lognormally distributed sample increases (Yazici and Yolacan 2007). Therefore, we performed the Shapiro-Wilk normality test for only 100 randomly chosen samples of simulated LCIs. The results show that 81 % of the simulated LCIs passed the normality test in this case, confirming that simulated LCIs generally follow lognormal distribution.

The next step of fitting the distribution is to test how well a lognormal distribution or other possible distributions actually fit LCI simulations. As mentioned before, according to the shape of the curves in histograms, some possible distributions of LCI results include lognormal, gamma, and Weibull distributions. The results are fitted by those distributions, and the OVL are calculated to find the closeness of the results to those distributions. The three types of distributions are generated based on the corresponding distribution parameters of simulated LCI results as described in the Sect. 2. Detailed description about the probability density functions for the three distributions is included in the Electronic supplementary material. Figure 3 represents nine typical comparisons among the results and the estimates of lognormal, gamma, and Weibull distributions of random elementary flow-process pairs.

Fig. 3
figure 3

Density plots of LCI data, lognormal, gamma, and Weibull distribution estimates

In the plots of the distribution comparisons (Fig. 3), lognormal distribution estimates have the larger shared area with simulated LCI data than gamma or Weibull distribution. Figure 4 illustrates the distributions of OVL results from the LCI results versus lognormal, gamma, and Weibull distributions. For example, the solid line in Fig. 4 shows the OVL probability density of expected lognormal distribution and LCI simulations. The average OVL for lognormal distribution and LCI result is 95 %, while that for gamma and Weibull distributions are 92 and 86 %, respectively. The result shows that LCI samples are closest to a lognormal distribution compared with other distribution types based on the coefficients of overlapping approach.

Fig. 4
figure 4

The coefficients of overlapping (OVL) of 1000 samples of LCI results and lognormal, gamma, and Weibull distribution estimates

Graphically and numerically, therefore, we could conclude that LCI results of Ecoinvent v3.1 are lognormally distributed. This observation allows us to characterize the distribution of aggregate LCI results more efficiently using GSD and median. In other words, individual users do not need to perform a MCS using unit process-level data, which can be highly time consuming given the dimensions of matrices involved.

4 Conclusions

In this study, the probability distribution type for aggregate LCIs of the Ecoinvent v3.1 database is identified by comparing the simulated LCIs with three possible distributions. The results show that lognormal distribution has the highest overlapping coefficient (average 95 %) with simulated LCIs as compared with gamma (average 92 %) or Weibull distribution (average 86 %). Our normality test results also confirm that 43 % of aggregate LCIs follow lognormal distribution. Therefore, aggregate LCIs can be presented efficiently as lognormal distribution (i.e., median and GSD).

Though the current database has uncertainty values for unit process inventory, conducting uncertainty analysis starting from the unit process level is neither time-efficient nor necessary for most studies. Therefore, the determination of the distribution that best fits the aggregate LCIs is needed. It would help improve the efficiency of storing uncertainty data and performing uncertainty analysis in LCA by saving computation time and storage of LCI data.

By way of an example, 1000 times of LCI simulation using unit process-level distribution information for a product system that involves 30 inputs from Ecoinvent v3.1 would take 1000 min for a modern, average desktop computer (7 core computer, 16 GB ram, 3.4 GHz). By using pre-calculated distribution function for LCIs, this can be reduced to 15 s, which is 1/4000th of the time needed for the unit process-level computation.

Our study only considers the uncertainty information from unit process data from Ecoinvent 3.1, which is mostly based on the pedigree matrix. Pedigree method is a pragmatic approach to uncertainty in the absence of better uncertainty information. However, the theoretical and empirical grounds of applying pedigree approach to quantify uncertainty itself are questionable (Ciroth et al. 2016). The validity of pedigree approach was not within the scope of our paper; the methodology presented in this paper can be applied to any uncertainty data regardless of how they are derived in the first place. Though the majority of the unit process data in A matrix include uncertainty values in the current database, there is still part of them lacking uncertainty information. The problem is more severe when it comes to B matrix, where only about 60 % of the data contains uncertainty values in Ecoinvent v3.1. The aggregate LCI results that we have calculated, therefore, does not reflect all the uncertainties, because some of the uncertainty data, especially those in B matrix, were not considered. However, for the purpose of this study, adding additional uncertainty information for those that are missing in the original data is unlikely to change the conclusions of our study.

Aggregate LCI uncertainty is only one step in the analysis of LCA uncertainty. Not only LCI uncertainty but also the uncertainty from impact assessment should be assessed in order to achieve the overall uncertainty of the final LCA results. Additional research is needed to understand the uncertainties in LCA encompassing both LCI and LCIA.