Abstract
Auditing and forensic analysis of how each prediction is calculated are key attributes of transparent open-box learning networks (TOB). It provides the full calculation and input metric contributions for each of the predictions it derives. There are two stages in executing TOB predictions (stage 1 matches and ranks using squared-error analysis; stage 2 optimizes and conducts sensitivity analysis). Neither stage involves generating or extrapolating correlations between the input variables. Both stages of the calculation generate accurate predictions for datasets with multiple, highly-dispersed and non-linear influencing inputs. The transparent way in which generates predictions leads to better understanding of the interplays between the input variables. Such attributes have direct relevance to the complex systems modelled in the coal industry [e.g., gas calorific value (GCV) prediction and coal petrology–grindability relationships]. The algorithm is applied here to predict GCA for a large published database (6339 records) of US coals including proximate and ultimate analysis metrics. The TOB predicts GCV with accuracy (RMSE ≤ 0.3 MJ/kg; R2 > 0.99). The transparency of the TOB method contrasts with the hidden relationships involved in many neural-network based prediction systems. Worked examples are provided to show the detailed prediction calculations associated with individual data points. The TOB approach applied to predicting coal GCV can help to verify the source of specific samples (e.g. specific mines or coal basins) using readily understandable underlying calculations available for audit and display. The TOB is therefore also suitable for identifying the provenance of specific coal samples based on proximate and/or ultimate analysis.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Predictions of dependent-variable values from complex systems of multiple non-linear influencing variables with highly dispersed distributions are key requirements for the coal industry. Predicting gross calorific value of various grades of coal (Mesroghli et al. 2009) and relationships between its petrological factors and grinding properties (Bagherieh et al. 2008) are common examples. Empirical correlations and artificial intelligence algorithms that are widely used to derive predictions for dataset of variable size covering regional and/or local coal sources. Many of these methods provide meaningfully accurate predictions where the underlying inputs are related in highly non-linear and irregular ways. Often, the metric of interest for commercial valuation is expensive to measure repeatedly by laboratory testing, making such prediction tools cost effective. This is the case with proximate and ultimate coal analysis. Artificial intelligence tools are therefore growing in their deployment for such applications (Schmidhuber 2015).
A problem with many machine-learning algorithms is that they lack transparency. They do not reveal easily how each prediction they generate is derived. This is because they typically involve hidden, complex and multi-dimensional correlations. This means that they typically do not provide straightforward and auditable input–output relationships between the variables involved their predictions. For this reason, some are reluctant to rely on machine-learning algorithms, where such information is of critical importance (e.g., commercial valuation or error analysis for specific value intervals; both of which are relevant to GCV analysis of coal samples).
This opaqueness leads to many practitioners being sceptical about the predictions derived from neural-network methods, particularly their claims to accuracy when applied to relatively limited data sets. They are often viewed as black boxes for this reason (Heinert 2008) and their inability to reveal the details of their underlying calculations can be frustrating. This is despite their ability, based on a range of statistical-accuracy measures, to achieve impressive levels of prediction accuracy for a wide range of complex systems. Indeed, some algorithms are prone to the pitfalls of overfitting (Lever et al. 2016) i.e., their hidden correlations are too dependent on a particular set of data, introducing doubt regarding their ability to fit additional data records as they become available. This is particularly a problem for datasets covering a range of the prediction-metric intermittently, i.e., significant gaps in the value range covered by the underlying dataset.
Locally-weighted learning methods (Atkeson et al. 1997) combined with lazy learning principles (Birattari et al. 1999), originating from the much earlier recognition of the benefit of nearest-neighbour prediction methods (Fix and Hodges 1951; Cover and Hart 1967), can be configured to provide transparency. However, these approaches tend to be applied more to pattern recognition algorithms (Garcia et al. 2012; Chen and Shah 2018) rather than non-linear regression predictions, where the application of the more-opaque neural networks now dominate. Moreover, such approaches often seek to linearize highly non-linear systems on a localized or neighbourhood basis (Bontempi et al. 1999). Nevertheless, there is the potential for such approaches to be configured with transparency in mind (Shakhnarovich et al. 2006).
The transparent open-box (TOB) learning-network algorithm (Wood 2018) overcomes many of the issues mentioned by not relying upon hidden correlations to calculate its predictions. It applies a matching technique between a tuning and training data subsets. The degree of match between the data records is quantified using squared-error analysis. The TOB stage 1 prediction establishes a set of high-ranking matching records from which an initial prediction is made. The TOB stage 2 then applies an optimizer to identify the optimum weights to assign to each input to improve its prediction accuracy. The transparent calculations used in TOB stages 1 and 2 are readily available to audit, and the level of accuracy it achieves compares favorably with the more-opaque artificial-intelligence (AI) techniques (e.g., adaptive neuro-fuzzy inference systems, multi-layer perceptron and radial-basis function artificial neural networks, least squares support vector machines, and hybrids of those with evolutionary optimization algorithms).
The AI methods mentioned do not need to be totally opaque. Simulation methodologies can provide a degree of transparency (Elkatatny et al. 2016). Variable importance algorithms can also establish the covariances between the influencing variables of AI methods. Auret and Aldrich (2012) achieve this with a random-forest algorithm. The TOB method goes further than this because it facilitates drilling down into the underlying variables to obtain the exact calculations involved in each of its predictions.
Here, the TOB method is applied to a 6339-record data set for US coals including both proximal and ultimate influencing variables to predict GCV (Appendix 1, supplementary file). Highly-accurate predictions are achieved using a small tuning data subset (~ 1.5% of the full dataset). Matches are achieved through error analysis in a training subset that constitutes about 97% of the entire database. The remaining 1.5% of the data records are not involved in the training or tuning process. They are used as a testing-data subset to independently test the TOB’s prediction performance. The dataset involves nine influencing variables that contribute through various applied weightings to predict GCV as the dependent variable.
Although a coal GCV dataset is used to demonstrate the benefits of the TOB learning network to the coal industry, there are other coal-related systems for which ANN is frequently used as prediction tool that could equally benefit from its application. For example, coal petrography and petrology influencing variables in relation to a measure of coal grindability as a dependent variable (Trimble and Hower 2003; Bagherieh et al. 2008).
TOB method
TOB stages 1 and 2 comprise of 14 steps (Wood 2018). These steps are summarized in a flow diagram (Fig. 1). Stage 1 builds upon lazy learning (Birattari et al. 1999) and nearest neighbour (Chen and Shah 2018) principles but with very specific error drivers. Stage 2 goes far beyond such principles by linking the selection of variable weightings to an optimizer, providing a more flexible and versatile weighting regime than typically associated with k-nearest neighbour classifiers (Samworth 2012).
The details and mathematical formulations involved in each of the 14 steps required to establish and implement a TOB learning network are described in Appendix 1.TOB Stage-1 predictions (steps 1–10) are often found to be quite accurate (e.g., comparable to those provided by typical k-learning algorithms). However, they typically can be much improved upon by applying TOB stage 2.
The TOB learning network can be successfully applied using spreadsheets (e.g. Excel workbooks) for mid-sized data sets. Fully-coded algorithm formats or hybrid VBA plus spreadsheet setups can speed up deployment for such datasets. For large datasets it is appropriate to deploy the TOB algorithm in a fully-coded configuration, e.g., in Octave, Python, MatLab, R, VBA, etc.
A hybrid VBA-Excel spreadsheet configuration is used here to predict the gas calorific value (GCV) from a published dataset of coals from the United States (6339 data records).
Dataset compiled to predict coal gross calorific value (GCV)
There are numerous well-established published correlations based on linear and multi-variable regressions, particularly for coals from the United States (US), based on proximate and/or ultimate analysis (Given et al. 1986; Neavel et al. 1986; Singh and Kakati 1994; Channiwala and Parikh 2002; Majumder et al. 2008; Mathews et al. 2014). Several of these provide predictions with low absolute errors and correlation coefficients (R2) > 0.9 between measured and predicted GCV. However, as many of the variables involved in proximate and ultimate analysis vary in a non-linear manner, prediction improvements have been achieved by applying non-linear regression or machine-learning algorithms such as artificial neural networks (ANN), support vector regression (SVR) or adaptive network based fuzzy inference system (ANFIS) (Patel et al. 2007; Mesroghli et al. 2009; Chelgani et al. 2010, 2011; Yalcin Erik and Yilmaz 2011; Kavsek et al. 2013; Tan et al. 2015; Feng et al. 2015). These are mainly based on proximate analysis of relatively small data sets of coals from India and China but achieve high levels of prediction accuracy with R2 > 0.99 between measured and predicted GCV in some cases.
Tan et al. (2015) also demonstrate a GCV prediction performance with R2 > 0.99 for their SVR algorithm using the many thousands of samples of US coal analysis provided by the US Geological Survey Coal Quality (COALQUAL) database (Bragg et al. 1997). Matin and Chelgani (2016) demonstrated that the random forest algorithm (Breiman 2001; Auret and Aldrich 2012) establishing covariances between the influencing variables, could produce highly accurate predictions of GCV (R2 > 0.97 for proximate data; R2 > 0.99 for ultimate data) using a filtered version of the COALQUAL (version 2) database. Matin and Chelgani (2016) filtered out coal records with > 25% ash as being unsuitable for use in power production and those with analysis that did not sum to 100.
This left 6339 data records of the COALQUAL database which they used to test their random forest algorithm. It is these 6339 data records that are used here to test the TOB algorithm. It should be noted that the COALQUAL dataset has now been updated and extended (> 13,000 coal records) by the issue of version 3 (Palmer et al. 2015), but for the current purpose it is deemed more useful to use the filtered version 2 dataset for which published GCV-prediction performances are available for comparison. Matin and Chelgani omitted fixed carbon (FC) and oxygen (O) values from the GCV-influencing variables they included in their prediction model, because they were derived from other variables in the proximate or ultimate analysis. These variables are included in the TOB analysis as they are valuable for data-record matching purposes.
The compiled-US-coal-GCV dataset used here to demonstrate the prediction capability of the TOB method spans a significant range of GCV. It also incorporates significant ranges of input-variable distributions (proximate and ultimate analysis) as shown in (Table 1). The details for each of the 6339 data records are provided in a supplementary file (see “Appendix”). This includes the values of all variables and links to the sample numbers listed in the extracts from the COALQUAL version 2 data base (Bragg et al. 1997) as complied by Matin and Chelgani (2016). From the supplementary file it is possible to locate the exact COALQUAL sample number and US State of origin of each.
It is clear from Table 1 that the data set is skewed towards the higher end of the GCV range; nearly 80% of the samples fall in the GCV range 23–35.5 MJ/kg. This means that the lower end of the GCV scale (i.e., < 15 MJ/kg) is sparsely sampled; representing about 4.5% of the data records. This feature of the dataset is addressed in the learning networks constructed.
The highly dispersed and non-linear relationships between each of the input variables (#1 to #9) and the dependent variable GCV are illustrated in Figs. 2 and 3. For the proximate analysis variables (Fig. 2), moisture content and fixed carbon show the best correlations GCV; the former a negative correlation (R2 = 0.8216), and the latter a positive correlation (R2 = 0.77). On the other hand, Ash and Volatiles show poor correlations with GCV and greater dispersal (Fig. 2). The dispersal in the Ash versus GCV relationship gradually reduces in this dataset for GCV values > 25 MJ/kg. For the ultimate analysis variables (Fig. 3), carbon and oxygen show the best correlations GCV; the former a positive correlation (R2 = 0.9847), and the latter a negative correlation (R2 = 0.8345). The other variables in Fig. 3 show significant dispersion and non-linearity, particularly S.
TOB predictions of coal gross calorific value (GCV) from the compiled 6339-record GCV dataset
The GCV dataset is divided into subsets (training = 6155 records; tuning = 92 records; and, testing = 91 records). Some 97% of the 6339 records (full data set evaluated) reside in the training subset. Allocation of records for tuning and testing is spread arbitrarily across the entire range of data values. It is best to avoid random allocations as this is likely to lead to clustering and gaps in specific data ranges. Steps 2 and 6 of the method (Appendix 1) help to distribute the records for tuning and testing systematically (e.g., using a specified ranking-interval spacing to select from the ranked and sorted dataset), but avoids being subjectively selective.
This allocation was achieved initially ranking the full dataset in ascending (or descending) order of GCV values and then by selecting every 70th data record from the full data set to be put to one side during the tuning process and allocated to the testing subset. Two additional records were also added to the testing subset: one close to the lower GCV limit; and, one close to the upper GCV limit. A similar approach was then taken in the selection of the tuning subset records from the remaining dataset, i.e., every 70th data record from the combined training and tuning subset (ranked in order of GCV values) was allocated to the tuning subset. Two additional records are then added to the tuning subset: one close to the lower GCV limit; and, one close to the upper GCV limit. This approach ensures a full spread of GCV values, representative of the entire data range, in both the testing and tuning subsets. Such a spread cannot be guaranteed by using random sampling. For the learning network to be tuned across its entire dependent variable range it is essential that the tuning subset is distributed relatively evenly across that range.
Table 2 provides TOB-prediction results for coal GCV dataset. The optimum GCV-prediction performance is assessed by comparing actual and predicted GCV values. This is initially calculated for the tuning-subset records and achieves RMSE = 0.33462 MJ/kg; R2 = 0.9963 with optimized Q = 9 (i.e. the nine-highest ranking data record matches from the training subset are used in the TOB stage-2 predictions). The TOB stage-2 variable weights established for the optimum solution (most accurate predictions) were: w#1 = 5.304E−04, w#2 = 8.317E−03, w#3 = 0, w#4 = 1.576 E−03′ w#5 = 0; w#6 = 1.0; w#7 = 2.062E−03, w#8 = 0, and w#9 = 4.348E−03. The very small weights applied to several of the input variables have significant influence on the prediction accuracy achieved. Section 6 addresses this point. Suffice it to say here that it does not follow that the higher the weight applied to a variable in the TOB method, the more significant that variable is in determining the predicted values.
Table 2 compares the prediction achieved with sub-optimal values of Q (i.e., Q = 2 to 10) with the optimal value of Q = 9. The results show that accurate GCV predictions can be achieved using most Q values in that range (R2 = 0.9944 for Q = 2; compared to 0.9963 for Q = 9). This suggests that for this learning network the value of Q plays a subordinate role, as high degrees of accuracy are achieved for all values of Q tested. This is further emphasized by the low range of root mean square (RMSE) values of 0.33462 (Q = 9) to 0.41393 MJ/kg (Q = 2) MJ/kg. These values indicate that the dataset is not noticeably under-fitted when Q = 2.
The TOB stage 1 predictions (Q = 10 and Wn = 0.5; see the left side of Table 2) also show credible accuracy (RMSE = 0.48381 MJ/kg; R2 = 0.9923). The data-record-matching conducted by TOB Stage 1 is, clearly, an essential contributing component to the optimum GCV predictions.
Testing-subset predictions (Table 2: lower two rows) applying optimum Q and Wn values achieves very slightly higher accuracy in GCV prediction than the tuning set GCV predictions. Testing subset accuracy metric values are: RMSE = 0.30556 MJ/kg and R2 = 0.9970 (Table 2). The accuracy of the TOB method’s predictions compares favourably with those achieved by other machine learning and empirical correlation methods applied to this dataset (Matin and Chelgani 2016) and other data sets (Feng et al. 2015; Tan et al. 2015).
Figures 4 and 5 display the optimum TOB predictions for coal GCV the tuning and testing subsets, respectively. It is apparent from these graphs that the data coverage for coals in the GCV < 15 range is only sparsely sampled by both tuning and testing subsets. In order to verify that the TOB network can produce consistent predictions of meaningful accuracy in this sparser data area a separate TOB network is sampled to analyse and tune that section of the GCV distribution more extensively.
Auditing and interrogating TOB predictions
Matching of data records rather than establishing correlations between input variables is the basis of the TOB methodology. It is constrained in its predictions by the range covered by the lowest and highest GCV values; it cannot extrapolate beyond that range (in contrast to many other AI methods).
For many machine-learning algorithms and empirical correlations it makes sense to apply their optimum coefficients to the data records in the training set as well as the tuning and/or testing sets used to verify their accuracy. When applying the TOB algorithm predictions are made only for data records that are not already present in the training subset. Predictions for records already in the TOB training set would achieve exact matches (RMSE = 0) and provide no insight to the accuracy of the method. This is a fundamental and distinguishing difference between the TOB methodology and most other AI methods (except K-learning-type methods).
As there are no hidden or difficult-to-access intermediate correlations involved in TOB the underlying calculations in each prediction are accessible Indeed, a key benefit of the TOB algorithm is that it allows each prediction calculation step to be readily analysed. Tables 3, 4, 5 and 6 provide examples of how this is achieved and the information it provides. These tables detail the TOB predictions (stages 1 and 2 presented separately) for record # 2738 (Tables 3, 4) and record # 193 (Tables 5, 6). Tables 3 and 5 focus on the TOB-stage-1 calculations (Q = 10; Wn = 0.5). Tables 4 and 6 focus on TOB-stage-2 predictions (2 ≤ Q ≤ 10; 0 ≤ Wn ≤ 1). The left side of the tables displays the ten high-matching records established by TOB stage 1 for data records #193 and #2738. The upper half of the tables displays the nine input-variable values (#1 to #9) and the dependent variable (GCV). These values are all expressed in normalized terms (ranging from − 1 to + 1). The matching records are listed in order of the matching rank; the first is the closest match; the one listed 10th is the 10th -best match of all the records, from the 6155 records in the GCV training set.
The lower half of Tables 3, 4, 5 and 6 applies the Q and Wn values to the squared errors calculated for the variables. The column fourth from the right (in the lower half each table) lists the sum of the weighted-squared errors (SumE). For stage 1 (Q = 10) the values of ten matching records are involved in the calculation of SumE; for stage 2 (Optimum Q = 6) the values of only six matching records are involved in the calculation of SumE. In both cases it is the last three columns to the right in the lower half of these tables that detail the actual and relative contributions of each of the high-matching records to the prediction. It is the sum of the contributions in the right end column (lower half of the tables) that provides the prediction in normalized terms. the lowest two rows (right side numbers) in each of Tables 3, 4, 5 and 6 provide a comparison with of the actual predicted GCV (transformed from the normalized scale) and the measured GCV for the data record being analysed.
The last three columns of Tables 3, 4, 5 and 6 (lower half of those tables) are the key calculations to focus upon. These use the SumE values to generate GCV predictions for the data record in question. The formulas provided in the heading of each of those three columns explains the calculations made.
For record #2738 (Tables 3, 4) each of the top-ten high-ranking records contributes small amounts to the prediction, but the higher the ranked match the greater its contribution to that prediction. The rank#1 match contributes 14.2% to the prediction whereas the rank #10 match only contributes 6.7% to the GCV prediction for record 2738. The GCV prediction achieved is quite accurate; < +0.5 MJ/kg above the measured value of 29.1 MJ/kg. In TOB stage 2 (Table 4) for record 2738 only the top nine match records contribute to the GCV prediction because the optimized Q value selected is 9. Also, as the weightings for each variable are not equal for the optimized solution the relative contributions of those top-ranking matches are no longer governed only by the closeness of their match with the record 2738 variables. This is clear as the prediction is now dominated by record match ranks #8 and #3 (41.6% and 40.0% contributions, respectively, to the GCV prediction; see lower half Table 4 column second from the right). The optimization adjustment for this record generates a better stage-2 prediction, i.e., ~+0.3 above the 29.1 MJ/kg measured GCV value for record 2738.
The TOB stage 2 prediction is not always better than the TOB stage 1 prediction, but that is the case most of the time as the optimizer is minimizing the RMSE across the entire tuning or testing subsets. The second example, for testing subset data record 193, shows a case where the stage 1 TOB prediction outperforms the stage 2 TOB prediction. The stage 1 prediction (Table 5) is ~-0.32 MJ/kg less than the measured GCV value for that record of 20.9 MJ/kg, whereas the stage 2 prediction (Table 6) is ~ 0.47 MJ/kg less than the measured value. In the stage#1 prediction all ten top-matching records contribute between ~ 7.0% and 14.3% to the prediction (rank # 1 contributes the most, but it does not dominate the prediction). In contrast, the stage #2 prediction is dominated by ranked-matches #1 (59.6%), #9(13.4%) and #6 (10.1%). When a prediction has more significant input from the lower of the top-ranking matches, the accuracy of that prediction is sometimes impaired. Both stage 1 and stage 2 predictions are credible predictions for record 193 but influenced by those top-ten ranking matches quite differently.
For both examples, the important role of those variables with very-low-Wn (> 0) values in determining the stage 2 predictions is apparent. If those variable weights were zero, they would not contribute to the improved accuracy of the optimized solution.
The details provided by Tables 3, 4, 5 and 6 highlight just how deeply it is possible to interrogate both the stage-1 and stage-2 predictions and its high level of prediction transparency. Moreover, it is able to provide almost forensic insight into the similarity between each record tested against each record in the training subset. This can be useful in the case of trying to identify the exact provenance of a certain coal (e.g. a specific basin or even a specific mine can sometimes be identified by the closeness of the high-ranking matches). Hence, in some cases it is not only the prediction of the dependent variable value that can be derived from this learning network, but this can be accompanied by provenance information. Other machine learning algorithms and empirical relationships that are underpinned by correlations cannot easily deliver this level of detail into the degree of similarity with specific records in their training subsets.
Analysis of a more sparsely populated data subset (GCV > 6 to < 15 MJ/kg)
To provide more insight into the more sparsely populated GCV < 15 MJ/kg range of the compiled dataset, a separate TOB network is evaluated using just the 283 samples in the data base with GCV < 15 MJ/kg. The statistical summary of the measured variables for these 283 records (which are also included in the dataset described in Table 1) is provided in Table 7.
This more-focused TOB divides its data records, using the same methodology as already described (training records = 235; tuning records = 24 records; testing records = 24 records). This TOB tunes this GCV interval with 24 data records, whereas in the larger-dataset TOB previously described only involved 5 data records to tune that GCV interval. Table 8 describes the details of the tuned and optimized prediction performance of this focused TOB, with sensitivities, demonstrating high prediction accuracy as illustrated (Fig. 6) for its testing subset (RMSE = 0.2944; R2 = 0.9644). The accuracy is not as high as for the larger dataset (see Fig. 5) due to the greater spacing of training-subset data records (i.e. more sparsely distributed data) for the GCV interval < 15 MJ/kg. This highlights a key positive feature of the TOB learning network approach, i.e., it is resistant to over-fitting sparse data sets. As the spacing between datapoints in the training set increases, the statistically accessed accuracy of its predictions tends to decrease. Although such behaviour is intuitive that is not necessarily the outcome with empirical calculations or learning networks driven by complex correlations between the variables, which are prone to over-fitting.
Table 8 reveals that the best prediction performance for this focused TOB network is for Q = 9. The sensitivity analysis shows that RMSE increases and R2 decreases as the value of Q decreases below the value of 9 with R2 falling below 0.96 for Q values below 8. However, for Q = 3 the prediction accuracy is also good and superior to values of Q = 2 or Q = 4. This is true for both tuning and testing subsets (Table 8). Indeed Q = 3 represents a local minimum, which the Solver evolutionary optimizer selected as its optimum (became trapped at) on most of its runs. This bimodal optimization outcome suggests that for some data records better predictions are achieved for Q = 3. Closer inspection of data record #3452, highlighted on Fig. 6 for the optimum Q = 9 tuned setting as having a relatively low accuracy in comparison to most other records in the testing subset.
Tables 9 and 10 describe the detailed calculation of the GCV TOB stages 1 and 2 predictions for data record #3452, respectively. In the stage 1 prediction (Table 9) the matching record ranked#1 (record #3106) contributes 32.95% to the GCV prediction, with the other high−ranking−matched records contributing progressively less until matched record ranked #10 (record#21) contributes just 6.0% to the GCV prediction. For this stage 1 prediction, the top−three matched records contribute > 50% to that calculation. This achieves a prediction of high accuracy, i.e., −0.3 below the measured GCV value of 12.67 MJ/kg.
For data record #3452 the TOB-stage-2 prediction (Table 10) generates a significantly less-accurate prediction than that achieved by the stage-1 prediction (Table 9). The reason for this is that in the TOB stage 2 solution, with Q = 9 and variable weights applied, the top-3 matched records only contribute about 18% to the prediction. On the other hand, the matched record #8 (record # 3110) contributes 68% to the GCV prediction. This achieves a prediction of less impressive accuracy, i.e., − 0.64 below the measured GCV value of 12.67 MJ/kg. In this case, considering the analysis just described and the sensitivity analysis of Table 8, a case could be made for applying a Q = 3 cut off for the prediction of this data record. This approach highlights how the transparency of the TOB learning network’s calculation aids the analysis of outlier data records (i.e., those for which predictions fall significantly off trend). It makes it possible identify, in detail, the reasons for such outlying prediction values. It also often provides the justification for potential adjustments that might be made to improve /correct the predictions for such problematic data records.
Auditing TOB predictions and conducting sensitivity analysis (e.g. varying Q values from the optimum and changing the data-subset allocation percentages) focused on specific data-records facilitates rigorous outlier analysis; something that is difficult with most other AI methods not easily possible with correlation-based machine learning algorithms or empirical calculations. This TOB strength is particularly beneficial for datasets for which details of specific data-record predictions are important (e.g. for commercial valuation purposes or detailed sample provenance purposes; both of which apply to GCV and commercial coal datasets). This feature could also be usefully applied to other commercially-important characteristics of coal (e.g., predicting coal grindability from multiple input variables based on coal petrological properties).
Although the coal dataset studied here is relatively large, and the TOB algorithm clearly copes well with such numbers of data records, as a “big data” tool, the TOB algorithm may have some limitations with very large datasets. Clearly, the algorithm has to contain and manage a large training data base, whereas the performance (i.e., computational speed) of the algorithm is also likely to progressively deteriorate as the intrinsic dimensionality of the variable space increases. Further studies are required to establish the limits of applicability of the algorithm to such “big data” sets. However, although computational time is likely to deteriorate for very large data sets, the transparency provided by the TOB algorithm may compensate for this. As stage 2 of the algorithm focuses on just a few of the best matches (i.e., up to ten or so) the collective influence of a significant number of variables would remain fully transparent.
The COALQUAL dataset lends itself to further studies on the impacts of sparse data coverage on TOB prediction performance. A future study will conduct sensitivity analysis that progressively excludes percentages of the dataset from the training data subset used for model tuning (i.e. adding those excluded data records to the testing subset). This will quantify how sparse the training data subset can become before it ceases to yield meaningfully accurate predictions for the dependent variable.
Conclusions
The transparent open-box (TOB) learning network algorithm provides credible and reliable predictions of dependent variables, such as coal gross calorific value (GCV), that involve complex, highly dispersed and non-linear datasets for the influencing variables. Its high-prediction accuracy, demonstrated in this study, when applied to predict GCV from nine influencing variables from proximate and ultimate analysis from a large published data set (6339 data records of US coals) testifies to such capabilities. The method could be easily applied to more limited datasets, e.g., those based upon only the easier to obtain proximate analysis variables.
TOB’s prediction performance for this published coal data set compares favourably to that reported by other artificial-intelligence algorithms and empirical correlations, with the added benefit that it is more easily audited and generally more transparent. The TOB algorithm does not develop any correlations when calculating its predictions. Instead, it establishes (in TOB stage 1) the closest matches with ten data records in its large associated training subset. In TOB stage 2 the algorithm improves its prediction, based on statistical measures of accuracy for tuning and testing data subsets (i.e., minimizing root mean squared error between predicted and measured GCV values). It achieves this by applying an optimizer to select the number of those matches (2 ≤ Q ≤ 10) and applying tuned weights to the errors associated with each input variable.
The calculations involved in the predictions derived from the TOB algorithm are individually auditable. Standard Solver optimizers or customized evolutionary or non-linear optimization algorithms can be used to successfully and transparently achieve the TOB stage 2 optimized predictions. Such flexibility and access to the underlying calculations is not possible with most other artificial-intelligence prediction methods or empirical calculations.
An additional valuable feature of the TOB algorithm is the ease with which sensitivity analysis can be conducted by modifying its Q value. In particular, the Q-value sensitivities can help to identify whether the algorithm is over-fitting or underfitting a dataset. These positive attributes make the TOB algorithm a suitable prediction-performance benchmark with which to compare the predictions of other machine-learning and empirical correlation algorithms. It typically provides complementary results to other algorithms with respect to insight to the underlying dataset. Indeed, in some cases, where the dataset covers coals from many different regions and mines the TOB algorithm has the ability, through its record matching stage 1 routine, to identify the provenance of specific samples.
The detailed calculations shown for example data records demonstrate exactly how the predictions of the TOB algorithm can be audited and assessed. These detailed calculations are not complex, rather they highlight the prediction mechanisms involved and the key roles played by the optimized Q value and the input variable weights in producing the stage 2 optimized predictions. The ability to interrogate and verify in detail specific predictions is increasingly important for providing user confidence in prediction algorithms. By revealing useful information about the relative importance of identified training-subset records in terms of their contributions to specific predictions, and the problematic nature of other data records (e.g., outlying values of certain metrics not replicated in other data records), the TOB method provides such user confidence. In some applications it may be worth sacrificing a small degree of accuracy in order to obtain such insight and confidence associated with the predictions to be deployed.
References
Atkeson CG, Moore AW, Schaal S (1997) Locally weighted learning. Artif Intell Rev 11(1–5):11–73, 1997
Auret L, Aldrich C (2012) Interpretation of nonlinear relationships between process variables by use of random forests. Miner Eng 35:27–42
Bagherieh AH, Hower JC, Bagherieh AR, Jorjani E (2008) Studies of the relationship between petrography and grindability for Kentucky coals using artificial neural network. Int J Coal Geol 73:130–138
Birattari M, Bontempi G, Bersini H (1999) Lazy learning meets the recursive least squares algorithm. Advances in neural information processing systems, vol 11. MIT Press, Cambridge, pp 375–381
Bontempi G, Birattari M, Bersini H (1999) Lazy learning for local modeling and control design. Int J Control 72(7/8):643–658
Bragg LJ, Oman JK, Tewalt SJ, Oman CJ, Rega NH, Washington PM, Finkelman RB (1997) US geological survey coal quality (COALQUAL) database: version 2.0. US geological survey open-file report 97–134. https://pubs.er.usgs.gov/publication/ofr97134. Accessed 15 Nov 2018
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Channiwala SA, Parikh PP (2002) A unified correlation for estimating HHV of solid, liquid and gaseous fuels. Fuel 81:1051–1063
Chelgani SC, Mesroghli SH, Hower JC (2010) Simultaneous prediction of coal rank parameters based on ultimate analysis using regression and artificial neural network. Int J Coal Geol 83:31–34
Chelgani SC, Hart B, Grady WC, Hower JC (2011) Study relationship between inorganic and organic coal analysis with gross calorific value by multiple regression and ANFIS. Int J Coal Prep Util 31:9–19
Chen GH, Shah D (2018) Explaining the success of nearest neighbor methods in prediction. Found Trends R Mach Learn 10(5–6):337–588
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Elkatatny S, Tariq Z, Mahmoud M (2016) Real time prediction of drilling fluid rheological properties using artificial neural networks visible mathematical model (whitebox). J Pet Sci Eng 146:1202–1210
Feng Q, Zhang J, Zhang X, Wen S (2015) Proximate analysis-based prediction of gross calorific value of coals: a comparison of support vector machine, alternating conditional expectation and artificial neural network. Fuel Process Technol 129:120–129
Fix E, Hodges JL Jr (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. Technical report, USAF School of Aviation Medicine
Frontline Solvers (2018) Standard Excel solver—limitations of nonlinear optimization. https://www.solver.com/standard-excel-solver-limitations-nonlinear-optimization. Accessed May 2018
Garcia S, Derrac J, Cano J, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Anal Mach Intell 34(3):417–435
Given PH, Weldon D, Zoeller JH (1986) Calculation of calorific values of coals from ultimate analyses: theoretical basis and geochemical implications. Fuel 65:849–854
Heinert M (2008) Artificial neural networks – how to open the black boxes? In: Reiterer A, Egly U (eds) Application of artificial intelligence in engineering geodesy. Proceedings of AIEG Vienna, Vienna, pp 42–62 (ISBN 978-3-9501492-4-1)
Kavšek D, Bednárová A, Biro M, Kranvogl R, Vončina DB, Beinrohr E (2013) Characterization of Slovenian coal and estimation of coal heating value based on proximate analysis using regression and artificial neural networks. Cent Eur J Chem 11(9):1481–1491
Lever J, Krywinski M, Altman N (2016) Model selection and overfitting. Nat Methods 13:703–704. https://doi.org/10.1038/nmeth.3968 (Published online)
Majumder AK, Jain R, Banerjee P, Barnwal JP (2008) Development of a new proximate analysis based correlation to predict calorific value of coal. Fuel 87:3077–3081
Mathews JP, Krishnamoorthy V, Louw E, Tchapda AH, Castro-Marcano F, Karri V, Alexis DA, Mitchell GD (2014) A review of the correlations of coal properties with elemental composition. Fuel Process Technol 121:104–113
Matin SS, Chelgani SC (2016) Estimation of coal gross calorific value based on various analyses by random forest method. Fuel 177:274–278
Mesroghli S, Jorjani E, Chelgani SC (2009) Estimation of gross calorific value based on coal analysis using regression and artificial neural networks. Int J Coal Geol 79:49–54
Neavel RC, Smith SE, Hippo EJ, Miller RN (1986) Interrelationships between coal compositional parameters. Fuel 65:312–320
Palmer CA, Oman CL, Park AJ, Luppens JA (2015) The US Geological Survey coal quality (COALQUAL) database version 3.0: US geological survey data series 975, p 43 with appendixes. https://doi.org/10.3133/ds975
Patel SU, Jeevan Kumar B, Badhe YP, Sharma BK, Saha S, Biswas S, Chaudhury A, Tambe SS, Kulkarni BD (2007) Estimation of gross calorific value of coals using artificial neural networks. Fuel 86:334–344
Samworth R (2012) Optimal weighted nearest neighbour classifiers. Ann Stat 40(5):2733–2763
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Shakhnarovich G, Darrell T, Indyk P (2006) Nearest-neighbor methods in learning and vision: theory and practice (neural information processing). The MIT Press, Cambridge (ISBN 026219547X)
Singh KP, Kakati MC (1994) New models for prediction of specific energy of coal. Fuel 73:301–303
Tan P, Zhang C, Xia J, Fang Q-Y, Chen G (2015) Estimation of higher heating value of coal based on proximate analysis using support vector regression. Fuel Process Technol 138:298–304
Trimble AS, Hower JC (2003) Studies of the relationship between coal petrology and grinding properties. Int J Coal Geol 54:253–260
Wood DA (2018) A transparent open-box learning network provides insight to complex systems and a performance benchmark for more-opaque machine learning algorithms. Adv Geo Energy Res 2(2):148–162
Yalcin Erik N, Yilmaz I (2011) On the use of conventional and soft computing models for prediction of gross calorific value (GCV) of coal. Int J Coal Prep Util 31(1):32–59
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: TOB learning network method details
TOB stage 1 (data matching and provisional prediction)
Step 1 Set up a 2-D array of N input variables and one dependent variable to be predicted for each of M data records.
Step 2 Arrange the data records in a systematic order defined by the prediction variable’s values (e.g. ascending or descending value order).
Step 3 Derive maximum and minimum values (and other standard statistics, such as mean and standard deviation) for all records in the dataset (Table 1).
Step 4 Normalize the data in the array so each variable spans a range from minus 1 to plus 1 (− 1, +1). This is achieved by using Eq. (1)
where: Xi: variable X value for the ith data record, Xmin: minimum value of variable X, Xmax: maximum value of variable X.
Step 5 Generate statistical analysis of the normalized values to check that the variables are all correctly normalized.
Step 6 Distribute the data records between training, tuning and testing subsets. Sensitivity analysis is conducted to establish the optimum percentage of data records to allocate to each data subset. Firstly, the data records to be used for testing are extracted from the complete data set and placed to one side. Sensitivity analysis then helps to divide the remaining data records between the training and tuning subsets in proportions that achieve an acceptable prediction accuracy. For most data sets the training subset is likely to hold more than 75% of the data records. For large datasets of several thousand data records the sensitivity analysis often reveals that the training subset can be a much larger percentage without compromising prediction accuracy.
Step 7 The variable squared error (VSE) between each variable in the J data records of the tuning-data subset and the K data records in the training-data subset are calculated using Eq. (2):
where: \({X_k}\left( {tr} \right)\) = variable X value for the kth training-subset data record, \({X_j}\left( {tu} \right)\) = variable X value for the jth tuning-subset data record, \(VSE{\left( X \right)_{jk}}\) = squared error value for variable X for the jth tuning-subset data record versus the kth training-subset data record.
∑VSE is then established as the sum of the VSE values for each variable for each data record match using Eq. (3):
where: \(VSE{\left( {Xn} \right)_{jk}}\) = squared error for variable Xn for the jth tuning-subset data record versus the kth training-subset data record. \(\mathop \sum \nolimits^{} VS{E_{jk}}~\) = sum of the squared errors for all N + 1 variables for that data record match.
Wn = weight (0 < Wn < = 1) applied VSE of each of the N + 1 variables involved. These weights are all set to the same values (e.g. 1) in TOB stage 1 to avoid any bias in the initial training of the prediction network.
Step 8 Select and rank (lowest in ∑VSE is ranked number 1) the top-Q-matching data records in the training subset for each tuning subset data record. Q = 10 is typically sufficient for TOB stage 1. However, Q could be adjusted to higher or lower values, if necessary to improve prediction accuracy.
Step 9 The Q-selected training-subset data records (i.e. best matches) for the jth tuning-subset data record each contribute a fraction to the prediction of the dependent variable. That fraction is proportional to the relative ∑VSE scores of those Q records for the jth data record That fraction is calculated with Eq. (4) to Eq. (6) and
where: q = qth top-ranking training-subset record for the jth tuning-subset data record. fq = fractional contribution of qth top-ranking records for the jth tuning-subset data record.
The constraint defined by Eq. (5) applies the sum of the f values applied to each matching data record.
The matching training-subset data record with the lowest \(\mathop \sum \nolimits^{} VS{E_{jk}}~\)value should contribute most to the dependent-variable prediction for the jth tuning-subset data record. To achieve this (1 − f) is the multiplier applied in Eq. (6) to each of the Q top-matching records.
Where:
\(\left( {{X_{N+1}}} \right)_{j}^{{predicted}}\) = dependent variable for the qth data record in the training subset.
\(\left( {{X_{N+1}}} \right)_{j}^{{predicted}}\) = Stage − 1 TOB predicted value for the dependent variable for the jth tuning-set data record.
This prediction is provisional because equal weights (Wn) are applied to the variables in TOB stage 1.
Step 10 Measures of statistical accuracy are calculated for the TOB stage 1 predictions. The measures used include: coefficient of determination (R2); mean square error (MSE); and, root mean square error (RMSE). These are calculated with Eq. (7) to Eq. (9), respectively.
where: Xj = dependent variable (i.e. \({\left( {{X_{N+1}}} \right)_j}\) in Eq. (6)) for the jth tuning-subset data record; \(X_{j}^{{actual}}\) = actual (or directly measured) value of the dependent variable for the jth tuning-subset data record; \(X_{j}^{{predicted}}\) = predicted value of the dependent variable for the jth tuning-subset data record; \(X_{{ave}}^{{actual}}\) = average actual value of the dependent variable for all J data records in the tuning subset.
TOB stage 2 (optimization)
Step 11 Optimization is performed to minimize RMSE (Eq. 9) collectively for the J data records in the tuning subset. This is achieved by adjusting optimization control metrics while applying certain constraints.
The two optimization control metrics are:
-
1.
Varying the values applied to the N input-variable weights (Wn). Small non-zero values to weights applied to certain variables can and do have a significant impact on the accuracy of the predictions derived.
-
2.
Varying the number (Q) of top matching records in Eqs. (4), (5) and (6). For most data sets: 2 ≤ Q ≤ 10. The optimizer is allowed to select the best integer value of Q to minimize RMSE. It does this by systematically changing the value of Q in the three equations mentioned and by comparing the RMSE value for the predictions generated for each integer value of Q evaluated in the range 2 ≤ Q ≤ 10. For examples, if Q is set to “4”, the predictions for all of the tuning subset data records only use the top-4 matching records from the training subset related to each tuning subset record in making their predictions. In this way the optimization algorithm identified which value of Q leads to the most accurate predictions for the tuning subset as a whole.
Here, the Generalized Reduced Gradient (GRG) algorithm option of the standard “Solver” optimizer in Microsoft Excel (Frontline Solvers 2018) is used, in conjunction with visual basic for application (VBA) code, to conduct the optimization process. Other evolutionary optimizers could be applied to achieve similar outcomes. For mid-sized dataset calculating the TOB predictions in Excel facilitates the display all the intermediate calculations in a convenient format.
The top-matching data records in the training subset for each tuning-subset data record are carried forward from TOB stage 1 for selection by TOB stage 2. Equation (3) is re-evaluated by varying Wn in each iteration of the optimizer. Additionally, TOB stage-2 \(\mathop \sum \nolimits^{} VS{E_{jq}}~\) scores are derived with Eq. (4) by varying Q (2 < Q ≤ 10) in each iteration of the optimizer, contrasting with the fixed value of Q used in TOB stage 1.
Step 12 Calculate TOB stage-2 RMSE and R2 values for the predictions provided by the optimum step 11 solution. Compare the TOB stage-2 predictions with the TOB stage-1 predictions to assess the prediction improvements achieved, if any. Running sensitivity analysis with different values of Q (i.e. Q = 2 to 10) often provides insight to potential underfitting or overfitting issues with the data set.
Step 13 Calculate TOB stage-1 and stage-2 predictions for the independent testing data subset using the optimum values established for Wn and Q in step 11. Calculate and evaluate the RMSE and R2 values for the predictions calculated for the testing data. Reviewing the intermediate steps in the calculations often provides useful insight to the variables that have the most influence on prediction accuracy (it is often not those with the highest Wn values). It also helps perform outlier analysis (i.e., understanding why some data records lead to less-accurate predictions).
Step 14 Consider whether the prediction accuracy achieved by the method is sufficiently meaningful for it to be relied upon. Also, evaluate how its prediction accuracy compares with other machine-learning tools.
Appendix 2: Details of data records in the dataset
Supplementary data associated with the coal proximate and ultimate analysis data set to which the TOB network is applied (Matin and Chelgani 2016; Bragg et al. 1997) can be found, in the online version. The data in the supplementary Excel file is listed in one sheet as the complete dataset (6339 data records) and another with those records sorted in ascending order of GCV. To further aid transparency, other sheets in that file list the actual data records assigned to the training, tuning and testing subsets used for the analysis presented. This enables readers to view exactly how the TOB network was configured for the analysis described.
Rights and permissions
About this article
Cite this article
Wood, D.A. Transparent open-box learning network provides auditable predictions for coal gross calorific value. Model. Earth Syst. Environ. 5, 395–419 (2019). https://doi.org/10.1007/s40808-018-0543-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40808-018-0543-9