Introduction

Predictions of dependent-variable values from complex systems of multiple non-linear influencing variables with highly dispersed distributions are key requirements for the coal industry. Predicting gross calorific value of various grades of coal (Mesroghli et al. 2009) and relationships between its petrological factors and grinding properties (Bagherieh et al. 2008) are common examples. Empirical correlations and artificial intelligence algorithms that are widely used to derive predictions for dataset of variable size covering regional and/or local coal sources. Many of these methods provide meaningfully accurate predictions where the underlying inputs are related in highly non-linear and irregular ways. Often, the metric of interest for commercial valuation is expensive to measure repeatedly by laboratory testing, making such prediction tools cost effective. This is the case with proximate and ultimate coal analysis. Artificial intelligence tools are therefore growing in their deployment for such applications (Schmidhuber 2015).

A problem with many machine-learning algorithms is that they lack transparency. They do not reveal easily how each prediction they generate is derived. This is because they typically involve hidden, complex and multi-dimensional correlations. This means that they typically do not provide straightforward and auditable input–output relationships between the variables involved their predictions. For this reason, some are reluctant to rely on machine-learning algorithms, where such information is of critical importance (e.g., commercial valuation or error analysis for specific value intervals; both of which are relevant to GCV analysis of coal samples).

This opaqueness leads to many practitioners being sceptical about the predictions derived from neural-network methods, particularly their claims to accuracy when applied to relatively limited data sets. They are often viewed as black boxes for this reason (Heinert 2008) and their inability to reveal the details of their underlying calculations can be frustrating. This is despite their ability, based on a range of statistical-accuracy measures, to achieve impressive levels of prediction accuracy for a wide range of complex systems. Indeed, some algorithms are prone to the pitfalls of overfitting (Lever et al. 2016) i.e., their hidden correlations are too dependent on a particular set of data, introducing doubt regarding their ability to fit additional data records as they become available. This is particularly a problem for datasets covering a range of the prediction-metric intermittently, i.e., significant gaps in the value range covered by the underlying dataset.

Locally-weighted learning methods (Atkeson et al. 1997) combined with lazy learning principles (Birattari et al. 1999), originating from the much earlier recognition of the benefit of nearest-neighbour prediction methods (Fix and Hodges 1951; Cover and Hart 1967), can be configured to provide transparency. However, these approaches tend to be applied more to pattern recognition algorithms (Garcia et al. 2012; Chen and Shah 2018) rather than non-linear regression predictions, where the application of the more-opaque neural networks now dominate. Moreover, such approaches often seek to linearize highly non-linear systems on a localized or neighbourhood basis (Bontempi et al. 1999). Nevertheless, there is the potential for such approaches to be configured with transparency in mind (Shakhnarovich et al. 2006).

The transparent open-box (TOB) learning-network algorithm (Wood 2018) overcomes many of the issues mentioned by not relying upon hidden correlations to calculate its predictions. It applies a matching technique between a tuning and training data subsets. The degree of match between the data records is quantified using squared-error analysis. The TOB stage 1 prediction establishes a set of high-ranking matching records from which an initial prediction is made. The TOB stage 2 then applies an optimizer to identify the optimum weights to assign to each input to improve its prediction accuracy. The transparent calculations used in TOB stages 1 and 2 are readily available to audit, and the level of accuracy it achieves compares favorably with the more-opaque artificial-intelligence (AI) techniques (e.g., adaptive neuro-fuzzy inference systems, multi-layer perceptron and radial-basis function artificial neural networks, least squares support vector machines, and hybrids of those with evolutionary optimization algorithms).

The AI methods mentioned do not need to be totally opaque. Simulation methodologies can provide a degree of transparency (Elkatatny et al. 2016). Variable importance algorithms can also establish the covariances between the influencing variables of AI methods. Auret and Aldrich (2012) achieve this with a random-forest algorithm. The TOB method goes further than this because it facilitates drilling down into the underlying variables to obtain the exact calculations involved in each of its predictions.

Here, the TOB method is applied to a 6339-record data set for US coals including both proximal and ultimate influencing variables to predict GCV (Appendix 1, supplementary file). Highly-accurate predictions are achieved using a small tuning data subset (~ 1.5% of the full dataset). Matches are achieved through error analysis in a training subset that constitutes about 97% of the entire database. The remaining 1.5% of the data records are not involved in the training or tuning process. They are used as a testing-data subset to independently test the TOB’s prediction performance. The dataset involves nine influencing variables that contribute through various applied weightings to predict GCV as the dependent variable.

Although a coal GCV dataset is used to demonstrate the benefits of the TOB learning network to the coal industry, there are other coal-related systems for which ANN is frequently used as prediction tool that could equally benefit from its application. For example, coal petrography and petrology influencing variables in relation to a measure of coal grindability as a dependent variable (Trimble and Hower 2003; Bagherieh et al. 2008).

TOB method

TOB stages 1 and 2 comprise of 14 steps (Wood 2018). These steps are summarized in a flow diagram (Fig. 1). Stage 1 builds upon lazy learning (Birattari et al. 1999) and nearest neighbour (Chen and Shah 2018) principles but with very specific error drivers. Stage 2 goes far beyond such principles by linking the selection of variable weightings to an optimizer, providing a more flexible and versatile weighting regime than typically associated with k-nearest neighbour classifiers (Samworth 2012).

Fig. 1
figure 1

Diagrammatic representation of the steps and stages in applying the transparent open-box (TOB) learning network algorithm (Wood 2018). See Appendix 1 (TOB method) for a detailed description of each stage and step including the mathematical formulations

The details and mathematical formulations involved in each of the 14 steps required to establish and implement a TOB learning network are described in Appendix 1.TOB Stage-1 predictions (steps 1–10) are often found to be quite accurate (e.g., comparable to those provided by typical k-learning algorithms). However, they typically can be much improved upon by applying TOB stage 2.

The TOB learning network can be successfully applied using spreadsheets (e.g. Excel workbooks) for mid-sized data sets. Fully-coded algorithm formats or hybrid VBA plus spreadsheet setups can speed up deployment for such datasets. For large datasets it is appropriate to deploy the TOB algorithm in a fully-coded configuration, e.g., in Octave, Python, MatLab, R, VBA, etc.

A hybrid VBA-Excel spreadsheet configuration is used here to predict the gas calorific value (GCV) from a published dataset of coals from the United States (6339 data records).

Dataset compiled to predict coal gross calorific value (GCV)

There are numerous well-established published correlations based on linear and multi-variable regressions, particularly for coals from the United States (US), based on proximate and/or ultimate analysis (Given et al. 1986; Neavel et al. 1986; Singh and Kakati 1994; Channiwala and Parikh 2002; Majumder et al. 2008; Mathews et al. 2014). Several of these provide predictions with low absolute errors and correlation coefficients (R2) > 0.9 between measured and predicted GCV. However, as many of the variables involved in proximate and ultimate analysis vary in a non-linear manner, prediction improvements have been achieved by applying non-linear regression or machine-learning algorithms such as artificial neural networks (ANN), support vector regression (SVR) or adaptive network based fuzzy inference system (ANFIS) (Patel et al. 2007; Mesroghli et al. 2009; Chelgani et al. 2010, 2011; Yalcin Erik and Yilmaz 2011; Kavsek et al. 2013; Tan et al. 2015; Feng et al. 2015). These are mainly based on proximate analysis of relatively small data sets of coals from India and China but achieve high levels of prediction accuracy with R2 > 0.99 between measured and predicted GCV in some cases.

Tan et al. (2015) also demonstrate a GCV prediction performance with R2 > 0.99 for their SVR algorithm using the many thousands of samples of US coal analysis provided by the US Geological Survey Coal Quality (COALQUAL) database (Bragg et al. 1997). Matin and Chelgani (2016) demonstrated that the random forest algorithm (Breiman 2001; Auret and Aldrich 2012) establishing covariances between the influencing variables, could produce highly accurate predictions of GCV (R2 > 0.97 for proximate data; R2 > 0.99 for ultimate data) using a filtered version of the COALQUAL (version 2) database. Matin and Chelgani (2016) filtered out coal records with > 25% ash as being unsuitable for use in power production and those with analysis that did not sum to 100.

This left 6339 data records of the COALQUAL database which they used to test their random forest algorithm. It is these 6339 data records that are used here to test the TOB algorithm. It should be noted that the COALQUAL dataset has now been updated and extended (> 13,000 coal records) by the issue of version 3 (Palmer et al. 2015), but for the current purpose it is deemed more useful to use the filtered version 2 dataset for which published GCV-prediction performances are available for comparison. Matin and Chelgani omitted fixed carbon (FC) and oxygen (O) values from the GCV-influencing variables they included in their prediction model, because they were derived from other variables in the proximate or ultimate analysis. These variables are included in the TOB analysis as they are valuable for data-record matching purposes.

The compiled-US-coal-GCV dataset used here to demonstrate the prediction capability of the TOB method spans a significant range of GCV. It also incorporates significant ranges of input-variable distributions (proximate and ultimate analysis) as shown in (Table 1). The details for each of the 6339 data records are provided in a supplementary file (see “Appendix”). This includes the values of all variables and links to the sample numbers listed in the extracts from the COALQUAL version 2 data base (Bragg et al. 1997) as complied by Matin and Chelgani (2016). From the supplementary file it is possible to locate the exact COALQUAL sample number and US State of origin of each.

Table 1 Statistical summary of dataset compiled for gross calorific value (GCV) of 6339 US coals with each record linking measured proximate and ultimate analysis variables to their measured GCV value (MJ/kg)

It is clear from Table 1 that the data set is skewed towards the higher end of the GCV range; nearly 80% of the samples fall in the GCV range 23–35.5 MJ/kg. This means that the lower end of the GCV scale (i.e., < 15 MJ/kg) is sparsely sampled; representing about 4.5% of the data records. This feature of the dataset is addressed in the learning networks constructed.

The highly dispersed and non-linear relationships between each of the input variables (#1 to #9) and the dependent variable GCV are illustrated in Figs. 2 and 3. For the proximate analysis variables (Fig. 2), moisture content and fixed carbon show the best correlations GCV; the former a negative correlation (R2 = 0.8216), and the latter a positive correlation (R2 = 0.77). On the other hand, Ash and Volatiles show poor correlations with GCV and greater dispersal (Fig. 2). The dispersal in the Ash versus GCV relationship gradually reduces in this dataset for GCV values > 25 MJ/kg. For the ultimate analysis variables (Fig. 3), carbon and oxygen show the best correlations GCV; the former a positive correlation (R2 = 0.9847), and the latter a negative correlation (R2 = 0.8345). The other variables in Fig. 3 show significant dispersion and non-linearity, particularly S.

Fig. 2
figure 2

ad Proximate analysis variable relationships with gross calorific value for the 6339 data records for US coals compiled from the US Geological Survey Coal Quality (COALQUAL) database version 2.0, open file report 97–134 (Bragg et al. 1997)

Fig. 3
figure 3

af Ultimate Analysis variable relationships with gross calorific value for the 6339 data records for US coals compiled from the US Geological Survey Coal Quality (COALQUAL) database version 2.0, open file report 97–134 (Bragg et al. 1997)

TOB predictions of coal gross calorific value (GCV) from the compiled 6339-record GCV dataset

The GCV dataset is divided into subsets (training = 6155 records; tuning = 92 records; and, testing = 91 records). Some 97% of the 6339 records (full data set evaluated) reside in the training subset. Allocation of records for tuning and testing is spread arbitrarily across the entire range of data values. It is best to avoid random allocations as this is likely to lead to clustering and gaps in specific data ranges. Steps 2 and 6 of the method (Appendix 1) help to distribute the records for tuning and testing systematically (e.g., using a specified ranking-interval spacing to select from the ranked and sorted dataset), but avoids being subjectively selective.

This allocation was achieved initially ranking the full dataset in ascending (or descending) order of GCV values and then by selecting every 70th data record from the full data set to be put to one side during the tuning process and allocated to the testing subset. Two additional records were also added to the testing subset: one close to the lower GCV limit; and, one close to the upper GCV limit. A similar approach was then taken in the selection of the tuning subset records from the remaining dataset, i.e., every 70th data record from the combined training and tuning subset (ranked in order of GCV values) was allocated to the tuning subset. Two additional records are then added to the tuning subset: one close to the lower GCV limit; and, one close to the upper GCV limit. This approach ensures a full spread of GCV values, representative of the entire data range, in both the testing and tuning subsets. Such a spread cannot be guaranteed by using random sampling. For the learning network to be tuned across its entire dependent variable range it is essential that the tuning subset is distributed relatively evenly across that range.

Table 2 provides TOB-prediction results for coal GCV dataset. The optimum GCV-prediction performance is assessed by comparing actual and predicted GCV values. This is initially calculated for the tuning-subset records and achieves RMSE = 0.33462 MJ/kg; R2 = 0.9963 with optimized Q = 9 (i.e. the nine-highest ranking data record matches from the training subset are used in the TOB stage-2 predictions). The TOB stage-2 variable weights established for the optimum solution (most accurate predictions) were: w#1 = 5.304E−04, w#2 = 8.317E−03, w#3 = 0, w#4 = 1.576 E−03′ w#5 = 0; w#6 = 1.0; w#7 = 2.062E−03, w#8 = 0, and w#9 = 4.348E−03. The very small weights applied to several of the input variables have significant influence on the prediction accuracy achieved. Section 6 addresses this point. Suffice it to say here that it does not follow that the higher the weight applied to a variable in the TOB method, the more significant that variable is in determining the predicted values.

Table 2 Prediction accuracy versus TOB optimization control metrics (Q and Wn) for the complete GCV data set

Table 2 compares the prediction achieved with sub-optimal values of Q (i.e., Q = 2 to 10) with the optimal value of Q = 9. The results show that accurate GCV predictions can be achieved using most Q values in that range (R2 = 0.9944 for Q = 2; compared to 0.9963 for Q = 9). This suggests that for this learning network the value of Q plays a subordinate role, as high degrees of accuracy are achieved for all values of Q tested. This is further emphasized by the low range of root mean square (RMSE) values of 0.33462 (Q = 9) to 0.41393 MJ/kg (Q = 2) MJ/kg. These values indicate that the dataset is not noticeably under-fitted when Q = 2.

The TOB stage 1 predictions (Q = 10 and Wn = 0.5; see the left side of Table 2) also show credible accuracy (RMSE = 0.48381 MJ/kg; R2 = 0.9923). The data-record-matching conducted by TOB Stage 1 is, clearly, an essential contributing component to the optimum GCV predictions.

Testing-subset predictions (Table 2: lower two rows) applying optimum Q and Wn values achieves very slightly higher accuracy in GCV prediction than the tuning set GCV predictions. Testing subset accuracy metric values are: RMSE = 0.30556 MJ/kg and R2 = 0.9970 (Table 2). The accuracy of the TOB method’s predictions compares favourably with those achieved by other machine learning and empirical correlation methods applied to this dataset (Matin and Chelgani 2016) and other data sets (Feng et al. 2015; Tan et al. 2015).

Figures 4 and 5 display the optimum TOB predictions for coal GCV the tuning and testing subsets, respectively. It is apparent from these graphs that the data coverage for coals in the GCV < 15 range is only sparsely sampled by both tuning and testing subsets. In order to verify that the TOB network can produce consistent predictions of meaningful accuracy in this sparser data area a separate TOB network is sampled to analyse and tune that section of the GCV distribution more extensively.

Fig. 4
figure 4

Predicted versus measured GCV (MJ/kg) for 92 TOB-tuning subset records with 6155 records in the training subset

Fig. 5
figure 5

Predicted versus measured GCV (MJ/kg) for 91 TOB-tuning subset records with 6155 records in the training subset. The data records of the testing subset were excluded from the tuning and training subsets

Auditing and interrogating TOB predictions

Matching of data records rather than establishing correlations between input variables is the basis of the TOB methodology. It is constrained in its predictions by the range covered by the lowest and highest GCV values; it cannot extrapolate beyond that range (in contrast to many other AI methods).

For many machine-learning algorithms and empirical correlations it makes sense to apply their optimum coefficients to the data records in the training set as well as the tuning and/or testing sets used to verify their accuracy. When applying the TOB algorithm predictions are made only for data records that are not already present in the training subset. Predictions for records already in the TOB training set would achieve exact matches (RMSE = 0) and provide no insight to the accuracy of the method. This is a fundamental and distinguishing difference between the TOB methodology and most other AI methods (except K-learning-type methods).

As there are no hidden or difficult-to-access intermediate correlations involved in TOB the underlying calculations in each prediction are accessible Indeed, a key benefit of the TOB algorithm is that it allows each prediction calculation step to be readily analysed. Tables 3, 4, 5 and 6 provide examples of how this is achieved and the information it provides. These tables detail the TOB predictions (stages 1 and 2 presented separately) for record # 2738 (Tables 3, 4) and record # 193 (Tables 5, 6). Tables 3 and 5 focus on the TOB-stage-1 calculations (Q = 10; Wn = 0.5). Tables 4 and 6 focus on TOB-stage-2 predictions (2 ≤ Q ≤ 10; 0 ≤ Wn ≤ 1). The left side of the tables displays the ten high-matching records established by TOB stage 1 for data records #193 and #2738. The upper half of the tables displays the nine input-variable values (#1 to #9) and the dependent variable (GCV). These values are all expressed in normalized terms (ranging from − 1 to + 1). The matching records are listed in order of the matching rank; the first is the closest match; the one listed 10th is the 10th -best match of all the records, from the 6155 records in the GCV training set.

Table 3 Example audit of the calculation details for the TOB stage 1 GCV prediction associated with specific data records.
Table 4 Example audit of the calculation details for the TOB stage 2 GCV prediction associated with specific data records. This calculation is for the TOB prediction for data record 2738 (part of the testing subset)
Table 5 Example audit of the calculation details for the TOB stage 1 GCV prediction associated with specific data records. This calculation is for the TOB prediction for data record 193 (part of the testing subset)
Table 6 Example audit of the calculation details for the TOB stage 2 GCV prediction associated with specific data records. This calculation is for the TOB prediction for data record 193 (part of the testing subset)

The lower half of Tables 3, 4, 5 and 6 applies the Q and Wn values to the squared errors calculated for the variables. The column fourth from the right (in the lower half each table) lists the sum of the weighted-squared errors (SumE). For stage 1 (Q = 10) the values of ten matching records are involved in the calculation of SumE; for stage 2 (Optimum Q = 6) the values of only six matching records are involved in the calculation of SumE. In both cases it is the last three columns to the right in the lower half of these tables that detail the actual and relative contributions of each of the high-matching records to the prediction. It is the sum of the contributions in the right end column (lower half of the tables) that provides the prediction in normalized terms. the lowest two rows (right side numbers) in each of Tables 3, 4, 5 and 6 provide a comparison with of the actual predicted GCV (transformed from the normalized scale) and the measured GCV for the data record being analysed.

The last three columns of Tables 3, 4, 5 and 6 (lower half of those tables) are the key calculations to focus upon. These use the SumE values to generate GCV predictions for the data record in question. The formulas provided in the heading of each of those three columns explains the calculations made.

For record #2738 (Tables 3, 4) each of the top-ten high-ranking records contributes small amounts to the prediction, but the higher the ranked match the greater its contribution to that prediction. The rank#1 match contributes 14.2% to the prediction whereas the rank #10 match only contributes 6.7% to the GCV prediction for record 2738. The GCV prediction achieved is quite accurate; < +0.5 MJ/kg above the measured value of 29.1 MJ/kg. In TOB stage 2 (Table 4) for record 2738 only the top nine match records contribute to the GCV prediction because the optimized Q value selected is 9. Also, as the weightings for each variable are not equal for the optimized solution the relative contributions of those top-ranking matches are no longer governed only by the closeness of their match with the record 2738 variables. This is clear as the prediction is now dominated by record match ranks #8 and #3 (41.6% and 40.0% contributions, respectively, to the GCV prediction; see lower half Table 4 column second from the right). The optimization adjustment for this record generates a better stage-2 prediction, i.e., ~+0.3 above the 29.1 MJ/kg measured GCV value for record 2738.

The TOB stage 2 prediction is not always better than the TOB stage 1 prediction, but that is the case most of the time as the optimizer is minimizing the RMSE across the entire tuning or testing subsets. The second example, for testing subset data record 193, shows a case where the stage 1 TOB prediction outperforms the stage 2 TOB prediction. The stage 1 prediction (Table 5) is ~-0.32 MJ/kg less than the measured GCV value for that record of 20.9 MJ/kg, whereas the stage 2 prediction (Table 6) is ~ 0.47 MJ/kg less than the measured value. In the stage#1 prediction all ten top-matching records contribute between ~ 7.0% and 14.3% to the prediction (rank # 1 contributes the most, but it does not dominate the prediction). In contrast, the stage #2 prediction is dominated by ranked-matches #1 (59.6%), #9(13.4%) and #6 (10.1%). When a prediction has more significant input from the lower of the top-ranking matches, the accuracy of that prediction is sometimes impaired. Both stage 1 and stage 2 predictions are credible predictions for record 193 but influenced by those top-ten ranking matches quite differently.

For both examples, the important role of those variables with very-low-Wn (> 0) values in determining the stage 2 predictions is apparent. If those variable weights were zero, they would not contribute to the improved accuracy of the optimized solution.

The details provided by Tables 3, 4, 5 and 6 highlight just how deeply it is possible to interrogate both the stage-1 and stage-2 predictions and its high level of prediction transparency. Moreover, it is able to provide almost forensic insight into the similarity between each record tested against each record in the training subset. This can be useful in the case of trying to identify the exact provenance of a certain coal (e.g. a specific basin or even a specific mine can sometimes be identified by the closeness of the high-ranking matches). Hence, in some cases it is not only the prediction of the dependent variable value that can be derived from this learning network, but this can be accompanied by provenance information. Other machine learning algorithms and empirical relationships that are underpinned by correlations cannot easily deliver this level of detail into the degree of similarity with specific records in their training subsets.

Analysis of a more sparsely populated data subset (GCV > 6 to < 15 MJ/kg)

To provide more insight into the more sparsely populated GCV < 15 MJ/kg range of the compiled dataset, a separate TOB network is evaluated using just the 283 samples in the data base with GCV < 15 MJ/kg. The statistical summary of the measured variables for these 283 records (which are also included in the dataset described in Table 1) is provided in Table 7.

Table 7 Statistical summary of data subset (283 records) compiled for gross calorific value (GCV) between > 6 and < 15 MJ/kg of US coals with each record linking measured proximate and ultimate analysis variables to their measured GCV value (MJ/kg)

This more-focused TOB divides its data records, using the same methodology as already described (training records = 235; tuning records = 24 records; testing records = 24 records). This TOB tunes this GCV interval with 24 data records, whereas in the larger-dataset TOB previously described only involved 5 data records to tune that GCV interval. Table 8 describes the details of the tuned and optimized prediction performance of this focused TOB, with sensitivities, demonstrating high prediction accuracy as illustrated (Fig. 6) for its testing subset (RMSE = 0.2944; R2 = 0.9644). The accuracy is not as high as for the larger dataset (see Fig. 5) due to the greater spacing of training-subset data records (i.e. more sparsely distributed data) for the GCV interval < 15 MJ/kg. This highlights a key positive feature of the TOB learning network approach, i.e., it is resistant to over-fitting sparse data sets. As the spacing between datapoints in the training set increases, the statistically accessed accuracy of its predictions tends to decrease. Although such behaviour is intuitive that is not necessarily the outcome with empirical calculations or learning networks driven by complex correlations between the variables, which are prone to over-fitting.

Table 8 Prediction accuracy versus TOB optimization control metrics (Q and Wn) for the 283-record GCV data set covering the range GCV > 6 and < 15 MJ/kg
Fig. 6
figure 6

Predicted versus measured GCV (MJ/kg) for 24 testing subset data records used to test the TOB model with 283 records (GCV > 6 and < 15 MJ/kg) in the training subset. The 24 testing data records were excluded from the TOB training process. Data record #3452 with a relatively poor fit between measured and predicted GCV is highlighted and its TOB prediction is considered in detail in Table 9

Table 9 Example audit of the calculation details for the TOB stage 1 GCV prediction associated with specific data records. This calculation is for the TOB prediction for data record 193 (part of the testing subset)

Table 8 reveals that the best prediction performance for this focused TOB network is for Q = 9. The sensitivity analysis shows that RMSE increases and R2 decreases as the value of Q decreases below the value of 9 with R2 falling below 0.96 for Q values below 8. However, for Q = 3 the prediction accuracy is also good and superior to values of Q = 2 or Q = 4. This is true for both tuning and testing subsets (Table 8). Indeed Q = 3 represents a local minimum, which the Solver evolutionary optimizer selected as its optimum (became trapped at) on most of its runs. This bimodal optimization outcome suggests that for some data records better predictions are achieved for Q = 3. Closer inspection of data record #3452, highlighted on Fig. 6 for the optimum Q = 9 tuned setting as having a relatively low accuracy in comparison to most other records in the testing subset.

Tables 9 and 10 describe the detailed calculation of the GCV TOB stages 1 and 2 predictions for data record #3452, respectively. In the stage 1 prediction (Table 9) the matching record ranked#1 (record #3106) contributes 32.95% to the GCV prediction, with the other high−ranking−matched records contributing progressively less until matched record ranked #10 (record#21) contributes just 6.0% to the GCV prediction. For this stage 1 prediction, the top−three matched records contribute > 50% to that calculation. This achieves a prediction of high accuracy, i.e., −0.3 below the measured GCV value of 12.67 MJ/kg.

Table 10 Example audit of the calculation details for the TOB stage 2 GCV prediction associated with specific data records. This calculation is for the TOB prediction for data record 193 (part of the testing subset)

For data record #3452 the TOB-stage-2 prediction (Table 10) generates a significantly less-accurate prediction than that achieved by the stage-1 prediction (Table 9). The reason for this is that in the TOB stage 2 solution, with Q = 9 and variable weights applied, the top-3 matched records only contribute about 18% to the prediction. On the other hand, the matched record #8 (record # 3110) contributes 68% to the GCV prediction. This achieves a prediction of less impressive accuracy, i.e., − 0.64 below the measured GCV value of 12.67 MJ/kg. In this case, considering the analysis just described and the sensitivity analysis of Table 8, a case could be made for applying a Q = 3 cut off for the prediction of this data record. This approach highlights how the transparency of the TOB learning network’s calculation aids the analysis of outlier data records (i.e., those for which predictions fall significantly off trend). It makes it possible identify, in detail, the reasons for such outlying prediction values. It also often provides the justification for potential adjustments that might be made to improve /correct the predictions for such problematic data records.

Auditing TOB predictions and conducting sensitivity analysis (e.g. varying Q values from the optimum and changing the data-subset allocation percentages) focused on specific data-records facilitates rigorous outlier analysis; something that is difficult with most other AI methods not easily possible with correlation-based machine learning algorithms or empirical calculations. This TOB strength is particularly beneficial for datasets for which details of specific data-record predictions are important (e.g. for commercial valuation purposes or detailed sample provenance purposes; both of which apply to GCV and commercial coal datasets). This feature could also be usefully applied to other commercially-important characteristics of coal (e.g., predicting coal grindability from multiple input variables based on coal petrological properties).

Although the coal dataset studied here is relatively large, and the TOB algorithm clearly copes well with such numbers of data records, as a “big data” tool, the TOB algorithm may have some limitations with very large datasets. Clearly, the algorithm has to contain and manage a large training data base, whereas the performance (i.e., computational speed) of the algorithm is also likely to progressively deteriorate as the intrinsic dimensionality of the variable space increases. Further studies are required to establish the limits of applicability of the algorithm to such “big data” sets. However, although computational time is likely to deteriorate for very large data sets, the transparency provided by the TOB algorithm may compensate for this. As stage 2 of the algorithm focuses on just a few of the best matches (i.e., up to ten or so) the collective influence of a significant number of variables would remain fully transparent.

The COALQUAL dataset lends itself to further studies on the impacts of sparse data coverage on TOB prediction performance. A future study will conduct sensitivity analysis that progressively excludes percentages of the dataset from the training data subset used for model tuning (i.e. adding those excluded data records to the testing subset). This will quantify how sparse the training data subset can become before it ceases to yield meaningfully accurate predictions for the dependent variable.

Conclusions

The transparent open-box (TOB) learning network algorithm provides credible and reliable predictions of dependent variables, such as coal gross calorific value (GCV), that involve complex, highly dispersed and non-linear datasets for the influencing variables. Its high-prediction accuracy, demonstrated in this study, when applied to predict GCV from nine influencing variables from proximate and ultimate analysis from a large published data set (6339 data records of US coals) testifies to such capabilities. The method could be easily applied to more limited datasets, e.g., those based upon only the easier to obtain proximate analysis variables.

TOB’s prediction performance for this published coal data set compares favourably to that reported by other artificial-intelligence algorithms and empirical correlations, with the added benefit that it is more easily audited and generally more transparent. The TOB algorithm does not develop any correlations when calculating its predictions. Instead, it establishes (in TOB stage 1) the closest matches with ten data records in its large associated training subset. In TOB stage 2 the algorithm improves its prediction, based on statistical measures of accuracy for tuning and testing data subsets (i.e., minimizing root mean squared error between predicted and measured GCV values). It achieves this by applying an optimizer to select the number of those matches (2 ≤ Q ≤ 10) and applying tuned weights to the errors associated with each input variable.

The calculations involved in the predictions derived from the TOB algorithm are individually auditable. Standard Solver optimizers or customized evolutionary or non-linear optimization algorithms can be used to successfully and transparently achieve the TOB stage 2 optimized predictions. Such flexibility and access to the underlying calculations is not possible with most other artificial-intelligence prediction methods or empirical calculations.

An additional valuable feature of the TOB algorithm is the ease with which sensitivity analysis can be conducted by modifying its Q value. In particular, the Q-value sensitivities can help to identify whether the algorithm is over-fitting or underfitting a dataset. These positive attributes make the TOB algorithm a suitable prediction-performance benchmark with which to compare the predictions of other machine-learning and empirical correlation algorithms. It typically provides complementary results to other algorithms with respect to insight to the underlying dataset. Indeed, in some cases, where the dataset covers coals from many different regions and mines the TOB algorithm has the ability, through its record matching stage 1 routine, to identify the provenance of specific samples.

The detailed calculations shown for example data records demonstrate exactly how the predictions of the TOB algorithm can be audited and assessed. These detailed calculations are not complex, rather they highlight the prediction mechanisms involved and the key roles played by the optimized Q value and the input variable weights in producing the stage 2 optimized predictions. The ability to interrogate and verify in detail specific predictions is increasingly important for providing user confidence in prediction algorithms. By revealing useful information about the relative importance of identified training-subset records in terms of their contributions to specific predictions, and the problematic nature of other data records (e.g., outlying values of certain metrics not replicated in other data records), the TOB method provides such user confidence. In some applications it may be worth sacrificing a small degree of accuracy in order to obtain such insight and confidence associated with the predictions to be deployed.