1 Introduction

Software projects failure is common these days, some researchers (Hoffman 1999; Jørgensen and Moløkken-Østvold 2006; Dwivedi et al. 2015) suggested that diverse failure incidences in software projects rise up to over 80% (Hoffman 1999; Jørgensen and Moløkken-Østvold 2006; Dwivedi et al. 2015). Moreover, literature shows that software projects are frequently influenced by large number of development problems, such as lean project management, high program expenses, time lags, and non-efficient advertisement (Verner et al. 2007; El Emam and Koru 2008; Hussain and Mkpojiogu 2016). Since some of these difficulties appear during early software development, literature suggests that early treatment of problematic situations may improve software project success. In addition, existing research emphasized on detecting factors which may affect the quality of project outcome (Cerpa et al. 2010; Reyes et al. 2011; Ahimbisibwe et al. 2015). Practically, the lack of strict software quality methodologies in software development usually lead to severe consequences on software development process as a whole (Gefen and Straub 2001; Jan et al. 2016).

In traditional context, the experience of project managers plays a decisive role in predicting the implications of decisions made during project life cycle. When ignored, signs of project failure could develop into total failure for the whole project. Though some signs of project failures are detectable, many project managers fail to make corrective actions at the right moment. Accordingly, there is an increasing interest in developing expert systems for predicting project outcomes in early stages.

Naturally, project stakeholders put significant pressure over development process, which impacts the quality of IT projects (Heravi et al. 2015). To illustrate, these pressures include changes in user requirements and inconsistency between user definition to quality and software development team definition to quality (Woodroof and Kasper 1998).

Since there is a large variety of software published online, users are considerably selective in what they consider the best software for them. Accordingly, it is important to study how a user interprets software quality, and how that interpretation differs from a software developer interpretation to quality. Garvin (Deming 2000) identified two forms of quality, qualitative and quantitative. The qualitative part of quality is based on opinions, and personal views. On the other hand, quantitative part of quality is based on numbers and statistics. ISO/IEC 25010:2011 is composed of two models, “quality in use” model and “product quality” model. This research covers the first model, which is composed of five characteristics. Table 1 summarizes these characteristics briefly as described by ISO (2011). With regard to QiU applications, QiU is increasingly adopted in many software programming fields, particularly in user-centered applications (UCA). For instance, they are common in business process management (Heinrich 2014), web applications (Orehovački et al. 2013; Lippert and Govindarajulu 2015), and mobile applications (Alnanih et al. 2012; Kim et al. 2015).

Table 1 The characteristics of ISO/IEC 25010, quality in use

Adopting QiU in early stages of software development cycle proved to be beneficial for all stakeholders. However, with the absence of a clear understanding of QiU, it may slow software development process and lead to unexpected project failure. With the scariness of research literature, this research aims to contribute to the process of predicting QiU in early software development cycles through integrating ISO 25010 with machine learning, in order  to develop a novel QiUPS.

2 Literature review

There is a tendency amongst researchers to develop QiU models based on ISO standard frameworks. La and Kim (2013) used ISO 9126 model to develop a service-based mobile system. Osman and Osman (2013) calculated QiU through a well-defined questionnaire. Oliveira et al. (2014) used ISO 9126 to develop a tool for evaluating project management tools usability.

The semantic web exploration tools, QiU model (SWET-QUM) uses ISO 25010 to evaluate semantic web exploration tools (González et al. 2012). Meanwhile, Becker et al. (2012) proposed a strategy for understanding and improving quality, as well as they recommended making modifications on operability characteristic to improve QiU measurement.

Ardito et al. (2014) proposed a pattern recognition method that uses a list of QiU evaluation patterns to detect the quality of e-learning systems. Apart from the positive results, the study concluded that pattern recognition application in QiU could be time consuming.

With regard to machine learning applications in predicting QiU, Datamining have been used widely to predict project software project outcome (Abe et al. 2006; Mizuno et al. 2004; Liu et al. 2014; Wang 2007; Smite 2007; Oztekin et al. 2013; Halees 2014). Abe et al. (2006) used a Bayesian Classifier to predict software project outcome. Mizuno et al. (2004) used a tenfold cross-validation (Bayesian Classifier) to predict the output of a runaway software.

Using a k-means algorithm, Wang predicted the output of open source software projects (Wang 2007). Cheng and Wu (2008) proposed a support vector machine and a fast messy genetic algorithm to predict software project success.

Examining research literature, first, machine learning applications for predicting QiU are rarely explored. Second, the research literature shows less interest for comparing the performance of diverse machine learning algorithms in predicting QiU. Third, there is apparent gaps of research in comparing the performance of different Neural Networks in predicting QiU. Finally, there is also less interest in using user requirements and feedback to predict QiU. Consequently, this research attempts to highlight these research gaps and contribute empirically to the process of filling them.

3 Methodology

The research methodology is formed of four subsections. The first subsection discusses deriving measurements from the ISO 25010 Quality Model. Then, the second sub-section explains the research dataset and its source. Finally, the third subsection explains ANN and its use as a CA in this research context while the fourth section discuses CNN structure.

3.1 Deriving measurements from ISO 25010 quality model

The research literature shows that ISO 25010 could be used to derive QiU measurements. This research converted ISO 25010, QiU general specifications into measurements, based on similar research model (Bevan 2009). To illustrate, the following represent examples of converting ISO 25010, QiU general specifications into measurable variables:

Concerning Effectiveness characteristic, the number of achieved objectives as requested by the user compared against the number of all requested objectives by the user. Formula (1) shows how to calculate this measurement.

$$QiUPS\,Effectiveness = \frac{number\,of\,achieved\,objectives }{number\,of\,all\,required\,objectives}$$
(1)

With regard to usefulness sub-characteristic of satisfaction, this sub-characteristic is measured by comparing the number of advantageous feedback notes against the total number of feedback notes. Feedback notes are extracted from online form and achieved tasks. Formula (2) shows how to calculate usefulness.

$$QiUPS\, Usefulness = \frac{ number\,of\,advantageous\,notes - number\,of\,disadvantageous\,notes }{total\,number\,of\,feedback\,notes}$$
(2)

On the subject of context completeness sub-characteristic of context coverage, for a given QiUPS case, this sub-characteristic is measured by calculating the average of user satisfaction and software freedom from risk. Satisfaction is calculated using four measurements, which are usefulness, trust, pleasure, and comfort. To illustrate, Formula (3) demonstrates how to calculate this measurement.

$$QiUPS\,Context\,Completeness = \frac{Satisfaction + Freedom\,from\,Risk}{2}$$
(3)

Touching flexibility sub-characteristic of context coverage, for a given QIUPS case, this sub-characteristic is measured by calculating the average of user satisfaction (in domains beyond user requirements) and software freedom from Risk (in domains beyond user requirements). Formula (4) demonstrates how to calculate this measurement.

$$QiUPS \,Flexibility = \frac{{Satisfaction\, \left( {beyond\,user\,requirements} \right) + Freedom\,from\,Risk\,\left( {beyond\,user\,requirements} \right)}}{2}.$$
(4)

3.2 Research dataset

The research dataset was extracted from archived files and database of Information, Communication, and E-learning Technology Centre (ICET), Hashemite University of Jordan. Since ICET works as a software house for developing diverse software applications, the research used ICET database and archive to generate the main research dataset. Figure 1 illustrates the process of extracting the research dataset. Unfortunately, 317 of software projects were excluded because of missing data. Hence, the dataset was left only with 1899 software projects. Table 2 shows the dataset processing summary with the resulted three classes. Since the dataset fields were basically ratios, the data type of the data fields is real number. With regard to statistical tools, the researcher used WEKA and IBM SPSS to implement the statistical experiments. Looking at Fig. 2, most of the cases are classified as having a medium QiU with 1423 out of 1899 cases. Highlighting the problem of inconsistency between developers definition to quality and users definition to quality, this chart shows that software development team efforts are focused mainly on developing a highly functional software rather than a software with high QiU.

Fig. 1
figure 1

The process of extracting research dataset

Table 2 Dataset processing summary
Fig. 2
figure 2

Dataset frequency chart

3.3 Artificial neural networks (ANN)

Partially derived from biological sciences, an ANN consists of a number of coordinated processes that accept a predefined input, process it, and predict a certain output. Based on brain neuron, ANN provides tools for learning rules from formatted examples. Looking at Fig. 3, hidden layer are positioned between the input and output layers. One of ANN major learning objectives is to produce rules in alignment with input and output parameters. Widely used in expert systems research, many researchers agree on several advantages for using ANN, which could be summarized as follows:

  1. 1.

    ANN ability to adaptively learn how to perform tasks based on primal data.

  2. 2.

    ANN learns quickly through developing its organization during learning process.

  3. 3.

    With the support of a dedicated hardware and software, ANN could be implemented effectively in parallel architectures.

  4. 4.

    Comparing to other CAs, ANN can process large amount of databases effectively.

ANN uses an administered learning method called back propagation for programming the neural network (Burr 2015). Developers train ANN to figure out how to transform data input to a required output, and fit the model to a specified prediction context (Craven and Shavlik 2014; Schmidhuber 2015).

Fig. 3
figure 3

Using WEKA tool, a visualization of the generated ANN

Technically, ANN acts as a managed learning system for studying and solving various logical problems, including pattern recognition and classification. Looking at Fig. 3, the ANN algorithm coordinates the weights of neural connection to reduce error values in the network output. If these modifications resulted error minimization, then the designated ANN learned a new function. Aside from naming conventions, in this research context, the research emphasized the term ANN to distinguish between traditional neural networks and CNNs. Moreover, the research used multi-layer perceptron (MLP) as the experimental ANN model.

3.4 Convolutional neural networks (CNN)

Rarely used outside image processing area, convolutional neural networks (CNN) inherits most of its features from ANN (Sainath et al. 2015). However, CNN structure layers based on selective convolutional principle. The main difference between ANN and CNN is that CNN contain specific layers for convolution and pooling, which implies that the layers after the input are connected. As Fig. 4 shows, composed of five inputs, the first layer (X − 1) contains five inputs. Additionally, each neuron in layers “X” and “X + 1” receives three inputs from the previous layer, presenting balanced architecture of a CNN. To illustrate, this structure allows balanced processing of data. Though not demonstrated in Fig. 4, CNN have more hidden layers than traditional ANN. Accordingly, this research paper selected CNN as one of the experimented CAs.

Fig. 4
figure 4

A visualization of CNN

4 Results and discussion

Divided into three experiments, this section discusses the research main results, which were processed using statistical tools, WEKA and IBM SPSS tools. The following subsections explain the research experiments.

4.1 Experiment 1: Testing the performance of different classification algorithms

The first experiment aims to test different CAs in terms of correctly classified instances, incorrectly classified instances, relative absolute error, root relative squared error, root mean squared error, and mean absolute error. In this experiment context, error values are used to evaluate the difference between model prediction and the real output.

Looking at Fig. 3, ANN model building process lasted for 6.79 s, with nine inputs and three outputs. Table 3 shows that the number of correctly classified instances for ANN and CNN are very close and both are relatively higher than other CAs. ANN predicted 1668 (87.8357%) correctly classified instances while CNN predicted 1654 (87.0985%) correctly classified instances. Additionally, error values for both ANN and CNN have subtle differences, as well as both are relatively lower than other algorithms, including Naïve Bayes CA. On the other hand, Naïve Bayes classification follows both ANN and CNN in terms of correctly classified instances with 1585 (83.465%) instances. However, Naïve Bayes error values are higher than error values for both ANN and CNN, with no exceptions. The J48 algorithm took about 0.14 s to build. When compared with other CAs, J48 performance is the lowest with only 1423 (74.9342%) correctly classified instances. Similarly, error values for the J48 model are the highest with 100% value in both relative absolute error and root relative squared error. Looking at Table 4, WEKA’s confusion matrix reveals more indications about the CAs comparative output. With regard to option “a”, Low QiU, the first three CAs made exactly the same output (232 correctly classified) whereas J48 tree failed to predict any Low QiU correctly. Looking at option “b”, Medium QiU, unexpectedly, J48 tree was the highest as it predicted flawlessly with 1423 (100%) correctly classified instances. Comes in second place, Naïve Bayes classifier as it predicted 1353 out of 1423 instances. After that, CNN predicted 1335 out of 1423 prediction while ANN predicted 1319 out of 1423 prediction. Looking at output “c”, High QiU, In contrary to previous results, both Naïve Bayes and J48 tree predicted poorly with not even a single, correct prediction. Conversely, ANN predicted 117 out of 143 prediction whereas CNN predicted 87 out of 143 prediction.

Table 3 Comparing different machine learning algorithms
Table 4 Confusion matrices for the experimented CAs

Overall, both ANN and CNN are providing a more balanced and correct classifications for the research dataset, knowing that ANN is slightly better than CNN. Consequently, this research considers the previous results as empirical indications for the feasibility of developing a QiUPS using either ANN or CNN. Accordingly, the researchers decided to study both CAs with two additional experiments.

4.2 Experiment 2: Testing ANN and CNN in terms of statistical measures of the performance

Formula (5) presents true positive rate (TPR) or sensitivity. Looking at Tables 5 and 6, as TPR demonstrates the proportion of correctly predicted positives, the TPR value for ANN classes are convergent with TPR value for CNN classes. However, ANN predictions are more balanced with 0.818 as TPR in class 3, high QiU.

$$TPR = \frac{{True\, Positives\,\left( {TP} \right)}}{{True \,Positives\,\left( {TP} \right) + False \,Negatives\,\left( {FN} \right)}}$$
(5)

Formula (6) shows False Positive Rate (FPR). Looking at Tables 5 and 6, as FPR demonstrates the proportion of falsely predicted positives, Clearly, the overall FPR in ANN classes are slightly lower than the overall FPR in CNN classes, particularly in class 3, High QiU. Accordingly, this proves that ANN is a little more effective than CNN in predicting positive cases correctly.

$$FPR = \frac{{False \,Positive\left( {FP} \right)}}{{FP + True \,Negatives\left( {TN} \right)}}$$
(6)

Defined as the proportion of relevant instances, Formula (7) demonstrates precision. When compared with CNN, the weighted average of precision value in ANN is slightly higher and more balanced, especially in class 3, high QiU. Hence, ANN ability to predict negative cases is slightly better.

$$Precision = \frac{TP}{TP + FP}$$
(7)

Representing harmonic mean of precision and recall, F-measure is calculated using Formula (8). Relatively higher in ANN, F-measure shows better predication and balance with ANN CA.

$$F{\text{-}}measure = 2 \times \frac{Percision \times TPR}{Percision + TPR}$$
(8)

Looking at Tables 5 and 6, the precision-recall curve (PRC) or precision to sensitivity provides indications for the accuracy of the CA. Not harmonious with the conclusions from the previous results, PRC values are slightly higher in CNN case. At the same time, receiver operating characteristic (ROC) values are slightly higher with CNN. Though both PRC and ROC are slightly higher in CNN case, the total ability of both CAs to differentiate between a false and true prediction of QiU is significantly convergent, but with slight advantage to CNN.

Table 5 Accuracy of the ANN CA by class
Table 6 Accuracy of the CNN CA by class

Based on the previous results, the researchers decided to make an independent t test to find the statistical significance of both ANN and CNN predictions, with μANN = μCNN as null hypothesis. Using WEKA experimenter, the t test for both ANN and CNN predictions showed that the value of Sig. (2-tailed) for correctly classified instances is over 0.05, which means that there is no evidence for a statistically significant difference between ANN predictions and CNN predictions, which strengthens the null hypothesis. Consequently, the researchers resorted to experiment 3.

4.3 Experiment 3: Testing the developed QiUPS against real test cases

After developing the QiUPS, the researchers used the QiUPS to predict QiU for 558 published software projects, which were compiled from another projects from ICET. Looking at Table 7, clearly, testing real 558 instances provided results that aligns with the previous results in Tables 3 and 4. In ANN case, 492 (88.172%) of the cases were classified correctly in the training test whereas 473 (84.767%) cases were classified correctly in CNN case. Similarly, error values in both tests a and b are diverse, with absolute errors better in CNN while squared errors better in ANN. Practically, the results align with previous results from the QiUPS tenfold t test. On the other hand, examining Table 8, confusion matrices for both tests show better balance and classification with ANN algorithm, particularly in predicting class 3, High QiU. Accordingly, though previous tests show non-significant differences between ANN predictions and CNN predictions, this test shows clear advantage for ANN predictions, especially in predicting software with high QiU.

Table 7 Comparing the ANN training data against real test cases
Table 8 Comparing the ANN training confusion matrix against real tested confusion matrix

5 Conclusion

In conclusion, the research highlighted the problem of inconsistency between software developers interpretation to quality and users interpretation to quality. Moreover, the research provided empirical evidences to support the feasibility of integrating a CA with ISO 25010 quality model in order to develop a QiUPS. As an experimented QiU model, ISO 25010 showed its ability to cover QiU aspects in real situations, which is rarely researched.

Never been discussed in QiU context, the statistical measures of performance showed convergent performance between ANN and CNN with clear advantage for ANN when tested on real cases. Moreover, statistical measures of performance showed more balanced predictions when using ANN. Hence, this research recommends integrating ANN with QiU model to develop a QiUPS, a novel recommendation based on the previous empirical evidences.

Researching in rarely explored research area, research results are original and should be treated as empirical indications for the feasibility of predicting QiU using ANNs. The research dataset is considered as one of the research main contributions due to its originality and harmonization with the CAs under study. Based on that, the research recommends using original dataset in similar contexts.

With regard to developing a decision support sstems (DSS) for measuring QiU, the research recommends developing the DSS based on QiUPS architecture, which was sufficiently discussed and applied in this research context.