Keywords

1 Introduction

Software development has been around for several decades now and discussion on its failures and successes has been strong.

It all started with the Standish Group’s Chaos Report of 1994 [1] that stated that projects that did not meet customer satisfaction and/or went over time or budget in a significant way corresponded to 53%. It was a bit shocking to see a figure that amounted for over half of the projects and a discussion about a software crisis was started.

This report, however, was since then criticized for lack of peer review, for not having a complete description of the study design or of the project selecting criteria, and for defining successful and failed projects in a way that may bias the study [2, 3, 5].

Over 20 years later, the debate is still on, but there seems to be an agreement on the failure rate of software development projects having dropped [5,6,7]. Although the values do not coincide, they show a decrease tendency that may be significant if you take into account that projects are increasingly complex.

One of the areas of software development that has helped this increased success in software projects is Requirements Engineering (RE), following previous research such as [8, 9]. Furthermore, according to [6], one of the three main reasons for the positive development is that the communication of requirements has much improved. [10] makes an even stronger statement that “Meets user requirements” is the most important success criteria for both users (96%) and project managers (81%).

Knowledge discovery and data mining are much more recent areas than software development and less mature fields. For instance, if considering process model development to be a sign of maturity, it can be seen that the first process model for this area dates back to 1996 [11], while in software development, the well-known Waterfall model goes back to 1970 [12].

Nonetheless, it is indisputable that knowledge discovery and data mining are of growing importance in a time where more and more data is produced.

Data production numbers are, in fact, staggering, for example, 144.000 h of video are uploaded to YouTube per day [13], 182.900.000.000 emails are sent per day [14] and 1.000.000.000 pieces of content are shared on Facebook per day [15].

This results in massive amounts of data. Facebook has one of the largest data warehouses in the world, storing more than 300 petabytes [16].

With such a large production of data and in a time when knowledge is one of the most precious assets, it is no wonder that knowledge discovery and data mining are of increasing importance.

The road for knowledge discovery and data mining projects is to increase systematization, as the area becomes more main stream.

This seems important because the trends in this area, currently, are to have larger projects (with larger amounts of data involved) and, at the same time, to have the people involved in those same projects with lower technical skills and very little time to experiment with different approaches [17].

Within the knowledge discovery projects, the area of requirements engineering is the one that can reap more benefits thanks to a higher level of systematization.

Firstly, because requirements engineering is particularly neglected in this type of projects. Some authors even argue that this type of projects should be based on the available data and not on stakeholders’ requirements [18].

Secondly, because, being a less mature field, less systematization efforts have been made so far and when they occur, the participation of enterprise stakeholders will be improved and facilitated and the area will follow software engineering in general, that has improved in terms of customer satisfaction and time and budget compliance.

For these reasons, the research question was “How can systematization be brought into Knowledge Discovery projects, in general, and into their Requirement Engineering phase, in particular, aiming at improvements in their success rate?”

The research started by analysing the Knowledge Discovery process through a systematic review of the state-of-the-art in academia and industry regarding knowledge discovery and data mining process models. To conclude this review a comparing of the main process models found was made.

Then the Requirements engineering area was analysed in a similar way followed byocusing on requirements engineering for KD projects. It was found that requirements engineering for KD is different. That is why it is claimed here that a requirements engineering for KD process model is needed and SysPRE, a Systematized Process for Requirements Engineering designed specifically for KD projects is proposed.

SysPRE, began from an initial textual description which was then formally specified as a DEMO ontology [45]. This formal specification was instantiated in two case studies so that trivial and non-trivial errors could be identified and the necessary adjustments made.

SysPRE synthesises the knowledge obtained for the state-of-the-art reviews in a way that can be helpful for enterprises and other organizations with KD projects both for novice and expert users, with the hope of bringing improvements to the success rate of such projects.

2 Knowledge Discovery Process and Demo Specification

In this section the Knowledge Discovery Process (KDP) will be described as seen after analysing the existing process models listed in Sects. 2.1 and 2.2 with special detail with what regards Requirements Engineering within the KDP.

This specifically considers business KDPs, but this description would also be accurate for other types of organizations, namely governmental or non-profit.

2.1 Knowledge Discovery

The need for a process model stems from the fact that data mining is non-trivial. In 2006, Bernstein et al. referred that “there are many possible choices for each stage, and only some combinations are valid. Because of the large space and nontrivial interactions, both novices and data mining specialists need assistance” [19].

Still the need for a process model goes back to 1989, when it was first discussed during the IJCAI workshop on Knowledge Discovery in Databases (KDD) [20]. This was the original workshop that started the series of KDD workshops that, from 1995 onwards, grew into KDD conferences. Still, only in 1996 the first model was formally proposed.

This original KDD model consisted in nine steps: learning the application domain, understanding the domain and any relevant prior knowledge but also identifying the goal of the process; creating a target dataset; data cleaning and pre-processing; data reduction and projection; function of data mining selection (e.g., summarization, clustering); data mining algorithm(s) selection and specification of relevant parameters; data mining, which means the actual search for patterns; interpretation of the results; using discovered knowledge, which could be done in many ways, such as incorporating the knowledge into another system or simply generating a report of the findings.

From this model other models derived such as Ganesh et al. [21] and Adriaans and Zantinge [22] in 1996, Brachman and Anand [23] in 1997, Berry and Linoff [24], Cabena et al. [25], Knowledge Discovery Life Cycle (KDLC) model by Lee and Kerschberg [26] 1998 or Buchner et al. [27] in 1999.

The most widely used in the industry however was CRISP-DM [46]. Created in 1997 by a group of organizations involved in data mining (NCR, SPSS, Daimler-Chrysler and OHRA). The first version was published in August 2000 [28]. Between 2006 and 2008 there were efforts to launch a second version of CRISP-DM, which was referred to as CRISP-DM 2.0, but no result was ever published.

The CRISP-DM model life cycle consists of six iterative steps: business understanding; data understanding; data preparation; modelling; evaluation; deploying.

To CRISP-DM many variations were proposed over the years, such as Rapid Collaborative Data Mining System (RAMSYS) model [31] in 2001, Data Mining for Industrial Engineering (DMIE) by Solarte [32] in 2002, Data Mining and Knowledge Discovery (DMKD) model by Cios and Kurgan [33] in 2005, Ontology Driven Knowledge Discovery (ODKD) by Gottgtroy [34] in 2007, Knowledge and Discovery and Communication Framework (KDCF) by Rennolls and AL-Shawabkeh [35] and ASD-DM by Alnoukari et al. [36] in 2008 or IKDDM by Osei-Bryson [37] in 2012.

Other models include Catalyst methodology in 2003 [30]. This methodology has two parts: business modelling and data mining. For each part, a detailed step-by-step methodology is suggested. Originally it was proposed both in printed form and online, and both formats followed a hyperlink structure.

Considering both parts of the methodology as a whole, we can say that it has six steps: business modelling; data preparation; tool selection; mining; refining; deploying.

What makes this methodology interesting is the level of detail that is includes in each step. It is very focused on what needs to be done and how it can be done. This is organized in what the author calls “boxes”. There are four types of “boxes”: Action Boxes, Discovery Boxes, Technique Boxes, and Example Boxes.

And finally SEMMA was created to be used is a specific application, SAS Enterprise Miner [29].

The acronym SEMMA stands for sample, explore, modify, model, assess, which are basically the five iterative steps proposed: sample, which consists of extracting sample data (optional step); explore, which means the exploring the data or the sample data in order to be able to simplify the model; modify, which can include any cleaning, pre-processing, reductions or projections deemed necessary; model, which is the actual search for patterns; assess, which is the evaluation and interpretation of the results.

SEMMA however is tied to the SAS Enterprise Miner tool and therefore overlooks any steps that are not related to the tool, namely any business understanding tasks.

2.2 Requirements Engineering

The IEEE Standard Glossary of Software Engineering Technology [38] defines a software requirement as:

  1. 1.

    A condition or capability needed by a user to solve a problem or achieve an objective.

  2. 2.

    A condition or capability that must be met or possessed by a system or system component to satisfy a contract, standard, specification, or other formally imposed document.

  3. 3.

    A documented representation of a condition or capability as in 1 or 2.

In short, a software requirement is something that we expect the software to meet.

In the studied methods there was a special focus on six, Waterfall by Winston Royce [12] in 1970, Spiral by Barry Boehm [39] in 1986, Rapid Application Development (RAD) by the New York Telephone Company in mid-1970s, becoming notorious in the early 90’s by James Martin and his approach [40], Rational Unified Process (RUP) by the Rational Software Division of IBM [41], Agile proposed in 2001 in the Agile Manifesto [42] and Goal-Oriented Requirements Engineering (GORE).

2.3 PIF and CAP Analysis

To the KDP a Performa-Informa-Forma (PIF) analysis and a Coordination-Actors-Production (CAP) analysis were made with the goal to gain insight to what concepts and activities are important in the KD process. Namely, in terms of activities, the Performa items are the truly relevant ones and will later be the transactions of the DEMO specification of SysPRE.

Most of the Performa-Informa-Forma is being omitted remaining only the Performa items in italic. The Coordination-Actors-Production analysis was done simultaneously by enclosing a piece of text indicating an actor role between the brackets “[“ and “]”. Transaction’s id (for instance T01) are also marked next to Performa items.

The knowledge discovery process begins {T01} when the [business analyst] realizes that there is a business problem or opportunity {T02} in which Knowledge Discovery and Data Mining might be helpful. More commonly, the [business analyst] starts with a question and needs certain information relevant to the decision he must make.

He or she starts by trying to learn {T03} as much as possible about the business and the application domain. He will identify the [stakeholders] {T04}. He will try to understand what issues are important for the [stakeholders] {T05}. The five core issues are [30]: product (goods or services, tangible or intangible); place; price; time; quantity.

The [business analyst] will classify the knowledge discovery process as {T06}:

  • Demand driven - process is aimed to fulfil the information requirements of the users

  • Data driven - process is aimed to discover the best use to the specific existing data

  • Exploratory - process is designed to find how KD and DM in general can offer value within that specific business

He will try to discover any relevant prior knowledge, namely the currently existing solutions for the problem, and identify the goal for the project {T05}.

If it is an exploratory process, the [business analyst] will identify several possible goals {T05} and review his stakeholders’ identification {T04} for each one (including the core issues that each one might be concerned with {T05}).

Since starting the project might have costs, the [business analyst] might have to ask for approval {T14} for the data mining project to the [business manager]. The [business manager] might ask {T13} the [project manager] for a cost and resources estimation so that he can decide on the approval {T14}. The [project manager] will create the cost, time and resources estimates or a project plan {T13}, if necessary. The [project manager] will hand these to the [business manager]. The [business manager] will decide to go ahead or not {T14}, that is, he will decide on the feasibility of the KD project. If the decision is to go ahead, the [project manager] might have to get the resources (human or otherwise) that are necessary and that were not available in the beginning.

If it is a demand driven project, the [business analyst] will then begin eliciting specific requirements {T07}. If it is a data driven project, the [business analyst] will then proceed by asking the [data analyst] to perform the data analysis. A hybrid approach is also possible, in which both will happen in parallel. For the requirements elicitation {T07}, the [business analyst] will choose the elicitation techniques {T08}, which might be one or more. He will execute them and document the resulting requirements from each technique at what is judged to be an appropriate level of detail. These requirements will be mostly information demand requirements, that is, requirements that describe why and how the [stakeholders] need specific information. The [business analyst] will also elicit non-functional requirements {T07}, and for that he will be particularly concerned with the delivery mechanism (how will the results be physically made available to the [end user]? What tools will the [user] employ to view it?), the format (will the [user] view the results in reports, dashboards, or other formats?) and the degree of interaction needed (to what extent must the [user] be able to manipulate the results following delivery?).

A detailed analysis of the requirements will be done by the [business analyst]. The [business analyst] and the multiple [stakeholders] will negotiate to:

  • Decide which requirements are accepted {T09} (which, in fact, is the same as deciding the system boundaries or scope)

  • Do a triage and prioritization of the requirements {T10}

  • Assess requirements risks {T11}

The [business analyst] will validate {T12}, that is, check for completeness and for consistency the resulting requirements.

The triage and prioritization {T10} should be done after the validation {T12}, as the validation {T12} process might result in adding, changing or removing some requirements.

The [business analyst] will also need data, so he will ask the [data analyst]. Again, note that in a demand driven project this request will normally happen after the requirement elicitation {T07}, but in a supply driven project the data gathering that we will describe next will happen before the requirement elicitation {T07}. The [data analyst] will look for the raw data {T15} to use for the project. The data might come from databases, internal or external, or from other sources. It might also need still to be collected for this specific purpose. The [data analyst] will need to select the data {T16} and decide if and when the data might need to be combined {T17}. If the [data analyst] considers the data to be too large for an initial analysis, he might consider using a sample {T17} of the data.

The [data analyst] will also try to understand the data. To begin with, if the data was already available at the beginning of the project, the [data analyst] should find the business motivation to collect and store the data in the first place, as it might provide some insights. From the data understanding he might suggest a possible hypotheses or objective {T18} to the [business analyst]. He might also identify constrains {T19} that arise from the data, so he will inform the [business analyst] of the detected constrains.

Since the raw data might be incomplete, noisy or inconsistent, the [data engineer] will perform data cleaning, pre-processing and transformation {T17}. This might include filling missing values, normalization, discretization, reduction, projection or other techniques. The data cleaning, pre-processing and transformation is guided by the data itself and also by what data mining techniques are going to be used on the data. The [data miner] selects the tool {T20} to be used (for the same project, more than one tool might be used). For selecting the tool he will start by identifying possible tools {T28} and decide on how he will compare them {T20}, specifying the evaluation criteria that are important and how the evaluation will be performed (for instance, he might decide to run a specific algorithm using all the tools and a sample of the data). He will then proceed with the evaluation and choose the tool {T20} (or tools). The [data miner] also selects the data mining technique {T21} (e.g., summarization, classification, regression, clustering) and the specific algorithms {T22}. For the same project, more than one tool might be used, as well as more than one data mining technique and one algorithm.

Some authors believe the choice of data mining technique can be simplified to four decisions {T21}.

The [data miner] will entail the prepared data to the tool and be responsible for the generation of the model {T24}. This means he will have, for instance, to decide on the appropriate parameters {T23}.

After the actual data mining has occurred and the KD results are available, both the [domain expert] and the [strategic manager] will analyse the results {T25}.

The [domain expert] analyses {T25} the data mining result, in the sense that he evaluates how the results fit his domain knowledge {T25}, possibly resulting in the need for refining what was done previously through:

  • Creating new questions or hypothesis {T18} for the [business analyst]

  • Pointing the need for new or more data {T15} for the [data analyst]

  • Indicating the need to use a different function {T21} or algorithm {T22} or simply to adjust parameters {T23} to the [data miner]

The [strategic manager] interprets and evaluates {T25} the data mining result, in the sense that he evaluates how these results are relevant to or have an impact {T25} on the current or future business situation.

The [knowledge engineer] will use the analysis results from the [domain expert] and the [strategic manager] and make sure the discovered knowledge is used. He will specify {T26} how the knowledge discovery result should be deployed, for instance he can decide that an annual report should be produced for the senior management. The knowledge discovery result will then be deployed to the [end users] as planned.

2.4 Transaction Result Table

From the Performa-Informa-Forma analysis and Coordination-Actors-Production analysis the Transaction Result Table (TRT) was the following.

This table shows the transactions (that correspond to the main tasks of the process) and the result types corresponding to each transaction. In the result types, we can see (between square brackets) the main concept that is being created or whose state is being changed.

The last transactions (T28 to T31) refer to the specification of an elicitation technique for requirements or, regarding the data mining stage, the specification of a tool, data mining technique, algorithm or data mining parameter that was previously unknown to the system. This is necessary as the knowledge discovery and data mining area is very dynamic and it is very likely that new tools, data mining techniques, algorithms or data mining parameters need to be considered.

T27 is the transaction that manages all this. The elicitation techniques, tools, data mining techniques, algorithms and data mining parameters are referred to as artefacts in the context of T27 (KD area artefact management) (Table 1).

Table 1. Transaction Result Table

2.5 Object Fact Diagram

Due to space constrains the DEMO’s Actor Transaction Diagram and the Process step diagram are omitted in this paper.

We then specified the DEMO’s Object Fact Diagram (OFD).

In this diagram it can be seen the classes that correspond to the main concepts identified in the DEMO transactions of the Transaction Result Table, as well as other related classes, the fact types that are associated with each class and the cardinalities and dependence laws.

In the image marked in red are comments of an instantiation of each class derived from a concrete case of a real organization, so that the interpretation of the diagram is easier (Fig. 1).

Fig. 1.
figure 1

Object Fact Diagram (Part 1)

The main class of this OFD is the KNOWLEDGE DISCOVERY PROCESS (KDP), related to the main transaction T01. Each instance of this class will specify a particular KDP. Most of the classes that follow (in all caps text) are self-explanatory, so will presented as the example is described.

Instances of the class PROBLEM/OPPORTUNITY specify a problem or an opportunity that triggered the KDP. Let’s say that a company wants to increase its sales to existing customers. The company we are considering, sells memberships, so basically they’ll want to increase the percentage of customers that renew their memberships. This is the problem/opportunity.

One STAKEHOLDER is the Board of Directors. This particular stakeholder had a GOAL/CORE ISSUE: they want to increase the annual revenue. Using one or more ELICITATION TECHNIQUES, a REQUIREMENT to satisfy the above GOAL/CORE ISSUE was elicited: Predict how many customers will renew. One possible ELICITATION TECHNIQUE is a structured interview, but many others were possible (Fig. 2).

Fig. 2.
figure 2

Object Fact Diagram (Part 2)

Normally several STAKEHOLDERS will be identified (T04), each with one or more GOAL/CORE ISSUE from which several REQUIREMENTS will stem and be elicited (T05).

From the accepted REQUIREMENTS, we will then proceed to create a HYPOTHESIS that can be tested in a KDP. For this case, one of the tested HYPOTHESIS was if the number of logins can be used to predict if a customer will renew. The link between HYPOTHESIS and REQUIREMENTS is important for traceability.

In the end, the RESULT of the KDP will either confirm this hypothesis or not. For the KDP, there needs to be an estimation of COST AND RESOURCES, so that a Go-no-go decision (T14) can take place.

If the KDP proceeds, instances of classes corresponding to the DATA SOURCE (from which DATA will be selected and prepared), the data mining TOOL (in this case, Tableau 8.1), the type of DATA MINING TECHNIQUE (in this case, classification) and the ALGORITHM (in this case, AdaBoost) will be used to obtain a particular RESULT. The ALGORITHM might require a DATA MINING PARAMETER (or more) to be set. In this case we could change the value for a_t weight, but did not.

The KD AREA ARTEFACT is a generalization that includes ELICITATION TECHNIQUE, TOOL, DATA MINING TECHNIQUE, ALGORITHM and DATA MINING PARAMETER. The management of these artefacts (T27) involves specifying an artefact that was previously unknown to the system whenever needed (T28, T29, T30, T31, T32). The can then be chosen for use (T08, T20, T21, T22, T23) using ELICITATION TECHNIQUE CHOICE CRITERIA, TOOL CHOICE CRITERIA, DATA MINING TECHNIQUE CHOICE CRITERIA, ALGORITHM CHOICE CRITERIA or DATA MINING PARAMETER CHOICE CRITERIA respectively. It is important that the choice criteria are all documented, which is why all these classes appear (Fig. 3).

Fig. 3.
figure 3

Object Fact Diagram (Part 3)

From the DATA might result some kind of DATA CONSTRAINT. In this case, it was very noticeable that the customer age was not available. The identified DATA CONSTRAINTS affected the KDP.

As mentioned, the execution of a particular algorithm with particular parameters and applied to a particular data, in the context of a KDP will produce a particular RESULT - for example, a classification model or a set of association rules. For the case study at hand, we found that the members who login more than once per month are more likely to renew.

The RESULT will be target of an analysis (T25). From such analysis the conclusion might be that new hypothesis needs to be formulated and/or new data, tools, data mining techniques or specific algorithms applied so that refined or alternative results are found. If none of this is necessary, the DEPLOYMENT of a RESULT can also be specified. For example, in this case it was decided that an annual report with the obtained result was to be produced (Fig. 4).

Fig. 4.
figure 4

Object Fact Diagram (Part 4)

3 Discussion and Conclusion

Other efforts have been made regarding knowledge discovery ontologies such as OntoDM [43] or Knowledge Discovery Ontology [44], but focus in great detail in the knowledge discovery process itself and don’t show any particular insight regarding its surroundings, like the business side information.

This DEMO based ontology gives several interesting insights. Thanks to the specified classes, for a particular problem/opportunity we can keep a record of detailed and important information of a respective KDP. Namely keep a consistent and integrated record of important business side information like the stakeholders, requirements, hypothesis and costs; and also of the technical side like tools, sources and algorithms used. The class RESULT is pivotal in the sense that each instance will include not only the patterns obtained using the data mining technique, but also an analysis of the results which may lead to the formulation of new hypothesis and requirements on the business side.

Having SysPRE, an ontology that represents both the KD work in general and the RE for KD work specifically can help technical roles not lose track of the big picture while working on the task at hand. Also, since it is understandable not only by the technical roles involved, but also by other stakeholders, SysPRE can foster a more effective dialogue between them.

This ontology can encourage knowledge reuse of the KD process or RE KD process itself in a consistent and integrated fashion because it enables keeping a record of iterations and refinements of a particular process in a highly structured way. This way, it’s hoped to make enterprises become aware of their own KD process and RE process in the KD projects, but also to improve such processes in reality, namely in terms of the success rate. In other words, this can help the lessons learned from the past be reused to improve the present.

The main contribution of this paper is to provide a systematization that can be applied to KD projects in general and to the requirements engineering process in such processes in particular.

Having a short, plain text description of a generic KD process with emphasis on RE that was proposed after doing a thorough literature review can be useful for novices in the area, both in the research and in the industry communities.

Having the SysPRE formal ontology can be helpful within the organization using them because it can:

  • Enable keeping a record of iterations and refinements of a particular process in a highly structured way.

  • Make enterprises (and specifically decision makers within the enterprise) become aware of their own KD process and RE process in the KD projects.

  • Assist enterprises that want to improve their own KD process and RE process in the KD projects.

  • Help each technical role involved keep an eye on the big picture while working on whatever task they are working on at that specific moment.

Having the SysPRE formal ontology can also be helpful for the communication between the organization and other stakeholders because, despite being formal, they are understandable and do sum up a lot of information in a graphical way.