Keywords

1 Introduction

Knowledge discovery in databases (KDD) is a dynamic field of research that promises high returns in many professional and scientific domains. The corporate, government and scientific communities are being shunned by the incursion of data that is consistently stored in online databases. Analysis of this data and extracting out some meaningful pattern in a felicitous manner is impractical. The process of KDD involves searching for useful knowledge from the data gathered from various sources. The current scenario is characterized by increasing enormous amount of data and all kinds of human efforts are being generated and shelved. This vast amount of data is recorded as computer databases are managed by computer technology in an easy way. Data is being collected and assembled across a wide variety of areas at a dramatic pace. New development of tools and computational theories are urgently required to help human being in withdrawing valuable information from the enormously rising proportions of digitalized data [1]. These tools and theories together constitute the main part of the dynamic field of KDD.

At abstraction levels, the field of KDD is mainly concerned with developing various strategies and methods for generating meaningful data. The major problem came across knowledge discovery is mapping low-level data into another more abstract, compact and useful form [2]. Fundamentally the process includes implementation of specific data mining techniques for pattern recognition and extracting useful knowledge.

Many techniques used for handling these tasks include cluster analysis, regression analysis, multidimensional analysis, numerical taxonomy and several other statistical methods. Many practical problems are solved by using such techniques. However, they are mainly focused on extraction of quantitative and statistical data, and as such they have some limitations. While discovering knowledge from single data source, the problem lies in that only of one type and less amount of information is obtained. So, there is a need of efficient methods to collect the vital information from multiple data source [2]. In this paper, a survey of various approaches which are useful in this area has been carried out. A comparative study of the approaches has also been presented.

2 Related Works

Knowledge discovery is referred to as the process of observing hidden designs and patterns from an enormous volume of the data sets. It includes transformation of the obtained patterns into comprehensive and easily understandable information. The domain of knowledge discovery comprises various processes that are carried out at various stages through which the basic rules of the knowledge discovery domain are made. It involves the possible analysis and interpretation of the evaluated patterns to decide what is called knowledge [3]. This includes schematic enciphering, preprocessing, sampling and projections of data before we move for data mining.

2.1 Steps Involve in Knowledge Database Discovery Process

Seven major aspects should be considered before selection of databases for their analysis. To understand the database prerequisite knowledge is required which is as follows [4]

  • Cleaning of Data—The process of removing the irrelevant and noisy data from gathered data.

  • Integration of Data—The process of combining adverse data collected from numerous sources into one common source.

  • Selection of Data—The process in which the relevant data required for analysis is decided and retrieved from data collection.

  • Data Transformation—The process of converting data to the appropriate form as required.

  • Data Mining—The process of applying techniques so that potentially useful patterns can be extracted.

  • Pattern Evaluation—The process of identification of unrevealed patterns to represent knowledge.

  • Knowledge representation—The process in which data mining results are represented using visualization tools.

The complete knowledge discovery process is shown in Fig. 1.

Fig. 1
figure 1

Knowledge discovery process

2.2 Issues and Challenges in Knowledge Discovery

Knowledge discovery is developing into a trusted discipline; however, there are still many challenges that need to be resolved. There are some issues and challenges those are identified in knowledge discovery process [5].

  • Noisy and Incomplete Data—“Data Mining is the way of acquiring information from massive volumes of data”. Generally, the data collected is heterogeneous and noisy. The extensive amount of data is irregular and unreliable. Such kinds of issues may arise due to human errors or due to instruments which are used for data collection.

  • Distributed Data—In this process, the data is passed through many stages. It can be easily carried out using internet or through individual systems. It is critical to unify all the data because of organizational and technical reasons.

  • Complex Data—The data obtained is really heterogeneous which may include text, spatial data, audios, videos, images, words, etc. It is tough to handle such diverse kinds of data and focus on requisite vital information. Sometimes we need to create new systems and equipment’s to separate crucial facts and information from the data.

  • Performance—Fundamentally, the presentation of the data mining framework is dependent on efficiency of techniques and methods used. If algorithms and techniques used are insufficient, then it adversely affects the presentation of data mining.

  • Scalability and Efficiency of the Algorithms—Efficient and scalable algorithms are to be used in order to extract valuable information from vast amount of data.

3 Review of Literature

Knowledge discovery covers a wide area of research. The work done in the area is as follows.

Silwattananusarn and Tuamsuk (2012) discussed the suitable methods and techniques which are needed in future to serve the requirements of data mining field as it is becoming more complex day by day [6]. According to Tomar et al. (2013) data mining is most active and likeable area of research which is capturing its attention in medical applications. [7]. Fan et al. (2014) explained knowledge discovery as the capability of obtaining useful statistics from a wide variety of datasets that because of its variability, volume and velocity [8]. Saurkar et al. (2014) described data mining as “interdisciplinary field which includes integrated databases, machine learning technique, artificial intelligence, statistical approaches etc.”. The data mining technique helps in extraction of hidden information and knowledge by digging deep into the data [4]. Real time analysis of streaming data is becoming the most efficient and fastest way for obtaining useful knowledge (Bifet et al. 2014). This allows firms to respond rapidly whenever a trouble appears to ascertain the enhanced performance [9]. Purcell (2014) stated that knowledge discovery databases consist of unstructured, semi-structured and structured data sets which cannot be handled using the traditional methods and systems. Data storage technique is used for object-based storage [10].

Reddi and Indira (2014) explained that a combination of heterogeneous, homogenous, unstructured, semi-structured data is known as big data. It also suggested a model for shifting and handling of vast quantity of data over the network [11]. Ibrahim et al. (2014) said that due to the presence of partitioning skew a huge amount of data transfer occurs which causes negligence on the reduce input among different data junctions and also develop a novel algorithm named LEEN [12]. Gamache et al. (2015) discussed the idea of linking various text mining techniques to convert the unstructured data in the forms of texts into structured data in the forms of numbers so that various statistical and mathematical algorithms can be applied [13]. Baker et al. (2015) primarily deal with the development of techniques which can be used for analysis and discovery of novel and useful information [14]. Soni (2015) discussed the prediction of future sales and trends based on patterns related to customer’s behaviour. This helps in increasing profits by assisting policymakers in decision-making [15].

Kaplan and vakili (2015) proposed a version to generate a text primarily-based degree of understanding recombination that they ultimately comprise as an impartial variable into their econometric version [16]. Kumar and Chatterjee (2016) focused on clarifying the relationship between techniques applicable for data mining and knowledge discovery and also discussed the data mining techniques, specialized methods for certain type of data and field [2]. Angus (2018) figured out document similarity measures in order to explore the link between search distance and firm performance [17]. Hariri et al. (2019) discussed about the capability in creating and managing information that has been a dominant factor in the growing era of technology [18].

Sankari and Shraddha (2019) introduced the application of data mining techniques on information generated from educational settings. The usage of educational data mining and analysis of data about learners and their contexts is the key to successful inference model of educational data [19]. Kumar and Basha (2020) have discussed the methods of accessibility of high volume of text-based data that needs to be examined for retrieving information [20]. Roozbahani and Rajabzadeh (2020) focused on past and current status of researches on big data in the medical and science-related areas [3]. Abdualgalil and Abraham (2020) focused on machine learning for knowledge discovery in big data. According to him machine learning needs to be more explorable, so that interacting with various kinds of data will become easier for a learner [1]. Lauw and Wong (2020) have discussed original research results, current new ideas and advanced experiences from all knowledge discovery-related areas such as data mining, machine learning, artificial intelligence, decision-making systems and other emerging applications [21].

4 Comparative Study

All the approaches discussed in earlier section use different techniques. These also differ on various parameters like technique used, database type, accuracy, sensitivity, specificity and fidelity. A comparative analysis of all approaches discussed is presented in Table 1.

Table 1 A comparative analysis

5 Conclusion

The knowledge discovery process primarily aims at finding out the exact information from the large datasets. The implementation of knowledge database discovery methods and techniques will help users to extricate meaningful information from virtually accumulated large amount of data. For industries like telecommunication, retail, biomedical, etc., such techniques are used widely. These techniques are proved to be helpful in predicting future trends and allow business activities proactive, dynamic and present valuable and useful knowledge which is simply understandable to human being. This paper provides an outline of the knowledge database discovery process. It presents a detailed study of knowledge discovery with various studies like steps, principle and challenging issues. A primary goal of this paper is to elucidate the relation between knowledge discovery and data mining. It also defines the knowledge database discovery process and important data mining techniques.