1 Introduction

Data mining, an essential and important step in knowledge discovery in databases, is used to discover useful unknown patterns from large repository of data [1,2,3,4]. Data mining consists of various functionalities, techniques and algorithms that are being used to discover and extract interesting patterns from the large repository of data [1, 2, 4]. Due to the importance in decision making, in the last two decades, data mining got a wide focus and has become an essential tool in performing variety of operations of the organizations [5].

Data mining is a step in the knowledge discovery in databases process consisting of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data….. [1].

Han et al. [6] stated data mining as “data mining is a process of discovering or extracting interesting patterns, associations, changes, anomalies and significant structures from large amounts of data which is stored in multiple data sources such as file systems, databases, data warehouses or other information repositories.”

Many techniques from other domains [6,7,8] such as statistics, database/data warehouse systems, machine learning, algorithms, pattern recognition, visualization, information retrieval, high-performance computing, etc. incorporated in data mining. First three techniques are the primary contributors of data mining [7].

2 Trends in data mining research

Through a survey of literature, it is identified that the data mining research can be broadly categorized into following types [9,10,11,12].

2.1 Data mining functions

Data mining functions or tasks can be used to specify the types of patterns or knowledge to be discovered during the data mining process. Some of the major data mining functions are summarization, characterization and discrimination, association, clustering, classification, outlier analysis, regression and trend analysis, etc. [1, 2, 6, 13].

2.2 Data mining techniques

Data mining task(s) are performed based on number of data mining techniques or approaches. A wide range of techniques for data mining are investigated by the researchers so far. For example, machine learning, statistics, neural networks, database and data warehouse systems, genetic algorithms, fuzzy sets, visualization, etc. [6, 9, 13].

2.3 Data mining algorithms

A variety of algorithms, also known as methods, are proposed by many researchers to carry out data mining functions based on data mining techniques. For example, Apriori algorithm, Naïve Bayesian, k-Nearest Neighbour, k-Means, CLIQUE, STING, etc. [6, 14].

2.4 Data mining domains

Data mining can be used in set of domains. E.g. time-series data mining, web mining, temporal data mining, spatial data mining, tempo-spatial data mining, educational data mining, business, medical, science and engineering, etc. Each domain can have one or more applications of data mining [6, 15].

2.5 Data mining applications

It is a set of application areas where the one or more data mining function can be used. For example, financial data analysis, market-basket analysis, intrusion detection, fraud detection, recommender systems, cancer detection, etc. [9, 6, 15].

Out of the above mentioned categories of data mining research, only data mining tasks, techniques and real-life applications are surveyed and presented in this paper.

3 Data mining tasks (or functions)

A large database and/or data warehouse may have a variety of unknown patterns in it [16]. To extract this variety of unknown patterns, distinct type of data mining function, methods and techniques can be used [11, 12, 17]. Based on different types of patterns, data mining functions can be categorized into summarization, characterization and discrimination, classification, regression and trend analysis, clustering, outlier analysis, association, etc. [1, 2, 9, 13, 18]. Classified work of the aforesaid literature related to data mining tasks is listed in Table 1.

Table 1 Classified work of reported literature related to data mining tasks

3.1 Summarization

Summarization results into a smaller set and presents a summary of the detailed data based on concept hierarchy. Usually, summarization is performed using aggregation which can be extended to different levels of abstraction and can be viewed from diverse angles. Various kinds of patterns can be extracted based on combinations of various levels of abstractions and different dimensions [13]. Data summarization is usually accomplished using attributed-oriented induction approach [69] and data cube approach [36, 70].

Data cube approach (also referred to as ‘multidimensional databases’, ‘materialized views’) materializes frequently queried expensive computations which involve group functions and then store the result as materialized views in a MDDB for decision support and knowledge discovery [13]. The attribute-oriented induction approach collects the related data in a database with the help of SQL-like DMQL and then a set of data generalization techniques [69] are applied for data generalization [13].

3.2 Characterization and discrimination

Characterization is basically summarization of data based concept hierarchy and generates characterization rules. On the other side, discrimination is used for identifying the varieties among various data sets. The output of the discrimination is generated in the form of discriminant rules [6, 23, 31].

3.3 Classification

Classification is the process to classify new observation based on the predetermined classes, i.e. supervised learning. A classification algorithm is used to forecast classes of the data [6]. A large collection of classification algorithms (or classifiers) have been proposed by the researchers [6, 47] so far. Some popular classification algorithms are summarized in Table 2. The classifiers based on genetic algorithms, rough set approach, fuzzy sets, semi-supervised learning and active learning have been also proposed by some researchers [6].

Table 2 Some Popular Classification Algorithms [6]

In addition to the popular classifiers mentioned in Table 2, many researchers have also presented and/or discussed a set of new classifiers such as classifier based on predictor features using supervised learning [43], a property-based classification [61] to adapt symbolic values, a unified classification model framework [50] for classification of skewed distribution of observations, adaptive very fast decision rules (AVFDR) [42], etc.

3.4 Clustering (or cluster analysis)

Clustering is used for partitioning or segmenting data objects (or observations) into subsets called as groups or clusters. The objects that are closed to each other are positioned in same group. Like classification, clustering classifies the similar data objects but unlike classification, the class labels are unknown (i.e. unsupervised learning) [6]. Cluster analysis is one of the most popular techniques which is not only used in data mining but also used in other domains such as statistics, image segmentation, pattern recognition, object recognition, information retrieval, bioinformatics, etc. [38].

A large collection of clustering algorithms has been suggested by many researchers [29, 6, 62] in the last two decades. Some popular clustering algorithms are presented in Table 3. The clustering algorithms based on probabilistic model, fuzzy sets, expectation–maximization, correlation using PCA, graph have been also proposed by some researchers [6].

Table 3 Some popular clustering algorithms [6]

In addition to the popular clustering algorithms presented in Table 3, many researchers have also presented and/or discussed a set of new clustering algorithms such as parameter-free method using minimum description length [49], parallelized hierarchical clustering approach [66], gene expression data clustering approach based on z-score measure [30], fully automatic clustering algorithm for high dimensional categorical data [24], nature inspired swarm based Intelligent Water Drops—K-Means (IWD-KM) algorithm [56], Voronoi diagram based clustering algorithm [71] for artificial as well as biological data, bisect K-means clustering algorithm [19], domain knowledge based density-based clustering [39], algorithm for clustering large-scale data sets based on the unique combination of matrix decomposition and low-rank matrix approximation named as exemplar-based low-rank sparse matrix decomposition (EMD) [64], a three-phased cluster ensemble method based on discriminant analysis [23], etc. Campello et al. [25] presented a framework for density-based clustering. Khandare and Alvi [40] proposed an improved clustering algorithm by proposing a new method of cluster initialization. Gupta and Chandra [72] proposed an efficient approach based on the selection of well-separated data points as intial cluster centroids to improve the performance of k-means algorithm. New cluster initialization approaches using partitioning for k-means algorithm, called as P-k-means and M-P-k-means, are proposed in Gupta and Chandra [33, 73] respectively. Hypercube based cluster initialization method, called as HYBCIM is proposed in Gupta and Chandra [74]. HYBCIM, P-k-means and M-P-k-means algorithms give better results as compared to traditional k-means algorithm.

Clustering XML data is one of the straightforward problems in many latest data mining applications such as Web Mining, XML query processing, Bioinformatics, etc. The conventional data clustering methods are not appropriate for XML data clustering [20]. The traditional clustering techniques are not suitable for web search results clustering because it has specific requirements [26]. Clustering data streams is also a difficult process as it requires the ability to continuously cluster the streaming objects within given memory and time restrictions [57].

3.5 Outlier analysis

Data objects that differ in general behaviour of the data are called as outliers. The outliers are generally discarded by most of data mining methods as noise or exceptions. Sometimes, outliers may have more information in comparison to other data objects. Therefore outlier analysis is important for some application areas such as intrusion detection, fraud detection, anomaly detection, etc. [35]. Many data mining techniques generally use clustering to detect the outliers as a noise. The outlier detection methods can be classified as classification-based methods, statistical methods, clustering-based methods, supervised, semi-supervised and unsupervised methods, deviation-based methods and proximity-based methods [6].

Angiulli and Fassetti [21] stated that the background knowledge (or domain knowledge) can be used to detect the outliers easily. They proposed the solution as unsupervised but it can have relationship with supervised learning. Gradient outlier factor is investigated in Angiulli and Fassetti [21] for generalization and unification of statistical outliers. Campello et al. [25] presented a framework for density-based outlier detection. Two new algorithms inc-iVAT and dec-iVAT based on visual assessment of tendency for anomaly detection in data steam are introduced in [44].

3.6 Association analysis (or association mining)

Association analysis discovers associations (or links) among datasets and identifies data objects that can be realized collectively satisfying a minimum support and confidence thresholds. Identification of all frequent item sets followed by generation of strong association rules is accomplished in association mining [28, 6]. Association analysis includes mining frequent itemsets, subsequences and substructures [6]. Market-basket analysis is mainly using the association analysis. Apriori algorithm is widely used for association. Association analysis algorithms can be classified into classical algorithms, condensed representation algorithms, and incomplete set algorithms [27].

Some popular association mining algorithms are summarized in Table 4. The association mining algorithms for multilevel association, multidimensional association, quantitative association, rare (or infrequent) patterns, constraint-based association, etc. have been also proposed by some researchers [6].

Table 4 Some popular association algorithms [6]

Multi-relational Data Mining (MRDM) is the process to look for multiple tables based patterns [58]. Sampling methods using disjunctive normal form (DNF) have been developed by Li and Zaki [46]. Rare or infrequent pattern mining is becoming popular nowadays in some application areas [41]. A correct and efficient algorithm for uncertain frequent patterns mining using minimum data structure is investigated by Lee and Yun [45]. A new less time consuming algorithm based on cellular learning automata (CLA) for mining frequent itemsets is presented in Sohrabi and Roshani [60].

Data mining can also extract uninteresting patterns/rules. Therefore, pattern evaluation (measuring interestingness) is required to filter out the only interesting patterns. Geng and Hamilton [32] presented nine specific criteria for measuring the interestingness of the mined rules and summaries. Further, these nine criteria have been categorized into three categories (1) subjective, (2) objective and (3) semantic-based. Tew et al. [63] suggested a technique to detect equivalences among interestingness measures using rule-ranking behaviour-based clustering for association rule mining. Hung et al. [36] presented a method WSWFP-stream based on FPGrowth method for mining frequent itemsets with weights for data stream. A new algorithm called MFIWDSIM based on weights using Inverted Matrix for mining frequent itemsets is proposed in [37]. Rustogi et al. [55] presented a improved parallel Apriori algorithm for multi-core.

3.7 Regression and trend analysis (or evolution analysis)

Regression predicts the value of attribute based on regression technique(s) over time. The future values of variables are predicted with the help of historical time series plot [6].

Trend analysis (also called as evolution analysis) discovers interesting patterns in the evolution history of the objects. Identification of patterns in an object’s evolution and matching of the objects’ changing trends are the two major aspects of trend analysis [28]. Trends of the objects, whose behaviour evolves over time, can be described using trend analysis and regression models. Trend analysis exposes time-varying trends of the data objects within the dataset. The association analysis can also be used for evolution analysis [62].

4 Data mining techniques

As data mining is a multi-disciplinary field, variety of techniques or approaches are adopted in data mining from number of domains which includes statistics, machine learning, neural networks, database systems, genetic algorithms, fuzzy sets, visualization, etc. [28, 9, 6]. Classified work of the aforesaid literature related to data mining techniques is listed in Table 5.

Table 5 Classified work of reported literature related to data mining techniques

4.1 Statistical approaches

Sometimes the terms ‘statistics’ or ‘statistical techniques’ are used as alias for data mining. But, statistics was coined before the term ‘data mining’. Statistics is data driven and is used to discover patterns and to build predictive (in statistics also called as regression) models. Due to its data driven approach, statistics is also used as one of the major technique for data mining [86]. In other words, data mining has an inherent connection with statistics [6]. Many statistical analysis tools including Bayesian network, correlation analysis, factor analysis, discriminant analysis, cluster analysis, regression analysis, etc. are widely used for data mining [7, 18, 87]. Usually, most of the statistical models are built from training data set. A variety of rules and patterns are then drawn from the model. Most of the data mining tasks are performed using one or more statistical approaches [18].

The statistical methods commonly used in data mining are described as follows [7, 18, 87]:

  • Bayesian network: It represents the casual relationship among the variables, calculated through Bayesian probability theorem [88].

  • Correlation: The relationship between two more variables/facts/dimensions can be determined using correlation [89].

  • Regression: It is a derivation of a function to map a set of variables of various objects to an output variable [90].

  • Cluster analysis: It groups the objects based on similarity measures so that objects that are similar to each other are located within same cluster [91].

  • Discriminant analysis: It assigns data objects to one or more groups based on discriminant function [92].

  • Factor analysis: It is used to understand and find out the main causes for the correlations and to identify the important ones [93].

4.2 Machine learning

Machine learning deals with the study of determining that how machines and humans can learn from data. Due to the importance of machine learning in data mining, a large number of the data mining algorithms have their roots in machine learning [83].

Machine learning increases levels of automation in the knowledge discovery in databases process to improve accuracy and efficiency. The systems produced by machine learning can be used regularly in the industry or education sector. In some of the applications, the machine learning methods gives performance better than the methods without learning [94, 85].

Inductive and deductive are two categories of machine learning. Deductive learning deals with facts and knowledge that existed over the time and then generates new knowledge from the old knowledge. In inductive learning, examples are generalized instead of starting with existing knowledge.

Meta-learning combines a number of detached learning processes in an intellectual fashion [95]. A meta-learning architecture exhibit two key behaviours: (1) an accurate final classification system (or final outcome), and (2) it must be fast, relative to an individual sequential learning algorithm [95]. For mining DSS solutions, RSA (rough set analysis) and DNA (dependency network analysis) have been suggested by Gengshen and Guenther [80].

The increasing popularity of Internet leads to the increase in network attacks. Therefore, intrusion detection (ID) is becoming the one of the key research areas for network security that. It is used to identify uncommon access or attacks to secure networks. Machine learning is also used in intrusion detection systems (IDS). IDSs monitor computers in case of security violations and trigger alerts to report any violation [96]. These reported alerts are given to an analyst for evaluation and initiation of an appropriate action. Two approaches based on reduction of the number of false positives in intrusion detection are proposed in Chih-Fong et al. [96].

4.3 Neural network

A neural network is a network or circuit of biological neurons. It has the capability to learn by examples which makes them flexible and powerful. Artificial neural network (ANN) is composed of artificial neurons or nodes and electrical signalling similar to the biological neural networks [97]. In ANN, knowledge is represented as a layered set of interconnected processors (also called as neurons). Different types of neural network models are also used to solve business problems as well as also play a vital role as a modern operations research tool [81].

Classification based on ANN to examine an effective forecast of future values is discussed in David et al. [75]. Saeed and Ali [84] proposed new privacy-preserving protocols for partitioned data based on extreme learning machine (ELM) and back-propagation (BP) algorithms.

Contemporary ANN approaches can also be used in spatial environmental data analysis. Valorisation and representativity of data is discussed in Kanevski et al. [98]. A hybrid model, based on support vector regression and multilayer perceptron ML algorithms, called as machine learning residuals sequential simulations (MLRSS) has also been presented in Kanevski et al. [98].

4.4 Database systems and data warehouses

Database-oriented and data warehouse-oriented approaches are not based on best model but uses existing data model to utilize the characteristics of the existing data [16]. To achieve scalability and great effectiveness of data mining tasks, that need to handle large data sets, the database technologies can be used for data mining. The systematic data analysis capabilities have been embedded in the recent commercial database systems also [1]. The iterative database scanning for the attribute focusing, attribute-oriented induction and frequent item sets are the major methods of this approach [28]. Multi-dimensionality nature of data structure in data warehouse also promotes multidimensional data mining [6].

4.5 Genetic algorithms

Genetic algorithms are based on concept of natural biological evaluation, i.e. processes of selection, reproduction, mutation, and survival of the fittest. Just like nature does, genetic algorithms can provide a better solution by combining the DNA of living beings [99]. But, in genetic algorithms the solutions are difficult to explain and no statistical measure exists to enable the user to understand why the particular solution has been reached [87].

4.6 Fuzzy sets

The concept of fuzzy sets theory was founded by Lotfi Zadeh. Fuzzy set defines the degree of membership based on the possibility value calculated with the help of membership function. It is widely used in classification and cluster analysis [6]. Fuzzy set theory is building potential contributions to the various applications of data mining, machine learning, and related fields [78, 79].

A knowledge discovery model based on integration of modification of the fuzzy transaction data-mining algorithm (MFTDA) and adaptive-network-based fuzzy inference systems (ANFIS) has been described in Mu-Jung et al. [82]. A machine learning approach combining fuzzy modelling which returns set of fuzzy rules was proposed by Edward and Olgierd [76].

4.7 Visualization

Visualization is a very useful data mining technique to identify and represent patterns in data sets. In visualization, data are translated into objects such as points, lines, and areas, etc. which are displayed in 2- or 3-dimensional space. By visual examination, the interesting patterns can be interactively explored by the users [18, 86]. Campello et al. [25] presented a framework for density-based estimates for visualization.

5 Real-life applications of data mining

Due to the power of data mining for data analytics, data mining uses in a wide range of real-life applications across variety of domains [100,101,102]. One or more data mining tasks, techniques and methods are applied in these applications [6, 15]. The various real-life applications of data mining are presented in the following sub-sections.

5.1 Telecommunication sector

Data mining is used by telecom/mobile service providers to formulate and design strategies for (i) marketing campaign, (ii) customer retention, (iii) packages for customers based on customer segmentation, (iv) optimum utilization of communication infrastructure, etc. By using classification and clustering, the mobile service providers can formulation strategies for their marketing campaign to promote direct marketing. With the help of clustering followed by classification, the customers can be segmented into various groups to predict the moving customers. The specific marketing strategies and packages can be formulated and designed for moving customers so that they can be retained with service provider. Based the identified customer groups, the specific packages can also be formulated based on the needs/requirements of these various customer groups. For designing packages the association analysis can also be used. The network usage pattern can be analyzed using data mining to identify the under-utilized and over-utilized network infrastructure so that the overall infrastructure can be optimally utilized and/or enhanced as per requirement [6, 15, 103,104,105,106].

5.2 Retail sector

Retail sector and super-market owners can be benefitted by data mining. With the help of data mining they can predict (i) buying behavior of the customers, (ii) market-basket analysis, (iii) choice of the customers, (iv) placement of products on shelves, (v) introduction of effective offers/coupons/discounts, (vi) customer segmentation, etc. To discover the buying behavior of the customers and market-basket analysis the association analysis is used. Using association, frequent itemsets based on given support and confidence level can be discovered from the sales data so that these frequent itemsets can be placed nearby so that their sales can be increased. The marketing campaigns can be designed using RFM (recency, frequency, and monetary) grouping. By analyzing sales data using clustering, the best location (i.e. shelves) for the placement of products and best optimal offers can be discovered so that the sales can be increased. The sales data can also be analyzed to discover the various segments of the customers using clustering and/or classification. The different marketing campaigns and promotions/offers can be customized for discovered segments of customers. The customer who buys very less frequently but spends a lot shall be treated differently from the customer who buys very frequently but of fewer amounts [6, 15, 103, 105, 107].

5.3 Financial data analysis

The financial data in financial industry and banking facilitates systematic data analysis and data mining. Data mining for financial data analysis can be used for (i) loan payment prediction, (ii) customer credit policy analysis, (iii) customer segmentation for targeted marketing, (iv) detection of money laundering and other financial crimes, etc.

With the help of attribute ranking and attribute selection, methods of data mining, the customer payment history can be analyzed to discover (i) credit history, (ii) payment to income ratio, (iii) the term of the loan, etc. of the customers. This prediction will help the banks/financial inistitutions to decide their loan granting policy and to grant loans to the customers as per their score. Now these days, the banks and financial institutions checks the CIBIL score, which is based on data mining, of the customers before granting the loans to them [6, 15, 103,104,105,106].

5.4 Healthcare sector

Recently, the data mining is widely used in healthcare sector to (i) identify and analyze chronic diseases, (ii) to identify and discover symptoms, possible causes and medicines for effective treatments, (ii) track high-risk regions prone to the spread of disease, (iii) design programs to reduce the spread of disease, (iv) identify regions of patients, etc. In healthcare sector, the imaging/lab test data/reports are analyzed using data mining tasks such as clustering, classification, association and outlier detection. These tasks are used to identify/discover/predict the chronic diseases; their symptoms, possible causes and medicines so that these diseases can be treated effectively. The analysis can be further extended to identify and track the high-risk regions which are prone to the spread of disease. Based on the analysis, the campaigns can be designed for the regions to make people aware of the disease and their precautions. Using data mining, continuous comparison of symptoms, causes, and medicines, data analysis can be performed to make effective treatments and the associated side-effects [6, 15, 103,104,105,106].

5.5 Fraud detection and crime prevention

The outliers can also be discovered using data mining from the vast amount of data. The outliers can be identified by discovering the infrequent patterns in the data. The infrequent patterns are generally belongs to fraudulent/criminal activity. Hence, with the help of outlier detection and/or infrequent pattern mining, the possible frauds can be identified and predicted so that the occurrence of crimes can be prevented [6, 15, 103, 107].

5.6 Customer relationship management (CRM)

Good customer relationships can be made by appealing more appropriate customers and better retention. Data mining can reinforce CRM by the identification and prediction of (i) database marketing, (ii) customer acquisition and customer retention campaigns, etc. [6, 15, 103, 107].

5.7 Recommender systems

Recommender systems give stakeholders with varied recommendations that may be of interest to the users using data mining. Recommender systems examine the user transactions, user profiles, keywords, common features among items to estimate an item for the user. Many data mining techniques such as machine learning, statistics, information retrieval, etc. are used in recommender systems. For example, in marketing, recommender system may recommend items which are either similar to the items queried by the user in the past or by looking at the other customer preferences which have similar taste as the user [103].

5.8 Online marketing/E-commerce

Various big brands/vendors of online marketing and e-commerce are also using the data mining to enhance their business. For examples: (i) E-commerce vendors discover the lowest price of the product using text mining on the web, (ii) large fast food chain vendors studies the ordering pattern of customers, waiting times, size of orders, etc. using big data mining to enhance their customer experiences, (iii) online media service providers also uses data mining to find out how to make a series or a movie popular among the customers [103].

6 Relationship between data mining tasks and data mining techniques

Data mining task is carried out with one or more data mining techniques. In data mining technique, one or more data mining methods can be applied. Table 6 represents the data mining tasks are carried out based on which major technique(s).

Table 6 Relationship between Data Mining Tasks and Data Mining Techniques

7 Summary and conclusion

There is need to evolve data mining to efficiently analyze the huge volume of data as well as to discover knowledge from it. The application domains of data mining are also increasing regularly. Hence, it is required to find the uniform methods/algorithms which can be implemented on large variety of applications without or with a few changes. Most of the data mining systems employ a combination of methods to handle various types of data, data mining tasks and application areas [18].

A number of challenges of the data mining research have been stated by many researchers [108]. Some of these are presented in Table 7 and require more research attention.

Table 7 Challenges of data mining research

Various data mining tasks and techniques help different companies to (i) gain knowledge, and (ii) increase their profitability by making amendments in procedures and operations. Data mining helps businesses in decision making through analysis of hidden patterns and trends [103].

Finally, it is concluded that (1) unification, scalability and optimization of data mining algorithms/methods, (2) cube-oriented multidimensional data mining, and (3) scalable real-time mining are the areas of data mining which also require more attention from researchers.