Keywords

1 Introduction

Recently using of soft computing techniques as an excellent tool for knowledge discovery in huge amounts of data. These hybrid combinations have the potential to handle large amount of data in a very quick and effective manner. Since the data to be analyzed is having with inexact and uncertainty. Therefore traditional techniques are not adequate. Properties of the same kind are typical of soft computing. Therefore the application of soft computing techniques results in systems that have high device ratio. Recently most widely used soft computing techniques are as follows:

1.1 Genetic Algorithm

Genetic algorithms are adaptive seek algorithms based totally on the evolutionary thoughts of natural choice and genetics. As such they constitute a sensible exploitation of a random are seeking used to resolve optimization hassle. The simple techniques of the genetic algorithm are designed to simulate manner in herbal structures vital for evolution, in particular people who examine the principles of nice survival. Genetic algorithms are higher than the tough computing algorithms in that they are more robust. In looking a large nation location a genetic algorithm may additionally offer substantial blessings over greater widespread optimization techniques. In famous, genetic set of rules starts as follows. An initial population is created which consist of randomly generated regulations. Every rule can be represented via a string of bits. Based at the belief of survival of the fittest, a ultra-modern populace is common to encompass the fittest regulations within the modern-day populace, as well as offspring of those suggestions. Normally, the fitness of a rule is classed through the usage of its class accuracy on set of education samples. Offspring are created by means of making use of genetic operators including crossover and mutation. In crossover, substrings from pairs of tips are swapped to shape new pairs of rules. In mutation, randomly selected bits in a policies string are inverted. The technique of producing new populations primarily based on in advance populations of guidelines keeps till a populace, P, evolves in which each rule in P satisfies a pre-particular health threshold. Genetic algorithms are without troubles parallelizable and had been used for class as well as extraordinary optimization troubles. In information mining, they will be used to evaluate the health of other algorithms [1].

1.2 Neural Networks

New models of computing to participate in pattern recognition tasks are influenced with the aid of the constitution and performance of our organic neural community. A set of processing models when assembled in a intently interconnected community, offers a wealthy structure exhibiting some features of the organic neural community. This kind of constitution is known as an artificial neural network. Considering ANNs are applied on computer systems, it is valued at evaluating the processing capabilities of a computer with these of the brain. Neural networks are sluggish in processing understanding, on the grounds that cycle time akin to a neural event promoted by using an outside stimulus happens in milliseconds range therefore the pc method expertise virtually one million time faster. Neural networks can participate in massively parallel operations for the reason that prompted from organic networks where mind operates with hugely parallel operations each and every of them having comparatively fewer steps. Neural networks have gigantic number of computing elements and the computing isn’t restrained to within neurons. Neural networks retailer expertise within the strengths of the interconnections. In a neural community new understanding is added via adjusting the interconnections strengths, without destroying the historic information. Consequently expertise in the brain is adaptable whereas in the laptop it’s strictly replaceable. Neural networks exhibit fault tolerance considering that the expertise is dispensed within the connection throughout the network. There’s no significant manage for processing expertise within the brain. In a neural network each and every neuron acts established on the neurons connected to it. Accordingly there is not any specified manipulate mechanism outside to the computing undertaking [2].

1.3 Support Vector Machine

Support vector machine, a promising new approach for the category of each linear and nonlinear data. An SVM is a set of rules that works as follows. It uses a nonlinear mapping to transform the authentic training records into a higher measurement. Within this new dimension, it searches for the linear greatest keeping apart hyper aircraft. With the ideal nonlinear mapping to a sufficiently excessive measurement, information from instructions can usually be separated through a hyper plane. The SVM reveals this hyper plane using help vectors and margins. the first paper on SVM became offered in 1992 through Vladimir Vapnik and colleagues, even though the ground work for SVM has been around because the Sixties. Although the training time of even the fastest SVM may be extremely sluggish, they’re noticeably correct, due to their capacity to model complex nonlinear decision obstacles. They are a lot less susceptible to over fitting than different methods. The guide vector additionally offer a compact description of the discovered version. SVMs can be used for prediction as well as class. They had been carried out to some of areas, consisting of handwritten digit popularity, object recognition, and speaker identity, in addition to benchmarks time series prediction checks.

1.4 Fuzzy Logic

The concept of fuzzy sets was first introduced by Zadeh in 1965 to represent vagueness present in human reasoning. Fuzzy sets can be considered as a generalization of the classical set theory. In a classical set an element of the universe either belongs to or does not belong to the set. Thus the belongingness of an element is crisp. In a fuzzy set the belongingness of an element can be a continuous variable. Mathematically, a fuzzy set is a mapping from the universe of discourse to [0, 1]. The higher the membership value of an input pattern to a class, the more is the belongingness of the pattern to the class [3]. The membership function is usually designed by taking into consideration the requirements and constraints of the problem. Fuzzy logic deals with reasoning with fuzzy sets and fuzzy numbers. it’s far to be noted that fuzzy uncertainty isn’t the same as probabilistic uncertainty. In the network outputs are interpreted as fuzzy membership values. Learning laws are derived by minimizing a fuzzy objective function in a gradient descent manner. In the concept of cross entropy was extended to incorporate fuzzy set theory. Incorporation of fuzziness in the objective functions led to better classification in many cases. In Kohonen’s clustering network has been generalized to its fuzzy counterpart. The merits of this approach is that the final weight vectors do not depend on the sequence of presentation of the input vectors. The method uses a systematic approach to determine the learning rate parameter and size of the neighbourhood.

1.5 Rough Sets

In many type tasks the aim is to shape lessons of objects which might not be considerably unique. These indistinguishable objects are beneficial to construct knowledge base concerning the task. For instance if the objects are classifieds consistent with color (red, black) and shape (triangle, square and circle) then the indiscernible classes are red triangles, black squares, red circles, and so on. As a result these attributes make a partition in the set of objects. Now if red triangles with distinct regions belong to different classes, then it is not possible for anybody to classify these two red triangles primarily based on the given attributes. This form of uncertainty is known as rough uncertainty. Pawlak formulated the rough uncertainty in terms of rough sets. The rough uncertainty is absolutely prevented if we will successfully extract all of the important capabilities to represent distinct objects. But it may now not be feasible to guarantee this as our knowledge about the system generating the records is limited. It should be stated that rough uncertainty is different from fuzzy uncertainty. Using rough sets it could be feasible to lower the dimensionality of the input without dropping any statistics. A set of features is enough to categorize all of the input patterns if the rough ambiguity, for this set of capabilities is equal to 0. The use of this quantity it’s far possible to pick a right set of features from the given data.

2 Literature Review

Nowadays there is a huge amount of information being collected and confine databases everywhere across the globe. There are valuable information and information “hidden” in such databases; and while not automatic ways in which for extracting this information it’s much impossible to mine for them. Throughout the year’s many algorithms were created to extract what is referred to as nuggets of knowledge from huge sets of information. There are several methodologies to approach this drawback.

Boonyanusith and Jittamai [4] on this studies the sample of blood donors’ behaviours based on elements influencing blood donation choice is accomplished the usage of on line questionnaire. The surveyed records are used for device studying techniques of synthetic intelligence to classify the blood donor company into donors and non-donors. the accuracy finding out of the surveyed facts is achieved the usage of the synthetic neural network (ANN) and choice tree techniques on the way to are looking ahead to from a series of individual Blood conduct information whether or not or now not each character is a donor. The consequences suggest that the accuracy, precision, and do not forget values of ANN method are better than those of the choice tree method [4].

Classification is an information analysis technique to extract models describing necessary knowledge classes and predict future values. Processing uses classification techniques with machine learning, image process, language method, applied mathematics and visualization techniques to seek out and gift info in a clear format. Most of the classification algorithms in literature are memory resident, usually presumptuous a little info size. Recent processing analysis has designed on such techniques, developing ascendable and durable classification techniques capable of handling huge disk-resident knowledge. The classification has varied applications in addition to flight classification, fraud detection, target promoting, performance prediction, manufacturing, and identification. The performance of the classification techniques is measured by the metrics like accuracy, speed, robustness, quality, comprehensibility, time and interpretability. Classification technique depends on the inductive learning principle that analyzes and finds the patterns from the knowledge. If the character of an environment is dynamic, then the model ought to be adaptive i.e. it got to be able to learn and map with efficiency.

Bhardwaj et al. [5] focuses on data mining and trends associated with it. In this paper, the main purpose of the system is to increase blood donor’s rate as well as to attract more blood donors to donate blood. The work has been made to classify and predict the number of blood donor’s according to their age and blood group. In this work, the WEKA data mining tool and the J48 algorithm is used to classify the data and evaluation of the data. Limère et al. Presented a model for firm growth with call tree induction principle. It offers fascinating results and fits the model to economic info like growth competence and resources, growth potential and growth ambitions.

Shilton and Palaniswami printed a unified approach to support vector machines. This unified approach is developed for binary classification and after extended to one-class classification and regression. Takeda et al. projected a unified durable classification model that optimizes the prevailing classification models like SVM, minimax likelihood machine, and Fisher discriminate analysis. It provides several blessings like well-defined theoretical results, extends the prevailing techniques and clarifies relationships among existing models.

Yee and Nursingd Haykin viewed the pattern classification as an ill-posed disadvantage, it is a demand to develop a unified theoretical framework that classifies and solves the unwell expose problems. Recent literature on classification framework has reportable higher results for binary class datasets alone. For multiclass datasets, there’s an absence of accuracy and lustiness. So, evolving an economical classification framework for multiclass datasets remains an open analysis downside. The evaluation of the parameters which influence the psychology of blood donors has been conducted largely because of the numerous effect of blood insufficiency at the continuance of patients [6]. The approach in discovering new styles of huge statistics units is recorded processing. It is able to be accustomed extract information from a present information set and redecorate into a character’s perceivable structure for any use [7]. It utilizes techniques on the intersection of information, facts systems, gadget mastering, and computing. ANN might be a way of facts processing it really is accustomed predict or classify records inside the area of ideas or emotions and behaviors of customers efficiently [8]. It is fashioned in getting to know styles of the statistics [9]. To resolve the difficulty of category and grouping records are effective to investigate the promoting databases [10]. Multi-layer Perceptron may be a giant and useful feed-forward ANN version, which might be accustomed examine dataset to categories the focused cluster [11]. Moreover, choice Tree is one many of the useful strategies in type by way of getting to know patterns of the dataset. it will display end result diagrammatically as a tree model a good way to factor every step of concluding process from input to output [12].

Borkar and Deshmukh [13] planned using Naïve Thomas Bayes classifier for detection of swine-flu disease. The method starts with finding likelihood for every attribute of swine flu against all output. The probabilities of every attribute are then increased. Choosing the most likelihood from all the possibilities, the attributes belong to the category variable with the most worth. The promising results of the planned theme is used for investigation more the swine flu disease in patients using info technology. Patil et al. [14] worked within the direction of diagnosis whether or not a patient along with his given info relating to age, sex, pressure, blood glucose, chest pain, electrocardiogram reports etc. will have a cardiovascular disease later in life or not.

The experiments involve taking the parameters of the medical tests as inputs. The proposal is effective enough in getting used by nurses and medical students for training functions. the data mining technique used is Naive Thomas Bayes Classification for the event of decision network in cardiovascular disease Prediction System (HDPS). The performance of the proposal is additional improved employing a smoothing operation.

Kharya et al. [15] proposed detecting in patients the chances of having Breast Cancer later in life. The severity of Breast Cancer is necessary seeing it becoming the second most cause of death among women. A Graphical User Interface (GUI) is designed for entering the patient’s record for the prediction. The records are mined through the data repository. Naïve Bayes classifier, being simple and efficient is chosen for the prediction. The results obtained by the Naïve Bayes classifier are accurate, have low computational effort and fast. Implementation of the proposal is done through Java and the training of data is done using from UCI machine repository [16]. Another advantage of the proposed system is that the system expands according to the dataset used.

Hickey [17] proposed using Naïve Bayes soft Classifier for public health domain. The public health data are used as an input and the purpose of the study was to analyze one or several attributes that predict a target attribute without the need for searching the input space exhaustively. The proposal achieved its goal with the increase in accuracy of classification. The target attributes were related to diagnosis or procedure codes.

Ambica et al. [18] developed an efficient decision support system for Diabetes disease by using Naïve-Bayes soft computing algorithm. The developed classification system contains two steps. The first step explains analysis of optimality of the dataset and accordingly extraction of the optimal feature set from the training dataset.

The second step create the new dataset as the optimal training dataset and the developed classification scheme is now applied on the optimal feature set. The mismatched features from the training data are ignored and the dataset attributes are used for the calculation of posterior probability. The proposed procedure, therefore, shows elimination of unavailable features and document wise filtering.

3 Proposed Methodology

The proposed methodology used to accomplish the various task is shown by following Fig. 1.

Fig. 1.
figure 1

Knowledge extraction process with soft computing techniques

3.1 Working Methodology of the Naive Bayes Classifier

Naïve Bayes Classification method starts with text document as an input. For measuring the relative degree of association between the class-word pairs, the classifier makes a log-linear decision rule that assigns an independent parameter to each class-word pair. The two steps of the classifier include Calculation of class conditional probability and Calculation of classification or posterior probability. For each term t and class cj, the class conditional probability (ti|cj) taking into consideration only one training set is represented as follows:

$$ \widehat{p} ( {\text{t}}_{\text{i}} / {\text{c}}_{\text{j}} )= \frac{{\Sigma tf(t_{i} ,d \in c_{j} ) + \alpha }}{{\Sigma N_{{d \in c_{j} }} + \alpha .M}} $$
(1)

Where \( \Sigma tf(t_{i} ,d \in c_{j} ) \) is the total sum of the term frequencies of the word from all documents in the training samples belonging to a class CJ, \( \alpha \), is a smoothing parameter. And \( \Sigma N_{{d \in c_{j} }} \) is the sum of all term frequencies in the training dataset for class Cj, and M is the number of terms.

Once the conditional probability is calculated for each term and class, the trained classifier is able to predict the class of any upcoming new document. Let the document to be queried query is with feature vectors represented by term frequencies. The posterior probability of a document which belongs to a class cj is the product of individual class conditional probabilities of all terms contained in the query document.

$$ \begin{aligned} & \widehat{p} ( {\text{d /c}}_{\text{j}} )= \widehat{p} ( {\text{t1/ c}}_{\text{j}} ) \cdot \widehat{p} ( {\text{t2/ c}}_{\text{j}} ) \ldots \widehat{p} ( {\text{t}}_{\text{m}} / {\text{ c}}_{\text{j}} ) \\ & = \prod\limits_{i = 1}^{M} {\widehat{p} ( {\text{t}}_{\text{i}} / {\text{ c}}_{\text{j}} )}^{tf(ti,d)} \\ \end{aligned} $$
(2)

After the calculation of both the probabilities, the highest probability of class ck which show that the queried document d belongs to class ck is given by k = argmaxj.

As long as the underlying assumption of independence is true, the Naïve Bayes classifier works fine. The independence here refers to the idea that the underlying category should be a better predictor of the options, the features that are independent given the class. The other benefits of the classifier embody simplicity, quick to classify, not sensitive to extraneous options and simple implementation that create it a promising technique to be tried for a brand new classification issue. However, the underlying assumption isn’t each and every time feasible as per the real world situation. Performances of naive bayes classification technique on the idea of varied parameters are evaluated and results collected. There are a lot of merits of this algorithm some of them are as follows:

  1. 1.

    When the input variables are unconditional this algorithm plays nicely.

  2. 2.

    This classifier converges quicker requiring fairly less training data than different discriminative models such as logistic regression.

  3. 3.

    It is less difficult to expect the class of the test dataset in this algorithm. This classifier is an excellent guess for multi-class predictions also.

  4. 4.

    This algorithm has offered excellent performance in numerous application areas in spite of conditional independence assumption.

  5. 5.

    There are different flavours of Naive Bayes algorithms such as Gaussian naïve bayes, Multinomial naïve bayes, Bernoulli naive bayes.

  6. 6.

    It is best suited for text classification problems. Generally it is used for spam email classification problem

  7. 7.

    This algorithm can also be used to train small dataset.

There are numerous areas where naive bayes algorithm are used some of them are as follows.

  1. 1.

    To check whether your email is junk mail or not.

  2. 2.

    For characterizing news articles about entertainment, politics, sports, technology etc. this algorithm is used.

  3. 3.

    It is used by social sites such as face book to break down announcements communicating positive or negative feelings.

  4. 4.

    It is also used as a document classification for indexing the document in a database.

4 Experimental Evaluation

The blood donor’s information collected from the Kota blood bank having 5656 instances with 12 attributes. In principle, the usage of big data set to construct the classifier version will increase the performance while classifying new statistics due to the fact it would be less complicated to assemble an extra trendy model and subsequently finding a suitable match for our dataset. The dataset used to construct the classifier model is dependent on a variety of things which include the scale of the type of problem, the classifier algorithm used and the statistics set. The blood donor classification model was evaluated using a Naive-Byes classification technique. There are two categories of the blood donor’ male and female. There are 5656 Instances of the blood donors dataset and there are seven attributes which are Bag-no, Age, Date Group, Available, Tested and Sex. The Testing mode is set at 10-fold cross-validation. The total execution time to build model is 0.02 s. The results of blood donors dataset for Naïve Bayes Classification technique is shown in the Table 1.

Table 1. Classified/unclassified instances

In the following step of the experiment we have calculated the classification accuracy in the Table 2.

Table 2. Classification accuracy

Where MAE, RMSE, RAE, RRSE are Mean absolute, Root mean squared, Relative absolute and Root relative squared errors respectively. According the male and female class accuracy is given in the following Table 3.

Table 3. Male and female class accuracy

The following graph shows the Accuracy of Male and Female class (Fig. 2).

Fig. 2.
figure 2

Accuracy of male and female class

In the Table 4 we have shown the classified male and female blood donors according their blood groups.

Table 4. Classified male and female blood donors

The following output screen generated when we run Naive Bayes computing algorithm is shown in the following Fig. 3.

Fig. 3.
figure 3

Classified and unclassified instances

5 Conclusion

There are so many parameters comparing the performance and accuracy of an algorithm. The objective of this research paper is for the classification and prediction of blood donors according their sex and blood group. In this paper we have discussed that how a Naive-Bayes soft computing algorithm can be used in knowledge discovery for classification and prediction. During this work a data mining model is developed and tested for extracting knowledge of blood donor’s classification which can be used to support certain kind of decisions in blood bank organization. The blood donor’s dataset collected from an authentic government blood bank centre. The experimental outcomes represents that the generated classification rules carried out perfectly with accuracy rate 97.5588%. In the next paper the soft computing techniques with KDD will be implemented on the real world dataset for predicting the blood donor’s conduct and mindset.