Keywords

1 Introduction

Big data is an emergent field which increases more number of data in the fields like marketing, medical, biological research, transaction of data, and so on. Due to growing size of data, data retrieval is more complex. Big data is classified into three V’s, that is, Volume, Velocity, and Variety of data [1]. Big data, which is a huge volume of data, is not only collected from computers but also from mobile phones, sensors in various filed, social media posts, and many other resources. Data retrieval, data analysis, quality and quantity measures of algorithm and data, and outlier detection are considered various issues in Big data [2].

Biological data is a collection of life science information, computational study, information of living organism, and high quantity of research knowledge. The progress of biological data information’s collected from DNA, RNA, protein discovered [3, 4]. The types of biological data are incorporated from genomics, proteomics, microarray, metabolomics, gene expression, and ontology, and so on. The biological data is distinguished in different data format like image, sequence, structure, patterns, graph, text, geometric, and expression [5, 6].

The cell is the basic structure of every living organism. The nucleus is the heart of the cell with chromosomes which have a part called DNA. The four bases of DNA are Adenine (A), Cytosine(C), Guanine (G), and Thymine (T). DNA is transcribed into RNA which has the base pair of Adenine (A), Cytosine(C), Guanine (G), and Uracil (U) [7]. The base pair of RNA is similar to DNA except for Thymine. RNA use Uracil instead of Thymine. RNA is translated to protein. Proteins are formed by linking different amino acid or peptide bonds [8]. A protein is normally denoted as a sequence or string on an alphabet of 20 characters, except B, J, O, U, X, and Z.

Cloud computing is the main part of the research in bioinformatics for huge volume of biological data [9]. Distributed cloud computing is one of the main roles in cloud computing that simplifies the cloud location, progress, distribution of data, and application from various sites to achieve the necessities, hence improving the performance and reducing the idleness.

Machine learning denotes to design and assess the algorithms to enable the data mining models from raw data. Generally, machine learning facilitates the two learning mechanization, that is, supervised learning and unsupervised learning [10]. Supervised learning represents the classification and prediction of the members with known features based on class label of data. Unsupervised learning, otherwise called as clustering and outliers, collects similar data into one group and dissimilar data into another. Both learning mechanisms work well in biological research for biological data. The combination of machine learning and deep learning is quite complex for biological data. Machine learning hybrid with deep learning and cloud computing enhances the performance of the algorithm.

Distributed clustering is used to solve computational issues in distributed data. Generally, the data is classified into two forms: homogeneous and heterogeneous. Homogeneous data has similar dataset attributes, and heterogeneous has different dataset attributes. In Fig. 2.1, the distributed clustering is done in two levels such as local and global [11, 12].

Fig. 2.1
figure 1

Architecture of distributed clustering

1.1 Chapter Sections Overview

Chapter sections are organized as follows: Sect. 2.1, describes the introduction; Sect. 2.2 presents a detailed survey of previous studies, Sect. 2.3 explains about the hybrid models; Sect. 2.4, presents the results and discussion; and Sect. 2.5 presents the conclusion.

2 Literature Review

Bioinformatics is an emerging research area for storing and accessing a huge volume of data. Data access is a difficult task in the research field. The structure and function of protein based on the statistical metric based feature selection techniques, which reduced the feature vector size for evaluate the growing biological data [13]. The neural network classifiers compared with other classifiers for improving the evaluation. The accuracy of classification is to exactly identify the changes of amino acid sequence. This feature selection proves a significant upgrade in performance in terms of accuracy, sensitivity, and F-measure. This selection technique fails to manage the time complexity for accessing the data.

The distribution-based spectral clustering and cuckoo search used for cancer identification with protein sequence data reduces time complexity. Invariant sequence identified based on the similarity index, which is identified by Jaccard similarity index. Fuzzy logic used to detect the membership value of protein sequence. Based on the similarity and membership value, the sequence is detected whether cancerous or non-cancerous. This distribution-based spectral clustering improves the accuracy and reduces the time but fails to detect the features-based detection [14, 15].

TRIBE-MCL is used for the family of protein to detect the information of sequence similarity. Protein family detection is one of the main goals of functional and structural genomics. Construct a protein–protein similarity graph for proteins. Then, generate a weighted transition matrix for the constructed similarity graph by BLAST E-Values and finally transform, weight into transition probability for constructing a Markov matrix. This task is probably expensive to achieve a goal in a short period [16].

The deep learning algorithm exactly identifies the breast cancer using mammography image. Digital Database for Screening Mammography (CBIS-DDSM) test improves the sensitivity, specificity and reduces the false-positive and false-negative rates [17]. Deep learning method is highly suitable for heterogeneous mammography image, but it takes much time to produce the result of algorithm. Random forest and distributed techniques are rarely used in biological environment [18, 19].

3 Hybrid Models of Deep Learning and Machine Learning

The data is distributed among various places and size. If all the data collected into single site, it takes more execution time and memory for process the data. To avoid this contingency, the distributed approach is used to cluster the data locally and form a global data based on data representative. Local cluster is done by Distributed Spectral Clustering (DSC) technique such that construct a diagonal matrix for “n” number of protein data, then find the similarity using Jaccard similarity index, then compute the Laplacian function with the help of Eigen values and Eigen vectors. Then, run the Fuzzy C-Means (FCM) to separate an object. In normal spectral clustering, K-means is used to separate a data instead of FCM. Apply the statistical metric-based feature selection in global data. This selection is done based on the scoring and length of the sequence. In this model, machine learning algorithm of spectral clustering is used to split up the data based on the similarity and the deep learning-based feature selection acts to get final informative sequence. Table 2.1. represents the Pseudo code of Distributed Spectral Clustering with Feature Selection (DSCFS).

Table 2.1 Pseudo code of distributed spectral clustering with feature selection

Feature selection is done based on the length and score of amino acid. The standard 20 (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) amino acid is used to stipulate the protein sequence of any length for any gene. Figure 2.2 represents the architecture of Distributed Spectral Clustering with Feature Selection.

Fig. 2.2
figure 2

Architecture of distributed spectral clustering with feature selection

4 Experimental Results and Discussion

The clustering measures are calculated by the following values: True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN) [20, 21].

4.1 Accuracy

Accuracy is refers to defined as correctly detect the cancerous sequence by the total number of sequence. It is measured in terms of percentage (%) (Table 2.2; Fig. 2.3).

$$ \mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}} $$
(2.1)
Table 2.2 Accuracy for TRIBE-MCL and DSCFS
Fig. 2.3
figure 3

Comparison of TRIBE-MCL and DSCFS in terms of accuracy

4.2 Precision/Specificity

Precision is referred to measure the quality of accuracy and it is the ratio of correctly identified sequences and the total number of sequences. It is also measured in terms of percentage (%) (Table 2.3; Fig. 2.4).

$$ \mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$
(2.2)
Table 2.3 Precision for TRIBE-MCL and DSCF
Fig. 2.4
figure 4

Comparison of TRIBE-MCL and DSCFS in terms of precision

4.3 Recall/Sensitivity

Recall is referred to measure the quality of accuracy and it is defined as a fraction of correctly identified sequences and the total number of sequences. It is also measured in terms of percentage (%) (Table 2.4; Fig. 2.5).

$$ \mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(2.3)
Table 2.4 Recall for TRIBE-MCL and DSCF
Fig. 2.5
figure 5

Comparison of TRIBE-MCL and DSCFS in terms of recall

4.4 F-Measure

F-measure is referred to integrate the mean of precision and recall. It is also measured in terms of percentage (%) (Table 2.5; Fig. 2.6).

$$ \mathrm{F}=2\ast \frac{\mathrm{precision}\ast \mathrm{recall}}{\mathrm{precision}+\mathrm{recall}} $$
(2.4)
Table 2.5 F-measure for TRIBE-MCL and DSCF
Fig. 2.6
figure 6

Comparison of TRIBE-MCL and DSCFS in terms of F-measure

4.5 Time

Time is referred to as starting and ending time of execution for the total number of sequence which is measured in terms of milliseconds (ms) (Table 2.6; Fig. 2.7).

Table 2.6 Times for TRIBE-MCL and DSCF
Fig. 2.7
figure 7

Comparison of TRIBE-MCL and DSCFS in terms of time

4.6 Motif for Normal Sequence (Fig. 2.8)

Fig. 2.8
figure 8

Motif graph for normal sequence

4.7 Motif for Mutant Sequence (Fig. 2.9)

Fig. 2.9
figure 9

Motif graph for mutant sequence

5 Conclusion

The Distributed Spectral Clustering with Feature Selection techniques is done in two models such as local and global models to reduce the time complexity, and feature selection is used to enhance the accuracy, precision, recall, and F-measures. Local model acts as a clustering and global model acts as Cloud, which provide most of the intelligent services like security, performance, productivity, reliability, scalability, speed, and accurate access. This method is mainly applicable for huge volume of distributed data. The results achieved are based on similarity, length, and score of the sequence. This novel technique is compared with TRIBE-MCL to show better performance to get mutant protein sequence. Every measure in this technique shows better performance than literature TRIBE-MCL method.