Keywords

1 Introduction

Clustering analysis has been studied for many years and formed a systematic method system [1]. Clustering is an unsupervised machine learning method that takes a group of physical or abstract objects. According to the degree of similarity between them, they are divided into several groups, so that the similarity of data objects in the same group is as large as possible, and the similarity of data objects in different groups is as small as possible [2]. However, the single clustering algorithm has the problems of unstable results and large randomness. Existing research tends to combine the results of clustering large data sets to overcome the shortcomings of clustering.

Research on the clustering of private data sets in heterogeneous networks has appeared in recent years [3], and it has attracted wide attention from all walks of life. However, how to generate the optimal clustering dataset and select the best merging strategy, especially the clustering fusion algorithm for large datasets of classification attributes, is still an unsolved problem. Therefore, it is necessary to conduct research on the generation and mining of cluster members to get the best clustering results.

A cloud computing based heterogeneous network privacy big data set clustering algorithm research is proposed in this paper, and gives the fusion method and strategy of big data set. Firstly, the attributes of each large data set are divided according to the value, and the features are collected and extracted to obtain the initial cluster members. Then, the optimal fusion clustering results are obtained through continuous adjustment and mining. In order to verify the validity of the cloud computing-based heterogeneous network privacy big data set clustering algorithm designed in this paper, the experimental demonstration is carried out. The experimental results show that the cloud computing-based big data set clustering algorithm can improve the data clustering effect and ensure the accuracy of the clustering results, which is extremely effective.

2 Design of Large Data Set Clustering Algorithm

The dispersed large data set matrix is used as an input set of the cloud computing clustering algorithm, and the feature coefficients of each column in the matrix are respectively calculated by the pair of data features. And comparing the matrix characteristics of each big data set within a given threshold, and determining whether the number of points around the point that are greater than the threshold is greater than the data feature. If the characteristic coefficient of a point and any other point is greater than the matrix feature, and the number of features around it and its characteristic coefficient are greater than or equal to the matrix feature, then the point is the core data point, and all the points with the same density as the core point are of one type, and the other type is noise point.

By mining all common feature data in the big data set, and collecting and extracting the characteristics of the common data, once again, using cloud computing technology, mining the clustering features of big data sets to achieve clustering calculation of big data sets, the cloud computing based clustering algorithm flow is shown in Fig. 1.

Fig. 1.
figure 1

Cloud computing based clustering algorithm flow chart

2.1 Big Data Set Feature Collection and Extraction

Suppose there are \( n \) data points \( X = \left\{ {x_{1} ,x_{2} ,x_{3} , \cdots x_{n} } \right\} \), which contain \( m \) attributes, the \( i \)-th attribute has \( k_{i} \) different values, and the \( i \)-th attribute has a weight of \( \omega_{i} \). This paper adopts the simplest and most easy to understand method of generating cluster members [4], that is, by attribute value division, the function expression of the attribute division rule \( R_{i} \) of the \( i \)-th big data set is:

$$ R_{i} = \left| {C_{i,1} ,C_{i,2} ,C_{i,3} \cdots C_{i,j} } \right|1 \le i \le m $$
(1)

Where, \( C_{i,j} \) c represents the \( j \)-th data feature of the segmentation result, \( \sum\limits_{i = 1}^{m} {\omega_{i} } = 1 \).

Inspired by the literature [5], the data on each attribute is divided into a cluster member, and a unified method is used to divide the different data subsets to obtain the characteristic relationship among the cluster members. Thus, clustering results \( R = \left\{ {R_{1} ,R_{2} ,R_{3} , \cdots ,R_{m} } \right\} \) of \( m \) cluster members can be obtained, and each cluster member \( R_{i} \) has \( k_{i} \) matrix features.

The feature collection method for similarity [6] refers to the collection process of metadata in the big data set representing the privacy of heterogeneous networks. By constructing the feature matrix, the clustering combination partition method of multiple large data sets is found, and the similarity between any two data points is used to describe and define the clustering features of large data sets. First of all, we must use cloud computing technology to extract features from the privacy big data sets in heterogeneous networks, and classify them according to their characteristics [7]. Secondly, the content characteristics of the metadata are represented, and the metadata is regarded as a vector space generated by a set of privacy orthogonal terms in a heterogeneous network. If \( t_{i} \) is treated as a term, \( w_{i} \left( d \right) \) is treated as the weight of \( t_{i} \) in the metadata \( d \), and each metadata \( d \) can be regarded as a normalized feature vector \( V\left( d \right) = t_{1i} ,w_{i} \left( d \right),t_{2i} ,w_{i} \left( d \right), \cdots t_{ni} ,w_{i} \left( d \right) \). In general, all data appearing in \( d \) is taken as \( t_{i} \); \( w_{i} \left( d \right) \) is generally defined as a function of the frequency \( tf_{i} \left( d \right) \) in which \( t_{i} \) appears in \( d \), i.e. \( w_{i} \left( d \right) = \vartheta \left( {tf_{i} \left( d \right)} \right) \). The frequency function is extracted to obtain the characteristic function of the big data set:

$$ \vartheta = \left\{ \begin{aligned} 1,tf_{i} \left( {R_{i} d} \right) \ge 1 \hfill \\ 0,tf_{i} \left( {R_{i} d} \right) = 1 \hfill \\ \end{aligned} \right. $$
(2)

Where, the square root function of \( \vartheta \) is \( \vartheta = \sqrt {tf_{i} \left( d \right)} \); the logarithm function of \( \vartheta \) is \( \vartheta = \log \left( {tf_{i} \left( d \right) + 1} \right) \).

The feature function of the big data set is similarly processed to prepare for the next data mining process.

2.2 Cloud Computing Based Big Data Set Clustering Mining

First, using cloud computing technology, randomly extract metadata features and transform the private data set into structured data that can describe the metadata content. Then use the cluster analysis of the big data set to form a structured metadata tree, and discover the new big data set concept according to the structure, and obtain the corresponding logical relationship. The cloud computing-based big data set mining process is shown in Fig. 2.

Fig. 2.
figure 2

Big data set mining process diagram

Since the amount of data in a private big data set in a heterogeneous network is very large, the dimension used to represent the metadata feature vector is also very large, and may even reach tens of thousands of dimensions. Therefore, we need to extract the network term with higher weight as the feature item of the metadata to achieve the purpose of dimension reduction of the feature vector. Then, the feature clustering mining process of big data sets is carried out. The big data set cluster mining process is as follows:

  1. (1)

    Select some of the most representative data features from the original features.

  2. (2)

    According to the similarity method principle [8], select the most influential feature data set.

  3. (3)

    Transforming original features into fewer new features by means of mapping or transformation in cloud computing technology [9, 10].

  4. (4)

    Using the evaluation function method [11], each feature in the feature set is independently evaluated and given an evaluation score, and a predetermined number of best features are selected as feature subsets of the big data set.

Let there be a sample set \( X = \left\{ {X_{1} ,X_{2} ,X_{3} , \cdots ,X_{n} } \right\} \) to be classified, and \( n \) is the number of elements in the sample, and \( c \) represents the number of target clusters. Then there is the following data mining matrix for \( n \) elements corresponding to class \( c \):

$$ \mu_{c} = \left[ {\begin{array}{*{20}l} {\mu_{1} ,\mu_{2} \cdots \mu_{n} \left( {n \le \vartheta } \right)} \hfill \\ \vdots \hfill \\ {\mu_{c1} ,\mu_{{c2 \cdots \mu_{cn} }} \left( {c \le \vartheta } \right)} \hfill \\ \end{array} } \right] $$
(3)

Where, \( \mu_{cn} \) represents the mining matrix feature of the \( n \)-th element to the \( c \)-type data (\( 1 \le i \le c,1 \le j \le n \)), and meeting \( \hbox{min} J\left( {X,\mu ,v} \right) = \sum\limits_{i = 1}^{c} {\sum\limits_{j = 1}^{n} {\mu_{cn} } } d_{cn}^{2} \), then the problem of clustering this multivariate data set is converted into a simple problem of finding the minimum value of the objective function.

The cloud computing-based big data set clustering mining process is based on the original big data set mining, adding cloud computing technology, adding constraints to the objective function to enforce the cluster search that satisfies the condition, and making the monitoring information constrained clustering search process.

The above assumes an unmarked big data set \( X = \left\{ {X_{1} ,X_{2} ,X_{3} , \cdots ,X_{n} } \right\} \), \( X_{n} \in R_{n} \), divide it into class \( K \), which is \( C_{1} ,C_{2} ,C_{3} , \cdots ,C_{k} \), and the mean of none class is \( M_{1} ,M_{2} ,M_{3} , \cdots ,M_{k} \). Suppose the number of samples in the \( K \)-th class is \( N_{K} \), then \( m_{K} = \frac{1}{{N_{K} }}\sum\limits_{i = 0}^{k} {X_{i} } \), \( K = 1 \cdots K \).

According to the European distance and intra-class error squares and criteria, the objective function of cloud-based big data set clustering is \( J = \sum\limits_{i = 1}^{k} {K = 1\sum\limits_{i = 1}^{Nk} {\left| {X_{i} - m_{K} } \right|} }^{2} \). When the algorithm is initialized, the center of each class is randomly selected, so the selection of the initial center determines the quality of the clustering results. After the introduction of cloud computing technology, a large data set formed by a small number of labeled samples, the large data set contains all \( K \) clusters, and each class contains at least one sample to implement a cloud computing-based big data set mining process.

2.3 Implementation of Large Data Set Clustering Algorithm

The cloud computing-based heterogeneous network privacy big data set clustering algorithm is implemented as follows:

The clustering algorithm requires two parameters \( \varepsilon \) and \( \mu \) when executed, known as \( \mu_{c} = \left[ {\begin{array}{*{20}l} {\mu_{1} ,\mu_{2} \cdots \mu_{n} \left( {n \le \vartheta } \right)} \hfill \\ \vdots \hfill \\ {\mu_{c1} ,\mu_{{c2 \cdots \mu_{cn} }} \left( {c \le \vartheta } \right)} \hfill \\ \end{array} } \right] \), and \( \varepsilon \) represents the spatial dimension of the heterogeneous network, up to the dimension [12,13,14], so no orientation analysis is done.

Search for the number of core data points by checking the \( \varepsilon \)-domain dimension of the arriving data point in the current time. If the \( \varepsilon \) field of any data point \( P \) contains at least \( \mu \) data points, create a data matrix with data point \( P \) as the core point. Then, by means of breadth search, the data points that can be directly clustered from these core data points are aggregated, and all the obtained density from the data point \( P \) is assigned to one class.

If \( P \) is the core data point, the cluster data points starting from point \( P \) are marked as the current class, and the next step is extended from the center of the matrix. If \( P \) is not a core data point, then when the algorithm clusters, the next data point will continue to be processed, in order, until a complete cluster core data point is found. Then select an unprocessed core data point to start expansion, and get the next clustering process, in sequence, until all data points are marked [15, 16].

For data points that are not added to the clustering matrix, they are noise points, and temporarily store them in the invalid area. If the number of data in the invalid area exceeds the maximum range of the preset threshold, the calling algorithm clusters the data in the temporary storage area, and deletes the already clustered data points from the temporary storage area. The dynamic data of the quadratic clustering is recorded as \( Q \), and the clustering calculation process for \( Q \) is as follows:

  1. (1)

    A large data set of mixed attribute features is processed by using different distance calculation methods, and new data point features are calculated by using Eq. (2).

  2. (2)

    Perform online maintenance on the characteristics of large data sets, and perform mining processing after maintenance.

  3. (3)

    The clustering algorithm is executed, and if there is data that is not clustered, it is placed in the temporary storage area.

  4. (4)

    The data feature matrix is again mined and the clustering algorithm is executed until the core data points are found.

The cloud computing technology is used to guide the clustering implementation process of big data sets [17, 18], which solves the problem that the single algorithm clustering quality is not high. First enter the data point \( x \in X = \left\{ {d_{1} ,d_{2} , \cdots ,d_{n} } \right\} \) in memory, \( d_{1} \) represents the data point in memory. The implementation of the big data set is replaced by a triple, which is equivalent to the center point with the weight to participate in the clustering, the number of data points is the weight, and the clustering result is the mark set, then there is \( labels = U^{k} \). Outputting \( K \) sets of disjoint big data matrices \( \left\{ {X_{1} } \right\}^{k = 1} \) of \( x \), and the objective function is \( J \in \sum\limits_{i = 1}^{k} {J = ik} \), then the local optimal clustering process of the big data set is obtained.

So far, the cloud data-based heterogeneous network privacy big data set clustering algorithm design is completed.

3 Simulation Experiment Demonstration and Analysis

In order to ensure the effectiveness of the cloud computing based clustering algorithm for heterogeneous network privacy big data sets, the simulation experiments are carried out.

Set the experimental object to the privacy UCI data set of a heterogeneous network, and perform clustering calculation on it.

In order to ensure the effectiveness of the experiment, the traditional algorithm and cloud based clustering algorithm are compared, and the accuracy of the two algorithms is statistically analyzed. The experimental results are shown in Table 1, Table 2 and Fig. 3.

Table 1. Traditional clustering algorithm results
Table 2. Cloud computing based clustering algorithm results
Fig. 3.
figure 3

Accuracy Analysis of two clustering algorithms

According to the data in Tables 1 and 2, the error rate of traditional clustering algorithm is higher than that of this clustering algorithm, and its clustering speed is slower than that of this clustering algorithm. It shows that the heterogeneous network privacy big data clustering algorithm has better effect than the traditional algorithm.

According to Fig. 3, the clustering accuracy of this algorithm can reach 99%, while that of the traditional clustering algorithm is only 91%, which indicates that the clustering accuracy of this algorithm is higher than that of the traditional algorithm.

To sum up, cloud computing based clustering algorithm in the heterogeneous network privacy big data set clustering process, regardless of the clustering speed, or deal with each heterogeneous network privacy structure, traffic and dimension, it is better than the traditional algorithm processing results. It can be seen that the cloud computing based heterogeneous network privacy big data agglomeration algorithm not only improves the clustering accuracy of private data sets in heterogeneous networks, but also improves the stability of the calculation process. The clustering error increases to zero gradually.

4 Conclusion

This paper analyzes and designs the clustering algorithm of heterogeneous network privacy big data sets based on cloud computing, and uses the advantages of cloud computing technology to collect and extract the matrix features of large data sets. Combining the similarity method, mining large data sets and realizing the clustering algorithm design of big data sets. The experimental results show that the cloud computing-based clustering algorithm designed in this paper is extremely efficient. When performing clustering calculation of the privacy big data set in the heterogeneous network, it greatly improves the accuracy of the clustering calculation, and can effectively reduce the clustering error, save the calculation time, and improve the working efficiency of the clustering algorithm. It is hoped that the research in this paper can provide theoretical basis and reference for China’s heterogeneous network privacy big data set clustering algorithm.