1 Introduction

Since the concept of big data was proposed by Toffler in 1980, its development prospects have been expanding and have penetrated into all aspects of life, work and research [1, 2]. Nowadays, with the promotion of information technology, some scattered data are collected and gradually concentrated into large and complex data. The rapid development of big data has brought huge benefits to the advanced technology industry and has changed the habits of users to a certain extent, attracting the attention of many companies with economic strength. In 2017, IT companies such as Alibaba, Tencent, and Jingdong invested heavily in big data research to enjoy the financial returns from big data. For example, the business efficiency of drip taxis, shared bicycles, Taobao, etc., is promoted through the promotion of big data [3]. The strategic importance of big data technology is not just about mastering a large amount of data, but rather focusing on the added value of these important data [4, 5]. In other words, if big data are compared with the industry, then the key to profitability in this industry is to perform data processing and enhance data computing capabilities to achieve data appreciation. From a technical point of view, the connection between big data and cloud computing is as inseparable as the front and back of a coin [1, 6]. Big data cannot be processed by one computer and must use a distributed architecture [7, 8]. It provides distributed data mining capabilities for large amounts of data. However, it must rely on cloud computing for distributed processing, distributed data storage, and cloud computing and virtualization technologies. With the advent of the cloud era, big data has attracted more and more attention. The analyst team believes that big data is often used to describe the large amount of unstructured data and semistructured data created by companies. When downloading a relational database for analysis, these data take too much time and money [9, 10]. Big data analytics is often associated with cloud computing. Because real-time large dataset analysis requires a framework like MapReduce to distribute work to dozens, hundreds or even thousands of computers [11, 12]. Therefore, the study of machine learning algorithms in the context of big data plays an important role in promoting the development of the country, enterprises and society. Machine learning plays an indispensable role in today’s big data processing. For example, in the AlphaGo game for Kejie in 2017, the game ended with a score of 3–0. This is an important symbol of machine learning. Machine learning overcomes the limitations of human factors, effectively processes data through neural networks, decision trees, and deep learning science, and improves data operations.

The concept of “Internet of Things” began in 1990 and originally originated on the Internet [13, 14]. However, in the later development, it gradually developed into two types: One is an Internet-based extension, and the other is to extend users between projects for information transfer, exchange, and communication. Mainly refers to the use of sensors, two-dimensional code and other technologies to achieve information and access products, and the use of Internet of Things and communication networks for transmission and storage [15, 16]. With the rapid development of animal network technology, the background of big data can provide greater information data resources for the Internet of Things [17, 18]. On the other hand, the development of Internet of Things technology will also promote the rapid arrival of the era of big data. It can be seen that the relationship between the Internet of Things and big data has always been complementary and inseparable. Only in this way can China quickly enter a smart society [19, 20]. Big data includes structured, semistructured, and unstructured data, and unstructured data are increasingly becoming a major part of data. According to IDC’s survey report, 80% of the data in the enterprise are unstructured data, with an exponential growth of 60% per year. Big data analysis through machine learning algorithms opens up a new era of big data analysis [21, 22]. In the context of technological innovation represented by cloud computing, no myths or awes are needed. These seemingly difficult to collect and use data are beginning to be easily developed. Big data will gradually create more value for human beings due to continuous innovation in all areas of life.

This research is based on the current new situation of social development and plays an important role in promoting the better development of society. Through a comprehensive analysis of the research background of big data and the status quo of machine learning research, it is found that the effective use of research results in the field of machine learning can better solve the big data problem. In order to improve the value density of massive unstructured data and remove redundant and noisy garbage data, this paper takes unstructured data as a sample and uses related machine learning algorithms to perform preprocessing, dimensionality reduction processing and predictive model training. And in the traditional database for data analysis efficiency comparison, and achieved good results.

2 Proposed method

2.1 Machine learning

Machine learning is a hot research area in current computer science and artificial intelligence disciplines. The industry does not uniformly define the standard for “machine learning,” but machine learning is generally a model of human cognitive processes and learning processes that combines the computational power of computers to perform human behavior simulations and to get new knowledge or skill algorithms. It uses prior knowledge and training data to guide learning and continually adopts existing knowledge structures to improve their performance. In recent years, many machine learning algorithms have been widely used in engineering practice and scientific research such as data clustering , support vector machine (SVM), nonlinear regression, neural networks, genetic algorithm, and so on. Whether it is speech recognition, credit monitoring, risk prediction, etc., or data mining of big datasets, machine learning algorithms play an irreplaceable practical guiding role. Machine learning plays a big role in the research of big data. For example, Google’s success in text processing is due to machine learning, and when building big data storage warehouses, a lot of knowledge in the fields of neural networks, supervision and unsupervised learning is required to use Hadoop clusters. At the same time, Amazon’s product recommendation system is also a combination of big data and machine learning. Deep analytics for big data analysis is also based on statistical analysis and machine learning.

The development of machine learning mainly includes two research directions: first, studying the learning mechanism. The main research focus of the learning mechanism is the study of machine learning techniques. With the development and changes in the big data environment, data analysis has high application requirements in the development of many fields of society. Through machine learning, it can quickly acquire corresponding knowledge and promote the development of machine technology. In the big data development environment, machine learning should highlight the important role of learning, gradually expand the actual scope of machine learning, and carry out data analysis on the basis of machine learning, efficiently process different pieces of data information, and clarify the basic goals of machine learning. The second research direction is studying the rational application of information. The focus is on finding more valuable information from a vastly populated data management repository. In the big data development environment, the data generation efficiency has gradually increased, and the overall number and types of data have undergone major changes. In addition to in-depth analysis of various types of important new rows of data, such as text data analysis, content searching images and image data processing, so that the machine learning research toward the diversification of comprehensive development. At present, the rational selection of semisupervised learning methods to strengthen the quality of training data and enhance learning ability is a key issue of concern to relevant departments. Big data is fundamental to artificial intelligence, and turning big data into knowledge or productivity is inextricably linked to machine learning. We can say that machine learning is the core of artificial intelligence and the fundamental way to ensure that machines have human intelligence. The task of machine learning is to discover information that is contained and useful based on large data volumes. The more data it processes, the more machine learning can show its advantages. This problem can be solved by providing big data or greatly improving performance, such as language recognition, image design and weather forecasting. K-nearest neighbor learning methods according to certain rules will be similar to the data sample is divided into a category, which is similar to real life idiom, “things gathered together, people were divided into several groups.” In the machine learning algorithms, the basic idea of the K-nearest neighbor learning method is to first extract the characteristics of the new data to be classified or tested and compare it with the characteristics of each datum in the original sample. Then, select the K closest sample data from the comparison results and calculate which K sample data appear in the number of times. Then what kind of data is to be classified, c class w1, w2,…, wc pattern recognition problem, each type has a sample of category Ni (i \\ u003d 1, 2,…, c). The discriminant function that can specify wi is:

$$ g_{i} ({\mathbf{x}}) = \hbox{min} \left\| {{\mathbf{x}} - {\mathbf{x}}_{i}^{k} } \right\|\quad k = 1,2, \ldots ,N_{i} . $$
(1)

For unknown samples x, simply compare the Mahalanobis distance between the x and N samples of the known category:

$$ d = \sqrt {\left( {{\mathbf{x}}_{\text{u}} - {\mathbf{m}}} \right)^{\text{T}} {\mathbf{C}}^{ - 1} \left( {{\mathbf{x}}_{\text{u}} - {\mathbf{m}}} \right)} $$
(2)

where m and C are the mean and covariance matrix of S, respectively. It is determined that x is the same as the sample closest to it. The algorithm has the following advantages: it is simple and easy to understand; there is no need for modeling and training; and it is easy to implement, suitable for classification of rare events, and suitable for multiclassification problems. However, the algorithm also has shortcomings. The algorithm is a lazy algorithm with large memory overhead. When the test sample is classified, the calculation amount is large and the performance is low. The interpretability is poor, and the decision tree and other rules cannot be given. The support vector machine algorithm is one of the classic machine learning algorithms and has achieved good results in both theoretical analysis and practical applications. A straight line is used to divide the data into two categories. This line is used as a linear discriminant function and is recorded as:

$$ g(x) = \omega^{\text{T}} x + b. $$
(3)

This line is equivalent to a hyperplane, and the optimal classification hyperplane equation is:

$$ \omega^{\text{T}} x + b = 0. $$
(4)

The sample is spatially transformed by nonlinear mapping, and the sample data are transformed from the low-dimensional sample space to the linear dimension of the feature dimension, and then the linear classification purpose is obtained. After mapping, the classification function can be expressed as:

$$ f(x) = \sum\limits_{i = 1}^{n} {\omega_{i} } \varphi \left( {x_{i} } \right) + b. $$
(5)

Reference perceptron idea, classification function to obtain a sample represented by the product of the form:

$$ f(x) = \sum\limits_{i = 1}^{n} {\alpha_{i} } y_{i} < \varphi \left( {x_{i} } \right),\quad \varphi (x) > + b $$
(6)

In addition to the SVM algorithm, the classic K-means clustering algorithm for machine learning is a partition-based clustering algorithm that can also be used for data analysis. The algorithm calculates the distance between each object and the defined center point and optimizes the coordinates of the center point according to the algorithm strategy to obtain the best clustering result:

$$ J_{c} = \sum\limits_{i = 1}^{k} {\sum\limits_{{p \in C_{i} }} {\left\| {p - M_{i} } \right\|^{2} } } . $$
(7)

Artificial neural network is also a classic machine learning algorithm. The neural network has a good fitting effect on training data. It has applications in many fields such as medicine, physiology, philosophy, informatics and computer science. While artificial neural network made a good effect in some areas, artificial neural networks in support of big data are still in their early stages, and there are still many issues to be resolved. For example, how to determine the number of layers of artificial neural networks, the number of nodes, how to improve the training speed of the network, especially in the massive data environment, data presentation of various high-dimensional attributes and data big data technology is only a key technology to solve these problems. Machine learning mainly includes the following steps: (1) selecting the type of training experience to provide direct or indirect feedback for system decision making. For example, learning maze problem, every step, whether the current position can walk in a certain direction provides the most direct feedback for the system, but the ultimate destination of walking provides indirect feedback for the system, which makes walking not deviate from the correct direction. In addition, the extent to which training sample sequences are controlled, including negative learning, active inquiry learning and completely autonomous learning, all of which completely imitate human learning styles. The last problem is how close the training samples are to the true distribution of the samples, which is of great significance to the final evaluation of learning results. If the training samples are too different from the actual distribution, they may still perform poorly in the test, although they perform very well in the training samples. (2) Choosing objective function to improve learning performance is to learn a specific objective function. In some cases, the optimal objective function is not operable and can only take the second place. For any classification, the optimal objective function is the smallest error rate, but it is not easy for simple Gauss density distribution or just solving the error rate. In a word, the process of learning is the process of searching in the hypothesis space for the knowledge and constraints that most conform to the existing training examples and some prior knowledge.

2.2 Internet of Things

The Internet of Things is a new technology application model for quickly acquiring remote information through modern wireless communication technologies. The Internet of Things (IOT) refers to “Internet through all connections”; that is, information is obtained by loading information on an information sensing device such as radio-frequency identification. A network that intelligently identifies ubiquitous information can be obtained and transmitted through ubiquitous Internet connections. The technology application model is based on ubiquitous information-gathering devices, such as radio-frequency identification (RFID) tags, sensors, drives and mobile phones, through a unique solution, which is the mutual object between the objects that achieve the common goal: communication and cooperation. In 2005, the International Telecommunication Union (ITU) released the 2005 ITU Internet Report: the Internet of Things, which shows that the ubiquitous “Internet of Things” communication era is coming. The report’s description of the Internet of Things is: At present, information and communication technology has been connected to anyone from any time and place and gradually evolved to the stage of connecting anything. Various information sensing technologies, such as the Internet of Things, connect real-time information on all projects, including materials/spare parts/work-in process/finished products in the supply chain, to the Internet to achieve intelligent management and identification. The Internet of Things consists of a three-tier architecture. The second is the transport layer, which enables the transmission and sharing of information, i.e., through existing local area networks, wide-area networks, the Internet and communication networks and with the electronic product code (EPC), electronic data interchange (EDI) and other data analysis and exchange technology to achieve data transmission; the third is the application layer, which implements the processing and application of the acquired sensor data information, including applications and display terminals. Applications on mobile phones, computers and other mobile devices’ operating systems are installed and applied based on business logic. Key technologies related to the Internet of Things include radio-frequency identification (RFID), sensor technology, nanotechnology, intelligent embedded technology, network communication technology, etc.

In the development of modern science and technology, the use of cloud computing, networking technology and information resource sharing and other high technology to effectively improve the level of intelligence and utility management level city and make the life of urban residents better. This trend of infiltration into urban government, society, economy and foundation is the trend of smart cities. The best example of using big data and IoT technologies in the healthcare field is to identify patients by RFID technology, which is used for matching, patient positioning, vital sign collection and monitoring management. Specifically, it guides patients to wear an electronic watch when they are admitted to the hospital in order to keep abreast of the patient’s identity information. Within the coverage of the frequency identification detection network, doctors can better use frequency identification technology to identify, organize, track and record the patient’s identity anytime, anywhere. The Internet of Things and big data have an inseparable relationship from the beginning. (1) The Internet of Things is a new Internet model developed based on Internet technology, enriching the content of big data. (2) The Internet of Things generated big data at the beginning of development, and big data promoted the improvement of the Internet of Things. (3) The mobile intelligent terminal is a multifunctional IoT stage, which is the main application mode of the Internet of Things in the big data environment. (4) The Internet of Things can bring the greatest functionality and value to smart cities and is the primary condition for building smart cities.

The emergence and development of the Internet of Things brings not only the rapid development of social productivity, but also another great innovation to the mode of production, lifestyle and thinking of human society: (1) reforming the mode of human production. The Internet of Things (IOT) is an integrated innovation of human wisdom, such as technology, sensing technology, information technology, intelligent computing technology and wireless communication technology. It is also a space of interconnection between the physical world and the network world. It will greatly promote the integration of industrialization and informationization and promote the adjustment of economic structure and social and economic development, thus promoting the transformation of production mode. (2) changing the way of human life. At present, the Internet of Things covers the fields of smart industry, smart agriculture, smart logistics, smart transportation, smart grid, smart environmental protection, smart security, smart medicine and smart home. The Internet of Things will bring people unprecedented convenience and comfort and will completely change the way of human life. (3) Changing people’s way of thinking. As an important part of the new generation of information technology, Internet of Things technology represents the new carrier of information dissemination and the new connotation of scientific and technological innovation. The change of this new tool, new technology and new method has gradually changed and influenced people’s life trip and daily behavior from the superficial point of view. Considering from the deep level, the intellectualized society brings not only the change of lifestyle, but also the change of thinking mode.

2.3 Big data analysis

The theoretical basis of big data analysis technology is a large amount of sample data, that is, data with accurate sources, rich data and intrinsic connections. Big data analysis theory mainly includes two analysis strategies: cluster analysis and association analysis, and predictive analysis methods are based on this. At present, big data processing technologies mainly include distributed computing technology, memory computing technology and stream processing technology. The fields in which these three technologies are applicable are different. In-memory computing technologies are developed to address issues such as efficient data reading and online real-time processing. Streaming technology solves real-time, continuous, uncontrolled data streams. Distributed computing technology can be used to break down problems into many small tasks that are assigned to multiple computer processes. Open-source Hadoop has become the mainstream distributed computing technology, of which distributed file system (HDFS) and parallel distributed programming framework (MapReduce) are two core technologies. It has good scalability, efficient equipment utilization and high reliability. Distributed computing technology is applicable to distributed data sources in power enterprise collections. In-memory computing technology puts large-scale data into memory for query and analysis operations. The memory computing technology avoids a lot of time overhead when reading and writing disks, which greatly increases the calculation speed. As an emerging engine of in-memory computing technology, Spark’s main advantage is cluster-based distributed memory abstraction (RDD). Spark reads the required data into memory. As the name implies, stream processing techniques treat continuous datasets as data streams and return processing results as soon as the data appear. The results are calculated, analyzed and presented in the latest data as soon as possible. Storm is a representative technology for streaming media technology, which is mainly used for real-time computing, online machine learning and other aspects. With the rapid development of smart substations, the real-time requirements of grid monitoring data are getting higher and higher, and the organic combination of streaming media processing technology and intelligent substation will inevitably become the mainstream trend in the future.

Cluster analysis is based on big data analysis, defining a large number of complex categories with attribute data such as quantity, speed and diversity. In addition, a large amount of basic data is quantized by aggregating phase categories or similar categories of data. Therefore, it is possible to extract, estimate, and predict valid information from the data of the same type of attribute. Combined with analytical methods such as cross-category correlation analysis, data can be refined to a high level, making full use of discrete, unordered and complex basic data information. After collecting, analyzing and collating a large amount of basic data, a relatively stable data filling resource is obtained through cluster analysis. And how to identify the intrinsic links between these well-defined and well-defined data so that the data can be fully analyzed and utilized from varying degrees. This is a problem that needs to be addressed in big data analytics. The actual meaning of the so-called association analysis refers to the seemingly unrelated data or information, trying to start the correlation analysis from different angles and the data correlation analysis method obtained from the comprehensive judgment. By associating different types and levels of information, we can make the data after clustering more closely connected with the data between different categories. Providing data analysts with a reliable source of reference data information and saving time in complex data analysis processes is easier. Mining data so that the data analyst to better understand the data, and the prediction analysis allows to make some analysts predict the results of determination of visual analysis and data mining.

Computer technology can support the collation and filtering of large amounts of irrelevant data when a large amount of basic data is collected and accessed through the database. However, for mobile communication optimization with strong awareness, it is necessary to provide initial and established decision support for the optimizer. The mobile communication network in the Bihuang area has hundreds of millions of users, and the amount of data is huge. Here, the decision tree method is used to process a large amount of data analysis processing and problem location in daily optimization work. The decision tree algorithm finds some deep information with important value by purposefully classifying the underlying data. The biggest advantage is that you can use simple language fast classification and description, very suitable for large-scale data analysis and processing.

2.4 Unstructured data

The processing and computing power of the existing traditional analytical system architecture is facing the impact of the rapid growth of big data scale and complexity. According to the research report, the volume of data in various fields is expanding, and the scale of data collection has been measured. It has risen from GB and TB to EB and ZB, and there are many types of data. In addition to a wide range of data sources, data types are diverse, and data structures are not only traditional structured data, but also unstructured data. This makes traditional data storage solutions more and more unsuitable for current data structures, and their requirements for data processing capabilities are increasing. Unstructured data usually cannot directly understand its content and must be opened by the corresponding software. It brings a lot of trouble for future data retrieval. Moreover, the data are not easy to understand, and the meaning of their expression cannot be directly obtained from themselves. Unstructured data have no defined structure, cannot be standardized, and are not easy to manage, so querying, storing, updating, and using these unstructured data require a smarter system.

Office documents, text, images, images, and audio and video information in all formats are unstructured data. (1) In terms of text, the traditional full-text search technology is based on keyword matching and the results are difficult to meet the demand. Intelligent search uses word segmentation dictionary, synonym dictionary, and homonym dictionary to improve the retrieval effect and combines user retrieval context analysis and user-related feedback technology to assist the query. It provides users with intelligent knowledge prompts and finally returns valid information to the user accurately. Prerequisite for realizing the functions is required to use a text segment, word frequency, text analysis to analyze text, text clustering, semantic analysis, text mining, and other text feature extraction techniques, a preprocessing operation on text library, as a result the input of the next layer of modules to achieve similarity text search. (2) Image, image feature extraction is based on image analysis technology. Image feature extraction is the use of computer extraction capabilities. Image feature extraction includes the following three levels: The main visualization function extracts the original features of the image, such as color, edge, shape, texture, layout, and so on. Intermediate object features are local features that extract images from external knowledge and logical reasoning (such as specific objects or characters). Advanced abstraction requires more external support to perform feature extraction on abstract attributes of an image, including specific events, specific content, or style image features. (3) Audio and audio analysis techniques include audio feature extraction, audio classification, and more. In the audio feature extraction, information about frequency domain energy, subband energy ratio, zero crossing rate, bandwidth, and the like in the audio is included. And the audio clip ratio in the audio clip, the sub-band energy ratio average, the spectrum traffic, and other content undergo corresponding feature extraction. Extracted features can be used for audio matching and recognition. (4) Video, video is currently the most complex type, and common video data may contain rich information such as audio, images and text. At the same time, because each video file is much larger than the other data, the problem is complex and variable. Video analytics techniques can rely on the analysis techniques of the above categories of unstructured data. For example, image recognition techniques can be used to extract key frames from a video, and the results obtained can be used as an image summary of the video, or an image index can be built for these key points to implement a video indexing service. In the analysis of unstructured data technology, the key method is to extract features from unstructured data, and the features obtained are usually high-dimensional data. High-dimensional feature extraction involves “distance” and “dimension reduction” problems. Desirable feature extraction algorithm has lower measured values for the degree of distance keeping as follows:

$$ {\text{Stress}} = \sqrt {\frac{{\sum\nolimits_{i,j} {\left( {d_{ij}^{\prime } - d_{ij} } \right)^{2} } }}{{\sum\nolimits_{i,j} {d_{ij}^{2} } }}} $$
(8)

3 Experiments

The machine-structure-based IoT unstructured big data analysis algorithm proposed in this paper belongs to the online terminal analysis algorithm. The specific design is as follows: The online terminal analysis algorithm infers the model and input data form to obtain the final result of the data, mainly for unstructured data design. For direct application-oriented scenarios, the OTA-selected training set instance consists of unstructured data. The OTA uses the adjacent node distance as a weighting parameter to evaluate the correlation. Figure 1 shows the file reading process of the online terminal analysis algorithm.

Fig. 1
figure 1

Schematic diagram of OTA internal file reading process under HDFS

In order to deeply analyze the performance of the online terminal analysis algorithm studied in this paper, this paper analyzes the performance of the original data based on the Internet of Things sensor for big data analysis. Due to the large amount of user data information, a big data platform was created to test the data and then configure the platform. Building a test big data platform uses Ubuntu Linux 10.04, Hadoop 1.03, and SunJava6. Hadoop needs to enable SSH access, and SSH can manage remote nodes and local nodes. After the configuration is complete, the operational data will be fully analyzed. Table 1 shows the time and number of nodes used for each analysis.

Table 1 Time and number of nodes used for data analysis

4 Discussion

In the four data analysis experiments, the number of sensor nodes and the time used in the experiment are shown in Fig. 2.

Fig. 2
figure 2

Time and number of nodes used in the experiment

As it can be seen from the figure, as the number of nodes increases, the time spent on the data processing is also increased, especially when the number of nodes increased to 240,000. The processing time has a very large increment so that the data analysis results of the online terminal analysis algorithm during the running process can be comprehensively evaluated, and the experiment continues to analyze the name node and the data node. The results are shown in Fig. 3.

Fig. 3
figure 3

Node name, data size, and data node

It can be seen in Fig. 3 that the name node size is positively correlated with the data node size, and the amount of data analysis results obtained by the OTA execution can be evaluated. In terms of the number of processing operations per second, the number of nodes in four runs is the same as the number of nodes in Table 1.

As can be seen from Fig. 4, the efficiency of data analysis in traditional databases is lower than that of online terminal algorithms. In particular, when the number of nodes in sensor networks is large, the efficiency of online terminal algorithms is much higher than that of traditional databases, leading to good results.

Fig. 4
figure 4

Comparison of results between traditional database and online terminal algorithm analysis

5 Conclusions

Relational databases have evolved decades of structured data management technology and are now mature. However, unstructured data account for approximately the total amount of information. Its complexity is much higher than structured data. Therefore, how to effectively manage unstructured data becomes the top priority of data management. The management of unstructured data through structured data obtained by heterogeneous data transformation is a very effective method. All in all, in the process of big data development, the amount of data information is increasing rapidly, and the traditional single machine learning algorithm cannot meet the basic requirements of social development. The massively parallel machine learning algorithm can meet the needs of the development of the big data era and can effectively adapt to the basic development requirements of artificial intelligence. Promoting social modernization is the focus of future development of machine learning. This paper explores the application of machine learning algorithms in unstructured data analysis through the support of big data storage technology, big data analysis technology and big data processing technology.