1 Introduction

In recent years, the continuous development of science and technology has provided great convenience for the daily lives of people. As the embedded technology advances, intellectualizing software, network, control, and communication technologies has been possible in modern lives (Gu and Qiu 2018).

The Internet of Things (IoT) refers to the interconnected network between objects and objects. It allows all everyday physical objects that can be independently addressed to form an interconnected network. It uses QR codes and bar codes to store the required information; besides, it also uses the sensing devices such as radio frequency (RF) identification technology, the global positioning system (GPS), and infrared sensors for intelligent sensing, identification, and management of items and processes (Dong et al. 2018). Deep learning is a typical computer algorithm; it can learn to understand the deep internal expression of sample data. The information obtained in the process of learning helps analyze texts, images, and sounds, as well as the searching of laws (Al-Hawawreh et al. 2018). One of the most critical applications of deep learning is the development of widespread artificial intelligence in recent years. The ultimate goal of deep learning is to make the machine have the same ability to analyze things as the human brain and to fully recognize the unique information contained in data such as texts, images, and sounds (Dong et al. 2017). Deep learning is a sophisticated machine learning algorithm that achieves better results in language and image recognition than any related technologies. It has achieved excellent results in many fields, such as computer science, machine learning, automation translation, searching technology, multimedia learning, and personalized recommendation technology (Zhang et al. 2018). Deep learning allows the machine to imitate audiovisual activities and to think like humans, thereby solving many complex model recognition problems and making significant progress in artificial intelligence technology (Jing et al. 2018). Deep learning algorithms can make IoT smarter and more user-friendly. It can be said that the development of artificial intelligence is indispensable for the realization of the IoT, and also provides the development direction of artificial intelligence (Song et al. 2020).

IoT is closely linked to the concept of the intelligent background music system. IoT technology is the basis for the intelligent background music system. In other words, the intelligent background music system is the specific application field of IoT technology (Rathore et al. 2017). The State Council of China also attaches great importance to the development of the IoT. In the past three years, the government report of State Council has repeatedly mentioned the importance of the development of IoT technology. Therefore, China has high hopes for the development of IoT (Camarinhamatos et al. 2016).

In summary, the deep learning algorithms are combined with IoT technologies to explore and investigate the intelligent background music system, expand the application of deep learning methods and IoT technologies in the development and design of intelligent systems, and promote the development of smart homes, providing new theory and data support for the realization of intelligent background music system.

The contributions include:

  1. 1.

    The Fast-RCNN algorithm is utilized for extracting the underlying features of the scene images.

  2. 2.

    A middle-level feature extraction algorithm is proposed, whose recognition rate can reach 87.6%.

  3. 3.

    An intelligent background music system is designed based on deep learning and IoT technologies, which runs stably and presents an excellent effect.

2 Literature review

As early as around 2007, reports on the intelligent background music system had been published; however, the intelligent background music system was in its infancy. In the beginning, the intelligent background music system could only be done through remote control or interaction between real objects without the help of application software (Bannan 2014). With the rapid development of Internet network technology in recent years, the background music system has become more intelligent. Deep learning is an important concept in contemporary science. In China, research on deep learning started late. In the earlier stage, scholars considered that deep learning was based on comprehension learning; learners could learn critically about new ideas and knowledge, combine the actual cognition with the newly learned knowledge, and connect the multiple structures of thoughts. The transfer of existing knowledge to new situations was a way of learning to solve problems and make decisions. In various application fields, deep learning has achieved tremendous success (Ranjan et al. 2018). Deep learning can learn the sophisticated user preferences from raw data, thereby enhancing the effects of recommendation. Liu and Chen (2018) proposed that the pictures and music under the IoT and deep learning matching model belonged to the background music recommendation problem of easy construction space; the deep learning algorithm, the captured pictures, and music data were used to obtain the semantic representation vector of music and pictures, respectively. For background music, a semantic set vector of adjectives was constructed to describe the mood information of music (Saari et al. 2017). Then, based on large-scale statistical data, the correlation between them was calculated to make recommendations for background music (Ouyang et al. 2019). Lv et al. (2018) believed that the background music system based on deep learning should focus on analyzing the public video database of standard scenario features, as well as training the video scenario classifier. Thus, a structured support vector machine (SVM) model was constructed to describe the video content features and video scenarios, such as color and motions. Also, the music features were described in detail; moreover, the appropriate alternative music was selected based on the trained model. In order to mine the matching relation between video content and background music, the music recommendation model was trained. First, some video clips from the movies were collected to find out the professional background music in them, which were used as the learning training samples to collect the recommended models; besides, it was necessary to collect the background music video taken by the user as a test sample. The selection of background music should be carried out in conjunction with the recommendation model, and the degree of matching of the music was determined through volunteer scoring. Based on the deep learning method, Zhang et al. (2016) studied the automatic recommendation algorithm of video background music. Based on the recommendation model, the video scenario was constructed through the structured SVM. Meanwhile, the background music features, video contents, and video scenarios must be matched. Eventually, the background music recommendation model for the personal shooting video was completed, and the appropriate background music around the model was selected. The recommended music and test video had a better match, and 88.8% of the users were satisfied. The music recommendation model could better describe the relationship between music and video semantics, and the recommended background music was more suitable (Luo et al. 2017).

Li et al. (2017) found that intelligent background music systems based on IoT technology have attracted widespread attention in foreign countries. In 2016, Microsoft introduced Microsoft Holographic Lens holographic glasses, which integrated the CPU, GPU, space stereo system, and Holographic processing unit (Qin and Wang 2017). Another technology company, Google, had also done much research on the intelligent background music system. Google first launched products such as Google Glass and then acquired Nest. Also, the patent of the Android operating system owned by Google has become a unique advantage in the development of intelligent background music systems. Tang et al. (2018) achieved great success through the research on the application of IoT combined with deep learning in the field of background music. It was found that deep learning could provide end-to-end learning and was good at handling complex tasks, which overcame the obstacles of the traditional linear model; thus, the quality of background music recommendations was significantly improved (Shen et al. 2019). Deep learning can effectively capture the nonlinear relationship between users and items and obtain vector representations of users or items through vectorization or coding. In addition, the IoT can learn the sophisticated features of user information and item information, such as texts, audio, or visual information, which can also improve the quality of recommendations.

The intelligent background music system based on deep learning and IoT technology is researched. Previous researches on intelligent background music only focused on the application of deep learning algorithms instead of the developing trend of combining IoT technology. Innovatively, the IoT and deep learning algorithms are combined and applied to design a background music system. Also, the proposed Fast-RCNN algorithm shows better recognition accuracy.

3 Methods

3.1 SIFT feature extraction algorithm

The scale-invariant feature transform (SIFT) is a method proposed by David Lowe in 1999 to describe image processing. Compared with conventional algorithms, SIFT has a lot of advantages (Xu et al. 2019). First, SIFT extracts local features; thus, even though the image changes, the extracted results are invariant, which makes the SIFT of robust stability. Second, SIFT can extract features from a few targets to obtain great features vectors; therefore, SIFT is plentiful. Third, the SIFT algorithm can be easily blended with other feature extraction algorithms. Forth, the SIFT algorithm has functional uniqueness and is very informative. In general, the extraction process of the SIFT algorithm can be divided into six stages, which are respectively the construction of multi-scale-space, the determination of endpoints, the elimination of defective function points, the matching of function point directions, and the generation of key point descriptors.

The Gaussian fuzzy method constructs the scale-space of the image, and the only linear kernel that realizes the scale transformation is the Gaussian convolution kernel, which has been proved by Lindeberg (2017). Therefore, the definition of the scale-space L (x, y, σ) of the constructed image is as follows:

$$ L(x,y,\sigma ) = G(x,y,\sigma )*I(x,y) $$
(1)

The Gaussian function is:

$$ G(x,y,\sigma ) = \frac{1}{{2\pi \sigma^{2} }}e^{{ - (x^{2} + y^{2} )/2\sigma^{2} }} $$
(2)

In (2), G (x, y, σ) denotes the Gaussian function, I (x, y) represents the initial image, (x, y) represents the spatial coordinate, and σ represents the coordinate scale. The size of σ represents the smoothness of the image after Gaussian transformation, and the value of σ is inversely proportional to the resolution of the Gaussian image. In other words, the larger the value of σ is, the lower the resolution of the image is; on the contrary, the smaller the value of σ is, the higher the resolution of the image is.

In terms of constructing the scale-space of the Gaussian image, the Gaussian difference scale-space technology can easily detect stable feature points. The specific method is to convolve Gaussian difference kernels with different scales, and the convolution is defined as follows:

$$ D\left( {x,y,\sigma } \right) = \left( {G\left( {x,y,k\sigma } \right) - G\left( {x,y,\sigma } \right) \times I\left( {x,y} \right)} \right) = L\left( {x,y,k\sigma } \right) - L\left( {x,y,\sigma } \right) $$
(3)

On this basis, in the exploration of the intelligent background music system, the advantages of the SIFT algorithm is fully utilized. The underlying features of the scene images are extracted by combining deep learning methods and SIFT feature extraction algorithms. Then, through the application of the SVM classification algorithm, the middle-layer features with spatial information are obtained. Finally, the proposed algorithm is compared with the LBP feature extraction algorithm, a Gabor feature extraction algorithm, and a salient feature extraction algorithm in terms of classification accuracy.

3.2 SVM classification algorithm

SVM was proposed by Corinna Cortes and Vladimir Vapnik of Bell Labs (USA) in 1964 (Dong et al. 2014); after the 1990s, it was rapidly developed and derived a series of improved and extended algorithms (Taehwan and Jinsung 2016), which could be linearly classified and belonged to the supervised statistical learning algorithms. SVM minimizes empirical error and maximizes geometric edges. It is also known as the maximum range classifier. It is mainly applied in the fields of classification and regression analysis. The SVM classifier is a learning process that can continuously study the features of the sample and find the hyperplane of the training sample based on the maximum and the minimum distances in the high-dimensional space. In other words, a hyperplane that can completely separate different types of samples can be found. If the distance between samples of the same type is the largest, it is called the maximum range of the hyperplane, and the corresponding dedicated classifier is divided into the maximum interval classifier. The expression equation of the hyperplane is as follows:

$$ w^{\text{T}} x + b = 0 $$
(4)

In order to describe the nearest sampling point, finding two parallel hyperplanes and at the same distances from this hyperplane is necessary:

$$ \begin{array}{*{20}c} {H_{1} :y = w^{\text{T}} x + b = + 1} \\ {H_{2} :y = w^{\text{T}} x + b = - 1} \\ \end{array} $$
(5)

The distance between the two hyperplanes is calculated through the following equation:

$$ d = \frac{2}{\left\| w \right\|} $$
(6)

In (6), \( w \) represents the normal vector of the hyperplane.

The advantage of the SVM algorithm is that the parameters are adjustable, which avoids the over-fitting of the user. SVM finally solves the convex optimization problem. Indeed, many methods can solve the convex optimization problem, such as SMO; however, given the small sample capacity, the classification effect of the SVM classifier is significantly better than other classifiers since its optimization goal is to minimize the structural risk rather than minimizing the empirical risk; therefore, it has strong versatility.

3.3 Faster-RCNN algorithm

A convolutional neural network (CNN) is a feedforward neural network under deep learning, which is mainly composed of the input layer, the convolutional layer, the pooling layer, and the output layer. Faster-RCNN is a target detection algorithm tool developed based on CNN. The application of this algorithm makes the optimization of the entire training process possible. In terms of composition structure, Faster-RCNN is mainly composed of two modules: Fast-RCNN target detection algorithm and candidate region generation network (RPN). The specific implementation is to apply the feature extraction network to realize the feature extraction of the input image to be tested. Then, the feature map can be generated. By applying the RPN to the feature map, the target candidate area containing multiple scales can be output, and finally, after the target-oriented classification, the feature extraction is realized. The loss function of RPN can be expressed as:

$$ L\left( {\left\{ {p_{i} } \right\},\left\{ {t_{i} } \right\}} \right) = \frac{1}{{N_{\text{cls}} }}\sum\limits_{i} {L_{cls} } \left( {p_{i} ,p_{i}^{*} } \right) + \lambda \frac{1}{{N_{reg} }}\sum\limits_{i} {p_{i}^{*} L_{reg} \left( {t_{i} ,t_{i}^{*} } \right)} $$
(7)

In (7), i represents the serial number, pi represents the probability corresponding to the ith goal, if the ith is the goal, then \( p_{i}^{*} \) = 1; if not, then \( p_{i}^{*} \) = 0; ti represents a vector of prediction parameterized coordinates, \( t_{i}^{*} \) represents the correction parameters of the target prediction frame relative to the real frame, Ncls and Nreg represent the normalization parameters, λ represents the balance coefficient, Lcls represents the loss function of the prediction candidate area category in confidence, and Lreg corresponds to the loss function of position correction.

The introduction of RPN is a significant innovation of the Faster-RCNN algorithm, which has excellent characteristics in the training optimization process. On this basis, the Faster-RCNN algorithm is combined with the SIFT feature extraction algorithm and SVM classification algorithm to design and construct the intelligent background music system. In summary, the SIFT algorithm has an excellent performance in extracting local features, SVM has an excellent performance in classifying feature images or related data, and the Faster-RCNN deep learning method has an excellent performance in extracting the multi-scale features. On this basis, considering that the intelligent background music system is a complex composition system, the integration of several feature extraction or feature classification algorithms are considered, and a feature extraction method based on the middle-level feature structure is proposed. The overall implementation architecture of the feature extraction method is shown in Fig. 1. Also, through the comparative analysis with other feature recognition algorithms, the effect of the proposed feature extraction algorithm based on middle-level features is tested. Through the application of deep learning methods, it can provide a good foundation for the significant recognition of specific scenes of the intelligent background music system.

Fig. 1
figure 1

The overall implementation architecture of feature extraction based on the functional modules of the intelligent background music system

4 Experiments

4.1 Application of IoT technology in the intelligent background music system

Once the needs of the global system users are clearly defined, it is possible to determine the overall architecture of the system. First, the main functions of the intelligent background music system include the security subsystem, the device subsystem control, the multimedia subsystem composition, and the hygiene and health subsystem. The security subsystem should include control and video surveillance functions. In the home appliance control function, it is necessary to create ZigBee networks in IoT technology. Currently, most health care applications of ZigBee are sphygmomanometers. Generally, applications, such as meters or weighing scales, can utilize the Bluetooth point-to-point connections for data transmission. In the lighting function of the intelligent background music system, the lighting system can adopt the embedded technology to directly give the cable interface and use the transmission relay for checking. Eventually, the home multimedia application can be extended by adding a large-capacity storage device in the system. The overall architecture of the intelligent background music system based on IoT technology is shown in Fig. 2.

Fig. 2
figure 2

Overall architecture of intelligent background music system

The feasibility of the intelligent background music system based on deep learning and IoT technology is discussed and studied. The specific research scheme is shown in Fig. 3. In the actual experiment, the functions that need to be realized in the system include remote video surveillance, home information management, home appliance remote control simulation, and information services. The remote transmission surveillance system sends the video information to the smartphone screen through the intelligent gateway platform, which has the function of dialing the alarm to the receiving end when the alarm is triggered. The family information management realizes the management of the family member information, which collects the music habits of users and uses the learning algorithm to extract the features of the music preferences of different family members. Remote simulation control is the remote control of the switches or other function buttons of the system through a smartphone, which is the underlying implementation of the intelligent background music system. Here, the feature extraction algorithm based on middle-level features is applied to the extraction of music preference features of different family members. In order to collect the music habits of family users, the user-centered PBRCM is utilized, which can obtain the users’ music preferences and habits. On this basis, the verification and analysis of the recognition effect of the algorithm in different family behavior scenarios are discussed.

Fig. 3
figure 3

Experimental scheme structure

4.2 Android client-side application

An API is created from the SQLite database; then, the SQLite Database is utilized to perform routine operations on the API. One of the representatives of the SQLite Database Object is SQLite Database, which contains methods for manipulating the database (Blanco et al. 2018; Laghari and Niazi 2016). In the Android directory, the sqlite3 tool can be used to create a database and execute SQL statements. Table 1 shows the standard methods for SQLite Database objects.

Table 1 Common methods of SQLite Database

Android provides a helper class for SQLite Database, which is named SQLite Open Helper. This helper class mainly generates and manages databases. The program calls the getwraitabledatabase () or getreadedatabase () methods of the class. If no data are found, the Android system will automatically generate the database. The generating process of the database by SQLite Open Helper is shown in Fig. 4.

Fig. 4
figure 4

SQLite open helper generating database procedure

5 Discussion

5.1 Feature extraction contrast experiment

In terms of selecting the scenario feature extraction algorithm, the accuracies of the experimental classification of the LBP feature algorithm, the Gabor feature algorithm, the saliency feature algorithm, and the middle-level feature construction-based feature extraction algorithm proposed in this study are compared. Besides, the experiment is performed under the conditions that the threshold of the middle-level feature construction-based feature extraction algorithm is T = 0.3, and the horizontal parameter is Level 2. Eventually, the comparison of the results is shown in Fig. 5.

Fig. 5
figure 5

Recognition effect maps based on different feature extraction methods

Compared with other traditional feature extraction algorithms, the recognition rate of indoor scenarios of the middle-level feature construction-based feature extraction algorithm proposed in this study is the highest, which is about 87.6%. The Gabor feature algorithm is used to classify and identify the scenarios, and it is found that the recognition rate is always around 20%. Therefore, it can be seen that the Gabor feature algorithm has certain limitations in classifying the indoor scenarios, and the effects are not ideal. Thus, the Gabor feature extraction algorithm is not suitable for classifying indoor scenarios. In the bathroom, the recognition effect of the saliency map feature algorithm is similar to that of the middle-level feature construction-based feature extraction algorithm. However, in the bedroom, the recognition effect of the middle-level feature construction-based feature extraction algorithm is significantly better than that of the saliency map feature algorithm due to problems such as the lighting and room orientation. Finally, as seen from the experiment, the effects of middle-level feature construction-based feature extraction algorithm on the classification and recognition of indoor scenarios are sound.

The average confusion matrix of the feature extraction method for each family scenario is obtained through 10 repeated experiments to better test the recognition effect of various algorithms. The recognition effect of the Gabor method on scene image is the worst, and the recognition result can be understood as a random assumption. The Gabor filter (Ma et al. 2017) is a linear filter that performs edge detection only. It is inconvenient to extract features from complex indoor scenarios. The recognition effect of Gabor features is significantly worse than the recognition effect of the LBP features and saliency features. In summary, the intermediate function construction algorithm based on the lower detection features of targets to be detected has strong recognizability.

5.2 Test of the service system

The environment configuration of the intelligent background music system is shown in Table 2.

Table 2 Test environment configurations

The results of the functional tests are shown in Table 3.

Table 3 Functional tests

In the final test of system performance, the CPU usage rate is firstly tested. In the process of increasing the test client-ends from 100 to 19,000, the CPU usage rate of the intelligent background music system increases slowly with the increase in the number of customers; however, its result remains below 32%. Next, the I/O throughput of the intelligent background music system is tested. Research has shown that each client end sends an average of 100 bytes of the data request to the server. The intelligent background music system is designed according to the netty framework and adopts the database operation optimization strategy; then, it creates a database that can process about 4000 user competitive data requests per second, i.e., if more than 20,000 client-ends connect to the service system at the same time, the CPU usage rate of the service system is 28%. The usage rate of internal memory is about 340 M, which is in a reasonable category. Finally, the system response of the intelligent background music system is tested. During the test, the trend of the response time of the service system in the customer simulation process is counted. The experimental results are shown in Fig. 6. The maximum response time, minimum response time, and average response time in response to client requests of the intelligent background music system are counted. The average request-response time of the intelligent background music system is less than 1 s, which is within the acceptable range of users.

Fig. 6
figure 6

Response time of the system

The above three parameters are test indicators. The test results have shown that the intelligent background music system is designed and implemented according to the netty framework, which is an integrated subscription and distribution system. Other technical solutions, such as Redis open-source data caching system and Dubbo remote service framework, have achieved the expectations, and the system is stable and effective.

6 Conclusions

In China, the application of IoT technology based on artificial intelligence is in the initial stage of development. The related technical problems have not yet been overcome. The business operation mode needs further improvement. It can be foreseen in the application of agricultural product circulation that intelligent IoT technology has begun to take shape.

The deep learning-based Fast-RCNN algorithm, the SIFT feature extraction algorithm, and the SVM classification algorithm are combined to extract the middle-layer features with spatial information. The results show that the proposed algorithm has better performance in anti-interference capacity, robustness, and recognition capacity. At the same time, the innovative combination of deep learning algorithms and IoT technologies is applied to design and construct the intelligent background music system. The constructed intelligent background music system has shown excellent results in data processing for communication systems. However, deficiencies are found. Therefore, in the future research process, the application of the IoT technology should be further explored; also, more compatibility tests and fault tolerance tests should be conducted on the transmission and the underlying protocol, as well as further extending the function of the network protocol.