1 Introduction

As a very important resource of human society, information is making mankind enter a colorful high-tech era. Efficient access to and use of information resources have a great impact on human social life and work. How to find the information that people are interested in accurately in the vast data is an effective way to solve this problem [6, 12]. With the increasing innovation of science and technology, especially the rapid development of computer and multimedia technology, the expression mode of information data is gradually enriched, in which the amount of information data carried by image and video is larger than that of voice and text, and it has the advantages of intuition and vividness, thus, it determines that it will become one of the most critical communication channels for the public. Video is popular among all kinds of multimedia information expression modes, such as text, image and video, because of its vivid and specific advantages. Video has a huge amount of data and abundant content, and the number and types of video information are continually expanding [1, 13]. In summary, video data information management and retrieval has become a research hotspot in this field.

Reference [15] proposed to combine orthogonal decomposition with visual word bag model, and applied the results to image retrieval to divide the image information domain into encryption domain and retrieval domain, so that encryption processing and feature extraction could be independent of each other, to prevent the interaction between encryption and feature extraction process. Among them, users in the encryption domain could choose appropriate encryption methods according to their needs; in the retrieval domain, the visual word bag model was fused, and the visual word histogram was used to represent the image, thus reducing the semantic gap between the underlying features and the high-level semantics of the image, and improving the retrieval performance. The experimental results showed that the method was secure, but the accuracy was low [15]. In reference [14], for the problems existing in the current image feature retrieval methods, a weighted quantization method was applied to image target and background fusion based on the concept of connected particles, which combined the definition of new structuring elements with the image adaptive vector model to achieve image retrieval. In the process, nine new structural elements of image features were selected according to vision, and connected granular attributes and hierarchical statistical models were designed and constructed at the same time; corresponding mapping subimages were obtained by color conversion and structural element registration, and statistical structural elements and image connectivity feature vectors were extracted from subimages; each component was merged into a set of feature vectors based on vector fusion model, and composite vectors were applied to images retrieval. The experimental results showed that the method was stable, but it had the problem of long retrieval time [14]. Reference [2] proposed an image retrieval method based on Henon mapping. Based on the principle of image imaging and multi-band features, the gray value encryption of each band was realized by improved Henon mapping. According to the “big data” characteristics of images, the image feature vectors were designed and constructed by using the interval information of calculating gray values, and the image retrieval was completed by using similarity matching algorithm. The experimental results showed that the method had low running complexity, but low recall rate [2]. Reference [3] proposed a method to apply two-dimensional shape features of salient regions to image retrieval. Firstly, the more significant shape features were extracted from the image, and the image retrieval was realized by combining the texture and color features of the image. The experimental results showed that this method had certain accuracy, but it had the problem of high energy consumption [3]. Reference [4] proposed to apply the correlation descriptor of semi-circular local binary pattern frame to image retrieval. The definition of new semi-circular local binary pattern primitives was given. According to the given definition, the hierarchical structure primitives with different quantized color were detected. The spatial distribution of new structure primitives and image contrast characteristics were extracted, and image retrieval was completed according to the extracted results. The experimental results showed that the method was efficient, but it had the problem of low precision [4].

At present, the local feature index of multimedia video does not describe the feature group of multimedia video image, so it has the problems of low precision, long retrieval time and high energy consumption. In this paper, a set of visual features corresponding to each feature group are constructed as feature group descriptors, and SIFT descriptors are constructed by using visual word-bag model. The improved VLAD method is used to enhance the above model, which makes the model more efficient. Based on the above theory, a local feature index method for multimedia video based on intelligent soft computing is proposed. This paper not only provides a new idea for local feature indexing of multimedia video, but also lays a foundation for the further development of local feature indexing technology of multimedia video, expands the application scope of multimedia field, and provides a simpler and more accurate way for multimedia video retrieval.

The method operation framework is as follows:

  1. (1)

    Video image is segmented by maximum entropy threshold method, which lays the foundation for local feature indexing of video.

  2. (2)

    Through intelligent soft computing, K-means algorithm, VLAD method, principal component analysis method and cosine similarity calculation method are introduced to complete local feature indexing of video.

  3. (3)

    The proposed method is validated by experiments and discussions.

  4. (4)

    The full text is summarized and suggestions are put forward for further research.

2 Local feature processing of multimedia video

In order to achieve high precision, low retrieval time and low energy consumption local feature index of multimedia video, a research on local feature index of multimedia video based on intelligent soft computing is proposed. The overall framework is as follows (Fig. 1):

Fig. 1
figure 1

Overall framework of local feature indexing method for multimedia video

2.1 Video image segmentation

In order to improve the efficiency of video local feature indexing and enhance the accuracy of feature indexing to a certain extent, video segmentation is needed. Video is the technology of capturing, recording, transmitting and reproducing a series of static images in the form of electrical signals. In fact, continuous image changes exceed 24 frames per second. According to the principle of visual persistence, the human eye can not recognize a single static image, which shows smooth and continuous visual effects. This continuous image is called video [8]. To sum up, the video can be used as a continuous image for threshold segmentation. The average segmentation method is used to divide the whole video into multiple segments. The detailed segmentation process is as follows:

Using the definition of entropy in information theory to select segmentation threshold, one-dimensional maximum entropy threshold method can be obtained. The expression of entropy H in information theory is as follows:

$$ H=-{\int}_{-\infty}^{+\infty }p(x)\lg p(x)\mathrm{d}x $$
(1)

Where, p(x) represents the probability density function that varies with the variable x. This paper applies the definition of entropy in information theory to image segmentation, uses entropy to determine a suitable threshold, divides the continuous image into two parts: target and background, and maximizes the amount of target and background information, so as to ensure the information integrity after video image segmentation [5, 11]. For gray image, the one-dimensional entropy of gray level is the largest, then a threshold t is selected to maximize the first-order gray level statistical information in the two parts of continuous image segmentation.

Based on the definition of entropy, the histogram entropy expression of continuous image with K gray level is as follows:

$$ H=-\sum \limits_{i=0}^{K-1}{p}_i\lg {p}_i $$
(2)

In formula (2), pi represents the probability of the occurrence of the ith gray level.

Set threshold be t, the target area O is constructed by the pixels whose gray level is higher than t in the continuous image, and the background area B is constructed by the lower part than t. The probability distribution of the two regions is: region O:pi/pt, i = 0, 1, ⋯, t and region B:pi/(1 − pt), i = t + 1, t + 2, ⋯, k − 1. Among them:

$$ {p}_t=\sum \limits_{i=0}^t{p}_i $$
(3)

According to the above calculation and analysis, the expression of entropy in the target and background region is as follows:

$$ {H}_O(t)=-\sum \limits_{i=0}^t\left(\frac{p_i}{p_t}\right)\cdot \lg \left(\frac{p_i}{p_t}\right) $$
(4)
$$ {H}_B(t)=-\sum \limits_{i=t+1}^{K-1}\left[\frac{p_i}{\left(1-{p}_t\right)}\right]\cdot \lg \left[\frac{p_i}{\left(1-{p}_t\right)}\right] $$
(5)

By using formula (4) and formula (5), the expression of entropy function of continuous image is as follows:

$$ H(t)={H}_O(t)+{H}_B(t) $$
(6)

The gray value corresponding to the maximum of the entropy function is the optimal segmentation threshold. Then

$$ {t}^{\ast }=\arg\ \max H(t),0\le t\le K-1 $$
(7)

The optimal segmentation threshold obtained by formula (7) can realize the optimal segmentation of multimedia video.

2.2 Local feature indexing of multimedia video based on intelligent soft computing

The characteristic of intelligent soft computing is that it does not need to establish the exact mathematical or logical model of the problem itself, but directly processes the input data to get the result. It is more suitable for solving the problems that traditional technology can not effectively deal with, or even can not deal with.

On the basis of video image segmentation, the local feature indexing of multimedia video is realized by intelligent soft computing. Each video image is detected for its characteristic region. Assuming that the video image to be processed is I(x, y) and I(x, y) is filtered by Gaussian kernel function G(x, y, σ), so that the expression f(x, y, σ) and f(x, y, σ) of scale space video image are obtained as follows:

$$ f\left({x}^{\prime },{y}^{\prime },\sigma \right)=G\left({x}^{\prime },{y}^{\prime },\sigma \right)\odot I\left({x}^{\prime },{y}^{\prime}\right)\cdot {t}^{\ast } $$
(8)

Where, ⊙ represents the convolution operation. Heisen matrix operation is applied to the video image representation f(x, y, σ) with scale features, which can detect the feature region coordinates (x, y) and scale σ information in the video image.

Video image feature detectors are constructed to detect some small video image feature regions. Based on these, a visual feature group is designed and constructed using the spatial location information of the feature regions in video images [10].

The feature areas detected in the video image I are marked as:

$$ {\left\{{f}_i=\left({x}_i^{\prime },{y}_i^{\prime}\right)\right\}}_{i=0}^{N-1} $$
(9)

Where, N represents the number of feature areas detected. In this paper, the standard K-means algorithm is used to classify the feature area into M feature groups based on the spatial coordinate information of the feature area in I. The process can be expressed by formula (10):

$$ D=\underset{\left\{\left({\overline{x}}_j^{\prime },{\overline{y}}_j^{\prime}\right)\right\}}{\min}\sum \limits_{j=0}^{M^{\prime }-1}\sum \limits_{\left({x}^{\prime },{y}^{\prime}\right)}{\left\Vert \left({x}_i^{\prime },{y}_i^{\prime}\right)-\Big({\overline{x}}_j^{\prime },{\overline{y}}_j^{\prime}\Big)\right\Vert}^2 $$
(10)

For each feature group, a radius value can be assigned to describe its scale. Its expression is as follows:

$$ {\overline{r}}_j=\underset{\left({x}_i^{\prime },{y}_i^{\prime}\right)\in {\eta}_j}{\max}\sqrt{{\left\Vert \left({x}_i^{\prime },{y}_i^{\prime}\right)-\Big({\overline{x}}_j^{\prime },{\overline{y}}_j^{\prime}\Big)\right\Vert}^2} $$
(11)

In formula (10), ηj represents a set of feature vision. According to formula (11), image I can be represented as feature set \( {\left\{\left({\overline{x}}_j^{\prime },{\overline{y}}_j^{\prime },{\overline{r}}_j\right)\right\}}_{j=0}^{M^{\prime }-1} \), and each feature group corresponds to a set of visual features ηj. Some feature groups in video images overlap each other. The main reason is that the radius \( {\overline{r}}_j \) of the feature group is the feature distance which is the farthest from the center position \( \left({\overline{x}}_j^{\prime },{\overline{y}}_j^{\prime}\right) \) of the feature group.

The descriptors are generated according to the visual feature groups of the video image, and the descriptors generated by each visual feature group are analyzed. A group of visual features ηj corresponding to each feature group is designed and constructed as feature group descriptors. A unified vector description is constructed with Sift descriptor. Obviously, the visual word bag model can achieve this goal. Considering the complexity of the whole process and the energy consumption, the non-probabilistic simplified model of Fisher vector, namely the VLAD method, is used. And then the VLAD method is modified to make it more efficient.

Following is a brief analysis of the VLAD method, and then a detailed analysis of ways to make the method more efficient. Different from the quantification value in the visual word bag model and the number of visual features of visual words, the VLAD method retains the quantization residual vector between visual features and visual words, that is to say, a relatively large amount of information is recorded in the VLAD method. In each visual word, the residual vector record can be expressed by formula (12):

$$ {v}_{k^{\prime }}=\sum \limits_{d_i\in {C}_{k^{\prime }}}{d}_i-{c}_{k^{\prime }} $$
(12)

In the formula, di represents the Sift descriptor of local feature fi, \( {c}_{k^{\prime }} \) represents the kth visual word in the visual codebook, and \( {v}_{k^{\prime }} \) represents all residual vectors in the visual word \( {c}_{k^{\prime }} \). The residual vectors of each visual word are connected to form the final VLAD descriptor \( V=\left[{v}_{k^{\prime }}\right] \).

In order to reduce the energy consumption of video feature indexing, a compact representation of visual feature group is obtained. The dimensionality of VLAD descriptor is reduced by PCA, as shown in Formula (13):

$$ {V}^S={R}^S\cdot \left(V-\overline{V}\right) $$
(13)

In the formula, RS represents the constructed value of the S largest eigenvectors corresponding to the elements of learned principal component matrix, and \( \overline{V} \) represents the mean vector in the principal component analysis.

Generally, Sift descriptors differ in the degree of discrimination among dimensions, mainly because of the variance of each dimension. Experience shows that after balancing the variance of each dimension of the Sift descriptor, a relatively high degree of discrimination can be obtained [7, 9]. The process can be represented by formula (14):

$$ {v}_{k^{\prime }}=\sum \limits_{d_i\in {C}_{k^{\prime }}}\left[\frac{1}{\sqrt{\kappa_{k^{\prime }}}}\right]\cdot {R}_{k^{\prime }}\cdot \left({d}_i-{c}_{k^{\prime }}\right) $$
(14)

In the formula, \( {R}_{k^{\prime }} \) represents the matrix constructed by the eigenvectors in principal component analysis, and \( 1/\sqrt{\kappa_{k^{\prime }}} \) represents the diagonal matrix formed by the square roots corresponding to all the features in \( {R}_{k^{\prime }} \). The descriptor generated by formula (14) is recorded as BVLAD.

After the unified descriptor is constructed by formula (14) for each feature group, formula (13) is used to reduce the descriptor to the S-dimension and encode the descriptor by piecewise quantization.

Among them, the process of feature segmentation quantization is: a descriptor \( {h}^{\prime }={V}^S={\left[{h}_0^{\prime}\cdots {h}_{S-1}^{\prime}\right]}^T \) of S-dimension feature group is given, at the same time, it is divided into several lower dimensions, which can be recorded as l sub-vectors, \( \left\{{h}_l^{\prime }={\left[{h}_{l\cdot \left\lceil S/l\right\rceil}^{\prime}\cdots {h}_{\left(l+1\right)\cdot \left\lceil S/l\right\rceil -1}^{\prime}\right]}^T\right\} \). In each S/l dimension subspace, a visual codebook is constructed by using standard mean clustering algorithm, and a total of l visual codebooks \( \left\{{B}_l^V\right\} \) are constructed. Then each descriptor can be expressed as an formula (15):

$$ {h}^{\prime}\approx {\left[N{N}_0\left({h}_0^{\prime}\right)\cdots N{N}_{l-1}\left({h}_{l-1}^{\prime}\right)\right]}^T $$
(15)

Where, NNl(⋅) represents the recent visual words in the existing visual codebook \( {B}_l^V \). Assuming that the size of visual codebook \( {B}_l^V \) is \( {K}_l^{\prime } \), descriptor h can be coded into \( Z={\sum}_l\left\lceil \log \left({K}_l^{\prime}\right)\right\rceil \) bits. To sum up, the Euclidean distance between the two descriptors is:

$$ \left\Vert {h}^{\prime }-{h}^{{\prime\prime}}\right\Vert \approx {\sum}_l{\left\Vert N{N}_l\left({h}_l^{\prime}\right)-N{N}_l\left({h}_l^{{\prime\prime}}\right)\right\Vert}^2 $$
(16)

In the formula, h also represents a visual feature descriptor. Because distance \( {\left\Vert N{N}_l\left({h}_l^{\prime}\right)-N{N}_l\left({h}_l^{{\prime\prime}}\right)\right\Vert}^2 \) can be calculated offline and stored in Euclidean distance table of visual feature group descriptor, the distance between descriptor h and h can be easily obtained online by looking up the table, so as to improve the recall of video feature indexing.

After designing and constructing Mdescriptors of visual feature group in video images, the similarity of two video images is measured based on the descriptors of visual feature group in two video images. An index video image \( {I}_q={\left\{{V}_i^S\right\}}_{i=0}^{M^{\prime }-1} \) and a video image \( {I}_d={\left\{{V^{\prime}}_j^S\right\}}_{j=0}^{M^{\prime }-1} \) in a video library are set up, and the similarity between them can be calculated by formula (17):

$$ sim\left({I}_q,{I}_d\right)=-\sum \limits_i\underset{j}{\min }{\left\Vert {V}_i^S-{V^{\prime}}_j^S\right\Vert}^2 $$
(17)

In formula (17), the method of calculating the similarity between local features of video images can be regarded as the video image Id to reconstruct indexing of video image Iq in video database. If the descriptor of feature group is quantized and coded by formula (16), the Euclidean distance between descriptors of two visual feature groups can be obtained by looking up tables.

Based on the above calculation and analysis, the index is implemented in large-scale databases by intelligent soft computing. Next, an index algorithm is introduced. First, the similarity between the index video image and the video image in the database is obtained according to the visual feature group. In this process, the inverted index mode is used to enhance the real-time and recall of the index. Formula (17) is used to calculate the similarity of the reconstruction errors of the returned video images, which are more advanced in partial ranking. The process can be expressed as:

$$ si{m}^T\left({I}_q,{I}_d\right)=\sum \limits_i\sum \limits_j{\delta}_{NN\left({V}_i^S\right), NN\left({V^{\prime}}_j^S\right)} $$
(18)
$$ si m\left({I}_q,{I}_d\right)=\Big\{{\displaystyle \begin{array}{l}-\sum \limits_i\underset{j}{\min }{\left\Vert {V}_i^S-{V^{\prime}}_j^S\right\Vert}^2, si{m}^T\left({I}_q,{I}_d\right)\ge si{m}_0^T\\ {}-\infty, \mathrm{others}\end{array}} $$
(19)

In the formula, \( si{m}_0^T \) represents a threshold and δ represents a Kronecker function. According to the above expression, it is obvious that simT(Iq, Id) has a non-negative value. Formula (18) and Formula (19) represent the similarity between indexed video images and database video images, and simT(Iq, Id) can be obtained by inverted indexing. In the database, some video images whose similarity degreesimT(Iq, Id) is higher than the given threshold \( si{m}_0^T \) are selected, and then the similarity degree sim(Iq, Id) between video images is calculated accurately. In the local feature indexing of multimedia video based on intelligent soft computing, the video images and index images in the database are arranged according to the order of similarity sim(Iq, Id), and the video images satisfying the similarity \( si{m}^T\left({I}_q,{I}_d\right)< si{m}_0^T \) are sorted according to simT(Iq, Id), and the first one is regarded as the required video image, so as to complete the video’s local feature indexing.

3 Experimental results and analysis

In order to verify the effectiveness of the local feature indexing method for multimedia video based on intelligent soft computing, an experimental platform is built on MATLAB. The experimental data are collected from the microblog database from October 1 to 30, 2018, and 2000 videos are selected as experimental objects. The video lasts about 2 min and the number of frames is about 3000. In order to improve the overall performance of local retrieval of multimedia video, it is necessary to denoise the video data before experiment.

Document [15] method and document [14] method are compared with the local feature index method of multimedia video based on intelligent soft computing. Firstly, the index precision of the three methods is compared, and the higher the index precision is, the higher the accuracy of the local feature index of multimedia video is. Then the retrieval recall rate of the three methods is compared. The recall rate is the ratio of the local feature of the retrieved phase multimedia video to all the relevant local features in the multimedia video, and the recall rate of the retrieval method is measured. The higher the recall rate is, the more comprehensive the retrieval results are. Then the time consumption of the three methods is compared, which shows that the retrieval speed of the three methods is faster. Finally, the energy consumption of the three methods is compared, and the lower the energy consumption, the smaller the fluctuation, which shows that the comprehensive performance of the method is better and more stable.

The experimental results are as follows:

Analysis of Fig. 2 shows that although the average index precision of the method in reference [15] is about 75%, and the index precision curve fluctuates little, the precision curve of the method shows a downward trend with the increase of the number of experiments, and the reliability of the method is poor; Compared with reference [15], the average index precision of the method in reference [14] is about 80%. The index precision curve fluctuates greatly and has certain credibility; the index accuracy curve of local feature indexing method for multimedia video based on intelligent soft computing is the most stable. And the index accuracy is high, which can be maintained above 90%. The main reason for these differences is that the proposed method can segment multimedia video, improve the efficiency of video local feature index and improve the accuracy of feature index to a certain extent.

Fig. 2
figure 2

Comparison of accuracy with different methods

Recall rate is a reliable indexing to verify the comprehensiveness of retrieval. As can be seen from Fig. 3, the average retrieval recall rate of the methods in reference [15] is about 83%, the average retrieval recall rate of the methods in reference [14] is about 88%, and the average retrieval recall rate of the proposed methods is about 98%. And the recall rate of indexing by using the methods in reference [15] and reference [14] fluctuates little, and the local feature indexing method of multimedia video based on intelligent soft computing has higher overall recall rate, which is more robust than the method in reference [15] and the method in reference [14]. The distance between feature descriptors of the proposed method can be obtained by online table lookup, which effectively improves the recall of video feature indexing.

Fig. 3
figure 3

Comparisons of recall rates of different methods

From Fig. 4, The time consumed by the method index in reference [15] is between 20 and 28 ugs, the time consumed by the method index in reference [14] is between 10 and 20 ugs, and the time consumed by the proposed method is between 1 and 6 ugs. It can be seen that the index time of the method in reference [15] and the method in reference [14] does not fluctuate greatly with the increase of the number of video images to be processed, but when the number of video images is 1400, the time-consuming curves of both methods show an upward trend, indicating that the practicability of the two methods is not very strong. When computing the cosine similarity between indexed video images and database video images, the local feature indexing method based on intelligent soft computing uses inverted indexing mode, which greatly improves the indexing efficiency of the proposed method and effectively reduces the indexing time.

Fig. 4
figure 4

Time-consuming comparison of different methods for indexing

Analysis of Fig. 5 shows that the energy consumption of the method index in reference [15] is between 43-62nJ/bit, and that of the method index in reference [14] is between 35-52nJ/bit, and that of the proposed method is between 30-42nJ/bit. The index energy consumption curve of the method in reference [15] is “M” type, and that of the method in reference [14] is “W” type. The energy consumption of the two indexing methods is not very stable under different running time and different experimental times. In order to reduce the energy consumption of video feature indexing, local feature indexing for multimedia video based on intelligent soft computing is used to reduce the dimension of feature descriptor, which reduces the energy consumption of indexing process and enhances the overall performance of the proposed method.

Fig. 5
figure 5

Energy Consumption Comparison of Different Indexing Methods

4 Discussion and exploration

In the discussion, whether the proposed method can guarantee the integrity of video information in the process of video image segmentation is analyzed. Discussions are made on the experimental platform. The reliability of the proposed segmentation method is verified by setting information feature points in the video image and the number of information feature points remaining in the image after segmentation by maximum entropy threshold method. The results of the discussion are as follows:

In Fig. 6, although the information integrity curve after video image segmentation is not very stable, it can be clearly seen from the image that the maximum entropy threshold method used to segment multimedia video images can ensure the information integrity of video images. This is mainly because in the process of video image segmentation, the definition of entropy in information theory is applied to image segmentation, and a suitable threshold is determined by using entropy. The continuous image is divided into two parts: target and background, which maximizes the amount of target and background information, thus ensuring the information integrity after video image segmentation.

Fig. 6
figure 6

Information Integrity after Video Image Segmentation

5 Conclusions

As the current research focus, video image retrieval is very important for the work and life of the public. Therefore, a local feature indexing method for multimedia video based on intelligent soft computing is proposed. Firstly, video image is segmented to facilitate video’s local feature indexing, and then a series of intelligent soft computing is used to realize video’s local feature indexing. Experiments and discussions prove that the proposed method is practical, and can provide reference for the development of this field. The following suggestions are put forward for the next research:

  1. (1)

    The most obvious feature of video is the huge amount of data. The next step is to introduce motion compensation or DCT transform into the indexing to further improve the real-time performance of the indexing.

  2. (2)

    Video is time-varying and dynamic. The next step is to integrate these two features into the indexing in order to adapt to the video’s indexing transformation and improve the indexing performance.

  3. (3)

    Dynamic update indexing is a new technology based on the development of indexing technology. The next step is to introduce the current new concepts into video image indexing to meet the growing needs of users.