1 Introduction

With the rapid development of animated visual experience and 3D data processing capability, increased scenes in social life use 3D images instead of the original 2D images, making the existing network accumulates a rich amount of 3D entity and behavior data. At present, relevant data mining methods are often used to filter out more realistic and effective data from them, making 3D images more accurate, textually relevant, and applicable to multiple scenes [1]. And based on the above research, the visual experience system is developed in a deeper level, and the information of 3D data images is efficiently enhanced by machine learning and other techniques, in addition, with the change of industrial application standards, it needs to be improved in several directions such as special speed requirements and user behavior preferences. The traditional 3D human animation visual experience system can randomly collect many 3D parameters, using a mega model training approach to solve the problem of 3D image display experience P [2]. The system architecture is built based on 3D laser scanning, and there are mainly two ways of 3D human animation experience: One is to quantify the image data through 3D equivalent data transformation to convey the complex information of 3D human animation; the other is to combine several effective data node resources for calculation so that the 3D human animation already in the extended system storage can maximize the representative parameters of 3D human image display. However, the traditional 3D human animation visual experience system has a poor ability to fuse relevant data in the process of image display, and it seems very difficult to face the fusion of big data. Therefore, we design the 3D human animation visual experience system based on machine learning [3].

Motion capture technology was introduced fifty years ago and provided the development ideas and laid the foundation for the development of television and film, medical, virtual reality, education, and military technologies at that time and decades today. With the progress of science and technology, people still cannot be separated from the real motion capture data to generate very realistic motion data, so the motion capture technology in the current needs of society is still very important application value, and the development of motion capture technology still has a high cost only a few high-level scenes to capture motion data following the needs of the data captured incompletely [4]. There are deficiencies, and the accuracy of the device capturing motion data is still not good enough to track perfect motion data. Human motion recognition helps to enhance the understanding of video content, thus enabling automatic analysis and processing of video content. In recent years, various video platforms and live streaming platforms have developed rapidly, and their huge amount of video data requires a lot of labor costs for analysis and supervision. For example, in the video content classification, most of the current platforms use the form of data tags to classify the video, in the live supervision, but also mainly in the form of manual supervision, high labor costs, inefficient [5]. By introducing algorithms such as human action recognition, intelligent understanding of video content can be achieved, to automatically carry out tasks such as platform content supervision, video retrieval, video marketing push, and greatly improve efficiency and reduce costs. Simple skeleton data are also due to its shortcomings, such as the loss of environmental information and possible interactive object information, and the lack of apparent features of the object.

The immersion provided by the behavior of the virtual character in the virtual scene comes from the intelligence and rationality of the behavior taken by the virtual character in the virtual scene. For the virtual application where the character is a human, one hopes that the behavioral decisions and actions of the virtual human are as similar as possible to the real human. There is also a lot of work on the behavior of virtual characters in virtual scenes, such as logic-based virtual character design, which sets a series of logics for virtual characters to make behavioral decisions based on these set logics, based on which Jinglei Kou designed a multi-agent virtual park system to allow guides in virtual parks to interact with tourist users, and Narang proposed Body Aware Movement (BAM) algorithm and designed a multi-role walking interaction system. Data-driven crowd modeling and simulation techniques aim to use crowd capture data to perform a series of subscale analyses and modeling to learn the movement characteristics of the crowd at different levels. In this paper, we focus on data-driven scene analysis and modeling, mesoscopic local collision avoidance, and microscopic motion synthesis related techniques to synthesize more realistic, natural, and vivid crowd 3D animation by using real-data simulation.

2 Current status of research

With the advent of depth sensors such as Kinect, obtaining skeletal information about human motion has become relatively simple. Human motion is usually formed by joint position changes or skeletal limb changes in space–time [6]. Currently, most studies on human behavior recognition usually extract key information that can better characterize movements as action features. The traditional Euclidean space-based human action recognition methods mostly use joint position, joint angle, or limb spatio-temporal changes as the features to describe human action [7]. At present, human action recognition methods based on skeletal information can be roughly divided into two types of methods: joint-based approaches and body part-based approaches, in which the joint-based approaches consider the human skeleton as composed of a series of points [8]. Ukrit et al. [9] proposed to use the covariance matrix of the temporal variation of skeletal joint positions as the discriminative descriptor of action sequences and to deploy multiple covariance matrices on action subsequences using a hierarchical approach to encode joint motion trajectories over time, followed by linear SVM for action classification. Siddique et al. [10] proposed to use the 3D histogram of joint positions (HOJ3D) as action features for human action recognition. The basic process is to first perform a Cartesian projection of HOJ3D vectors using linear discriminant analysis (LDA), then cluster them into multiple pose visual words, and finally classify the pose visual words using discrete hidden Markov models (HMMs).

There is also a trend to investigate the semantic level of actions. Existing action recognition methods, especially those based on deep learning, do not give enough attention to the unique properties of human action itself, and most of them directly regard action video or sequence as an object and use deep networks to extract features directly for classification [11]. And action is a high-level composite semantic object involving hierarchical composite relationships, progressive state transfer, human pose features, character interaction relationships, etc., with its unique complex properties [12]. In practical applications, there are many kinds of actions, large variability in expressions, and the scene background is usually complex; in addition, there are perspective changes and object occlusion problems in video capture and other aspects, which are huge challenges for recognition algorithms [13]. Second, there are many cases of very high similarity between human actions, and fine research on the differences between actions is needed to achieve good results. Third, a deeper understanding of the action carried out from the semantic level [14]. Most of the current algorithms are brutally using data-driven classification, and the understanding of human actions is shallow, mostly rigidly cutting the actions into individual pieces, and rarely explicitly considering the role of people in the actions. It is only by forming a high-level semantic understanding of the action that intelligent human action recognition can be truly achieved.

Compared with traditional methods, deep learning-based target detection algorithms can use deep learning either as a feature extraction or classification tool or as an end-to-end learning model where the feature extraction process is performed simultaneously with the classifier training process. Such target classification and tracking algorithms have a great improvement in detection accuracy compared to SVM-type algorithms. In [24], the authors combined feature extraction, deformation processing, occlusion processing, and classification in pedestrian detection for joint learning and used deep networks to optimize each part in different network layers using back propagation (BP). In the literature [15], the authors proposed a novel deep network model to jointly train a multi-stage classifier for pedestrian detection by multiple BP stages. By specifying the training strategy, this deep architecture can simulate cascade classifiers by mining sample information to train the network step by step. Experiments show that the multi-stage training mechanism of their deep learning model well mitigate the overfitting phenomenon, thus improving the accuracy of pedestrian detection. Shimada al. proposed a novel scale-aware fast R-CNN (SAFR-CNN) model in the literature, which merges large-size subnetworks and small-size subnetworks into a unified architecture to handle pedestrian images of various sizes in instances. Specifically, the model framework enables SAFR-CNN to train dedicated subnetworks for large- and small-size pedestrians to capture their unique features by sharing convolutional layers to extract common features at an early stage and combining the outputs of both subnetworks using a designed weighted scale-aware mechanism [16].

3 Analysis of simulated 3D human animation vision technology with enhanced machine learning algorithms

3.1 Enhanced machine learning algorithm design

3D human animation visual experience belongs to a specific display system, and in short, it is the process of being able to generate 3D human animation according to the user's usage requirements. In this system, the 3D human animation mainly reflects the changes of human body position, proportion, and corner, and the 3D human animation is displayed in a specific way according to the user characteristics [17]. The position transformation of the human body animation is to take the head point of the human body as the marker point, which is written as \((x,y,z)\), and mark \(d_{x}\), \(d_{y}\), and \(d_{z}\) length units in the x-direction, y-direction, and z-direction, respectively, then after the human body is regarded as the mass point, the 3D human body position coordinates can be marked as \((X,Y,Z)\), and the 3D human body position coordinates can be expressed as:

$$\left\{ {\begin{array}{*{20}c} {X = x - d_{x} } \\ {Y = y - d_{y} } \\ {Z = z - d_{z} } \\ \end{array} } \right.$$
(1)

The scale transformation of 3D human animation is to regard the human body as the mass point (0, 0, 0), extend the coordinates of each marked point on the 3D human animation to the direction of \(X\) component, \(Y\) component, and \(Z\) component, respectively, and multiply the extension distance by the vector length of \(d_{x}\), \(d_{y}\), and \(d_{z}\). The coordinates after the scale transformation of 3D human animation still take \((x,y,z)\). The original area is expanded or reduced by some times, expressed by the formula.

$$\left\{ {\begin{array}{*{20}c} {x = X - d_{x} } \\ {x = Y - d_{y} } \\ {z = Z - d_{z} } \\ \end{array} } \right.$$
(2)

Unlike ordinary convolution operations, when convolving on a graph, the number of neighboring nodes of each node is not fixed, but the parameters of the convolution operation are fixed, so it is necessary to define a mapping function to combine the fixed number of parameters with the variable the number of neighbors corresponds to the number of nodes. The human body is still considered as a mass point, the origin of the coordinate system (0, 0, 0) remains unchanged, and the coordinates of the center point are expressed by \((x,y,z)\), then the coordinates of any point \((X,Y,Z)\) on the human body are rotated counterclockwise along an axis by \(\theta\) angle, and the new point coordinates are marked as \((X_{1} ,Y_{1} ,Z_{1} )\), then θ as the human body 3D animation around the center, expressed by the formula as;

$$\left\{ {\begin{array}{*{20}l} {X = (X - x)\sin \theta - d_{x} } \hfill \\ {Y = Y} \hfill \\ {z = Z - (X - x)\cos \theta + d_{z} } \hfill \\ \end{array} } \right.$$
(3)

Another type of data commonly used for action recognition is the 3D pose data of the human body. The 3D pose of the human body is usually represented as a skeleton model consisting of joints and limbs. In most video-based motion recognition, data-driven and deep network are used for automatic feature learning of the whole video image, without including or reflecting the specific study of human actions in the process. However, human action recognition is a processing of the semantic level of the video content, and a semantic understanding of human behavior is required to truly achieve universal human action recognition. The skeleton data are a higher-level semantic feature of the human body. Compared with video data, the skeleton model provides a detailed description of the human body's posture, thus helping to accurately analyze human body movements and thus perform action recognition from the human perspective [18]. Of course, pure skeleton data such as the loss of information about the environment and possible interacting objects lack of apparent features of objects. However, in actions that do not involve environmental interactions, such as human gestures and postures, skeleton data still have a greater advantage.

A graph convolutional neural network uses a convolution-like operation to aggregate the features of all neighboring nodes of each node and then performs a linear transformation to generate a new feature representation of a given node. Specifically, all feature vectors in the neighborhood, including the feature vector of the central node itself, are aggregated and weighted by untrained weights based on the number of neighbors. In this work, we build skeleton subgraph sampling rules on human skeleton graph data to be able to use regular convolution operations on the graph. Unlike the regular convolution operation, the number of neighboring nodes of each node is not fixed when convolving on the graph, but the parameters of the convolution operation are fixed, so a mapping function needs to be defined to correspond a fixed number of parameters to a variable number of neighboring nodes. In this paper, the size of the convolution kernel is defined as three, and the three parameters correspond to the points far from the center of the body, the points near the center of the body, and the convolution point itself, as shown in Fig. 1.

Fig. 1
figure 1

Overall flow of constructing spatio-temporal graph convolution to recognize human actions

But the center of gravity is only the geometric center and does not effectively represent the center of gravity in human movement. It is natural to view the human skeleton as a combination of multiple body parts as opposed to the entire human skeleton being viewed as a single figure. A body part-based representation provides insight into the importance of each part and their action cooperation and contextual relationships in space and time. Therefore, we propose an alternative strategy to represent the information structure of the skeleton during human exercise by dividing the skeletal joints into five main parts, two arms, two legs, and a torso.

$$n_{c} (v_{tj} ) = \left\{ {\begin{array}{*{20}l} {{\text{root}}} \hfill \\ {{\text{centripwtal}}} \hfill \\ {{\text{centrifugal}}} \hfill \\ \end{array} } \right.$$
(4)

After designing the spatial graph convolution form of the human skeleton, we model the spatio-temporal dynamics of the skeleton sequence. In the construction of the human skeleton graph, the temporal representation of the graph is constructed by connecting the same joints between consecutive frames. In this way, the spatial graph convolution extended to the temporal domain, i.e., the spatial neighborhood is extended to include temporally connected joints, and the temporal convolution form is shown in Eq. (5).

$$f_{uot} (v_{tj} ) = \sum\nolimits_{{v_{tj} }}^{K} {f_{in} (T(v_{tj} ))}^{2}$$
(5)
$$P\left( {C_{i} \left| X \right.} \right) = \frac{{e^{{o_{i} }} }}{{\sum\nolimits_{{v_{tj} }}^{K} {f_{in} (T(v_{tj} )e^{{o_{j} }} )} }}$$
(6)

where \(D(v_{k} )\) is a 20-dimensional local shape descriptor that describes the curvature around the vertex.

$$D(v_{k} ) = (d(r_{1} ),...,d(r_{20} ))$$
(7)

Finally, the probability prediction of candidate regions is performed using Markov networks, where the coordinates of the maximum probability are the locations of the corresponding feature points. For the problem of maximizing the Markov joint probability, this paper uses the Bayesian trust transfer algorithm to optimize the solution. The results of feature point detection are shown in Fig. 2, where the model is the scanned model and the red points are feature points.

Fig. 2
figure 2

Feature point detection results

Since the CAESAR dataset used for training in this paper is a 3D model dataset of Western ethnicity, while the scanned data in this paper are mostly Chinese, and there may be noise and holes on the scanned data, the final feature point detection results will inevitably have certain errors, and the quality of feature point detection will directly affect the results of the alignment and fitting stages later, so this paper provides. After designing the convolutional form of the human skeleton space map, we model the skeleton sequence in spatio-temporal dynamics. In the construction of the human skeleton diagram, the time representation of the diagram is constructed by connecting the same joints between consecutive frames. Therefore, this paper provides a tool for interactive correction of feature points, which allows users to correct the feature points with large errors. Also, since the alignment and fitting process only require the same number of corresponding points between the standard template and the scanned model and do not necessarily require 73 feature points, the user can interactively remove some of the feature points that interfere with the alignment. Similarly, if the user wants to retain better local geometric details in the fitting process, more feature points can be added interactively to guide the fitting deformation.

The interactive point selection part contains two windows: standard template and scan model. The system maintains two feature point lists: standard template feature point list and scan model feature point list, in which the standard template feature point list initially has 73 feature points, and the positions of these 73 feature points are fixed, while the positions of feature points in the scan model feature point list can be modified interactively. When the user selects a feature point in the standard template window and then selects another feature point in the scan model window, the two feature points can be correlated. In addition, feature points can be added to the standard template, and the added feature points will be appended to the feature point list of the standard template, and for the added standard template feature points, the correspondence is also established by tapping the feature points on the scan model.

After finishing the correction, the tool will output two files of the standard template and the feature point list of the scan model, each line of the file is a five-tuple containing feature point serial number, feature point 3D coordinates and feature point index, and the feature points with the same serial number in the two files correspond to each other. If both the standard template and the scan model have 73 feature points, the list of feature points of both the standard template and the scan model is 73 lines; if the scan model has less than 73 feature points, only the feature points with an established relationship are output, and the number of lines of the list of feature points of both the standard template and the scan model is the number of feature points of the scan model; if the standard template and the scan model have more than 73 feature points, the list of feature points of both the standard template and the scan model is output. If there are more than 73 feature points in the standard template and scanning model, the number of rows of the list of feature points in the standard template and scanning model is the total number of feature points.

The deformation map-based coarse alignment algorithm is another algorithm for coarse alignment. Similarly, after the similar transformation alignment completed, when the scale, position, and orientation of the standard template and the scanned model are already relatively consistent, the deformation map-based coarse alignment algorithm can unify the body shape and pose of the standard template and the scanned model. The interactive feature point correction tool is provided to allow users to correct feature points with large errors. At the same time, since the alignment and fitting process only require the standard template and the scanning model to have the same number of corresponding points. Unlike the coarse alignment algorithm based on the SMPL model, the deformation map-based coarse alignment algorithm is a deformation algorithm embedded in space, which first constructs a deformation map, then embeds the mesh model into the deformation map, updates the affine transformation of the deformation map nodes by solving the optimization problem, and finally finds the positions of the updated mesh vertices, to complete the non-rigid coarse alignment.

3.2 Simulation of three-dimensional human animation visual technology design

One of the more challenging and valuable types of crowd simulation is crowd modeling and simulation based on real scenes, especially for unstructured scenes. Although there are some researches for scene modeling and trajectory analysis, most of them still focus on scene perception and trajectory anomaly detection but ignore some spatio-temporal statistical characteristics implied in real crowd trajectories [19]. More importantly, the existing research results cannot effectively guide the crowd simulation and evaluation work. For these reasons, this chapter will focus on revealing some spatio-temporal statistical properties in crowd trajectories to assist in more complex macroscopic scenario path planning and simulation evaluation. This chapter will first introduce the clustering operation of crowd trajectories and then perform spatio-temporal statistical analysis on each class of motion patterns after clustering. The extension distance is multiplied by the vector length of dx, dy, and dz, the coordinates of the 3D human animation scale transformation still take (x, y, z) as the origin, and the original area is enlarged or reduced several times. Anomaly removal operation is included in the statistical analysis process to obtain more pure track point data. Finally, the obtained velocity field will be used to model and simulate the scene. Among them, the process of analyzing the trajectories in this research method is shown in Fig. 3.

Fig. 3
figure 3

The core idea of the attentional mechanism

The original data type of motion data has 32 nodes with a total of 99-dimensional data features, but in practice, this project merges nodes that have motion similarity in time and space with neighboring nodes according to the actual requirements. On the other hand, in this project, to speed up the training of the network model, our data pre-processing also decided to remove the redundant nodes, reduce the network parameters, and speed up the training of the network model, using seventeen nodes to represent the original 32 nodes of the research subject. The data information of the original poses without redundancy removal process, and detailed descriptions of joint names, cis-regulation directions, parent, and child nodes, etc., represent the original human skeleton model information and the skeleton model information after the redundancy removal process. Since the original motion data contain all the data of all the joints, but in the practical application, to be able to reduce the deep learning network model parameters and speed up the training of the model, the pre-processing process will be censored the joints that are similar on the human skeleton space, i.e., remove the redundant data. In addition, the classical deep network models for modeling and prediction tasks in human motion sequences are introduced, and the advantages of both network models in short-time prediction and good prediction performance for periodic motion and vice versa are analyzed, such as a summary that the prediction performance will be relatively slightly degraded.

In terms of skeletal information features, researchers tend to adopt more motion trajectories of joint position changes among limb bones and geometric transformations among limbs as the representation of human actions, such as using the covariance matrix of the temporal changes of skeletal joint positions as the discriminative descriptors of action sequences [20]. When the scale, position, and orientation of the standard template and the scanned model are relatively consistent, the body shape and posture of the standard template and the scanned model can be unified through the coarse registration algorithm based on the deformation map. In recent years, considering the complexity and non-linearity of human action, many researchers have started to explore the natural representation of human limb action in the popular space by extracting the motion trajectories and geometric transformations of limbs in the popular space to describe human action. It is inspired by this, the human action feature extraction by extracting the geometric transformation relationship between the limbs in the popular space of the Li group.

In addition, it can be found that recurrent neural network has better prediction effect for some action types with periodic motion, but for non-periodic as well as in natural space, most human motion actions have a high degree of freedom, non-linearity, and uncertainty. Therefore, in the next chapter, considering that human motion is influenced by natural conditions and its properties, a human skeletal motion prediction model is constructed, considering not only the motion characteristics in the time domain but also the spatial structure information between the joint points. Then, the motion features on the time domain are stacked by recurrent neural units for each frame, which can strengthen the weight of certain potential motion features and facilitate the propagation of features in recurrent neural networks at long times, thus improving the prediction performance of the model. The addition of the residual idea, the attention mechanism to the network, and the addition of the residual connection make the last frame of the real motion sequence and the first frame of the prediction smoother, and the addition of the attention mechanism can strengthen the attention weight of certain joints, which is more favorable for the features to focus on the long-time information in the recurrent neural network and slow down the condition of long-time information forgetting. Since only one dataset used to train and test the performance of the model in this paper, in future work, we will increase the training data and train the model with the CMU motion capture dataset, as shown in Fig. 4.

Fig. 4
figure 4

Adjacency matrix of skeleton node numbers and nodes

After the user selects a feature point in the standard template window and then clicks a feature point in the scan model window, the two feature points can be established a corresponding relationship. There are 12 motion trajectories for the 6 actors in the test set. The original motion capture data contain the spatial 3D coordinate information of 32 joints. Since some of the joints are in close spatial distance, the motion data of these close joints obtained from motion capture will be very close to each other, which means that the spatial and temporal characteristics of motion between joints in the close distance are also very close. Therefore, the training speed of model can be improved by reducing the parameters of model training. The motion data of redundant nodes are removed by keeping one of the nodes close to each other [21]. So, the motion data of the original 32 nodes totaling 99 dimensions are kept, and the important nodes of 17 nodes totaling 54 dimensions are kept. Second, many of the actions in UCF24 are interactions between people and objects. After using target localization to process the human area images separately, these interactive objects are excluded, which causes a loss of action information. However, for the system's target action cross-police gestures, which is the TPG dataset, the actions are all individual human gestures and the positions are fixed. In this case, the target localization and optical flow improvement methods effectively extract the targets from the background and enhance the optical flow of the target actions, thus significantly improving the final recognition results.

4 Analysis of results

4.1 Algorithm performance results

Compared with the rule-based crowd simulation strategy, the operational efficiency of the traditional data-driven crowd simulation method greatly affected by the size of the database. To further demonstrate that the proposed algorithm in this chapter outperforms the traditional data-driven methods in terms of operational efficiency, the experiments in this section will examine the crowd simulation efficiency of different algorithms under 10 different data sizes, and the results are shown in Fig. 5. Where the simulation environment density is 20 a/f, in all examined methods, k is set to 10, and the known search structure is kd-tree. The number of clusters in the CLUSTER method is set to 1/1000 of the total data volume.

Fig. 5
figure 5

Comparison of the time required to run 10,000 frames for four data-driven crowd simulation-based methods

Figure 5 shows that only the method described in this chapter can achieve stable and fast running efficiency in the population simulation with different amounts of training data. Compared with the KNN method, the PAG method can run faster, but the number of collisions cannot be effectively controlled so that the algorithm needs to constantly reposition the nodes using the search strategy during the execution. The CLUSTER method needs to reduce the search domain by adding a classifier, which leads to an additional classification overhead in the final runtime. A human body measurement algorithm based on projection and posture correction is proposed to fit the missing data and realize the measurement and calculation of human body feature parameters based on millimeter-wave point clouds; the Lee group bones are used to represent the features of the Li group extracted by the model, and the data dimension is relatively high. This method can not only process high dimensional data but also reduce the complexity of recognition process. In addition, we further examined the running efficiency of ORCA, which took an average time of 47.28 s to run 10,000 frames for a dense population of 20, and the running time of the method described in this chapter has approached that of the ORCA method. It is worth noting that ORCA is affected to some extent by the crowd density (not the number of pedestrians), while the neural network-based method in this chapter is not affected by it, which further proves the stability of the proposed method in this chapter.

Finally, it will be briefly explained that the proposed approach has stronger data confidentiality and maintainability compared to traditional data-driven approaches. It is easy to see that the approach described in this chapter does not require the maintenance of a large and complex structured data system, and the model is trained end-to-end without much human intervention. The model in this chapter has 780 parameters, and the number of parameters is much smaller than the data itself. Therefore, when the method is integrated into a system, we only need to store these 780 parameters and do not need to load a large database for simulation experiments. At the same time, the separation of the data from the program undoubtedly raises the confidentiality of the data to a higher level.

A more efficient and accurate local collision avoidance mechanism is achieved by fitting real crowd data with a neural network model. By training a multilayer neural network model to fit the motion states and corresponding motion decisions in real data, we finally obtain a crowd motion decision model to represent the motion decisions embedded in real pedestrian data. The experimental results show that our method outperforms the commonly used data-driven crowd simulation methods in terms of simulation realism, simulation efficiency, collision avoidance, and method maintainability, as shown in Fig. 6.

Fig. 6
figure 6

Discriminant convergence curve

The attention weights learned by the model effectively emphasize the key joints based on the action sequence information. At the beginning of the action sequence, the weights of all joints are usually relatively low and do not differ much from each other. At present, relevant data mining methods are often used to filter out more real and effective data, so that the three-dimensional image has more accuracy, text relevance, and multi-scene applicability. And based on the above research, the visual experience system is deeply researched and developed. As the action progresses, the weights of the joints are adjusted according to their importance in the movement, and those joints that are critical in the movement gradually gain more weight, while the weights of unimportant joints are reduced. For example, in the first column of the drinking action, the main interactive nodes, the right hand, and the head receive increasing attentional weights as the drinking action progresses. Similarly, the hands and head in the glasses-wearing action, the hands in the hand-clapping action, and the legs in the leg kicking action all receive higher attentional weights as the respective action progresses. The results of the attentional weight visualization indicate that the model learns to discover the critical joints in the skeleton based on the sequential information of the movements through the attentional mechanism, which further demonstrates the effectiveness of the attentional mechanism employed by the model.

4.2 3D human animation visual results

When comparing the three visual experience systems, the data results of all the input fusion are unified and summarized in the figure, and the experimental results are shown in Fig. 7. From Fig. 7, we can see that the data fusion volume of the 3D human animation visual experience system based on machine learning designed in this paper is over 70 TB at the lowest, while the best value of the two traditional systems does not exceed 60 TB, which means that the system in this paper has a stronger ability to fuse data. This is because the machine learning algorithm is introduced in the process of software design, and the algorithm is used to fuse the visual experience data with the user data, to realize the filling of the visual experience data, and the optimization of the contrast of the filled visual perception is implemented, which finally improves the data fusion volume of the system and then improves the performance of the system.

Fig. 7
figure 7

Change in viewing angle

Observation perspectives have a great influence on 3D action recognition. The proposed multi-view observation fusion method based on the attention mechanism firstly reobserves the action data under multiple views, then processes these observations separately, and finally fuses the results of all observations to get the action category judgment, to improve the effect of action recognition by integrating the information from multiple observation views. In the fusion process, the attention mechanism is used to evaluate whether the observation perspective is helpful to the action recognition based on the action sequence information, and the observation perspective that is helpful to the action recognition will get a high fusion weight to further improve the effect of multi-view fusion. We can combine multiple effective data node resources for calculation, maximize the expansion of the existing 3D human animation system storage, so that the representative parameters of the 3D human image can be displayed. By this method, the model can learn to find the suitable viewpoint for action recognition among multiple viewpoints based on action information. LSTM is a network suitable for handling time series prediction tasks, fine-tuning the structure of its neural units, and performing a local and overall spatial feature extraction between graph convolution nodes for each time step forward in the LSTM, i.e., for each motion sequence frame. The spatial features and temporal features are extracted simultaneously for each motion sequence frame to capture the complete motion information of the skeletal sequence. On the other hand, residual connections are added to the decoder to improve the model prediction performance and accelerate the convergence of the model. The comprehensive experimental results show that our constructed spatio-temporal end-to-end network model extracts both temporal and spatial features of the motion skeletal sequences and has outstanding prediction accuracy in the motion sequence prediction task.

From Fig. 8, it can be analyzed that although several convolutional neural network models with different combinations of convolutional kernel sizes can achieve the effective classification of the human action plural features extracted in Chapter 3, there are some differences in the classification effect, mainly because the feature information obtained from different convolutional kernel sizes is not the same when the features are processed, compared with other weight combinations, when the first convolutional layer convolutional kernel size is 13*13 and the second convolutional layer convolutional kernel size is 8*8. 13 and 8*8 for the second convolutional layer, we choose the combination of three weight parameters for network model construction.

Fig. 8
figure 8

Effect of different numbers of feature maps on recognition effect

In the process of network training, the selection of network model parameters greatly affects the effectiveness of the model, such as the selection of key parameters between the input layer feature maps and the output layer feature maps, such as weights and bias parameters, can produce different recognition effects, either better or worse. Only a few high-level scenes can capture motion data as required, and the captured data are incomplete. The accuracy of devices that have missing and captured motion data still cannot track perfect motion data well. In this paper, to achieve a more satisfactory recognition effect, we conducted several experiments by selecting different parameter combinations (mainly including the size of the convolutional kernel of different convolutional layers) to achieve a more representative recognition effect with the high dimensionality of the action feature data extracted in this paper and using the CNNs model as a reference.

Accurate cross-sectional determination of body parts is a fundamental part of circumference measurement, and accurately determining the cross-sectional curve will help to improve the accuracy of determining the parts measured. For example, the virtual character design based on logic, which sets a series of logics for the virtual character, allows the virtual person to make behavior decisions based on these set logics. This is especially true when measuring waist circumference and hip circumference. Waist circumference is the circumference of the thinnest part of the human waste, i.e., the circumference of the smallest curve among all cross-sectional curves of the human waste, and hip circumference is the circumference of the fullest part of the human hip, i.e., the circumference of the largest cross-sectional curve among all cross-sectional curves of the human hip, so the selection of cross-sectional curves of these two parts and their vicinity will help to determine the measurement location more accurately. First, according to the functional relationship between different parts of the body and the height, the proportion of the height of the measured part to the height is determined, and then, the approximate height of the measured part is determined, and the selection range is appropriately floated up and down. Multiple sets of cross sections are made for the selected part to obtain its three-dimensional profile.

5 Conclusion

To solve the problem of the low amount of fused data in the traditional system, a 3D human animation visual experience system based on machine learning is designed. After experimental verification, the system solves the problem of poor data fusion ability of the traditional system, which indicates that the system has more advantages in data fusion. Although the system designed in this paper has made some progress, it does not consider the problem of low data processing efficiency due to a large amount of data and will be studied in-depth with this as the focus of research next. The visual experience of three-dimensional human body animation belongs to a specific display system. In short, it is the process of generating three-dimensional human body animation according to the needs of users. In the system of this article, the three-dimensional human body animation mainly reflects the changes of the position, proportion, and corner of the human body and displays the three-dimensional human body animation in a specific way according to the characteristics of the user. A projection- and pose correction-based human point cloud size measurement method is proposed. For different parts, the functional relationship between part height and height as well as the part cross-sectional curve measurement makes the part selection on the point cloud human model more accurate and the measurement results more precise. By comparing different body measurement methods, the relationship between the point cloud obtained by millimeter-wave scanning and the point cloud obtained by laser scanning is determined, which makes the millimeter-wave human point cloud measurement more rigorous. Based on the obtained point cloud data based on a millimeter-wave human body, a 3D mannequin was reconstructed for virtual presentation, and a small human parameter database was constructed using the measured human data.