1 Introduction

With the rapid development of science in recent years, 3D technology has been widely used in industrial design, medical industry, architectural design, aerospace, automotive manufacturing, education, film and television animation, and other fields. The categories and numbers of 3D models are increasing year by year, and the classification of 3D models has become a popular trend in terms of computer vision. The 3D model adds structural information compared to 2D images. In the development of 3D model classification, multi-view, point cloud, PANORAMA, mesh, voxel, etc. are all 3D shape representation methods.

Multi-view is an object that is photographed at a fixed angular interval by tilting the virtual camera down 30 degrees. In other words, Multi-view is a set of series of 2D views. Su et al. [27] input multiple views into a series of convolutional neural networks to obtain descriptors corresponding to the model. Using this descriptor to classify and retrieve 3D models, it achieves good results on popular data sets. But this descriptor only contains the visual information of the model. The point cloud is a uniform point-like processing of the 3D model. PointNet inputs the original point cloud into the network to maximize the spatial characteristics of the point cloud, while the data volume is small and the 3D model can be efficiently classified [20]. Sfikas et al. [25] used PANORAMA to enter the convolutional neural networks to effectively classify 3D models. PANORAMA consists of a series of images that represent the vision and structure information. Mesh data of 3D shapes is a collection of vertices, edges, and faces. A 3D model can obtain several meshes, and the mesh data has complexity and irregularity [7]. Voxel is the smallest unit of 3D model segmentation. Similar to the pixel concept of 2D images, it is very complex and is not the popular method.

All of the above methods are classified models by single modal so that none of these methods make good use of the structural and visual information of the 3D model. It is a natural thought that these feature vectors have the same or similar parts. Thus, these feature vectors contain different modal information, we can take advantage of each other in the prediction step to robust classification. In order to demonstrate our idea, we propose a novel Multi-Modal classification Network for 3D shape classification in this work, we use multi-view, point cloud, and PANORAMA to represent the visual, structural, and surface information of the model. Then use MVCNN, PointNet, and PANORAMA-MVCNN to get the predict-scores of the 3D model classification. Finally, we propose an effective fusing approach to fuse the classification results of different models for the final 3D model classification with different weights.

The main contributions of this paper are summarized in the following two aspects:

  • We propose a new multi-modal classification network, which uses predict-scores fusion and different weighting coefficients to obtain more accuracy classification results.

  • The popular dataset is used to demonstrate the performance of the proposed method. Several classic methods are used for comparison. The final experiment also demonstrates the superiority of our approach.

The rest of the paper is structured as follows. Section 2 introduces some of the work related to 3D classification. Section 3 illustrates the details of the multi-modal classification network. Section 4 gives a lot of experimental results and discussion. Section 5 describes the relevant details at work. Finally, Section 6 draws conclusions.

2 Related work

3D shape recognition can be divided into four parts according to the different modalities, mesh-based methods, volume-based methods, view-based methods, and multi-modal fusion methods.

  • Mesh-based methods: Mesh data of 3D shapes consist of vertices, edges, and faces, and the mesh has a stronger 3D shape description capability than other data.

    Socher et al. [26] use a model of combined convolutional neural network and recursive neural network to learn the characteristics of RGB-D images for classification. Novotny et al. [18] use a joint view decomposition meshes to align objects by detecting unsupervised ways of moving. Hubeli and Gross[10] design a semi-automatic framework to extract surface meshes features that require users to manually enter parameters and operators. Kokkinos et al. [15] solve the direction ambiguity problem by constructing a shape context (ISC) meta-descriptor based on the development within the 3D shape. Feng et al. [7] propose a MeshNet based on face-unit and feature splitting to solve the complexity and irregularity of traditional meshes.

  • Volume-based methods: Representing 3D models in voxel and point cloud first, the convolution operation can be performed like a two-dimensional image.

    Wu et al. [30] use the convolution depth belief network to represent the geometric 3D shape as the probability distribution of the binary variables on the 3D voxel grid and realize the active recognition object through view planning. Brock et al. [1] train voxel variational autoencoders to provide the possibility of voxel representations in model classifications. Qi et al. [20] design a network that directly applied to point cloud data, which has a good effect on 3D model classification. However, PointNet cannot capture the local structure, which results in low accuracy in identifying fine-grained patterns and poor ability to generalize complex scenes. Qi et al. [22] propose PointNet++ that learns local features by processing a set of points sampled in the metric space in a hierarchical manner.

  • View-based methods: Compared with 3D data features, computer vision processing 2D images is more mature and faster, so it is necessary to reduce the dimensionality of 3D models. The earliest work based on view recognition 3D model was that Murase and Nayar [17] get a large number of images by the object changing posture and lighting conditions to form the certificate space, and classify the 3D model by matching the appearance. Kanezaki et al. [11] design otationNet based on convolutional neural network, which uses partial multi-views for reasoning. The viewpoints train object data sets in an unsupervised learning manner, select perspectives, estimate categories and poses. Su et al. [27] propose multi-view CNN (MVCNN), which uses different perspective images as raw data and inputs novel CNN framework training to obtain high-accuracy shape descriptors. Sfikas et al. [25] input model PANORAMA into the convolutional neural network for 3D model classification based on the fisher vector. Schneider et al. [24] provide different benchmarks for different sketches, perform classification-driven analysis, and extract the semantic features of the sketch.

  • Multimodal Fusion Methods: Each 3D classification method has a good performance, and different methods are multiplied by different weights to achieve mutual compensation.

    Chen et al. [3] use a 3D point cloud to generate 3D candidate frames in autonomous driving scenarios, and then multi-view features obtained by fusing Frac-talNet and Deeply-Fused Net. Gonzalez et al. [9] use the detection to obtain RGB spectral map and depth image fusion to improve the accuracy of identifying two-dimensional objects. Enzweiler et al. [6] detect pedestrians by blending features, clues, and improve classification goals. You et al. [32] propose PVnet. It combines point cloud and multi-view data to compensate each other.

3 Method

Figure 1 shows the framework of our work, which mainly includes three steps: 1) Multi-modal data generation: we utilize OpenGL to extract visual and PANORAMA information and employ Point cloud to extract point cloud information for each 3D model; 2) Multi-modal network learning: it is used to get the predict-scores of 3D model based on different modalities. we trained the network on the single modal independently, and we get the best performance on each modal; 3) Classification Fusion: we propose an effective classification fusion method to utilize the advantages of different classifiers for a more accurate classification result of the 3D model. In the next part, we will detail these three steps

Fig. 1
figure 1

Our MIFN framework is composed of 4 parts: multi-view network, point cloud network, PANORAMA-view network and predict-score fusion part. Multi-view network: The structure of MVCNN is employed, the view pooling layer that conducts max-pooling across all views. Point cloud network: The classic PointNet structure is employed. This network takes n points with 3-dimensional coordinates as input. Then in spatial transform net, a 3 × 3 matrix is learned to align the input points to a canonical space. For EdgeConv, it extracts the local patches of each point by their k-nearest neighborhoods and computes edge features for each point by applying a 1 × 1 convolution with output channels M’, and then generates the tensor after pooling among neighboring edge features. PANORAMA-view network: It also utilizes the structure of MVCNN. However, we retrain the parameter of MVCNN based on PANORAMA view data. The predict-score fusion part: based on the predict-score produced by the above three networks, this fusion part defines the weight of different modal predict-score by statistic experiment and utilize the advantage of different modal predict-score for a better classification result

3.1 Data processing

Multi-view(MV Modal)

Since the size and angle of the 3D model data are not uniform, we first use the NPCA[19] to calibrate the 3D model. Then place a virtual camera every 30 degrees from the Z-axis around the 3D model. The lens points to the 3D model centroid and tilts down 30 degrees. Finally, get twelve views of the three-dimensional model by OpenGL visualization tools.

Point cloud(PC modal)

The point cloud data is obtained by meshing the surface of the 3D model and using the centroid of the grid to represent the mesh. Due to the different volume of the 3D model, we need to process the surface of each 3D model, which need to be subdivision to get more mesh. Here we use the butterfly subdivision [5] algorithm to get more points. At last, we obtain 1024 points of each 3D model and then convert the PLY model into point cloud data.

PANORAMA-view(PV modal)

PANORAMA view is a set of 2D images that contains the surface information of a 3D model. When we get the PANORAMA view, we need to project the surface of the 3D model onto the surface of the cylinder with the centroid of the 3D model as the origin. The radius R of this cylinder is the three times the maximum distance from the surface of the 3D model to the axis of the cylinder, and the hight of this cylinder is 2R. Taking the Z-axis as an example, we use a series of point sets s(φ,z) to represent the projected data, where φ is the angle of the point of the 3D shape, and z is the height of the point. According to this point set, we can get a panorama of four different data for each axis. 1)the position of the model’s surface in 3D Space(SDM). 2)the orientation of the model’s surface(NDM). 3)the gradient map of NDM. 4)three-channel graphics consist of the above three kinds of graphics. So for each model, we can get 12 views for three-axis projections like Fig. 2.

Fig. 2
figure 2

The PANORAMA views of 3D model on three axis which consist of SDM, NDM gradient map of NDM and 3-channel images

3.2 MIFN:Multi-modal classification network

In order to utilize the advantages of different modalities of 3D shapes, we design the Multi-modal Information Fusion Network(MIFN), which uses a new strategy in the prediction part. Comparing with the traditional method, instead of using the prediction score of single modal, we refer to all modal information of the 3D shape. To make the final prediction correctly represent the 3D shape while maxing the precision of the prediction, we process each model according to the method of data preprocessing in Section 3.1. Each modal products its own modal-level prediction using its own network, and a consensus function is designed to aggregate these modal-level predictions into the final prediction scores of different classes named model-level prediction. This model-level prediction score is more reliable and informative than the original modal-level prediction since it aggregates three modality prediction results and gets more credible results. However, we train the single-modal network and update the parameters separately. In this way, the single modal network can get the best performance.

For a 3D shape M, we can get K modalities data P1,P2,⋯,PK after preprocessing. Then, these modalities data are feed into their own networks M1,M2,⋯.M3. At last, Multi-modal Information Fusion Network (MIFN) aggregate the prediction scores of different modalities as follows:

$$ MIFN(P_{1},P_{2},{\cdots} ,P_{K})=H(G(M(P_{1}:W_{1}),M(P_{2}:W_{2}),{\cdots} M(P_{K}:W_{K}))) $$
(1)

Here, WK represents the parameters of Mk which is updating by feeding single modal data PK into it and produce the modal-level prediction score. G is the aggregate function that combines the outputs from Mk to aggregate the modal-level prediction scores. Based on this aggregate result, we use Softmax function H predicts the probability of each class for the 3D shape. In our MIFN, the aggregate function is of great importance. It should retain useful information as soon as possible. Meanwhile, it should be able to treat each modality differently, since different modalities have different characteristics and have different advantages in category prediction. So we want this function can aggregate modality information, not a function that happens to a model. We will provide the details on function G in the next subsection.

3.3 Multi-modal information fusion

Through the above analysis, we know that aggregate functions are the most important part of our framework(MIFN). In this section, we will introduce our idea of designing aggregate functions. According to the different modal learning independently, we can get three modal-level prediction scores based on different modality data. These scores represent the probability that a single modality predicts a model as a class, but these are based on features learned by a single modal. Obviously, the reliability of a single modal prediction is not very high, we can make a simple addition on the predict-scores as equation (2). But in this way, we ignore the difference between multi-modal, so we use a linear weighted average method to fuse these modal-level prediction scores as equation (3). In this work, we employ the weighted fusion method to fuse the three modal-level prediction scores. The framework of this method is shown in Fig. 1. The detail is shown in (1).

$$ S=\sum\limits_{i=1}^{K}M(P_{K}:W_{K})\\ $$
(2)
$$ S=\sum\limits_{i=1}^{K}\alpha_{i} M(P_{K}:W_{K})\\ $$
(3)

where M(PK : WK) represents the modal-level prediction score produced by MVCNN, PointNet and PANORAMA-MVCNN respectively based on different modalities of 3D model. αi is the weight of different modal-level prediction scores in order to balance the Multi-view, Point cloud and PANORAMA-View. The fusion score is also processed by softmax to get the class label. The related experimental is shown in Section 4.2.

4 Experiment

4.1 Datasets

One well-known dataset was used to prove our ideas. It is ModelNet.ModelNet consists of two versions of this dataset and they are publicly available for download: ModelNet10 and ModelNet40. ModelNet10 comprises 4899 CAD models split into 10 categories. The training and testing subsets of ModelNet10 consist of 3991 and 908 models. ModelNet40 comprises 12,311 CAD models split into 40 categories. The training and testing subsets of ModelNet40 consist of 9843 and 2468 models. The ModelNet dataset is manually filtered to remove 3D models that do not belong to the specified category, but in particular, the pose in terms of translation and rotation of ModelNet10 dataset is normalized, and ModelNet40 dataset does not.

4.2 Comparison the combinations of different modal networks

In this work, we propose a novel predict-score fusion method to fuse the multi-modal information extracted by these different modal networks. The goal of this design is to utilize the advantages of different modal networks to get more accuracy classification results. In order to demonstrate the performance of this approach, we compare the classification results of a single modal network with the combinations of different modal networks. The corresponding experimental results are shown in Table 1. From this table, we can find that the combination of different modal networks brings a significant improvement in performance compared with a single modal network. Here, MV+PC brings a 4% and 1.5% improvement over MV and PC respectively. MV+PV brings a 2% and 6% improvement over MV and PC respectively. PC+PV brings a 0.25% and 6.5% improvement over PC and PV respectively. Finally, MV+PC+PV brings a 5%, 3% and 9% improvement over each single modal respectively. We can find that the PC network brings the biggest improvement under different conditions. Meanwhile, the single modal network PC also gets the best classification results compared with another single modal network. There are reasons to think that the point cloud data represents more information on 3D modal.

Table 1 Comparisons of different Model Classification accuracy on ModelNet10 and ModelNet40

After the analysis of Section 3.3, we know that in order to let the different modalities play their respective advantages and maximize their advantages, we should take weights on different modalities and maximize the advantages of the modality. For example, in the previous experiment, we know that the point cloud works best. Naturally, we think that the power of the point cloud is a bit more important, and related experiments have proved this. From Fig. 3 we can see that when the weights of Point Cloud, Multi-view and PANORAMA-view is set to the parameter α1 = 0.7, α2 = 0.2 and α3 = 0.1, we can get the best results. Comparing with the method of directly averaging without increasing the weight, we get a gain of 0.54% after increasing the weight, and we can find that the modality PC has the biggest weight. This experiment result demonstrates the effectiveness of the proposed method.

Fig. 3
figure 3

The X axis represents the weight of the PANORAMA-view modality, the Y axis represents the weight of the Multi-view modality, the weight of the point cloud is determined by 1 − xy, and the colorbar represents the classification accuracy of the 3D model on ModelNet40

4.3 Comparison to state-of-the-art methods on ModelNet-40

To validate the efficiency of the proposed MIFN, 3D shape classification experiments have been conducted on the Princeton ModelNet dataset [31]. Totally, 127,915 3D CAD models from 662 categories are included in the ModelNet dataset. ModelNet40, a common-used subset of ModelNet, containing 12,311 shapes from 40 common categories, is applied in our experiments. We follow the same training and test split setting as in [31].

In experiments, we have compared the proposed MIFN with various models based on different representations, including volumetric based models [31], hand-craft descriptors for multi-view data [4, 12], deep learning models for multi-view data [21, 28], deep learning models for PANORAMA-Views [29] and point cloud based models [8, 13, 16, 23, 33].

In Table 2, the classification results of all compared methods are provided. As shown in the results, our proposed MIFN can achieve the best performance with the classification accuracy of 92.4%. Compared with the MVCNN using GoogLeNet, our MIFN achieves the state-of-art accuracy on the classification tasks. For point cloud based models, our MIFN also achieves the state-of-the-art point cloud based model DGCNN in terms of classification accuracy.

Table 2 Comparisons of classification accuracy on ModelNet40

5 Implementation details

Our framework contains point cloud network, multi-view network, and PANORAMA-View network. For point cloud network, 1,024 raw points for each object are fed into the network. For Multi-View network, 12 views for each object are fed into the network. The parameters of CNN in multi-view network are initialized by the pre-trained MVCNN model. For PANAORAMA-View network, 12 views are fed into the network that same to Multi-View network, differently, the parameters aren’t initialized. We pre-train the model on our dataset and find the best model to initialize the parameters. At last, we fuse the modal-level predict-score to generate model-level predict-score, we can use these scores to predict the class of the model.

6 Conclusion

In this paper, we propose a novel modal fusion network: MIFN, which can employ different modal data for 3D shape classification. In our framework, the model-level predict-scores is introduced to employ the advantage of different modal networks to predict the label of class. More specifically, the proposed can effectively fuse more modal information. it is easy to utilize in other similar applications. The effectiveness of our proposed framework has been demonstrated by experimental results and comparisons with the state-of-the-art models on the ModelNet dataset. We have also investigated the effectiveness of different components of our model to demonstrate the robustness of our framework.