1 Introduction

The accuracy and speed of object recognition are necessary criteria to evaluate the environment perception ability of unmanned ground vehicles (UGV) [1]. During autonomous driving, excellent object recognition results can support sufficient semantic environment information for UGVs to execute subsequent processing applications [2]. In addition, object recognition technology also benefits many other industry fields, such as automatic semantic map generation, local path planning for autonomous robotics, and digital terrain recognition and analysis.

Researchers employ light detection and ranging (LiDAR) sensors to perceive precise and adequate surrounding information in continuous driving. LiDAR sensors can collect large 3D point clouds quickly with few errors [3]. While 2D images captured by digital cameras are sensitive to illumination change, anti-interference of changing light is another advantage of three-dimensional (3D) point cloud sensed by LiDAR. However, several special distribution characteristics of LiDAR point clouds, especially disordered arrangement and inhomogeneous densities, will cause huge challenges in object feature selection and computation. The efficiency of the extraction and optimization of object features from sparse 3D point clouds directly determine object recognition accuracy.

Traditionally, key-point-based and local-surface-based methods are the primary feature extraction methods in the 3D object recognition domain [4]. Fixed- or adaptive-scale-based key-point detection algorithms compute curves, variances, normal vector, and other geometric spatial attributes as their feature extraction descriptors. By identifying multiple 3D points with stable attributes after a series of rigid transformation, these points are considered key points in the recognition of object types by comparing the similitude levels between target objects and the given models. However, in most key-point detection methods, feature extraction relies heavily on determining the neighboring region and finding neighbor points, in which key-point selection causes significant computational and time costs [5].

In consideration of the limitations of key-point-based feature extraction methods, researchers prefer to choose local surface features as automatic object recognition criteria [6]. There are several types of common local surface feature descriptors that are commonly used to extract features in 3D object recognition. For example, some studies have generated and used local-surface geometry feature descriptors in the spatial domain to extract basic contour and spatial distribution information about different types of 3D objects. Some complex object descriptors analyze the topological structure of point clouds to extract distribution features by rasterizing the 3D local space into ordered and aligned voxels [7]. Differing from computing intuitive distribution attributes, several studies have transformed object features from spatial space into other domains (e.g., the spectrum domain) to search for a descriptive representation. After extracting object features, the classifiers are initialized and fed into multidimensional features as a series of recognition conditions.

Recently, an increasing number of researchers have concentrated on utilizing machine learning algorithms as object classifiers to solve 3D object recognition problems [8]. Thus, this paper proposes a multilayer neural network-based 3D object recognition system with multiple feature extraction from LiDAR point clouds. To compute geometry and spatial distribution features, we initially rasterize an object point cloud in a global voxel model. In each voxel, 23 features, e.g., point divergence degree, variance, and covariance, are computed in parallel. Using multiple object features and their manually annotated labels, the initialized neural network (NN) is trained through a massive number of iterations. The feature extraction processes in the proposed system are accelerated by a graphic processing unit (GPU) to help realize real-time object recognition.

The remainder of this paper is organized as follows. In Sect. 2, we discuss work related to object recognition algorithms in 3D point clouds. In Sect. 3, the proposed GPU-based recognition system is described. We analyze the performance result of an object recognition experiment in Sects. 4 and 5 concludes the paper.

2 Related works

Object recognition is an essential function for UGVs to realize semantic environment perception and automatic driving decision making in urban areas [9, 10]. This section surveys several widely used feature extraction methods based on 3D point clouds to realize fast and accurate object recognition.

In the 3D object recognition domain, key-point-based detection is commonly used to describe stable local features using a series of descriptive points [11]. For example, Sun et al. [12] used curvature characteristics to establish a robust key-point selection algorithm and a reliable local feature descriptor. In the defined key-point-based feature descriptor, curvature maps are structured based on interest points and their neighboring points within a predefined constraint. To define a local reference frame (LRF), surface normal and max principal curvature orientation of the interest points is used as the axes in the local space. Although Sun proposed a precise key-point description and matching algorithm, the computational cost of the matching of interest points and its neighborhood points occupies majority of time. In consideration of speed performance in key-point detection, Persad et al. [13] developed a transform-based key-point detection method to search for invariant interest points during rigid rotation and translation. Compared to a traditional feature extraction algorithm, i.e., random sample and consensus, the point-matching process in their method is more efficient and does not require a predefined threshold. In addition, Ge et al. [14] realized a random point method to select key points at each scaling layer to compute local features. A point-cloud-matching process is executed by comparing local features described by some key points. Here, matching performance relies on the size of the feature dictionary generated by the method such that sequence processing stability is limited.

To enhance the descriptiveness, robustness, and time efficiency of the key-point process, some studies have employed a key-point-based histogram to gather the spatial features of point clouds in multiple dimensions [15]. For example, Weber et al. [16] computed orientation angles between points and their neighbors to generate a classic histogram-based feature descriptor in local space, which was referred to as fast point feature histograms (FPFH). Although the FPFH descriptor demonstrates excellent feature extraction processing speed, the neighboring area of the FPFH descriptor is still large, which incurs relatively large time costs [17]. Yang et al. [18] developed a more time-efficient local feature descriptor referred to as the local feature statistics histogram (LFSH), which is insensitive to noise and varying density. To reduce time costs in these types of histogram-based feature extraction methods, Garcia et al. [19] employed GPU technology to convert serial procedures into parallel procedures, which is a novel use of such technology in most point cloud recognition methods.

Differing from calculating spatial features by analyzing individual points in sequence, converging neighboring points can be collected into a series of voxels as processing units to realize course feature extraction [20]. Zhu et al. [21] proposed a semantic classification method by dividing the spatial space into arranged fixed-size voxels, which overcomes the processing difficulty associated with non-homogeneous density distributions in point clouds. To maintain the topological structure of point clouds, adjacent voxels are clustered based on pairwise connections between different components. By computing the energy values of pairwise connections, raw point clouds collected from urban environment can be divided into individual semantic components. Xu et al. [22] also developed a voxel-based model to down-sample large original point clouds to reduce the time costs of sequence processing. In this model, after down-sampling, principal directions are computed to reconstruct the local coordinate axes (referred to as the semi-LRF). Here, global point coordinates are simplified as local coordinates to extract features from the LRF. Extracting features from suitable local coordinates is much easier than extracting from global coordinates because local transformation is an advisable preprocess for feature computation.

Fix-sized voxels are sometimes not sufficiently effective to realize object recognition due to the variable densities and unsymmetrical structure of point clouds [23]. Lei et al. [24] exploited a 3D convolution kernel at different scales to extract object features of different resolutions. Thus, the topological structures of the point clouds can be described in a detailed manner such that the accuracy of object classification, semantic recognition, and other sequence applications is increased. However, the computational cost of updating new points in such tree-based storage constructions is high, especially in incremental point cloud collection systems.

Dimensional reduction methods are commonly used to increase the speed and efficiency of feature extraction by projecting a 3D point cloud into a 2D space [25]. Ligon et al. [26] utilized a traditional spin image descriptor to convert 3D point clouds into a 2D feature plane. By limiting the normal angle between a selected point of interest and neighboring points, point counts under the predefined constraint are obtained to form a statistics plane. However, the features extracted using the image spin algorithm rely heavily on the normal vector so that they are sensitive to noise on the local surface. To increase the reliability and stability of point cloud processing, Yang et al. [27] developed a rotational contour signature descriptor to extract features comprising an array of contour signatures under different rotation angles. Then, these signatures are gathered together as local feature descriptors to realize shape matching and object recognition. Dong et al. [28] projected 3D point clouds into six planes to record interest point, and its neighboring point counts as a weighted distance feature in a limited area. Here, distance features are encoded into a string of binary numbers considered a local feature descriptor to categorize outdoor objects in an urban environment. With such feature descriptors, a large number of thresholds and parameters must be adjusted and modified manually; thus, using machine learning algorithms is effective for such feature classification tasks [29].

Recently, inspired by outstanding object recognition results in the 2D image domain, machine learning algorithms have been researched in 3D point clouds processing [30]. To reduce the time cost of searching for neighbor points, Dubé et al. [31] used a k-d tree model to store LiDAR point clouds to resolve several spatial features, especially for an inhomogeneous and unstructured distribution. Soilán et al. [32] applied a neural network (NN) to a road marking recognition task in which local features are extracted as the input data of the classifier model. Automatic parameters adjustment in these classification modes can realize better recognition results than traditional threshold-predefined classifiers [33]. These methods only use multiple geometry features of the spatial distribution of the point and ignore the topological relationship among neighboring points.

Motivated by significant achievements in convolutional NN (CNN), some studies in the 3D point cloud domain have focused on finding self-adaptive feature extracting filters. For example, Bobkov et al. [34] input a sequence of 3D points into a CNN model with five filtering and five pooling layers to extract point cloud features. Here, prior to the input process, a preprocess is executed to normalize the point order and down-sample the point cloud. Chen et al. [35] defined the starting points of different models to avoid rigid transformation, such as random rotations. The filters in their system are set to 3*1 or 1*1 at the first or sequence filtering layers, respectively; however, this was inefficient to extract spatial features. Therefore, the proposed system extracts multiple dimensional features from voxels and summarizes them together to form a series of object feature datasets. The detailed connected component labeling algorithm for object clustering is explained in our previous work [36]. By feeding feature datasets into an initialized multilayer NN model, point-based object identification can be implemented under automatic parameters adjusted by massive examples.

3 Multifeature-based object recognition system

The proposed system includes preprocess, feature extraction, and classifier modules. Several spatial features of point cloud are computed from each voxel. An object feature normalization process is executed to convert feature vectors of a voxel to a normalized descriptor. The normalized descriptor forms a normalized object feature matrix, which is then fed into the classifier to realize the object recognition function.

3.1 System overview

Accuracy and processing speed are two primary bottlenecks in object recognition using a large-scale point cloud. To address these bottlenecks, the proposed method combines parallel computing technology and multifeature descriptors to realize an object-recognizing system (Fig. 1). First, all raw point clouds collected by the LiDAR sensor are input to the proposed system. Using a height threshold, the ground point clouds are filtered out to avoid their adverse effect on the subsequent feature-computing and object-recognizing procedures. For example, ground points can occupy a sizable portion of the sensed point cloud in an entire scene, which results in wasteful computation in non-ground object recognition tasks. Another reason the ground points filtering step is performed is that such points also form a horizontal plane that can connect objects near the earth’s surface.

Fig. 1
figure 1

Flowchart of proposed object recognition system with multiple extracted features

After executing the ground point cloud filtering step, the voxel model is initialized with a predefined boundary size and voxel resolution. Multiple geometry features of point clouds projected in the corresponding voxels are computed and formed as feature vectors. The voxel counts of the different objects likely differ; thus, it is difficult to input the raw voxel feature vectors of these objects into a common NN classifier. Thus, these voxel feature vectors with different counts are converted from standard space to a normalized descriptor to generate fixed-size object feature matrices. By feeding these normalized object feature matrices into our initialized multilayer NN model, the model improves its classification ability by executing 10,000 iterations in the training process. As a result, the proposed system can achieve automatic and accurate object recognition using the trained classifier.

3.2 Voxel features generation

In the proposed system, a series of feature vectors in the voxel model are computed as the judgment basis to identify different types of outdoor objects in subsequent steps. After defining a suitable resolution and boundary size, the voxel model is established and comprises cube-shaped voxel units. Here, non-ground points are projected into corresponding voxel units based on their positions at the x, y, and z axes in a standard Cartesian coordinate system. Using these 3D points in voxels, multiple geometry features are computed (Table 1). Point count N in a given voxel is a significant parameter to measure the voxel’s importance. Similarly, point density ρ in each voxel is utilized to evaluate voxel importance, as given in formula (1), which is also affected by the relative distance between the current voxel and the sensor location. In Eq. (1), l, w, and h are the voxel sizes of the x, y, and z directions in standard space, respectively.

Table 1 Feature list computed in valid voxel
$$ \rho = N/\left( {l \times w \times h} \right) $$
(1)

Voxel centroid μ is a geometric center computed by traversing all points’ positions in the voxel. Vector μ is expressed as {\( \bar{x},\bar{y},\bar{z} \)}, where \( \bar{x} \), \( \bar{y} \), and \( \bar{z} \) are the mean values of the point cloud in three directions. Point variance \( \sigma^{2} \) is a 3D variable to measure the point cloud distribution differences in the x, y, and z directions (\( \sigma^{2} = \left\{ {\sigma_{x}^{2} ,\sigma_{y}^{2} ,\sigma_{z}^{2} } \right\} \)). Point covariance \( \bar{\sigma }^{2} \) is a 3D covariance variable that measures the point distribution differences in the xy, xz, and yz directions, given as \( \bar{\sigma }^{2} { = } \left\{ {\bar{\sigma }_{xy}^{2} ,\bar{\sigma }_{xz}^{2} ,\bar{\sigma }_{yz}^{2} } \right\} \). By combining point variance and covariance, a covariance matrix is expressed in Eq. (2).

$$ \text{cov} (X,Y,Z) = \left[ {\begin{array}{*{20}l} {\sigma^{2} (XX)} \hfill & {\bar{\sigma }^{2} (XY)} \hfill & {\bar{\sigma }^{2} (XZ)} \hfill \\ {\bar{\sigma }^{2} (YX)} \hfill & {\sigma^{2} (YY)} \hfill & {\bar{\sigma }^{2} (YZ)} \hfill \\ {\bar{\sigma }^{2} (ZX)} \hfill & {\bar{\sigma }^{2} (ZY)} \hfill & {\sigma^{2} (YY)} \hfill \\ \end{array} } \right] $$
(2)

Point eigenvector υ and eigenvalues λ are decomposed from this covariance matrix using a singular value decomposition algorithm. Eigenvalue λ is also a 3D variable represented as {λ1, λ2, λ3}, in which variables are sorted from largest to smallest as λ1, λ2, and λ3. The parameter surface curvature κ describes the point cloud curvature (Eq. 3).

$$ \kappa = \lambda_{3} /\left( {\lambda_{1} + \lambda_{2} + \lambda_{3} } \right) $$
(3)

Divergence degree F is a 3D variable used to evaluate the dispersion degree of the point clouds in the x, y, and z directions. Divergence degree F is computed according to Eq. (4), where N and μ are the point count and voxel’s centroid. Here, vector P is a set of points expressed as P = {p1, p2,…, pN}, where variable pi contains the x, y, and z coordinates as pi = {xi, yi, zi}.

$$ F = \frac{{\sum\nolimits_{i = 1}^{N} {\left( {p_{i} - \mu } \right)} }}{N} $$
(4)

For each voxel, the feature list shown in Table 1 forms a description vector \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v} \) defined as \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v} = \left\{ {N,\rho ,\mu ,\sigma^{2} ,\bar{\sigma }^{2} ,\upsilon ,\lambda ,\kappa ,F} \right\} \) to represent the spatial features of the point cloud. Thus, feature vectors are input to the proposed system rather than the relative coordinates of the point cloud.

3.3 Object features normalization

Different objects are of different sizes; thus, the voxel counts of different objects also differ. Under different rotation angles, the same objects have different point cloud and voxel arrangement distributions. As a result, an effective normalizing method is required to realize equitable down-sampling/up-sampling on no matter large or small counts of object voxel. Using a predefined scale filter (Fig. 2), object features are normalized from standard Cartesian space to a normalized descriptor after computing all voxel features listed in Table 1.

Fig. 2
figure 2

Flowchart of object feature normalization

In object feature normalization, voxel feature vectors of a certain object are transformed from an uncertain count to a fixed-count 3 × 3 × 3 feature vectors using the object feature’s normalized descriptor. The generated feature matrix M of each object is expressed by Eq. (5), where vector \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v} \) is the feature vector of different blocks in the normalized descriptor.

$$ M = \left[ {\begin{array}{*{20}c} {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v}_{1} } \\ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v}_{2} } \\ \ldots \\ {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v}_{27} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\mu_{1} } & {\sigma_{1}^{2} } & \ldots & {F_{1} } \\ {\mu_{2} } & \ldots & {} & \ldots \\ \ldots & {} & {} & {} \\ {\mu_{27} } & \ldots & {} & {F_{27} } \\ \end{array} } \right] $$
(5)

In this matrix, each row contains 25 variables, including point count, point density, voxel centroid, point variance, point covariance, eigenvectors, eigenvalues, surface curve, and divergence degree. Each column contains 27 feature vector elements computed from the 3 × 3 × 3 normalized blocks. To input object features into the proposed multilayer NN, matrix M is converted to a vector as input data in the form of the initialized neural network classifier, as defined by Eq. (6).

$$ M' = \{ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v}_{1} ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v}_{2} , \ldots ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}} {v}_{27} \} = \{ u \in (\mu_{1} ,\sigma_{1}^{2} , \ldots ,F_{1} ,\mu_{2} ,\sigma_{2}^{2} , \ldots ,F_{2} , \ldots ,\mu_{27} ,\sigma_{27}^{2} , \ldots ,F_{27} )\} $$
(6)

Equation (7) expresses the computing principle of input data ui in the input layer and the classifier parameters, including weights wi and threshold θ in the first hidden layer. In vector Mʹ, u represents an element. Variable m is both the neuron count in the input layer and the dimension of the converted object feature vector Mʹ. Loss function f utilized is the softmax function with higher learning efficiency in the training process compared to the sigmoid function. The result c is the input data for the neuron in the second hidden layer. Note that the computation principle of the employed multilayer NN classifier is omitted.

$$ c = f\left( {\sum\limits_{i = 0}^{m} {w_{i} u_{i} - \theta } } \right) $$
(7)

4 Experiments and analysis

In an experiment, a LiDAR sensor was used to collect point cloud data from the environment, which included distance and rotation information. Here, we used the CUDA programming method to implement parallel computing in the proposed system to improve feature extraction efficiency. The system was executed on a 2.8 GHz Intel® Core™ i7-7700HQ CPU (with 8 GB RAM) with a GeForce GTX 1050 Ti graphics processing unity. The system utilized the DirectX 9.0 software development toolkit to represent raw point clouds, voxels, and point clouds with different colors for individual object recognition results.

The ground points occupied large portion of point clouds in outdoor scenes, as illustrated in Fig. 3. Thus, ground point filtering was implemented to reduce the computation complexity of the non-ground object clustering and recognition. To speed up the object segmentation and recognition process, a parallel computing method was utilized on voxel feature extraction and object feature matrix normalization. The time consumption of our proposed GPU-based accelerated method was shorter than that using CPU programming. In the feature-computing process, point count and point density in each voxel are relevant with the distance between the voxel’s and the sensor’s locations; therefore, the feature datasets contained 23 features, with the exception of point count and density features. Using these 23 features, 1000 obstacles were labeled manually to train and test the proposed multilayer NN model. As shown in Fig. 4, the training datasets comprised 286 walls, 109 poles, 43 pedestrians, 416 trees, and 146 bushes.

Fig. 3
figure 3

Ground and non-ground point counts in the sequential frames

Fig. 4
figure 4

Numbers of objects in the training and testing datasets

The numbers of input and output neurons in the initialized multilayer NN model were 675 and 4, respectively. We set three hidden layers and 350, 100, 25 hidden neurons in hidden layers. By using 800 object feature datasets as training examples, the randomly initialized parameters in the model were modified over 10,000 iterations. Finally, using the trained classifier, the accuracy rate of the test datasets of object feature vectors was up to 92.84%. As shown in Table 2, our experiment also compared the accuracy performances of other classifiers tested by the same databases (Table 2).

Table 2 Accuracy performance comparing results of different classifiers

Figure 5 shows the object recognition results. Here, different objects are rendered as different colors. Wall objects have prominent features; thus, most points exist in the same plane surface. As a result, recognition accuracy is much higher for walls than for other objects. Based on the limited size of the pedestrians, the features matrix cannot always clearly express the object’s characteristics. The accuracy rates for the pole and tree objects are between those of the walls and pedestrians. The object recognition accuracy in narrow roads is higher than that for the large space because the objects were closer to the LiDAR sensor in the narrow roads than in the open square environment. The point clouds of objects in a narrow environment are more intensive; thus, the feature matrix can describe the objects’ characteristics more precisely. In addition, for the same reason, the object recognition rates are relative to the distance between the objects and sensor. In future, it is expected that object recognition accuracy can be improved significantly if an interpolation algorithm is employed to increase the point cloud density.

Fig. 5
figure 5

Object recognizing results for different scenes: a narrow road with trees and buildings on both sides; b narrow road scene with pedestrian, trees, and building on one side

5 Conclusion

To sense urban environment information as precisely as possible, most studies employ LiDAR on UGVs to collect 3D point cloud datasets. Fast and accurate object recognition is a huge challenge in the autonomous driving domain due to the large number of obstacle types and their various features. In this paper, we have proposed a 3D object recognition system with multiple feature extraction from LiDAR point clouds. Through a preprocess, non-ground points are extracted and a voxel model is initialized based on the valid range of the remaining point clouds. After computing a feature vector for each voxel, a scale filter is used to generate normalized feature matrices for each individual object. These object feature matrices are then fed into a multilayer NN model to obtain their object types. Using manually labeled testing data sets, the accuracy rate of the proposed 3D object recognition system was 92.84%. In future, more object feature datasets with corresponding manual labels will be collected to train this model. In additions, more features will be computed, and their effectiveness will be analyzed to form their feature space to increase accuracy.