Keywords

1 Introduction

Trees growing in cities or large areas of green forest can be uniformly defined as urban forests. The existing database of urban forests, however, there are insufficient data, updating the problem of not in time, the traditional manual measurement and estimation cannot meet the needs of modern urban forest estimation, people are eager to realize automation and efficient estimation in urban forest density estimation, in order to facilitate real-time monitoring of their time and space distribution.

The diverse composition of cities makes it challenging to use lidar filtering. At the same time, complex 3D point cloud computing makes it more difficult to achieve segmentation and 3D reconstruction of urban trees. In [1], voxel grid filtering is used to reduce the number of point clouds. In this algorithm, grid filtering, spatial clustering, and the k-nearest neighbor method are used in turn to extract the final tree height. The effect is good, but the algorithm design is still complex. In [2], A new method based on point density is proposed to reconstruct the complete three-dimensional structure of a single tree, but this algorithm is very dependent on the composition of the environment, and if it is too complex, the classification accuracy and reconstruction accuracy will be reduced. With the development of deep learning, tree detection and tree segmentation are realized by combining LIDAR with the target detection algorithm in [3]. The overall segmentation accuracy is very high, but there is still a lack of in-depth research on the statistics of trees. In [4], the LiDAR point cloud was used to detect a single tree and its related measurements, and the measurements of tree height, canopy diameter, and depth were realized. However, there was no statistical study on the distribution density of urban forests. In terms of computational complexity, the use of lidar data combined with deep learning algorithms will be more complex.

Aiming at these problems, this paper proposes an estimation algorithm of urban tree density based on image semantic segmentation. Firstly, we segment the whole image. Secondly, the trees’ areas are extracted through a series of image processing techniques, which reduces the number of image pixels that the algorithm needs to traversal in the process of estimating the density of trees. Finally, the size of the tree pixel in the whole image is calculated by the number of pixels to realize the estimation of the density of trees. The algorithm does not use three-dimensional data similar to the lidar point cloud but achieves estimated extraction of tree density in the scene through segmentation and processing of the whole image. Therefore, compared with the processing algorithm of point cloud data, its complexity is reduced and it only needs to realize image processing.

The chapters of this paper are arranged as follows. In Sect. 2, the basic principle of image semantic segmentation is briefly introduced. Section 3 introduces an algorithm of urban tree density estimation based on image semantic segmentation. Section 4 gives the overall process of algorithm implementation and the final detection results. The experimental results are analyzed and the advantages of the proposed algorithm are given. Finally, the thesis is summarized in Sect. 5.

2 System Design

Since the appearance of the FCN network [5] in 2014, image semantic segmentation has realized pixel-level image segmentation. It is an end-to-end neural network based on CNN structure. In the input layer, it is not restricted to the size of the image and the final full connection layer of CNN is removed, making the semantic segmentation using a pre-trained network more efficient. A jump connection is introduced to solve the problem of pixel location with rough up-sampling. But it is insensitive to the details of the image, which restricts the application of data requiring precise statistics such as tree density. After FCN came into being, various forms of neural networks came into being [6,7,8,9]. In 2020, the University of Science and Technology of China and Microsoft jointly proposed a High-Resolution Net (HRNET) [10]. The network maintains a high resolution in the whole process. Starting from the high-resolution convolution stream, it gradually increases the high-resolution convolution stream to the low-resolution convolution stream and connects the multi-resolution convolution stream in parallel. The advantage of HRNET is that in terms of feature extraction, image details can always be kept without loss, which is richer in semantic expression and more accurate in space.

Therefore, this paper proposes an urban streetscape tree density estimation algorithm based on HRNET semantic segmentation. Firstly, the open-source cityscapes images are selected as the original input of the whole algorithm, and the semantic region of trees in the image is extracted by using the HRNET network. The extracted results are processed by image grayscale, edge extraction, mathematical morphology, to achieve the extraction of tree contour. Finally, count the number of tree pixels in the region and get the final estimation of urban tree density according to the number of pixels in the whole image. The system block diagram of the whole algorithm is shown in Fig. 1.

Fig. 1.
figure 1

Block diagram of tree density estimation system based on image semantic segmentation

In Fig. 1, a set of open-source cityscapes image data sets are used as the input of the entire tree semantic extraction and tree density estimation system, and the final output results are used to achieve tree density estimation for each image. Among them, the estimated algorithm of urban tree density based on HRNET semantic segmentation reduces the computational complexity of the original laser radar combined with image processing, improves the computing efficiency of urban tree density estimation, and realizes the automatic tree density estimation. In the algorithm proposed in this paper, through a series of image processing methods, HRNET semantic segmentation results can not only focus on the tree region but also make full use of the pixel number of the region to achieve automatic urban tree density estimation. So that the estimated algorithm used in the actual application has a better real-time interaction.

3 Tree Density Estimation Algorithm

3.1 Semantic Region Extraction Based on HRNET

HRNET network is used for semantic segmentation to achieve segmentation of each type of object in the city streetscape, such as road surface, trees, etc. Through segmentation, each type of object can be analyzed in detail, which lays a foundation for estimating of the density of trees in the city streetscape. Its structure is shown in Fig. 2.

Fig. 2.
figure 2

HRNET network structure

It can be observed that the network is divided into four stages, each of which is composed of channel maps with different resolutions, and each of which has one more branch than the previous one. The new branch is the result of the step convolution fusion of all the feature images of the previous stage. The resolution at the end of the stage will be half of the resolution of the previous branch, and the number of channels will be doubled. On the left is the input image which can be extracted from the semantic region of the city streetscape through the HRNET network.

3.2 Tree Contour Extraction Based on Semantic Region and Image Processing

The second step of the algorithm is based on semantic region and image processing to complete the tree contour extraction, mainly through image grayscale processing, Gaussian filtering, edge detection, and mathematical morphology processing.

  1. (1)

    Image grayscale

    The semantic segmentation map of urban trees is grayed to keep only the urban tree area that we are interested in and remove the redundant other areas. The whole image is traversed, and the tree area is assigned a value of 255, that is, white, while the other areas are assigned a value of 0. It can also be processed by the grayscale formula as:

    $$ {\text{Gray = 0}}{\text{.299R + 0}}{\text{.587G + 0}}{\text{.114B}} $$
    (1)

    where, RGB represents the image’s three channels, red, green, and blue.

  2. (2)

    Gaussian filtering

    The Gaussian filter is a linear filter, which can eliminate the noise generated in the process of digital processing and extract useful features in the image. For the next edge detection, an appropriate Gaussian filter should be selected to suppress a kind of noise whose probability density function follows Gaussian distribution, and the Gaussian kernel size of 7 × 7 is selected. The calculation formula of Gaussian kernel parameters by second-order Gaussian function is shown as follows:

    $$ g\left( {x,y} \right) = \frac{1}{{2\pi \sigma^{2} }}e^{{ - \left( {\frac{{x^{2} + y^{2} }}{{2\sigma^{2} }}} \right)}} $$
    (2)

    where, x2 and y2 respectively represent the distance between the pixels in the Gaussian core and the center pixel, and represent the standard deviation. As the standard deviation increases, the effect of Gaussian filtering will be more obvious.

  3. (3)

    Edge detection

    The canny operator is used to realize edge detection. It adopts the method of smoothing first and then derivative. It belongs to the first-order differential filter and is an optimization operator that can achieve a strong filtering effect. By calculating the gradient of the image, the amplitude and angle of the edge of the image are obtained and the change of the gray value is detected. If a drastic change is found somewhere, the area can be determined as the edge of an object. The Sobel operator is selected to obtain the pixel gray value corresponding to the 8-neighborhood region. The template operator and the pixel gray value of 8-neighborhood are shown in (3) and (4) respectively.

    $$ M_{x} = \left[ {\begin{array}{*{20}c} { - 1} & 0 & 1 \\ { - 2} & 0 & 2 \\ { - 1} & 0 & 1 \\ \end{array} } \right]\begin{array}{*{20}c} {} & {M_{y} = \left[ {\begin{array}{*{20}c} { - 1} & { - 2} & { - 1} \\ 0 & 0 & 0 \\ 1 & 2 & 1 \\ \end{array} } \right]} \\ \end{array} $$
    (3)
    $$ g_{px} = \left[ {\begin{array}{*{20}c} {f(x - 1,y + 1)} & {f(x,y + 1)} & {f(x + 1,y + 1)} \\ {f(x - 1,y)} & {f(x,y)} & {f(x + 1,y)} \\ {f(x - 1,y - 1)} & {f(x,y - 1)} & {f(x + 1,y - 1)} \\ \end{array} } \right] $$
    (4)

    where, Mx is the x direction template, My is the y direction template.

    Horizontal and vertical templates are shown in formula (5), and pixel gray value and angle in 8-neighborhood are shown in formula (6) and (7) respectively.

    $$ \begin{gathered} {\mathbf{G}}_{x} = f(x + 1,y + 1) - f(x - 1,y + 1) + 2f(x + 1,y) \hfill \\ \begin{array}{*{20}c} {} & {} & {} \\ \end{array} - 2f(x - 1,y) + f(x + 1,y - 1) - f(x - 1,y - 1) \hfill \\ {\mathbf{G}}_{y} = f(x - 1,y + 1) - f(x - 1,y - 1) + 2f(x,y + 1) \hfill \\ \begin{array}{*{20}c} {} & {} & {} \\ \end{array} - 2f(x,y - 1) + f(x + 1,y + 1) - f(x + 1,y - 1) \hfill \\ \end{gathered} $$
    (5)
    $$ {\mathbf{G}} = \sqrt {{\mathbf{G}}_{x}^{2} + {\mathbf{G}}_{y}^{2} } $$
    (6)
    $$ \theta = \arctan \left( {\frac{{{\mathbf{G}}_{y} }}{{{\mathbf{G}}_{x} }}} \right) $$
    (7)
  4. (4)

    Mathematical morphology processing

    Mathematical morphology processing is used to further extract the boundary of the tree area, where corrosion and expansion are two basic operations. B(x,y) is the convolution template, Bx is the structural elements, and A is the input image, then the corrosion of Bx to A is shown as follows:

    $$ A - B_{x} = \{ f(x,y)|B_{(x,y)} \subseteq A\} $$
    (8)

    where, f(x,y) is the pixel value of any pixel point in the image A. Similarly, the expansion of Bx with respect to A is shown as follows:

    $$ A + B_{x} = \{ f(x,y)|B_{(x,y)} \cap A = \varnothing \} $$
    (9)

    where, \(\varnothing\) is the empty set.

3.3 Tree Density Estimation Algorithm Based on Tree Contour Region

The density of an object in an image is defined as the proportion of the number of pixel points of the object to the total number of pixel points in the whole image. Therefore, the density calculation of trees can be expressed as follows:

$$ \rho_{{{\text{tree}}}} = \frac{{n_{{{\text{tree}}}} }}{M \times N} $$
(10)

where, ρtree is the density of trees; ntree is the number of pixel points in the tree part; and M, N are the height and width of the whole image respectively. To calculate the number of pixels in the tree part, we only need to traverse the whole image and then sum the number of pixels in the tree part.

4 Experiment and Analysis

4.1 Experimental Environment

The implementation of this algorithm is based on Linux-Ubuntu operating system and PyTorch framework under a deep learning environment. The open-source data sets Cityscapes are selected. The data set LeftIng8Bit and Gtfine store image sets and label sets respectively, and each folder contains training sets, verification sets, and test sets respectively. A total of 5,000, 2,975 for training, 500 for verification, and 1,525 for testing. To reduce the time spent on training the network, 1100 pictures of Stuttgart city were selected for this project, and 8 pictures were selected as shown in Fig. 3. The semantic segmentation graphs were selected as in Fig. 4.

Fig. 3.
figure 3

Dataset of Stuttgart

Fig. 4.
figure 4

Semantic segmentation of Stuttgart

4.2 The Experimental Results

According to the results of urban tree distribution in Stuttgart, the criteria for estimating the density of trees in the city can be set, as shown in Table 1. Generally, the density of trees in the city can be divided into three levels: low, medium, and high.

Table 1. Standard for estimating urban tree density

A total of 6 pictures of 270–280 frames are selected with the span of every two frames to estimate the density of trees. The peripheral contour of trees is marked in yellow and brown. The value of tree density and the degree of trees in each frame of the picture is output in the middle of the picture. The six pictures are shown in Fig. 5.

Fig. 5.
figure 5

Tree extraction and tree density output

Six pictures are selected in the figure above. It can be seen from the analysis that as the vehicle progresses, it generally moves towards the direction of dense distribution of trees, so the value of tree density continues to increase. Meanwhile, the overall value is greater than 0.28, so the thickness of the trees is shown as high (Table 2).

Table 2. Tree density and dense degree table of 270–280 frames

5 Conclusion

Aiming at the problems of low efficiency of the algorithm of processing point cloud data by lidar and the lack of automation of urban tree density estimation, this paper proposes an urban streetscape tree density estimation algorithm based on image semantic segmentation. Firstly, the open-source cityscapes data set combined with HRNET semantic segmentation network is used to divide the semantic regions of trees. Secondly, a series of image processing algorithms are used to extract the contours of trees. Finally, the density of trees in the city is estimated based on the statistics of pixels. The whole process is based on image data processing, without the need for other sensor collection point cloud processing or combined with a deep learning algorithm to achieve tree region segmentation, the algorithm enhances the real-time interactivity of tree density estimation, so that the density of trees with faster speed and automation.