Keywords

1 Introduction

In current times the effect of anthropogenic activities is affecting ecosystems and biodiversity with regard to plants and animals alike. Whereas poaching and clearing of forests are only some of the smaller impacts of humans, one of the biggest is the anthropogenic effect on climate change.

Plants are strong indicators of climate change, not only in terms of phenological responses [7, 9, 13, 31, 32, 37], but also in terms of plant community compositions [27, 29, 37]. However, these compositions do not only reflect changes in climate, but also in other aspects, like land use [2, 14] and insect abundance [39]. Hence, plant community compositions are a valuable metric for determining environmental changes and are therefore focus of many experiments [6, 14, 27, 39].

In the last years, technology enabled us to develop systems that can automatically collect images of such experiments in high resolution and high frequency, which would be too expensive and time-consuming if done manually. This process also creates large masses of data displaying complex plant compositions, which are also hard to analyse by hand. As we are missing methods to survey the data automatically, this is usually still done manually by biologists directly in the field. However, this process is bound to produce subjective results. Therefore, an automated, objective method would not only enable fast evaluation of the experimental data, but also greatly improve comparability of the results.

Krizhevsky et al. [24] showed in the ILSVRC 2012 challenge [38] that convolutional neural networks (CNNs) can be used to analyze large numbers of images by outperforming alternative approaches by a large margin. Following this, deep learning became a large area of research with many different developments, but only a small number of approaches deal with the analysis of plants and therefore there are only few existing solutions for the very specific problems in this area.

With our approach we propose a system for an important task: the analysis of plant community compositions based on plant cover. The plant cover, i.e., the amount of ground covered by each plant species, is an indicator for the plant community composition. The information on the spatio-temporal distribution of plant communities leads to a better understanding of effects not only related to climate change, but also concerning other environmental drivers of biodiversity [5, 6, 44]. We present an approach using a custom CNN architecture, which we train on plant cover percentages that are provided as annotations. We treat this as a pixel-wise classification problem known as semantic segmentation and aggregate the individual scores to compute the cover predictions.

CNNs are often treated as black boxes, returning a result without any information to the user what it is based on. To prevent trust issues resulting from this, we also focus on providing a segmentation map, which the network learns by training on the cover percentage labels only. With this map, the user can verify the network detections and whether the output of the network is reasonable. For implausible cases or manual inspections of random samples, the user can look at the segmentations. If detections of the network are deemed incorrect, a manual evaluation of the images can be suggested in contrast to blind trust in the output of the network. To the best of our knowledge, we are first in applying CNNs to plant cover prediction by training on the raw cover labels only and using relative labels (cover percentages) to train a network for generating segmentation maps.

In the next section we will discuss related work, followed by the dataset we used and its characteristics in Sect. 3. In Sect. 4 we will then present our approach, with the results of our experiments following in Sect. 5. We end the paper with a conclusion and a short discussion about future work in Sect. 6.

2 Related Work

An approach also dealing with plant cover is the one by Kattenborn et al. [20], who developed their approach using remote sensing image data, specifically images taken from UAVs. They developed a small convolutional neural network (CNN) architecture with 8 layers to determine the cover of different herb, shrub and woody species. In contrast to our approach, their network was trained on low-resolution image patches with delineations of tree canopies directly in the images. In addition to this, their approach was mostly concerned with the distinction of 2–4 tree species with heterogeneous appearances, which makes the classification easier as compared to our problem.

While, to the best of our knowledge, the aforementioned approach appears to be the only one dealing with plant cover, there are many methods which tackle plant identification in general, e.g. [4, 15, 25, 48]. One example for such a method is the one by Yalcin et al. [48], who applied a pre-trained CNN with 11 layers on fruit-bearing agricultural plants. Another, more prominent project concerned with plant identification is the Flora Incognita project of Wäldchen et al. [45], in which multiple images of a single plant can be used for identification. These approaches, however, are usually applied on one or multiple images of a single plant species in contrast to pictures of plant communities with largely different compositions like in our dataset.

Weakly-supervised segmentation, i.e., the learning of segmentation maps using only weak labels, is an established field in computer vision research. Therefore, we can also find a multitude of different approaches in this area. Some of them use bounding boxes for training the segmentation maps [10, 21] while others use merely image-level class annotations [3, 18, 23, 33, 46], as these are much easier to acquire than bounding box annotations. However, most of these approaches are only applied on images with mostly large objects like the PASCAL-VOC dataset [12] as opposed to high-resolution images with small fine-grained objects like in our dataset. In addition to this, in our dataset we have a new kind of weak labels: plant cover percentage labels. This type of label enables new approaches for learning segmentation maps, which we try to exploit in this paper.

At first glance the task of predicting the cover percentage appears similar to counting or crowd-counting tasks, which are often solved by training a model on small, randomly drawn image patches and evaluating them on complete images, or also evaluating them on patches and aggregating information afterwards [19, 28, 47]. This can be done, because only absolute values have to be determined, which are usually completely independent from the rest of the image. However, in our dataset the target values, i.e., the cover percentages, are not absolute, but relative and therefore depend on the whole image. Because of this, we have to process the complete images during training and cannot rely on image patches.

Fig. 1.
figure 1

A selection of example images from the image series of a single camera in a single EcoUnit. The complete life cycle is captured in the image series, including flowering and senescence.

3 Dataset

For our experiments we used a dataset comprising images from the InsectArmageddon projectFootnote 1 and therefore we will refer to this dataset as the InsectArmageddon dataset. During this project the effects of invertebrate density on plant composition and growth were investigated. The experiments were conducted using the iDiv Ecotron facilities in Bad Lauchstädt [11, 42], which is a system comprising 24 so-called EcoUnits. Each of these EcoUnits has a base area of about \(1.5\,\text {m}\times 1.5\,\text {m}\) and contains a small, closed ecosystem corresponding to a certain experimental setup. An image of an EcoUnit is shown in Fig. 2.

Over the time span of the project, each of the EcoUnits was equipped with two cameras, observing the experiments from two different angles. One example of such a setup in shown in Fig. 3. It should be noted that the cameras have overlapping fields of view in many cases, resulting in the images from each unit not being independent of each other. Both cameras in each unit took one image per day. As the duration of the project was about half a year, 13,986 images have been collected this way over two project phases. However, as annotating this comparatively large number of images is a very laborious task, only about one image per recorded week in the first phase has been annotated per EcoUnit. This is drastically reducing the number of images available for supervised training.

Fig. 2.
figure 2

An EcoUnit from the Ecotron system.

Fig. 3.
figure 3

An example camera setup in an EcoUnit. The two cameras are placed at opposite corners of the EcoUnits and can have an overlapping field of view. In some cases not the complete unit is covered by the cameras.

Fig. 4.
figure 4

The mean cover percentages of the plant species over all annotated images in the dataset in a long-tailed distribution. The abbreviations are explained in Sect. 3.

The plants in the images are all herbaceous, which we separate in nine classes with seven of them being plant species. These seven plants and their short forms, which are used in the remainder of the paper, are: Trifolium pratense (tri_pra), Centaurea jacea (cen_jac), Medicago lupulina (med_lup), Plantago lanceolata (pla_lan), Lotus corniculatus (lot_cor), Scorzoneroides autumnalis (sco_aut) and Achillea millefolium (ach_mil). The two remaining classes are grasses and dead litter. These serve as collective classes for all grass-like plants and dead biomass, respectively, mostly due to lack of visual distinguishability in images.

As with many biological datasets, this one is heavily imbalanced. The mean plant cover percentages over the complete dataset are shown in Fig. 4. There, we can see that tri_pra represents almost a third of the dataset and the rarest three classes, ach_mil, sco_aut and lot_cor together constitute only about 12% of the dataset.

3.1 Images

The cameras in the EcoUnits are mounted in a height of about \(2\,\text {m}\) above the ground level of the EcoUnits and can observe an area of up to roughly \(2\,\text {m}\times 2\,\text {m}\), depending on zoom level. Equal processing of the images however is difficult due to them being scaled differently. One reason for this is that many images have different zoom levels due to technical issues. The second reason is that some plants grew rather high and therefore appear much larger in the images.

As mentioned above, the images cover a large time span, i.e., from April to August 2018 in case of the annotated images. Hence, the plants are captured during their complete life cycle, including the different phenological stages they go through, like flowering and senescence.

Occlusion is one of the biggest challenges in the dataset, as it is very dominant in almost every image and makes an accurate prediction of the cover percentages very difficult. The occlusion is caused by the plants overlapping each other and growing in multiple layers. However, as we will mostly focus on the visible parts of the plants, tackling the non-visible parts is beyond the scope of this paper. A small selection from the images of a camera of a single EcoUnit can be seen in Fig. 1. Each of the images has an original resolution of \(2688 \times 1520\) px.

As already discussed in Sect. 2, we are not able to split up the images into patches and train on these subimages, as we only have the cover annotations for the full image. Therefore, during training we always have to process the complete images. This circumstance, in conjunction with the rather high resolution of the images, the similarity of the plants and massive occlusion, makes this a tremendously hard task.

3.2 Annotations

As already mentioned above, the annotations for the images are cover percentages of each plant species, i.e., the percent of ground covered by each species, disregarding occlusion. The cover percentages have been estimated by a botanist using both images of each EcoUnit, if a second image was available. As perfect estimation is impossible, the estimates have been quantized into classes of a modified Schmidt-scale [34] (0, 0.5, 1, 3, 5, 8, 10, 15, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90 and 100%). While such a quantization is very common for cover estimation in botanical research [30, 34], it introduces label noise and can, in conjunction with possible estimation errors, potentially impair the training and evaluation process of machine learning models. In addition to the cover percentages, we also estimated vegetation percentages, specifying the percentage of ground covered by plants in general, which we use as auxiliary target value.

While both images of each EcoUnit have been used for estimating a single value, the distribution of plants should approximately be the same for both images. Therefore, we increase the size of our dataset by using one annotation for both images, which leads us to 682 image-annotation pairs.

4 Approach

Due to the necessity of using the complete image for the training process, we require a setting, in which it is feasible to process the complete image efficiently without introducing too strong limitations on hyperparameters like the batch size. The most important part of such a setting is the image resolution. As it is hard to train models on very high resolutions due to GPU memory limitations, we chose to process the images at a resolution of \(672 \times 336\) px, which is several times larger than other common input image resolutions for neural network architectures, like e.g. ResNet [17] training on the ImageNet dataset [38] with a resolution of \(224 \times 224\) px. To make the results confirmable, we aim to create a segmentation map during prediction that designates, which plant is located at each position in the image. This segmentation map has to be learned implicitly by predicting the cover percentages. Due to the plants being only very small in comparison to the full image, this segmentation map also has to have a high resolution to show the predicted plants as exactly as possible.

The usage of standard classification networks, like ResNet [17] or Inception [40, 41], is not possible in this case, as the resolution of the output feature maps is too coarse for an accurate segmentation map. Additionally, these networks and most segmentation networks with a higher output resolution, like Dilated ResNet [49], have large receptive fields. Thus, they produce feature maps that include information from large parts of the image, most of which is irrelevant to the class at a specific point. This leads to largely inaccurate segmentation maps.

We thus require a network, which can process the images at a high resolution, while only aggregating information from a relatively small, local area without compressing the features spatially to preserve as much local information as possible. Our proposed network is described in the following.

Fig. 5.
figure 5

The basic structure of the network. It consist of a feature extractor network as backbone, which aggregates information from the input image in a high resolution, and a network head, which performs the cover percentage calculation and generates the segmentation map.

Table 1. A detailed view of the network architecture. We use the following abbreviations: k - kernel size, s - stride, d - dilation rate

4.1 General Network Structure

The basic structure of our network is shown in Fig. 5. We do a logical separation of the network into two parts: backbone and network head, similar to Mask R-CNN [16]. The backbone, a feature extractor network, extracts the local information from the image approximately pixel-wise and thus generates a high-resolution feature map, which can then be used by the network head for the cover calculation and generation of the segmentation map. In the network head the pixel-wise probabilities for each plant are calculated, which are then aggregated to calculate the total cover percentage of the complete image. The maxima of the intermediate probabilities are used for generating the segmentation map.

4.2 Feature Extractor Network

Feature extractor network initially applies two downscaling operations with 2-strided convolutions, bringing the feature maps to a resolution of 25% of the original image, which is kept until the end of the network. The downscaling layers are followed by nine residual bottleneck blocks as defined in the original ResNet paper [17]. To aggregate information quickly over multiple scales, an inception block, similar to the ones introduced in the papers by Szegedy et al. [40, 41] is used. The inception block consists of four branches with different combinations of convolutions, resulting in four different receptive field sizes: \(1 \times 1\), \(3\times 3\), \(7\times 7\) and \(11\times 11\). In Table 1 the network architecture is shown in detail.

Fig. 6.
figure 6

The calculation in the network head. We use a sigmoid function to determine the plant probabilities and a softmax function for the background and irrelevance probabilities. To bring these into a relationship with each other, we use a hyperparameter \(\kappa \), L1-normalization and a multiplication, denoted with \(\cdot \).

4.3 Network Head and Calculation Model

In the network head we try to calculate the cover percentages as exact as possible. For this, we first introduce two additional classes to the ones already described in Sect. 3: the background and the irrelevance class. While very similar at the first glance, these two classes differ significantly in meaning. The background class represents every part of the image that is not a plant, but still relevant to cover percentage calculation. The most obvious example for this is the bare soil visible in the images. This class will be abbreviated with bg in the following. The irrelevance class, denoted with irr in the following, represents all image parts that are not a plant but also not relevant for the cover calculation. Here, the most obvious example are the walls of the EcoUnits, which are visible in many images. The aim of differentiating between these two classes is to separate unwanted objects from the actual plantable area of the EcoUnits and therefore enable the network to work on images without manual removal of such objects, which can be very laborious. If not handled in any way, such objects like the walls of the EcoUnits in our dataset can strongly distort the calculation of cover percentages. For the latter, we require the pixel-wise probabilities of each plant being at the corresponding location in the image as well as the probabilities for both the location being background that is still relevant for the cover percentages, and the location being irrelevant for estimating cover percentages. The calculation scheme is shown in Fig. 6.

The extracted features from the backbone are processed by a \(1\times 1\) convolution to create the classification features for each plant as well as background and irrelevance. As due the occlusion multiple plants can be detected at the same location, we do not consider their probabilities to be mutually exclusive. Hence, we use a sigmoid function to calculate the probability for each plant appearing at this location or not. However, a softmax activation is applied to the classification features for background and irrelevance, as they are mutually exclusive. We also introduce a hyperparameter \(\kappa \), which we use within the L1-normalization of the probabilities for each plant, and an additional multiplication for the normalized \(\kappa \) to relate the appearance probabilities to those for background and irrelevance, as they depend on each other. The detailed equations for the complete calculation process are explained in the following.

While the plants already have separate classes, for our formalization we introduce the abstract biomass class, abbreviated with bio, which simply represents the areas containing plants. For the introduced classes the following holds:

$$\begin{aligned} A_{total} = A_{bio} + A_{bg} + A_{irr}, \end{aligned}$$
(1)

where A represents the area covered by a certain class. For improved readability we also define the area relevant for cover calculation as

$$\begin{aligned} A_{rel} = A_{bio} + A_{bg} = A_{total} - A_{irr} \end{aligned}$$
(2)

As mentioned above we consider the classes of the plants, denoted with \(C^{plants}\), to be not mutually exclusive due to occlusion enabling the possibility of multiple plants at the same location. However, the classes bio, bg and irr are mutually exclusive. We will refer to these as area classes and denote them with \(C^{area}\).

Based on this formulation we describe our approach with the following equations. Here, we select a probabilistic approach, as we can only estimate the probabilities of a pixel containing a certain plant. With this, the following equation can be used to calculate the cover percentages for each plant:

$$\begin{aligned} cover_{p}=\frac{A_p}{A_{rel}}=\frac{\sum \limits _{\forall x}\sum \limits _{\forall y}P(C_{x,y}^{plants}=p)}{\sum \limits _{\forall x}\sum \limits _{\forall y}1 - P(C_{x,y}^{area}=irr)}, \end{aligned}$$
(3)

with p being the class of a plant. whereas x and y determine a certain location in the image. \(C_{x,y}^\cdot \) is the predicted class at location (xy) and \(P(C_{x,y}^\cdot =c)\) is the probability of class c being located at the indicated position.

As mentioned before, we also use the vegetation percentages for training to create an auxiliary output. The vegetation percentage represents how much of the relevant area is covered with plants. This additional output helps for determining the area actually relevant for calculation. It can be calculated as follows:

$$\begin{aligned} vegetation=\frac{A_{bio}}{A_{rel}}=\frac{\sum \limits _{\forall x}\sum \limits _{\forall y}1 - P(C_{x,y}^{area}=bg) - P(C_{x,y}^{area}=irr)}{\sum \limits _{\forall x}\sum \limits _{\forall y}1 - P(C_{x,y}^{area}=irr)}. \end{aligned}$$
(4)

The notation is analogous to Eq. 3.

While the probabilities for each plant as well as for background and irrelevance can be predicted, we are still missing a last piece for the construction of the network head: the calculation of the biomass class bio. We mentioned above that this class is abstract. This means it cannot be predicted independently, as it is mostly dependent on the prediction of plants in an area. We solve this by introducing the hyperparameter \(\kappa \) as mentioned above, which represents a threshold at which we consider a location to contain a plant (in contrast to background and irrelevance). We concatenate this value with the plant probabilities \(P(C_{x,y}^{plants})\) to form a vector \(v_{x,y}\). We normalize this vector using L1-normalization, which can then be interpreted as the dominance of each plant with the most dominant plant having the highest value. As the values of this normalized vector sum up to 1, they can also be treated as probabilities. The value at the original position of \(\kappa \), which basically represents the probability for the absence of all plants, is higher, if no plant is dominant. Hence, we can define:

$$\begin{aligned} P(C_{x,y}^{area} = bio) = 1 - \left( \frac{v_{x,y}}{\Vert v_{x,y} \Vert _1}\right) _\kappa \end{aligned}$$
(5)

where \((\cdot )_\kappa \) designates the original position of the value \(\kappa \) in the vector. The value \(1 - P(C_{x,y}^{area} = bio)\) can then be multiplied with the background and irrelevance probabilities to generate the correct probabilities for these values. This results in the probabilities of the area classes summing up to one:

$$\begin{aligned} 1 = P(C_{x,y}^{area} = bio) + P(C_{x,y}^{area} = bg) + P(C_{x,y}^{area} = irr). \end{aligned}$$
(6)

Based on these equations we can construct our network head, which is able to accurately represent the calculation of plant cover in our images.

To generate the segmentation map, we use the maximum values of sigmoidal probabilities of the plant classes together with the ones for background and irrelevance. As these values only have 25% of the original resolution, they are upsampled using bicubic interpolation, resulting in a segmentation map that has the original image resolution.

5 Experiments

In the following, we will show our experimental setup, then explain the error measures we used and afterwards will go over the numerical results followed by evaluation of the segmentation maps.

5.1 Setup

During our experiments we used an image resolution of \(672 \times 336\) px and a batch size of 16. We trained the network for 300 epochs using the Adam [22] optimizer with a learning rate of 0.01, decreasing by a factor of 0.1 at epoch 100, 200 and 250. As loss we used the MAE both for the cover percentage prediction as well as for the vegetation prediction weighted equally. Furthermore, we used L2 regularization with factor of 0.0001. The activation functions in the backbone were ReLU functions and we used reflective padding instead of zero padding, as this produces fewer artifacts at the border of the image. During training the introduced hyperparameter \(\kappa \) was set to 0.001. For data augmentation we used horizontal flipping, small rotations in the range of \(-20^\circ \) to \(20^\circ \), coarse dropout, and positional translations in the range of -20 to 20 pixels. We trained the model using the Tensorflow framework [1] with Keras [8] using mixed precision. For a fair evaluation, we divided the images into training and validation parts based on the EcoUnits. We use 12-fold cross validation, such that each cross validation split consists of 22 EcoUnits for training and 2 for testing. While the cover percentages are not equally distributed over the EcoUnits, this should only have little effect on the results of the cross validation.

Table 2. The mean cover percentages used for scaling in Eq. 8 during evaluation.
Table 3. The mean values and standard deviations of the absolute errors and scaled absolute errors.

5.2 Error Measures

To evaluate the numerical results of our approach, we will take a look at two different error measures. The first one is the mean absolute error (MAE), which is defined as follows:

$$\begin{aligned} MAE(t, p) = \frac{1}{n}\sum _{i=1}^n{|t_i-p_i|}, \end{aligned}$$
(7)

where t and p are the true and predicted cover values, respectively. As the mean absolute error can be misguiding when comparing the goodness of the predictions for imbalanced classes, we also propose a scaled version of the MAE: the mean scaled absolute error (MSAE), which is defined as follows:

$$\begin{aligned} MSAE(t, p) = \frac{1}{n}\sum _{i=1}^n{\frac{|t_i-p_i|}{m_i}}. \end{aligned}$$
(8)

The absolute error values for each class are scaled by a value \(m_i\), which is the mean cover percentage averaged over the different annotations within the respective class in the dataset. This error will provide a better opportunity for comparing the predictions between the classes. The values that have been used for scaling can be found in Table 2.

Fig. 7.
figure 7

An overview over the MAE of the plant cover prediction in the dataset.

Fig. 8.
figure 8

An overview over the MSAE of the plant cover prediction in the dataset.

5.3 Experimental Results

Cover Predictions. Our model achieves an overall MAE of 5.3% and an MSAE of 0.50. The detailed results for each species are shown in Table 3 as well as in Fig. 7 and Fig. 8. With respect to the MAE, we can see that the error of tri_pra appears to be the highest, while the error of the less abundant plants (ach_mil, lot_cor, sco_aut) appears to be much lower. However, as mentioned above, the distribution of the MAE mostly reflects the distribution of the plants in the whole dataset, as the errors for the more abundant plants are expected to be higher. Therefore, to compare the goodness of the results between plants, we take a look at the MSAE depicted in Fig. 8, where we can see that tri_pra actually has the lowest relative error compared to the other plants, partially caused by the comparably large amounts of training data for this class. The most problematic plants appear to be ach_mil, sco_aut and med_lup, with MSAE error values of 0.63, 0.61 and 0.63, respectively. For ach_mil the rather high error rate might result from multiple circumstances. The plant is very rare in the dataset, small in comparison to many of the other plants in the dataset and also has a complex leaf structure, most of which might get lost using smaller resolutions. The large error for med_lup might be caused by its similarity to tri_pra, which is very dominant in the dataset. Therefore, the network possibly predicts Trifolium instead of Medicago on many occasions, causing larger errors. The same might be the case for sco_aut and pla_lan or cen_jac, especially since sco_aut is one of the least abundant plants in the dataset making a correct recognition difficult.

To put these results into perspective, we also provide the results using a constant predictor, which always predicts the mean of the cover percentages of the training dataset, and the results using a standard U-Net [36] as feature extractor. These achieved an MAE of 9.88% and MSAE of 0.84, and an MAE of 5.54% and MSAE of 0.52 for the constant predictor and the U-Net respectively. We can see that our proposed network outperforms the constant predictor by a large margin and also slightly improves the accuracy of a U-Net, despite having less than \(10\%\) of the number of parameters compared to the U-Net (3 million vs. 34 million). More details can be found in the supplementary material.

Fig. 9.
figure 9

Segmentation results for an image with a high zoom level from the validation set. We can see that grasses, P. lanceolata, T. pratense and background area are segmented correctly in many cases.

Fig. 10.
figure 10

Segmentation results for a zoomed out image from the validation set. While the network captures the signals of many plants correctly, the segmentations are rather inaccurate leading to a large number of wrongly segmented plant species.

Segmentations. To evaluate the result of our network, we also take a look at the results of the segmentation. The first image, shown in Fig. 9, is one with a comparably high zoom level. There we can see that tri_pra is detected correctly in the areas on the left and right sides of the image, while the segmentations are not perfect. pla_lan has been segmented well in many cases, especially on the right side of the image. On the left we can see that it is also segmented correctly, even though it is partially covered by grass. Therefore, the approach appears to be robust to minor occlusions to some extent. Despite these results, the segmentation is still mostly incorrect in the top center of the image. Grasses are also detected correctly in most regions of the image, whereas above the aforementioned instances of pla_lan they are not segmented at all, which is mostly caused by the low resolution of the segmentation map. This low resolution also appears to impair the segmentation results in many other occasions and we would like to tackle this problem in the future.

The second segmentation image is shown in Fig. 10. Here, the zoom level is lower than in the image before, which results in the segmentations getting increasingly inaccurate. We can see that the network correctly captured the presence of most plant species. Notably, the approximate regions of med_lup and tri_pra are marked correctly. However, the detailed segmentation results are not very accurate. It also appears that some parts of the wall are wrongly recognized as tri_pra, while other parts are correctly marked as irrelevant for cover calculation. The segmentations with a U-Net feature extractor can be found in the supplementary materials. All in all, the segmentations appear to be correct for the more prominent plants in the dataset shown in the images with high zoom level and at least partially correct in the images without zoom. Therefore, the segmentation maps can be used to explain and confirm the plant cover predictions for some plants from the dataset.

6 Conclusions and Future Work

We have shown that our approach is capable of predicting cover percentages of different plant species while generating a high-resolution segmentation map. Learning is done without any additional information other than the original cover annotations. Although not perfect, the segmentation map can already be used to explain the results of the cover prediction for the more prevalent plants in the dataset. Many original images have a very high resolution and are currently downscaled due to computational constraints. Making our approach applicable to images of higher resolution could be one improvement. This would also increase the resolution of the segmentation map, resulting in much finer segmentations. The recognition of the less abundant plants, but also of very similar plants like T. pratense and M. lupulina, might be improved by applying transfer learning techniques. For example, we could pretrain the network on the iNaturalist datasets [43], since they contain a large number of plant species. Heavy occlusions are still a big challenge in our dataset, making predictions of correct plants and their abundances very hard. While there are already some approaches for segmenting occluded regions in a supervised setting [26, 35, 50], it is a completely unexplored topic for weakly-supervised semantic segmentation.