Keywords

1 Introduction

The lack of data in order to apply machine (deep) learning method is still a challenging problem even though data augmentation methods have been developed. This is particularly true in the field of video analysis and image time series. Another problem when using machine learning is the pure quality of the available annotated data. In this paper we are to tackle these two problems in the context of a remote sensing application. Remote sensing applications are numerous thanks to the variety of satellite sensors that produce images at a daily scale. Most of the time, a single image is acquired and then analyzed, but recently, new sensors have been providing satellite image time series (SITS) over the same geographical areas at different dates all along the year. This enables to get more information about a sensed scene and to choose the most suited resolution according to the application. In the same time it is obvious the number of training samples cannot be large.

Land-cover mapping [1] is one of these applications. Two different aspects of the problem can be distinguished. It can be considered as a segmentation problem when the labeling is done without any a priori knowledge except the images. Or it can be considered as an updating problem when a labeling of the existing plots is already known in some database and needs to be checked or modified. In that case, a study of new image acquisitions can be done with some a priori information and some classification tasks can be considered. Actually, many geographic databases are available, particularly in the urban or agricultural landscapes. For the latter, they contain information at the plot scale for monitoring environmental changes or for agricultural crop type detection, as illustrated by the RPG database in France. The RPG database is mostly built and updated by annual declarations of farmers within the framework of the Common Agricultural Policy (CAP) in Europe. Because of their manual aspect, these declarations may contain errors or inaccuracies.

In this study, we are interested in the specific vineyard label, and our aim is not to deliver an up-to-date geographic vineyard database thanks to a satellite acquisition. Instead of that, we aim to correct the database in a timely/accurate manner via SITS. Of course, it is not just an update of the database that has to be performed, but the spatial domains outside the database plots have to be analyzed in order to find new or forgotten vineyard parcels. Indeed, in this particular domain it is known that the labeling in the RPG may be erroneous, since many plots are not recorded. However, the RPG is used as reference for monetary assistance delivered to winemakers by CAP aids. The errors could be automatically pointed out to get a more accurate monitoring system. Thanks to the increasing availability of high spatial resolution satellite imagery via European programs (e.g. Sentinel, Ven\(\upmu \)s), it becomes possible to use series of images to analyze this type of geographic database [2] or to check the consistency of the data through the analysis of the visual content of the images. In particular, SITS make it possible to study, from 2D+t imaging data, the spatio-temporal evolutions of the territory, which may for example indicate a change in management of the cropping system or an error in the actual labeling [3, 4]. This, in fact is similar to improving the initial annotated data used in the learning process, in a region or an other, the representativity of the data being assumed.

Many studies in remote sensing have been focused on the use of images from Sentinel 2 satellites which produce optical data at high temporal frequency with a medium spatial resolution (10 to 20 m) [1]. It has already been proved that Sentinel 2 sensors do not provide images with spatial resolution high enough to distinguish for example between vineyards and orchards. The level of details is too low to make apparent the structure in tree or vine stock rows of these types of crops. In this project, we want to explore new data from the Ven\(\upmu \)s satellite, which also offers a good temporal frequency (3 days) but with a finer spatial resolution (5 m). Such data were already considered in the literature for agricultural monitoring [5]. Here we want to take advantage of their spatial resolution to analyze the vineyards.

In this context, the aim of the paper is to propose an approach that enables to use a corrupted training set to achieve the update of the vineyard labeling in the RPG database (Sect. 2). The process is to achieve a semantic segmentation of the spatial data in two classes defining a multi-view approach, vineyard and non-vineyard. The basis of the approach is illustrated in the top part of Fig. 1. This allows us to recover the labeling at plot level as illustrated in the bottom part of Fig. 1. Furthermore, from a general point of view, we have developed a method that is able to partially uncorrupt badly annotated data. The main study area will be Alsace (France) where we have considered two geographical regions respectively around the cities of Obernai and Epfig (Sect. 3).

Fig. 1.
figure 1

Workflow of the proposed approach. The box at the top recalls the Deep-STaR principle [6], while the box at the bottom illustrates the innovation developed in this study, how a spatial segmentation can be performed.

2 Proposed Method

SITS can be viewed as 3D data, with two spatial dimensions and one temporal one, linked to the acquisition date of the images. When considering a supervised approach, the number of training examples is an important issue.

We have already developed a methodology involving the Deep-STaR model [6] for the classification of agricultural plots according to their temporal behaviors and spatial configurations thanks to SITS. Deep-STaR takes as input, 2D spatio-temporal representations built from the content of the 3D data cube. Based on the same principles, we will show here, how it is possible to use a similar approach to solve a semantic segmentation problem, enabling to extract all the vineyard pixels from a geographical area. More precisely, we intend first to use a training set (potentially erroneous) of plots to learn characteristics of vineyards and then to apply the model to segment vineyard on any region. The segmentation is performed regardless of the plot positioning while enabling us to rebuilt them. This makes quite a difference with the previous studies.

First, we recall the Deep-STaR principle (Fig. 1, top). We present then our contribution, how a spatial segmentation can be performed, starting with a coarse approach and then from a finer point of view (Fig. 1, bottom).

2.1 Deep-STaR Principle [6]

In a classification process, the labeling may be done at different levels: the pixels or regions, that can be plots in remote sensing applications. In our case, we consider another level, the curve segment level. We assume that, on short curve segments, the label of all pixels is the same. This assumption has led us to define 2D elements within the 3D data as an intermediate point of view. They are similar to temporal pixels but more complex and they carry more information, in particular spatial information. We name them spatio-temporal representations (STR). Such planar representations enable to take advantage of the 2D+t nature of the data, as well as to benefit from the efficiency of classical 2D convolutional neural networks (CNN) trained on large training data sets (e.g. ImageNet) and providing interesting initialization for CNN used on other problems.

STR carry more information than a temporal pixel that is only considering the evolution along time [7]. They are based on curves drawn in the spatial domain and unfurled in a straight-line segment, the new pixel index being the curvilinear abscissa of the pixel in the curve. The construction of such STR is illustrated in Fig. 2. The straight-line segments associated with each image of the STIS are structured as rows of a 2D image. Its lines are indexed by the dates associated with the images of the series and they contain the pixels of the unfurled curve in the image of the series at this time.

Fig. 2.
figure 2

STR construction: a spatial curve is drawn in the spatial domain of a SITS (a) and (b) a 2D image is built where the lines are indexed by the acquisition date and are the unfurling of the curve at that date, leading to a STR.

The curves that we have chosen to consider are defined by an initial point and they are built according to a random walk process according to an 8-connected topology. The length of the curves is a parameter of the method that has to be fixed according to the problem. It has to be linked to the size of the entities (e.g. the plots) to be considered in the spatial domain. Then, we can build and associate with an image (or a region of interest), as many STR as it is necessary, even more than the number of pixels of the region. Each STR contains spatial information as each pixel has two neighboring pixels that carry partial information on the local environment of the pixel. The random aspect in the construction of the curve contributes to the method invariance to rotation, as the behavior in all directions are equality represented. The width of the STR is the length of the curve and its height is the number of dates in the STIS (Fig. 2).

From the SITS, STR images are built either in vineyard plots or outside. The way the STR are built ensures to have all the pixels in the plot, some spatial constraints are added to the random walk to stay in the region of interest. The labeling of the STR is performed thanks to the learning of a classical 2D CNN designed for a two classes categorization task: vineyard and non-vineyard. The final labeling of the plots is achieved thanks to an aggregation of the decisions given by the CNN at the STR level in the plot to be labeled.

Now we are ready to use this approach to achieve a spatial semantic segmentation. This will be done by considering two steps (Fig. 1, box at the bottom). First, we describe the general principle of the method leading to a coarse segmentation and then we show how a finer result can be obtained thanks to an improvement of the curve analysis process.

2.2 From Classification to Segmentation

As already mentioned, our problem here is not a classification one. The input information is given as a set of agricultural plots labeled as vineyard. But the results on vineyard regions have to be presented as existing plots or new plots (the objective is not only to correct the given plots but also to find the ones missing from the input). It has also to be mentioned the plots are not raster regions but vectorial regions whereas we are working at the pixel level. Considering this information, we need to study the whole region covered by the RPG section we want to correct and not only existing plots. Segmentation is the best approach to the problem. Indeed, the training of the CNN classifier can be performed as in the case of Deep-STaR, as it is independent of the final decision making, but then, the decision can be done at pixel level. A post processing step is needed to rebuild some quasi-plots. In order to propose smooth contour to the new quasi-plots, we resorted to the agricultural land register that is public information available in any country. The land register plots are labeled thanks to the pixel labeling or they have to be partitioned to define homogeneous zones. We present hereinafter the segmentation principle.

Fig. 3.
figure 3

Graph of the evolution of recall in red and precision in green with respect to the threshold value in the decision process of labeling pixels as vineyard or non-vineyard. The blue vertical line indicates the chosen threshold. (Color figure online)

The labeling has to be performed at the pixel level. From the labeling point of view, one pixel p is known both from its spatial coordinates (xy) and through the index in the STR curves \(C_i(p)\) it belongs to, where i is varying from 1 to \(n_p\). This is illustrated in Fig. 1 (bottom, second box) where \(n_p = 3\). The number of curves generated has to be sufficient so that, for each pixel p, several decisions at the STR level are available, their number is \(n_p\). The available information is given by the CNN classifier thanks to the confidence scores given in the output layer for both classes (\(s_i(v)\),\(1-s_i(v)\)), \(s_i(v)\) being the score associated with the vineyard class. We do not apply a majority rule as it gives lower results than using the confidence scores through their mean value of the \(n_p\) values.

The decision rule for pixel p is involving a confidence score defined as: \(S(p)=\frac{1}{n_{p}} \sum _{i=1}^{ n_{p}} s_i(v)\). The pixel p, in first approximation, can be considered as a vineyard pixel if \(S(p) > \frac{1}{2}\). Thanks to this approach based on S, it is possible to build a grey level map where the bright pixels are vineyard pixels. A binarization process is needed in order to take the final decision and it can be either local or global. In our case, we have chosen a global threshold, not fixed as the \(\frac{1}{2}\) example, but it is learned on a training set through an exhaustive search. We choose the threshold associated with the equal error rate of vineyard pixel recognition rate computed in a region included in the training set. Figure 3 shows the evolution of recall and precision with respect to the threshold, allowing to choose the optimal value.

After this learning phase, a two-step system enables to study any region and to label the pixels of a specific area. Figure 4 illustrates a result on a geographical area distinct from the learning one. It can be observed, for example, that roads are not visible in this map and that small plots surrounded by other vegetation than vineyard are missing. The origin of the errors was analyzed to come from the labeling itself first performed at the STR level. All pixels of the curve are labeled in the same way but this may be wrong. A curve can pass through both a vineyard and a road (or a path). This was not happening in the classification process where the curves were drawn totally inside the plots to be labeled.

2.3 Pixel Classification Improvement

The non-homogeneous aspect of the STR is normal in the non-vineyard geographical zones, but in vineyard zone the STR should present a similar temporal evolution all over the pixels of the curve. When the curve is going through a vineyard but also comprises parts in a road or in a tree for example, the STR can be labeled as vineyard, nevertheless, all the pixels have not the same behavior along the year. This is illustrated in Fig. 5 where two behaviors can be observed, the blue vertical area in (b) is associated with a road that has the same aspect all over the year whereas for vineyard pixels, in the beginning of the year at the top of the STR, the blue color indicates no vegetation and the reddish aspect is increasing as chlorophyll is developing along the year.

Fig. 4.
figure 4

Example of a binary map obtained in the region of Obernai in Alsace (France). White pixels indicate vineyards.

Fig. 5.
figure 5

(a, b) Example of two STR labeled as vineyard but associated in (a) with a curve entirely included in a vineyard plot and in (b) with a curve passing through a road (the blue area, its appearance is stable all over the year). (c) Example of a STR image shown in (b) split in two classes. Only pixels of the grey class will be labeled as vineyard. (Color figure online)

The idea is then to post-process only the STR that are labeled as vineyard and for them to check whether several classes of behavior are present or not. If two types of columns (or more) are present, only the pixels associated with the largest one will be labeled as vineyard. We have opted for a process, transforming the color STR image in a grey level image, we consider the mean value of each column. Their number is equal to the length of the curves. The modes of this grey value population are assumed to correspond to different vegetation types and that the most important mode is corresponding to vineyard pixels. If more than one class appears, the curve can be segmented. The result of the segmentation is illustrated in Fig. 5(c). The confidence of curve C pixels in the vineyard class is the confidence associated with the STR \(s_C(v)\) and for the other pixels the confidence in the vineyard class is \(1-s_C(v)\). The result of this process is presented in Fig. 6 that can be compared to Fig. 4. The plots are materialized by some limits, the paths between plots are visible, then at the pixel level the results seem satisfactory.

Fig. 6.
figure 6

Process of the same zone as in Fig. 4. White pixels indicate vineyards. Paths and plot limits are visible.

Of course, such pixel-level results are not usable for real applications requiring a plot-level segmentation. Small connected components are considered as noise, so they are removed. But here the size of the removed components is not fixed in an empirical way, it is learned from the data in the training region in order to optimize the results at the pixel level. Furthermore, in order to make the results more easily usable and exploitable by end-users, the maps obtained at pixel level have to be transformed in vectorized form. For this, the agricultural land register is used, all its plots are polygons. We tag with a label both the plots in the RPG and the land register plot outside the labeled plot of the RPG. A plot is labeled vineyard when 90% of its pixels are labeled as vineyard. One can notice that for some end-user applications, the threshold of 90% can be modified to define confident labels and labels to be manually verified on an ortho-image. Most of the time, the land register plots are homogeneous with respect to our class decision, but sometimes a segmentation of the plot has to be performed. The result of the segmentation is illustrated in Fig. 7. Two colors have been used to indicate the confident conclusions in blue and in green the plots where only between 70% and 90% of the pixels are labeled as vineyard.

Fig. 7.
figure 7

Results (same zone as in Fig. 4), superimposed on the satellite image: in blue (resp. green) the plots labeled vineyard with 90% (resp. between 70 and 90%) of vineyard pixels. (Color figure online)

In this section, we have described the method illustrated thanks to qualitative results. Hereinafter, we precise the experiments and present quantitative results.

3 Experimental Study

First, we precise the data and the way the parameters of the system have been fixed, before to present two aspects of the results. One is the segmentation of an area that is ignored during the training, the other is the method ability to correct some poor-quality annotations without any other information.

3.1 Data

We have studied vineyard regions in Alsace (France). As satellite data, 39 Ven\(\upmu \)s images were available (Level 2A) through the Theia service, around the year 2019 without clouds. They are not uniformly distributed along the year.

Besides, because of the input nature of the considered CNN, SqueezeNet, that we use in the STR labeling, a linear interpolation is performed at two levels, one to fill the cloudy zones where no information is available and also to define 224 images equally distributed. The global data cube has a depth of 224 and the STR have 224 lines. To benefit from the ImageNet initialization of the CNN, we use 3 (near infrared, red and green channels) among the 12 spectral bands available in the Ven\(\upmu \)s images. They are the most often used for crop analysis. In order to avoid the effect of outliers in the raw data, a normalization is performed at the STIS level. A linear transformation is applied between minimum and maximum values in the STIS, where the min and max values are set respectively at 2 and 98 percentiles of the values.

We considered two geographical zones, one for training (Epfig region, \(\approx {}\) 86 km\(^2\)) and one for testing (Obernai region, \(\approx {}\) 62 km\(^2\)). But one originality of our work is that we do not know any ground-truth. The content of the RPG is actually used in the training phase but it cannot be considered as a ground-truth since it contains several errors. Nevertheless, a ground-truth is required to evaluate our methodology. Then we had to build one, in a manual way, looking at ortho images under the supervision of an experimented geographer scientist. Despite the ground-truth is built in a vectorial mode, we will have to use it at the pixel level. The ground-truth is built over the Obernai region where the evaluation will be done.

3.2 Parameter Values

The aim being to show the methodology associated with Deep-STaR enables to manage learning with very poor annotation, the learning of the classifier of STR in vineyard and non-vineyard is done using the RPG. An idea of the low quality of the annotation is given in Fig. 8 where most of red pixels are due to RPG errors. The length of the curves must be long enough to support information but not too long as their position with respect to vineyard is random. Considering the resolution of the images and the small size of vineyard plots, we have chosen a length of 20 pixels (100 m). In an automatic way, in the Epfig region, 100,000 STR are built in each class for the learning phase among which 20,000 are used as a validation set. The STR are 224 \(\times \) 224 color images, but only 20 columns in them are significant and all other pixels are characterized by 0 values.

The number of STR that are used in the decision step is computed such that all pixels of the studied region are belonging to a curve and around 4 curves pass through all pixels. To set the threshold value in the binarization of the map, a zone in the studied region has been used and is never modified in the evaluations. The evaluation of this classification of the STR is achieved in the testing zone of Obernai, where a ground-truth has been manually constructed.

3.3 Evaluation

We propose three types of evaluations. First, the evaluation of the STR classification; then the evaluation of the segmentation and vineyard plots proposition, and finally we evaluate the capacity of the method to correct an erroneous labeling used in the learning phase. The evaluation relies on recall and precision quality indexes computed at the pixel level.

STR Classification. It is achieved on 20,000 STR that have been randomly extracted in vineyard and non-vineyard areas. Table 1 presents the results. The precision is important and higher than the recall which means there is more false negatives than false positives, the system is then better in labeling non-vineyard. The recognition rate could be higher but, we must keep in mind the training set contains errors that must influence the global results. Besides, in the end, several STR will be used to make a decision at the pixel level, then if the philosophy of STR is good, the rate would increase.

Fig. 8.
figure 8

Results in the same zone as in Fig. 4, superimposed on the satellite image where in white the plots labeled vineyard by both the RPG and our method, in red our prediction not present in the RPG and in green the elements we missed. (Color figure online)

Table 1. Evaluation of the STR classification on Obernai.

Segmentation. It will be performed at the pixel level. As we have a manually checked ground-truth, it is also possible to compute the evaluation for the RPG, enabling a comparison with our proposal. Besides, for the chosen region, it exists for the East side of France a manually labeled topographic database OCS GE2 ©GeoGrandEst, 2019 containing landuse/cover map with plots of vineyards.

The three segmentations are then compared to our ground-truth and the results are provided in Table 2. If the STR classification was limited to 0.88 in Table 1, thanks to the several curves passing through the pixels, the segmentation results are increased to 0.97. We can observe that our precision is lower than in the other maps, but our recall is much larger than the recall of the RPG and the global recognition rate is higher. The level of our results, obtained in an automatic way, nearly reach the evaluation computed for the OCS GE2 that has been manually labeled by experts.

Table 2. Evaluation of the segmentation on Obernai at the pixel level.

Figure 8 illustrates the differences between our results and the RPG. In white are the joint label vineyard, in red is the vineyards we predict and is not contained in the RPG and in green the vineyard present in the RPG that we missed. Most of the new predicted vineyards are really vineyards.

RPG Correction. Until now, we have achieved the segmentation thanks to a system trained to differentiate vineyard from everything else. Then vineyards can be segmented in any region. In this new study, we take advantage of the RPG in the new region (Obernai) and we learn the CNN with supplementary epochs using STR from Obernai, before we apply the system on the same region. Results are reported in Table 3. The STR classification system has been updated with respect to the RPG in the region. No other hyperparameter has been changed. It can be observed that the recall has been increased, better than in the general case, but precision is decreased. As the aim is to recover the non-labeled vineyard plots, this new system is improved compared to previously.

Table 3. Correction of the RPG on the Obernai region at the pixel level.

4 Conclusion

From this study, we can give several conclusions, either from a methodological point of view or from the applicative point of view. Through the use of 2D data representations that we can develop in large quantity, deep learning is made possible. These data contain both spatial and temporal information making possible to study crop classes that are characterized by structural texture and this is done in a rotation invariant way. We have showed that the Deep-STaR model built in a classification context can be extended to solve a semantic segmentation task. Here only two types of regions are considered but the number of types of regions can be extended to deal with the semantic segmentation of other types of agricultural crops. We have also highlighted that, thanks to a two-step process, a poor quality annotation can be improved or used in a learning process and a redundancy process introduced by the multi-view aspect.

Our methodology can be applied to other applications, for example in radiology where imaging modalities lead to 3D images. Video analysis in the context of the localization of violent areas is another perspective. Besides we think important to deeper analyze the filters leading to spatio-temporal features to better understand the properties of the model that is built.