Introduction

Land cover information is used in various applications such as land resource management and ecological, environmental monitoring (Lepers et al. 2005; Deka et al. 2019). Remote sensing technologies are frequently used to obtain land cover information quickly, economically, and reliably (Kavzoğlu and Çölkesen 2010; Gallego 2004; Tong et al. 2020; Cao et al. 2020). However, as the amount of information obtained by remote sensing increases, data analysis extraction and updating have become a separate study area (Appel et al. 2018; Audebert et al. 2018; Sun et al. 2019; Zhang and Ge 2019). In particular, automatic extraction and the labelling of land cover classes from remotely sensed images without user dependency is an important field.

While the results of supervised classification methods create labelled results in line with data determined by the user, unsupervised classification methods divide the data into unlabelled groups based on spectral similarities and differences. The user must interpret and label the classes after classification (Jensen and Lulla 1987). Interpretation and labelling may not be easy, as the classification does not always form meaningful clusters, and it can take a great deal of time (Enderle and Weih Jr 2005).

Automatic classification of remotely detected images and automatic labelling of classes is still challenging. The studies in the literature show that it is possible to label target objects in images (Lin et al. 2017; Yuan et al. 2019; Huang et al. 2019). Studies have also been carried out to create target land classes (Wang et al. 2019; Gupta et al. 2018).

In recent studies, convolutional neural networks (CNNs) and deep learning structures have been used for object identification and labelling (Kaiser et al. 2017). In another study, Lin et al. (2017) propose a method for sea-land segmentation and vessel detection in remotely sensed images (Lin et al. 2017). Using this CNN method, they identify oil tankers and naval vessels.

Wu et al. (2020) use older labelled images of a region to update data on land classes in that region. This method gives successful results when the study area has been previously classified and labelled with auxiliary data. However, the authors state that the threshold value parameters used to update old maps are critical and determined by the user. Incorrect classifications made in the change determination operation can negatively affect the determination of the changed regions (Wu et al. 2020).

In this study, an approach is proposed to label land use classes automatically. The approach has been tested primarily on automatic labelling of land, green farmland, forest, urban area, and uncultivated agricultural land. In the method developed, firstly, the raw band values ​​of pixels assigned to classes and the minimum and maximum values ​​of the data created by spectral indexing and other methods are determined. These values ​​are analyzed along with threshold values ​​found in the database created from examining images of many different geographical regions. Finally, according to the ruleset created for this study, these labels were compared with the Corine classes and determined the label of each pixel.

Materials and methods

Data set and study areas

Sentinel 2 images are the preferred images to test the method developed in the study. These high-medium resolution images, which can be obtained free of charge, are widely used in many studies. The Sentinel-2 program was launched by the European Space Agency (ESA) on 23 June 2015 as part of the Copernicus program. This system consists of two polar orbit satellites with high-resolution multispectral sensors, placed in the same orbit with 180 ° graduation, used for tracking Earth’s surface changes (ESA 2018).

Level-2A images from the Sentinel 2 satellite are used to test the algorithm developed within the scope of the study. In order to use as many spectral indexes as possible, the 20 m bands are sharpened by the intensity-hue-saturation (IHS) method to obtain a 10 m spatial resolution. The specifications of the used images are gicen in Table 1.

Table 1 Detailed information about the Sentinel 2 images

The images used in this study are of two geographical regions from Turkey and Agioi Apostoli region from Greece. These regions are chosen due to their different climates, topographical features, and urban distributions.

The first image is of the Gemlik region (Fig. 1b). Gemlik is located 19.13 degrees east and 40.12 degrees north. Gemlik is surrounded on three sides by mountains, with the Marmara Sea to the west. The Mediterranean climate generally prevails; however, there is a transition to the Black Sea climate (Gemlik_Belediyesi 2019). The vegetation in the region consists of maquis, forest, and olive trees (URL 2019). The climatic characteristics of the Marmara region of Turkey are used to represent the vegetation, which is chosen as it contains a high density of forest and urban areas.

Fig. 1
figure 1

Study areas a Agioi Apostoli b Gemlik c Hatay

The second study area is the Hatay region (Fig. 1c). Hatay is in southern Turkey, on the shores of the Gulf of Iskenderun. It’s vegetation consists of maquis species.Within Hatay’s boundaries, a Mediterranean climate is seen with hot, dry summers and warm, rainy winters. Hatay has red-brown Mediterranean soil, red Mediterranean soil, brown forest soil, colluvial soils, and alluvial soils (Hatay_Valiliği 2019). This area is used to determine the algorithm’s success in detecting Mediterranean vegetation, different soil types, and dense urban areas.

The third study area is Agiogi Apostoli from Greece (Fig. 1a). Agiogi Apostoli is located in Attica region. Attica, is a triangular peninsula in the Aegean Sea. There is a large basin in the center of the peninsula. This region has a Mediterranean climate.

In order to determine the label of the pixels, the Corine is also used in the proposed algorithm. The Corine project was initiated by the European Union in 1985 to determine the land inventory, monitor the land cover changes, and make environmentally sensitive decisions. Corine Project studies in our country started in 2001. Currently, there are land cover maps for the years 1990, 2000, 2006, 2012, and 2018, which have been added to the European Environment Agency database.

Methods

In current unsupervised classification methods, classification results are generated without labels. This approach increases the dependency on the user, as the results require interpretation. In order to get rid of this disadvantage, a labelling approach is proposed. The study aims to label soil, a green agricultural area, forest, urban area, and uncultivated agricultural area. The flow diagram of the proposed method is given in Fig. 2.

Fig. 2
figure 2

Flowchart of the proposed method

In the first stage of the study, the labels of the clustered data are produced for pre-classified data with the algorithm developed. This algorithm consists of four basic stages. In the first step, the pixels in each class are examined, the spectral indices are calculated using the values of these pixels, and the minimum and maximum values of the calculated spectral indices for each class are determined. In the second step, the spectral index values obtained for each class are compared with previously prepared threshold values for all labels. As a result of this analysis, all classes compatible with the threshold values determined for a label are assigned that label. After repeating this process for all classes, the last stage of the algorithm is to check whether there is more than one suitable label assigned for a class. If only one class is assigned, the labelling process of that class is terminated. If more than one label is assigned to a class, the most repetitive label is assigned to that class. All the assigned labels are presented to the user with probability information according to the repetition rate.

The second phase of the study starts with the acquisition of Corine data belonging to the study area. Corine class corresponding to each pixel in the workspace is determined. Thus, two-class candidates are determined for each pixel. A rule set is developed to decide which tag to choose if the suggested tags are not the same. The final class is determined by choosing the one suitable for this rule set among these two classes. The rule set used in the study is given in Table 2.

Table 2 The rule set used to determine the final label to the pixel

The logic in creating this ruleset is that Corine data is prepared every four years, so it cannot reflect the current situation. It is also the use of Corine data to correctly label classes such as soil and fallow land, whose pixel values are very close to each other. For example, if a pixel is assigned to the soil class according to spectral indices, but according to Corine data, this area is seen as an agricultural area, it turns out to be a fallowed agricultural area.

Sentinel 2 images are used to test the proposed algorithm. Sentinel 2 images of the Black Sea and Mediterranean climatic regions are analyzed to determine the threshold values ​​ in the algorithm. Accordingly, images from January, April, June, September, and November are used to determine the changes of classes across all seasons.

The spatial resolution of the images is increased to 10 m by the IHS sharpening method. The spectral indices given in Table 3 are calculated from the bands in the images explained in Table 1.

Table 3 Spectral Indices Used in Algorithm

In order to determine the accuracy of the results, the overall accuracy method is used. In this method, 100 points are randomly generated for each study region (Fig. 3). The land cover value corresponding to these points and the classification result are determined and assigned to an error matrix.

Fig. 3
figure 3

Randomly created points for accuracy analysis: a Gemlik; b Hatay c Agioi Apostoli

Results and discussion

In this study, threshold values are used in the labelling algorithm. In order to determine these threshold values, Sentinel 2 images of the Black Sea and Mediterranean regions of Turkey are analyzed. The reflections of the land cover classes in the images change depending on the season. Accordingly, images from January, April, June, September, and November are used to determine the classes’ changes. The spatial resolution of these images is increased to 10 m with the IHS sharpening method. The wetness Index, SAWI, REM, SATVI, OSAVI, NGRDI, NDWI, NDVI, NDSI, NBR, TBI, GVI, EGI, BI, and AWEI spectral indices are calculated using the bands in the images. The analysis shows how the values ​​of these indices change seasonally in the regions. The variation of index values by months and regions is given in Figs. 4 and 5.

Fig. 4
figure 4

Spectral index values in Karadeniz Region

Fig. 5
figure 5figure 5

Spectral index values in Akdeniz Region

Threshold values were determined from the obtained data. This threshold values are determined by calculating the average values covering all seasonal changes. The values obtained are given in Table 4.

Table 4 Threshold values obtained from spectral indices

The approach developed for automatic labelling of the classification results is tested with Sentinel 2 images for the geographical regions of Hatay and Gemlik. The spatial resolution of the images is increased to 10 m by the IHS method. The labelling algorithm developed for the study is tested in study areas and aims to label soil, a green agricultural area, forest, urban area, and uncultivated agricultural area.

Gemlik study area

In order to test the proposed approach, the results obtained for Gemlik region are determined by the algorithm and given in Fig. 6.

Fig. 6
figure 6

Gemlik Region: a Satellite image b Corine Classes c Labelling process result

Hatay study area

The result obtained for the classes in the image of the Hatay region is determined by the algorithm and given in Fig. 7.

Fig. 7
figure 7

Hatay Region: a Satellite Image b Labelling process result c Corine Classes

Greece study area

The result obtained for the classes in the image of the Hatay region is determined by the algorithm and Corine Classes and given in Fig. 8.

Fig. 8
figure 8

Agioi Apostoli study area: a Satellite Image b Labelling process result b Corine Classes

The results of the “Overall accuracy” analysis applied to the algorithm and Corine land classes covering the study area are given in Table 5.

Table 5 Error matrix for Algorithm and Corine Classes

The accuracy analysis shows that the overall accuracy rate is 80% for the resulting map obtained by the algorithm for the Gemlik region. While water and urban areas are separated with high accuracy, soil areas and green areas outside the forest area could not be determined. In Gemlik, there are no green agricultural areas such as wheat or barley fields. Agriculture generally consists of fruit trees. Nevertheless, the algorithm labels green pastures and grass areas as green agricultural areas. The accuracy analysis shows that the overall accuracy rate is 71% for the Corine classes. While forest and water areas are separated with high accuracy, land areas and soil areas outside the forest area could not be determined successfully.

The accuracy analysis shows that generally successful results are obtained for water, urban areas, and agricultural areas for the algorithm. When the results are analyzed, it is shown that the overall accuracy rate is 0.65.

Discussion

If class labels are not produced by unsupervised classification methods, users are required to interpret the results. This requires detailed knowledge of the area studied and increases dependence on the user. In this study, an approach is proposed to automatically label data classified with an unsupervised classification method. The aim is to determine green farmland, forest, fallow, city, and water classes from Sentinel 2 images. In the study, temporal and geographical variation of spectral indices for land cover classes are examined. As a result of this analysis, threshold values are determined to separate the classes. The threshold values obtained and the algorithm developed are tested in two study areas. Besides, the Corine labels of the pixels are used. In the last stage of the algorithm, the best label according to the rule set determined by authors is assigned to the pixels.

When the results are examined, the overall accuracy rate of the labelling in the study area, Gemlik, is 80% if the proposed algorithm is used, while the overall accuracy rate is 0.71 for the Corine classes. On the other hand, the overall accuracy is 0.83 for the proposed algorithm, while the overall accuracy rate is 0.62 for the Corine classes in study area results show that the proposed algorithm is capable of labelling with high accuracy. One of the reasons for this is that the resolution of the Corine classes is 100 m, so they cannot show the details found in high-resolution images. Figure 9 shows one of the examples of this. The marked region is labelled as Port in the Corine classes. However, when examined in detail, it is seen that there are details such as water and soil in this area. General evaluation of the results obtained shows that the proposed approach effectively determines the target classes, revealing that spectral indices can be used for labelling land classes. In addition, it is thought that expanding the analysis to changes in spectral indices using a more comprehensive time range would increase the accuracy rates and target class diversity.

Fig. 9
figure 9

Hatay Region a Satellite Image; b Proposed algorithm c Corine Classes

On the other hand, as seen in Fig. 9, some green areas in urban areas such as parks or stadiums are labelled as green agr. Areas or forests. This is because there is no class label in our previously determined pilot labels for this study.

Conclusions

This study aims to produce unsupervised classification labels, and a method is developed for this purpose. Spectral values ​​in various classes obtained by the method developed are compared to a land class threshold value ​​database and assigned to the appropriate class. In order to create the database, images from January, April, June, September, and November of Mediterranean and the Black Sea regions with different climatic and geographical characteristics are examined. The threshold values ​​of the classes are determined by this analysis. The approach developed is tested on images of two regions. High accuracy results are obtained, as shown by the accuracy analysis applied to the labels created. The successful results illustrate the potential of the proposed method for automatic labelling.

This study was carried out using sample data and pilot classes as first steps towards automatic labelling of images classified by unsupervised classification methods. When thresholds are determined, labels can be created for different classes with this method. It is recommended to examine the results by expanding the range of labels at least as much as the Corine classes.