Keywords

1 Introduction: Extension Possibilities and Contributions

The Nomenclature of Territorial Units for Statistics (NUTS) is a hierarchical system in which regions of the European Union (EU) and related states are sub-divided. There is an official Resource Description Framework (RDF) dataset provided by Eurostat, which contains major NUTS concepts. Since some data is not included in the RDF dataset, the following possibilities for extension arise:

  1. E.1

    Extension by the finest geographical level, named Local Administrative Units (LAU). This level contains data of districts and municipalities and allows a more precise identification of regions.

  2. E.2

    Extension by the currently valid version (NUTS 2021). The published Eurostat RDF dataset is limited to data up to the NUTS 2016 version.

  3. E.3

    Extension by URIs for different versions. The Eurostat RDF dataset focuses on the respective latest NUTS version. There are no unique URIs for obsolete versions, which would be helpful for, e.g., updating other datasets to revised NUTS versions, which are issued at intervals of three years.

With this work, we present the three following contributions to the Linked Data community:

  1. C.1

    A proposal to extend the existing Linked Data scheme by the additional LAU level as well as unique identifiers for published NUTS and LAU versions.

  2. C.2

    A Knowledge Graph (KG) generator, which can be used to build and update the KG to NUTS versions released in the future. The generator is implemented to automatically parse the file format used by Eurostat to publish new NUTS and LAU data, which are contained in Excel files.

  3. C.3

    A KG built upon the existing concepts as well as a scheme extension along with data officially published by Eurostat and links to additional entities.

The contributions can be used to enhance other KGs and scientific works which include relations to EU regions and their population. Also, tasks like Named Entity Recognition (NER) of geographical entities can be improved by including the hierarchical structure of named regions.

The remainder of this article is structured as follows: Sect. 2 introduces NUTS and LAU concepts and gives insights into related works. In Sect. 3, an extension of the existing Eurostat: NUTS - Linked Open Data dataset concepts is presented. This includes a description of the given scheme (Sect. 3.1), the added concepts of the extension (Sect. 3.2) and the data processing pipeline (Sect. 3.3). Sect. 4 lists statistics of the resulting KG. Finally, Sect. 5 provides a conclusion and an outlook towards future works.

2 Related Work: Existing Concepts and Their Usage

Related works comprise the data and schemes published by Eurostat (Sect. 2.1) and scientific works related to NUTS and LAU (Sect. 2.2).

2.1 NUTS and LAU: Hierarchial Geographical Regions

The Nomenclature of Territorial Units for Statistics (NUTS)Footnote 1 is a geographical hierarchy of regions. For every member state of the EU and for additional states like the United Kingdom, respective geographical regions are sub-divided into three levels of detail. The subdivision into levels is based on thresholds of population sizes. The average population size of regions has to range between a minimum and a maximum. Figure 1 shows the specified thresholds as well as examples of NUTS levels and regions related to Strasbourg.

Fig. 1.
figure 1

NUTS classification criteria based on population thresholds

With exceptions, the current NUTS scheme version is updated every 3 years. The last three versions are 2021, 2016 and 2013. With regard to the version numbers, it has to be noted that there was no large delay in releasing the versions. The naming of the scheme was changed: Up to 2016, schemes were named after the technical date of adoptions, and from 2021, it is when data becomes available. NUTS 2016 became valid in 2018 and NUTS 2021 has been valid since 2021. The official description of the current NUTS version was published by Eurostat [4].

The current version NUTS 2021 comprises sub-divided regions of the 27 EU states Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czechia (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Spain (ES), Malta (MT), Netherlands (NL), Poland (PL), Portugal (PT), Romania (RO), Slovakia (SK), Slovenia (SI) and Sweden (SE), as well as the United Kingdom (UK). These country regions are sub-divided into 104 regions at the NUTS 1 level, 283 regions at NUTS 2 level and 1,345 regions at NUTS 3 level.

In addition to NUTS, there is one additional sub-divided level named Local Administrative Units (LAU). It consists of municipalities or equivalent units. Up to 2016, this level was sub-divided into two LAU levels. Additionally, it was named NUTS 4 or rather NUTS 5 up to 2003.

LAU is updated annually and in the current version (2021), it comprises the states of NUTS 2021 (listed above) as well as additional data for Albania (AL), Iceland (IS), Liechtenstein (LI), Norway (NO), Switzerland (CH) and Turkey (TR). Related to the state of 2022-06-14, data for the following countries will also be added: Bosnia and Herzegovina (BA), Kosovo (XK), Montenegro (ME), Republic of North Macedonia (MK) and Serbia (RS). Along with the related NUTS regions, the respective area sizes and populations are published. This data has been used in statistical and scientific works.

2.2 Usage of NUTS and LAU in Statistical and Scientific Works

Statistical evaluations based on NUTS and LAU data were carried out in several domains. In the recent work Coronis [8], multiple public COVID-19 sources were combined with NUTS regions to compare rates of infection. The work is based on GeoVocab, which contains spatial data and was updated in 2011. In the economic domain, rental listings of Greece have been sub-divided into NUTS regions and visualized afterwards [1]. This approach could be applied to other countries and compared afterwards. Farm topology and spatial land in the German state of North Rhine-Westphalia have been combined with LAU level data [7]. It is an example of the usage of extended fine-granulated spatial data where “official statistics provide frequency tables [...] at NUTS 3 and higher level, only”. Early works that focused on the UK used NUTS and the related harmonized statistics at the national level [2, 3]. However, the URIs are not available anymore. In order to remain sustainably retrievable, our approach is based on a combination of open licensing of code and data, permanent identifiers via w3id.org and generator software that can parse official Eurostat data from the last 10 years and with which future releases can probably also be integrated effortlessly.

NUTS data has been combined with other data sources like postal codes, GeoNamesFootnote 2 and OpenStreetMapFootnote 3 to enable users to search and retrieve information about geo entities [5]. Entities from OpenStreetMap itself have been transformed into RDF data [10].

There are also various visualizations of several domains, mainly published by Eurostat itself: Regions in Europe - 2022 interactive editionFootnote 4, Statistical AtlasFootnote 5, Statistics IllustratedFootnote 6, eurostat-map.jsFootnote 7, NutsDorlingCartogramnFootnote 8, and Regions and Cities IllustratedFootnote 9. To enable other EU projects to build equal works based on RDF, this work extends the existing NUTS Knowledge Graph by LAU data.

3 Extending the Existing NUTS Knowledge Graph

In order to extend the existing NUTS KG, we first analyze the officially published RDF data (Sect. 3.1). Based on the scheme characteristics, we propose an extension (Sect. 3.2). In addition, we describe the single steps of the generator software (Sect. 3.3).

Fig. 2.
figure 2

Scheme of Eurostat: NUTS - Linked Open Data

3.1 The Eurostat Linked Open Data Scheme

The Eurostat LOD scheme comprises NUTS data from country level (NUTS 0) down to NUTS 3 data. For all levels, the NUTS schemes 2016, 2013 and 2010 are included. The dataset is focused on the newest included version; changes to prior versions are described, e.g. if a region was split. Single NUTS URIs (named NUTS entities afterwards) are provided with the related NUTS code, NUTS scheme, label and level. Figure 2 gives an overview and Table 1 lists the namespaces used in this paper.

Table 1. Used prefixes, namespaces and related vocabularies

The RDF dataset is well suited to describe the current NUTS state. However, the following disadvantages result: (a) The currently valid NUTS 2021 scheme is not included. (b) LAU-level data is not included. (c) There is no specific identifier for NUTS entities combined with related NUTS schemes. If additional data is added for a NUTS entity, e.g. population of a region, the related NUTS scheme cannot directly be addressed. (d) The NUTS level 3 is not included as literal; the data is limited to the literals 0, 1 and 2. (d) The properties replaces and isReplacedBy are part of the Dublin Core vocabulary, the RDF file erroneously uses SKOS. With regard to adding further details for regions, an extension of the scheme is necessary.

3.2 Extension of the Eurostat Scheme

In order to uniquely address a NUTS entity, we introduce a combination of a NUTS entity and a related NUTS scheme. This combination is named Unique NUTS entity and is shown in Fig. 3. The figure also shows existing Eurostat concepts in blue, while all data generated in our approach is colored yellow and green. Additional concepts from Fig. 2 (e.g. the NUTS label) remain valid but are not additionally visualized. A Unique NUTS entity has a label (the name of the respective region in English) and can be related to other entities, e.g. the region URI in Wikipedia, Wikidata or DBpedia. The NUTS hierarchy is represented by skos:broader properties between pairs of Unique NUTS entities. The inverse narrower direction can easily be inferred and is not explicitly modelled to keep the amount of data to generate low.

Fig. 3.
figure 3

Extension of Eurostat scheme with LAU data

In addition, we introduce the same NUTS concepts for LAU-level data. A Unique LAU entity is related to both a LAU entity with a code and a LAU scheme representing the issued year. In addition, we add the respective area and population sizes and use skos:prefLabel for Latin names and skos:altLabel for names using non-Latin characters. Figure 4 shows the symmetric design of the scheme and the single parts of URIs, which allow directly addressing NUTS and LAU codes of individual years. LAU entities can be listed by traversing scheme paths (e.g. using SPARQL) and be directly addressed by URIs. In addition to this scheme extension, we processed published data and built a KG.

Fig. 4.
figure 4

Extension of Eurostat NUTS URIs with LAU and unique identifiers

3.3 Data Analysis and Processing

The LauNuts approach was developed in several iterations following the Linked Data life cycle [6] (Fig. 5). Actions such as manual revision and quality analysis towards the final KG generation are included implicitly in every stage of the workflow.

We first explored data sources and discovered, inter alia, the officially published sources for NUTSFootnote 10, LAUFootnote 11 and Linked Open DataFootnote 12. The majority of the data is provided as Excel files. NUTS data is currently available as 7 Excel files for the schemes of the years 2021, 2016, 2013, 2010, 2006, 2003, 1999 and 1995 with 31 sheets in total; for LAU, there are 14 Excel files with 495 sheets for the years from 2010 to 2021. The RDF file related to Linked Open Data contains 20,001 triples.

The extraction started with sighting the data. Simply opening some of the Excel files was not possible for the following reasons: Google Sheets (“file is too large to preview”), LibreOffice Calc (“the maximum number of columns per sheet was exceeded”) and Apache POI (“OutOfMemoryError: Java heap space”). We finally installed the following extraction queue, explicitly stated here as it could be interesting for other developers working on the topic: (1) Converting XLS files to XLSX using LibreOffice (7.3.7.2). (2) Converting XLSX files to CSV using ssconvert/Gnumeric (1.12.51). (3) Extracting single sheet names using in2csv/csvkit (1.0.7). (4) Renaming CSV files. The additionally provided Eurostat RDF file could be read using Apache Jena (4.6.1) without any problems.

Fig. 5.
figure 5

The Linked Data life cycle

For further querying, the stored CSV and RDF data were used as a cache. To ensure reproducibility, even if the Excel source files are updated in the future, the pre-processed data are published on an FTP serverFootnote 13.

The manual revision of the data started with data analysis of the RDF source. In order to reuse existing Semantic Web concepts, we created a scheme from the available RDF data (see Fig. 2). The scheme is extensible, and details of the most important nuts are provided. However, the predicates used in the RDF file replaces and isReplacedBy are not part of the used vocabulary SKOS, but DCT (see Table 1). Regarding the predicate nuts:level and related literals, the NUTS levels 0, 1 and 2 are included, and level 3 is not included. Additionally, the RDF data is limited to the NUTS schemes 2016, 2013, and 2010. The provided Excel files contain values that must be handled individually and partially cleaned. The LAU Excel files provide LAU codes, related NUTS codes, names in Latin characters, names in national (non-Latin) characters, and area and population data. The values of Latin names are sometimes duplicates of non-Latin names. In other cases, there are no Latin names given. Furthermore, some row headings describing the same concepts are named differently in single files. An example of required cleaning is the code FR7, which occurs twice in NUTS 2013. In addition, the LAU 2021 file contains a sheet with 1 million rows, where each contains a cell with a value 0. Overall, the data was evaluated to be usable with additional cleaning.

We interlinked the generated data with two data sources. First, the official Eurostat NUTS URIs have been reused. Second, as a proof of concept, we also created links to Wikipedia URIsFootnote 14 representing regions at NUTS levels 0 and 1. Therefore, we processed JSON data retrieved using the Wikipedia API and parsed the embedded Markdown code. The Wikipedia URIs can be used to create additional skos:relatedTo links to Wikidata and DBpedia as these KGs are also linked to existing Wikipedia URIs.

Additional steps of the Linked Data life cycle are integrated into the used workflow. The classification of entities is built in as the overall data integration is based on the used RDF schemes. The quality analysis and evolution were conducted by several iterations during development and comparing official numbers about the data and concrete values with actually created entities in the KG. The development started in 2019 as part of the OPAL research project and has been used to access geo labels for Question Answering (QA) [9].

4 Results: Open Software and Knowledge Graph

This work provides three main contributions as listed in Sect. 1. The first contribution (C.1) is the scheme extension described in Sect. 3.2, which allows the integration of LAU data versions, which are updated annually. The scheme makes extensive use of common RDF vocabularies. Additionally, created URIs use permanent identifiers of w3id.org to be available in the future.

The generator software (C.2) is published as Open Source (GNU AGPLv3 license) on GitHubFootnote 15. This enables extensions or reuse of the code in other projects. It is designed to extract NUTS and LAU data in the format of published Eurostat data of the last 10  years; therefore, it is probably possible to effortlessly process data published in the future. The software is parameterized to process only single steps (e.g. data extraction or KG building) and subsets of available data (e.g. only specified NUTS or LAU versions or single country data).

Table 2. Knowledge Graph sizes

Generated Knowledge Graphs (C.3) contain up to 7 NUTS versions (from 1999 to 2021) and 12 LAU versions (from 2010 to 2021) with labels, area, and population sizes. The current version is named LauNuts2021b to be referenced unambiguously and comprises the NUTS schemes 2021 and 2016 as well as LAU data from 2021 and 2020. In addition, entities of the NUTS levels 0 and 1 and the NUTS 2021 scheme are linked to Wikipedia URIs. As new LAU versions are published annually, new KG versions are expected to be generated in the future. The KG is published under the CC BY 4.0 International license on FTPFootnote 16, Zenodo and Figshare and therefore is accessible by respective Document Object Identifiers (DOI). Table 2 shows an overview of contained entities and literals in the KG and sub-graphs for 2021 and 2016. The KG in version LauNuts2021b contains 1,181,549 triples.

5 Outlook and Conclusion

5.1 Possibilities for Future Work

Extending the KG with postal codes could enable a more precise linking to other KGs. Postal codesFootnote 17 are available for the NUTS schemes 2021, 2016, 2013, and 2010. For example, for NUTS 2021, there are lists for 35 countries. Mappings between geodataFootnote 18 and NUTS as well as LAU codes are available in different file formats and scales. An extension with geodata would enable the identification of geographic regions for given points of interest.

NUTS and LAU codes could complete mappings in well-known Knowledge Graphs independently from the LauNuts KG. In Wikidata, there is the property P605Footnote 19 which represents NUTS links. It is already used for entities, e.g. for AlsaceFootnote 20. In DBpedia, there is the property nutsCodeFootnote 21. It is used, e.g., for CornwallFootnote 22. The property is listed as an equivalent property to the Wikidata property P605Footnote 23.

The generated entities could be completely linked to Wikipedia URLs. For NUTS 2021, the levels 0 and 1 have already linked in this work as a proof of concept. Pages in the Wikipedia category Nomenclature of Territorial Units for StatisticsFootnote 24 contain tables with NUTS codes and linked Wikipedia pages. These mappings could then be utilized for KG linking, as Wikipedia pages are also linked from Wikidata and DBpedia.

5.2 Conclusion

With this work, we extended the existing Eurostat KG with the suggestions listed in Sect. 1: We added (E.1) LAU data, (E.2) the current NUTS 2021 version, and (E.3) URIs for different NUTS and LAU versions.

The KG can be utilized for tasks such as Named Entity Recognition or entity disambiguation by using the provided literals and geographical hierarchy. Other use cases are updates of outdated data to the newest NUTS and LAU versions or comparisons of EU regions based on population numbers.

The provided LauNuts KG and the generator software are available with open licensing and are ready to use for upcoming research projects related to EU regions and on the national level.