Keywords

1 Introduction

In recent years, Deep Learning (DL) has become very popular for extracting knowledge from non-structured data, such as images or time series. The increase in computing power made possible with the development of new computing hardware and new deep neural network architectures has allowed for unprecedented results in previously complex tasks. From 2012 to 2019, the computing power required by state-of-the-art results has increased by 300,000\(\times \) [17]. More recently, there has been an even bigger increase with models such as GPT-3, which has 175 billion parameters and was trained on a dataset of nearly a trillion words.

This rise of DL results in a huge environmental impact since the hardware required is very power-hungry. Recently, a distinction between “Red AI” and “Green AI” has been introduced in [17]: the former refers to Artificial Intelligence (AI) research focusing on performance aspects only, the latter refers to environmentally-aware research on AI. Most of the research on “Green AI” has been addressed to developing better and more efficient algorithms and architectures. However, DL training is always preceded by a data preparation phase, in charge of preparing the dataset for the training task. Information Systems Engineering (ISE) expertise can be beneficial in the design of the data preparation phase and can impact the overall energy consumption of the DL task.

This work aims at complementing the existing model-centric approaches with a data-centric approach. We propose a methodology that improves data preparation by transforming the original training set such that: (i) the performance constraints of the resulting model are satisfied; (ii) the environmental impact of the training phase is reduced. These goals are reached by taking into account the characteristics of the dataset, such as data volume and Data Quality (DQ).

The proposed methodology is validated on time series classification using Deep Neural Networks (DNNs). The prototype of a tool that researchers interested in reducing their carbon emissions in this task can use is also provided.

The paper is organized as follows. Section 2 describes existing work. Section 3 motivates the approach and set the goals of the methodology. Section 4 and Sect. 5 introduce an architecture for Data-centric Green AI and describe an implementation in the context of time series classification. Section 6 validates the methodology, while Sect. 7 summarizes the approach and outlines future developments.

2 State of the Art

AI has become pervasive in all fields, and its strong interdependency with climate change has been demonstrated [16]. Several applications of AI can play a role in the reduction of the effects of climate change. At the same time, AI is also the application affecting the environmental impact of IT the most. In 10 years, the computing power required by AI has increased 300’000 times [10]. DL is a subset of AI which uses DNNs as predictive models. The learning process requires several iterations and can take many hours or days, on very power-hungry hardware. Models such as Convolutional Neural Networks (CNNs), Fully Convolutional Networks (FCNs), and Residual Networks (ResNet) have proved to be very effective in terms of performance at the expense of a relevant environmental impact [7, 8]. The most power-hungry phase in DL is the hyperparameter (HP) search, since it considers the training of many models with different configurations to find the one with the best performance [20].

The issue of the environmental impact of AI has been discussed in [17], introducing and comparing the two opposite concepts of “Red AI” and “Green AI”. The former refers to a performance-focused approach, where all the efforts are put into accuracy, disregarding costs and efficiency. The second envisions a more sustainable approach to AI, encouraging a reduction in resources spent. The main aspects to consider for reducing the environmental impact of AI are analyzed, focusing mainly on architectural and algorithm-related aspects. This can also be seen in [25] and [15], where the environmental impact of DL is considered focusing on the infrastructural, architectural, and location aspects. They partially consider the data perspective through transfer learning and active learning approaches. In [6], authors focus instead on the environmental impact of the model selection and the hyperparameter search. As shown in [4], modelling the DL task is only one step, preceded by a data preparation phase, which might affect as well the environmental impact of the overall task.

Data preparation is essential in many contexts for the analysis of large volumes of data [13, 14]. Data preparation is the preliminary phase of every DL task, which can improve the resulting model performance [11, 19] or affect the dataset balance [22]. Data preparation can also affect the environmental impact of DL tasks. The main factor to consider is data volume [12], affecting the training time and the number of resources needed, with sometimes marginal effects in terms of performance [21]. A preliminary data-centric empirical study on Green AI [23] has shown that modifications on the volume of datasets can drastically reduce energy consumption, with a limited decline in accuracy. Data selection should be DQ-driven. The data preparation step in the AI lifecycle is necessary to prevent incorrect results and biases due to poor quality data [2]. A study on the effect of DQ issues on several ML models have been performed in [3], where completeness, accuracy, consistency, completeness, and class balance have been considered, suggesting a limited relevance of class balancing on the model performance as far as the balance is higher than or equal to the original dataset.

This paper takes a data-centric perspective to Green AI to complement existing model-centric approaches with improved training data management.

3 Motivation and Goals for Data-Centric Green AI

AI is a first-class citizen in modern data centers, and the amount of computational and storage resources employed for supporting AI has been increasing and keeps growing. AI applications have become a utility, as demonstrated by the wide and continuously increasing adoption in different fields and for diverse purposes. Current approaches to AI focus on performance optimization and consider sustainability mainly from a model-centric perspective. The availability of huge datasets has enabled the training of complex models and boosted their performance. However, the data size used for training significantly affects the time and the resources needed for the training, impacting the environmental sustainability of AI applications [20]. Not all data have the same relevance for building the model: good quality datasets are necessary for creating high-quality models able to perform accurate predictions [9]. This problem is amplified by big data [5]. This paper adopts a sustainability-driven perspective on DL, with a data-centric focus: the environmental impact of DL applications is reduced by selecting a proper subset of the data for training the model while ensuring a required performance level. This paper identifies three incremental goals:

  • Goal 1: Explore which data-centric characteristics of DL pipelines contribute the most to energy usage, and find out which can be tweaked so the overall environmental impact is reduced.

  • Goal 2: Model the relation between the discovered data-centric characteristics and the resulting model’s performance.

  • Goal 3: Reduce the DL impact on carbon emissions by making more efficient use of data while being constrained by performance requirements.

To reach them, a general and data-centric methodology valid for any DL task is proposed, including two phases: a Data Exploration Phase in charge of reaching the first two goals through the generation of a knowledge base, and a Data Selection Phase focused on the third goal. The proposed approach can be integrated with existing model-centred techniques to improve AI sustainability.

4 An Architecture for Data-Centric Green DL

In this section, we present a detailed architecture for supporting Data-Centric Green DL in Fig. 1. The architecture is split into two parts, corresponding to each one of the phases. Since each DL task has different characteristics (i.e., models, algorithms, datasets and performance metrics), the actual implementation of the architecture depends on the specific DL task to support. To validate our approach, an implementation for time series classification is presented in Sect. 5.

Fig. 1.
figure 1

Architecture for Data-Centric Green DL

4.1 Data Exploration

The Data Exploration part of the architecture focuses on the components required for addressing Goal 1 and Goal 2. The output is a Data Exploration KB containing information about how an experiment (defined here as a dataset-model pair) is affected by the manipulation of a data-centric characteristic. The performance of the resulting model and the carbon emissions generated during training will be considered. This part consists of three components.

Fig. 2.
figure 2

Overview of the Experiment Evaluator component behaviour

The Experiments Evaluator executes a set of experiments to collect useful data and learn the trade-offs between data volume vs performance, data volume vs emissions, and DQ vs performance. More specifically, the Experiments Evaluator runs a set of experiments, defined as:

$$\begin{aligned} exp = {<}mod, ds{>} \end{aligned}$$
(1)

where mod is a specific DL model and ds is a dataset for training the model. For each experiment, the Experiments Evaluator runs a set of sub-experiments, changing its configurations. A sub-experiment is defined as:

$$\begin{aligned} sub\_exp = {<}exp, [conf]{>} \end{aligned}$$
(2)

where [conf] is the set of configurations to test (data volume or DQ), each one identifying a specific aspect and a specific value for that aspect (e.g., \(volume = 50\%\)). In order to isolate side effects, only one configuration is tested at each time and for each aspect, several values are tested. For each sub-experiment, the resulting modified dataset is used to train the model and a set of performance metrics is evaluated and stored. The overall process is shown in Fig. 2. The component uses as input a set of models and relative datasets stored in the Datasets and Models DB. The output of the component is stored in the Experiments DB as a table containing the following information for each sub-experiment:

$$\begin{aligned} exp\_res = {<}ds, mod, [conf], [perf]{>} \end{aligned}$$
(3)

where [perf] is the set of performance metrics evaluated with their assessed values. The set of metrics to evaluate depends on the task (e.g., recall, precision, accuracy, F1-score for classification tasks).

The data stored in the Experiments DB have a dual use. The Results Analyzer component analyzes the impact of data volume and DQ on the model performance. It accesses the experiments with different configurations involving DQ aspects and provides a ranking of which DQ metric degradation mostly affects the model performance. It also validates the carbon emission reduction capabilities for each configuration to detect which data aspects mostly affect CO\(_2\) emissions. This information is stored in the Smart Reduction DB.

The Reduction Curves Extractor component focuses on the data volume vs performance trade-off and aggregates the information collected by the Experiment Evaluator to build a reduction curve for each experiment. An example can be seen in Fig. 3, in which actual data are collected only for nine configurations of the data volume, while the generated curve enables us to estimate which will be the performance metric also for intermediate configurations.

Fig. 3.
figure 3

Sample reduction curve for an experiment

All the activities in this phase are executed only once and aim at collecting information to enable the Data Selection Phase.

4.2 Data Selection

The Data Selection part of the architecture exploits the results collected in the previous phase to support a researcher willing to execute a new training task in reducing its environmental impact.

The most relevant component of this part of the architecture is the Regression Model Generator. At first, this component uses the data stored in the Reduction Curves DB to train a predictive regression model that will be able to build a new reduction curve for an unseen experiment starting from the reduction curves examples contained in the DB. Once the regression model is built, it can be used to perform a prediction every time a researcher submits a new experiment.

The Regression Model Generator is the only component of this part of the architecture running partially in batch mode. All other components run interactively, providing a human-in-the-loop (HITL) approach. The researcher is, in fact, in charge of providing some preliminary information to the system. The information provided by the researcher belongs to four different categories:

  • Dataset Information: \(\mathcal{D}\mathcal{I} = {<}ds, d\_type{>}\). The researcher provides the dataset ds for training the model and specifies the data type \(d\_type\) from a list of supported types (e.g., image, sensor data, etc.).

  • Model Information: \(\mathcal{M}\mathcal{I} = {<}arch\_type, \#par{>}\). The researcher provides the features of the model to be trained, consisting in the type of architecture \(arch\_type\), selected from a list of available architectures, and in the number of parameters of the model \(\#par\);

  • Baseline Execution Information: \(\mathcal{B}\mathcal{I} = {<}ds_p, perf_{val}{>}\). The researcher provides the results of a preliminary execution of the experiment using a randomly reduced dataset. More specifically, the researcher provides the tested dataset size \(ds_p\) and the obtained performance value with that size \(perf_{val}\);

  • Performance Goal: \(\mathcal {G} = {<}perf_{metric}, perf_{val}{>}\) the researcher sets the minimum acceptable value \(perf_{val}\) for a specific performance metric \(perf_{metric}\).

The inputs provided by the researcher are used by the different components of the architecture. The dataset ds is first processed by the Dataset Features Extractor component, which performs profiling activities to extract metadata and compute DQ metrics about the dataset. The enriched dataset information \(\mathcal{D}\mathcal{I}'\) and the model information \(\mathcal{M}\mathcal{I}\) are used by the Regressor Model Generator that matches them with the parameters and configurations of its internal model and predicts a regression curve for the new experiment. With this curve and \(\mathcal {G}\), the Reduction Estimator suggests the volume of data \(\hat{p}\) that ensures \(\mathcal {G}\) while reducing energy consumption. The Dataset Reducer extracts a subset \(ds_{\hat{p}} \subset ds\) of size \(\hat{p}\) exploiting the information provided in the Smart Reduction DB about the DQ metric ranking. As an output, the researcher gets the Reduced Dataset \(ds_{\hat{p}}\), with higher DQ and lower data volume, that can be used to perform a new training with a limited environmental impact.

5 Implementation of the Architecture

The actual implementation of the architecture presented in Sect. 4 depends on the specific DL task to be addressed. To demonstrate it, we describe its implementation for the time series classification task. In this context, we can define a dataset ds as a collection of data points DP, where each data point dp is a time series consisting of L values collected over a time period.

A collection of datasets and models have been used to implement the Data Exploration Phase and stored in the Dataset and Models DB:

  • the datasets are selected from the UCR/UEA repositoryFootnote 1, consisting of over 100 datasets with different characteristics over a variety of fields;

  • three different architectures - MLP, FCN, and ResNet [24] were used.

For the sake of simplicity, we limited our evaluation to a single performance metric, and we selected the F1-Score, which represents both the correctly classified series (precision) and the incorrect ones (recall).

The experiments were run on Google ColabFootnote 2, on Intel(R) Xeon(R) CPU and a Nvidia Tesla T4 GPU instances. Carbon emissions were measured in KgCO\(_2\)e with CodeCarbonFootnote 3, manually setting the execution in Italy to reduce variability. All the code is freely available on GitHubFootnote 4.

5.1 Data Exploration Implementation

As described in Sect. 4.1 and depicted in Fig. 2, several experiments are executed combining the datasets and models contained in the Dataset and Models DB and storing the results in the Experiments DB. In each sub-experiment a different dataset configuration was tested, considering two aspects:

  • Data Volume: from 100% all the way down to 20%, in steps of 10%. At this stage, data points are selected randomly from the dataset;

  • DQ: injecting errors on accuracy, consistency, and completeness, from 1 to 0.2 in steps of 0.1. To obtain the dirty dataset, we apply data pollution as described in [3]: for each DQ metric and for each step, the set of data points to pollute is randomly extracted, and the data points are properly modified:

    • Accuracy: it is computed as the percentage of data points with a correct target value associated. For each of the selected data points, the target value is substituted with a different one;

    • Completeness: it is computed as the complement of the percentage of missing values in the time series composing the dataset. For each of the selected data points, values of the time series are randomly removed;

    • Consistency: it is computed as the percentage of data points that follow the consistency rule: two series with the same values must be associated with the same target value. Each of the selected data points is duplicated and a different target value is assigned to the copy.

For each configuration, five experiments are executed to reduce noise for a total of 1’215 experiments.

Fig. 4.
figure 4

Volume and carbon emissions (top) or F1-Score (bottom) trade-off

Fig. 5.
figure 5

Impact of different DQ dimensions on model performance

Fig. 6.
figure 6

Comparison of the impact on the performance of different data reduction strategies: smart removal, random removal, no removal

The Results Analyzer uses part of these data to (i) evaluate the volume vs performance and the volume vs emissions trade-off, and (ii) to rank the DQ dimensions according to their impact on the model performance. As an example, the results of these two analyses performed by the Results Analyzer for three datasets and three models are shown and discussed here. In Fig. 4, the impact of data volume on CO\(_2\) emissions (top row) can be compared to its impact on the model performance (bottom row). While the degradation in performance due to a reduced training set grows slowly, the CO\(_2\) emissions have a steeper trend, suggesting that the gain in terms of environmental sustainability beats the loss in terms of performance. It can be seen that volume reduction has a limited effect on the second and third experiments. This can be due to the dataset characteristics (a better class separability, which makes it easier to build a high-performance model with fewer data) and its intertwining with the selected DL model and HP configuration. Figure 5 shows the impact of DQ degradation on the resulting model performance, considering three different DQ metrics: completeness, consistency, and accuracy. It can be observed that not all the metrics have the same impact, with completeness being the less relevant aspect. A ranking of the most relevant DQ dimensions is extracted from these experiments and stored in the Smart Reduction DB. The intuition is that removing poor-quality data improves the overall model performance. To validate this intuition, we executed some experiments showing the results of the model performance under three different conditions: given a dataset with a percentage p of poor quality data (i) no data are removed; (ii) all the poor quality data are removed (Smart Removal); (iii) the same percentage p of data is removed but with a random selection. The experiments tested different percentages of affected data points for different DQ metrics. Results are shown in Fig. 6. As can be observed, smart removal performs similarly or better than the other two options.

The Reduction Curves Extractor uses the experiments DB to build a set of reduction curves modeling the trade-off between performance and volume. To build the Reduction Curves DB, 42 datasets and three models were used. The reduction curves were modeled as shown in Eq. 4:

$$\begin{aligned} F1\_Score = C_1 + C_2 \times \log (ds_p) \end{aligned}$$
(4)

where \(ds_p\) is the percentage of the original dataset to be considered, \(C_1\) and \(C_2\) are the regression parameters, and \(F1\_Score\) is the resulting model performance.

Table 1. Inputs of the regression model

5.2 Data Selection Implementation

The content of the Reduction Curves DB is used by the Regression Model Generator to build a Regression Model. In our implementation, we tested several algorithms and selected the Random Forest Regression [18]. All the details about the inputs and output of this model can be seen in Table 1.

As described in Sect. 4, our methodology considers a HITL approach. For this, it is expected for a researcher to provide all the necessary information (\(\mathcal{D}\mathcal{I}\), \(\mathcal{M}\mathcal{I}\), \(\mathcal {G}\)) and to perform a preliminary HP search process with a reduced dataset (\(\mathcal{B}\mathcal{I}\)). In our tests, we set the dataset size \(ds_p = 50\%\) since this value resulted in a good trade-off between performance and emissions in the analysed scenario. The Dataset Feature Extractor extract from the dataset the missing characteristics for the selected data type and model (as described in Table 1) and assesses DQ. Using this information, the Regression Model can be exploited to obtain the \(C_2\) coefficient for the new regression curve. The \(C_1\) coefficient is computed using the baseline \(F1-Score\) result from \(\mathcal{B}\mathcal{I}\), using Eq. 5. With this reduction curve and the performance goal \(\mathcal {G}\), the required dataset percentage is computed by the Reduction Estimator using Eq. 6. Finally, the dataset is reduced by the Dataset Reducer component by removing low-quality data first according to the DQ dimensions ranking until the required percentage is met: (i) for completeness, data points containing null values are removed; (ii) for consistency, data points with the same values but different target values are removed. Since data points associated with a wrong label cannot be automatically detected, no action is taken to improve accuracy unless additional information is provided. The Data Selection phase additionally allows the researcher to express preferences on the class balance of the resulting dataset: the user can decide if to keep the same distribution or reduce as much as possible the imbalance between classes.

$$\begin{aligned} \hat{C_1} = ReportedF1Score - \hat{C_2}\times \log (ds_p) \end{aligned}$$
(5)
$$\begin{aligned} RequiredPercentage = e^{\frac{GoalMetric - \hat{C_1}}{\hat{C_2}}} \end{aligned}$$
(6)
Fig. 7.
figure 7

The Data-Centric Green DL tool GUI

To ease the interaction with the researcher, we provided a prototype including a web interface (Fig. 7a). To tool increases the sustainability awareness of the researcher by estimating the emissions reduction of the approach (Fig. 7b).

6 Validation

The validation of the proposed methodology needs to focus on two aspects: (i) using the approach, the environmental impact of DL model training is reduced; (ii) the performance goals set by the researchers are met.

The majority of carbon emissions produced in a DL pipeline come from the HP search. Using a classic method for this process, N training iterations are usually performed on the full dataset changing the HP values, and the resulting best model is chosen. This paper proposes to perform this search in two steps: (i) N training iterations are performed on a reduced dataset \(ds_p = 50\%\) to generate the required input for the methodology; (ii) a final HP search is refined on the resulting reduced dataset \(ds_{\hat{p}}\) with n final iterations. If \(N>n\) by a significant amount, the carbon emissions of the new method will be less than the ones in the classic method, up to half in the limit case where \(N \gg n\). The values for N and n used for the experiments were defined part experimentally and part from the literature [1].

In order to extensively and systematically validate the approach, the same experimental data obtained in Sect. 5 were used. The data contained in the Experiments DB were split into a training set (70% of the experiments) for training the regression model and a testing set (30% of the experiments) to simulate new experiments requested by researchers. The baseline result \(\mathcal{B}\mathcal{I}\) was obtained from one of the sub-experiments performed from the validation set, and the performance goal \(\mathcal {G}\) was set as the performance of the selected sub-experiment plus 5%, 10%, and 15%. Taking an extreme case, where \(N \gg n\), the emissions were reduced by around 40%, on the three performance goal cases. When using more reasonable values of iterations for the HP search (\(N=100;n=25\)), the reduction in emissions was closer to 15% (Fig. 8). Figure 9 shows instead the average error of the approach of predicting the model performance for a specific dataset volume, which resulted to be near 1.5%.

Fig. 8.
figure 8

Carbon emissions change due to the proposed approach

Fig. 9.
figure 9

Error in the satisfaction of the performance goal set by researcher

Finally, an extra end-to-end experiment was performed, to test how the researcher can reduce carbon emissions on a new and unseen DL model. This was done using the Swedish Leaf datasetFootnote 5, modified to have \(consistency = 0.85\) and with a performance goal set as \(\mathcal {G}: F1-Score = 0.95 \). After the first HP search with \(N = 100\), a baseline result of \(F1-Score = 0.91\) was achieved. With the proposed approach, the original dataset was reduced using 68% of the original data, with a resulting consistency of 1. The new dataset was used to perform a second HP search with \(n=1\), with a resulting performance of \(F1-Score = 0.961\). Table 2 shows a comparison between the proposed method and the results of the classic method in two different cases: one where the full dataset was used (with the inconsistent series present), and one with all the inconsistent data removed (85% of the dataset). The proposed method generated fewer emissions when compared to both cases while reaching the performance goal set.

Table 2. Performance and emissions results of the Green DL compared with classic DL training on the Swedish Leaf dataset

All the experiments executed in this paper generated 6.7 kg CO\(_2\)e. Using the tool on datasets with a size similar to the ones used for development, with \(N=100; n=25\), we can estimate that the generated emissions would be offset with 274 uses of the tool. This number is reduced to only 12 uses with \(N=1000\).

The preliminary results obtained in testing the approach have proven the relevance of data preparation for Green DL. However, the approach can be enriched by i) integrating it with the existing model-centric approaches, providing a holistic view to Green DL; ii) exploiting additional data features affecting either model performance or energy consumption (e.g., data augmentation and class balancing). Namely, the approach can also be applied to other DL tasks, however, additional experiments will be needed to check its efficiency and to provide a way to automatize the parameters optimization in different scenarios.

7 Conclusion

Motivated by the increasing environmental impact that DL is having, this paper proposes a data-centric approach for reducing carbon emissions in DL training pipelines as part of what is called “Green AI”. This research is data-centric since all the considerations to reduce energy usage are addressed to more efficient use of the training data, rather than focusing on more efficient hardware or algorithms. For this, characteristics like data volume and DQ were taken into account. A general methodology, valid for any DL task, was proposed, consisting of two phases. First, a Data Exploration Phase inspected the characteristics of the data and generated a knowledge base for efficient data reduction. Second, a Reduction Building System Phase is defined to support researchers in reducing their carbon emissions by operating on the training dataset. This process follows a HITL approach, where the researcher needs to interact with it, providing all the necessary information.

An implementation of the approach focusing on the time series classification task using DNNs is provided. The result of the implementation is a prototype that can be used by the researchers. Experimental results showed that the approach can reduce carbon emissions by up to 50%. With time, more data from experiments of new model architectures and datasets can be included, further increasing the accuracy of the predictions provided by the proposed system.

Future work will focus on testing a more extensive set of data-centric characteristics, to reduce more or in a better way the dataset. Also, the proposed system could be integrated with a location-aware deployment service, which can train models in locations with a more favourable energy mix.