Introduction

MATLAB is used to showcase a number of machine learning techniques that environmental researchers may consider incorporating into their research agenda. Regardless of a researcher’s computing abilities, machine learning provides the ability to take in large sets of data and return models relatively quickly and easily. World Health Organization data was used in this example.

Using the variables female life expectancy, continent, access to clean water in rural areas, and particulate matter concentration, machine learning algorithms displayed the ability to conduct deeper analysis for decision support. This paper is intended to introduce the techniques to get the most information out of collected data rather than to propose quantitative analyses of the data. Table 1 presents key terms used hereafter.

Table 1 Glossary of key terms

The concept of machine learning has been around for several decades and its earliest definition is a component of computer science in which computers are given the ability to learn without being programed to do so (Samuel 1959). Machine learning has evolved drastically since first being described in literature, with abilities such as deep learning available to almost anyone with a personal computer. Deep learning can uncover intricate structures in data and dictates how a machine should change algorithms to compute a model (LeCun et al. 2015).

Machine learning can assist any environmental researcher with a number of practical tools, including adaptability, recovery from missing data, confounding control, and causal discovery through regression learning (Pearl 2019). Some example questions a research can apply machine learning to include

  • Where are the most effective places to install rain gardens to maximize stormwater capture and natural treatment? (Younos et al. 2019)

    A researcher can approach this question quite systematically. Using the rational method, where the peak discharge of stormwater is equal to the product of rainfall intensity, surface area and a runoff coefficient, areas in a community that prone to flooding can be quickly identified. Capturing the data of flooded sites, storm parameters, water depth, and time for complete drainage, machine learning can provide infrastructure agencies develop effective strategies for installing rain gardens to mitigate this issue.

  • What is the carbon footprint of a region where data availability is limited? (Habans et al. 2019)

    Understanding a community’s waste management system allows for its carbon footprint to be determined using a tool such as the Environmental Protection Agency’s Waste Reduction Model (WARM). The foundation is based upon knowing the tonnage of every waste stream is generated (i.e., cardboard, yard waste, rubber) and its fate (i.e., landfill, incineration, composing). Some data is not readily available or recorded, making traditional modeling adequate but with a lower confidence of prediction. With machine learning, historical data such as population, waste stream rates, transportation and final destination can be used to “increase the probabilistic relationships recovered from incomplete data (Pearl)”.

  • What are the most effective ways of addressing climate change impacts which meet social objectives? (Dulal 2019)

    Climate change brings the increased risk of major events such as flooding, drought, and crop loss. One social object to be addressed is the malnourishment of children in areas where drought is more likely to occur. Under- and malnutrition are precursors to a child’s ability to grow, learn, and rise out of poverty (“The Social Dimensions of Climate Change” n.d.). Classification matrices can be used to help predict areas in greatest need for this sustainable develop goal to be addressed.

Review of relevant literature

Incorporating environmental data into machine learning methods will enhance researchers’ abilities to generate stronger developed models. A significant application of this is the forecasting of environmental events, ranging from fires and floods to land and water quality. Geologists are using multivariate data analysis through machine learning techniques to predict more accurately future occurrences of landslides from historical data and patterns (Korup and Stolle 2014). Landslide displacement prediction modeling has been performed through machine learning and causal factors (Zhou et al. 2018). Machine learning methods have identified two central landslide predictors, unfailed cliff slope angle, and fault proximity, which may support management approaches, that include all predictor information (Dickson and Perry 2016).

Environmental toxicity has been shown to be predictable using neural networks in cluster analysis. Research has shown that machine learning can find new knowledge that is encoded within a chemical’s molecular structure (Mayr et al. 2016). Applications such as this will allow for a decreased risk in chemical safety testing (“Prediction of human population responses to toxic compounds by a collaborative competition” Nature Biotechnology2015).

Rain gardens are being implemented in cities as part of a green infrastructure push to help manage the effects of stormwater runoff (Liao et al. 2017; Malaviya et al. 2019). In New York City’s Jamaica Bay and Tributaries zone, over 169 million gallons of stormwater have been managed over 135 equivalent green acres of this type of green infrastructure (“NYC Green Infrastructure 2018 Annual Report” 2019). Selecting locations is tedious work due to several variables including types of trees, size of sidewalks, size of drainage area, location of public transportation, and underground utilities. Machine learning has the potential of streamlining the process.

The methods for calculating carbon footprint emission from solid waste have a great deal of variance, resulting in a lower confidence of prediction (Robinson et al. 2018). In India, there is very “limited data available for waste generation patterns” in the largest urban areas (Joshi and Ahmed 2016). While machine learning cannot be directly used to curb a society’s waste generation, it can be used to predict waste models with limited data (Pearl).

Climate change will have an impact on social objectives, particularly equity in developing nations, in the near future. In Tanzania, social equity in rural area adaptation must need to be addressed (Smucker et al. 2015). Social objectives in regions that cover several nations, such as the Brahmaputra River Basin in South Asia, can affect a population of tens of millions inhabitants (Yang et al. 2016). Machine learning can provide policymakers with better guidance on where to develop hydroelectric facilities, water supply basins, and farmland.

World Health Organization environmental health data has been the subject of machine learning in other studies. Ground-based particulate matter readings were trained in a machine learning algorithm to estimate average concentrations of the air pollutant (Lary et al. 2015). This technique was used to demonstrate its relationship with mental illness-related emergency room visits in Baltimore, Maryland. The outcome for particulate matter exposure presented in this paper is female life expectancy, so the researcher may use any relevant predicted variable. The National Institute of Environmental Health Sciences (NIEHS) has identified machine learning as a method for the analysis of multi-pollutant environmental epidemiology studies (Taylor Kyla et al. 2016).

Methodology

Data setup

The data of 162 nations was extracted from the World Health Organization (WHO) “Global Health Observatory” data repository. The study population was composed from the data made publicly available by the WHO. Each country was documented as a record along with a number of other variables, including (1) average life expectancy, (2) average life expectancies for males and females, (3) percentage of population with access to improved potable water, (4) percentage of population with access to improved sanitation, (5) average exposure concentration of PM10, (6) number of deaths per one-hundred thousand citizens, and (7) continent. Data was saved as a Microsoft Excel 2016 file.

Example hypotheses

Table 2 provides examples of machine learning techniques and a null hypothesis to demonstrate the practicality of incorporating MATLAB into common research studies. The intent is for the reader to be able to apply some to his/her own research.

Table 2 Machine learning techniques and associated null hypothesis

MATLAB

MATLAB, which stands for “Matrix Laboratory,” is a software title designed for scientists and engineers to conduct high-level research through function generation, model building, plotting data, application development, and more (Gilat 2017; Hahn and Valentine 2016). Users are able to add on a number of applications to conduct higher level mathematical analysis, including Machine Learning and Data Analysis.

Within MATLAB, the Import Data function is used to bring a data set into the session’s memory (Fig. 1).

Fig. 1
figure 1

Toolbar location for data import

Regression Learner

MATLAB’s Regression Learner app is accessible from the APPS ribbon and is used to train regression models to predict data. It can also be directly called from the command Regression Learner. Using this app, you can survey your data, train models, and evaluate outcomes. Users have the ability to “perform automated training to search for the best regression model, including linear regression models, regression trees, Gaussian process regression models, support vector machines, and ensembles of regression trees” (MATLAB (version 2019a) 2019).

Once a data set is in MATLAB’s memory, with independent and dependent variables already decided upon, users can begin to conduct machine learning. Independent variables are used to train a model that generates predicted outcomes in the general form of

$$ \hat{\mathrm{y}}={\beta}_0+{\beta}_1{x}_1+{\beta}_2{x}_2+\bullet \bullet \bullet {\beta}_n{x}_n, $$

where

ŷ :

is the predicted outcome,

β 0 :

is the y-intercept of the best-fit regression line,

β :

coefficients represent the change in outcome relative to its x value, and

x :

variables represent distinct predictor values (Johnson and Bhattacharyya 2018).

Data can be imported later into the Regression Learner for predictions and, very importantly, continued model learning. As more data is added to the model, its predictions will become more accurate and reliable.

Classification Learner

The Classification Learner app is accessible through the APPS ribbon and is used to visualize how well variables can group together based on commonalities. Machine learning allows for the clustering of data which, in essence, divides all samples in a database to be grouped together (Alpaydin 2014). Software algorithms find densely populated areas of data points to find commonalities (Daniel 2013). Figure 2 displays a hypothetical data set gathered around a cluster center, represented by the four “x” symbols.

Fig. 2
figure 2

Clustered classification data

Some examples questions that classification can help answer include

  • How do water quality parameters influence algae blooms?

  • Are there clusters of environmentally-related illnesses in a region where particulate matter fluctuates?

  • To what scale do small generators of toxic substances impact the environmental justice of a community? (Collins et al. 2016)

  • What zip codes present the greatest impact on stormwater runoff?

Results

Regression Learner outcomes

Eight independent variables were used in this machine learning session for the prediction of the outcome variable, “Female Life Expectancy”. These variables were “continent,” “PM10,” “Rural Water,” “Urban Water,” “Total Water,” “Rural Sanitation,” “Urban Sanitation,” and “Total Sanitation.” The hypothesis of “female life expectancy is not influenced by environmental factors” was tested in this technique.

The chosen model picked from machine learning was the Medium Gaussian Support Vector Machine and is shown along with a number of other model outcomes in Table 3. The coefficient of multiple determination (R2) was highest for linear regression Model 1.5 (0.83). This implies that 83% of female life expectancy variability can be explained by this linear model. The plot of this model is represented in Fig. 3.

Table 3 Model outcomes from regression learning
Fig. 3
figure 3

Observed versus fitted responses of female life expectancy

As this model is studied in more details, the machine learning computes that only three input variables are required to predict female life expectancy (LXF): continent, access to water in rural areas, and particulate matter concentration.

The linear model may be explained as:

$$ \hat{\mathrm{y}}={\beta}_0+{\beta}_1\mathrm{Continent}+{\beta}_2\mathrm{Access}\ \mathrm{to}\ \mathrm{Water}\ \mathrm{in}\ \mathrm{Rural}\ \mathrm{Areas}+{\beta}_3 PM10 $$

Generated unadjusted (crude) linear and quadratic models are presented in Figs. 4 and 5. For Fig. 4, the linear model may be explained as:

Fig. 4
figure 4

Linear and quadratic models for female life expectancy as predicted by access to water in rural areas

Fig. 5
figure 5

Linear and quadratic models for female life expectancy as predicted by particulate matter concentration

$$ \hat{\mathrm{y}}={\beta}_0+{\beta}_1\mathrm{Access}\ \mathrm{to}\ \mathrm{Water}\ \mathrm{in}\ \mathrm{Rural}\ \mathrm{Areas}, $$

while Fig. 7 may be explained by

$$ \hat{\mathrm{y}}={\beta}_0+{\beta}_1 PM10 $$

Clearly, the more of the population that has access to clean water, the average life expectancy increases (Fig. 4 positive slope) while as the concentration of particulate matter increases, the average life expectancy decreases (Fig. 5 negative slope).

Linear SVM was the postulated model for this technique but the machine learning has provided a stronger one in a Medium Gaussian SVM model. Linear SVM is capable of separating data with a straight-line vector while the Gaussian SVM method uses an additional dimension. The differences between the two SVMs are presented in Figs. 6 and 7.

Fig. 6
figure 6

Linear SVM separateing data through a best-fitting line

Fig. 7
figure 7

Gaussian SVM separating data in an extra dimension

Classification Learner outcomes

Additional testing was conducted through machine learning to evaluate how well a continent can be classified through the data provided within the female life expectancy, access to water in rural areas, and particulate matter concentration variables. Figure 8 presents the confusion matrix table for the classification through machine learning. Six continents/regions (AF = Africa; AS = Asia; EU = Europe, NA = North American; OC=Oceania; and SA = South America) are represented as true classes in rows and predicted classes in columns.

Fig. 8
figure 8

Confusion matrix for the classification data by continent

This output shows that there is an 80% positive rate in the data that the female is from Africa by knowing the values of the three predictor variables, water, sanitation and particulate matter concentration. Twenty percent constitutes false predictions. Classifying other continents based on the input factors is less reliable for this model. Seven percent of the data classified as Africa was falsely identified as Asia.

The hypothesis for this technique was “machine learning will be able to classify a continent at a sixty percent success rate by using three predictor variables.” Data from African (80%) and European (72%) nations supported this hypothesis but the machine did not learn for the other continents. Oceania can be explained by having fewer nations with data for the machine to learn robustly. Data similarities between North America, South America, and Asia likely resulted in collinearity during regression learning.

Classification was conducted using the percentages of access (1) to clean drinking water and (2) to improved sanitation to find commonalities between continents. Figure 9 shows how the data results in clustering among African nations as opposed to the greater variability of European and North American countries.

Fig. 9
figure 9

Classification clustering between Africa, Europe, and North America

Conclusions

The main goals were to demonstrate readily available machine learning modeling techniques using historical environmental health data (Feldman et al. 2016). This was not an attempt to propose a fully functioning model for predicting female life expectancy dependent upon environmental infrastructures and pollution exposures. Providing researchers with an introduction to how they may use machine learning in their own research is the priority of this research brief.

MATLAB is a powerful software title that is often provided to research faculty free of charge through their institution’s academic license. Several add-ons enable research projects to be enhanced while having data evaluated quickly and accurately. A few basics of machine learning have been presented which can be easily incorporated into a researcher’s toolbox with some practice. World Health Organization environmental health data was used to demonstrate some techniques.

Regression learning displayed the ease of importing a dataset into the MATLAB memory and then defining which variables are chosen as predictors and which will be the predicted. Support Vector Machine learning allowed for the running of a number of linear models of the selected data and quickly evaluated the best one.

Classification of data was demonstrated through the confusion matrix. Through the analysis of a number of variables, the algorithm can learn what outcomes may be in future analyses. This case showed that there was an 80% chance of predicting a female in Africa depending upon the predictor variables used in this example. Clustering analysis was used to demonstrate how the software can group data points together by similar characteristics into clusters and then show relative distances between all identified clusters.

Example research questions were provided in which machine learning can be applied for deeper understanding of issues. Clustered classification can be used to determine locations for the installation of rain gardens to help mitigate flooding. Carefully chosen variables, such as latitude/longitude, may provide the ability to overlay graphed clusters on a map to identify where rain gardens would be most efficient. Regression learning would be a practical technique to model the carbon footprint of a community where there is incomplete municipal waste data. The regression line will become more significant as more data is put into a database for the model to learn. The confusion matrix can be used to predict if an underserved community is experiencing a lack of equity based upon the variables incorporated into the machine learning environment.

Environmental science is experiencing an increase in the number of studies using machine learning (Best et al. 2018; Fuchs et al. 2019; Feldman et al. 2016; Gómez-Losada et al. 2018; Griffin et al. 2019; Sung Kyun et al. 2017; Hoos et al. 2019). Machine learning will account for uncertainties in environmental science studies, allowing for an increase in credibility of future results (Stritih et al. 2019). Incorporating some of these tools and techniques can enhance our decision-making and support for our planet’s needs (Hill et al. 2019; Tasdighi et al. 2018).