Introduction

Harmful algal blooms (HABs) are arguably the greatest threat to inland water quality, public health, and aquatic ecosystems (Finnis et al. 2017; Cruz et al. 2021). HABs are particularly damaging for the aquaculture industry causing extensive damage to operations. The effects on farmed finfish include physical interference, deoxygenation, or ichthyotoxicity (Davidson et al. 2021). Recently, large blooms of the common algae Chrysochromulina leadbeateri killed 8 million salmon in ocean net pens in Norway (Karlson et al. 2021). Shellfish farmers are potentially more exposed to HAB effects since toxins consumed by the filter feeders are bioaccumulated and bioamplified throughout the food web to affect humans and other organisms (Van Dolah 2000). There are several diseases associated with shellfish toxins including paralytic shellfish poisoning (PSP), diarrhetic shellfish poisoning (DSP), and amnesic shellfish poisoning (ASP) (Basti et al. 2018).

This significant risk to human health has led to a rigorous monitoring and management framework to minimize human health risk (Trainer and Hardy 2015; Trainer 2020; Harley et al. 2020; Legleiter et al. 2022). Monitoring of phytoplankton communities in the marine environment is required by the European Union (EU) legislation, such as the Marine Strategy Framework Directive (MSFD, 2008/56/EC and 2017/ 845/EC, 2008) and Water Framework Directive (WFD, 2000/60/EC), as well as the national law of the member states (Garmendia et al. 2013). If the concentration of shellfish biotoxins exceeds threshold levels, harvesting restrictions are applied until toxins are reduced (Davidson et al. 2021).

Monitoring and early warning of adverse harmful algal blooms is of critical importance for aquaculture industry stakeholders and consequently has been the focus of much research effort (O’Donncha and Grant 2019). These include comprehensive in-situ monitoring programs, the use of remote sensing products, and sophisticated computer modeling approaches that can be used to detect and forecast blooms (e.g., Wynne et al. (2020); Hardison et al. (2019)). Traditionally, computationally expensive, large-scale, physics-based models have been used to simulate algal concentrations (or a suitable proxy). The limitations to these include the high degree of user skill required and the difficulty to parametrize across different geographical regions (McGillicuddy 2010). Instead, the task of forecasting HAB events is ideally suited to a data-driven machine learning approach — especially since neither algae species data nor site-specific hydrodynamic/thermodynamic data are required.

This paper details a machine learning framework to forecast toxin events at a shellfish aquaculture farm site in Portugal. The objective is to develop an early warning system for farmers to forecast closures of shellfish sites due to elevated toxin levels in shellfish flesh. A forecasting model was trained using historic data from the Portuguese Institute of Sea and Atmosphere (IPMA) on-site closures (Portuguese Institute for Sea and Atmosphere 2022) together with environmental variables. Contributions of the paper are as follows:

  • A transferable framework based on ocean data and AutoAI models to provide early warning of toxin events.

  • A robust feature engineering approach that is amenable to the complexities of ocean time series datasets

  • Evaluation of the approach on data from an operational shellfish site in South-West Portugal.

Related work

Early warning and accurate forecasting of algal blooms hold significant importance for public health organizations, fish farmers, tourist organizations, and various other stakeholders. As a result, substantial research efforts are dedicated to developing operational forecasting models capable of accurately capturing the intricacies of algal bloom dynamics.

Numerous algal forecasting models rely on numerically resolving the complex processes involved in the formation and development of algal blooms. This entails understanding the circulation patterns of the ocean and modeling the kinetics of individual algae species. Consequently, this scientific challenge is notoriously difficult (Roiha et al. 2010).

Some studies simplify the problem by focusing solely on circulation patterns, treating algae as passive particles transported by currents (Pinto et al. 2016). In contrast, more comprehensive approaches take into account both circulation and biology. One notable example is the Gulf of Mexico Harmful Algal Bloom Operational Forecast System (GOMX HAB-OFS), which provides a 10-day forecast of the “red tide” caused by Karenia brevis. This system combines a Regional Ocean Modeling System circulation model with a biological sub-model that considers cyst germination, cell growth rates, and other factors (Kavanaugh et al. 2013). By integrating satellite imagery, in-situ monitoring, and hydrodynamic modeling, the system estimates the current extent and intensity of the bloom, predicts its trajectory, and forecasts concentration levels.

A similar framework is employed to forecast cyanobacteria blooms in Lake Erie (Wynne et al. 2011). However, a primary challenge with these approaches is the computational expense involved in implementing them at high resolution across large spatial and temporal scales, often requiring high-performance computing facilities (O’Donncha et al. 2020). Additionally, configuring and parameterizing these models is a complex task typically requiring extensive expertise (O’Donncha et al. 2015).

To overcome these limitations, recent years have witnessed a growing interest in utilizing machine learning (ML) to develop cost-effective approximations or surrogates of physics-based models (Lary et al. 2004; Ashkezari et al. 2016; James et al. 2018; O’Donncha et al. 2019). Traditionally, data mining approaches have been employed to identify areas prone to algal blooms by determining influential features in bloom formation (Gokaraju et al. 2011; Chau and Muttil 2007). However, more recent attention has shifted towards ML methods that employ regression-based approaches to forecast future algal bloom concentrations.

Park et al. (2015) implemented an artificial neural network (ANN) and support vector machine (SVM) networks to predict Chl-a concentration in the Juam Reservoir and Yeongsan Reservoir. Using weekly measurements of water quality (Chl-a, phosphate phosphorus, ammonium nitrogen, nitrate nitrogen, and water temperature), and meteorological data (solar radiation and wind speed) over a 7-year period as input data, both the SVM and ANN made seven-day-ahead predictions of Chl-a concentrations. While these models indicated good predictive skill, they relied on difficult-to-collect water-quality observations that are not widely available.

Zhang et al. (2016) implemented a five-layered neural network to forecast algal blooms in the coastal waters of East China. The model was trained on 4 years of observed water-quality variables, including temperature, salinity, pH, Chl-a, chemical oxygen demand, dissolved oxygen, phosphate, acid nitrate, nitrate, ammonia nitrogen, and silicate, to forecast phytoplankton density as a proxy for algae formation.

In their study, Lee and Lee (2018) examined the performance of machine learning (multilayer perceptron, MLP) and deep learning models (recurrent neural network, RNN; and long short-term memory, LSTM) to forecast Chl-a (as a proxy for algal blooms) in four rivers in South Korea. The models were applied to 16 monitoring stations, and the results revealed that ordinary least squares regression outperformed the deep learning models at five locations. Among the deep learning models, MLP and RNN achieved the lowest root mean square error (RMSE) at four and three locations, respectively, while LSTM performed best at four stations. However, a more recent study by Wolff et al. (2020) reported that simpler models such as generalized additive models (GAM) or random forest (RF) performed better than more sophisticated approaches like LSTM.

Fernandes-Salvador et al. (2021) conducted a comprehensive review study on harmful algal bloom (HAB) issues in European Atlantic waters and the status of early warning systems. They categorized the early warning systems into five major types: industry alert “bulletin” reports, particle tracking-based systems, statistical models based on remote sensing, statistical and machine learning models based on the fusion of multiple data sources, and mechanistic full-low trophic ecosystem models. These systems often combine multiple methods and are typically communicated to end users through expert interpretation.

CoastObs (2023) offers a HAB forecasting service that predicts the probability of a toxic bloom causing the closure of a production area. Their system employs a combination of advanced and basic models. However, they highlight the requirement of measuring nutrient concentrations, which is currently not available through existing monitoring programs or publicly accessible data. We believe that this paper is the first to predict the likelihood of farm closure based on publicly available data.

Methodology

Study site

The site for this study is located at the intersection of the west and south coast of Portugal near Sagres (\(37^\circ \) 00′ N, \(8^\circ \) 53′ W ). This coast has a narrow continental shelf that descends rapidly to depths of over 1000 m at the continental slope (Fig. 1). There are no perennial rivers, but the area is affected by coastal upwelling events which promote high primary production due to the upwelling conditions induced by northerly winds, occurring mostly during early spring to late summer. The cold, nutrient-rich water from these events stimulates high primary productivity that has enabled the development of offshore aquaculture for bivalves. More detailed information about this region can be found in the following articles and references therein: (Cravo et al. 2010; Krug et al. 2017; Danchenko et al. 2019; Santos et al. 2021; Danchenko et al. 2022; Icely and Fragoso 2023).

Since 1985, IPMA (the Portuguese Institute for the Sea and Atmosphere) has been responsible for the national monitoring program for HAB to identify potential levels of biotoxins in shellfish (Silva et al. 2016). With a funding contribution from the EU through the ASIMUTH project (Applied Simulations and Integrated Modeling for the Understanding of Toxic and Harmful Algal Blooms), the Portuguese have developed a weekly bulletin that provides information on the closure status of shellfish harvesting areas throughout Portugal (Silva et al. 2016; Fernandes-Salvador et al. 2021). The shellfish areas have been divided up into specific zones; in the case of the Sagres site, the zone is L7c (Figure 1 from Danchenko et al. (2019)). However, the L7c zone was split at the end of 2018 into L7c1 and L7c2; the Sagres site is within L7c1 (Portuguese Institute for Sea and Atmosphere 2022). Thus, for the analysis in this paper, closures over a period of 7 years were considered for zone L7c from August 2014 until 2018 and then subsequently for L7c1 up to May 2021.

Fig. 1
figure 1

Study site at the offshore mussel farm at Sagres SW Iberia (Portugal); distances from IPMA phytoplankton monitoring stations (figure from Danchenko et al. (2019))

Input data

The presence of toxins in shellfish flesh are regularly monitored by the Portuguese Institute of Sea and Atmosphere (IPMA). When concentrations exceed threshold levels, the shellfish farm is closed until values return to acceptable limits. For each of PST, AST, and DST, the legal thresholds are \({800} \mu \text {g}\), 20 mg, and \({160} \mu \text {g}\) equivalent per kg shellfish, respectively (Portuguese Institute for Sea and Atmosphere 2023).

Table 1 Summary of the data sources used in this study and their respective resolution
Fig. 2
figure 2

Correlation matrix of all variables considered for this study. The name of each variable is denoted on the axes, while the correlation between each variable pair is provided in the matrix plot

Data on licensed area closures are provided by IPMA from August 2014 and provide daily information on whether a site is open or closed.

The environmental data used for this study consisted of ocean product data from the Copernicus Marine Service model repository (Tonani et al. 2019), and weather data from IBM Environmental Intelligence Suite (Villali 2021). While data were collected or generated at different resolutions, we resampled all variables to daily intervals to correspond to site closure values. Table 1 details the data sources and resolutions used for this study, while Figs. 2 and 3 provides a summary of the variables.

The data utilized in this research is publicly accessible. Although we are not authorized to distribute weather data, you can acquire a free API key from the supplier to retrieve the data. Following a complimentary registration, Copernicus ocean product data can be obtained from the marine services portal. Additionally, the full source code for downloading the data, implementing the machine learning models, and executing the analysis has been publicly released at: https://github.com/fearghalodonncha/habs_ml/.

Fig. 3
figure 3

Plot of the data distribution for all variables. The state variable indicates whether the aquaculture site is closed (1) due to elevated levels of biotoxin

Machine learning

Classical works in machine learning and optimization, introduced the “no free lunch” theorem (Wolpert and Macready 1997), demonstrating that no single machine learning algorithm can be universally better than any other in all domains — in effect, one must try multiple models and find one that works best for a particular problem. Selection of the most suitable algorithm and algorithmic settings is one of the most complex aspects of machine learning applications and is highly dependent on user skill. A powerful approach to select optimal algorithms is automatic machine learning (AutoML) frameworks that aim to learn how to learn (Drori et al. 2018). AutoML tools use a variety of techniques, such as differentiable programming, tree search, evolutionary algorithms, and Bayesian optimization, to find the best machine learning pipelines for a given task and dataset (Drori et al. 2018). We used the open-source Lale Python library (Hirzel et al. 2019) for automated machine learning to simplify model development. Lale is designed to automatically select algorithms, tune hyperparameters, and explore pipeline topologies from a set of available preprocessors and machine learning algorithms suggested by the user.

The algorithms considered for this study were Random Forest (RF), Gradient Boosting (GradBoosts), XGBoost (XGB), and MultiLayer Perceptron (MLP). The first three models are from the classification and regression tree (CART) family, based on the aggregation of a large number of decision trees. Decision trees are a conceptually simple yet powerful prediction tool that breaks down a dataset into smaller and smaller subsets while at the same time, an associated decision tree is incrementally developed. The resulting intuitive pathway from explanatory variables to outcome serves to provide an easily interpretable model. The combination of simplicity, robustness, and interpretability makes these an extremely popular tool in machine learning.

In RF (Breiman 2001), each tree is a standard Classification or Regression Tree (CART) that uses what is termed node “impurity” as a splitting criterion and selects the splitting predictor from a randomly selected subset of predictors (the subset is different at each split). Each node in the regression tree corresponds to the average of the response within the subdomains of the features corresponding to that node. The node impurity gives a measure of how badly the observations at a given node fit the model. In regression trees, this is typically measured by the residual sum of squares within that node. Each tree is constructed from a bootstrap sample drawn with replacement from the original data set, and the predictions of all trees are finally aggregated through majority voting (Boulesteix et al. 2012). RF is especially popular for its strong performance with little hyperparameter tuning (i.e., works well with the default values specified in the software library).

GradBoost and XGBoost primarily differ from RF in how decision trees are built. While they both share many characteristics and advantages with RF (namely interpretability, predictive performance, and simplicity), a key difference facilitating performance gain in boosting methods is that decision trees are built sequentially rather than independently. This allows each new tree help to correct errors made by previously trained trees (Breiman 1997).

The XGBoost algorithm was developed at the University of Washington in 2016 and since its introduction has been credited with winning numerous Kaggle competitions and being used in multiple industry applications. XGBoost provides algorithmic improvements such as a sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning, together with optimization towards distributed computing, to build a scalable tree boosting system that can process billions of examples (Chen and Guestrin 2016).

The multi-layer perceptron (MLP) model on the other hand comes from the neural network family and is loosely based on the anatomy of the brain. Such an artificial neural network is composed of densely interconnected information-processing nodes organized into layers. The connections between nodes are assigned “weights,” which determine how much a given node’s output will contribute to the next node’s computation. During training, where the network is presented with examples of the computation it is learning to perform (i.e., likelihood of HAB event), those weights are optimized until the output of the network’s last layer consistently approximates the training data set (i.e., correctly predicts a shellfish site closure) (Hornik et al. 1989). A significant drawback of neural network methods is that the explainability of predictions is challenging, particularly as the size of the network increases.

Model setup and training

Since in-situ and remote sensing data did not provide contiguous data over the entirety of the study period, we used ocean model data from the Copernicus Marine Service repository (Tonani et al. 2019) and weather data from the IBM Environmental Intelligence Suite historical reanalysis product (Villali 2021). Leveraging publicly available data as features also serves to enhance the generalisability of the approach. The framework presented here can easily be applied to any other region in Europe (or globally where high-resolution ocean data is available). Data were resampled to daily values and combined with our label data informing on whether the shellfish site was open (denoted zero) or closed (denoted one). Label data from IPMA were available since August 2014, hence our study covered the period 05/08/2014–01/05/2021. The day-of-year was included as a feature to represent temporal variations in regional HAB developments. In order to augment the model’s ability to assimilate temporal dynamics, we encoded these features through the application of trigonometric functions, thereby effectively encapsulating their inherent cyclical nature. The final design matrices were of dimension [2492, 20] features and a corresponding vector of length [2492] label data.

Figure 2 displays the correlation matrix for the data. The label data reporting when the shellfish site is designated open or closed is denoted by the state variable and the correlation against the 22 selected explanatory variables are presented. Naturally, the degree of correlation between constituents varies. The highest correlation is reported for physical variables such as temperature and velocity, biogeochemical variables such as net primary production, or temporal variables such as day-of-year. Figure 2 informs that correlation exists between some of these variables and it can be considered the task of the machine learning model to extract the characteristics of the relationships and use this information to craft a predictive model.

Data were split into 80% training and validation dataset and 20% testing. To avoid data leakage, the testing dataset considered a contiguous period from 28/08/2017 to 01/01/2019 with the remainder retained for training. Data leakage in timeseries forecasting can occur when temporal dependencies are not respected during the train-test split. If the data is shuffled or the test set includes data that comes before the training set, it can lead to leakage and unrealistic performance estimates.

Further, the selected testing period contained a relatively high number of closures which allowed a more comprehensive evaluation.

Figure 3 indicates that the shellfish site was closed 497 of 2353 days (21%). Post-splitting, the training dataset contained 326 closures from 1883 days (17%), while the test set contained 171 from 470 (36%). Imbalanced data can lead to model performance being skewed towards the majority class (by simply learning to output the majority class). Our selected split penalizes such a tendency by ensuring the test dataset is not dominated by the majority class. Importantly by focusing on selecting a “difficult” test dataset, we avoid potentially biasing model results.

To address imbalances in the training data, we instead adopted a statistical approach, termed data sampling. Fundamentally, this involves generating new data points for the minority class (up-sampling), and/or removing data points from the majority class (down-sampling). Implementing data sampling requires consideration of different combinations of up-sampling and down-sampling to improve model accuracy. Up-sampling proceeded by sampling points from the minority class with replacement (i.e., making copies of points from the minority class), while down-sampling involved deleting points from the majority class (He and Ma 2013). Adjusting the class balance does not introduce new data but simply adjusts the ratio either by replicating the minority class or by throwing away some of the majority class. Importantly, the sampling techniques were applied only to the training data and the test data were untouched to provide a comprehensive evaluation. The optimal resampling rate is a hyperparameter to be selected as part of model training.

Re-sampling provides a data-centric approach to address model inaccuracy. An alternative model-centric approach is based on the classic paradigm of combining multiple model predictions into a single classifier.

Ensemble learning has a relatively simple concept that aims to reduce the model error by combining multiple “weak” learners. By combining different, diverse models, the expected error of the group reduces, while each individual model remains unchanged. Adding more models to the ensemble improves accuracy, under the condition that each learner is uncorrelated to the others. Of course, this is increasingly difficult to achieve and it is subject to the law of diminishing returns (Kyriakides and Margaritis 2019).

Two main categories of ensemble classification exist voting and stacking or blending. Voting, as the name implies, refers to techniques that allow models to vote in order to produce a single answer. The most popular (most voted for) answer is selected as the winner. The results of the voting can be decided based on the number of votes (hard mode) or on both the number of votes and probability returned by each ML model (soft mode). Stacking, on the other hand, refers to methods that utilize a model (the meta-learner) to learn how to best combine the base learner’s predictions. Although stacking entails the generation of a new model, it does not affect the base learners, and instead, the new model aims to learn which combination of the base forecasts provides the best estimate.

The feature transformations considered were standardization and principle component analysis (PCA). Standardization is a scaling technique where the values are centered around the mean with a unit standard deviation. PCA is a data-driven modeling technique that transforms a set of correlated variables into a smaller set of uncorrelated variables while retaining most of the original information.

Model testing and evaluation

Model testing considered the ability of the model to accurately forecast the likelihood of site closure at the shellfish site due to an algal bloom event. A number of standard performance metrics are commonly used in classification models. True positives (TP) and true negatives (TN) report events that the model correctly predicts. Conversely, false positives (FP) and false negatives (FN) are events the model misdiagnoses. Oftentimes, an FN is the most damaging since the site suffers an unexpected toxin event.

Common classification model skill metrics include accuracy, precision, and recall. Accuracy (ACC) is simply defined as the ratio between the number of correct predictions divided by total number of predictions (\(ACC=\frac{TP + TN}{N}\)); precision or positive predictive value (PPV) is defined as the fraction of correct times the model returns positive (\(PPV = \frac{TP}{TP + FP}\)); recall or sensitivity quantifies the amount of times that the sites is actually closed that the model correctly predicts (\(Recall = \frac{TP}{TP + FN}\)).

Precision and recall are often in conflict and one must consider both metrics and how they contribute to the forecasting requirements. Recall provides a measure of the number of events that are correctly diagnosed, while precision measures the proportion of events flagged by the model that were correctly classified.

The F1 score is an evaluation metric that combines precision and recall and can be expressed as:

$$\begin{aligned} F1 = 2 * \frac{Precision * Recall}{Precision + Recall} \end{aligned}$$
(1)

The F1 score is less influenced by TN than accuracy. In many applications, TN does not have significant business implications, whereas FN and FP often have operational implications.

Model evaluation adopted a 10-fold cross-validation. Cross-validation is a technique in which the model is trained using a subset of the data set and then evaluated using a complementary subset of the data set.

Table 2 Hyperparameters and their ranges used for model design
Fig. 4
figure 4

Using Lale, we pass a variety of options related to feature transformation and algorithmic selection to the meta model. a presents the pipeline to select the best-performing individual model, while b presents the ensemble learning preprocessing and training process. The first box explores the effects of scaling the data to unit standard deviation; the second box considers whether PCA provides valuable transformations to the data; finally, the third box considers four different machine learning algorithms to identify the most appropriate algorithm or combination of such. Lale selects the pipeline that optimizes model skill using 10-fold cross validations

Results

The model was trained using the Lale semi-automated machine learning library (Hirzel et al. 2019). In short, a number of algorithms were interrogated by the Lale library which simplifies model setup for data scientists by searching over possible choices for hyperparameters, algorithms, and data standardization schemes. The models considered were RF, GradBoost, XGB, and MLP, which are described in detail in the “Machine learning” section. Figure 4a summarizes the feature transformations and algorithms provided as options to the learning algorithm, while Table 2 details the hyperparameter ranges provided as inputs to the Lale search routines.

The experimental approach involved investigating various sampling techniques to address data imbalance. A grid search strategy was employed, exploring different values of up-sampling and down-sampling within the range of 0 to 1, with an incremental step size of 0.05. The imblearn library was utilized for data resampling. After evaluating different options, the optimal model performance was achieved with an up-sampling rate of 0.5 and a down-sampling rate of 0.6. This means that the minority class was up-sampled until its number of instances reached 50% of the majority class, and then the majority class was reduced to achieve a ratio increase to 60%. Consequently, the ratio between the majority and minority classes in the training data transformed from 1557:326 prior to resampling to 1296:778 after the processing.

An ensemble model approach was implemented based on a voting approach that considered the same algorithms (RF, GradBoost, XGB, and MLP). Figure 4b summarizes the architecture of the meta-learner consisting of two different preprocessing options and the above four learning algorithms. Again Lale implemented the voting classifier algorithmic search to learn the optimal pipeline. From the set of all possible pipelines, Lale meta-learner computes the top-performing learners and implements an ensemble voting predictor.

Listing 1 details the optimal hyperparameter and pipeline topologies identified by Lale. For each individual model in the ensemble, it identified the optimal algorithm, preprocessing, and hyperparameter settings. Each of the 10 best-performing models is then combined using a voting approach. For the ensemble classifier, a hard or soft voting can be used with soft voting being selected in this study.

The ensemble learner was evaluated against two baseline models to benchmark model skill:

  • Best-performing individual model trained on the original dataset with no resampling (called BPIM)

  • Best-performing individual model trained on the resampled dataset that amended the ratio between majority and minority classes from 1557:326 to 1296:778 (called BPIM_S).

figure a
Table 3 Confusion matrix reporting the predictive skill of the models
Table 4 Accuracy and model skill reported for our three different model implementations

The two baseline models were compared against the final ensemble model trained on resampled data (called Ensemble). The model evaluation focused on predictive skill and robustness. Tables 3 and 4 summarize the relative performance of the different model implementations. Table 3 presents a confusion matrix for the three implementations. A confusion matrix is a common method for describing the performance of a classification model. This is a simple cross-tabulation of the observed and predicted classes for the data. The main diagonal denotes cases where the classes are correctly predicted (TP and TN) while the counter diagonal illustrates the number of errors for each possible case (FP and FN). Results demonstrate that model BPIM provided a relatively accurate prediction of the site being open (correct prediction 288 out of 299 times), but failed to diagnose closure events. In short, the imbalanced nature of the data led the model to be biased towards positive events.

Table 4 summarizes the accuracy metrics reported by the three models. Accuracy ranges from 0.7 to 0.83 with the ensemble model significantly outperforming BPIM. The ensemble model reports a moderate drop in precision compared to BPIM_S. Namely, while the number of TP increases and FN decreases, the number of TN increases which leads to a moderate drop in precision. However, this reduction in FN is critical to practical model performance. Operators are typically accepting of a slight increase in the number of false notifications of a potential toxin event if the model accurately forecasts closures. Typically, the more severe the repercussions of an event are, the more important recall becomes as a metric.

A fundamental characteristic of data-driven approaches is that the model learns the patterns that enable it to make a prediction: instead of the expert encoding these relationships using physical equations, the model learns the most appropriate mapping between the given inputs or features and outputs or labels. Aquaculture industry stakeholders have been particularly interested to know what specific drivers contributed to predicting potential HAB events. Interpretable or explainable AI (Samek et al. 2019) are emerging topics in data science that aim to guide AI model interrogation.

Fig. 5
figure 5

Feature importance reported for the XGB model when predicting the likelihood that the shellfish site will be closed or not due to a toxin event. Visualization is limited to the 12 most important features. The y-axis reports the ranked list of features that contributed the most to variation prediction (site closure), while the x-axis presents the relative magnitude of those contributions. Ranking predictors in this manner can quickly help sift through large datasets and understand data trends (Kuhn et al. 2013)

Figure 5 presents the feature importance of the supplied data to the response variable or model prediction at the study site. Feature importance was computed based on the Shapley value (Cohen et al. 2005). The Shapley value derives from game theory Shapley (1997) and aims to assign payout (or importance) to players (or features) depending on their contribution to the total payout. The feature importance measure computes the contribution or importance of each feature by calculating the increase of the model’s prediction error after permuting the feature. A feature is “important” if permuting its values increases the model error because the model relied on the feature for the prediction. A feature is “unimportant” if permuting its values keeps the model error unchanged because the model ignored the feature for the prediction (Breiman 2001).

Given the difficulties associated with extracting primary features from ensemble methods, we opted for BPIM_S in this analysis. Specifically, we employed the best-performing XGBoost model, which is highly suitable for explainability analysis. This choice underscores the significance of selecting an appropriate model that aligns with the specific requirements of a study, where factors like accuracy, explainability, computational cost, and robustness often play crucial roles.

Figure 5 illustrates that the day-of-year is the most important contributor to prediction. This is not surprising, as many papers on this site (Loureiro et al. 2005; Goela et al. 2014; Cristina et al. 2016; Silva et al. 2016; Danchenko et al. 2019, 2022; Fernandes-Salvador et al. 2021), show that the phytoplankton community with HAB species changes from diatoms in early spring to summer, during the upwelling season, to dinoflagellates during the relaxation of upwelling in autumn. The other primary contributory features include physical variables such as temperature and salinity, as well as important biogeochemical variables, including phosphate, dissolved oxygen, and the Net Primary Productivity of Carbon. Notably eastward (but not northward) velocity is also an important contributor to prediction. HAB events in the region are strongly influenced by upwelling events (Garmendia et al. 2013) driven by flows from offshore. Elucidating the complexity of prediction, these upwelling events are driven by physical processes such as wind speed and ocean flows which significantly modify properties such as nutrient levels, dissolved oxygen, and temperature. Interpretability approaches provide a framework to disentangle these complex events to some degree, but they must ultimately be informed by oceanography, biology, and domain expertise.

Discussion

These results illustrate how machine learning can improve decision-making on shellfish farms. Experienced farm operators have heuristic knowledge of the environmental conditions that lead to HAB events (upwelling processes, ambient wind conditions, water temperature range, etc.). Machine learning offers the opportunity to mine these processes and implement automated frameworks to identify high-risk periods for HAB-induced closures.

Relying only on public datasets, the model can be seamlessly transported to other locations. Naturally, predictive skills can likely be improved by incorporating long-term in-situ monitoring data that more closely captures system dynamics. However, increasing the volume of data is often the optimal strategy to improve model performance. This alludes to another advantage of the framework — the model will continue to improve as more data is collected.

The paper details the key considerations to apply machine learning to complex environmental datasets. Many aspects of the proposed framework are relevant for other environmental studies, namely:

  • The design of suitable features to represent the dynamics of the system is critical. This includes the choice of suitable variables to represent important processes — often this is driven by domain expertise; understanding the statistical relationships between variables that can be used to inform feature selection; and implementing robust feature engineering and transformation routines to avoid data scale inconsistencies.

  • The “no free lunch” theorem in classical optimization (Wolpert and Macready 1997), indicates that no single optimization algorithm is superior to all others. For practical purposes, one must explore different algorithms and decide on the one that works best for the problem at hand. While algorithmic search can be done manually by the data scientist, AutoAI approaches are a valuable tool to do this easily, at scale.

  • Explainable and interpretable AI methods are extremely valuable for machine learning-based analysis of environmental processes. It is vital that the relationships used by the model to make a prediction agree with those expected by the domain experts. Otherwise, the grounding of the model is flawed and will not perform well in unseen situations.

  • Many environmental studies deal with the prediction or evaluation of extreme events. However, machine learning is predicated on two core assumptions: 1) maximizing accuracy is the goal, and 2) in use, the classifier will operate on data drawn from the same distribution as the training data. Consequently, careful attention is required when the prediction classes are highly imbalanced. Naïve classifiers which always predict the majority class can achieve high accuracy in this case with no model skill.

  • Artificially rebalancing the data by up- or down-scaling is a robust technique to improve prediction (Provost 2000). Results presented in this paper demonstrate that it produces significant model uplift for predicting adverse environmental conditions. Further, the ease of implementation makes it an important part of the data scientist’s toolbox.

  • Ensemble prediction is a widely used technique to improve model performance by considering forecasts from multiple different models. The fundamental objective of ensemble forecasting is to investigate inherent uncertainty to provide more accurate information about future states. Classical works on ensemble forecasting demonstrated that the ensemble mean should give a better forecast than a single deterministic forecast as long as the ensemble represents the uncertainty present in the forecast (Epstein 1969; Leith 1974). Ensemble methods with perturbed initial conditions are ubiquitous in meteorology and these focus on quantifying the fastest-growing errors with techniques such as the breeder method and optimal perturbation analysis, common in data-assimilation implementations (Turner et al. 2008). Multi-algorithm ensembles in machine learning provide a robust framework to explore model uncertainty and serve as a means of regularization to the model.

This paper illustrates the potential value of machine learning as a decision-support tool for aquaculture. A holistic experimental framework is vital to achieve a high-performing model. Table 4 summarizes the performance of three different models. While accuracy is high for all three models, this is only a useful metric when there is an equal distribution of classes. In many applications (including this), recall and precision scores provide a more representative measure. Recall measures the number of actual closures that were accurately predicted while the F1 score is defined as the harmonic mean of precision and recall.

In this study, intelligent feature engineering and algorithmic selection increased recall from 0.23 to 0.58 while the F1 score increased from 0.35 to 0.75. This means that almost 60% of site closures were correctly diagnosed. Importantly, the final model reported a precision of 0.9 indicating that the model reported a relatively small number of false positives (reporting site closure when in fact it was open). Considering the complexities of the problem this is an excellent model performance. It can drastically improve the ability of operators to respond to adverse events. Currently, farmers have no real insight into the likelihood of closures beyond heuristics and experience (O’Donncha and Grant 2019). With a robust early forecasting system, operators can make decisions to ameliorate the effects. Potential decisions include: harvest the molluscs early to avoid the toxin event, move the installation to another location that is less exposed, or incorporate the upcoming disruption into their scheduling and amend harvesting and product sales timelines.

This paper considers the problem as a timeseries case. Of course, processes are also influenced by conditions at adjacent locations. Further research could explore the integration of spatial dependencies. Examples include the use of LSTM networks to represent spatial relationships (O’Donncha et al. 2022), while convolutional neural networks (CNN) are a powerful deep learning tool that demonstrates exceptional performance processing image data. Combining CNN with time series machine learning models has previously been used to forecast ocean temperature Yang et al. (2017). This can allow the model to more effectively learn spatial and temporal drivers of HAB developments.

In the future, further investigations in this field will aim to incorporate a more explicit depiction of the spatial connections among blooms. Graph neural networks (GNN) offer a potent technique that permits the integration of heuristic or statistical details about physical properties and the organization of the structure within the modeling framework (Langbridge et al. 2023). The benefit of this approach is that heuristic relationships or modeled flow patterns can be utilized to develop a graph topology that links and informs about harmful algal bloom (HAB) incidents in various farming locations.

It is worth noting that the drivers behind HAB events are likely to exhibit similarities across different sites. This presents a clear opportunity to enhance the performance and generalization capabilities of machine learning models by leveraging transfer learning techniques (Oruche and O’Donncha 2023). In particular, transformer architectures can be trained on large datasets to learn spatial and temporal dependencies in HAB occurrences. Although initially designed for natural language processing (NLP) tasks, transformers have been successfully applied to other domains, such as computer vision and graph modeling, due to their powerful capabilities.

One advantage of using transformers for this task is their ability to handle large amounts of data and complex relationships between variables. Transformers can learn patterns across both time and space, allowing them to model the complex interactions between different environmental factors that contribute to algal blooms. Additionally, transformers can learn from multiple modalities of data, such as satellite images and oceanographic data, and combine them into a unified model.

Another advantage of using transformers for this task is their ability to scale to large datasets. As the amount of data related to algal blooms continues to grow, transformers can be trained on increasingly larger datasets without sacrificing performance.

Conclusions

This paper introduces a framework designed to forecast the closure of shellfish sites caused by excessive levels of harmful toxins. The framework encompasses data conditioning, the development of machine learning (ML) models, and ensemble model forecasting, while also addressing pragmatic aspects of model development and implementation.

The presented model demonstrates high predictive accuracy when appropriate data conditioning and algorithmic selection are applied. This research holds significant practical value for shellfish operations, as timely information is crucial for effective decision-making. By predicting site closures based on environmental conditions, the framework aligns its forecasts with available environmental predictions.

The approach relies on publicly available data and leverages robust autoML metalearners to automate algorithm selection and parameterization. This feature enables easy deployment of the framework to other shellfish sites. Moreover, the computational efficiency of the approach makes it suitable for on-site or edge deployment.

Presentations of this study to individual stakeholders and the Portuguese Association of Aquaculture Farmers have received highly positive feedback. In fact, some stakeholders have expressed interest in expanding the approach further. They have inquired about the potential for predicting bivalve larvae spawning and settlement, as well as forecasting the optimal conditions for achieving maximum sales returns of the product.