Introduction

Fisheries bycatch can be an obstacle to the socioeconomic and ecological sustainability of seafood production. Bycatch can reduce global food, nutrition and livelihood security (Belton and Thilsted 2014; Béné et al., 2015; FAO 2020). Fisheries targeting relatively productive species can cause protracted or irreparable harm or permanent loss of populations of incidentally caught bycatch species with low reproductive potential and other life history traits that make them vulnerable to anthropogenic mortality (Musick 1999; Chaloupka 2002; Dulvy et al. 2021). Some teleosts, cartilaginous fishes (sharks, rays and chimaeras), marine reptiles (turtles and sea snakes), marine mammals and seabirds are threatened with extinction due to bycatch (Wallace et al. 2013; Dias et al. 2019; Nelms et al. 2021; Pacoureau et al. 2021). Reduced biomass of apex and mid-trophic level bycatch species can have direct and indirect effects on food web dynamics and on ecosystem structure, functions, stability and services, and selective removals within populations of bycatch species based on heritable traits can cause fisheries-induced evolution, reducing population fitness (Estes et al. 2011; Heino et al. 2015; Young et al., 2016; Stevens et al. 2000).

Independent synthesis of all accumulated scientific information is a fundamental principle for developing transparent, evidence-informed regional conservation management decisions (Dicks et al. 2014; Nichols et al. 2019). There are, unfortunately, numerous examples of decisions that ignored accumulated information, including decisions based on the latest or most publicized results from a single study, and that were based on weak forms of evidence, in some cases with dire consequences (Sutton et al. 2000; Chalmers 2007). Evidence-informed policy has guided decision-making in medicine and other disciplines for almost three decades (Sackett and Rosenberg 1995; Satterfield et al. 2009), but the concept remains absent from international guidelines on fisheries bycatch management (FAO 2011). Bycatch policy not informed by evidence of responses to mitigation interventions risks adopting a management strategy that at best is ineffective in meeting ecological and socioeconomic objectives. At worst, evidence-uninformed bycatch policy cause harm, including by exacerbating catch and mortality rates of threatened species and creating unacceptable costs to components of commercial viability (economic viability, practicality, safety). Impacts of poorly designed bycatch management strategies have consequences across manifestations of biodiversity through altered evolutionary characteristics of populations and cascading effects through food web links, compromising global food, nutrition and livelihood security. Here we expand upon Gilman et al. (2022), who include an evidence hierarchy as one step of a decision support tool for bycatch management, to describe the potential benefits and limitations of applying a sequential evidence hierarchy to evaluate alternative fisheries bycatch management methods.

Evidence hierarchy tiers of synthesis and individual studies

Table 1 integrates categories of synthesis studies with individual studies, adapting the sequential evidence hierarchies of the Oxford Centre for Evidence-Based Medicine (CEBM 2009; Stegenga 2014) and the Scottish Intercollegiate Guidelines Network Grading Review Group (2001). Study approaches are presented in rank order to identify the relative degree of risk and error of different categories of study approaches. Tier 1 has the least risk of error and bias, and produces findings that are the most generalizable and optimal for guiding global- and regional-level decision-making. Lower tiers are relatively weaker forms of evidence, have higher risks of error and bias, are more context-specific and less suitable for basing broad spatial scale decisions.

Table 1 Sequential evidence hierarchy of categories of study methods for testing a hypothesis, applied to mitigating the catch and fishing mortality of threatened bycatch species (adapted from: CEBM 2009; Stegenga 2014; Scottish Intercollegiate Guidelines Network Grading Review Group 2001; Gilman et al. 2022). RCT = randomized controlled trials and experiments. Tier 1 has the least risk of error and bias, most generalizable and optimal for global and regional policy. Tier 12 has the highest risk of error and bias, is most context-specific and least suitable for global and regional policy

There is a risk that results from a single study are context-specific—and hence lack external validity (Deaton and Cartwright 2018). Results may be affected by the specific conditions of an individual study, such as the study area, study period, species involved and environmental conditions, preventing the results from that single study from being applicable under different conditions. This may explain cases where individual studies have conflicting findings (Deaton and Cartwright 2018). Furthermore, a single study may have low power and fail to find a meaningful result due to too small a sample size (Mumby et al. 2021).

The issue of lack of external validity can be addressed by meta-analytic based synthesis of evidence sourced from multiple studies that address the same question. The three statistical approaches used for meta-analytic based syntheses are:

  • Meta-analyses (including meta-regression) of the aggregated or summary results from individual studies. For example, see Chaloupka et al. (2022);

  • Mega-analyses of the original datasets used in each individual study (also referred to as integrative data models or individual participant data models). For example, see Musyl and Gilman (2019); and

  • Data fusion using augmented or aggregated data-dependent priors. For example, see Hooten et al. (2021).

Meta-analyses that comprise more than two interventions or treatments can be assessed simultaneously within a single model framework using a network meta-analysis modelling approach (Caldwell et al. 2005; Dias and Caldwell 2019). Due to the larger sample size plus the number of independent studies, correctly designed meta-analytic assessments, including meta-analyses, can provide estimates with increased accuracy over estimates from single studies, with increased statistical power to detect a real effect (Borenstein et al. 2009; Nakagawa et al. 2015). By synthesizing estimates from a mixture of independent, small and context-specific studies, the overall estimated effect from meta-analyses is generalizable and relevant over diverse settings (Pfaller et al. 2018). Therefore, evidence from meta-analytic studies ideally should inform the development of global- and regional-level bycatch management strategies. If effects vary across studies, meta-analytic synthesis studies can identify reasons for between-study heterogeneity. Synthesis research also identifies knowledge gaps, and conversely identifies areas where additional studies are not needed, guiding priorities for future research (Chalmers et al. 2014; Pfaller et al. 2018; Musyl and Gilman 2019).

Other synthesis study approaches are relatively weak forms of evidence. This includes qualitative systematic literature reviews, which have a higher evidence ranking than qualitative unstructured literature reviews (Table 1). Targeted, non-systematic reviews have a high risk of bias and can lead to false conclusions. Conversely, systematic reviews employ an impartial, transparent and hence replicable approach that reduces the risk of biased selection of publications and the risk of introducing prevailing paradigm, familiarity, citation and publication biases (Sutton 2009; CEE 2013; Bayliss and Beyer 2015). Methods for planning, implementing and reporting systematic reviews should follow the Reporting Standards for Systematic Evidence Syntheses (ROSES, Haddaway et al. 2018), Collaboration for Environmental Evidence (CEE, Pullin et al. 2020, 2021), or Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA, Page et al. 2021a,b), but adapting the PRISMA checklist into a reporting protocol.

Randomized controlled trials and experiments (RCTs) are considered the gold standard of individual studies, with the least risk of error and bias (Backmann 2017; Pynegar et al. 2021). After RCT studies, the next tier in a linear hierarchy of relative degree of evidence of individual studies is comprised of quasi-experiments (non-randomized, controlled studies) and comparative experiments (Boesche 2020). Next are studies analyzing observation data, including human at-sea observer and electronic monitoring data, that apply statistical modelling approaches to standardize catch and fishing effort time series data (Venables and Dichmont 2004; Potts and Rose 2018) and that apply quasi-experimental statistical modelling approaches to infer causal impacts of an intervention often using standardized fisheries data (see review in Hilborn et al. 2021). This is followed by observational studies with nominal estimates that are made without standardizing effort. Unlike in experimental studies, observational studies do not experimentally manipulate specific variables and control for others (Hayes et al. 2019), and unlike in observational studies employing appropriate modelling approaches, observational studies with nominal estimates do not standardize effort by constructing indices of relative fishing power from vessel, gear, spatial, environmental and other explanatory variables and thus do not explicitly account for simultaneous variability in potentially informative predictors of a response (e.g., mean catch rate, haulback mortality rate, length) (Venables and Dichmont 2004; Potts and Rose 2018).

Mechanistic studies, designed to answer questions about the physiological mechanisms causing a phenomenon (Marchionni and Reijula 2019), such as a behavioral response to a bycatch mitigation method, are the next tier for individual studies. Mechanistic experimental and observational studies typically do not provide direct evidence of the efficacy of a bycatch mitigation measure. Instead, they improve the understanding of why an observed response to a bycatch mitigation method occurs and can help to identify promising new or modified bycatch mitigation approaches. For example, while a non-mechanistic study could assess marine turtle catch rate responses to lightsticks, a mechanistic study could test specific behavioral responses to lightsticks with different emission spectra (Wang et al. 2007), increasing the understanding of marine turtle responses to different types of lightsticks and marine turtle visual acuity.

Expert surveys are the next lowest tiers of individual studies and a relatively weak form of evidence. Expert surveys have a relatively high risk of bias and can have both low internal and external validity (Kahneman 2011; Hayes et al. 2019). Expert surveys are a rapid and low-cost approach that is suitable when previously little or no information was available. Information from fisher surveys may be the only source of data available for many fisheries. Data from expert surveys, as well as data self-reported by fishers in logbook data, however, are of relatively low certainty, especially where the survey is addressing highly sensitive issues, such as if there are stringent economic or regulatory penalties for identified infractions (Walsh et al 2002; Mangi et al. 2016), but also due to various additional sources of bias, including retrospective, anchoring, availability, prevailing paradigm (confirmation), dominance, groupthink and overconfidence (Tourangeau 2000; Martin et al. 2012; Hemming et al. 2017). Furthermore, there is a risk that the data collected from survey respondents are not generalizable and are unrepresentative of the underlying population that was sampled (Downes and Carlin 2020). This is a high risk if a probability sampling design was not followed, resulting in undercoverage bias (e.g., fishers of small-scale vessels and of vessels from certain seaports are not sampled), nonresponse bias was large and is not explicitly accounted for, there was a low response rate, and the questionnaire design or the way the questionnaire was administered caused biased responses (Choi and Pak 2005; Brick 2011; Sarstedt et al. 2018; Downes and Carlin 2020).

Structured expert elicitation approaches can improve on simple expert judgement approaches to reduce some of these sources of bias and improve the accuracy of estimates, as well as improve transparency (Martin et al. 2012; Hemming et al. 2017). Structured expert elicitation approaches apply objective and reliable methods to select experts, frame questions that support expressing responses as probabilities or numerical quantities, employ elicitation practices to counteract biases, and employ objective aggregation methods (Hemming et al. 2017). For example, the IDEA protocol, a modified structured Delphi procedure that includes a group discussion stage, improves the accuracy of individual responses (Burgman et al. 2011; Hanea et al. 2016; Hemming et al. 2017). Initial expert estimates are elicited from a diverse group of individuals, who then revise their individual estimates following group discussion during which the experts can share evidence and resolve any linguistic ambiguity (Hemming et al. 2017). If information is known that is closely related to the focus of an expert survey, then experts can be asked questions for which the answers are already known. For example, accurate information on the number of trips and fishing operations that an individual fishing vessel made in the past year may be available from satellite-based vessel monitoring system data. The accuracy of the individual expert’s estimates for these known values can then be determined, enabling their responses to questions with unknown values to be weighted, referred to as Cooke’s Classical Model (Cooke 1991; Aspinall 2010). However, this assumes that the questions with known answers will be affected by various sources of bias to the same degree as the questions with unknown responses, which may be a false assumption. For instance, information on fishing effort is unlikely to be sensitive in the way that estimates of the number of captured threatened species and amount of abandoned and discarded fishing gear are.

Finally, flawed studies, non-expert surveys, opinions from a single individual or organization, and bycatch mitigation method–species combinations with no records provide the least certain evidence. This makes up the lowest tier of the evidence hierarchy (Table 1).

Evidence hierarchy drawbacks

Evidence hierarchy categorizations should not be used as an absolute interpretation of relative degree of risk of error and bias. Several cogent, strong arguments have been made against using evidence hierarchies (Stegenga 2014; Jones and Steel 2018). A hierarchical approach on study evidence risks ignoring potentially important findings derived from studies using methods low on an evidence hierarchy. The hypothesis being tested and the context of the study need to be considered in addition to the relative strength of evidence of the study method.

While global meta-analyses provide relatively robust evidence to inform global and regional policy, they may not be the most certain evidence for local, individual fishery-level decisions (Gilman et al. 2022). Because prevailing conditions at local and regional scales may be substantially different, bycatch mitigation measures that are effective at a regional level may have a different response locally, for an individual fishery. For instance, the catch rate response to a change in gear design that affects size selectivity, such as gillnet mesh size and hook size, of an individual fishery that overlaps with a portion of the length frequency distribution of a population may differ from the response by a regional fishery that encounters the entire length frequency distribution for the population demographic structure that is exposed to the fishery (Gilman et al. 2020).

There is no definitive basis for determining the relative certainty of some study design categories, such as between a meta-analysis of compiled quasi-experimental studies and an individual RCT. There is also variability in the degree of error, bias and quality of individual studies within each hierarchy tier. Individual studies may employ flawed designs, and synthesis studies might include flawed individual studies. A meta-analytic study employing a weak approach or that is based on predominantly flawed studies may produce less reliable results than individual, well-designed studies. For information on the strengths and weaknesses of meta-analytic approaches for either aggregated data summaries or original datasets used in each study, see Lyman and Kuderer (2005), Finckh and Tramèr (2008) and Gurevitch et al. (2018).

Evidence hierarchies tend to be simplistic. They use a small suite of criteria, ignoring many potentially critical, context-specific aspects of evidence needed to test some hypotheses. For instance, the evidence hierarchy does not account for whether evidence of the response to an intervention is applicable to conditions in practice (i.e., in the real world, such as under commercial fishing conditions) and has been externally validated, or otherwise evidence is available only from controlled conditions (Stegenga 2014; Jones and Steel 2018; Luján and Todt 2021; Pullin et al. 2021).

Estimates of the efficacy of some bycatch mitigation methods derived from analyses of monitoring data provide a more realistic prediction of the effect of the method when used during real-world, commercial fishing operations than estimates from experiments, despite the latter having a relatively lower risk of bias (Gilman et al. 2005; Cox et al. 2007; Stegenga 2014; Jones and Steel 2018; Luján and Todt 2021). The evidence hierarchy ranks evidence from experiments, where a bycatch mitigation method is likely to be employed optimally, as having relatively lower risk of error and bias than observational studies. But the efficacy of some bycatch mitigation measures is strongly affected by crew behavior [see Jones and Steel (2018) for a parallel discussion of applying evidence hierarchies in the context of real-world medical decision making]. This can cause substantial differences in the efficacy of these bycatch mitigation methods between estimates from experiments, where researchers implemented the mitigation measure, versus from analyses of observer or electronic monitoring data, where fishers implemented the bycatch mitigation method during commercial operations (Gilman et al. 2005; Cox et al. 2007). Therefore, for bycatch mitigation methods whose efficacy is affected by crew behavior, analyses of observer and electronic monitoring data may provide a more certain estimate of responses during commercial fishing operations than experiments, where experiments that optimally apply a treatment provide useful information on the upper bound of effectiveness. It therefore can be important to validate that the efficacy of an intervention when used under controlled conditions is of similar effectiveness when employed in real-world conditions through ‘pragmatic’ studies (Khorsan and Crawford 2014; Pullin et al. 2021). To account for this real-world efficacy, considering whether the efficacy of a specific method is affected by crew behavior is important (Gilman et al. 2022).

In some cases, to enable each treatment to have an equal probability of being selected, study designs with systematic treatment assignment (a form of probability-based sampling) and that are balanced may be preferrable to ‘simple randomization’ designs. Many fisheries bycatch mitigation experiments employed designs that alternated the order of treatments, in some cases with a random starting point, and are thus balanced but with systematically assigned rather than randomly assigned treatments. This allows the treatments to be exposed equally to varying, patchy conditions (e.g., sea surface temperature, thermocline depth, and proximity to a submerged feature) along the distribution of the fishing gear. It also allows the treatments to have an equal probability of encountering a school of pelagic predators that are susceptible to longline capture, such as when a school of tunas encounters a section of a longline, resulting in clustered, patchy catch (Capello et al. 2013). However, study designs that use replicates such as one basket (the hooks between two floats), set or trip by pelagic longline vessels may be a less robust approach than alternating treatments by hook, because the former does not account for this patchiness of potentially informative predictors and the distribution of pelagic predators in pelagic marine ecosystems. But in studies where the experimental treatment affects a response to the control treatment, such as deterrents and attractants, or if the treatment affects local abundance, such as bycatch mitigation methods that conceal or protect baited hooks, then using a replicate of sections of gear may be warranted (e.g., Gilman et al. 2003). And, experiments that are designed to assess catch and mortality rate responses to variables such as the time-of-day of setting may require using a set- or trip replicate in fisheries where a fishing operation occurs on a daily cycle.

Some RCT study designs approximate but do not truly achieve ‘simple randomization’. Employing simple randomization designs is challenging in field ecology studies, and instead haphazard designs are typically employed, where treatments are allowed to randomly mix instead of following a pre-arranged randomized order. But humans have an inherent, subconscious propensity to organize, categorize and lump like with like, and to behave in patterns instead of randomly (Washington Sea Grant 2016). Haphazard designs therefore approximate but do not achieve formal, true simple randomization (Shadish et al. 2002). For instance, a study deployed fishing hooks of three sizes by having crew mix the hooks haphazardly when storing in bins, where it would be impractical to have crew follow a pre-arranged randomized order determined by a random number generator because of the method and speed that the fishers set, retrieve and store their gear, and because they had a finite number of each hook type (Gilman et al. 2018). Gilman et al. (2018) used the Wald-Wolfowitz test for runs to test the hypothesis of randomness, i.e., that there was no significant difference between the number (size classes) of runs of each of the three hook types, and found that 11% of sets had significantly more runs of one hook size than expected, likely due to chance (i.e., simple randomization can result in some chance confounding of imbalances in some variables, especially with small sample sizes, Chu et al. 2012; Saint-Mont 2015), but possibly due to bias introduced inadvertently by the process that crew used to store gear in bins during the haulback.

Given these valid drawbacks of evidence hierarchies, decision-makers should consider evidence hierarchy categorizations as but one of various criteria to guide their design of a bycatch management strategy. Management authorities should account for all accumulated evidence for individual bycatch mitigation methods and the implications of different approaches for testing different hypotheses in making evidence-informed bycatch management policies (Bluhm 2005; Stegenga 2014).

Conclusions

Decisions for regional bycatch management should ideally be based on evidence from meta-analytic modelling syntheses of accumulated research, which usually produce the most robust and generalizable findings. Otherwise, if there are too few studies to support robust meta-syntheses, then decisions should rely on evidence from a qualitative synthesis of all available individual studies while accounting for the relative degree of risk of error and bias based on each study’s design. Bycatch mitigation methods with evidence only available from studies with relatively weak forms of evidence, or lacking any evidence of efficacy, should only be considered as a precautionary approach when more certain alternatives to achieve a bycatch management objective are unavailable (Gilman et al., 2022).

Strictly applying a hierarchical approach on study evidence to make policy decisions risks ignoring potentially important findings derived from studies using methods low on an evidence hierarchy. In making evidence-informed bycatch management policies, authorities should instead account for all accumulated findings and consider which study approaches are best suited for testing different hypotheses under different circumstances. Instead, a network or plurality approach that integrates evidence across different types of evidence has been proposed as an alternative to a sequential evidence hierarchy (Bluhm 2005; Stegenga 2011, 2014). Fisheries bycatch policy guided, but not bounded, by a sequential evidence hierarchy promises to achieve ecological and socioeconomic objectives.