Introduction

Pangenomic analysis aims at recognizing the sharing of biological information between living organisms [1]. The term was introduced by Tettelin and colleagues [2] in studying the genomic composition of multiple pathogenic isolates of Streptococcus agalactiae motivated by the fact that a single genome does not reflect how genetic variability drives pathogenesis within a bacterial species, and also limits genome-wide screens for vaccine candidates or for antimicrobial targets.

A building block in a pangenomic study is the identification of homologies among the genes that compose the input genomes. A gene family is a set of several similar genes, formed by duplication of a single original gene [3]. In this context, it is a good idea to cluster genes into gene families in order to identify the presence of a family within a genome and to study its genetic composition, but more importantly to understand the global genetic composition of the whole group of genomes. For this purpose, the concept of sequence homology is taken into account [4]. Transmission of genetic material might occur vertically or horizontally [5], which, added up to gene duplication, makes us distinguish homology into three categories: paralogy (namely copies of the same gene produced by a duplication event); orthology (the same gene transmitted vertically produced by a speciation event); and xenology (given by a gene transmitted horizontally between two genomes). Thus, a gene family is a set of genes that are in relation through such type of homologies. Pangenomic studies are then essentially based on such information. In accordance with homology relations, the presence of a family in the whole group of genomes identifies it as a core family. On the contrary, the absence of the gene family within one or more genomes defines the family as “accessory”, thus not constituting an essential biological function for the whole group.

Pangenomic analysis can be seen as a way of exploring the distribution of genetic information across a group of genomes, to determine if the group complexity is saturated, hence the pangenome is considered closed, or if any new genome that is included in the analysis increases the pangenome size [6], in which case the pangenome is considered open. This approach relies on the use of mathematical models to predict how fast we would expect a pangenome to reach a plateau in an open or closed pangenome. In general, a closed pangenome is an indication that we have already discovered the majority of the genomic content of a given group of organisms. In contrast, an open pangenome is clear evidence that more has to be discovered from that clade. Recently, Rubio et al. [7] showed, by means of a pangenomic analysis of Streptococcus pneumonia strains, that accessory genes help to increase functional redundancy in bacteria.

Alternative approaches switch from such a gene-level pangenomic information to the level in which whole DNA sequences are taken into account. In these approach, the pangenomic content of genomes is represented through formal languages or graph structures [8, 9], highly used to represent population-level information in human studies [10]. However, such a type of representation aims at recognizing similarity in the individual’s variations of the DNA sequence making no distinction between particular genomic regions, such as genes. In this study, we only consider gene-level analyses, rather than genomic-level approaches.

Pangenomics, however, can also be exploited to investigate specific biological questions, through a plethora of downstream applications. During disease outbreaks, for instance, pangenomic studies can be employed to characterize bacterial isolates and to track the spread of infections by comparing the genomes of different isolates, helping public health authorities implement targeted control measures and prevent further spread [11]. More in general, the collective analysis of all the genes is developed for many specific interests, for example, for the study of a bacterial strain of a given species [12, 13]. Pangenome analyses found many applications in clinical studies [14, 15], for example, they help in identifying drug-target genes in clinical studies [16, 17], or in exploring phylogenetic lineages of bacteria [18] that can be linked to strain-specific disease phenotypes [19], or for recognizing possible antiviral response in bacteria [20] .

Figure 1 summarizes the main steps of a workflow for pangenomic analyses in which the genetic composition of genomes is retrieved and analysed for detecting gene family composition by means of genetic homology. The result of such a clustering step is presented to the researcher for downstream analyses possibly by aggregating it with supplementary information (that is, gene coordinates, biological function, etc.). It is important to notice that computational pangenomic analyses need extensive computational resources. Additional knowledge is anything that is not the mere sequences of genes. The calculation of homology is mainly based on genetic sequences, which is the essential information needed by every tool for computing gene families. Any other additional data is an accessory. It can help with the homology computation, or it can be used for downstream analyses.

Fig. 1
figure 1

Architecture of a pangenomic analysis workflow. Red circles refer to the tips described in this study. Examples of application modules for downstream analyses are reported in [21,22,23,24,25,26]

Concerning the tips proposed in this document and reported in Fig. 1, Tip 1 puts attention to the research question of the reader and how it relates to the other tips. Issues related to input data and its quality are reported in Tip 2, while output formats and their visualization are the focus of Tip 5. Because this study mainly regards gene-level pangenomics, Tip 3 gives clues on genome annotation strategies and they can affect pangenomic analyses. Tip 4 pays attention to the core of gene-level pangenomics, which is the computation of sequence homology and the clustering of genetic sequences into gene families. Tip 6 gives guidelines on how to critically evaluate the results of a pangenomic detection methodology. Lastly, but not less important, Tip 7 claims openness and reproducibility of data and experiments.

So far, the Quick Tips article series has published manuscript on several topics  (for example, [27,28,29,30,31,32,33]), but not on pangenomic analysis. We fill this gap by presenting our recommendations on this theme: a list of simple tips that should be followed by anyone working on this field, to avoid common mistakes and errors.

Our goal is to provide some advice on how to properly conduct a computational pangenomic analysis and help you obtain a pangenome which is reliable, reproducible, and that will serve as a starting point for answering your research question.

Tip 1: Ensure the data in your hands may answer your research question

Having a clear research question may seem trivial for an inexperienced researcher. However, it is fundamental to know what is the ultimate goal of your research project because several aspects of the analysis will be affected by this. For sure, it is possible that pieces of evidence can arise from agnostic data analysis, but, in a general case, it is essential to plan a research goal and then find the instruments for driving toward such a goal. First and foremost, the ultimate goal of your analysis depend on the conditions of the data you use as input. This means that when you start a project on bacterial pangenomics you do not only need to determine the set of organisms that best reflects the problem you want to solve, but you also have to make sure that the data you have has the right characteristics. It is accepted that bacteria from the same species have big differences in their genetic content, due to horizontal gene transfer and mutations, leading to substantial differences in phenotype [34]. If the genomic sequence quality and depth are insufficient to capture strain-level information, the pangenomic analysis will not be able to detect intra-species variation in the gene family identification process [35, 36]. This means that, even if you ensure the quality of your sequencing [37, 38], in this step, you may find that the data you require to solve a question will not always be available, and this means that you might want to reshape your research question. Being aware of the data you need affects the feasibility of your analysis and ultimately your project.

Another point of relevance is that the questions you may want to solve normally require some downstream analysis that is specific to the question at hand. The key is understanding what is required to perform these tasks, as their input is the output of your pangenomic analysis. Depending on downstream needs, you may need to use different pangenomic analysis tools. For example, your goal might be to study the taxonomic lineage of a species. A common approach is to identify genes that are present in all genomes being analyzed, called core genes, and use these to build a phylogenetic tree [39]. However, you might also consider to use for this purpose a gene presence-absence matrix, which contains information about each gene’s presence across genomes, information that would otherwise be missed when using only core genes. In that case, you need to make sure the tool you use provides that information [40]. This type of data representation would be useful also if you wanted to use a wider core gene threshold by taking all genes present at least in a percentage of genomes, instead of all of them [41].

What is important is knowing where you are headed with your analysis, because this will help you make the right decisions along the way.

Tip 2: Know your pangenomic input data and double-check its quality

Complete and fragmented genomes usually come in the form of one-genome-one-file, such that multiple fragments of the same genomes are included within the same file. Regarding metagenomic material, it is usually distributed in the form of a Metagenome-Assembled Genome (MAG) [42], which is a single-taxon assembly based on binned sequences that have been asserted to be a close representation of an actual individual genome. When selecting analysis tools, ensure that they are designed to handle the type of data you are working with.

Additionally, always make sure your data has been quality-checked and filtered to remove any potential issues [37, 38]. This can mean different things depending on the data at hand. If you start your analysis from raw reads, you will want to make sure you discard low-quality reads and contaminants. When reads have been assembled, make sure they respect the highest standards in terms of completeness and contamination [43]. Be aware that analyzing highly fragmented genomes using a tool that does not account for that can introduce biases in your analysis, leading to an overestimation of the pangenome size and an increase in the number of singletons [41]. In fact, many of the genomes that are available in public databases are at draft level. Some existing methodologies are able to deal with them [40, 44,45,46], however, it is always convenient to understand when a genome is too fragmented to be a good source of pangenomic information. Thus, you should evaluate the number of fragments, their length and the various statistics based on them, such as N50, L50, and U50 [47]. You can also check if the quality of those fragmented genomes is fine by running a phylogenetic analysis including reference non-fragmented genomes and by checking the accordance between the expected position of the fragmented genome within the obtained phylogenetic tree and its resultant location. In some cases, when you are dealing with assembled genomes, you might come across unmapped reads. Unfortunately, there is only one tool that can effectively utilize these reads, and prematurely discarding them may result in missed analysis opportunities [45]. By recognising and addressing these considerations, you can conduct a more accurate and meaningful analysis of your pangenomic data.

Tip 3: Be mindful of how your sequences are annotated

Biological sequences are usually distributed in the form of FASTA files [48], which are textual files in which the nucleotidic or amino acidic string of one or more sequences are reported. No additional information is included in such files, that, instead, is embedded in file formats such as GBK (GenBank file format) [49] and GFF (General Feature Format) [50], which version 2 is identical to GTF (Gene Transfer Format). The aim of such formats is to provide information regarding “annotation” of a given genomic sequence. In the case of genomic sequences, such an annotation can be, for example, the location of the genes contained in it. The GBK file usually contains information on genes and, more in general, on CDS (coding sequences) annotation, but recently non-coding elements have become increasingly common. For this reason, GBK files often report the translation of a gene, its known biological functions and so forth. On the contrary, GFF files are very general. They allow embedding any type of data, but they are more specialized in describing annotation elements within the sequence.

It is important to examine how annotations have been inferred. Which computational tool was used to recognize the elements, with which parameters, and did the annotation pass through manual curation? Tool parameters may be the cause for the discarding of specific elements, such as transfer RNAs (tRNAs), and manual curation may force the deletion of some of them. For this reason, we suggest always running a gene detection tool, such as Prodigal [51] or Prokka [52], but also to evaluate the difference between the output of multiple detection tools. It has to be noticed that in the case of fragmented genomes or metagenomic data, because of the possible lack of the information on some parts of a genome, some of the genes may be missed. This lack of information might have a severe impact on pangenomic studies, for example, by switching a gene family from core to accessory because one or more copies of a given gene have not been sequenced or recognized. Additionally, it is important that the different genomes in the pangenome have been annotated with the same pipeline to avoid biases due to annotation strategy.

Some tools may necessitate an annotated version of the genome, in which case it is important to ensure that the annotations have been generated using the same version of the reference file or the same CDS predictor tool. Be mindful that using manually curated or in silico annotation for defining units affects all further steps. Even more importantly, pangenomics can be carried out on any unit of information, from genes, CDS, and sequence chunks [53], making consistency in annotation units and techniques necessary for performing a meaningful comparison.

Tip 4: Use a homology detection approach that is coherent with your input: pay attention to parameters

Homology detection is the key step for clustering single genetic sequences spread across the input genomes into gene families. If you are focusing on functional clustering, the homology of the amino acidic sequences is preferred to nucleotidic sequences because such a piece of information more closely represents the secondary structure of the proteins. However, you may focus the analysis on mere evolutionary considerations and compute nucleoditic similarity that not always relates to secondary structure because of possible frame shifts.

When performing a new pangenomic analysis, it is advisable to employ multiple approaches for homology detection rather than relying solely on one. Finding a consensus between different results will make your results more reliable.

In most cases, tools use alignment-based sequence similarity approaches where similarity is computed by local aligners such as BLAST+ [54]. Although not all tools provide a direct measure of the significance of sequence similarity, when possible, we suggest you to employ a significance threshold for adjusted e-value of 0.005 [55] rather than using the traditional, too permissive 0.05 threshold.

You should also consider that alignment-based measures are not always the best solution and that such alternative sequence similarity measures could be involved, especially if the tool implementing them is aimed to reduce the number of user-defined parameters [56, 57]. Such approaches aim to infer the optimal similarity threshold, which can be specific for a pair of genomes or within a given gene family. In the other cases, when you use a pangenomic analysis tool, it is essential to be mindful of any parameters implemented and their impact on the results, since using different parameters drastically affects the size of the final pangenome [58]. Even when you are employing a parameter-free tool, such as PanDelos [57] or Edgar [56], make sure you completely understand the approach used. If the tool is determining the required thresholds based on the data (that is the case of Edgar) it’s a reasonable approach but if the parameters are fixed under the hood by the developer, the tool is not truly parameter-free, you are just not allowed to tweak underlying values, which might result in very unreliable results. In this case, parameters usually are required to account for different characteristics of the genomes under study. Understanding which is the level of similarity and the average nucleotide identity (ANI) [59] that is expected to be found between related genes is essential to set parameters about sequence identity and coverage [60] that are often set to default BLAST values, or in other cases to be generally no lower than 70% of sequence similarity, and for some extreme cases about 50% of coverage and identity. These may largely vary depending on the level of similarity between the genomes under study, whether they belong to the same species, group, or higher taxonomic levels [53], although most commonly pangenomics refers to species-level analyses. Additionally, it is worth noting that different species might have varying evolutionary rates, meaning that within-species similarity can have different implications across different species [61]. Unfortunately, the majority of existing approaches have no specific procedures of parameters for what concerns a specific level of similarity between paralogous sequences. The only current approach that defines a specific treatment for them is PanDelos [57], which sets the sequence similarity between paralogous to be equal or greater than the similarity between the two most similar orthologous genes of two genomes.

To conclude, the construction of a synthetic benchmark by simulating bacterial evolution [62,63,64] can support the choice of a specific tool, especially when synthetic evolutionary parameters reflect the phylogenomic composition of the studied population. In fact, even if there is evidence that some tools always perform better than others, the most reliable of them are sometimes discordant because they are suitable for specific experiment conditions, such as the level of evolutionary distance between the analysed genomes. Such conditions can be simulated in silico and tools can be run over such artificial benchmarks for understanding which solution works better under such conditions. Evaluating the output of a clustering procedure is not a trivial task. Several measures can be used as an indication of the divergence between the output and the suspected clusters (see [65] for some examples of such evaluation criteria). However, until a golden truth is not available, the match between found and real clusters remains unverified. Synthetic benchmarks provide such a golden truth and thus allow for supervised validation of the obtained clusters. Thus, synthetic benchmarks enable such evaluation because the expected outcome is known by construction (of the benchmark), which is in contrast with experiments of living organisms.

Tip 5: Be aware of the available output formats and visualization options

Given a set of genomes, composed of genetic sequences, the essential information of a pangenomic analysis is clustering the genetic sequences into gene families. Starting from such a cluster’s compositions, usually, a gene of each cluster is selected to be representative of its family. An information surrogated from such clustering data is the presence-absence matrix, also called pangenomic matrix, which reports the presence-absence of each identified gene family for each of the input genomes. No standards are currently employed for the storage and representation of cluster composition and presence-absence matrix, which usually come in the form of raw text files.

Some available tools might help you visualize the presence-absence matrix, as well as other aspects regarding gene family composition, the inferred evolutionary history and their distribution along the genomes [66]. These tools are not intended for providing a pangenomic content discovery methodology but only for visualizing the results and for running downstream analysis. Thus, you need to convert the output of content discovery tools to meet the visualization platform’s input formats. In contrast, other solutions, such as Roary [60] already provides embedded visualization instruments. Unfortunately, the methodologies for homology detection that better perform in general conditions [61] and for fragmented genomes [44] are still strictly focused on the clustering step and lack downstream/visualization instruments.

And eventually, as it has been said elsewhere for pathway enrichment analysis [33], keep in mind that different visualization techniques can highlight or hide different results, even in pangenomics.

Tip 6: Critically evaluate the resulting gene families

Be always ready to critically evaluate the output produced by a specific methodology. Because of their intrinsic behavioural difference, different tools may produce different gene family compositions that need to be evaluated case by case. No tool is able to always provide the correct output.

Given an output gene family, always compare the number of genetic sequences that compose the family with the number of input genomes. When doing this, be aware that paralogs increase the family size but not the number of genomes a family is present in, also called diffusivity [57]. Hence some tools may merge two or more families, with a resultant increased family size, but still preserving the diffusivity of the most diffused family. To catch this aspect, it is always a good practice to evaluate the functional coherence of the genes of the family and to analyze and visualize phylogenetic relationships among the genes. Functional coherence can be evaluated by comparing functions assigned to genes that can be already known or predicted via tools, such as Prokka [52]. Moreover, in case a particular genetic biomarker is already known for the species involved in your study, you should check the presence of this gene across genomes. However, it has to be noticed that gene families are computed strictly, which means that a gene might belong to only one gene family. This aspect excludes the possibility of representing gene fusions [67,68,69,70], or any other evolutionary event that is not embedded in the problem modelling of a given tool.

When performing these critical evaluations, keep in mind that public resources may carry misidentification of genomes, as well as incomplete or incorrect gene functional annotation [71,72,73]. Moreover, because of the intense horizontal exchange of genetic material among microbial organisms, the actual scheme of evolutionary events is closer to being a “web of life” [5], rather than a tree. However, the evolution of genetic families can still be evaluated in terms of bifurcating trees. Thus, a single phylogenetic tree might be insufficient to evaluate the phylogenomic relationships of your organisms. Alternatively, phylogenetic analyses of specific strains by means of core genes may be more informative and reliable than whole genome comparisons. This analysis can be performed by using alignment-free tools, such as CVTree [74], on the sets of sequences of core genes, or by multiple sequence alignment of them [75].

Thus, be open to collaboration: pangenomics is a field which requires computational expertise (which can be computationally demanding) but also necessitates a good understanding of the species being analyzed to start from a biological question which is relevant and understanding the result.

Tip 7: Make your pangenomic analyses open and reproducible

Reproducibility has become a pivotal topic in bioinformatics in recent years: with the availability of massive computational resources (in terms of microprocessor capacity and memory available for the computation) and fast computers, the possibility to reproduce a computational analysis has become easier and broader, even for bioinformatics beginners [76, 77]. Making a study reproducible is also pivotal to allow external researchers to find possible mistakes in the computational pipeline, helping generate more robust results. Even if computational resources for scientific reproducibility are available at low cost worldwide, bioinformatics analyses can be replicated and reproduced only if open science best practices are taken into account:

  1. (a)

    The usage of open-source programming languages and software platforms;

  2. (b)

    The sharing of data publicly online;

  3. (c)

    The sharing of your open software code publicly online;

  4. (d)

    The publication in open-access journals.

Open source programming languages and software platforms, such as R or Python, in fact, are necessary to make a pangenomic analysis reproducible by anyone, since they are free and have an open license. The R statistical computing language, in particular, provides two bioinformatics platforms which supply a larger number of R software libraries for computational biology analyses: Bioconductor [78] and Bioconda [79]. Bioconductor provides the PanViz [66] software library for pangenomes’ visualization, and Bioconda furnishes the PPanGGOLiN [80, 81] software package for pangenome partition and the PanTools [82], Pangenome Graph Builder (PGGB) [83], PanX [26], Pagoo [84], and pgr-tk [85] software libraries for pangenomic data analysis.

A few packages are available for Python as well [86]. The recently-released programming language Julia also provides a software library for pangenome graph creation [87]. Regarding application programming interfaces (API) and visualization tools, we mention ODGI [88].

On the contrary, the usage of proprietary programming languages makes the replication of the analysis doable only by people who have that license.

Releasing software code on online platforms such as GitHub [89] and GitLab [90], moreover, can enhance the possibility to reproduce a study [91]. Sharing data online is another key component of reproducibility: a pangenomic analysis can be re-performed openly only if its datasets are available online to anyone without restrictions. Therefore, we suggest publishing your raw and processed datasets in open online repositories such as Gene Expression Omnibus (GEO) [92], ArrayExpress [93], Sequence Read Archive (SRA) [94], Kaggle [95], Figshare [96], Zenodo [97], or the University of California Irvine Machine Learning Repository [98], following the principles of FAIR (Findability, Accessibility, Interoperability, and Reuse) data sharing [99]. In case you are implementing a new tool for pangenomic analysis, the use of synthetic benchmarks is crucial for allowing a quantitative evaluation of the results and a fair comparison with other tools.

Similarly, regarding the paper writing and publishing, we recommend submitting your article to an open-access journal. Once published, your article will be available to be read for free by anyone in the world, even in the least developed countries. A list of open-access journals in bioinformatics can be found on the ScimagoJR website [100].

Conclusions

In the context of bacterial and more in general microbiome research, pangenomic studies exploit advanced bioinformatics tools to explore the genetic content of various organisms, providing valuable insights into genetic diversity and evolution. A core procedure is the clustering of genetic sequences spread along input genomes into gene families by means of homology computation. How this step of computational pangenomic pipelines significantly affects results and downstream analyses. By following the seven tips outlined here, researchers can enhance the reliability and reproducibility of their pangenomic analyses. Ensuring clear research questions, high-quality input data, appropriate annotation strategies, and critical evaluation of results are fundamental steps. Additionally, utilizing open-source tools and sharing data openly is crucial for advancing the field and fostering collaboration. Ultimately, these practices contribute to a more thorough and accurate understanding of genetic landscapes, paving the way for future discoveries and innovations in microbial research.