Introduction

The use of the rat (NCBI Taxon ID 10116, Rattus norvegicus), as a model organism in research has a long history and started in the nineteenth century. The rat is considered to be the first organism to have been domesticated for scientific research purposes (Modlinska and Pisula 2020; Richter 1959). While some sporadic experiments using rats were done in the first part of the nineteenth century, such as an experiment on fasting in 1829 (Hedrich 2000; Modlinska and Pisula 2020), the systematic use of rats has started in 1856 (Philipeaux 1856), with studies on the impact of adrenalectomy (Philipeaux 1856) followed by studies on nutrition (Savory 1863), behavior, neurology, and others (Bergman et al. 2000; Watson 1914; Jonckers et al. 2011; Logan 1999). In some fields, rats outperform mice, mainly because of their bigger size, allowing larger biopsies and more repetitive (blood) samples to be taken, higher resolution imaging (e.g., in cancer therapy Bergman et al. 2000) and better precision in surgery (Jonckers et al. 2011; Modlinska and Pisula 2020). In cardiovascular research and neurobiology, rats are often preferred above mice, based on size and the amount of already available datasets and papers (Ellenbroek and Youn 2016; Leong et al. 2015). On the other hand, rat research is more expensive than mouse work, as rats require larger cages and more space. Like in the mouse field, rat research is mostly performed in rat-inbred strains, but outbred lines, typically yielding individuals with less defined genetic background, are also used (Sharp and Villano 2013).

The laboratory rat is the second most used animal model organism of all species; for example, in the European union rats accounted for 12.2% of all animals used in 2017, preceded only by the mouse (60.8%). This popularity was one of the reasons why the rat was selected to have its genome sequenced with priority, which was possible via a whole genome shotgun sequencing approach. The first draft of the rat reference genome was published in 2004 by the rat genome sequencing consortium (Gibbs et al. 2004). This genome was derived from the inbred Brown Norway strain (BN/SsNHsd), which was selected by the community and inbred another 13 generations at Medical College of Wisconsin (to BN/SsNHsdMcwi) before sequencing 2 females of this strain (Gibbs et al. 2004).

However, several rat strains/stocks were specifically bred to yield and investigate specific phenotypes or diseases. For inbred lines, well-known examples are the spontaneously hypertensive rat (SHR) strains that develop spontaneous hypertension (Conrad et al. 1995) or the Zucker Rat strain develops obesity (Obrosova et al. 2006). In outbred stocks, the outbred heterogeneous stock rat is one of the best known, created from 8 founder lines and used to study complex traits (Woods and Mott 2017) (e.g., depression disorder (Holl et al. 2018)). It is thus important that genetic variation between the strains is investigated in order to help explain the genetic components underlying these and other phenotypes. The sequencing of 28 rat (sub)strains by Atanur et al. (2013) and the sequencing of 40 strains (including resequencing and/or reanalysis of the previous 28) by Hermsen et al. (2015) encompass all the data that are currently available. These can be accessed at the rat genome database (RGD, Smith et al. 2020), where also additional information can be found concerning gene ontologies, QTLs, phenotypes, variants, pathways, as well as various tools to use this information (e.g., ontology enrichment, pathway browser).

While the rat sequence variations data are available in the RGD database, the data are provided to the user in a variant browser or as compressed vcf files. The information provided for coding genes is limited in the sense that only the nucleotide variation and the effect on the coding sequence (missense, nonsense) is given while there is no detailed overview over the consequences of the amino acid variation on the protein function, or a way to see the combined effect on the final amino acid sequence at a given position in a certain strain, compared to the reference strain amino acid. The latter can be important, as multiple mutations may affect a single codon, it was even shown that many of the nonsense variants actually have a second SNP that leaves a missense instead of a nonsense variant (Hermsen et al. 2015). However, the RGD does provide information about non-coding variants, where our tool provides only details about protein-coding variants. This makes the RGD a highly complementary resource to the tool described in this work.

Since a single amino acid change can have impact on protein function, ranging from no impact to a mild change or ruining the proteins function entirely, the coupling of amino acid variations to predicted function changes is essential. We have addressed this problem in the past, studying all amino acid changes in all 42 sequenced mouse inbred strains, and combined this information in a user-friendly database, called ‘mousepost’ (Timmermans et al. 2017). In the current study, we have focused our attention on the analysis of protein-coding variants in the 40 sequenced rat stains, as compared to the reference strain BN/SsNHsdMcwi. We have focused on high-confidence homozygous variants in inbred strains and report those contributing to changes in amino acid sequence, viewed on a per codon basis. Note that several of these are derived from outbred stocks such as the WKY/N strain from Wistar rats https://rgd.mcw.edu/. Our data allow thus to link phenotypes of rat strains with amino acid polymorphisms, as well as studying all variant versions of a given protein across all sequenced rat strains, or a genome-wide strain-to-strain comparison. Finally, we have also focused on the potential defects in protein-coding genes in the reference strain BN/SsNHsdMcwi, by comparing its genome to a consensus genome, defined by the other 40 sequenced strains.

Results and discussion

Classification of variants and comparison with the reference rat genome

The first draft assembly of the rat (Rattus norvegicus) genome was released in 2004 (Gibbs et al. 2004) and has been improved since. The reference genome is derived from the BN/SsNHsdMcwi inbred rat strain. The genome of 40 other high-priority rat-inbred strains has also been sequenced (Hermsen et al. 2015). The rat genome browser (RGD, https://rgd.mcw.edu/) is a rich database which contains the reference rat genome sequence, rat phenotype information, protein–protein interaction data, nucleotide variants, from more than 100 frequently used rat-inbred strains, and outbred stocks. Most of the research of today is preferentially performed in inbred lines, because of the stability in time and space of an inbred genome, and the identical background of individual animals within a certain inbred strain. Outbred animals are often used for more general research and behavioral studies. The identification of protein-coding variations among the rat strains is important for the correct interpretation of research or may serve as a useful tool for genetic research.

Since even a single nucleotide sequence polymorphism (SNP) variation in a protein-coding gene can have a severe effect on the protein function, via an amino acid change or the formation of a nonsense mutation STOP signal, we decided to transform the tabular lists of SNPs, and their positions in the rat genome, in protein-coding genes, into amino acids changes in the corresponding proteins, for each protein across all 40 sequenced rat strains, with the reference genome as a standard.

Sequence variations of a given rat-inbred strain compared to the reference strain are found in ‘variant files’ in the RGD. As a first step, we processed the variants files to retain only high-quality events. For each strain, coding genes and SNP/indel events were filtered as to only retain relevant genes (containing at least one nucleotide variant in that strain) and events overlapping with exons. Transcripts were classified exclusively in one of three classes: (1) stop gained (SG), which are sequence variations leading to a new STOP codon (missense mutation), (2) stop lost (SL), which are nucleotide changes leading to a loss of a STOP codon and further translation of the mRNA and (3) non-synonymous variants (MUT), which are small insertions and deletions of maximally 200 bp and SNPs, as shown in Fig. 1. For each class, transcripts and all accompanying data and results are stored in a relational database for use in the web tool. An SG or SL variant takes precedence over the MUT class, so a gene containing an SG and any number of MUT events will have the label SG. These data also include splice variants, but only in the case of a mutation in the canonical splice donor or splice acceptor sites. Such sites are not specially annotated and are then seen as either intron retention or exon loss, and the variant CDS is constructed with these variations included. The final effect is most likely a frameshift and nonsense mutation, and these are included in the corresponding class (usually SG).

Fig. 1
figure 1

Data use flowchart: data are obtained from ensemble (genome annotations) and the rat genome database (reference genome, sequence variants). Protein-coding transcripts are filtered from the genome annotation and their exon sequences extracted from the genome. Only transcripts with at least one exon containing a variant are kept. Variants are likewise filtered so only those overlapping an exon are kept. We perform three analyses: First, we compare the strain-specific sequences of each individual strain to the reference sequence. Second, we construct a synthetic consensus reference from all strain-specific sequences using a majority vote method (per position), and the reference sequence is compared to this consensus in order to obtain genes that can be considered to be deviant in the reference strain. For these two analyses, protein sequences are classified as stop gain, stop loss, or mutated (non-synonymous variant(s)). Potential effects of mutated variants are estimated with PROVEAN. Finally, we perform a pairwise comparison between each of the non-reference strains to obtain a list of all differences. All these data are made available in the website “ratpost.be” and can be accessed through strain and class-specific listing or searched by gene, location, or gene ontology term

The impact of SG or SL variants was estimated by the length ratio, i.e., we assumed that the greater the deviation from the reference protein chain length, the more likely that there will be an impact on function. In general, a nonsense variant removes 46.6% of the total sequence (median value, mean: 47.2%), resulting in a high impact. For MUT variants, we applied the Protein Variant Effect Analyzer (PROVEAN) software to provide a score to a given variation (Choi et al. 2012). We are aware that the SL- and SG-classified transcripts may contain additional non-synonymous variants, but these were not analyzed further with PROVEAN. We made this decision by taking into consideration the additional information gained by performing a PROVEAN analysis compared to the time and the computational resources needed to obtain these scores. The variant sequences themselves are available in the web tool. The PROVEAN score indicates the chance that the variant amino acid will affect the protein function. The lower the score, the higher the chance of function loss. Based on previous papers from the authors of the PROVEAN software (Choi and Chan 2015a; Choi et al. 2012), we here considered PROVEAN scores lower than − 2.5 as leading to significant function loss. This is the point where the software shows the optimal balanced accuracy, i.e., where the sensitivity and specificity are both within acceptable limits. For PROVEAN, this means a sensitivity and specificity of about 80%. A higher specificity (90%) can be obtained by changing the score cut-off (− 4.1) at the cost of the sensitivity, which drops to 57%. We tested all 32,883 genes included in the ensemble genome annotation for the rat reference genome (41,078 transcripts). Our analysis showed that 12,172 distinct transcripts from 9715 genes have at least one natural variant in the included rat strains, which corresponds to 29.6% of the transcriptome or 29.5% of all genes. The total amount of different variants per strain and per class is shown in Table 1. As expected, the amount of non-synonymous variants (MUT) is the largest group. However, only a fraction of those (28.7%) are predicted to be potentially deleterious for protein function (PROVEAN score < − 2.5). The amount of nonsense (SG) variants in each strain compared to the reference strain is lower compared to the missense variants, with a maximum of 627 transcripts affected in any one strain (median: 558, IQR: 160). The amount of transcripts affected by a variant removing the stop codon is lower, with 67 or less transcripts affected per strain (median: 45, IQR: 1). While the number of non-synonymous variants is the highest, only 20–30% of those have a PROVEAN score of − 2.5 or lower, so the majority of them predicted not to influence protein function. The other BN strains (“BN/SsN”, “BN-Lx/Cub”, “BN-Lx/CubPrin”), which are closest related to the reference strain, display the lowest number of variants, as expected. In absolute numbers, the non-BN strains are all in the same range for amount of variant-containing transcripts (median(IQR): 599(43), 55(7), and 4971(518) for SG, SL, and MUT, respectively). In the 3 BN strains, the proportion of SG variants on the total is much higher than in the other non-BN strains. With an average of 278 transcripts that contain a stop-gain variants in the “BN/SsN”, “BN-Lx/Cub”, “BN-Lx/CubPrin” strains, 45.5% of all variant-containing transcripts in these strains fall in the SG category. For the non-BN strains, there are on average 532 stop-gain containing transcripts per strain, amounting to 9.5% of all variant-containing transcripts in each strain, on average. Many of the sequenced strains are also related closely to one another or are derived from a common stock; for example, the SHR strains. Compared to the reference strain, these have similar amounts of variants and many of these are shared between several or all these strains. In case of the SHR strains, 64% of all events (2035 of 3329) are shared between at least two and 23% between all of them (764 of 3329), as is illustrated in Fig. 2. Note that one of the SHR strains, SHR/OlaIpcvPrin has a larger amount of strain-specific variants (914) than the other three (43, 98, 139).

Table 1 Number of genes and transcripts per class
Fig. 2
figure 2

Overlap between MUT events in all transcripts (PROVEAN < − 2.5) in the SHR strains

Data availability

All data displayed in Table 1 are available through the searchable Ratpost (rat polymorphic sequence tags) web tool, which is available at ratpost.be. The database provides an overview table showing the number of SG, SL, and MUT variants per strain, with customizable cut-offs for length ratios and PROVEAN scores, and the numbers in the table are clickable. A listing of the strain-specific variants (versus the reference strain), divided by type, can also be obtained and exported in combination with various advanced filter options.

Ratpost provides several search functions that may assist with various questions. For example, (1) the tool provides the option to search for all variants of a particular protein-coding gene or transcript, across all 40 strains, or in a subset of these strains; (2) a search for variants, based on chromosomal location, can be helpful in the process of mapping and positional cloning of a trait; and (3) searching deviant protein-associated functions (by introducing a gene ontology term) is also possible. As a way of providing examples, which illustrate the power of the tool, we investigated several variants that are potentially interesting and/or relate to known (strain-specific) rat phenotypes. We provide three examples.

One of the best known variants in several mammalian species is the loss of function of tyrosinase (Tyr), resulting in an albino phenotype (Blaszczyk et al. 2005). In Ratpost, by introducing ‘Tyr’ as a search function, we recovered the previously identified Tyr variant (Blaszczyk et al. 2005) with a loss-of-function effect, the R299H mutation, which has a PROVEAN score of − 3.2 and is found in 32 strains. This is due to the fact that many strains that were sequenced are albino strains having the defect Tyr gene.

The Wistar outbred stock is well known as a model for diabetes since these rats develop diabetes spontaneously (Obrosova et al. 2006). Four inbred strains derived from Wistar rats have been sequenced (WKY/N, WKY/Nctrl, WKY/NHsd, WKY/Gla). Searching for genetic variants in genes related to diabetes (by entering the GO term ‘glucose metabolic process’) provides Gckr (glucokinase regulator) with the variant R149C (PROVEAN score − 4.9) found in three out of four strains (WKY/N, WKY/Nctrl, WKY/NHsd) as a likely polymorphism related to the diabetes phenotype. It has been shown that both a genetic and an environmental component are involved in developing the disease (Liu et al. 2015; Stumvoll et al. 2005), and this gene may contribute to the former. The gene itself is a strong candidate for maturity onset diabetes of the young in humans (O’Leary et al. 2016). Another hit in this analysis is Phip, which has the GO function ‘insulin receptor binding’ (GO) and is involved in promotion of development and survival of pancreatic β cells (Podcheko et al. 2007). The encoded protein has a W388L variant, (with a PROVIAN score of − 10.7) found in the GK/Ox strain, a standard rat model for rat type 2 diabetes (Liu et al. 2015; Stumvoll et al. 2005).

The SHR strains develop spontaneous and severe hypertension (Conrad et al. 1995). This phenotype was linked to several genomic loci. When interrogating Ratpost, we found a few dozen genes at these locations with predicted deleterious variants present in these strains, and so we were not able to clearly identify a major candidate gene (Smith et al. 2020), but interestingly, we found that these strains also have a mutation in the gene Mybphl (W159L, PROVEAN of − 12.7). The protein (Myosine-binding protein H-like) plays a role in cardiac function (Barefield et al. 2017) and may sensitize these rats for developing the cardiac issues (heart failure) that they are known to develop as a result of their the hypertension (Trippodo and Frohlich 1981).

As a final example of the value of the database, in Supplemental Table, we provide the results of a search with the GO term “inflammation,” which yield a variety of interesting deviant alleles present in several rat-inbred strains, e.g., in the genes Tlr3, Tlr10, Tlr12, or the Mineralocorticoid gene Nr3c2.

Variations in the reference and pairwise comparisons

As shown in Supplemental Figure, most variants are specific to only one strain, or to a rather small group of strains. However, it is clear that there are also some variants that are found in many strains, resulting in a situation where the alternative sequence is the one that occurs in the majority of strains that have been sequenced. This means that there are also genes where the reference strain is the one containing the variant or defective gene: similarly as the mouse reference strain C57BL/6J has been shown to harbor multiple variant (even defective) protein-coding genes compared to all other strains (Timmermans and Libert 2018), we categorized genes where the reference strain, BN/SsNHsdMcwi, contains a deviant allele. In Ratpost, we provide a way to query these variants, which are based on a comparison of the protein-coding sequences of the reference strain with a ‘consensus rat genome’: by using a simple majority-voting mechanism, the amino acid sequence that is supported by the most strains at each position of each protein is fixed and is considered a synthetic reference to which the BN/SsNHsdMcwi sequence is compared. Due to the nature of the data, the following considerations need to be taken into account: (1) as only a limited number of the available rat strains have been sequenced (40 plus the reference strain out of > 100 strains), the true consensus sequences cannot be known with certainty; (2) if the sample from the set of strains is biased, then the consensus sequence will be biased as well. This actually the case to some extent for the dataset: there are several closely related strains sequenced (SHR, WKy, and F344 substrains), it is not unlikely that if 1 SHR strain supports a certain variant, all strains will support this variant. However, as shown previously (Fig. 2), these related substrains still contain variants that are not shared by all of them. Overall, all the most common strains are represented in the dataset, while some bias cannot be excluded, we believe that this gives a high-quality overview of variation in commonly used inbred rat strains. For example, the defective mutant Tyr gene is present in a majority of the sequenced rat strains (32 out of 41), and would be considered as the consensus sequence, so the wild-type Tyr sequence of BN/SsNHsdMcwi would wrongly be identified as deviant. These facts mean that, while potentially very informative, care needs to be taken when interpreting the results of this BN/SsNHsdMcwi versus the consensus reference. Such variants, with the exception of stop losses (these need to have the 3′ UTR added to detect the presence of a new stop codon in some cases), can be derived from the Ratpost database that was constructed as described in the previous section.

A list of reference-strain-specific variants can thus be obtained in Ratpost. Only variants where more than 50% of the strains differ from the reference are included in this analysis. We also provide an agreement score (S = M1 × M2), which is a number between 0 and 1 that shows how many strains (1) agree with the consensus sequence (metric M1 = C/N, C: number of strain supporting consensus, N: total number of strains) and (2) how many other variants there are for that position that are not the consensus sequence and not the same as BN/SsNHsdMcwi (metric M2 = 1 − B/N; B: number of strains supporting the BN/SsNHsdMcwi sequence, N: total number of strains), the closer this number is to “1”, the more unique the BN/SsNHsdMcwi sequence is. When performing this analysis, we find 411 transcripts where the BN/SsNHsdMcwi has a stop-loss variant compared to the consensus sequence and 49 where it has a stop-gain mutation compared to the consensus. For non-synonymous variants, we find a total of 5981 variants that differ in the reference rat strain compared to the synthetic consensus sequence. Assuming that the variants with a PROVEAN score of < − 2.5 in the reference are negatively affecting the protein function, there are 1191 variants present that have a negative effect on protein function (out of 5981 in total). It is important to take into account the fact that the variants found in BN/SsNHsdMcwi compared to the consensus sequence across all strains are not always the correct ones, it is not because the large majority of the sequenced strains have a certain agreement sequence that this sequence is the functional one. These results also show that it is important to keep the genetic background of the used animals when using the rat reference genome, which contains its own set of defective alleles. Of the variants with low PROVEAN scores, there are 52 (in 37 transcripts) with an agreement score of S = 1, i.e., unique to the reference strain (see Table 2). One example is a variant unique to the reference that is the gene coding for proline dehydrogenase 1 (Prodh1) which has a G373C substitution with a PROVEAN score of − 8.3 as well as a second substitution: G312C with a score of − 8.6. While the reference strain has not been extensively tested in this regard, it is known that mutations in his gene in humans lead to Proline Dehydrogenase deficiency (Afenjar et al. 2007). Depending on the residual activity phenotypes can be very mild to severe, manifesting as hyperprolinemia with cognitive defects in severe cases (Afenjar et al. 2007; Perry et al. 1968). A second interesting example concerns the Dclk1 gene, also specific to the reference strain, which has two amino acid substitutions compared to all other strains, with low PROVEAN scores (L269F: − 3.8 and V268D: − 6.6). Previous work has shown that the reference rat strain precursor (BN/SsNHsd) has an abnormal pre-pulse inhibition (Palmer et al. 2003). This trait was linked to two regions in a QTL study (Palmer et al. 2003). The chromosome 2 region encompasses the Dclk1 gene. This gene belongs to a family of three genes (along with Dcx, Dclk2) which are all three important for normal neuronal functions (Dijkmans et al. 2010). Dckl1 has been proven to be important for proper neuronal development in mice, based on full KO phenotypes (Koizumi et al. 2006). The available data thus make the abnormal Dclk1 protein in BN/SsNHsd a good candidate to explain its pre-pulse inhibition trait.

Table 2 A list of variants that are unique to the reference strain with a PROVEAN score < − 2.5

Finally, the Ratpost tool also allows the direct comparison of any two strains. This comparison allows only a list of differing transcripts and their differences as it is not informative to provide a classification in this case. For example, Tyr gene can easily be identified by comparing protein-coding gene sequence differences between the albino strain SS/Jr and the pigmented strain BN/SsN.

In conclusion, a searchable online repository (which will be updated yearly) of variant alleles of protein-coding genes is now available at Ratpost and can lead to extensive exploration and exploitation of the naturally occurring mutant variants fixed in the 40 sequenced rat strains. Our analysis and database concerns protein-coding genes only, and an extension toward non-coding RNAs and a link towards mRNA expression levels might be considered in the future. Combined with the efficient CRISPR/Cas-based mutagenesis, the availability of these naturally occurring sequence variations in these 40 rat strains is an added value to couple sequence variations to phenotypes, and moreover, CRISPR/Cas could be considered to correct some unwanted variations in the reference strain, upgrading it to a higher standard reference.

Methods

Sequence variation data

We obtained the sequence variation data (SNPs, insertions and deletions) of the inbred strains from the ftp site of the rat genome database (Hermsen et al. 2015; Smith et al. 2020). Data were converted to bed (vcf2bed) and filtered with bedops (Neph et al. 2012) to retain only events overlapping with exons of coding genes, by using the element of 1 option with the variant file as reference and exon locations as query input. All data from the RN5 version of the rat genome were converted to RN6 with the use of the picard tools (LiftoverVcf) (Institute 2019) using the RN5 to RN6 chain file obtained from UCSC.

Reference genome and annotation

We were constrained by the variation data to use the RN6 version of the reference genome. The sequences and corresponding annotation (gene transfer format; gtf) were downloaded from ensemble ftp server. The gtf file was converted to bed (gtf2bed) and filtered to only retain exons from transcripts containing at least one variant. For this, the filtered set of variants obtained as described in the above section was used as the query for bedops—element of 1, with the exon positions as the reference. Exon sequences were extracted from the reference genome fasta file with the getfasta command from bedtools using the filtered transcript set.

Transcript classification

We used an updated version of the script used in mousepost (Timmermans et al. 2017). The cDNA sequences were constructed using from the previously constructed exon sequences and split up in the 5′ UTR, CDS, and 3′ UTR. For each strain, the sequence variations present for that strain were applied to the CDS. The in silico mutation of the sequence was done using a custom made perl script from the 5′ end to the 3′ end of the sequences in order to easily take into account sequence offsets from indels. All genomic positions were converted to cDNA/CDS positions internally and applied in sequence, negative-strand sequences were reverse complemented for further processing. Finally, the CDS was translated to the protein sequence, using the standard vertebrate coding table. Classification was performed by comparing the reference and alternate aa sequences into three classes (SG, SL, MUT). For SG, any offsets from indels were taken into account, and SL variants were called on the loss of the canonical stop codon, in which case the 3′UTR, with variants applied (same as CDS), was appended to the CDS and a scan for a new stop codon was done. All variants not classed as SG/SL were placed as MUT, and all processing was stopped after encountering a SG-classed variant. Transcripts were assigned to the most severe variant class. We processed all strains in parallel.

BLAST databases

The BLAST+ program suite, v2.2.30, was pre-installed on the HPC infrastructure, and BLAST databases for the blast tools were obtained from the NCBI ftp site. The database with non-redundant protein sequences for use with PROVEAN was downloaded November 16, 2016 (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). This will be used by the PROVEAN software, more specifically, the psi-blast program.

PROVEAN

Several programs have been developed to estimate the effect of a given sequence variation on the function of the protein. Because PROVEAN is able to interpret small insertions and deletions, this tool was selected (Choi and Chan 2015b). First, a file for each transcript in each strain was constructed with the mutated positions. For this, a global pairwise sequence alignment was constructed between the reference and alternative transcript with needle (EMBOSS tools) (Madeira et al. 2019). This alignment file, along with the positional data from the classification step, was processed with a perl script that was specifically created to build these files. To keep run times within acceptable limits, the PROVEAN tool was run on a high-performance computing (HPC) cluster for each strain in sequence; in this way, we could save and reuse the supported sequence sets. Due to the high computational requirements and limitations in HPC access, some variants were excluded from PROVEAN analysis: all variants from strain pairwise comparisons were excluded, as the PROVEAN scores are only truly relevant in comparison to a reference and variants from SL- and SG-classed transcripts were also not processed with PROVEAN. The score provided by PROVEAN depends in part on the number of available sequences for a given transcript. We have followed the suggestions of the authors of the PROVEAN tool and show a waring if the number of sequences is < 50, which may give unreliable results, and we use a score of − 2.5 as a ceiling cut-off for potentially deleterious variants (Choi and Chan 2015b).

Gene ontology

The gene ontology (GO) annotation was downloaded from the gene ontology consortium website (www.geneontology.org). We processed this file to obtain the GO terms for all genes in our dataset to allow the GO search functionality in the web tool.

Data storage & presentation

All results from the analyses are stored in a MySQL relational database. Database tables are interlinked and have been optimized with indexes to allow fast searches. This database is made accessible through the website ratpost.be, built on the bootstrap 4 framework php and javascript. All tables found are created by use of the DataTables plugin.