Introduction

Lampreys and hagfish are the only two extant genera of jawless fish (Agnatha). As such, they are central to our understanding of the early evolution of vertebrates, offering the possibility of finding simpler versions of complex physiological systems observed in mammals. For example, it was long ago determined that lampreys have single-chain hemoglobins and not the tetrameric kind found in most vertebrates (Wald and Riggs 1951). Similarly, early biochemical studies suggested that lampreys have a simpler blood clotting scheme than do higher vertebrates and one that is limited to the so-called “extrinsic” clotting system (Doolittle and Surgenor 1962). The same underlying question has been asked about numerous other biochemical systems: namely, can the early evolutionary stages of complex systems found in higher vertebrates be better understood by examining the situation in lampreys or hagfish?

Over the years, we and others have been attempting to trace the appearance of the various genes involved in vertebrate blood coagulation. Currently, the most direct method of attempting such evolutionary reconstructions is the examination of whole genome DNA sequences (WGS) of representative organisms. In this regard, high-quality genomes are available for several jawed fish, frogs, lizard, birds, and numerous mammals, and it has been possible to detail many of the events that have made mammalian blood clotting so complex (Davidson et al. 2003a, b; Jiang and Doolittle 2003; Ponczec et al. 2008). Unhappily, whole genome sequences for lamprey and hagfish have been late in coming and are still imperfect, especially for cases where a consideration of gene absence is critical.

Several years ago, we attempted to circumvent the problem of not having a lamprey WGS by exhaustively studying individual DNA sequences stored in the NCBI Trace Archive, which, in the case of the sea lamprey (Petromyzon marinus), is an especially rich trove of raw sequence data. As it happened, the concurrent appearance of a draft assembly based on the same Trace data made it possible to make judgments about the presence and likely absence of the various clotting factors found in mammals, and in the end we cautiously proposed that lampreys have a reduced set of such genes (Doolittle et al. 2008).

In particular, the data suggested that lampreys lack genes for factors VIII and IX, two genes that are essential for the “intrinsic” clotting system in higher vertebrates and defects in which are responsible for hemophilia in humans. It was already known that genes for these two proteins are present in jawed fishes like the pufferfish (Davidson et al. 2003a; Jiang and Doolittle 2003) and zebrafish (Hanumanthaiah et al. 2002), and the proposal was that in the interval between the divergence of the jawless and jawed fish more or less simultaneous duplications of genes for factors X and V gave rise to those for factors IX and VIII. Crucially, factors VIII and IX interact with each other in the same way that factors V and X do, the first-named of each pair (VIII and V) serving as a large molecular weight co-factor for the second-named vitamin K-dependent protease (factors IX and X).

The study also revealed that lampreys have two genes for factor X and three for factor VII, vitamin K-dependent factors at the heart of the “extrinsic” clotting system. The main shortcoming of the work was that it was not possible to link together all the various trace sequences for every one of the clotting genes, as would be required for proving absence.

In the interval since that report, two different assemblies for lamprey genomes have appeared. The first (Smith et al. 2013) was a new assembly for P. marinus but one that was based on sequences that also appear in the Trace Archive and were used in the 2007 draft assembly. Although the new assembly significantly increased the lengths of scaffolds relative to the contigs generated in the earlier draft, it was actually slightly less complete as measured by base pairs provided (Table 1). The other recently reported assembly was for the closely related Japanese lamprey (Lethenteron japonicum) (Mehta et al. 2013). It has only a slightly higher coverage, but the much longer scaffolds provide vital overlaps for joining segments found in the other databases (Table 1). The lamprey genome has been estimated to contain between 1.6 and 2.2 billion bp (Gregory 2005), suggesting that the average coverage in both of the new assemblies may be as low as 60 % (Table 1). This may be misleading, however, because gene-rich regions of the genome were likely easier to sequence than gene-poor, high-repeat regions.

Table 1 Sources of sequence data used for re-constructing lamprey clotting genes

In spite of their limitations, in the work reported here, these assemblies have been used in combination with the abundant short sequences stored in the Trace Archive collection to reconstruct fully most of the genes and proteins involved in the lamprey coagulation pathway. All the evidence supports the earlier conclusions about the absences of factors VIII and IX and the presence of additional factors VII and X in lampreys. The biggest challenge was that most of the gene products of interest are the result of gene duplications that took place about the same time as the appearance of vertebrates, exacerbating the problem of distinguishing orthologs from highly similar paralogs. The cases of coagulation factors V and VIII are even more problematic because of their being members of the ferroxidase family of proteins, which includes ceruloplasmin and hephaestin proteins, themselves composed of three major domains that are the result of (tandem) duplications.

Methods

Various lamprey databases were downloaded on to an in-house computer. These included current versions of the Trace Archive for P. marinus, which, in addition to the vast numbers of random shotgun reads, also contains numerous sequences from non-genomic sources, including extensive cDNA and EST entries. The Trace Archive also contains many mate pairs that can be used to establish neighborliness. As noted in our earlier report, we had also downloaded a 2007 draft assembly based on the same Trace data from ftp://genome.wustl.edu/pub/petromyzon_marinus. More recently, a 2012 (and subsequently updated to 2015 versions) top level assembly was downloaded from http://www.ensembl.org, as well as the 2013 release of the L. japonicum from http://lampreygenome.imcb.a-star.edu.sg/.

BLAST software (Altschul et al. 1997) was downloaded from the NCBI website; tblastn was used for searching amino acid sequences against raw DNA data. Phylogenetic reconstructions were made by a distance-matrix method (Feng and Doolittle 1996) as well as with a parsimony procedure (Doolittle and Feng 1990). Trees were drawn on the PHYLODENDRON website http://iubio.bio.indiana.edu/treeapp/treeprint-form.html.

In one case, cDNA prepared from sea lamprey liver was subjected to PCR to obtain an overlap between two key segments of the lamprey factor V gene. Primers were based on sequences found in the Trace database. I am grateful to Russell Darst for conducting these experiments in the laboratory of Dr. Lorraine Pillus.

Searching Strategy

As described in our earlier report (Doolittle et al. 2008), the exploration began by searching relatively short (exon-sized) sequences from human clotting factors against all entries in the 2007 Trace Archive. Identical (or near identical) sequences were extracted from the individual reads and clustered in “hit-groups.” As an example, the preliminary search for sequences corresponding to clotting factors V and VIII yielded 257 hits, which after accounting for redundancy reduced to about 50 “hit-groups.”

The challenge from that point on was to arrange the hit-groups in the proper order in which they occur in their parent proteins, either with the aid of other data or by homology with factors from other vertebrates. In the cases of some of the clotting factors, longer non-genomic sequences (cDNA and/or ESTs) are available, including some especially useful entries added to the Trace Archive after our first report. The 2008 draft lamprey assembly cited above had been used to link about half of the remaining hit-groups to contigs, and the posting of the 2012 assembly with its longer scaffolds gave rise to a few more. In a few cases, connections in contigs and scaffolds revealed segments that had been missed in the initial searching. At this point, all the collected matches for the sea lamprey were re-searched against the 2015 Trace Archive, as well as against the 2013 assembly for the Japanese lamprey. The latter has very long scaffolds and provided additional linkages between contigs and scaffolds from the sea lamprey assembly. Exon–intron boundaries were determined on the basis of the GT-AG rule for the start- and endpoints of introns.

Results

The Ferroxidase Gene Family

The initial evidence for there being only four genes in the hephaestin-ceruloplasmin-factor V–VIII family was based on there never being more than four different sequences for any set of aligned segments, and only one of those being more similar to factors V and VIII (Doolittle et al. 2008). The weakness of the proposal was that fewer than half of the segments (based on reads from the trace archive) could be linked together by cDNA sequences or the draft assembly. Two important developments have taken place since that first effort. First, some long non-genomic DNA sequences have been added to the Trace Archive that link together many of the trace reads for ferroxidase family sequences and demonstrate unequivocally that none of these have intervening B domains and cannot be either factors V or VIII. Second, an assembly for the Japanese lamprey with its very long scaffolds showed that these same three constructs are in complete agreement and located on separate scaffolds (Fig. 1). The corresponding exons for these three genes are very similar in length but the introns separating them vary greatly (Figs. 1, 2a–c). The same is true for a comparison of the lamprey ceruloplasmin gene with its human counterpart (Daimon et al. 1995), the exon lengths corresponding closely but the introns exhibiting significant differences. Two of the constructs also have membrane-spanning segments at their carboxyl ends and must be hephaestins. The third does not have such a segment, as would be expected for ceruloplasmin.

Fig. 1
figure 1

Gene structures of four ferroxidase family members (hephaestin-1, hephaestin-2, ceruloplasmin, and factor V) in L. japonicum. Locations of “hit-groups” from P. marinus are shown as small boxes: blue boxes mean sequences available for both species; orange boxes occur in unsequenced regions of scaffolds (dotted lines) and are positioned by independent data from P. marinus assemblies (see Fig. 2). Open boxes (no color) indicate approximate locations; MS denotes region encoding membrane-spanner (Color figure online)

Fig. 2
figure 2

Reconstructed blocks of “hit-groups” for four ferroxidase proteins from lamprey: a ceruloplasmin-1, b hephaestin-1, c hephaestin-2, and d factor V. Double-arrows denote overlap sequences from various sequence data sources that support the arrangements. LJ and LJcon denote scaffolds and contigs from L. japonicum assembly; the prefix con denotes contigs from the 2008 draft assembly of P. marinus, and the prefix scaf is for scaffolds from the 2012 P. marinus assembly. Orange boxes are “hit groups” from Trace Archive; green boxes are based only on sequences in L. japonicum scaffolds. Pink boxes are based on non-genomic sequences in trace archive, and blank boxes in factor V denote missing regions based on homology expectations. The motifs “WDY” and “ING” (or “VNG”) found in all A domains are shown atop the group in which they occur. MP mate pair (Color figure online)

The fourth member of the ferroxidase family has only been partially reconstructed, but all of its parts are more similar to factors V or VIII from other vertebrates than they are to hephaestin or ceruloplasmin. Unhappily, the gene occurs mainly on a scaffold (LJ scaf02148) with long interrupting regions of unsequenced DNA (Fig. 1). Nonetheless, the gene was pieced together further by a variety of means. For one, PCR of sea lamprey (PM) cDNA was used to show that hit-groups G27 and G67 are adjacent to each other. Second, various other hit-groups were found to match segments in the same region of the scaffold (LJ scaf02148). The few remaining unplaced hit-groups (from the original 50) were positioned in the most propitious gaps. Additionally, a previously unidentified mate pair for hit-group G45 was found and logically placed nearby (MP45 in Fig. 1). On a negative note, although there are numerous discoidin sequences in the various lamprey databases, it was not possible to link any specific pair to the putative factor V gene.

As implied above, it was possible to assign all 50 of the original hit-groups for the ferroxidase family by the use of overlaps from the various databases (Fig. 2a–d). The matches for the putative factor V gene were scattered and occurred in parts of all three A domains. A concatenated sequence of five coding sections amounting to 481 amino acids was prepared and aligned with the corresponding segments from 12 other ferroxidase proteins, including the three other lamprey proteins (hephaestins −1 and −2 and ceruloplasmin), five human counterparts, pufferfish hephaestin, and zebrafish ceruloplasmin. A phylogenetic tree generated from the alignment is in complete accord with there being a single encoded protein of the factor V–VIII kind in lamprey genomes (Fig. 3). (The amino acid sequences of the four lamprey ferroxidase proteins are provided in the Supplementary Material.)

Fig. 3
figure 3

Phylogenetic tree (unrooted) of ferroxidase family proteins (coagulation factors V and VIII, hephaestins, and ceruloplasmin) from human (5 entries), lamprey (4 entries), pufferfish (Takifugu rubripes) (3 entries), and zebrafish (Danio rerum) (1 entry). Each of the 13 entries was a concatenated sequence of five segments corresponding to the available partial sequences for lamprey factor V. The five segments represented all three A domains

Vitamin K-Dependent Factors

Previously, we had reported the existence of seven vitamin K-dependent proteases in the sea lamprey, P. marinus, based on Trace Archive data (which included some EST and non-genomic sequences). All told, 215 hits were gathered into 38 hit-groups, independent of eight others for GLA domains (see below). The seven proteins identified were prothrombin, protein C, two factors X, and three factors VII. One of the factors X (factor XB) was found to lack an activation cleavage site, casting doubt on its role as a prothrombinase. At the time, it had not been possible to provide sequences or link together traces for all three of the factors VII.

Since that report, an article has appeared reporting the cloning of messages from the liver of the Japanese lamprey (L. japonicum) for prothrombin, protein C, a factor VII, and two factors X, one of which, like the putative sea lamprey protein, lacks an activation cleavage site (Kimura et al. 2009).

As a reminder, the vitamin K-dependent clotting factors contain amino-terminal sections with several gamma-carboxylated glutamic acid residues, casually referred to as “GLA domains.” In contrast to the 16 GLA domains found in most vertebrates, we had only found eight in sea lampreys (Doolittle et al. 2008), but now a search of the genome for the Japanese lamprey has revealed 12 GLA domains, two of which are 96 % identical. All 12 have been linked with their parent proteins (Table 2). Four of these are not associated with serine proteases and correspond to protein S, growth-arrest protein (homologous to protein S), and two GLA-transmembrane proteins. The other eight occur in prothrombin, protein C, three factors VII, and three factors X. The third factor X was a surprise and found to lie adjacent to what had been termed the factor XA gene (LJ scaffold 00078). The two putative proteins are 92 % identical and clearly the result of a fairly recent duplication, not much different from the time the sea (PM) and Japanese (LJ) lampreys diverged. There was no evidence for a comparably recent duplication in P. Marinus, the single factor XA being located in the middle of a vary large scaffold in the 2012 assembly.

Table 2 Vitamin K-dependent proteins identified in L. japonicum assembly

As was the case for the ferroxidase family proteins, it was possible to assemble the sequences of the vitamin K-dependent factors by taking advantage of sequences being found in one or the other databases that were not present in others. The approach was particular helpful in characterizing the three factor VII proteins. As an example, in the case of factor VII C, scaffold 00627 of the Japanese lamprey contains the GLA domain and almost all of the protease domain, but the regions of the gene between them, which include the two EGF domains, remain unsequenced (NNNN). By good fortune, the 2007 draft assembly for P. marinus had a contig (35757) that contained the GLA domain and the first EGF section, and a scaffold from the 2012 PM assembly contained both EGF domains (but no GLA domain) (Fig. 4).

Fig. 4
figure 4

Schematic depiction of three different factors VII from lamprey showing sources of sequence data. LJ data from L. japonicum; PM data from P. marinus. The cartoon across the top shows the domainal arrangement of the proteins: GLA γ-carboxy-glutamic acid containing domain; EGF epidermal growth factor domain; joiner, joining region; SP serine protease section. Dotted lines denote unsequenced regions of scaffolds

Similarly, the scaffold containing the factor VIIA gene (scaffold 00301) also had large unsequenced regions, one of which fell in the protease region of the gene (Fig. 4). In this case, matters were greatly facilitated by the cDNA for this protein having been determined (Kimura et al. 2009). With the aid of that sequence, it was possible to use various traces, contigs, and scaffolds to reconstitute most of the protein sequence for the sea lamprey, P. marinus.

The putative sequences for the eight vitamin K-dependent factors of interest are included in the Supplementary Material. A phylogenetic tree of various vitamin K-dependent proteases that included factor IX sequences from various species was wholly consistent with its absence in lampreys (Fig. 5).

Fig. 5
figure 5

Phylogenetic tree (unrooted) of 18 vitamin K-dependent clotting factors (serine protease domains only) from lamprey (P. marinus and/or L. japonicum), elephant shark (Callorhinchus milii) (only factor IX), pufferfish (Takifugu rubripes), and human

Some Other Clotting Proteins

Fibrinogen

Although the sequences for the various chains of sea lamprey fibrinogen have been long known, the Japanese lamprey allows the full gene structures to be revealed. Unlike the situation in mammals where the α, β, and γ genes are clustered together in a 50-kilobase region that is coordinately regulated with regard to gene expression, in the lamprey the α, β, and γ genes are on separate scaffolds, the great lengths of which show that in the lamprey these genes cannot be within a megabase of each other. Not surprisingly, the α (Wang et al. 1989) and α2 (Pan and Doolittle 1992) genes are adjacent to each other on the same scaffold (LJ scaffold 00123).

Tissue Factor

In our earlier report, we had been unable to identify a gene in the sea lamprey for tissue factor, a notoriously fast changing protein, even though lamprey tissue factor had been long ago characterized biochemically. We have now re-examined the various databases and found appropriate matches covering the whole protein in the 2007 draft assembly, two scaffolds amounting about half the protein in the 2012 assembly, and a virtually complete sequence in the Japanese lamprey genome. (The amino acid sequences are provided in the Supplementary Material.)

Species Differences

The use of the LJ assembly in coordination with sundry sea lamprey data afforded an opportunity for numerous direct amino acid sequence comparisons for many of the clotting factors (Table 3). On the average, these proteins are about 95 % identical in the two species, in line with the suggested divergence time of 10–30 million years based on similar data for other proteins (Kuraku and Kuratani 2006).

Table 3 Percent identities for 12 orthologous proteins in P. marinus and L. Japonicum

Discussion

The primary aim of this report is to update and solidify the proposal that the lamprey genome contains a smaller number of standard blood clotting factors than jawed vertebrates. Beyond that, a strategy is described for using a combination of various sources of lamprey sequence data that can overcome the limitations and incompleteness of currently available lamprey genome assemblies.

Particularly, we had cautiously proposed that the lamprey has a reduced set of clotting factors, corresponding to what would have been in place before the duplication of two different kinds of protein, a vitamin K-dependent protease, for one, and a (factor V–VIII) ferroxidase family protein, for the other. Even with the newly available data, it has not yet been possible to reconstruct the full sequence for the putative pre-duplication factor V/VIII gene, but the fact that there is one, and only one, is now certain. The full reconstruction of the closely related ferroxidase proteins hephaestins −1 and −2 and ceruloplasmin has been completed.

Several kinds of evidence speak for the absence of the vitamin K-dependent protease factor IX. First, all of the 12 GLA domains in the Japanese lamprey have been identified. Of these, the only possible candidates for a factor IX would be one of the two “extra” factors VII. These sequences are very different from factor IX, however, and contain features typically found in factor VII and X, including an additional disulfide bond near the amino-terminus of the A-chain in the activated protease. It might also be mentioned again that the sequence similarities between known factors IX and X are only slightly less than the resemblances observed between fish and human sequences for those two factors, implying that the duplication event occurred not long before the appearance of jawed fishes (Fig. 5). It is always more difficult to prove the absence of a gene than its presence in a sequence database, but the case is strong that a gene corresponding to factor IX is not present in the lamprey genome.

What makes the question of when the coagulation factors VIII and IX first made their appearance interesting is that it offers a unique example of two interacting proteins in a cascade, one a protease and the other a large MW co-factor, having their genes duplicated at (or about) the same time and as a consequence expanding an existing pathway in a kind of double-jump. Fuller descriptions of how the clotting system may have gotten started and how it continued to evolve during vertebrate evolution have been presented (Doolittle 2009, 2012).

It would have been better, of course, if all the clotting factors considered here could have been identified (or not) directly in a high-quality, fully completed genome assembly for the lamprey. Neither the 2012 assembly for the sea lamprey (Smith et al. 2013) nor the 2013 assembly for the Japanese lamprey (Mehta et al. 2013) fully meet the criterion, however. It should be emphasized that the missing data from the 2012 sea lamprey assembly are not because of programmed loss of DNA during development (Smith et al. 2009). The main reason for the deficiency is because a large number of trace reads were omitted during the assembly process, mostly because of being considered repetitive DNA.

To emphasize the point, more than a third of the (~50) hit-groups for the ferroxidase family of proteins were not found in the sea lamprey assembly even though they are present in the Trace Archive. Many appear on the long list of “unplaced reads” provided as part of Supplementary Material (Smith et al. 2013). That some key genes reside in parts of the genome that are extensively populated with repeats is unfortunate but something that needs to be dealt with, difficulties with the assembly process aside.

The strategy of using numerous short sequences in the Petromyzon marinus Trace Archive together with the two partially assembled genomes should be applicable for studies already under way for other vertebrate systems, including, for example, the evolution of complement systems (Kimura et al. 2009), the extracellular matrix (Hynes 2012; Adams et al. 2015), bone formation (Zhang et al. 2006), and numerous other systems that have been expanded by gene duplications during the course of vertebrate evolution. In a sense, the use of this approach allows one to make use of information discarded during the assembly process. Clearly, different assemblies of the same sequence data can end up with different regions being excluded. Used together and with “reads” from the Trace database, these assemblies can be used to encompass entire genes, introns included.

The strategy described here, which depends heavily on manual curation, has the advantage that individual researchers bring a familiarity to focused projects that can greatly aid interpretation of the data. In this regard, it was disconcerting to read in the report covering the 2012 assembly (Smith et al. 2013) that the sea lamprey had lost clotting genes, in contrast to our earlier (and current) findings that several clotting genes had never been there to lose (Doolittle 2009, 2012). It was also dispiriting to read of the discovery that sea lamprey codon usage is greatly biased in favor of G and C at the third position, an observation reported three decades ago (Strong et al. 1985; Bohonus et al. 1986; Pontes et al. 1988, inter alia) and re-discovered in 2011 (Qiu et al. 2011).

About Whole Genome Duplications

Certainly there is uniform agreement that the vertebrate blood coagulation pathway is the result of a series of gene duplications; it is the timing of those duplications that is an issue. At one point, it was suggested that the gene duplications responsible for clotting factor genes may be linked to whole genome duplication events (Davidson et al. 2003b) in accord with the proposal that two rounds of whole genome duplication occurred during the early evolution of vertebrates (Ohno 1970). The 2R hypothesis, as it has come to be known, has been much debated ever since, and the two recent lamprey genome assemblies seem to have yielded two different interpretations of the matter (Mehta et al. 2013; Smith and Keinath 2015).

We have no wish to become embroiled in the debate about the timing of the two rounds of duplication and whether or not one or both preceded the divergence of Agnatha except to note that our findings are in complete accord with an early proposal of one round of whole genome duplication occurring before the advent of cyclostomes and another after their appearance (Escriva et al. 2002). Certainly in the wake of this event, the lineage leading to jawed vertebrates experienced a large scale block duplication—even if less than genome wide—that simultaneously gave rise to genes that were destined to encode factors VIII and IX, quite apart from several independent duplications that gave rise to additional genes for factors VII and X on the lineage leading to lampreys.