Introduction

Over the last few decades, it has emerged that programmed cell death (PCD) played a much more significant part in the evolution of life than previously imagined (Ameisen 2002; Michod and Nedelcu 2003; Segovia et al. 2003; Blackstone 2013; Iranzo et al. 2014; Koonin and Zhang 2017; Durand 2021; Huneman 2022). PCD fulfills a critical role in the eco-evolutionary dynamics of prokaryote and unicellular eukaryote communities (Vardi et al. 2007; Bidle 2016; Durand et al. 2016; Abada and Segev 2018; Ndhlovu et al. 2021) and tissue homeostasis in multicellular organisms (Glücksmann 1951; Lockshin and Williams 1964). In addition, PCD was essential for the evolutionary transitions in individuality that characterize the history of life, such as eukaryogenesis, multicellularity, and eusociality in insects (Iranzo et al. 2014; Durand et al. 2019).

It is important to acknowledge at the outset, that the very subject of PCD in microbes has been fraught. This is largely because the term ‘PCD’ was initially discovered and subsequently interpreted as a hallmark of ontogenesis in Metazoa (Glücksmann 1951; Lockshin and Williams 1964) and the very concept of programmed microbial death in some unicellular lineages was questioned (Ratel et al. 2001; Nedelcu et al. 2011; Proto et al. 2013; Ramisetty et al. 2015). Different researchers often imply different things when referring to the various kinds of cell death, which can lead to a conflation of questions and ideas. To circumvent this, there are now both mechanistic and evolutionary definitions. The mechanistic definition we adopt is the general Berman-Frank interpretation that PCD is “an active, genetically controlled, cellular self-destruction driven by a series of complex biochemical events and specialized cellular machinery” (Berman-Frank et al. 2004). Evolutionists use a historical account of whether there was a selection for the trait or not and draw a distinction between PCD as an adaptation (true PCD) and non-adaptive PCD when the trait is a pleiotropic phenomenon (also called ersatz PCD) (Durand and Ramsey 2019). For the purposes of this study and for investigating the hypotheses below, we use the Berman-Frank mechanistic definition. The evolutionary definitions can be ignored because the ‘selection for’ and ‘selection of’ distinction are not drawn here.

There have been several hypotheses for the origin of PCD, based largely on philosophical and conceptual works, which can now be examined by comparative analyses of death domains in prokaryotes (bacteria and archaea) (Aravind et al. 1999; Asplund-Samuelsson et al. 2012; Hofmann 2019). One of the hypotheses for the emergence of PCD in the earliest cells is the ‘original sin' hypothesis, which postulates that ancestral effector genes of PCD were present at the origin of cellular life (Ameisen 2002). The argument is that PCD effectors may have been rooted in primordial non-death functions, such as the cell cycle or adaptations to environmental stress (Shrestha and Megeney 2012; Shalini et al. 2015), with subsequent pro-death functions emerging. Crosstalk between molecular pathways in extant unicellular organisms is consistent with this claim (van Creveld et al. 2018). There are other hypotheses that may be complementary and synergistic such as the ‘addiction’ hypothesis (Ameisen 2002) and the argument for life–death coevolution in the last universal common ancestor state (LUCAS) (Durand 2021). The different hypotheses may not be mutually exclusive; however, the ‘original sin’ hypothesis is much more accessible with a comparative genomics approach and is the focus here.

Genomic data have linked eukaryogenesis with PCD in bacteria (Blackstone and Green 1999; Koonin and Aravind 2002; Michod and Nedelcu 2003). Many of the ‘death domains’ have been identified and the “bacterial connection” between PCD and eukaryogenesis is well established (Koonin and Aravind 2002). While it is known that PCD played a part in the rise of eukaryotes, potentially as a conflict mediator (Blackstone and Green 1999; Michod and Nedelcu 2003; Iranzo et al. 2014), and continues to impact the social lives of microbes (Arnoult et al. 2001; Durand et al. 2011; Bayles 2014), the origin of the genetic toolkit in prokaryotes generally, is largely unknown (Koonin and Aravind 2002; Kaczanowski 2016; Klim et al. 2018). This raises some interesting questions. Were PCD genes also present in archaea, or did they emerge in bacteria after the divergence of the two lineages? One of the historical obstacles to addressing this question has been a dearth of sequence data from archaeal taxa. However, since 2012, much more data have become available (for example, Asplund-Samuelsson et al. 2012), presenting an opportunity to investigate the most ancient origins of the PCD genetic toolkit (Srivastav and Suneja 2019; Zhang et al. 2020). In addition, there has been some “tantalizing” empirical evidence for PCD genes in archaea (Asplund-Samuelsson et al. 2012; Klemenčič et al. 2019). Caspases (cysteine–aspartic proteases) (CAs) and their homologs in non-Metazoa are one of the central mediators of PCD, activating multiple enzymes that are part of the death signaling cascade (Cohen 1998). Caspase homologs refer to orthocaspases (OCAs), metacaspases (MCAs), and paracaspases (PCAs), as per the proposed unified nomenclature of the C14 peptidase family (Minina et al. 2020) (Fig. 1). There are some reports of putative caspase homologs of the C14 peptidase family, OCAs and MCAs, in archaeal phyla (Euryarchaeota, Thermoplasmatota, Thaumarchaeota, Bathyarchaeota, and Heimdallarchaeota) (see supplementary documents in Klemenčič et al. 2019).

Fig. 1
figure 1

Classification and domain organization of caspase homologs possessing the catalytic dyad in unicellular organisms according to the unified nomenclature of the C14 peptidase family (adapted from Klemenčič et al. 2015; Minina et al. 2020). Members of the family possess the catalytic p20-like region (maroon), a regulatory p10-like region (blue) and/or the linker region (thick gray line). The positions of catalytic His (H) and Cys (C) residues are shown above the p20-like region. The most common YC dyad is shown for the pseudo-orthocaspase. N-terminal pro-domains are indicated in green. Detailed classification is shown in the methodology. The figure is not drawn to scale (Color figure online)

Three homologous groups of caspases (CAs) are documented, type I, II, and III MCAs, type I and II PCAs, and OCAs (Fig. 1). All possess the p20-like region which includes the conserved Histidine–Cysteine (HC) dyad, sometimes also referred to as the Cysteine–Histidine (CH) dyad, which is situated within a characteristic caspase/haemoglobinase fold (Aravind and Koonin 2002; McLuskey and Mottram 2015) (Fig. 1). They are members of the CD clan, C14 peptidase family of the MEROPS peptidase database (Rawlings et al. 2014). Unlike CAs that are only found in Metazoa, PCAs are common in Metazoa and slime molds while MCAs are universally distributed in the three domains of life, except for the Metazoa (Uren et al. 2000). OCAs are ancestral caspase homologs present in bacteria and phytoplankton (Choi and Berges 2013; Klemenčič et al. 2015). OCAs in eukaryotes are commonly referred to as MCA-like proteases (Minina et al. 2020). MCAs, PCAs, and OCAs have different substrate specificities and N-terminal domains compared to CAs (McLuskey and Mottram 2015) but both sets of homologs correlate and are causally associated with cell survival and death functions (Vercammen et al. 2007; Shrestha and Megeney 2012; Minina et al. 2017; van Creveld et al. 2018). Cell survival functions include cell cycle regulation (Ambit et al. 2008; Lee et al. 2008), stress response (Richie et al. 2007), and virulence mediation (Proto et al. 2011; Benler and Koonin 2020). The pleiotropic nature of caspase homologs in both death and non-death-related functions (Shrestha and Megeney 2012; Klemenčič et al. 2015; Hill and Nyström 2015; Lema et al. 2021) indicate that the ‘original sin’ hypothesis is a potentially useful framework for investigating the evolution of the PCD toolkit.

The aim of this study was to investigate the potential presence of homologous caspase-like protein sequences in archaea and provide detailed comparative phylogenetic and sequence analyses of other possible death domains. Our analyses revealed type I MCAs and OCAs are surprisingly widespread in archaea. The phylogenetic reconstruction analysis and taxonomic distribution are suggestive of massive horizontal gene transfer (HGT) events between bacteria and archaea, and that at least some of the effectors of PCD were likely present prior to the diversification of prokaryotes. In addition, numerous death domains were identified in archaeal OCAs and type I MCAs, and subsequent phyletic pattern analyses inferred their putative ancestral functions. Our data provide strong support for the ‘original sin’ hypothesis of PCD. This lays the foundation for understanding the evolution and role of PCD in prokaryotic communities and as a conflict mediator in eukaryogenesis.

Results

Caspase Homologs in Archaea

In total, 321 caspase homologous sequences were identified spanning eleven archaeal phyla (Fig. 2 and Supplementary Table S1). Sequences identified as caspase homologs were further classified according to their structural subtypes (meta-, ortho-, and paracaspases) (Fig. 1). Fifty sequences (15.58%) possessed both the p20-like region and the p10-like region and hence classified as MCAs. The length of the linker region for the fifty sequences was 7.34 ± 27.34 amino acids (Supplementary Table S2). Following the guidelines and criteria of (Choi and Berges 2013), the presence of a short linker region (161.3 ± 32.9 aa) was used to classify the identified MCAs as type I MCAs (Supplementary Table S1) versus type II MCAs with longer linker regions. The remaining 271 sequences only contained the p20-like region and were therefore classified as OCAs (84.42%) (Supplementary Table S1).

Fig. 2
figure 2

Taxonomic distribution and abundance (%) of OCAs (green) and type I MCAs (orange) according to different phyla of archaea. Superphyla are indicated in gray boxes. Unclassified refers to sequences from environmental samples where taxonomic information was not available from the NCBI taxonomy database (Color figure online)

Archaeal OCAs and type I MCAs identified in this study showed a wide taxonomic distribution with sequences belonging to the three major superphyla (TACK, Asgard group, and DPANN) and two phyla of archaea (Thermoplasmatota and Euryarchaeota) (Fig. 2). TACK is an acronym for superphyla Thaumarchaeota, Aigarchaeota, Crenarchaeota, and Korarchaeota (Guy and Ettema 2011) and DPANN is an acronym for superphyla Aenigmarchaeota, Diapherotrites, Nanoarchaeota, Nanohaloarchaeota. Parvarchaeota and Woesearchaeota (Rinke et al. 2013; Castelle et al. 2015). The majority of OCAs and type I MCAs were identified in the phylum Euryarchaeota (51.09%) and phyla with the least number of homologs were Candidatus Aenigmarchaeota and Candidatus Woesearchaeota belonging to superphylum DPANN (Fig. 2). Unclassified organisms refer to sequences obtained from environmental samples that do not have taxonomic annotation information.

Sequence Analysis of Archaeal Orthocaspases and Type I Metacaspases

Multiple sequence alignment (MSA) and sequence logo analyses of archaeal OCAs and type I MCAs revealed the highest sequence conservation occurred in amino acid residues surrounding the catalytic HC dyad (Supplementary Fig. S1 and S2). Additional four key amino acid residues are known to have importance for the formation of the S1 pocket and enzyme specificity in eukaryotic type I MCAs (McLuskey et al. 2012). Of these, two Asp residues were conserved in archaeal type I MCAs and present in most OCAs (Supplementary Fig. S1 and S2). The remaining two amino acids, acidic Cys and Ser, showed variation in residues in both archaeal OCAs and type I MCAs (Supplementary Fig. S1 and S2). The variation in residues corresponded to the amino acids observed in CAs, PCAs, or MCAs of eukaryotes (Wei et al. 2000; Yu et al. 2011; McLuskey et al. 2012). The acidic residues important for substrate specificity in type I MCAs (Y31, numbered according to TbMCA-Ib) and PCAs (E500, numbered according to MALT1) were absent. Proline-rich repeats, which are usually found in the N-terminal region of type I MCAs (Uren et al. 2000), were identified in the linker region of archaeal type I MCAs (Supplementary Fig. S2). Archaeal type I MCAs possessed high-affinity and low-affinity Ca2+-binding sites (Supplementary Fig. S2 and S4). They were absent in archaeal OCAs.

According to the automated I-TASSER protein structure prediction, structural mimicry of archaeal type I MCAs and OCAs to TbMCA-Ib of Trypanosoma brucei (UniProt ID: Q585F3) was observed. Type I MCA of Candidatus Thorarchaeota archaeon (GenBank ID: TFG95767.1) and OCA of Candidatus Prometheoarchaeum syntrophicum (GenBank ID: QEE14607.1) were used as model templates for archaeal type I MCAs (Supplementary Fig. S4) and OCAs (GenBank ID: QEE14607.1) (Supplementary Fig. S3), respectively. The S1-binding pocket was conserved which possessed the HC dyad and the four key residues involved in substrate specificity and enzyme activity (Supplementary Figure S3 and S4). Some differences were observed in the secondary structure and the tertiary folding of archaeal OCAs and type I MCAs, including the number of the β-sheets and α-helices, as well as the missing N-terminal region in archaeal type I MCAs (Supplementary Fig. S3 and S4).

Phylogenetic Analysis

Archaeal OCAs and type I MCAs phylogenies were reconstructed using the MSA of the p20-like region and the C14 peptidase domain, respectively (Supplementary Fig. S1 and S2). Both the OCAs and type I MCAs phylogenetic distributions were incongruent with 16S rRNA phylogeny and revealed a diffuse, scattered distribution (Supplementary Fig. S1 and S2). Tree topologies revealed that the clades were clustered according to the different key residues required for the S1 pocket formation and enzyme activation, which corresponded with the residues observed in CAs, PCAs, or type I MCA of T. Brucei. Both trees were well supported with robust bootstrap values above 0.80 (Supplementary Fig. S1 and S2).

Phylogenetic reconstruction of the p20-like region of bacterial and archaeal OCAs, and eukaryotic CAs, PCAs, and type I MCAs (Fig. 3) revealed an incongruency with the species tree provided by Hug and colleagues (Hug et al. 2016). P20-like regions of PCAs and CAs of Metazoa clustered together, while eukaryotic type I MCAs and PCA of D. discoideum (UniProt ID: Q9GPM2) were placed in separate clades. CAs, PCAs, and eukaryotic type I MCAs all branch from bacterial OCAs, except for PCA of D. discoideum which branches from archaeal OCAs.

Fig. 3
figure 3

Maximum likelihood phylogenetic tree of the p20-like region of 231 amino acid sequences of prokaryotic OCAs, and eukaryotic CAs, PCAs, and type I MCAs generated using the LG + R6 model under IQ-TREE. Three metazoan CAs, H. sapiens CA 3 and CA 8, and C. elegans cell death protein 3 (UniProt IDs: Q14790, P42573, and P42574), were used as outgroup sequences. The classification is shown in different colors: eukaryotic PCAs (yellow), eukaryotic type I MCAs (orange), archaeal OCAs (red), and bacterial OCAs (blue). The tree reliability was tested using the bootstrap method with 1000 replicates, of which values below 0.50 are not shown. The tree scale bar shows substitutions per site. Corresponding Genbank IDs of archaeal OCAs, and UniProt IDs of eukaryotic CAs, PCAs, and type I MCAs, and bacterial OCAs are shown in brackets (Color figure online)

The phylogenetic relationship of type I MCAs across the three domains of life: archaea, bacteria, and eukarya were incongruent with the classic 16S rRNA phylogenetic distribution (Fig. 4). Three clades were observed that were well supported with robust bootstrap values above 0.95 (Fig. 4). For the ultrafast bootstrap used by IQ-TREE, a clade is considered reliable if its support is more than 95% (Hoang et al. 2018). Clade 1 possessed the CAs and bacterial type I MCAs, which included the alphaproteobacterium Rhizobiales bacterium (UniProt ID: A0A2A4P607). The second clade consisted of bacterial and archaeal type I MCAs. They possessed key residues resembling type I MCA as well as the low-affinity and high-affinity Ca2+-binding sites and proline-rich repeats. The third clade possessed type I MCAs of eukarya, bacteria, and archaea, which included the OCA of Asgard archaeon (Candidatus Thorarchaeota archaeon (GenBank ID: TFG95767.1)). They also possessed the key residues observed in a type I MCA as well as the low-affinity and high-affinity Ca2+-binding sites.

Fig. 4
figure 4

A Maximum likelihood phylogenetic tree of MCAs in the three domains of life generated using the LG + R5 model under IQ-TREE. The tree is based on the C14_peptidase domain of 69 protein sequences. Three metazoan CAs, H. sapiens CA 3 and CA 8, and C. elegans cell death protein 3 (UniProt IDs: Q14790, P42573, and P42574), were used as outgroup sequences. The classification according to the three domains of life is shown in different colors: eukarya (green), archaea (red), and bacteria (blue). Different clades are shown in different colors: clade 1 (orange), clade 2 (pink), and clade 3 (green). Bootstrap values based on 1000 replicates calculated by maximum likelihood. Bootstrap values below 0.50 are not shown. The tree scale bar shows substitutions per site. Corresponding Genbank IDs of archaea and UniProt IDs of eukaryotes and bacteria are shown in brackets. B Partial MSA of key residues of type I MCAs in the three domains of life under hmmalign as well as MAFFT (See Methods). The figure was generated using Jalview. Colors of the amino acid have been assigned according to the Percentage Identity color scheme on Jalview with an identity threshold above 0.50. The catalytic HC dyad is indicated with a red star. Residues involved in the formation of the S1 pocket are indicated in orange, the proline-rich repeat is shown in pink, and Ca2+-binding sites are shown in blue (high affinity) and green (low affinity). The p20-like, linker, and p10-like regions are shown below the MSA (Color figure online)

C14 Peptidase-Accompanying Domains and Domain Architectures of Archaeal OCAs and Type I MCAs

In addition to the C14 Peptidase domain, 56 different C14 peptidase-accompanying domains were identified (Supplementary Table S3). These include Peptidase_C13 (34.07%), Raptor_N (10.37%), Formylglycine-generating (FGE) sulfatase (4.44%), Polycystic Kidney Disease (PKD) (4.44%), and PKD_4 (4.44%). Thirty domains were identified only once in the sequence data. GO prediction and functional information available on the Pfam database revealed the functions of the three most abundant C14 peptidase-accompanying domains (PKD, PKD_4, and FGE_sulfatase) as cell surface proteins that protect against extreme environments (Jing et al. 2002), respond to environmental stimuli, and unknown, respectively (Supplementary Table S4). Archaeal OCAs and type I MCAs that possess C14 peptidase-accompanying domains associated with cell survival and PCD were identified. Single C14 peptidase-accompanying domains predicted to have both functions were common (Fig. 5 and Supplementary Table S4). In addition, C14 peptidase-accompanying domains implicated in interactions with the environment and neighboring cells were abundant. These included domains associated with cell adhesion, cell projection, and cell surface receptors (Supplementary Table S4). A complete list of C14 peptidase-accompanying domains, their abundance, and predicted functions is available in Supplementary Table S3 and S4.

Fig. 5
figure 5

Dual role of C14 peptidase-accompanying domains which were predicted to have cell survival and death-associated functions. Cell survival-associated functions refer to functions mentioned by Ameisen which contributed to the rise of the PCD genetic toolkit (Ameisen 2002). C14 peptidase-accompanying domain associated with cell survival or death has been classified according to appropriate categories: cell cycle (blue), cell metabolism (orange), cell differentiation (purple), and PCD (green). C14 peptidase-accompanying domains include ANAPC3 (Anaphase-promoting complex, cyclosome, subunit 3), ANAPC4_WD40 (Anaphase-promoting complex subunit 4 WD40 domain), ANF_receptor (Receptor family ligand-binding region), BiBP_C (Penicillin-Binding Protein C-terminus Family), CUB (CUB domain), DUF11 (Domain of unknown function), DZR (Double zinc ribbon), EF-hand_5 (EF hand), fn3 (Fibronectin type III domain), fn3_3 (Polysaccharide lyase family 4, domain II), Peripla_BP_5 (Periplasmic-binding protein domain), Peptidase_C13, Peptidase_M20, Peptidase_M28, PPC (Bacterial pre-peptidase C-terminal domain), TPR_1, TPR_2, TPR_11, and TPR_16 (TPR repeat), and WD_40 (WD domain, G-beta repeat). Detailed information is available in the supplementary table S4, Supplementary Materials (Color figure online)

Prediction of transmembrane domains using TMHMM revealed 127 (39.56%) sequences with varying transmembrane topologies (Supplementary Table S5). One hundred fourteen protein sequences were found to be type II membrane proteins, which are single-spanning membrane proteins that possess intracellular N-terminals and extracellular C-terminals (von Heijne 2006) and resemble signaling peptides (Krogh et al. 2001).

In total, 63 different domain architectures were identified (Figs. 6, 7). Out of 321 sequences, 183 (57.01%) sequences possessed C14 peptidase-accompanying domains of which 177 sequences were OCAs and six were type I MCAs. The most abundant domain architectures were OCAs possessing a single p20-like region (29.30%), OCAs possessing a transmembrane helix followed by a p20-like region (22.43%), and type I MCAs possessing a p20-like region followed by a p10-like region (13.40%). Forty-two domain architectures were identified only once in the sequence dataset (13.08%) and no domain architecture was unique to a specific phylum.

Fig. 6
figure 6

Domain architectures of transmembrane archaeal OCAs and type I MCAs. Domains (oval) and distances (gray line) between domains are not drawn to scale. Domains are presented with N-terminals on the left and C-terminals on the right. iTmm refers to transmembrane helices predicted to have N-terminal domains situated intracellularly and C-terminal domains extracellularly. oTMM refers to transmembrane helices predicted to have the reverse topology. Functional categories and color keys of C14 peptidase-accompanying domains and transmembrane domains are indicated below the domain architectures. Only functions associated with cell survival, death, or interactions with neighboring cells and the environment are indicated. Functions labeled ‘other’ (yellow) refer to domains associated with functions other than PCD, cell survival, or cell–cell interactions. P20- and p10-like regions of the C14 peptidase domain are colored lilac. If a domain was associated with more than one function, all the potential functions were indicated with the corresponding color (Color figure online)

Fig. 7
figure 7

Domain architectures of 195 cytosolic archaeal OCAs and type I MCAs. Domains (oval) and distances (gray line) between domains are not drawn to scale. Domains are presented with N-terminals on the left to C-terminals on the right. iTmm refers to transmembrane helices predicted to have N-terminal domains situated intracellularly and C-terminal domains extracellularly. oTMM refers to transmembrane helices predicted to have reverse topology. Functional categories associated with cell survival, death, or interaction with neighboring cells and the environment were considered, and the color key of C14 peptidase-accompanying domains and transmembrane domains is shown below the domain architectures. Functions labeled ‘Other’ refer to domains associated with functions other than PCD, cell survival, or interaction with neighboring cells and the environment. P20- and p10-like region of the C14 peptidase domain is colored lilac. If a domain was associated with more than one function, all the potential functions were indicated with the corresponding colors (Color figure online)

Transmembrane OCAs are common in archaea (39.25%). The most common domain architecture of transmembrane OCAs was an extracellular C-terminal p20-like region and domains interacting with the neighboring cells and the environment (8.41%) (Fig. 6, domain architecture 3, 5–8, 10, 14,15, 17, 18, 20–22, 25–27, 30). Compared to transmembrane OCAs, cytosolic OCAs possessed a great variety of domains, including those associated with survival (Fig. 7, domain architecture 34, 37, 38, 43, 48, 55–57, 60–62) and PCD (Fig. 7, domain architecture 43, 56, 61, and 62). The death domains (Aravind et al. 1999; Li and Roberts 2001) identified in this study include TPR_1, TPR_2, TPR_16, and TPR_11 (Tetratricopeptide repeat), and WD_40 (WD domain) and ANAPC4_WD40 (Anaphase-promoting complex subunit 4 WD40 domain) (Supplementary Table S4). Function assignments via Pfam database, dcGO, and Pfam2GO, associated these death domains with cell survival functions, such as regulation of cell growth, cell projection assembly, and cell cycle control (Supplementary Table S4).

Archaeal type I MCAs possessed simple domain architectures, of which most did not possess any C14 peptidase-accompanying domains (86.60%). Transmembrane type I MCAs were rare (3.38%). C14 peptidase-accompanying domains in archaeal type I MCAs were associated with cell adhesion and cell surface receptors (Fig. 6, domain architecture 12 and 29, and Fig. 7, domain architecture 58).

Taxonomic distribution analysis revealed that cytosolic and transmembrane OCAs and type I MCAs associated with cell survival had wide taxonomic distributions (Supplementary Table S6). They were identified in superphyla Asgard group and TACK, and phyla Thermoplasmatota and Euryarchaeota. Archaeal cytosolic OCAs possessing death domains were confined to superphyla Asgard group and TACK (Supplementary Table S6).

Discussion

Caspase homologs were identified in archaeal superphyla TACK, the Asgard group, and DPANN, and phyla Euryarchaeota and Thermoplasmatota. In total, 321 OCAs and type I MCAs were identified (Fig. 2) by stringent hmmsearch filtering, manual inspection of the HC dyad (Supplementary Fig. S1 and S2), and structural similarity (Supplementary Fig. S3 and S4). The HC dyad is a requirement for the proteolytic activity of OCAs and MCAs (Aravind and Koonin 2002; McLuskey and Mottram 2015), whereas the absence of the dyad indicates catalytic inactivity (Szallies et al. 2002). It is likely, therefore, that the OCAs and MCAs reported here are active.

The OCAs were much more abundant in archaea (84.42%) than type I MCAs (15.58%) (Fig. 2), a pattern that is similar to the findings in bacteria (Klemenčič et al. 2019). It is possible that some sequences classified as OCAs may be partial sequences, which can result in erroneous classifications, and future whole-genome sequencing analyses will resolve any potential inconsistencies. There was a bias in the taxonomic distribution of archaea (Fig. 2), which we attribute to the relative dearth of data from superphyla TACK, DPANN, and Asgard group. This will likely also change as more data become available. CAs and PCAs were not identified, which is in keeping with the previous phylogenetic predictions (Uren et al. 2000). Similarly, type II MCAs and type III MCAs were not identified. This was unsurprising because type II and III MCAs are reportedly specific to Viridiplantae and some algal species where they are associated with secondary endosymbiotic events (Choi and Berges 2013). In addition to the previously listed putative archaeal caspase homologs (Klemenčič et al. 2019), an additional 228 sequences were identified here, including previously unclassified sequences from superphyla Asgard group and TACK.

Comparative sequence and secondary structure prediction analyses of OCAs and archaeal type I MCAs confirmed the presence of key residues and structures typical of this group of homologs (Supplementary Fig. S1, S2, S3, and S4). This includes the HC dyad, and two Asp residues present in the S1 pocket that are required for substrate specificity and enzyme activity. There was some minor variation in two of the key residues. These are the acidic Cys (Supplementary Fig. S1 and S2), which is important for substrate binding (McLuskey et al. 2012), and Ser (Supplementary Fig. S1 and S2), which is important for regulation of enzyme activity (McLuskey et al. 2012). Interestingly, this variation matched those of eukaryotic type I MCAs, PCAs, or CAs (Supplementary Fig. S1 and S2) and may have some functional significance. The conservation of key residues in eukaryotic type I MCAs, PCAs, or CAs suggests that they may be derivatives of archaeal OCAs and type I MCAs. Furthermore, the structural mimicry of the automated I-TASSER predicted model of archaeal type I MCA and OCAs with the X-ray crystallography of TbMCA-Ib further suggests common ancestry and structural (and presumably functional) homology (Supplementary Fig. S4).

The presence of key residues and structures required for a catalytically active C14 peptidase suggests that archaeal OCAs and type I MCAs are catalytically active. A small subset of archaeal OCAs was very similar in their key residues to an OCA found in Microcystis aeruginosa (Supplementary Fig. S1) that is known to have MCA-like activity (Klemenčič et al. 2015). This subset may, therefore, also have MCA-like activity. The conservation of some of the Ca2+-binding sites in archaeal type I MCAs suggests that Ca2+ remains important as a secondary messenger in the activation of archaeal type I MCAs. As described by (Klemenčič et al. 2015), Ca2+-binding sites were absent in archaeal OCAs.

The N-terminal domain requirement for MCA activity (McLuskey et al. 2012), and the zinc-finger domain required for cell death function (Coll et al. 2010) were absent in archaeal OCAs and type I MCAs (Supplementary Fig. S3 and S4). Tyr, which functions as a latch with Ser for regulation of enzyme activation (McLuskey et al. 2012) was also absent in both archaeal OCAs and type I MCAs. The observed differences in the secondary structure and some of the key residues required for enzyme activity and regulation (supplementary Fig. S1 and S3, Supplementary Material online) suggest there are likely subtle differences in substrate specificity or regulation of activity of archaeal OCAs and type I MCAs compared to TbMCA-Ib.

Origins of Prokaryotic Type I MCAs and OCAs

Phylogenetic reconstruction of archaeal type I MCAs and OCAs illustrates a diffuse gene distribution pattern across different phyla and is inconsistent with the species trees (Supplementary Fig. S1 and S2 and discussed further below). Both trees were well supported with robust bootstrap values above 0.80. Only one sequence belonging to the superphylum Asgard group was available (Genbank ID TFG95767.1) for type I MCA phylogenetic analysis, while none were available from superphylum DPANN, which resulted in a biased phylogeny of type I MCAs towards the phylum Euryarchaeota.

The identification of OCAs and type I MCAs in all three domains of life—archaea, bacteria, and eukarya—and their shared origins with CAs and PCAs raises new questions: were the ancestors of these PCD effectors present in the ancestors of both archaea and bacteria? Or did they emerge after the divergence of the two prokaryote lineages with subsequent HGT between them? Furthermore, since most of the eukaryotic members of the C14 peptidase family appear to have originated from the bacterial ancestors of mitochondria (Aravind et al. 2001; Koonin and Aravind 2002; Klim et al. 2018), what has been the evolutionary trajectory of the archaeal homologs? What is clear from the data here and previous publications (Uren et al. 2000) is that PCAs and CAs are limited to the so-called higher, complex organisms while MCAs are widely distributed across all lineages with the notable exception of Metazoa (Uren et al. 2000; Choi and Berges 2013). OCAs are confined to unicellular organisms (Klemenčič et al. 2015; Klemenčič and Funk 2018). The conservation of key residues, the structural similarities between archaeal OCAs and the p20-like region of PCAs and type I MCAs, and their presence in the deepest branching superphylum DPANN, phylum Euryarchaeota (Fig. 3), and the bacterial phyla Aquifex, Thermotogae, and Deinococci (Klemenčič et al. 2019) suggest that OCAs (the p20-like region) are almost certainly the most ancestral members of the peptidase C14 clan. This supports earlier predictions (Choi and Berges 2013).

Three clades of type I MCAs are neatly resolved. These are the bacterial caspase homologs (clade 1), prokaryotic type I MCAs (clade 2), and eukaryotic type I MCAs (clade 3). CAs were placed in one clade with bacterial type I MCAs, which included the Rhizobiales bacterium (UniProt ID: A0A2A4P607) bacteria (Fig. 4). This supports the origin of CAs as bacterial caspase homologs during the endosymbiotic event (Koonin and Aravind 2002) and is consistent with the previous findings tracing the evolutionary histories of CAs and MCAs across the three domains of life (Klim et al. 2018).

The origin of unicellular eukaryotic type I MCAs in clade 3 is more difficult to resolve and could have emerged in either archaea or bacteria (Fig. 4). The absence of the Ca2+-binding sites in bacterial type I MCAs in clade 1 indicates that the ancestral type I MCAs were not regulated by Ca2+, a feature that emerged later. Proline-rich repeats, which are typically found in the N-terminal domains of eukaryotic type I MCAs, were only identified in clade 2, which contains prokaryote type I MCAs. The distribution of type I MCAs across different domains indicates HGT events between prokaryotes and eukaryotes.

The incongruency between gene and species trees is most likely due to extensive HGT between archaea and bacteria (Maddison 1997; Ponting et al. 1999; Ochman et al. 2000; Aravind et al. 2001). Tracing the key residues in the S1 pocket and additional C14 peptidase-accompanying domains present in different members of the C14 peptidase family also suggest massive HGT. This is in line with what is known about large-scale gene flows between the proto-cellular organisms during the emergence of diverse prokaryotic lineages (Koonin et al. 1997; Aravind et al. 1998; Nelson et al. 1999; Nelson-sathi et al. 2015; Koonin 2016; Wagner et al. 2017; Husnik and McCutcheon 2018) and limits the identification of donor lineages (Akanni et al. 2015). Of course, there are other potential explanations. For example, one type of a C14 peptidase may have arisen de novo in an ancestral archaeon with continuation in daughter lineages and HGT and gene duplication events (Fig. 8). However, this would have to include multiple duplication events with subsequent HGT of the paralogous genes. It seems more parsimonious that the OCAs were present prior to the diversification of archaea and bacteria, suggesting that ancestral OCAs were already present in the realm of the LUCAS (Last Universal Common Ancestor State). The acronym LUCAS refers here to Koonin’s “forest of life” (FoL) concept, which was a hypothetical state prior to the existence of cells as we understand them today and in which genetic material flowed between semi-cellular compartments (Koonin 2009). Another possible explanation is that these C14 peptidases originated from eukaryotes and were transferred to prokaryotes via HGT events (Rawlings and Bateman 2019). However, their wide taxonomic distribution in all of the major phyla of archaea and bacteria suggest they were present in prokaryotes first. The mitochondrial origin of the C14 peptidase family during eukaryogenesis (Kroemer 1997; Frade and Michaelidis 1997; Mignotte and Vayssiere 1998; Blackstone and Green 1999) also supports their prokaryotic origin and is consistent with Klim and colleagues’ mitochondrial hypothesis for the origin of eukaryotic apoptosis (Klim et al. 2018).

Fig. 8
figure 8

Canonical 3-domain tree of life with caspase homologs. Only the groups of interest included in this study are indicated. HGT events between the domain eukarya and bacteria are indicated by the arrows and thickness indicates the frequency of the event (Color figure online)

C14 Peptidase-Accompanying Domains Associated with Pro-survival Outnumber Death Domains in Archaeal OCAs and Type I MCAs

The conservation of OCAs and their ancient origins raise questions about their ancestral functions. The functional inference was based on phyletic and functional pattern analysis of C14 peptidase-accompanying domains, with reference to the previous studies (Gaasterland and Ragan 1998; Aravind et al. 1999; Galperin and Koonin 2000). Of course, functional inference based on phyletic pattern analysis is predictive and awaits future empirical studies. Nevertheless, a large variety of different C14 peptidase-accompanying domains were easily identified, and these were more abundant in archaeal OCAs (97.27%) than type I MCAs (2.73%) (Supplementary Table S3,). The high abundance of the Peptidase_C13 and Raptor_N domains is spurious and explained by their sequence similarities to the C14 peptidase domain and the presence of the HC dyad (Ginalski et al. 2004; Rawlings et al. 2014). Functional assignment of the other C14 peptidase-accompanying domains using multiple GO databases indicated that domains associated with cell survival are more abundant than those associated with PCD (Fig. 5). GO functions associated with cell survival were chosen based on the Gedanken experiment by Ameisen (Ameisen 2002), which includes pro-survival functions like cell differentiation, cell cycle regulation, and cell metabolism.

Domain architecture analyses revealed archaeal OCAs resembling signaling peptides that interact with the external environment (Fig. 6). Examples include OCAs possessing a transmembrane helix followed by C14 peptidase-accompanying domains associated with cell adhesion, cell projection, and cell surface receptors. The occurrence of these membrane-bound OCAs suggests archaeal OCAs were involved in cellular responses to the surrounding environment through cell signaling mechanisms. In contrast, only a few archaeal type I MCAs possessed C14 peptidase-accompanying domains and transmembrane helices. It is acknowledged that the transmembrane components of the archaeal OCAs and type I MCAs resembling signaling peptides were predicted using the TMHMM program, which reports a ~ 78% precision in topology prediction (Krogh et al. 2001) and should be interpreted in this light.

Cytosolic archaeal OCAs possessing death domains were observed in superphylum TACK (Fig. 7 and Supplementary Table S6). However, it is uncertain if their presence implies archaeal OCAs are involved in PCD as death domains are associated with both death and pro-survival. According to the GO prediction, the death domains are also associated with cell survival functions (a quirk of the historical use of PCD terminology), such as cell cycle control, differentiation, and metabolism (Supplementary Table S4). The abundance of C14 peptidase-accompanying domains associated with cell survival in archaeal OCAs adds to this uncertainty.

The pleiotropic nature of MCAs and the association of these effector enzymes with both PCD and cell survival functions are established (Shrestha and Megeney 2012; Hill and Nyström 2015). Their pleiotropic nature and the higher abundance of C14 peptidase-accompanying domains associated with pro-survival, therefore, raise questions about the role of ancestral OCAs and type I MCAs in archaea. Did they evolve because of PCD-associated cell death functions, or were they co-opted from ancestral proteins with pro-survival functions? Intuitively, it seems more parsimonious that these cytosolic archaeal OCAs were initially involved in pro-survival activities and were subsequently co-opted into death-related functions.

Archaeal OCAs and Eukaryogenesis

According to the Entangle-Engulf-Enslave (E3) model, the physical interaction between a Candidatus Prometheoarchaeum syntrophicum MK-D1 archaeon and an alphaproteobacterium involved cellular protrusions and phagocytosis-independent engulfment (Imachi et al. 2020). This is because phagocytic machinery is absent in Asgard group, the proposed archaeal clade involved in archaea–alphaproteobacteria symbiosis (Moreira and Lopez-Garcia 1998). Using this hypothetical framework, which is now one of the most updated hypotheses for eukaryogenesis (Imachi et al. 2020), we wished to investigate the potential role of these archaeal OCAs in the rise of FECA (First Eukaryotic Common Ancestor). Two OCAs found in Candidatus Prometheoarchaeum syntrophicum strain MK-D1 archaeon possessed simple domain architectures with no C14 peptidase-accompanying domains (Fig. 6, domain architecture 1 and Fig. 7, domain architecture 32). In contrast, OCAs of other Asgard archaea (Fig. 6, domain architecture 2, 3, 7, 14, 17, 19, and 24 and Fig. 7, domain architecture 36 and 60) were all transmembrane proteins with domains associated with cell metabolism, cell adhesion, cell cycle control, cell projection, cell differentiation, and cell surface proteins/receptors implicated in stress responses. The abundance of cell surface and receptor-like OCAs in the Asgard group suggests that these archaeal OCAs may be candidate PCD-related genes that were also important in the E3 model of eukaryogenesis (Fig. 6).

The ‘Original Sin’ Hypothesis and the PCD Toolkit

The phyletic pattern analysis of archaeal OCAs revealed that they are not only the most ancestral peptidase C14 members, but they also contain an abundance of pro-survival domains (Figs. 6, 7). This is especially true of the deep-branching phyla Euryarchaeota and Thermoplasmatota, linking these ancestral OCAs with cell survival prior to eukaryogenesis. Although OCAs of DPANN only possessed the p20-like region, it is possible that additional genomic data will identify some of the C14 peptidase-accompanying domains. Although we are unsure if ancestral OCAs and type I MCAs were associated with PCD, it is clear that the ancestral effectors of PCD were already present prior to the diversification of bacteria and archaea. Type I MCAs appeared later in Thermoplasmatota and Euryarchaeota (Fig. 9) and acquired Ca2+-binding sites to refine their activity (Figs. 4, 9). In addition, OCAs and type I MCAs of superphylum TACK acquired death domains and other C14 peptidase-accompanying domains via HGT (Figs. 6, 7, 9). It seems that this pre-adaptation allowed for the emergence of pro-death processes in prokaryotes and eukaryotes (Aravind et al. 1999; Ameisen 2002). Klemenčič and Funk have suggested before that PCD evolved from effectors involved in physiological cellular processes (Klemenčič and Funk 2018). The findings presented here enhance these assertions and support the ‘original sin’ hypothesis which is a philosophical argument put forward by Ameisen in 2002 that argues for diverse pro-survival functions for the PCD toolkit with subsequent refinement for essential death-related functions (Ameisen 2002).

Fig. 9
figure 9

Proposed acquisitions of the different C14 peptidase domains and their associated pro-survival or pro-death domains in archaea. A simple C14 peptidase distribution is shown on the tree of life: archaea (red), bacteria (blue), and eukarya (green). The acquisitions of domains are shown in lines of different colors: p20-like region (red), p10-like region (light blue), C14 peptidase-accompanying domains associated with cell survival (yellow), and death domains (green). The important regions specific to type I MCAs, the Ca2+-binding sites (blue circle) and proline-rich repeats (pink circle), are indicated. LUCAS refers to the Last Universal Common Ancestor State. The point of endosymbiosis leading to eukarya is shown with a star. Horizontal gene transfer events between the domain eukarya and bacteria are indicated by the arrows (Color figure online)

Conclusion

OCAs and type I MCAs are present in archaea and have a wide taxonomic distribution in the three domains of life. The phylogeny of OCAs and type I MCAs suggest that these are the ancestral peptidase C14 members and were present since LUCAS. Their strong similarity with other members in bacteria and eukarya further supports this and the incongruency with the species tree proposes massive HGT of domains between prokaryotes and proto-cellular compartments prior to their divergence. The phyletic and functional pattern analyses of C14 peptidase-accompanying domains and death domains provide strong support for the ‘original sin’ hypothesis that OCAs in the Asgard group were perhaps first involved in pro-survival functions with subsequent co-option for PCD functions in prokaryotes and the earliest eukaryotes.

Materials and Methods

Sequence Retrieval and Identification of Caspase Homologs

The NCBI Protein database (https://www.ncbi.nlm.nih.gov/protein/) (Accessed 2 June 2021) was queried using search terms: “C14,” “caspase,” “metacaspase,” and “orthocaspase” in organism “archaea” under default parameters. Retrieved protein sequences were identified as caspase homologs by the presence of the Peptidase_C14 domain (PF00656) (http://pfam.xfam.org/) (Mistry et al. 2021) using the hmmsearch tool from the HMMER v3.0 package (http://hmmer.org/) (Eddy 1998) under a domain independent E-value ≤ 0.0001 and score/bias ratio ≥ 10. Hmmsearch uses profile hidden Markov Models (HMM) to detect sequence similarity and homology (Eddy 1998). The raw Peptidase_C14 HMM profile (PF00656) (206 amino acids) obtained from the Pfam database was used. The aligned MSA file of 321 sequences was aligned to Peptidase_C14 HMM profile (PF00656) using the hmmalign tool under the-trim option. The presence of the conserved histidine and cysteine (HC) dyad was identified by visual inspection under Jalview v2.0 (Waterhouse et al. 2009) according to the reference position (H220 and C276) provided on the MEROPS database (Rawlings et al. 2014) as well as using the reference sequence of TbMCA-Ib of T. brucei. Sequences not possessing the HC dyad or pseudo-dyads (Eyers and Murphy 2016) were removed.

Classification and Taxonomy Assignment of Caspase Homologs

A p20-like and p10-like region profile HMM was constructed with HMMER v3.0 hmmbuild, according to the approach followed by Klemenčič and colleagues (Klemenčič et al. 2019). The first 457 and the last 211 columns of Pfam Peptidase_C14 (PF00656) seed alignment (112 sequences with gaps) were used as the input for the p20-like and the p10-like regions, respectively. To identify PCAs, hmmsearch was performed using the Immunoglobulin (Ig) (PF00047), Ig_2 (PF13895), or Ig_3 (PF13927) HMM profiles (http://pfam.xfam.org/), commonly found N-terminal pro-domains in PCAs, and the unaligned 321 protein sequence file under a domain independent E-value ≤ 0.0001 and score/bias ratio ≥ 10 using the-domtblout parameter to produce the domain hits table as the output. MCAs were classified according to their subtypes based on sequence data of the amino acid length of the linker region between p20-like and p10-like regions (Uren et al. 2000; Choi and Berges 2013) (Fig. 1). Linker region length was determined from the alignment coordinates in the domain hits table obtained from the hmmsearch hit.

Taxonkit, an NCBI taxonomic classification toolkit (Shen and Ren 2021) was then used to convert taxonomy names to their respective taxonomy ID (TaxId) and assign the full taxonomy to 321 sequences. If no taxonomic information was available, it was assigned as ‘unclassified.’

Sequence Analysis and Tertiary Model Prediction

Sequence analysis of archaeal OCAs and type I MCAS was performed by aligning respective unaligned sequences to the constructed p20-like region HMM profiles and the Peptidase_C14 HMM profile (PF00656) using hmmalign under-trim option. A more reliable alignment was produced by realigning the produced MSA using MAFFT v.7.470 (Katoh and Standley 2013) under default parameters. Sequences were filtered by removing sequences with no residues at the site of four key residues required for the correct formation of the S1 pocket (Cys, Asp, Ser, and Asp) (Wei et al. 2000; Yu et al. 2011; McLuskey et al. 2012). Three metazoan CAs, Homo sapiens CASP8, H. sapiens CASP3, and Caenorhabditis elegans cell death protein 3 (UniProt IDs: Q14790, P42573, and P42574), PCAs of H. sapiens (MALT1), C. elegans (MALT1), Amphiprion percula (MALT PCA3) and Dictyostelium discoideum (PCP_DICDI) (UniProt IDs: Q9UDY8, G5EG87, A0A3P8TIJ8, and Q9GPM2), and type I MCA of T. brucei (TbMCA-Ib) (UniProt ID: Q585F3) were included for sequence comparison. PCAs of Dictyostelium discoideum (UniProt ID: Q9GPM2) and Amphiprion percula (UniProt ID: A0A3P8TIJ8) were only used for the sequence comparison with archaeal type I MCAs as they possessed the p10-like region. Secondary structure prediction was performed using JPred (Protein Secondary Structure Prediction Server) (Drozdetskiy et al. 2015) under Jalview with default parameters. To generate sequence logos of archaeal OCAs and type I MCAs, MSAs were transformed to HMMs using hmmbuild and submitted to Skylign5 (https://www.skylign.org/) (Wheeler et al. 2014). The full alignment of OCAs (271 sequences) or type I MCAs (50 sequences) was used using the parameter ‘information above background.’ Putative protein tertiary structures of archaeal OCAs and type I MCAs were predicted using I-TASSER under default parameters (https://zhanglab.dcmb.med.umich.edu/I-TASSER/) (Yang and Zhang 2015). I-TASSER uses a hierarchical approach to predict protein structures by multiple threading approach LOMETS (Local Meta-Threading Server, version 3). Protein structures were visualized on PYMOL (The PyMOL Molecular Graphics System, Version 1.2r3pre, Schrödinger, LLC). The protein sequence of an archaeal OCA (GenBank ID: QEE14607.1) and type I MCA (GenBank ID: TFG95767.1) was selected as the representative OCA and type I MCA model with the TbMCA-Ib of T. brucei (PDB ID: 4AFV) as the template.

Phylogenetic Analysis

Four MSAs (archaeal type I MCA and OCA, and type I MCA and OCA across three domains of life) were created for the reconstruction of respective phylogenies. Archaeal OCA and type I MCA MSAs created for the sequence analysis were used, excluding sequences with no taxonomic classification. For the sequence selection of type I MCAs across the three domains of life, all 33 filtered archaeal type I MCA sequences were used. UniProt IDs of an identical number of type I MCAs of eukarya belonging to unicellular phyla and bacteria were obtained from supplementary documents (Klemenčič et al. 2019). Respective protein sequences were obtained from the UniProt database (The UniProt Consortium 2021). For the sequence selection of the p20-like region, 143 filtered p20-like regions of archaeal OCA sequences were used. An identical number of UniProt IDs of bacterial OCAs were obtained from supplementary documents (Klemenčič et al. 2019) and the respective sequences were obtained from the UniProt database. The p20-like region of eukaryotic CAs (H. sapiens CASP8, H. sapiens CASP3, and C. elegans cell death protein 3), PCAs (H. sapiens (MALT1), C. elegans (MALT1), and D. discoideum (PCP_DICDI)), and eleven eukaryotic type I MCAs with known functions (Tsiatsiani et al. 2011) were included. Eukaryotic OCAs were not considered as the aim was to determine the origin and investigate the evolutionary history of the p20-like region from prokaryotic OCAs to eukaryotic CAs, type I MCAs, and PCAs. An identical protocol was followed for the construction of all four MSAs. Unaligned MSA file of acquired protein sequences was aligned to the p20-like region HMM profile for OCAs and the entire Peptidase_C14 HMM profile using hmmalign under-trim option. Gaps were removed using seqmagick 0.7.0 at default parameters as gaps are treated as unknown characters in IQ-TREE (Nguyen et al. 2015). The resulting trimmed MSA file was then aligned using MAFFT under default parameters. Sequences were filtered by removing sequences with no residues at the site of four key residues required for the correct formation of the S1 pocket (McLuskey et al. 2012). The best substitution model for phylogenetic reconstruction was chosen using IQ-TREE to correct for multiple changes at the same site under a maximum likelihood (ML) approach, including all sites (including gaps) (Kalyaanamoorthy et al. 2017). Considered substitution models included a wide range of models that are supported by IQ-TREE, including advanced partition and mixture models (Minh et al. 2020). Rate heterogeneity across sites was considered as well (Minh et al. 2020). The phylogenetic tree was reconstructed using IQ-TREE using an ML approach (Nguyen et al. 2015). The reliability of the reconstructed tree was assessed using the bootstrap method with 1000 replicates in IQ-TREE (Hoang et al. 2018). The phylogenetic tree was visualized using the Interactive Tree of Life (iTOL) v5.0 online service (https://itol.embl.de/) (Letunic and Bork 2019).

Identification of C14 Peptidase-Accompanying Domains

C14 peptidase-accompanying domains refer to Pfam domains identified within an OCA or type I MCA sequence, adjacent to the C14 peptidase domain. The entire database of HMM profiles was retrieved from Pfam (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam33.0/) and the HMM profile database was prepared for usage using hmmpress in the HMMER package. The identified, full-length 321 amino acid sequences of caspase homologs were subjected to a hmmsearch against the produced HMM profile database under a threshold E-value of ≤ 0.0001, independent E-value ≤ 0.0001, and score bias ratio ≥ 10, and the output was produced using-domtblout option (Eddy 1998).

Identification of Transmembrane Domains

Transmembrane domains in caspase-like protein sequences were predicted using the TMHMM Online Server v2.0 under default values (http://www.cbs.dtu.dk/services/TMHMM/) (Sonnhammer et al. 1998) with the unaligned MSA of the 321 sequences as the input file. Transmembrane domains were classified according to their topology (von Heijne 2006). If the C14 peptidase domain was located C-terminally to a single N-terminal transmembrane helix, it was regarded as a signal peptide, following the approach of Krogh et al. (Krogh et al. 2001).

Functional Prediction and Classification of C14 Peptidase-Accompanying Domains

Putative functions (biological processes) of identified C14 peptidase-accompanying domains were predicted using the gene ontology (GO) information using the Pfam2Go (https://rdrr.io/github/missuse/ragp/man/pfam2go.html) and dcGO Enrichment approach (https://supfam.org/SUPERFAMILY/dcGO/), as well as the information available on Pfam website (https://pfam.xfam.org/). The Pfam release used for the functional annotation was v34.0 (Mistry et al. 2021). For dcGO GO assignment, False Discovery Rate (FDR) threshold of 0.05 was chosen as the cut-off value and the remaining setting was set as default. Multiple GO databases were used for functional prediction (Forslund and Sonnhammer 2008). Predicted functions were further classified: cell survival, PCD, and interaction with neighboring cells and the environment. Cell survival-associated functions refer to the functions listed by Ameisen (Ameisen 2002). The classification of the predicted functions (biological processes) used is as follows: cell differentiation, cell cycle (DNA replication, transcription, and recombination, DNA repair mechanisms, cell membrane repair mechanisms, cell division, chromatin organization, remodeling or dissolution of the nuclear membrane and chromosomal migration for cell division), and cell metabolism (regulation of ionic channels). The term ‘PCD-associated’ refers to the death domains that act as ligands or adaptors of the PCD molecular pathway domains (Aravind et al. 1999; Li and Roberts 2001). Interaction with neighboring cells and the environment refers to biological processes: cell adhesion, cell projection, cell surface protein or receptor, response to stress, and interaction with neighboring cells. The remaining functions were classified under “other.” If no function was predicted by the GO databases, the function of the domain was assigned as “unknown.”

Domain Architecture Analysis

Protein domain architectures of archaeal caspase homologs were created by using a custom R script (https://github.com/Asplund-Samuelsson/caspases.git) and an input file (Sequence ID, Domain, alignment coordinates). Domain refers to the C14_peptidase domain, C14 peptidase-accompanying domains, and transmembrane helices. Domains were placed in sequential order according to the alignment coordinates. An overlap of two domains was calculated by their predicted alignment coordinate end and start position of domain1 and domain2. If two domains on a sequence overlapped more than 25%, the domain with the lower E-value was accepted. C14 peptidase domain took precedence if other domains were found. If a transmembrane helix occupied the same region as a C14 peptidase-accompanying domain, the C14 peptidase-accompanying domain took precedence. Consecutive repeats of identical domains were identified as a single domain.