Introduction

Each time a new genome is sequenced, genes coding for proteins with no known homolog are found, even when genomes of closely related species are available, as is the case of primates (Toll-Riera et al. 2009; Cai and Petrov 2010; Ruiz-Orera et al. 2015; Sandmann et al. 2023) or Drosophila (Domazet-Loso and Tautz 2003; Wang et al. 2004; Heames et al. 2020; Grandchamp et al. 2023).

In the former case, the hypothesis that human-specific proteins may prove to be involved in the behavioral or anatomical peculiarities of the human species (Vakirlis et al. 2022; Papadopoulos and Albà 2023), such as the size of its brain (Rich and Carvunis 2023) or its functions (Li et al. 2010; Duffy et al. 2022; An et al. 2023), seems worthy of consideration. In the present study, taking advantage of both the high quality of the annotation of the human proteome (Amaral et al. 2023) and the availability of a significant number of well-annotated primate proteomes (Marques-Bonet et al. 2009; Juan et al. 2023), an extensive search of human-specific proteins was undertaken.

To achieve this, as in a previous study (Sanejouand 2023), a reference database was set up. Then, information about the tertiary structure of the identified putative human-specific proteins was gathered, with the idea that such knowledge could provide hints about their function or origin. No such information based on experimental data was found, so advantage was taken of recent progress in structure prediction methods (Kryshtafovych et al. 2021; Necci et al. 2021; Liu et al. 2023). Since, as is noteworthy in the case of proteins with no known homolog, predictions can vary significantly from one method to another (Monzon et al. 2022; Aubel et al. 2023; Middendorf and Eicholt 2023), three prediction methods based on different approaches were considered: first, IUPred (Dosztányi 2018), which provides a qualitative prediction stating whether a polypeptide is expected to adopt a globular fold; then, flDPnn (Hu et al. 2021), a neural network that provides structure disorder predictions and was ranked among the top methods in the recent Critical Assessment of protein Intrinsic Disorder (CAID) prediction experiment (Necci et al. 2021); and finally, AlphaFold (Jumper et al. 2021), which has proved able to predict the tertiary structure of proteins at an atomic level of detail (Kryshtafovych et al. 2021; Jones and Thornton 2022).

Methods

Choice of a Reference Database

The definition of what a species-specific protein is depends upon what is known about the proteomes of the closest species at a given time. For studying species-specific proteins, as well as for the sake of reproducibility, it is thus important to choose a well-defined reference system (Sanejouand 2023), that is, a set of reference proteomes. Moreover, since the quality of the annotation of a proteome can vary significantly from one proteome to another, these proteomes have to be chosen from among the best annotated ones.

To do this, 27 UniProt reference proteomes (UniProt Consortium 2017) of primates were picked, that is, all those that have at least a standard level of annotation,Footnote 1 according to the Complete Proteome Detector (UniProt Consortium 2021). Note that ten of these are high-value outliers, namely those of Callithrix jacchus, Cercocebus atys, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Macaca nemestrina, Pan troglodytes, Papio anubis, Rhinopithecus roxellana, and Sapajus apella, meaning that they have significantly more identified proteins than closely taxonomically related species.

For these 27 reference proteomes, complete BUSCO predictions of single-copy orthologs (Simão et al. 2015) are found for 94% (median value) of the cases. However, the percentage of short (less than 50 amino acid residues long) proteins varies widely from one proteome to another, ranging between 0.1% (Sapajus apella) and 3.5%, being over 1% in the case of only four primate species, namely Pongo abelii, Pan troglodytes, Macaca fascicularis, and Homo sapiens.

Overall, there are between 19,229 (Chlorocebus sabaeus) and 50,207 (Macaca fascicularis) proteins per proteome (40,000 ± 7500, on average) in our reference database, with a total of 1,083,746 proteins, including known isoforms.

Figure 1 shows the phylogenetic tree of the primate species considered in the present study, according to the TimeTree webserver version 5 (Kumar et al. 2022).

Fig. 1
figure 1

The phylogenetic tree of the primate species with well-annotated proteomes considered herein

Search for Homologs

For each of the 20,449 human proteins that were at least 30 amino acid residues long, associated to a given gene, as found in UniProt (in February 2023), homologs in the reference database were sought using BLAST (Altschul et al. 1997) version 2.6.0+, assuming that two proteins are homologous when the E-value of their pairwise alignment is lower than \(10^{-6}\) (Lobley et al. 2007; Lucas et al. 2014; Sanejouand 2023). Note that, to avoid an overestimation of the number of specific proteins due to the filtering of low-entropy segments, that is, of segments with restricted amino acid composition, composition-based statistics (Schäffer et al. 2001) were not considered (-comp_based_stats 0).

Noncoding RNAs

For each human-specific protein found, that is, for each human protein with no homolog in the reference database, its possible encoding by a noncoding RNA (ncRNA), as found in the RNAcentral database (The RNAcentral Consortium 2015) version 22 was checked. To do this, its sequence was compared with those obtained by translating all nonoverlapping open reading frames at least 90 nucleotides long. Note that it was assumed that both start and stop codons are standard ones, while it is known that peptides encoded by ncRNAs can have atypical stop codons (Dragomir et al. 2020).

Protein Globularity

The degree of globularity of each human-specific protein found, that is, the percentage of the protein length predicted to be globular, as well as the number of globular domains, was estimated using version 1 (Dosztányi et al. 2005) of the standalone version of IUPred (Dosztányi 2018; Pajkos et al. 2023). Note that IUPred performs its predictions by considering the local sequential environment of each amino acid residue within 2–100 residues in either direction. Note also that, at variance with most recent methods (Necci et al. 2021), IUPred does not make use of evolutionary data, which are expected to be lacking in the case of species-specific proteins.

Ordered Residues

Binary predictions of ordered/disordered residues were obtained using the flDPnn neural network (Hu et al. 2021), as implemented in the eponymous webserver.Footnote 2

In this study, high percentages, namely over 80%, of ordered residues are assumed to indicate that the considered protein is globular, meaning that it can adopt a stable tertiary structure. Note that the results obtained herein depend slightly upon the threshold chosen to indicate that a protein is predicted to be globular.

Structure Prediction

Predictions of tertiary structure were picked from the AlphaFold Protein Structure Database (Varadi et al. 2022), except for PACMP, which was included in UniProt after the release of the fourth version of the database. In this case, the prediction was performed using the standalone version of AlphaFold2, version 2.3, as available on the GitHub webserver.Footnote 3

AlphaFold2 can also be used for predicting whether a protein has disordered segments. Indeed, AlphaFold2 provides an estimate of the accuracy of its prediction for the position of each amino acid residue, which is coined pLDDT,Footnote 4 with values over 90% corresponding to high quality, meaning that residue positions can be trusted, while for values below 50% they should not (Varadi et al. 2022). In the latter case, this can mean that the residues belong to disordered segments (Ruff and Pappu 2021; Pajkos et al. 2023), but it can also be interpreted as a possible lack of homologs, including remote ones, in the sequence databases available at the time of training the network (Varadi et al. 2022).

Hereafter, the overall quality of the prediction of the structure of a protein is assumed to be given by the average of the quality of the prediction of the position of its residues (\(\langle\)pLDDT\(\rangle\)).

Results

Fig. 2
figure 2

Number of human-specific proteins as a function of the number of primate proteomes in which homologs of the human proteins were sought. Primate proteomes were added one by one according to the time of divergence between the primate and the human species, as provided in the TimeTree database, Pan troglodytes being added first (left) and Protolemur simus the 27th (right)

How Many Human-Specific Proteins?

The requirement that the reference database be large enough (Vakirlis and McLysaght 2019) was assessed as follows: When proteins of Homo sapiens not found in the proteome of its closest relative, namely Pan troglodytes, are sought, 347 are identified. Note that this number is lower than a previous estimate obtained 10 years ago, namely 634 (Ruiz-Orera et al. 2015), maybe as a result of the improvement of the annotation of the human proteome (Amaral et al. 2023). Indeed, when the search was performed the other way around (Sanejouand 2023), 1036 chimpanzee-specific proteinsFootnote 5 were found.

In fact, as shown in Fig. 2, when the number of proteomes in the reference database is increased, by adding proteomes one after another starting from the proteomes closest to the human species, the number of human-specific proteins drops from 347 (left) to 193 (right). Interestingly, a few species make significantly higher contributions to this reduction, like the fourth (Pongo abelii) and tenth (Macaca fascicularis) (the two sharpest drops in Fig. 2), further suggesting that their proteomes are more complete than the others. However, while the proteome of Macaca fascicularis is indeed the largest in our reference database, the size of the proteome of Pongo abelii, with 39,491 proteins, is slightly below the average. Indeed, its level of annotation is only considered standard, according to the Complete Proteome Detector (UniProt Consortium 2021).

Fig. 3
figure 3

Number of human proteins with homologs found in a given number of primate proteomes; 193 human proteins have no homolog in the proteomes of the 27 other primates considered

As shown in Fig. 3, while 89% of human proteins have homologs in all 27 proteomes in our reference database, 836 of them have homologs in all but one, 298 in all but two, etc., suggesting that the annotation of several reference proteomes is far from being complete. Of course, if the annotation of the 27 proteomes considered were improved or if more primate proteomes were added, the number of proteins found to be specific to the human species would continue to drop. To partially take this expected trend into account, homologs of the 193 proteins found above were sought in 52 UniProt reference proteomes of other mammalian species, these other proteomes being chosen on the basis of their high level of annotation, being all high-value outliers,Footnote 6 according to the Complete Proteome Detector (UniProt Consortium 2021).

Homologs were indeed found for 23 (12%) of them. However, in more than half of these cases, they were found in a single mammalian species only, as if the annotation of these proteins was intrinsically difficult. A possible reason is that these proteins are often short, with an average length of 106 ± 54 amino acid residues.

A total of 170 putative human-specific proteins were identified above, but as suggested, note that this number is expected to drop year on year as a consequence of the ongoing progress of proteome annotation. Note, however, that the protocol used in the present study was designed to be easy to reproduce, allowing for independent updates.

Table 1 The 25 human-specific proteins known at either the protein (top) or the transcript (bottom) level

How Many Well-Known Ones?

In UniProt, the degree of knowledge, that is, the type of evidence that supports the existence of a protein, is quantified through a number ranging between one (known at the protein level) and five (uncertain).

Among the 170 putative human-specific proteins identified above, only 2 are known at the protein level (top of Table 1) according to UniProt, namely PACMP, the poly-ADP-ribosylation-amplifying and CtIP-maintaining micropeptide (Zhang et al. 2022), and SDIM1, the stress-responsive DNAJB4-interacting membrane protein 1 (Lei et al. 2011). Such a result is in sharp contrast to the fact that 90% of the human proteome is nowadays known at this level (Adhikari et al. 2020). Note that PACMP is short (44 residues) and, as such, could have escaped annotation in the proteomes considered above. In fact, PACMP was included in UniProt quite recently.Footnote 7

On the other hand, while 23 of these proteins are known at the transcript level (Table 1), 23 others are just predicted. Strikingly, the 122 others (72%) are deemed uncertain in UniProt, being annotated as dubious CDS or gene predictions, possible pseudogenes, etc. This means that, according to UniProt, although a few of them may prove to be actual proteins, this is unlikely for the vast majority of them.

Actually, among the 25 human-specific proteins known at either the protein or the transcript level, except PACMP, SDIM1, CATR1, and HCP5, all of them are considered to be uncharacterized, meaning that they do not have any known function. On the other hand, as stated in Table 1, 21 of them are found to be encoded by an open reading frame of a long noncoding human RNA (lncRNA), while 2 others, namely CATR1 and YK004, have close RNA-encoded homologs.

Interestingly, these 25 proteins, except SDIM1, CATR1, YV004, and YS039, also have close homologs encoded in the open reading frames of RNAs of other primate species, meaning that, at the RNA level, their sequences are not human specific. Since no transcript is known for any of them in UniProt, this raises the possibility that, in the human species, these RNAs have acquired the ability to be recognized as messenger ones. Of course, they may also just have been missed so far in species other than humans, at both the protein and the transcript level. Note that the growth of the number of reported noncoding RNA genes has been rapid, suggesting that primate catalogs may, in this respect, prove rather incomplete (Amaral et al. 2023).

How Many Globular Ones?

From the results above, it is tempting to speculate that most human-specific genes do not code for proteins and may instead be involved, like many lncRNAs (Statello et al. 2021), in the regulation of gene expression (Nahon 2003). However, it has recently been shown that translation is widespread at many annotated lncRNA transcripts (Patraquim et al. 2020, 2022; Mudge et al. 2022; Broeils et al. 2023), with up to 3330 human lncRNAs found bound to ribosomes with active translation elongation (Lu et al. 2019).

Actually, lncRNAs often show coding potential and sequence constraints similar to evolutionarily young protein coding sequences (Ruiz-Orera et al. 2014). It is thus necessary to assess the coding potential of lncRNAs. A straightforward way to do this is to predict how globular the encoded proteins are expected to be (Papadopoulos et al. 2021; Peng and Zhao 2024). As shown in Table 1, ten human-specific proteins known at the transcript level and encoded by an lncRNA (50% of them) are predicted to be at least 80% globular, by IUPred, with a single structural domain and more than 80% of ordered residues, according to flDPnn. Note that the predictions of IUPred and flDPnn are similar. In fact, they differ by more than 20% for five cases only, namely YS049, FEAS1, YI001, IDAS1, and YT009.

Note also that the two human-specific proteins that are not known to be encoded or to have homologs encoded by a human lncRNA, namely SDIM1 and YV004, are predicted to be at least 83% globular, by IUPred, also with a single structural domain, and to have more than 80% of ordered residues, according to flDPnn.

On the other hand, since nearly 30% of regions within the proteome are expected to be disordered (Ruff and Pappu 2021), other human-specific lncRNAs could also encode genuine, though disordered, proteins.

Fig. 4
figure 4

Quality of tertiary structure prediction, according to AlphaFold2, for the whole human proteome (left) and for the putative human-specific proteins identified herein (right). The dashed line indicates the quality threshold below which the confidence in the models is very low (\(\langle\)pLDDT\(\rangle <\) 50)

What about Their Structure?

No homolog was found in the Protein Data Bank (Kouranov et al. 2006) for the 25 human-specific proteins identified above. However, thanks to machine learning algorithms, major progress has recently been witnessed in the field of protein structure prediction (Jumper et al. 2021; Jones and Thornton 2022). Moreover, such predictions have been performed on a large scale. Furthermore, they are nowadays available in public databases (Varadi et al. 2022).

As illustrated in Fig. 4, the tertiary structure of most human proteins has been predicted with a high level of confidence by AlphaFold2 (Varadi et al. 2022), the average pLDDT being over 90 for nearly 40% of them, and over 70 for more than 70% of them. Note that the structures of only 14% of human proteins are predicted with low confidence (\(\langle\)pLDDT\(\rangle\) below 50).

However, in the case of the putative human-specific proteins identified above, up to 70% of them are predicted with such low confidence (Fig. 4). As specified in Table 1, among the 25 human-specific proteins known at either the protein or the transcript level, AlphaFold2 is able to make a fair prediction in the case of one of them (\(\langle\)pLDDT\(\rangle\) = 83) only, namely CATR1, the CATR tumorigenic conversion 1 protein (Li et al. 1995, 1998). However, as shown in Fig. 5, its predicted structure is fairly simple, with a single, long α-helical segment.

The tendency of AlphaFold to predict helical structures for short proteins with no known homolog has already been noted (Monzon et al. 2022). In fact, the two other rather confident predictions of AlphaFold2 (\(\langle\)pLDDT\(\rangle>\) 50; see Table 1) are also for small proteins with very simple topologies (Fig. 5).

Note that the structure of most proteins predicted to be globular, by IUPred, and to have more than 80% of ordered residues, by flDPnn, is predicted with little confidence by AlphaFold2 (Table 1). This result further suggests that AlphaFold2 is of little help for predicting the structure of proteins with no known homolog (Varadi et al. 2022; Monzon et al. 2022; Middendorf and Eicholt 2023), as for instance illustrated in a previous study of 362 eukaryotic proteomes (Sanejouand 2023). On the other hand, such discrepancies could also indicate conditional folding of intrinsically disordered regions (Alderson et al. 2023).

Fig. 5
figure 5

The best predicted structures of human-specific proteins, according to AlphaFold2. Left: PACMP, the poly-ADP-ribosylation-amplifying and CtIP-maintaining micropeptide (\(\langle\)pLDDT\(\rangle\)=57). Middle: Q68DW6, an uncharacterized protein (\(\langle\)pLDDT\(\rangle\)=61). Bottom: CATR1, the CATR tumorigenic conversion 1 protein (\(\langle\)pLDDT\(\rangle\)=83). The darker the color, the higher the level of confidence (pLDDT). For entries Q13166 (CATR1) and Q68DW6, colored representations can be found at https://www.uniprot.org/uniprotkb. Drawn with Chimera (Pettersen et al. 2004)

Conclusions

By looking for a lack of homologs in a reference database of 27 well-annotated proteomes of primates and 52 well-annotated proteomes of other mammals, 170 putative human-specific proteins were identified. However, most of these are deemed uncertain in UniProt, casting doubts on the 23 that are deemed to be predicted. Indeed, given the efforts made to complete the annotation of the human proteome (Amaral et al. 2023), it becomes less and less likely to have human proteins that are not known at least at the transcript level (Adhikari et al. 2020).

On the other hand, 23 of the 25 human-specific proteins known at either the protein or the transcript level are found to be encoded or to have close homologs in an open reading frame of a human lncRNA (Table 1). While one of them, namely PACMP, the poly-ADP-ribosylation-amplifying and CtIP-maintaining micropeptide, is known at the protein level (Zhang et al. 2022), 12 others are predicted to be at least 80% globular, with a single structural domain, and to have more than 80% of ordered residues, suggesting that a majority of human-specific proteins may prove to be encoded by lncRNAs.

In fact, de novo proteins have already been found to have an lncRNA origin (Ruiz-Orera et al. 2014, 2020). Such lncRNAs could come from RNAs with a former regulatory function or from intergenic open reading frames (Papadopoulos et al. 2021), which in turn may appear randomly, becoming new functional proteins when they happen to confer selective advantages (Ruiz-Orera et al. 2020).