Introduction

Human endogenous retroviruses (HERVs) and related genetic elements occupy ~8 % of human genome. Genomic copies of HERVs are of particular interest because in addition to functional viral genes, they have multitude of regulatory DNA regions serving as promoters [13], enhancers [4, 5], polyadenylation signals [6, 7], insulators [8, 9] and binding sites for various nuclear proteins [2, 1013]. Many families of HERVs exhibit high transcriptional activity in human tissues [1, 1417]. HERVs are believed to be remnants of numerous retroviral infections [1820] that occurred repeatedly during primate evolution [18, 21]. This was supported by the artificial reconstruction of an active HERV element [20]. This element successfully amplified via an extracellular pathway involving retroviral particle exit from the cell and reinfection, thus recapitulating ex vivo the molecular events responsible for its dissemination in the host genomes. HERVs fixed in the genome and became inheritable because their insertions occurred in the germ cell lineage [2225].

HERVs are composed of sequences related to retroviral genes and are flanked from both ends by ~1 kb long so-called long terminal repeats (LTRs). A structure of an LTR comprises functional enhancers [26], promoters and polyadenylation signals [18] normally used for retroviral gene expression. However, the LTRs may drive the transcription of adjacent host genomic sequences [27, 28]. Most of HERVs reside in the human genome as solitary LTRs arisen due to homologous recombinations between the two 5′- and 3′-flanking LTRs of the same full-length HERV elements [2931]. However, some full-length HERVs express viral genes in a variety of human tissues [32] and even form virus-like particles [33, 34]. Expression of HERV-encoded proteins is directly or indirectly associated with progression of many human diseases [35].

In this review, we tried to elucidate the current progress in the identification of the HERV-linked genomic features dealing with the human gene expression. We consider structure of HERVs, their protein-coding potential, their influence on the transcription of host genes, their implication in human diseases and evolutional aspects of HERV accumulation, mutation and selection. A special attention is paid to the databases featuring functional peculiarities of HERVs.

Structure of HERVs

Genomic structure

Human endogenous retroviruses are represented in the human genome by 504 families including ~520,000 individual members, which make up ~8 % of human DNA. Genomic structure of a full-length HERV includes retroviral genes, typically Gag, Prot, Env and Pol, and flanking ~1 kb long sequences termed long terminal repeats (LTRs). Most of the HERVs exist in the form of solitary LTRs, arisen most probably due to homologous recombinations between LTRs of full-length elements [2931]. In turn, recombinations between the different HERV elements may cause further genomic rearrangements including copy number variation (CNV) of known human genes. For example, this mechanism may be responsible for at least 78 CNV cases encompassing known human genes [36].

Evolutionary recent HERVs have more intact open reading frames (ORFs) and less mutated regulatory sequences, as compared to the ancient elements [37]. The most recent endogenous retroviral group of human genome, HERV-K/HML-2 (ERVK), comprises at least fifty-five full-length members (termed proviruses) [21] and ~2000 solitary LTRs [38].

Having a variety of potential regulatory sequences such as promoters, enhancers, transcriptional factor binding sites, splice sites and polyadenylation signals, LTRs are believed to possess the major transcriptional regulatory potential of endogenous retroviruses. Solitary LTRs are differentially methylated in different human tissues [3941], they may specifically bind host cell nuclear proteins [42, 43], serve as tissue-specific transcriptional promoters and enhancers [4, 26, 39, 44], and, finally, are transcribed in vivo in many tissues [1, 4548]. In addition, LTRs may contribute to the host gene regulation network by acting in cis (by providing regulatory elements) or in trans (by driving expression of antisense transcripts) [30, 4851]. Overall, the LTR sequences have a higher substitution rate than the rest of the non-coding part of the genome [21]. This higher mutation rate underlines LTR regulatory potential, since it may lead to its inactivation, thus counteracting its deleterious effects for the host cell [21].

HERV life cycle

The life cycle of a HERV comprises reverse transcription of viral RNA, followed by the integration of a nascent DNA copy into genomic DNA of the host cell [52, 53]. Importantly, retroviral genomic RNA differs from genomic copy by the absence of LTRs, which are built during the reverse transcription, a multistep complex process including several template switching events [5456].

Transcription of an endogenous retrovirus can be shown on the example of HERV-K/HML-2 (ERVK) family members, which have the most complex organization among all HERV elements [19]. The inserted full-length proviral copy is normally transcribed using its functional promoter on the 5′ LTR (Fig. 1a). Transcription stops at the polyadenylation signal of the 3′-terminal LTR. The polyadenylated full-length transcript can be further spliced, thus generating at least three different spliceforms (Fig. 1b). Fully unspliced transcript encodes for 160-kDa viral polyprotein Gag–Prot–Pol. Its translation requires two (−1) ribosome frameshifts [37]. The Polyprotein is then processed by the Prot (retroviral protease) intramolecular activity, and the mature proteins are released. The Gag protein is further cleaved to release matrix, core and nucleocapsid proteins [37, 57]. Pol is the retroviral reverse transcriptase (RT), possessing also RNase H activity. The single-spliced transcript encodes for the envelope protein (Env) that is needed to infect the cells via binding to a cellular receptor [37]. Double-spliced RNA encodes a short regulatory protein. Type 1 HERV-K/HML-2 (ERVK) elements share a 292 nt deletion in the Env region. Apart from fusion of Pol and Env genes, this deletion also gives rise to a difference between the two isoforms of regulatory proteins encoded by the double-spliced transcripts. Type 2 HERV-K/HML-2 (ERVK) 1.8 kb-long proviral transcript codes for the 15-kDa accessory protein Rec (also called cORF [58]), which is the only known auxillary factor encoded by HERVs [37]. Type 1-specific double-spliced RNA product called Np9, is a 9-kDa protein that shares only the N-terminal 15 aa residues with Rec [37, 59, 60]. Rec is specifically accumulated in the nucleoli. It has a striking functional homology to lentiviral RNA-binding nuclear export proteins like the HIV and HTLV proteins Rev and Rex, respectively [37]. Similarly to those proteins, Rec binds to unspliced or partially spliced viral transcripts and mediates their transfer to the cytoplasm where they escape the cellular splicing machinery and can be translated into retroviral polyprotein [61]. Rec may act via interaction with the host-encoded protein Staufen-1 that facilitates transfer of unspliced transcripts to the cytoplasm [62]. The Rec binding site, termed Rec Responsive Element (RRE), is a highly structured RNA motif within the U3R region of the 3′ LTR. Interestingly, this functional motif can be recognized by the HTLV Rex protein that can at least partly substitute for the Rec function [37, 61, 63]. Similarly to Rec, Np9 accumulates in the nucleus. Although Np9 expression was found in many tissues and cell lines, the exact molecular function for this protein remains unclear.

Fig. 1
figure 1

Functional genes encoded by HERVs, on the example of HERV-K (HML-2) elements. a Genomic organization of the reconstituted full-size provirus. Apart from “classical” retroviral genes Gag, Prot, Pol and Env, an additional gene Rec or Np9, depending on the provirus type, is encoded. b Different types of proviral transcripts. Full-length subgenomic transcript encodes for Gag–Prot–Pol polyprotein, single-spliced product codes Envelope protein, double-spliced RNA is for Rec/Np9, whereas ~1.5 kb long completely spliced transcript of unknown function appears to lack any functional open reading frames

Finally, ~1.5 kb-long completely spliced proviral transcripts (Fig. 1b) appear to lack any protein coding regions and may have only some regulatory functions, if any [64, 65].

Expression of HERV proteins

Human endogenous retrovirus proteins are actively expressed in a variety of human tissues [37]. For example, autologous antibodies against multiple HERV-K/HML-2 (ERVK) proviral Env epitopes were found in ~30 % of healthy individuals [66]. Increased HERV protein production was also detected in placentas and in embryonic tissues, in line with the identification of putative responsive elements for several pregnancy hormones within the HERV LTRs [67, 68]. Gag protein expression may induce massive T cell stimulation or apoptosis [69]. Endogenous Prot genes may help to exogenous retroviruses, such as lentiviruses, to infect the host cells [7072]. Rec and Np9 activities may interfere with normal nuclear cytoplasmic transport mechanisms [73] or even serve as inducers of organ-specific tumorogenesis [74]. Finally, Env protein has an immunosuppressive domain that inhibits T and B cell activation and proliferation and induces modulation of the expression of many cytokines [75, 76]. This may be functionally linked to an increased HERV expression in some tumors [37, 77]. Of note, human primary lymphocytes express up to 32 different HERV-K/HML-2 (ERVK) envelopes, and at least two of the most highly expressed Env genes retain the protein-coding capacity [78].

HERV proliferation in the genome

Although traces of several hundreds retroviral groups can be identified in human DNA, there are no evidence that any of them remains active in terms of generating new copies in human genome. However, up to recent times in human evolution, few retroviral families were prolific [24]. Some HERV inserts are specific to human DNA, which means that they happened already after the radiation of human and chimpanzee ancestor lineages, that occurred ~6 million years ago [38, 79, 80]. HERV-K/HML-2 (ERVK) is the sole group of HERVs known to contain human specific members, totally ~140 such elements found in previous works [1, 21, 24, 38, 47, 8186], that contributed ~330 kb of the human DNA [87]. Similarly to other HERVs, this group comprises mostly proliferation deficient and transcriptionally silent elements [88]. However, many family members are transcriptionally active [45] and theoretically may possess an infectious potential [79, 86]. According to the number of literature citations, this group may be considered the most biologically active human endogenous retroviral family [31, 89, 90]. The detailed analysis of the human-specific LTR structures provided evidence that at least three HERV-K/HML-2 (ERVK) master genes were active in the hominid lineage soon after the human-chimpanzee ancestral radiation [24]. Moreover, few dozens of HERVs are also insertionally polymorphic in human population [25, 79, 86, 9194], thus suggesting that this family remained prolific up to recent times in the human evolutionary history [74, 86, 89, 95]. The bioinformatic screening of human specific HERVs revealed that they may encode a total of 11 functional genes for Gag, 12 genes for Prot, 9 genes for Pol, 8 for Env, and 9—for Rec [19].

In contrast to humans, in other mammals, endogenous retroviruses may be actively proliferating and highly invasive [96, 97]. For example, murine ERVs are very active and form a significant source of mutation of the murine germ-line. Approximately, one in ten spontaneous phenotypes that have been described in mice are caused by insertional mutagenesis by an ERV [98].

In addition to standard mechanism for HERV proliferation involving reverse transcription and genomic integration of retroviral RNA using a complex of HERV-encoded proteins, at least two alternative possibilities can be mentioned. First, new HERV copies may be generated due to unrelated DNA duplication mechanisms, as in the case of widespread expansion of centromeric human endogenous retroviruses [99]. Multiple copies of a single recently inserted HERV-K/HML-2 (ERVK) provirus, named K111, present in at least 100 copies spread across the centromeres of fifteen human chromosomes. In the chimpanzee genome, K111 is present as a single copy, and it is most likely absent from the DNA of other primates [99]. Second, new HERV copies may appear in the form of processed pseudogenes generated using the reverse transcriptase machinery of a LINE-1 retrotransposon rather than using HERV self-encoded enzymes [10]. Such HERV copies lack LTRs and most frequently are transcriptionally silent [100]. This phenomenon is not unique to HERVs and represents a general mechanism, e.g., there are on an average 1–10 processed pseudogenes per each human gene [101].

Regulatory potential of endogenous retroviruses

Not only proteins, but also non-coding sequences including viral regulatory elements, considerably shaped human genome and transcriptome [32, 102104]. HERV LTRs have functional enhancers, promoters, polyadenylation signals and splice sites. They can regulate transcription in human cells by using the following five major mechanisms: (1) LTR enhancer/transcriptional repressor activity may alter expression of the neighboring genes; (2) LTRs may drive transcription of downstream genomic sequences, thus creating new genes and non-coding RNAs; (3) LTR polyadenylation sites may cause premature termination of the read-through transcripts; (4) LTR splice sites may change exon–intron structure of genes; (5) LTRs may regulate host genes via RNA interference mechanisms (Fig. 2). Below, we discuss examples of such functional roles published in the literature.

Fig. 2
figure 2

Functional roles played by HERV elements (defined as LTRs) in the regulation of gene expression. 1 HERVs may serve as transcriptional enhancers or silencers by regulating activities of downstream promoters. 2 HERVs may act as transcriptional promoters for host non-repetitive DNA, thus creating new genes. 3 HERV polyadenylation sites may cause premature termination of transcription of the host genes. 4 HERV sequences may disrupt exon–intron structure of genes by donating new splice sites. 5 HERVs may initiate antisense transcripts overlapping with RNAs of the host genes

Functional DNA regions

DNaseI hypersensitivity sites (DHS) are probably the most important genomic landmarks for regions of open (functionally active) chromatin, whereas transcription factor binding sites (TFBS) denote regions of DNA with nuclear protein binding properties [105, 106]. Garazha et al. combined investigation of both DHS and TFBS content of HERVs on a genomic scale. To this end, we devised a bioinformatic algorithm mapping relevant TFBS identified by annotating all the HERVs in the human DNA (504 families and ~720,000 copies) [107]. For the whole set of HERVs, ~140,000 inserts (~19 %) had at least one mapped DHS and ~110,000 inserts (~15 %) had at least one mapped TFBS. The total numbers of all DHS and TFBS in all HERV elements amounted to ~155,000 and ~320,000, respectively [107]. All the 504 HERV families were characterized with regard to their TFBS content (available at http://herv.pparser.net/TotalStatistic.php). The individual families differ dramatically in copy number, ranging from just few copies as for the HERV-F (ERVFH21-1, ERVH48-1) family, to more than 22,000 members as for the THE1B family. The total number of TFBS was also strikingly different varying from 0 (families LTR5, LTR7A) to ~13,000 (MLT1K family). The maximum absolute number of the TFBS-positive members was observed for the MLT1K family (~4000). The families with the greatest densities of TFBS may be regarded as the most functionally active ones among the HERVs. However, it is also important to consider the absolute numbers of TFBS contributed by each family. For example, the family LTR12 has the highest proportion of TFBS-positive members and contributed a total of ~1300 TFBS to the human genome, whereas the family MLT1K donated the greatest number of TFBS (~13,000), but has a rather small density of TFBS-positive members. A definite trend was seen when HERV-related DNase hypersensitivity sites (DHS) and TFBS distributions were compared. The probability that an individual HERV element has DHS increases proportionate to the increase in mapped TFBS. Further experiments confirmed that TFBS density may be an overall measure of the functional activity of HERVs [107]. These results provide clues for identification and functional validation of tens of thousands of previously unknown regulatory sequences of the human genome. Moreover, these results are likely an underestimation of the HERV-related TFBS pool. Due to the repetitive nature of HERVs, it is impossible in many cases to directly map TFBS on any particular element. Those TFBS that were successfully mapped, corresponded mostly to the 5′- or 3′-terminal regions of HERVs, no further than ~200 bp from the border with the unique flanking genomic sequence [107].

Enhancer activity

Human endogenous retroviruses and their LTRs include numerous transcription factor binding sites and may be involved in regulation of the neighboring host genes. One of the first striking reports of the involvement of HERVs in tissue-specific gene transcriptional regulation was for the human amylase locus [108]. In humans, amylase is produced in pancreas and in salivary glands. Human amylase locus includes two genes of pancreatic amylase (AMY2A and AMY2B) and three genes of salivary amylase (AMY1A, AMY1B, AMY1C). The latter three genes are likely products of a recent triplication, because in the chimpanzee genome there is only one gene for AMY1. All genes for salivary amylase contain a full-length insert of HERV-E (ERVE) upstream their transcription start site. The insertion of a full length endogenous retrovirus activated a cryptic promoter that drives the transcription of amylase in salivary glands.

Promoter activity

Promoter strengths of HERVs were investigated in many experimental assays. The application of novel high-throughput techniques such as cap analysis of gene expression (CAGE) and paired-end ditag (PET) sequencing revealed 51197 HERV-derived promoter sequences. 114 HERV-derived transcription start sites appeared to drive transcription of at least 97 human genes, thus producing chimeric transcripts initiated within LTR and read-through into known gene sequences [109]. In transient transfection experiments, a human-specific HERV-K/HML-2 (ERVK) LTR from contig L47334 displayed very low promoter activity in three out of ten cell lines tested, moderate activity (10–20 % of the SV40 promoter) in six cell lines and, finally, the maximal value of ~100 % of the SV40 activity—in embryonal teratocarcinoma cells Tera-1 [44]. Similarly, in another laboratory, five other individual HERV-K/HML-2 (ERVK) LTRs showed high promoter strengths in the same cell line [39].

The comprehensive study of the expression of human-specific LTRs in human germ-line tissue (testicular parenchyma) and in the corresponding tumor (seminoma) [1] showed that different individual LTRs were expressed at markedly different levels differing ~3000 times in magnitude [47]. The LTR status (solitary, 5′ or 3′ proviral) was an important factor affecting LTR activity: promoter strengths of solitary and 3′ proviral LTRs were almost identical, whereas 5′ proviral LTRs showed higher promoter activity (~2-fold and ~5-fold greater in testicular parenchyma and seminoma, respectively). Another important factor influencing promoter activity was the LTR distance from genes: the relative content of promoter-active LTRs in gene-rich regions was significantly higher than in gene-poor loci.

The data obtained suggest also a selective suppression of transcription for the LTRs located in gene introns. Such a transcriptional suppression might be aimed at silencing of the proviral gene expression in gene-rich regions and may serve to minimize possible destructive effects of undesirable transcription. Transcriptional peculiarities of the LTRs are tightly associated with their capability of binding the host transcription factors, e.g., DUX-4 by MaLR, HERV-L (ERVL) and HERV-K/HML-2 (ERVK) promoters [12].

Polyadenylation

Polyadenylation is an essential step for the maturation of almost all eukaryotic mRNAs. A polyadenylation signal (AAUAAA) nearby the 3′ end of pre-mRNA is required for poly(A) synthesis. HERVs encode proteins and utilize functional poly(A) signals at the 3′-termini of their genes. Therefore, insertions of HERVs in the sense orientation can influence the expression of neighboring genes by providing new poly(A) signals. This consideration may explain the clearly seen strong negative selection pressure on such elements inserted in gene introns and oriented in the same transcriptional direction as the enclosing gene [110, 111].

Consistently, HERV polyadenylation signal may be used for the non-retroviral human transcripts. For example, eight human mRNAs are polyadenylated at the sequence of HERV-K/HML-2 (ERVK) LTR [19, 112]. One of these transcripts encodes a 8-kDa protein of unknown function, highly similar to human protein GON4L, a transcription factor that functions in cell cycle control [113]. 5′ LTR of the retrovirus HERV-F (ERVFH21-1, ERVH48-1) may function as an alternative polyadenylation site for known gene ZNF195 [114]. Human genes HHLA2 and HHLA3 utilize HERV-H (ERVH) LTRs as the major polyadenylation signals. In the baboon genome, orthologous loci lack retroviral inserts and these genes recruit other polyadenylation motifs [115].

Antisense regulation of gene expression

This regulatory mechanism is based on formation of the double-stranded RNA between mRNA and the antisense transcript, followed by catalytic degradation of RNAs containing the sites homologous to the double stranded fragment [50]. Among twenty-eight antisense-oriented human-specific HERV-K/HML-2 (ERVK) LTRs located in gene introns, fifteen elements were shown to be promoter active in human germ cells [1]. High expression levels of certain intronic LTRs might suggest the possibility of their involvement in antisense regulation of the enclosing genes [30]. Recently, we found the first evidence for the human specific antisense regulation of gene expression occurred due to promoter activity of HERV-K/HML-2 (ERVK) endogenous retroviral inserts [48]. The human-specific LTRs located in the introns of genes SLC4A8 (for sodium bicarbonate cotransporter) and IFT172 (for intraflagellar transport protein 172) in vivo generate transcripts that are complementary to exons within the corresponding mRNAs in a variety of human tissues. Overexpression of antisense transcripts resulted in ~4- and 3-fold decrease in mRNA levels for these genes, respectively [48].

Splicing

Human gene Hpr sequence for haptoglobin related protein is 92 % identical to haptoglobin gene HP [116]. Both genes are transcribed at the highest level in liver. Hpr promoter is stronger than HP promoter, but the concentration of Hpr liver transcripts is ~17-fold lower than for the HP mRNA [117]. The major distinction between these genes is the endogenous retroviral sequence RTVL-Ia in the intron of Hpr [118]. RTVL-Ia fragment demonstrated significant silencer activity in a series of luciferase transient transfection experiments [117]. The mechanism of the negative Hpr regulation by the RTVL-Ia endogenous retrovirus is most probably linked with aberrant splicing of the Hpr transcript at the retroviral sequences.

Host regulation of HERVs

Suppression of HERVs by the host-encoded mechanisms

Expression of HERVs is tightly controlled by the host cell because it may be deleterious. Even the physical presence of such a number of repetitive sequences in the genome can generate considerable problems dealing with homologous recombination between the different HERV elements, which may disrupt functional genes located in their neighborhood [10]. HERV-derived transcription and gene regulation can bias normal gene expression regulatory networks [4, 10, 48]. Expression of HERV proteins in various human tissues may result in dangerous inflammatory or immunosuppressive effects [35].

In the mouse, endogenous retroviruses are transcriptionally repressed using the tetrapod-specific KRAB-containing zinc finger proteins (KRAB-ZFPs) and their cofactor TRIM28. Recent study demonstrated that KRAB/TRIM28-mediated regulation is responsible for controlling a broad range of HERVs in human embryonic stem cells, too. The authors revealed reciprocal dependence between TRIM28 recruitment at specific families of HERVs and their DNA methylation, which suggests that KRAB/TRIM28 complex recruits methylation machinery to HERV copies [119]. These data are in line with the striking correlation observed across vertebrate genomes between the number of LTR retroelements and the number of host tandem KRAB domain zinc finger genes [120]. Similarly, zinc finger protein Yin Yang 1 may be one of the crucial components restricting HERV transcription in embryonic cells by suppressing promoter activities of the LTRs [121]. Other known mechanisms suppressing endogenous and exogenous retroviruses deal with the functions of APOBEC3, BST2, TREX, Tetherin, TRIM5α and Toll-like receptor proteins [122124].

They are able to limit viral replication by targeting specific steps of the viral life cycle. Tetherin is interferon-induced transmembrane protein that blocks the release of particles of many enveloped viruses, including HIV [125]. It is associated with lipid rafts at the plasma membrane, and at the trans-Golgi network [122]. Tetherin sequence is highly variable among the mammalian organisms. It appears to inhibit virus release, by connecting both viral and host cell membranes [122]. However, the Envelope proteins (Env) of HERV-K/HML-2 (ERVK) proviruses are able to inhibit Tetherin using a yet unknown mechanism mediated by the recognition of Tetherin by the surface subunit of Env. In experimental tests, two out of six natural complete alleles of HERV-K/HML-2 (ERVK) Env were able to inhibit Tetherin and block Tetherin-mediated viral restriction [126]. Notably, since many HERV-K/HML-2 (ERVK) elements are polymorphic in the human population, it is likely that all individuals will not all possess the same anti-Tetherin potential, which may have functional consequences for individual responses to infection [126].

APOBEC3 protein family consists of seven members in humans. Human APOBEC3G (hA3G) inhibits the infectivity of HIV-1 variants lacking a Vif gene. Vif (virion infectivity factor) prevents hA3G activity by binding and inducing its degradation through ubiquitination [127]. If not inactivated by Vif, hA3G enters HIV-1 particles, and then induces hypermutation of HIV-1 proviruses by editing the proviral genome during reverse transcription, leading to G to A substitutions [122, 128]. Similarly, hA3G inhibits the replication of many other exogenous retroviruses and HERVs. In addition to hA3G, hA3F contributes to proviral hypermutation by deaminating minus-strand of viral cDNA during reverse transcription [122]. In mice, the activity of endogenous retroviruses is also suppressed by the nucleic acid-recognizing Toll-like receptors 3, 7, and 9 (TLR 3, TLR7, and TLR9). Loss of TLR7 function caused spontaneous retroviral viremia that coincided with the absence of ERV-specific antibodies. Additional TLR3 and TLR9 deficiency led to acute T cell lymphoblastic leukemia. Experimental ERV infection induced a TLR3-, TLR7-, and TLR9-dependent group of “acute-phase” genes previously described in HIV and SIV infections [124].

Helpful HERVs

Co-evolution with the human genome resulted in a recruitment of certain HERVs to the execution of important molecular functions. HERV-H (ERVH) is a family of endogenous retroviruses expressed preferentially in human embryonic stem cells (hESCs) [129]. Recently, it was published simultaneously by two research teams that transcriptional regulation of the HERV-H (ERVH) LTRs may be one of the primary mediators of cell fate reprogramming, e.g., induced pluripotency stem cells generation using “Yamanaka cocktail” (by overexpressing OCT3/4, SOX2, and KLF4 proteins) [130, 131]. This effect appeared to be most probably mediated by the HERV-H (ERVH)-driven intergenic long noncoding regulatory RNAs [130, 131]. HERV-R (ERV3-1)-encoded Env protein is also suggested as possible developmental mediator, as it is overexpressed in the developing tissues, like kidney, tongue, heart, liver and brain [132].

The envelope proteins of HERV-W (ERVW-1) family members (corresponding to human protein Syncytin) may serve human physiology through their fusogenic or immunosuppressive properties. For example, Syncytin is essential for placentation, by mediating cell fusion of syncytial cell layers, and for maternal tolerance of the fetus, by immunosuppression [98, 133]. Syncytin, the product of individual HERV-W (ERVW-1) proviral locus ERVW-1, binds to its extramembrane receptor SLC1A5/ASCT-2/RDR/ATB(0) and initiates formation of trophoblast cell fusion, most likely via Cx43-mediated gap junctional intercellular communication [134]. The deficiencies in Syncytin expression, e.g., caused by hypermethylation of ERVW-1, were reported to be associated with various placental abnormalities [135]. As shown in cell culture experiments, Syncytin activity may be negatively regulated by an RNA-binding protein LIN28A, whose target downregulation, in turn, appeared to release Syncytin functionality and promoted fusion of cultured human trophoblast cells [136]. This fusogenic effect was specific to human, but not mouse trophoblast cells, which suggests long-term species-specific molecular evolution of mechanisms controlling spatio-temporal activation of Syncytin in placenta [136]. Other retroviral Env proteins encoded by the endogenous elements ERV-3 and HERV-FRD (ERVFRD), may be also implicated in placentation, as they promote intercellular fusion in cell culture model [137] and are normally expressed in placenta [138]. Interestingly, in other mammalian species, other retroviral Env proteins may behave similarly to Syncytin in placentation, like EnvV protein encoded by ERV-V-1 and ERVV-2 proviruses in the Old World monkeys, but not in the great apes [139].

HERVs may serve as major transcriptional regulators of human genes by direct enhancer or promoter activities, or using other mechanisms, sometimes involving long non-coding transcripts [140]. For example, human specific transcription of human gene PRODH in hippocampus is regulated by an enhancer element created by the insert of a HERV-K/HML-2 (ERVK) LTR [4]. This might have an important impact on human evolution since PRODH metabolizes neuromediator molecules and has a strong implication in higher nervous activity [4]. The ERV9 LTR element upstream of the DNase I hypersensitive site 5 (HS5) of the locus control region in the human β-globin cluster, 40–70 kb upstream of the human fetal gamma- and adult beta-globin genes, is responsible for controlling expression of this cluster in erythroid cells [141]. The enhancer effect is caused by LTR-initiated transcription driven in the direction of associated gene promoter [9, 142]. The LTR contains multiple CCAAT and GATA motifs and competitively recruits a high concentration of NF-Y and GATA-2 transcription factors present in low abundance in adult erythroid cells to assemble an LTR/RNA polymerase II complex. The LTR complex transcribes intergenic RNAs unidirectionally through the intervening DNA to loop with and modulate transcription factor occupancies at the far downstream globin promoters, thereby regulating globin gene switching by a competitive way [140].

Solitary ERVL LTR was shown to promote transcription of a known human gene β3GAL-T5 in various tissues, being especially active in colon, where it is responsible for the majority of gene transcripts [143]. β3GAL-T5 is involved in the synthesis of type 1 carbohydrate chains in gastrointestinal and pancreatic tissues. Interestingly, murine β3GAL-T5 gene is also expressed primarily in colon, despite the absence of an orthologous LTR in the mouse genome. It is likely that in humans, the LTR adopted the function of an ancestral mammalian promoter active in colon [143]. Another interesting example of gene transcriptional regulation by LTR was shown for NAIP (BIRC1) gene coding for neuronal apoptosis inhibitory protein [144]. Although human and rodent NAIP promoter regions share no similarity, in both cases LTR serve as an alternative promoter. Thus, two evolutionary distinct LTR elements were recruited independently in primate and rodent genomes for transcriptional regulation of this gene.

Human gene CYP19 codes for aromatase P450, the key enzyme in estrogen biosynthesis. LTR insertion upstream CYP19 led to the formation of alternative promoter located 100 kb upstream of the coding region [28]. This event resulted in the primate-specific transcription of CYP19 in the syncytiotrophoblast layer of placenta. Placental-specific expression plays an important role in controlling estrogen levels during pregnancy. Cases of placental-specific transcription driven from endogenous retroviral promoters were also shown for Mid1 gene linked with inheritable Opitz syndrome [145], endothelin B receptor [146] and insulin-like growth factor INSL4 [147]. HERVs may also serve as unique promoters for human genes. For example, the only apparent promoter of the liver-specific gene BAAT implicated in familial hypercholanemia is an ancient LTR in human but not in mouse [148].

In addition, a polymorphic HERV-K/HML-2 (ERVK) insert in the ninth intron of the complement component C4 gene was reported as a novel marker of type 1 diabetes that accounts for the disease association previously attributed to some key HLA-DQB1 alleles raising the possibility that this retroviral insertion element contributes to functional protection against type 1 diabetes using any of the above mechanisms [149].

HERVs and human diseases

Recent studies evidence that different activities of HERVs may be involved in various human diseases including autoimmune disorders, neurological, infectious diseases and cancer [150].

Autoimmune diseases

The biased expression of proviral proteins in human tissues may trigger autoimmune diseases [151, 152]. This was indicated first by increased proviral transcript levels [153] and finding anti-HERV protein antibodies in sera from several groups of patients suffering from these systemic disorders [37]. Immune reactivity to HERV products can often occur spontaneously in infection or cancer and is considered the driving force of several autoimmune disorders also in mice. Immune reactivity against ERV proteins can be experimentally induced in mice and non-human primates, suggesting that immunological tolerance to ERV-derived products is not complete [98].

The apparent overexpression of HERV genes may be associated with global hypomethylation as observed for the DNA of T cells from systemic lupus erythematosus (SLE) patients. However, different HERV families behave differently in SLE. For example, HERV-E (ERVE) mRNA expression was higher in lupus CD4+ T cells than in healthy controls, whereas the expression of HERV-K/HML-2 (ERVK) and HERV-W (ERVW-1) family members were similar in SLE patients and healthy controls. Additionally, the HERV-E (ERVE) mRNA expression level was positively correlated with SLE disease activity. Consistently, the HERV-E (ERVE) LTR methylation level was decreased and negatively correlated to the HERV-E (ERVE) mRNA expression in lupus CD4+ T cells [154]. Overexpression of the following HERVs is considered as possibly implicated in SLE: HRES-1, ERV-3, HERV-E 4-1 (ERVE 1–4), HERV-K10 (ERVK-10) and HERV-K18 (ERVK-18) [155]. In addition to HERV-E (ERVE) LTRs, the HERV-K/HML-2 (ERVK) LTRs were also found significantly undermethylated in various types of T lymphocytes in SLE patients [156]. Note that another study showed decreased expression of most HERV genes in purified monocytes from SLE [157].

In the patients with rheumatoid arthritis, statistically significant increase in IgG antibody response to HERV-K10 (ERVK-10) Gag protein was detected, as compared to normal controls [158]. HERV-W (ERVW-1) transcripts and protein products (isoforms of Syncytin) were overrepresented in cartilage of osteoarthritis patients [159]. In osteoarthritis, the patients health status and the disease severity index were also correlated with the expression of HERV-K18 (ERVK-18) provirus, thus suggesting its possible involvement in the aetiopathogenesis of this disease [160]. The implication of HERV-K/HML-2 (ERVK) elements in autoimmune disorders may be connected with the presense of multiple binding sites for various inflammation-linked transcription factors in their LTRs [161].

However, inflammatory diseases may be equally associated with decreased expression of HERV genes [162]. For example, Lichen planus (LP) is a common inflammatory skin disease of unknown etiology. In LP subjects, a significant decrease in the HERV-K/HML-2 (ERVK) Gag and Env, as well as HERV-K18 (ERVK-18) and HERV-W (ERVW-1) Env mRNA expression was detected, compared to healthy controls. Overall, HERV-K/HML-2 (ERVK) Gag expression strongly correlated with other HERV sequences. The decrease of HERV expression in this case may be at least partly explained by observed significant up-regulation of known retroviral restriction factors like cytidine deaminase APOBEC 3G gene, and the GTPase MxA (Myxovirus resistance A) gene [162]. Other inflammation-related transcripts, such as the master regulator of interferon-dependent immune responses, STING, IRF-7 (interferon regulatory factor 7), IFN-β and the inflammasome NALP3, also had increased levels in LP, when compared to healthy controls [162]. This study evidences that interferon-inducible factors may contribute to the negative transcriptional control of HERVs [162]. For psoriasis, which is a multifactorial chronic disease of skin, a significant decrease in antibody response against HERV-K/HML-2 (ERVK) protein products Gag and Env was detected in plasma of the affected patients [163]. Congruently, the expression of HERV-K/HML-2 (ERVK) and ERV-9 gene transcripts was significantly lower in lesional psoriatic skin as compared to healthy skin [163].

Interestingly, although there are currently no indications that HERV reverse transcriptase (RT) is involved in autoimmune disorders, some of the drugs targeting RT enzymatic activity manifested good clinical effects for inflammation-linked diseases [164].

Neurological diseases

Enhanced expression of the HERV-encoded proteins is a promising biomarker for several neurological diseases [165, 166]. Pro-inflammatory cytokine IFNγ plays a key role in the pathology of several HERV-associated neurological diseases. In model experiments, IFNγ signalling markedly enhanced the levels of HERV-K/HML-2 (ERVK) protein expression in both human astrocytes and neurons. These findings again indicate that HERV expression may be inducible under inflammatory conditions [165]. For example, expression of HERV-K/HML-2 (ERVK) genes was significantly increased in postmortal brains of Amyotrophic lateral sclerosis (ALS) patients [167].

For multiple sclerosis (MS), a hypothesis was proposed that HERV-encoded envelope proteins (Env) can act as strong immune stimulators [168]. Thus, slow disease progression following neurodegeneration might be induced by re-activation of HERV expression directly, while relapses in parallel to inflammation might be secondary to the expression of HERV-encoded superantigens [169]. In Northern and Southern European cohorts, an association with susceptibility to bout-onset MS was established and confirmed for the HERV-Fc1 (ERVFC1) sequence in chromosome X and the enclosing polymorphism rs391745 [170]. In addition, two polymorphisms mapped within the HERV-K18 (ERVK-18) locus on the chromosome 1 appeared to be significantly associated with susceptibility to MS in the Spanish cohort of patients [171]. MS occurs more frequently in women than in men, and a possible link between the HERV-W (ERVW-1) copy on chromosome Xq22.3, and the gender differential prevalence in MS has been suggested. This copy contains an almost intact open reading frame for Env, but it is interrupted by a premature stop codon, so the resulting protein, if any, is heavily truncated [172]. Several MS-linked genetic polymorphisms were recently reported that were located within the same genomic region [172]. HERV-W (ERVW-1) Env protein was detected in MS brain lesions within microglia and perivascular macrophages and was shown to induce proinflammatory response in human macrophage cells through TLR4 activation pathway [173]. The corresponding HERV-W (ERVW-1) mRNA levels were enriched in blood, spinal fluid, and brain samples of the MS patients [174]. Furthermore, HERV-W (ERVW-1) Env was even utilized as a superantigen to develop a MS model in mice [173]. While increasing during MS manifestation, HERV-W (ERVW-1) expression decreases with the decline of this disease. Natalizumab is a humanized monoclonal antibody against the cell adhesion molecule α4-integrin. Since 2004, it is widely used as the target drug against MS. In a cohort of MS patients efficiently treated with Natalizumab, both mRNA and protein levels of HERV-W (ERVW-1) Env were significantly reduced [175].

There is frequently a cross-talk between MS and previous infections of the human CNS cells [168]. It was recently demonstrated that infectious mononucleosis may lead to enhanced levels of HERV-W (ERVW-1) Env protein and mRNA in blood mononuclear cells [174]. Flow cytometry data showed increased percentages of cells exposing surface HERV-W (ERVW-1) Env protein, that occur differently in specific cell subsets, and in acute disease and past infection [174]. Expression of the same retroviral protein was considerably increased in various human cell types following influenza A virus infection [2]. Importantly, treatment with neuroleptics and antidepressants (e.g., valproic acid, haloperidol, risperidone, and clozapine) may greatly upregulate the expression of HERV-W (ERVW-1) and ERV9 elements in CNS cells [176]. Moreover, expression of HERV-W (ERVW-1) may be significantly upregulated by some ubiquitous nutrients and medicines like caffeine and aspirin [177]. Besides HERV-W (ERVW-1), MS is genetically associated in Scandinavians with one human endogenous retroviral locus related to the HERV-F (ERVFH21-1, ERVH48-1) element [123]. HERV-F (ERVFH21-1, ERVH48-1) Gag RNA in plasma was increased fourfold in patients with recent history of attacks, relative to patients in a stable state and to healthy controls [178]. It can be extrapolated that infections sometimes can upregulate HERV expression in the CNS cells, thus provoking deleterious autoimmune response [179]. Indeed, genetic variant in some genes restricting retroviral infections were statistically linked with the risk of getting MS, as shown for TRIM5, TRIM22 and BST2, but not for APOBEC3s and TREXs genes [123]. Interestingly, HIV infection is associated with a significantly decreased risk of developing MS. Mechanisms of this observed protective association may include immunosuppression induced by chronic HIV infection and antiretroviral medications [180].

In schizophrenia and bipolar disorder, abnormally high levels of HERV-W (ERVW-1) Env gene product in the plasma were also detected [176, 181]. The seroprevalence for Toxoplasma gondii yielded low but significant association with HERV-W (ERVW-1) transcriptional level in a subgroup of bipolar disorder and schizophrenia, suggesting a potential role in particular patients [182]. However, for the HERV-K18 (ERVK-18) provirus, there were no significant associations with the susceptibility to schizophrenia [183].

Finally, HERVs may also cause neurological disorders using quite distinct mechanism comprising HERV-linked genomic rearrangements. For example, HERV-H (ERVH)-mediated 3q13.2-q13.31 deletions cause a syndrome of hypotonia and motor, language, and cognitive delays [184].

Infectious diseases

The long-term spontaneous evolution of humans and the human viruses might generate various mechanisms involving cooperation or interference of endogenous and exogenous retroviruses. For example, the primate lentiviruses HIV and simian immunodeficiency virus (SIV) do not express their own dUTPase, and it is believed that a host cell endogenous retroviral enzyme (Prot) provides this activity during reverse transcription [7072], in line with the recent observations that HIV-1 infection may increase the expression of HERV-K/HML-2 (ERVK) proviruses in vitro [185] and in vivo [185, 186]. The envelope glycoprotein of one of HERV-K/HML-2 (ERVK) members, HERV-K18 (ERVK-18), is incorporated into HIV-1 in an HIV matrix-specific fashion [78]. In HIV patients, HERV-K/HML-2 (ERVK) proviruses are expressed at significantly higher levels in peripheral blood mononuclear cells than in those from uninfected individuals [187]. Proviruses were expressed in multiple blood cell types, and the magnitude of HML-2 expression was not related to HIV-1 disease markers [187]. In addition, a controversial data were reported on whether HERV-K/HML-2 (ERVK) viral particles are present or not in the plasma of HIV-infected patients [187]. Increased levels of HERV-K/HML-2 (ERVK) RNA were detected in plasma of HIV patients from Uganda, but not from the USA [188]. In contrast, Esqueda et al. argue that there was no correlation between HERV-K/HML-2 (ERVK) RNA levels and HIV viral loads in plasma specimens they profiled [189]. However, presence of antibodies against HERV-K/HML-2 (ERVK) Env protein in blood was proposed as the new biomarker of HIV-1 infection, because HIV-1 can modify HERV-K/HML-2 (ERVK) Env mRNA expression, resulting in the expression of a fully N-glycosylated HERV-K/HML-2 (ERVK) envelope protein on the cell surface [190]. Moreover, HERV-K/HML-2 (ERVK)-specific CD8+ T cells obtained from HIV-1-infected human subjects, exhibited complete elimination of human cells infected with a panel of globally diverse HIV-1, HIV-2, and SIV isolates in vitro. This supports the consideration of HERV-K/HML-2 (ERVK)-specific and cross-reactive T cell responses for exploring HERV-K/HML-2 (ERVK)-targeted HIV vaccines and immunotherapeutics [191]. The mechanism of HIV-1 induced transactivation of HERV-K/HML-2 (ERVK) and other transposable elements possibly involves the activity of an HIV-1 Tat protein [192]. Recent studies showed that in model experiments with peripheral blood lymphocytes out of 91 annotated HERV-K/HML-2 (ERVK) proviruses, Tat significantly activates expression of 26 unique HERV-K/HML-2 (ERVK) proviruses, silences 12, and does not significantly alter the expression of the rest proviruses [193]. In addition, HIV infection may cause transactivation of HERV-W (ERVW-1) family with their Env genes and Syncytin [194].

On the other hand, endogenous retroviral Env production theoretically can provide to the host cell a partial resistance to infection of pathogenic exogenous counterparts or related retroviruses by receptor interference [77, 195, 196], as this is the case for endogenous Jaagsiekte sheep retrovirus (JSRV) that blocks the entry of the corresponding exogenous virus. Both forms use the same protein receptor for entry, implying interference between endogenous and exogenous viruses [195]. Endogenous Gag protein may be also involved in antiviral host cell protection. For instance, the expression of murine endogenous Gag-sequence Fv1 blocks certain strains of mouse leukemia virus (MLV) soon after entry [197], most probably, due to a direct encounter with the incoming viral capsid [37]. Similar cases were reported also for the chicken, ground squirrel and cat endogenous retroviral elements [198]. No direct evidence exists so far for the human elements, but theoretically this may be possible since human DNA harbors hundreds of intact or largely intact retroviral env genes [199]. The recent association study of ~23,000 participants indicates that susceptibility to herpes zoster caused by the varicella zoster virus (VZV) is linked with the non-coding gene HCP5 (HLA Complex P5) in the major histocompatibility complex. This gene is an endogenous retrovirus that likely suppresses viral activity through regulatory functions. Variants in this genetic region are also known to be associated with delay in development of AIDS in people infected by HIV [200].

Cancer

The role of HERVs in cancer is most likely limited to retrovirus-driven gene expression and does not involve their insertional activity [22]. In this regard, data from cancer genome sequencing identified over 180 somatic integrations in human cancer cells caused by LINE-1 retrotransposon activity [98, 201, 202]. In contrast, there was only a single insertion of a small HERV fragment, which was most likely the result of a microhomology-mediated DNA repair mechanism [98, 201].

Abnormal expression of HERVs in cancer is well known. For example, HERV-K/HML-2 (ERVK) elements are overexpressed in germ cell tumors and in melanoma [22, 203205]. Upregulation of HERVs may be mediated by the cancer-specific combinations of transcription factors, as shown for the HERV-K/HML-2 (ERVK) activation by the melanoma-specific transcription factor MITF-M [206]. A significant increase in frequency and titer of antibodies against proviral proteins in patients suffering from testicular cancers has been documented (60 % against 4 % in healthy control group) [57]. Importantly, shortly after the elimination of the tumor, the antibody titers dropped and became undetectable by 5 years after surgery [57]. Both HERV-K/HML-2 (ERVK) mRNA and antibodies against proviral proteins were significantly overrepresented in plasma of the patients with primary breast cancer [207]. Serum proviral mRNA levels tended to be higher in the patients who later on developed the metastatic disease [207]. Aberrant overexpression of HERV-K/HML-2 (ERVK) proviruses was found also in prostate cancer [208] and HERV-K/HML-2 (ERVK) encoded transcripts or proteins are considered as possible biomarkers of malignization, being overexpressed for the patients with poor prognosis [209]. Increased HERV-K/HML-2 (ERVK) Env gene expression was detected in the prostate tumors in 40 % of European-American and 61 % of African-American patients [208]. Expression of some individual HERV-K/HML-2 (ERVK) elements can be of an outstanding importance to follow prostate cancer development, as recently shown for the HERV-K/HML-2 (ERVK) (ch22q11.23) Gag gene [209]. In a cohort of patients with renal cell carcinoma, bone marrow transplantation was reported to result in tumor regression likely due to a graft-versus-tumor effect. In such patients, anti-tumor cytotoxic lymphocytes were targeting a HERV-E (ERVE)-encoded epitope [210], thus demonstrating the importance of anti-HERV immune responses in the progression or cure of human diseases [98].

Chronic lymphocytic leukemia (CLL) cells demonstrate increased transcription of auxillary HERV-K/HML-2 (ERVK) gene Np9, which was previously published as possible oncogene [211]. Indeed, silencing of Np9 inhibits the growth of myeloid and lymphoblastic leukemic cells, whereas its overexpression promotes the growth in vitro and in vivo. In human leukemia cells, Np9 protein level is substantially higher than that in normal cells, e.g., in normal hematopoietic stem cells. Np9 may act by activating ERK, AKT and Notch1 signaling pathways and through the upregulation of β-catenin, essential for survival of leukemia stem cells [212]. Another group of researchers found that Np9 may directly interact with the RING-type E3 ubiquitin ligase LNX protein [59]. This interaction affects the subcellular localization of LNX, whereas LNX can target the cell fate determinant and Notch antagonist Numb for proteasome degradation, thereby promoting Notch signaling. The LNX-interacting Np9 is also unstable and degraded via the proteasome pathway. Combined, these findings point to the possibility that Np9 affects tumorigenesis by influencing the LNX/Numb/Notch pathway [59]. In addition, Np9-positive leukemia samples highly expressed HERV-K/HML-2 (ERVK) Pol-Env polyprotein, Env and transmembrane proteins as well as entire viral particles [212].

Expression of Syncytin, or Syncytin-1, which is Env protein of HERV-W (ERVW-1) family, is normally restricted to the placenta. However, it was also found in many pathologies including cancers and it is hypothesized that Syncytin-mediated cell fusion participates in cancer cell transformation or metastasis [98]. Endogenous retroviral Env proteins possess immunosuppressive and fusogenic activities [98]. As in the placenta, the expression of immunosuppressive domain of HERV-W (ERVW-1) and HERV-K/HML-2 (ERVK) Env in tumors may suppress immune responses and thus prevent rejection of the tumour and the embryo [76]. Immunosuppression is most likely mediated by the transmembrane subunit of the envelope protein of several retroviruses [213]. Indeed, overexpression of Moloney MLV transmembrane subunit in murine tumor cell lines led to tumor growth in recipient mice that would otherwise immunologically reject them [98, 214]. Consistently, knockdown of env transcripts in melanoma and neuroblastoma cell lines, both of which produce infectious MLVs derived from endogenous retroviral precursors, rendered them susceptible to immune rejection in vivo [215, 216]. In human DNA, in addition to Syncytins, the Env genes of HERV-E (ERVE) [217] and HERV-H (ERVH) [218] were shown to possess immunosuppressive potential [98]. MRNA contents of HERV-R (ERV3-1), HERV-H (ERVH), HERV-K/HML-2 (ERVK), and HERV-P Env genes were significantly increased in breast cancer patients and dropped to normal levels following chemotherapy [219]. Syncytins and other Env genes like Erv-3, envT and envFc2 were also upregulated in endometrial carcinomas [220].

In addition, HERVs may promote cellular transformation by regulating downstream human gene expression through their LTRs [98]. Transcriptional derepression of the CSF1R gene, encoding colony stimulating factor-1 receptor, by a demethylated MaLR LTR acting as an alternative promoter has been linked with survival of cancer B cells in Hodgkin’s lymphoma [221]. Finally, the discovery of HTLV-1 (human T-lymphotropic virus 1) unequivocally proved the existence of a tumor-inducing exogenous human retrovirus [98].

Impact on human evolution

Various HERV families were active at the different timepoints during human evolution [8, 10, 18, 222]. For primate-specific regions, ~63 % of mapped DNase I hypersensitivity sites representing open chromatin regions corresponded to HERV sequences [105]. The apparent emergence of ~320,000 transcription factor binding sites donated by HERVs to human genome, must have a deep impact on the regulation of intracellular molecular networks [107]. Expression of viral genes, HERV-driven transcription of neighboring DNA, attenuation of splicing, polyadenylation and RNA degradation might all considerably influence the human cell and organism [10]. In a comprehensive genome-wide study, Subramanian and coauthors recently found that both solitary LTRs and full-size proviruses of human HERV-K/HML-2 (ERVK) family are preferably located on gene-rich chromosomes and close to gene regions [223]. A small group of HERVs belonging to ERVK elements was actively proliferating in the genome after the divergence of human and chimpanzee ancestor lineages [19]. Few group members were even specific to non-human hominid genomes of Neanderthals and Denisovans [224]. Human specific HERVs, in turn, are presented by ~130–140 members that shaped ~330 kb of the human DNA. This group modified the human genome activity by endogenizing ~50 functional copies of viral genes that may have important implications in the immune response, cancer progression and anti-retroviral host defence. 134 potential promoters and enhancers have been added to the human DNA, about 50 % of them—in the close gene vicinity, and 22 %—in gene introns [1]. For sixty such human specific promoters their activity was confirmed by in vivo assays, with the transcriptional level varying ~1000-fold from hardly detectable to the level comparable with the expression of major housekeeping genes. New polyadenylation signals have been provided to four human RNAs, and a number of potential antisense regulators of known human genes appeared due to human-specific retroviral insertional activity [47, 48].

When looking at all, not only human-specific, HERV elements, a number of remarkable trends can be observed dealing with transcription factor binding site (TFBS) densities. Garazha et al. analyzed the distribution of the TFBS-containing HERVs relatively to their divergence from the consensus sequence (this divergence positively reflects evolutionary age of each HERV element) [225, 226]. Overall, the TFBS-activity of HERVs was decreasing with evolutional age [107]. However, a significant heterogeneity between the proportions of TFBS-positive elements was found for the evolutionary “young” HERVs (divergence less than 10 %). However, the proportion of TFBS-positive elements drops as the divergence increases (>15 %) and further tends to lay within a sharp interval of 0.12–0.18, which is ~6-times less wide than for the young elements. This clearly demonstrates that the low-diverged (evolutionally recent) HERVs have a significantly lower likelihood of functional activity in human DNA compared to the “older” elements. This observation may suggest that genomic “domestication” of the newly integrated HERV sequences involves reshaping of their active TFBS profiles and their further “standardization”, e.g., upon accumulation of mutations [107]. Another aspect of the same concept was uncovered when looking at the distributions of TFBS for the different transcription factors. For example, the protein NF-YA has highly abundant TFBS in the evolutionary young HERVs (divergence 5–8 %), whereas for the older elements, the TFBS ratio is significantly lower. In contrast, for the protein Rad21 there is a relatively low ratio of TFBS for the “young” elements followed by a subsequent increase for the older elements, reaching a maximum value at ~22 % divergence [107]. This example shows, that for some transcription factors (e.g., NF-YA), the recently inserted HERVs are enriched in TFBS, whereas further genomic domestication and mutation pressure progressively decrease the TFBS proportion. In contrast, for another group of transcription factors like Rad21, the older elements accumulate increased proportions of TFBS [107]. Overall, most of the transcription factors showed one common feature in their TFBS evolutionary dynamics: a decrease in the proportion of TFBS in the divergence interval around 5–15 %. This indicates that functional adaptation and modification of a HERV insert includes strong initial silencing of the original TFBS that came from this element, and further accumulation of new functional TFBS in tight co-evolution with the host genome [107]. Support for this hypothesis comes from the studies showing that the newly integrated inserts are initially under strong DNA methylation repression [119, 227]. This preserves the cell genome from viral gene expression and from the deleterious influence of these elements on the host gene regulatory ensembles. Upon de-methylation, a number of HERVs release their regulatory potential and provide functional TFBS to the human genome. However, de-methylation is followed by progressive mutation of the HERV sequence, which is reflected by a further decline of the “original” TFBS. However, mutations do not only cause removal of these sites, but also create new TFBS in the HERVs that, in turn, fall under selection pressure according to their implication in the overall genomic context [228]. Finally, the highly diverged HERVs become roughly equilibrated with the enclosing genomic regions and show little difference compared to the average genomic sequence [107].

In a much closer evolutional scale, this theory is exemplified by the investigations of the promoter activity of an evolutionally recent family HERV-K/HML-2 (ERVK) [27]. Mapping of the transcriptional start sites for several actively transcribed HERV-K/HML-2 (ERVK) LTRs evidenced for the presence of two functional promoter regions within their sequence [27]. The first promoter was the canonical element located in the LTR U3 region, whereas the second one was mapped in the very 3′ terminus of the LTR R region. Both promoters appeared to be active in solitary LTRs and in full-length proviruses. Surprisingly, this second non-canonical element was even more active than the classical U3-based retroviral promoter. Therefore, the R region is excluded from most transcripts initiated on LTRs, whereas a classical retroviral life cycle model implies that the transcription is driven from between the LTR U3 and R elements (first promoter), and the R transcript is a 5′-terminal component of the newly synthesized proviral RNA [27]. Such a mode of proviral DNA transcription is a basis of the HERV life cycle that provides the possibility of template jumps during proviral RNA reverse transcription. Further studies confirmed the prevailing activity of a non-canonical HERV-K/HML-2 (ERVK) LTR promoter and showed it is regulated by Sp1 and Sp3 transcription factors via a TATA box-independent transcription initiation mechanism [229]. A shift of the transcriptional start site can be explained by the initial adaptation to the human genomic context, which can be an early step in the complex process of the HERV sequence domestication and reshaping by the host genome. Another important mechanism guiding coevolution of HERVs with the human genome deals with inactivation of full-size proviruses. This mechanism may include recombinations, most frequently eliminating viral genes and producing solitary LTRs instead of complete proviral sequences. An outstanding case of full-size provirus disruption by recombination was published by Hughes and Coffin in 2004 [29]. The authors found a genomic locus, which may exist in three alternative states: it may lack proviral insert, may have a complete provirus, or may have a solitary LTR instead. Overall, apparent polymorphic solitary LTR formation from full-size proviruses was documented in 5/13 of the studied cases and was found to be a rather frequent event in the human population [29]. Moreover, the same authors previously found that ~16 % of all recently inserted ERVK elements are linked with traceable genomic rearrangement events [230].

Databases linked with HERV structure and functions

The ubiquitous nature of HERVs and the plurality of their molecular functions stress the importance of organization and maintaining related databases. Some of the HERV sequences and features were annotated and collected in various databases. For example, in 2002 the first database was published termed HERVd that was collecting structural data on most on the HERV elements known at that time point [231]. However, this database was not updated since 2003 and nowadays misses important information available elsewhere. Villesen et al. published a database putting together the sequences of HERV-encoded open reading frames of the human genome available to the date of publication (2005). Fifty-nine intact or almost intact endogenous retroviral genes were identified, of them 29 encoding envelope proteins [232]. The database of human retrotransposon insertion polymorphisms [233] presents the data on the HERV elements that appeared polymorphic in human population, and the database of human specific transposable elements [234] contains information on the genomic location of human specific HERVs and other retrotransposons. However, many human specific HERVs were unique to further database [19] which encompassed 134 human-specific endogenous retroviral inserts. It has genomic coordinates of HERVs, their polymorphic state in human populations, distances from known genes or mapped RNAs, structural status of HERVs (solitary LTR/provirus), information about open reading frames putatively encoded by the corresponding HERVs, and the corresponding deduced aminoacid sequences. It also accumulates data on methylation of individual elements, and also on individual HERV transcription. This database was probably the first attempt to characterize in detail a particular group of HERVs [19]. However, it was focused solely on human-specific HERV-K/HML-2 (ERVK) elements and was not updated since 2007. More recently, Subramanian et al. cataloged all available HERV-K/HML-2 (ERVK) sequences of the human genome, including both proviruses and solitary LTRs [223]. However, other human endogenous retroviral groups left outside the framework of this study.

RNA of HERVs and other transposable elements may form an intrinsic hairpin structures and/or serve as microRNA precursor when inserted into transcriptionally active genomic regions [235, 236]. To catalog the data on transposable elements that may have an impact on gene regulation and functioning via RNA interference, a database termed “TranspoGene database” has been constructed that covers not only human, but also mouse, chicken, zebrafish, fruit fly, nematode and sea squirt genomes [237]. A variant of this database termed “microTranspoGene” collects data on human, mouse, zebrafish and nematode TE-derived microRNAs [237]. HESAS (HERVs Expression and Structure Analysis System) database was developed to link HERV sequences with the neighboring genes to identify the elements having impact on the expression of human functional genes. Inserts of HERV elements were found into 17,317 of human genes and linked (to the opinion of the authors) to expression of 898 genes [238]. However, the database was not updated since 2004 and lacks information on functional signatures of regulatory sequences such as DNase I hypersensitivity sites and transcription factor binding sites.

A bioinformatic algorithm termed RetroTector and related database published in 2007 utilize automated recognition of retroviral sequences in genomic data. Retroviral sequences were detected in various vertebrate genomes, including human. Most RetroTector-detected chains were coincident with Repeatmasker output and the HERVd database. However, RetroTector did not report many evolutionary old HERV sequences, and was useless for the detection of solitary LTRs [239].

Most recently, Garazha and coworkers published an interactive comprehensive HERV database that groups the individual inserts according to their familial nomenclature, number of mapped TFBS and divergence from their consensus sequence [107]. Database encompasses data from 717,612 individual elements represented by 504 different HERV families, which covers ~8 % of the human DNA. Detailed information on any particular HERV element can be easily extracted by the user. To facilitate data navigation, we created a genome browser tool enabling quick mapping and finding of any HERV insert according to genomic coordinates, known human genes and densities of TFBS. This browser is cross-linked with the UCSC Genome Browser to enable easy mapping of other genetic features of the interest. This database may be widely used for quickly locating functionally relevant individual HERVs, and for analyzing their impact on the regulation of human genes. In addition, option of browsing TFBS distribution for any particular transcription factor among the HERVs is enabled [107]. These resources are freely available at http://herv.pparser.net.

Concluding remarks

In the last decade, human endogenous retroviruses attracted attention of the research society because of multiplicity of ways they can influence human physiology in health and disease. Since 2005, the grant funding for HERV-related projects by governmental agencies increased ~4-times ([240]; International Aging Research Portfolio, http://www.agingportfolio.org/) and the number of PubMed-indexed publications featuring HERVs increased by ~1.4-fold reaching a number of approximately 200 publications per year. This growing interest of biomedical community is linked with the immense role that these elements played, play and may play in shaping of human genome and transcriptome, in molecular evolution and in the progression of many autoimmune, neurological, infectious and oncological diseases. Contemporary experimental and bioinformatic methods enable investigation of HERVs at the unprecedental levels of whole transcriptomes and even proteomes. This promises further acceleration of the progress in decoding molecular functions of HERVs, hopefully at the level of each individual genomic element.