Keywords

The deep complexity of eukaryotic transcriptomes and the rapid development of high-throughput sequencing technologies led to an explosion in the number of newly identified and uncharacterized lncRNAs. Many challenges in lncRNA biology remain, including accurate annotation, functional characterization, and clinical relevance. All these topics will be thoroughly discussed throughout the book. But to start with, we will detail the discovery of RNA as life’s indispensable molecule. The long journey for the biological characterization of non-coding RNAs is summed up in Fig. 1.1, and this history will be described over the first half of this chapter, from the DNA to the first non-coding transcripts. Then, we will discuss how global genomic and transcriptomic studies changed our view on the role of RNA in regulatory circuits, biodiversity, and complexity. Finally, we will include a summary of the extensive classification of lncRNAs.

Fig. 1.1
figure 1

The timeline of principle discoveries in nucleic acid biology and, in particular, eukaryotic ncRNAs

1.1 A Hundred-Years History of RNA Biology

Before the ever-expanding catalogues of lncRNAs that we have today, a long experimental and theoretical journey was required to prove the importance of RNA molecules in cell biology. It began in 1869 with the discovery of nucleic acids, and it took over a hundred years for researchers to finally identify non-coding transcripts and begin proposing regulatory roles for them.

1.1.1 From “Nuclein” to Nucleic Acids and to the Double Helix

At the end of the nineteenth century, a few pivotal discoveries foreshadowed the molecular biology era. In 1869, Friedrich Miescher isolated a material from nuclei that he called “nuclein” and which he described as highly acidic: in fact he had discovered DNA [1]. In contrast with proteins that were the main focus at the time, its content was low in sulfur and very high in phosphorus and could not be digested by protease treatment. Later, once the chemical composition of the “nuclein” isolated from different organisms had been discovered, it was realized that “thymus nucleic acid” consisted of DNA, while “yeast nucleic acid” was composed of RNA. In the early 1900s, several scientists proposed the chemical composition and the first structures for DNA and RNA, though the biological differences between these two molecules were still not apparent. Ironically, the discovery of “nuclein” by Miescher happened only a few years after Gregor Mendel published his work on the laws of heredity in 1866, but nevertheless many scientists thought proteins were the carriers of genetic information. Thus, the link between Mendel’s model and Miescher’s “nuclein” remained missing until 1944 when Oswald Avery proposed DNA as a carrier of genetic information [2].

The link between DNA and RNA was established in the late 1950s as Elliot Volkin and Lawrence Astrachan thoroughly described RNA as a DNA-like molecule synthesized from DNA. This discovery was then further elaborated into a molecular concept of RNA and DNA synthesis [3, 4]. Indeed, following the X-ray crystallographic studies of Rosalind Franklin and the establishment of the double-helix structure of DNA by James Watson and Francis Crick in 1953, it was proposed in 1961 that RNA could be an intermediate molecule in the information flow from DNA to proteins [5]. First devised in 1958 by Francis Crick and then by François Jacob and Jacques Monod, the Central Dogma of Molecular Biology comprised transcription of a DNA gene into RNA in the nucleus followed by protein synthesis in the cytoplasm. It was also stated that the information flow can only proceed from DNA to RNA and then from RNA to protein, but never from protein to nucleic acids [5]. The mediating role of RNA became a new focus of research which has been pivotal for the development of modern molecular biology.

1.1.2 A Central Role for RNA in Cell Biology: The RNA World Concept

In 1939, Torbjörn Caspersson and Jean Brachet showed independently that the cytoplasm is very rich in RNA. They also showed that cells producing high amount of proteins seemed to have high amounts of RNA as well [5]. This was a first hint for the requirement of RNA during protein synthesis and its role as a link between DNA and proteins. In 1955, Georges Palade identified the very first ncRNA that makes part of the very abundant cytoplasmic ribonucleoprotein (RNP) complex: the ribosome. In his “Central Dogma” Crick also theorized that there was an “adapter” molecule for the translation of RNA to amino acids. This second class of ncRNAs was discovered in 1857 by Mahlon Hoagland and Paul Zamecnik: the transfer (t)RNA. In 1960, François Jacob and Jacques Monod first coined the term “messenger RNA” (mRNA) as part of their study of inducible enzymes in Escherichia (E.) coli. Indeed, they showed the existence of an intermediate molecule carrying the genetic information leading to protein synthesis. Shortly after, the work of Crick helped establish that the genetic code is a comma-less, non-overlapping triplet code in which three nucleotides code for one amino acid. It was later deciphered in vitro as well as in vivo and shown to be universal across all living organisms [6]. In the late 1960s, rather different from mRNAs, a new class of short-lived nuclear RNAs was found: heterogeneous nuclear (hn)RNAs. These long RNA molecules, which were in fact precursors for mature rRNAs and mRNAs, led to the study of rRNA processing and the discovery of splicing [7, 8]. During that period, small nuclear (sn)RNAs which are part of the spliceosome, the RNP machinery responsible for intron splicing from pre-mRNAs, were discovered [9]; as well as small nucleolar (sno)RNAs, which are involved in the processing and maturation of ribosomal RNAs in the nucleolus [10].

Although Jacob, Monod, and Crick had already mentioned independently that RNA was not just a messenger, many scientists considered it as a mere unstable intermediating molecule, overlooking the active roles of other classes of ncRNAs. However, this view partially changed in 1980 when Thomas Cech and Sidney Altman discovered that RNA molecules could act as catalysts for a chemical reaction. Initially, Cech’s group found an intron from an mRNA in Tetrahymena thermophila that is able to perform its own splicing through an RNA-catalyzed cleavage [11]. Subsequently, Altman’s group showed that the RNA component of the ribonucleoprotein RNase P is responsible for its activity in degrading RNA [12]. These RNA enzymes were called ribozymes and have been shown since then to be key actors of the genetic information flow, making part of both the ribosome and the spliceosome [13, 14].

The discovery of catalytic RNA also led scientists to develop the RNA World theory, which states that prebiotic life revolved around RNA, since it appeared before DNA and protein. Indeed, the extensive studies of its roles in cell biology revealed that RNA is necessary for DNA replication and that its ribonucleotides are precursors for DNA’s deoxyribonucleotides. Moreover, as it was previously mentioned, RNA plays an important role in every step of protein synthesis, both as scripts (mRNAs) and actors (ncRNAs: rRNAs, tRNAs, etc.) (Fig. 1.2) [15]. Remarkably, the latter ones are constitutively expressed in the cell and are necessary for vital cellular functions, constituting a class of housekeeping ncRNAs. Being extensively studied, housekeeping ncRNAs are the subject of many specialized publications and will not be described here. Instead, other classes of regulatory ncRNAs that were discovered in the early 1990s will be discussed. These ncRNAs are characterized by very specific expression during certain developmental stages, in certain tissues or disease states, and play multiple roles in gene expression regulation.

Fig. 1.2
figure 2

Initial and current dogma of molecular biology

1.1.3 Bacterial sRNAs: Pioneers of Regulatory ncRNAs

The very first regulatory ncRNA to be discovered and characterized was micF from the bacteria E. coli. It was described as the first RNA regulating gene expression through sense-antisense base pairing in 1984 by the team of Masayuki Inoue [16] and represents the major class of bacterial regulatory ncRNAs, small (s)RNAs. The micF ncRNA was shown to repress the translation of a target mRNA encoding a porin (outer membrane protein F, OmpF), involved in passive transport through the cell membrane. First discovered through multicopy plasmid experiments, the transcript was isolated 3 years later and shown to be an independent gene. When transcription of micF is activated, it inhibits the expression of the ompF gene at both mRNA and protein levels. Subsequently, following the characterization of the RNA duplex structure in vitro, micF was shown to bind to the ribosome-binding site (RBS) of the ompF mRNA, thus inhibiting ribosome binding and translation.

More recently, it was shown that the regulation of gene expression by micF through base pairing extends to other genes, among which is the lrp mRNA [17]. Lrp (leucine-responsive protein) is a transcription factor that vastly regulates gene expression in E. coli in response to osmotic changes and nutrient availability. Remarkably, Lrp regulates micF expression as well, thus creating a feedback and proving the important role of micF in global gene regulation and metabolism. The same mechanisms were also found in Salmonella, supporting the evolutionary conservation of this regulatory pathway [18]. Since then many other sRNAs ranging in length from 50 to 500 nucleotides (nt) have been discovered, including trans- or cis-encoded ncRNAs, RNA thermometers, and riboswitches. They all act by pairing, thus inhibiting translation of targeted mRNAs and inducing their degradation.

1.1.4 MicroRNAs and RNA Interference

In the early 1990s, several scientists observed independently and in different eukaryotic organisms, through experiments of transgene co-expression or viral infection, an intriguing phenomenon of RNA-mediated inhibition of protein synthesis. The regulatory effects of these RNA molecules reshaped the views of RNA as a mere messenger. The very first studies described the phenomenon as “co-suppression” in plants, as “posttranscriptional gene silencing” in nematodes, or as “quelling” in fungi, but none of them suspected RNA to be the key actor until the identification of the first micro (mi)RNA in the nematode Caenorhabditis (C.) elegans in 1993 by Victor Ambros and coworkers. Ambros discovered that the lin-4 gene produces small RNAs of 22 and 61 nt from a longer non-protein-coding precursor. The longer RNA forms a stem-loop structure, which is cut to generate the shorter RNA with antisense complementarity to the 3′-untranslated region (UTR) of the lin-14 transcript [19]. The lin-4 RNA pairing to lin-14 mRNA was proposed as a molecular mechanism of “posttranscriptional gene silencing”, thus decreasing LIN-14 protein levels at first larval stages of nematode development [20]. Michael Wassenegger observed a similar phenomenon occurs in plants which he described as “homology-dependent gene silencing” or “transcriptional gene silencing”; this process is mediated by the incorporation of viroid RNA which induces the methylation of the viroid cDNA and gene silencing [21]. Ultimately the entire process of RNA-mediated gene silencing was elucidated in 1998 by Andrew Fire and Craig Mello in similar experiments with the unc-22 gene of C. elegans.

In 2000, another essential miRNA was identified in C. elegans. This miRNA, let-7, was shown to have homologues in several other organisms, including humans [22, 23]. The biogenesis as well as the molecular mechanisms of miRNA-mediated gene silencing has been extensively characterized. In 2001, Thomas Tuschl showed that, in C. elegans, long double-stranded RNA is processed into shorter fragments of 21–25 nts. Since this discovery, it has been demonstrated that premature transcripts in the nucleus are processed into hairpin-structured RNA by the Drosha-containing microprocessor complex and then exported to the cytoplasm where they are cleaved into a double-stranded RNA by Dicer. One of the strands of this double-stranded RNA is loaded to the RISC complex and then targeted to an mRNA molecule by complementarity, thus inducing translational repression [23]. This simplified scheme constitutes the mechanistic basis of RNA interference (RNAi) and presently unites all gene silencing phenomena at transcriptional and posttranscriptional levels, mediated by small ncRNAs including miRNAs, small interfering (si)RNAs, and Piwi-interacting (pi)RNAs, all of which are processed from double-stranded RNA precursors [24, 25].

Although the focus on RNAi resulted in a breakthrough for modern biology and biotechnology, as well as provided a deeper understanding of gene regulation, development, and disease, the relevance of lncRNAs remained largely unexplored. Nevertheless, some lncRNAs were investigated in the late 1980s such as H19 and Xist, the milestones of dosage compensation in mammals.

1.2 LncRNA Discovery in the Pre-genomic Era

In the 1980s, scientists were using differential hybridization screens of cDNA libraries to clone and study genes with tissue-specific and temporal patterns of expression. Initially, efforts were focused on genes producing known proteins; subsequently, an a posteriori approach was adopted without regard to the coding potential of RNA. Through this approach, the first non-coding gene was discovered, H19, even though at that time it was first classified as an mRNA [26].

1.2.1 H19: The Very First Eukaryote lncRNA Gene

In the late 1980s, elegant genetic and molecular studies discovered a phenomenon of genomic imprinting or parent-of-origin-specific expression which constitutes part of the dosage compensation mechanisms. Independently, two imprinted genes were identified: the paternally expressed protein-coding Igf2r and the maternally expressed H19. Both genes were localized to mouse chromosome 7 in proximity to each other forming the H19/IGF2 cluster [27, 28]. What made H19 unusual was the absence of translation even though the gene contained small open reading frames. H19 showed high sequence conservation across mammals, and the abundant transcript presented features of mRNAs: transcribed by RNA polymerase II, spliced, 3′ polyadenylated, and localized to the cytoplasm [29]. The expression of H19 in transgenic mice revealed to be lethal in prenatal stages, suggesting not only that the dosage of this lncRNA is tightly controlled but that it has an important role in embryonic development. However, the function of H19 as an RNA molecule in its own right remained a mystery until the functional characterization of another lncRNA involved in dosage compensation in mammals, Xist. Since that time, H19 has been thoroughly investigated and represents the prototype of a multitasking lncRNA.

1.2.2 X Inactivation: Existence of Xist

In living organisms, sex can be determined by many ways; it is defined in mammals by the X and Y chromosomes, while males only have one X and Y chromosome, females have two X chromosomes in their karyotype. However, the X chromosome carries many genes, most of which have functions that are not involved in sex determination. Hence, there is a need for dosage compensation between males and females. Although the mechanism of choice in Drosophila is to double the transcription of the single X chromosome in males, it is the opposite in mammals: one of the female X chromosomes is inactivated. This phenomenon, called X-chromosome inactivation (XCI), was first discovered in mouse by Mary Lyon in 1961 [30] and further generalized to other mammals. XCI is established early in development and is initiated by a unique locus, the X-inactivation center (Xic).

In the early 1990s, this locus was found to produce a long non-coding RNA, Xist (X-inactive-specific transcript). It is expressed at very low levels in both, male and female, mouse undifferentiated embryonic stem (ES) cells. Upon differentiation Xist expression is activated in a monoallelic way in female cells, from the future inactive X (Xi) to initiate the onset of random XCI. Being retained in the nucleus, Xist triggers gene silencing in cis by physically localizing and spreading broadly on the future Xi [31,32,33]. In contrast to H19 and other lncRNAs involved in dosage compensation, Xist is highly unusual since it triggers the silencing of the entire chromosome. The propagation of Xist along the Xi, called “coating,” implicates the RNA wrapping around the X and the recruitment of multiple factors, including the polycomb repressive complexes 1 and 2 (PRC1 and PRC2). This triggers a cascade of chromatin changes and a global spatial reorganization of the Xi and, ultimately, the stable repression of nearly all Xi-linked genes throughout development and adult life [34]. While Xist expression is critical for the initiation of the XCI, in somatic cells, Xist and the whole Xic were shown to be dispensable for the maintenance of silencing in mouse [35]. In 1999 human XIST, ectopically expressed from the artificially inserted transgene on mouse autosomes, was demonstrated to function as Xic and to initiate XCI even in undifferentiated mouse ES cells, unlike the mouse counterpart. This result suggested differences in the developmental regulation of Xist and in the initiation of the XCI process between mouse and humans. In addition, the inactivation by ectopic human XIST was observed only in a portion of mouse male ES cells, thus confirming that Xist is not a unique actor of stable inactivation [36]. Indeed, the Xic was initially defined in mouse as the minimal region of the X chromosome that contains all sequences both necessary and sufficient for the initiation of XCI. Xic extends over 1 Mb, and transcriptomic studies revealed that this region contains several protein-coding and non-coding genes, including Linx, Ftx, and others. Remarkably, some non-coding genes within Xic and beyond show poor primary sequence conservation between human and mouse, and this includes the sequence of Xist itself [37]. In particular, the Tsix lncRNA is an antisense transcript which overlaps the whole Xist gene and its promoter in mouse. In humans, key regulatory elements are truncated, and the transcript overlaps XIST only at 3′ end. These differences abolish the TSIX function in transcriptional repression of XIST on the future active X in humans [38, 39]. Recently, another lncRNA, XACT, was discovered in human ES cells. This gene is located within the intergenic region, outside of Xic, and it is not conserved in mice. In female human ES cells, XACT is expressed from and coats both X chromosomes. This lncRNA seems to be specific to pluripotent cells and is proposed to ensure peculiar control of XCI in humans [40]. The biogenesis of Xist, its structure, and the molecular mechanism of XCI have been the focus of many studies in different mammals and extensively documented in other publications [34].

The pioneering studies of H19 and Xist revolutionized our view of non-protein-coding gene functions and on the biological relevance of lncRNAs in general. These examples demonstrated the complexity and versatility of regulatory circuits orchestrated by a single lncRNA. They also stimulated the discovery and suggested potential mechanisms for other, yet uncharacterized, non-coding transcripts. A global effort toward lncRNA identification and characterization began in the 2000s, as a plethora of novel non-coding transcripts during the sequencing of the complete human genome.

1.3 From Non-coding Genome to Non-coding Transcriptome: The Genomic Era

Our modern view of eukaryotic transcriptomes was preceded by comprehensive investigations of genomic DNA and the discovery that, in addition to protein-coding (PC) sequences and regulatory elements essential for PC gene (PCG) transcription, the majority of the genome contains sequences that were considered to be useless evolutionary fossils. To differentiate these sequences from PC sequences, this DNA was named non-coding and referred to as selfish or junk DNA for almost 20 years [41].

1.3.1 The Human Genome Project: Genomic DNA Is Mostly Non-coding

In 1978, using the sequencing technique he had developed, Frederick Sanger generated the first ever full genomic sequence: the viral genome of the bacteriophage ɸX174 [42]. Since then, Sanger sequencing has been routinely used worldwide, and its discovery and development earned the Nobel Prize in Chemistry for Sanger, along with Walter Gilbert. During the following years, several viral genomes were sequenced and by the end of 1990, a worldwide sequencing effort, the Human Genome Project (HGP), was established by the National Institute of Health (NIH, USA) to completely sequence the human genome. In parallel, the American biochemist and entrepreneur Craig Venter founded his own company and sought private funding to achieve the same goal. This put pressure on the public groups involved in the HGP, and the race to unravel the human genome began. The first bacterial genome was published in 1995 [43]. It was followed in 1999 by the sequence of the euchromatic portion of human chromosome 22 [44], which covered approximately 65% of what is now known to be the full chromosome 22. This sequence was thought to contain 545 protein-coding genes (whether known or predicted), with PC exons spanning a mere 3% of the full sequence.

Finally, using clone-by-clone methodology, the first draft of the complete human genome was published in Nature in 2001 covering 96% of the euchromatin [45], followed the next day by Craig Venter’s publication in Science of the whole-genome sequence obtained by the shotgun-cloning method [46]. Regular updates completed the human genome sequence in 2003. In the meantime, the genomes of several other organisms had already been released, notably yeast [47], pufferfish [48], worm [49], fruit fly [50], and mouse [51], thus allowing comparative studies to be performed.

The first surprise from this comprehensive genomic sequencing effort was the rather low number of PCGs compared to what was initially expected. Indeed, early studies that looked at the repartition of CpG islands predicted 70,000–80,000 genes in the human genome [52], a figure close to the well-admitted 100,000 genes from the mid-1980s. However, the HGP predicted around 31,000 PCGs in 2001 reduced to 22,287 PCGs in 2004 [45, 53]. In general, only 1.2% of the human genome represents PC exons, whereas 24 and 75% were attributed to intronic and intergenic non-coding DNA.

1.3.2 Pervasive Transcription and the Dark Matter of the Genome

The HGP also revealed that most of the genome is actually transcribed, whether it encodes proteins or not. Indeed, a tiling array with oligonucleotide probes spanning human chromosomes 21 and 22 revealed that 90% of detected cytosolic polyadenylated transcripts map to non-coding genomic regions and not to exons [54]. Similar results were found by the FANTOM and RIKEN consortia when analyzing the transcriptome in both human [55] and mouse [56]. They sequenced more than 60,000 full-length cDNAs from mouse in a standardized manner to generate accurate maps of the 5′ and 3′ boundaries of all transcripts, thus defining transcription start (TSS) and termination (TTS) sites. Remarkably, cap analysis gene expression (CAGE) sequencing, a technique that sequences 5′ ends of capped transcripts, revealed over 23,000 ncRNAs originating from both sense and antisense transcription representing approximately two thirds of the mouse genome [57]. For the first time, antisense transcription was proposed to contribute to the regulation of gene expression at transcriptional level in mammals.

These results were later confirmed by even larger-scale studies conducted in humans by the ENCODE (Encyclopedia of DNA Elements) consortium. This project compiled over 200 experiments in its pilot phase [58] and up to 1640 datasets from 147 different cell lines in its later release [59]. Through various sequencing techniques, landscapes of DNase I hypersensitive sites, histone modifications, transcription factor binding sites, and the whole transcriptome were defined. Conclusions from these studies estimated that 93% of the human genome is actively transcribed and associated with at least one primary transcript (i.e., coding and non-coding exons and introns); among these transcripts, approximately 39% of the genome represented PCGs (from promoter to poly(A) signal) and 1% protein-coding exons, while the other 54% mapped outside of PCGs (Fig. 1.3). However, many lncRNAs overlap with PCG annotations in both sense, coding and antisense strands. More recently, the mouse counterpart of the ENCODE Consortium confirmed previous reports by publishing a similar analysis which showed that 46% of the mouse genome produces mRNAs while at least 87% of its genome is transcribed [60, 61].

Fig. 1.3
figure 3

Proportion of transcribed protein-coding and non-coding sequences (introns, UTRs, and others) in the human genome according to ENCODE [59]

Many studies aiming to characterize non-coding transcription were also performed in other eukaryotes, including Saccharomyces cerevisiae. Even in this primitive unicellular eukaryote, about 85% of the genome is transcribed [62]. This phenomenon is often referred to as “pervasive transcription” and is widespread among eukaryotes. An expanding body of literature details its function [63, 64]. The identification and characterization of non-coding transcripts as unique ncRNAs extended the former definition of a “gene” beyond its coding function. Furthermore, the discovery of the non-coding genome and transcriptome gave rise to heated debates in the scientific community concerning the biological significance and functional relevance of these non-coding DNA and RNA, still perceived as a junk [63, 65, 66]. These debates challenged the Central Dogma of Watson and Crick, promoting ncRNAs to the epicenter of the cellular processes as a driver of biological complexity through evolution.

1.4 Non-coding RNAs: Junk or Functional

Polemics around the biological and functional relevance of lncRNAs were oriented toward understanding the origin, conservation, and diversification of lncRNA species across evolution.

1.4.1 Origin of lncRNA Genes

Non-coding genes were proposed to arise through various mechanisms including DNA-based or RNA-based duplications of existing genomic sequences, the metamorphosis of PCGs by loss of protein-coding potential, transposable element exaptation, or non-coding DNA exaptation [67]. Homologous non-coding genes arise from duplications of already existing lncRNA genes. Pseudogenes are an example of PCG metamorphosis during which a duplicated ancestral open reading frame had accumulated disruptions destroying its potential to be translated. Once transcribed, pseudogenes often produce lncRNAs, as in the case of PTENP1. Pseudogenization of a PCG, due to mutations deleterious to translation, can also produce lncRNA genes that do not have an apparent protein-coding “homologue”. An example is Xist which is derived from an ancestral Lnx3 gene and which has acquired several frame-shifting mutations during early evolution of placental mammals [68]. Exaptation or co-option of RNA-derived transposable elements (TE) into non-coding genes is another frequent mechanism of lncRNA origination. In humans TEs constitute a large portion of the genome (40–45%) [45]. Most of them are genomic remnants that are currently defunct but are often embedded into non-coding transcripts. TEs are considered as major contributors to the origin and diversification of lncRNAs in vertebrates [69]. Together with local repeats, they provide lncRNA genes with TSS, splicing, polyadenylation, RNA editing, RNA binding sites, nuclear retention signals or particular secondary structures for protein binding [70,71,72].

Finally, pervasive transcription of the genome may generate cryptic RNAs that, if maintained through evolution, can give rise to lncRNA genes with novel functions. In particular, exaptation of non-coding sequences into lncRNAs can occur through the acquisition of regulatory elements within a silent region, thereby promoting transcription. However, the de novo origin of lncRNAs remains difficult to prove and is represented by few examples, such as the testis-specific lncRNA Poldi [73]. Interestingly in humans, the testis and cerebral cortex are the most enriched tissues for the expression of PCGs and non-coding genes of de novo origin. This particularity was suggested to contribute to phenotypic traits that are unique to humans, such as an improved cognitive ability [74, 75].

1.4.2 Evolutionary Conservation of lncRNAs

Genomic and transcriptomic studies across the eukaryotic kingdom allowed the analysis of the primary sequence conservation of protein-coding and non-coding loci. These studies revealed that the human genome is highly dynamic, and only 2.2% of its DNA sequence is subjected to conservation constraints [76]. Remarkably, non-coding genes are among the least conserved with more than 80% of lncRNA families being of primate origin [77]. This finding raised skepticism regarding the functionality and biological relevance of lncRNAs and initiated a search for other conservation constrains [78, 79]. If the criterion of primary sequence conservation is too restrictive in regard to lncRNA genes, other features such as structure, function, and expression from syntenic loci constitute multidimensional factors that are more applicable for evolutionary studies of lncRNAs [80]. Recently, a study looking at the non-coding transcriptome of 17 different species (16 vertebrates and the sea urchin) showed that although the body of non-coding genes tends not to be conserved, short patches of conserved sequences could be found at their 5′ ends. This confirmed a higher conservation of TSS and synteny, as well as expression patterns in different tissues, especially in those involved in development [81]. Indeed, the most conserved are developmentally regulated lncRNAs of the lincRNAs subfamily. These lncRNAs have a remarkably strong conservation of spatiotemporal and syntenic loci expression, suggesting that it is selectively maintained and crucial for developmental processes [77, 82, 83].

1.4.3 Role of lncRNAs in Biological Diversity

The identification of new lncRNAs in the last decade continues to increase and, as anticipated in the past, largely exceeds that of protein-coding transcripts. The diversity of the non-coding transcriptome is considered as an argument to explain the remarkable phenotypic differences observed among species given a relatively similar numbers of protein-coding genes among fruit fly (13,985; BDGP release 4), nematode worm (21,009; Wormbase release 150), and human (23,341; NCBI release 36) [84]. In 2001, John Mattick and Michael Gagen proposed, for the very first time, that non-coding transcripts named “efference” RNA, together with introns, constitute an endogenous network enabling dynamic gene-gene communications and the multitasking of eukaryotic genomes. In contrast to core proteomic circuits, this higher-order regulatory system is based on RNA and operates through RNA-DNA, RNA-RNA, and RNA-protein interactions to promote the evolution of developmentally sophisticated multicellular organisms and the rapid expansion of phenotypic complexity. A direct correlation between the portion of non-coding sequences in the genome and organism complexity was hypothesized [85, 86]. Interestingly comparative genomics allowed the identification of a few regions in the human genome that have high divergence when compared to other species [87, 88]. These human accelerated regions (HAR) contain many lncRNA genes and have been suggested to be involved in the acquisition of human-specific traits during evolution. In 2006, a first lncRNA from these regions was shown to be expressed during cortical brain development [89]. Since then, many mutations involved in diseases were identified in these non-coding regions and shown to be associated with regulatory elements in the brain [90]. A more recent study showed that mutations of HAR enhancer elements could be involved in the development of autism, thus supporting the hypothesis that some HAR could be involved in human-specific behavioral traits and cognitive or social disorders when mutated [91]. However, the functionality of non-coding transcripts was and still remains hotly debated. Nevertheless, the conception of developmental and evolutionary significance has stimulated an exhaustive molecular characterization of lncRNA genes and transcripts.

1.5 The General Portrait of lncRNA Genes and Transcripts

lncRNAs have been identified in all species which have been studied at the genomic level, including animals, plants, fungi, prokaryotes, and even viruses. Genome-wide studies continue to enlarge the catalogue of lncRNAs continuously reshaping the specific features of lncRNAs as transcription units. Here, we will summarize the main features of lncRNAs that distinguish them from mRNAs (Table 1.1).

Table 1.1 Comparison of lncRNA and mRNA features

1.5.1 Coding Potential of lncRNA Genes

As dictated by the acronym, lncRNA genes do not encode proteins. Cytosol-localized lncRNAs were found associated with mono- or polyribosomal complexes [92], but this association is not necessarily linked to translation but rather proposed to determine lncRNA decay [93, 94]. Some lncRNAs include short open reading frames (sORFs) and undergo translation, though only a minority of such translation events results in stable and functional peptides [95, 96]. This is the case of DWORF, a muscle-specific lncRNA that encodes a functional peptide of 34 amino acids [97,98,99,100]. Proteomic studies will undoubtedly introduce a new “coding” aspect to lncRNAs, expanding our conception of “coding” and leading to a possible concept of bifunctionality.

1.5.2 LncRNA Transcription and Transcript Organization

The majority of eukaryotic lncRNAs are produced by RNA polymerase II, with some exceptions, for example, the murine heat-shock induced B2-SINE RNAs [101] or the human neuroblastoma associated NDM29 [102], which are synthesized by RNA polymerase III. However, the last two examples are not strictly considered as lncRNAs because the transcript length is below the arbitrary threshold of 200 nts. In plants, two specialized RNA polymerases, Pol IV and Pol V, transcribe some lncRNA genes [103]. Many lncRNAs are capped at the 5′ end, except those processed from longer precursors (intronic lncRNAs or circRNAs). However, some ambiguities exist concerning the presence of a cap, especially for highly unstable and low-abundant transcripts, since they can’t be captured by the CAGE-seq technique. LncRNAs may or may not be 3′-end polyadenylated; in addition, they may also be present as both forms, such as bimorphic transcripts like NEAT1 and MALAT1 [104, 105]. LncRNAs with a polyadenylation signal have higher stability than those that are poorly or not polyadenylated, with the exception of lncRNAs bearing specific 3′-end structures as in case of MALAT1 [106]. Of note, poly(A)+ transcriptomic studies exclude the possibility of discovery of non-polyadenylated transcripts and introduce a quantitative bias in the identification of such lncRNAs. This point should be taken into account in comparative studies or in selection of RNA-seq strategies, favoring the use of total RNAs instead of the more customary used poly(A)+ RNA fraction.

Similar to PCGs, transcription of many lncRNA genes requires canonical factors assisting the RNA polymerase machinery such as the pre-initiation complex (PIC), Mediator, transcription elongation complex, and also specific transcription factors that in turn could define the specificity of lncRNA expression in different biological contexts. However, some particularities in lncRNA promoters have been demonstrated. In humans lncRNA promoters are more enriched in A/T mono-, di-, and trinucleotide stretches and are characterized by reduced CG and almost depleted AT skews (CG and AT compositional strand biases); this is contrary to PCGs suggesting a distinct regulation of transcription for these two groups of genes [107]. Promoters of PROMPTs are devoid of transcription initiation factors such as TAFI, TAFII, p250, and E2F1 and are believed to initiate transcription without the use of conventional PIC [108]. eRNAs require the Integrator complex for the 3′-end cleavage of primary transcripts [109], and lncRNA precursors of small ncRNAs were shown to be processed by specific endonucleases [110, 111]. Some unstable lncRNAs such as yeast NUTs and CUTs are terminated by the Nrd1-dependent pathway, thus targeting them for rapid degradation by the exosome [112,113,114].

LncRNA genes can have a multi-exonic composition with similar splicing signals as PCGs and therefore could undergo splicing into several different isoforms with distinct functional outcomes and clinical relevance [115,116,117]. However, they usually comprise fewer and slightly longer exons than PCGs [118, 119].

1.5.3 Chromatin Signatures of lncRNAs Genes

As RNA polymerase II transcribes most of the lncRNA genes, their genomic regions present a chromatin organization resembling that of PCGs, with some differences. This could be due to the globally low expression of lncRNAs, which is a consequence of either low rate of transcription, lower stability, or both. Globally, lncRNA TSS reside within the DNase I hypersensitive sites suggesting nucleosome depletion from this region. LncRNA promoters have lower levels of histone H3K4 trimethylation (H3K4me3), which is in accordance with their low transcription rate. eRNAs and PROMPTs present high levels of histone H3K4 monomethylation (H3K4me1) and K27 acetylation (H3K27ac) at promoters, which is considered as a specific signature of enhancer- and promoter-associated unstable transcripts; these signatures exist in the following ratios: H3K4me3 over H3K4me1 as a mark of PROMPTS and H3K4me1 over H3K4me2 as a mark of eRNAs [120]. The body of most lncRNA genes with the exception of eRNAs and PROMPTs is marked by histone H3K36 trimethylation (H3K36me3). In yeast, sense-antisense transcription was reported to be associated with particular chromatin architecture: reduced histone H2B ubiquitination, H3K36me3, and histone H3K79 trimethylation, as well as increased levels of H3ac, chromatin remodeling enzymes, histone chaperones, and histone turnover [121]. In mouse, bidirectional transcription, which is often associated with developmental genes and genes involved in transcription regulation, was found to harbor high H3K79 dimethylation (H3K79me2) and elevated RNA polymerase II levels. This signature is characteristic of intensified rates of early transcriptional elongation within a region transcribed in both directions [122].

It is anticipated that single cell studies will resolve the problem of signal variability in a population of cells, allowing transcriptional events to be directly linked to specific chromatin modifications. Such efforts have already been initiated for transcriptome profiling [123,124,125] but remain challenging for epigenomic studies [126].

1.5.4 Expression Pattern of lncRNAs: Stability, Specificity, and Abundance

Several genome-wide studies addressed lncRNA stability and, depending on the employed experimental approach, revealed some discrepancy for different species of lncRNAs. In mouse, the measurements of the lncRNA half-life (t½) and decay rates were performed through transcription inhibition by actinomycin B treatment. In this case, lncRNAs showed a half-life range from 30 min to 48 h, which is similar to mRNAs; however, a mean t½ of 4.8 versus 7.7 h for mRNAs suggests that lncRNAs possess a lower stability. A high percentage of lncRNAs was classified as unstable (t½ < 2 h), e.g., Neat1, and a few as highly stable (t½ > 12 h) [127]. Comparison of the stability of different lncRNA species revealed that intronic or promoter-associated lncRNAs are less stable than either intergenic, antisense, or 3′ UTR-associated lncRNAs. Single-exon transcripts, a class of nuclear-localized lncRNAs, are overrepresented among unstable transcripts. In human HeLa cells, the same approach of transcriptional inhibition was used and revealed that antisense lncRNAs are more stable than mRNAs (median t 1/2 = 3.9 versus 3.2 h, respectively), whereas intronic lncRNAs included both stable (t 1/2 > 3 h) and unstable (t 1/2 < 1 h) transcripts with the t 1/2 median of 2.1 h [128]. Recently discovered circular RNAs are examples of highly stable lncRNAs with the median t 1/2 of 18.8–23.7 h and which is at least 2.5 times longer than their linear counterparts [129].

Nuclear and cytoplasmic exosomes, cytoplasmic Xrn1, and nonsense-mediated decay (NMD), as well as RNAi pathways, are known to control lncRNA abundance in the cell. Circular RNAs are intrinsically protected from any exonucleolytic- or polyadenylation-dependent decay pathways. Of note, actinomycin D treatment has a large impact on cells, and this can particularly influence lncRNA decay because of the very high sensitivity of lncRNAs to stress. Indeed, the measurements of t½ for single lncRNAs could significantly vary from one experiment to another, pointing to the necessity of multiple approaches including de novo RNA labeling to achieve more accurate and confident conclusions.

Multiple transcriptome profiling globally highlighted a highly specific spatiotemporal, lineage, tissue- and cell-type expression patterns for lncRNAs compared to PCGs; only a minority are ubiquitously present across all tissues or cell types, such as TUG1 or MALAT1 [105, 130, 131]. Curiously, the brain and testis represent a very rich source of uniquely expressed lncRNAs supporting the hypothesis that such transcripts are important for the acquisition of specific phenotypic traits [82, 130]. The ubiquitously expressed lncRNAs are often highly abundant, whereas specific lncRNAs present in one tissue or cell type tend to be expressed at low levels [132]. Moreover, interindividual expression analysis in normal human primary granulocytes revealed increased variability in lncRNA abundance compared to mRNAs [133]. Some disease-associated single-nucleotide polymorphisms (SNPs) within lncRNA genes and their promoters were linked to altered lncRNA expression, thus supporting their functional relevance in pathologies [134].

The high specificity of lncRNA expression argues in favor of important regulatory roles that these molecules can play in different biological contexts, including normal and pathological development.

1.5.5 Subcellular Localization of lncRNAs

Globally, unlike mRNAs, many lncRNAs have nuclear residence with focal or dispersed localization pattern (NEAT1) [135]. However, others were also found both in the nucleus and in the cytosol (TUG1, HOTAIR) or in the cytosol exclusively (DANCR) [105]. Multiple determinants, such as a specific RNA motif (BORG) [136] or RNA-protein assemblies, may dictate the subcellular localization of lncRNAs and define their function [137]. Remarkably, environmental changes or infection can induce lncRNA delocalization (or active trafficking) from one cellular compartment to another, as in the case of stress-induced lncRNAs [138]. HuR and GRSF1 modulate nuclear export and mitochondrial localization of the nuclear-encoded RMRP lncRNA [139].

1.5.6 Structure of lncRNAs

RNA is a highly flexible and dynamic molecule that adopts complex secondary structures. The folding of lncRNAs defines their cellular decay and functional versatility, enabling their nuclear localization, stability, and interaction with proteins [140]. A growing number of examples demonstrate that the RNA secondary structure constitutes the primary functional unit and evolutionary constraint bypassing poor interspecies lncRNA sequence conservation [141]. One such example is the lncRNA HOTAIR which exists only in mammals, sharing 58% of homology between human and mouse [142, 143]. Covariance analysis across 33 mammalian sequences of HOTAIR revealed a significant number of covariant base pairs and half-flips, which maintained a similar structure regardless of the changed sequence; this was especially true in regions surrounding proposed protein-binding segments of the lncRNA [144]. On the other hand, low sequence conservation that induces changes in structure can drive acquisition of new functions and specialization of the lncRNA-mediated regulatory circuit. This is the case of human accelerated region 1 (HAR1)-derived lncRNAs expressed in developing neocortex in primates where the capacity to form a stable cloverleaf-like structure has arisen only in humans [89, 145]. However, we are still far from understanding the function of this lncRNA in human brain development.

Numerous structure prediction tools, such as Rfold, have been developed to give guidance for further functional studies. Structural analysis of RNA has increased our understanding of mechanistic aspects of lncRNA action; however, X-ray crystallography, nuclear magnetic resonance (NMR), and cryo-electron microscopy require purified and stable, nearly static, molecules and are not adapted to highly dynamic and flexible RNA. Very recently, new technologies based on high-throughput sequencing have evolved enabling both an in vitro and in vivo view of RNA conformation [140].

1.6 Classification of lncRNAs

Advances in deep sequencing technologies gave rise to a plethora of novel transcripts requiring a universal standardized system for lncRNA classification and functional annotation. The state of lncRNA annotations is still at its beginning, and different classifications based on their length, transcript properties, location in respect to known genomic annotations, regulatory elements, and function have been proposed. Here, we review a non-exhaustive cataloguing of eukaryotic lncRNAs summarized in Table 1.2.

Table 1.2 Classification of lncRNAs (adapted from [279])

1.6.1 Classification According to lncRNA Length

By convention, a length of 200 nt constitutes a bottom line for discrimination of long or large ncRNAs from small or short ncRNAs. However, lncRNAs vary significantly in size, and those that exceed the length of 10 kb belong to the groups of very long intergenic (vlinc)RNAs and macro lncRNAs. These transcripts possess some particular features that distinguish them from other lncRNAs: they are poorly or not spliced, weakly polyadenylated at 3′ end, and are produced by particular genomic loci. The majority of vlincRNAs are localized in close proximity or within PCG promoters on the same or opposite strand and function in cis as positive regulators of nearby gene transcription. Interestingly, some vlincRNA promoters harbor LTR sequences that are highly regulated by three major pluripotency-associated transcription factors, suggesting a possible role in early embryonic development [146]. Others are specifically induced by senescence and are required for the maintenance of senescent features that in turn control the transcriptional response to environmental changes [147]. Macro lncRNAs are often antisense to PCGs and are produced from imprinted clusters in a parent-of-origin-specific manner. Macro lncRNAs silence nearby imprinted genes either through their lncRNA product triggering epigenetic chromatin modifications or by a transcriptional interference mechanism [148].

1.6.2 Classification According to lncRNA Location with Respect to PCGs

This attribute is commonly used by the GENCODE/Ensembl portal in transcript biotype annotations, but also employed on an individual scale by consortia and laboratories for newly assembled lncRNA transcripts. Initially transcripts are classified as either intergenic or intragenic (Fig. 1.4). Long or large intergenic non-coding (linc)RNAs do not intersect with any protein-coding and ncRNA gene annotations. This category also includes the adopted GENCODE and homonymous biotype of long or large intervening ncRNAs that were originally defined by specific histone H3K4-K36 chromatin signatures within evolutionary conserved genomic loci [149, 150]. LincRNAs are usually shorter than PCGs, transcribed by RNA polymerase II, 5′ capped, 3′ polyadenylated, and spliced. Although several highly conserved lincRNAs exist, the majority possess modest sequence conservation comprising short, 5′-biased patches of conserved sequence nested in exons [81]. Highly conserved lincRNAs are believed to contribute to biological processes that are common to many lineages, such as embryonic development [77], while others are proposed to assure phenotypic and functional variations at individual and interspecies levels. Many, if not most, lincRNAs are localized in the nucleus where they exercise their regulatory functions. One such example is lincRNA-p21 which is induced by p53 upon DNA damage [151]. lincRNA-p21 physically associates with and recruits the nuclear factor hnRNP-K to specific promoters mediating p53-dependent transcriptional responses.

Fig. 1.4
figure 4

Annotation of non-coding transcripts according to their genomic position relative to a protein-coding gene (orange box, protein-coding exon; blue box, non-coding exon)

Intragenic lncRNAs overlap with PCG annotations and can be further classified into antisense, bidirectional, intronic, and overlapping sense lncRNAs.

Antisense lncRNAs, asRNAs or ancRNAs, were first discovered in single gene studies, but the recent development of stranded tiling and RNA-seq technologies has identified them as a common genome-wide feature of eukaryotic transcriptomes [152,153,154]. This group encompasses so-called natural antisense transcripts, NATs, which are in turn subdivided into cis -NATs, which affect the expression of the corresponding sense transcripts and into trans -NATs, which regulate expression of non-paired genes from other genomic locations [155,156,157]. A very recent study has pointed to a higher specificity of expression and an increased stability of asRNAs compared to lincRNAs and sense intragenic lncRNAs [128]. Due to sequence complementarity to sense-paired mRNAs or pre-mRNAs, asRNAs can act through RNA-RNA pairing, thereby ensuring specific targeting of the asRNA regulatory activity. This is the case of BACE1-AS that is highly expressed in Alzheimer’s disease patients. It stabilizes the BACE1 mRNA resulting in an increased expression of the BACE1-encoded beta-secretase and the accumulation of amyloid-beta peptides in the brain [158]. Antisense transcription across intron regions has been shown to regulate the local chromatin organization and environment, thus affecting co-transcriptional splicing of sense-paired pre-mRNAs [159]. Some NATs contain the inverted short interspersed nuclear element B2 (SINEB2), such as AS-Uchl1 [160]. These NATs, called SINEUPs, are able to stimulate sense mRNA translation through lncRNA-mRNA pairing thanks to a complementary 5′ overlapping sequence to the paired-sense protein-coding gene. Recently, SINEUPs were proposed as a synthetic reagent for biotechnological applications and in therapy of haploinsufficiencies [161, 162]. In spite of the poor evolutionary conservation of sense-antisense transcription, some subgroups of lncRNAs, such as senescence-associated vlincRNAs and macro lncRNAs in mammals or XUTs in yeast, are mostly constituted of antisense transcripts, which suggests potential antisense-mediated regulatory pathways in control of cellular homeostasis, stress response, and disease [154].

The discovery of bidirectional transcription as an intrinsic feature of the eukaryotic transcriptional machinery has given rise to the identification of bidirectional lncRNAs [153, 163,164,165,166]. Originating from the opposite strand of a PCG strand, these transcripts do not overlap or only partially overlap with the 5′ region of paired PCGs, as is the case of promoter-associated (pa)ncRNAs, long upstream antisense transcripts (LUATs), and upstream antisense transcripts (uaRNA) [122, 167,168,169,170]. Presently, the number of bidirectional lncRNAs is largely underestimated not only because of the inaccurate annotation of transcriptional start sites (TSS) and promoters in the genome but also because of the highly unstable nature of these ncRNAs and the corresponding difficulty to detect them. Genomic studies have revealed that bidirectional promoters display distinct sequences and epigenetic features; moreover, they can be found near genes involved in specific biological processes such as developmental transcription factors or cell cycle regulation [122, 168, 169, 171, 172]. An imbalance in bidirectional transcription constitutes an endogenous fine-tuning mechanism that is particularly operative when facultative gene activation or repression is required [173, 174].

Intronic lncRNAs are restricted to PCG introns and could be either stand-alone unique transcripts or by-products of pre-mRNA processing. Examples of pre-mRNA-derived intronic transcripts are circular intronic (ci)RNAs produced from lariat introns which have escaped from debranching [175] and sno-lncRNAs produced from introns with two embedded snoRNA genes [176]. Such lncRNAs are proposed to positively regulate the transcription of the host PCG or its splicing by accumulating near the transcription locus. Another example of intronic lncRNAs of lariat origin, named switch RNAs, is produced by transcription through the immunoglobulin switch regions. They are folded into G-quadruplex structures to bind and recruit the activation-induced cytidine deaminase AID to DNA in a sequence-specific manner, thereby ensuring proper class switch recombination in the germ line [177]. Stand-alone intronic transcripts, expressed independently of the PCG hosts, are believed to be the most prevalent class of intronic lncRNAs, including so-called totally intronic ncRNAs, TINs [178, 179]. Expression of a certain TIN is activated during inflammation, but the exact function of these lncRNAs is still poorly understood [180].

Overlapping sense transcripts encompass exons or whole PCGs within their introns without any sense exon overlap and are transcribed in the same sense direction. This annotation includes the GENCODE-adopted homonymous biotype and has been attributed to a number of transcripts, denoted as “GENENAME-OT.” One such example is SOX2-OT that harbors in its intron one of the major pluripotency regulators, the SOX2 gene. SOX2-OT is dynamically expressed and is alternatively spliced not only during differentiation but also in cancer cells where it was proposed to regulate SOX2 [181].

Intronic and overlapping sense lncRNAs could form circular lncRNAs (circRNAs) due to head-to-tail noncanonical splicing [182, 183]. Some sequence features such as the presence of repetitive elements within introns could be decisive for activation of noncanonical splicing and generation of a circular RNA molecule [184]. For example, Alu elements within introns are proposed to participate in RNA circularization via RNA-RNA pairing [185]. Remarkably, such events seem to be tissue or cell type specific, restricted to a certain developmental stage or pathological context [186, 187]. More generally, circRNAs function in the cytosol as miRNA sponges, as the case of CDR1as/ciRS-7 which is an RNA sponge of miR-7 [182, 183]. Some circRNAs, termed exon-intron circRNAs (EIciRNAs), still contain unspliced introns and are retained in the nucleus, where they are able to interact with U1 snRNP and promote transcription of their parental genes [188]. The most remarkable property of circRNAs is their high stability which makes them eligible as potent diagnostic markers and therapeutic agents [189].

1.6.3 Classification According to lncRNA Residence Within Specific DNA Regulatory Elements and Loci

In addition to PCGs, mammalian genomes contain tens of thousands of pseudogenes, which are genomic remnants of ancient PCGs that have lost their coding potential throughout evolution. Importantly, many of them are transcribed in both sense and antisense directions into lncRNAs. Given high sequence similarity with parental genes, pseudogene-derived lncRNAs can regulate PCG expression via RNA-RNA pairing by acting as miRNA sponges, by producing endogenous siRNAs, or by interacting with mRNAs [190,191,192]. PTENP1, a lncRNA pseudogene derived from the tumor-suppressor gene PTEN, was among the first reported non-coding miRNA sponges with a function in cancer [193].

Ultra-conserved regions (UCRs) are genome segments that exhibit 100% DNA sequence conservation between human, mouse, and rat. The human genome contains 481 UCRs within intragenic (39%), intronic (43%), and exonic (15%) sequences [194]. These regions are extensively transcribed into T-UCR lncRNAs [195, 196]. Remarkably, expression of T-UCRs is induced by cancer-related stresses such as retinoid treatment or hypoxia. They are aberrantly expressed in different cancers and some are associated with poor prognosis [196,197,198]. Given high specificity of expression, T-UCRs were proposed as molecular markers for cancer diagnosis and prognosis [199]. The function of T-UCRs is still poorly understood. Evf2 (or Dlx6as) is an example of T-UCR which acts as a decoy. It interacts with the transcription activator DLX1 increasing its association with key DNA enhancers but also with the SWI-/SNF-like chromatin remodeler brahma-related gene 1 (BRG1) inhibiting its ATPase activity. As a result, Evf2 induces chromatin remodeling and Dlx5/Dlx6 enhancers decommissioning with a final repression of transcription [200, 201].

Telomeres, which are protective nucleoprotein structures at the ends of chromosomes, are transcribed into non-coding telomeric repeat-containing RNAs, TERRA , in all eukaryotes. This family of transcripts is generated from both Watson and Crick strands in a cell cycle-dependent manner [202, 203]. Formation of RNA-DNA hybrids by TERRA at chromosome ends promotes recombination and, hence, delays senescence. However, in cells lacking telomerase- and homology-directed repair, TERRA expression induces telomere shortening and accelerates senescence [204, 205]. Subtelomeric regions are also actively transcribed [206,207,208]. In budding yeast, this heterogeneous population of lncRNAs, named subTERRA, is transiently accumulating in late G2/M and G1 phases of the cell cycle in wild-type cells or in asynchronous cells deleted for the Xrn1 exoribonuclease [209]. The exact function of subTERRA is not yet clear though it has been proposed to have a regulatory role in telomere homeostasis.

Recent findings in different eukaryotes including human revealed that centromeric repeats are actively transcribed into lncRNAs during the progression from late mitosis to early G1 [210,211,212,213,214]. These centromeric lncRNAs physically interact with different centromere-specific nucleoprotein components, such as CENP-A/CENP-C and HJURP, and are required for correct kinetochore assembly and the maintenance of centromere integrity.

Ribosomal (r)DNA loci were shown to be transcribed by RNA polymerase II, antisense to the rRNA genes, into a heterogeneous population of lncRNAs, called PAPAS (promoter and pre-rRNA antisense). Their expression is induced in quiescent cells and triggers the recruitment of histone H4K20 methyltransferase Suv4-20h2 to ribosomal RNA genes for histone modification and transcriptional silencing [215]. PAPAS also allow heterochromatin formation and gene silencing in growth-arrested cells.

Promoters and enhancers constitute fundamental cis-regulatory elements for the control of PCG expression, serving as platforms for the recruitment of transcription factors and transcription machinery and the establishment of particular chromatin organization. Remarkably, many, if not all, functional enhancers and promoters are pervasively transcribed, respectively, into eRNAs and PALRs, in both sense and antisense directions. Transcribed enhancer and promoter regions possess particular histone modification signatures that distinguish them from other transcription units. Such signatures include increased histone H3K27ac and H3K4me1 as compared with other lncRNA and PCGs. The termination of enhancer-derived lncRNAs, eRNAs, depends on the Integrator complex which ensures 3′-end transcript cleavage. The result is that eRNAs are poorly or not polyadenylated and highly unstable. Their expression is specific to cell type, tissue, or stages of development and can be activated by external or internal stimuli. Enhancer transcription was proposed to mark functional, active enhancer elements. However, eRNA function as stand-alone transcripts is still controversial, and the function of only few eRNAs, such as FOXC1e or NRIP1e [216], has been demonstrated. Specifically, it is proposed that these eRNAs control promoter chromatin environment, enhancer-promoter looping, RNA polymerase II loading and pausing, and “transcription factor trapping”; all these events contribute to a robust transcription activation of nearby and distant genes [217].

Promoter-associated lncRNAs or PALRs are transcribed in sense and antisense directions at promoter regions and can partially overlap the 5′ end of a gene [218]. This class of transcripts includes highly unstable PROMPTs (promoter upstream transcripts) and upstream antisense RNAs (uaRNAs) that are more easily detectable in a context where the nuclear exosome has been depleted [108, 170, 219]. Polyadenylation-dependent degradation of PROMPTs was proposed to ensure directional RNA production from otherwise bidirectional promoters [220]. The presence of a splicing competent intron within uaRNAs was shown to facilitate gene looping placing termination factors at the vicinity of a bidirectional promoter for termination and thereby ensuring RNA polymerase II directionality toward a PCG [221]. Some PALRs were shown to negatively regulate transcription of the nearby genes. One such example is a PALR from the CCND1 gene promoter which represses transcription by recruiting TLS and locally inhibiting CBP/p300 histone acetyltransferase activity on the downstream target gene, cyclin D1 [222, 223].

The 3′-untranslated regions (UTRs) of eukaryotic genes can be transcribed into independent transcription units or UTR-associated (ua)RNAs [224]. They are generated either by an independent transcriptional event from the upstream PCG or by posttranscriptional processing of a pre-mRNA. Expression of uaRNAs is regulated in a developmental stage- and tissue-specific fashion and is evolutionarily conserved; nevertheless, the functional relevance of such transcripts has not yet been explored.

1.6.4 Classification According to lncRNA Biogenesis Pathways

In budding yeast, since many lncRNAs are highly unstable or “cryptic,” the commonly employed classification of lncRNAs is based on their decay or biogenesis features. However, some so-called stable unannotated transcripts (SUTs) were identified in a wild-type genetic background [163]. Others are only detectable under specific stress conditions or in RNA-decay mutant strains. These latter transcripts are roughly divided into three classes: cryptic unstable transcripts (CUTs), which are sensitive to the nuclear RNA decay pathway [163, 225]; Nrd1-unterminated lncRNAs (NUTs) [113]; and Xrn1-sensitive unstable transcripts (XUTs), which are degraded by the cytoplasmic 5′–3′ exoribonuclease, Xrn1 [226, 227]. The majority of XUTs are transcribed antisense to PCGs. CUTs are often bidirectional or overlapping sense transcripts, but can also be antisense, as is the case of the PHO84 CUT [228]. Beyond each class definition, there is a considerable overlap between CUTs and NUTs but also XUTs and SUTs [94, 112]. Some CUTs have been reported to escape nuclear RNA decay and are exported to the cytoplasm where they are taken in charge by Xrn1 or by nonsense-mediated mRNA decay (NMD), as is the case of cytoplasmically degraded CUTs or CD-CUTs [229]. CD-CUTs bear a 5′ extension originating upstream from the bona fide promoter and which partially or completely overlaps PCGs. CD-CUT transcription is proposed to control the expression of a subset of genes from subtelomeric regions and, in particular, metal homeostasis genes. Another subclass of CUTs includes meiotically induced lncRNAs, meiotic unannotated transcripts (MUTs), that are degraded by the nuclear exosome Rrp6 and the exosome targeting complex TRAMP [230, 231]. The key difference between CUTs, XUTs, and SUTs is determined by their distinct subcellular fates. CUTs are transcribed and degraded in the nucleus, while SUTs and XUTs are exported to the cytoplasm where many XUTs are degraded by Xrn1 unless they escape degradation by pairing to complementary mRNAs [94]. In this case, they could be protected from NMD-mediated degradation and eventually translated into peptides, giving rise to new putatively functional molecules [232]. Notably, CUTs and XUTs are conserved among yeast species [233], (Wery et al., unpublished).

In other eukaryotes, some highly unstable lncRNAs have been reported, for example, above mentioned PROMPTs and eRNAs which could be considered to be human analogues of CUTs, since they are highly stabilized upon RNA exosome depletion [108, 234]. The RNA exosome is proposed to play a role in resolving deleterious RNA/DNA hybrids (R-loops) arising from active enhancers to prevent recombination. So far, the existence of mammalian XUTs has not been reported; however, in humans, XRN1 was shown to be sequestrated by some RNA viruses [235, 236]. Their genomic RNA possesses a structured module in the 3′-UTR that traps and inhibits XRN1 catalytic activity. This action gives rise to the stabilization of the subgenomic flavivirus (sf)RNA which is important for the pathogenicity of the virus but could also result in a global stabilization of transcripts, including yet uncovered, highly unstable lncRNAs analogous to yeast XUTs.

1.6.5 Classification According to lncRNA Subcellular Localization or Origin

Knowing the subcellular localization of a particular lncRNA provides important insights into its biogenesis and function. LncRNAs could be exclusively cytosolic (DANCR and OIP5-AS1) or nuclear (NEAT1) or have a dual localization (HOTAIR) [128]. Several subgroups of lncRNAs with a precise subcellular localization have been defined, such as chromatin-enriched (che)RNAs [237] and chromatin-associated lncRNAs, CARs [238]. Many nuclear and chromatin functions have been proposed for such lncRNAs, including the assembly of subnuclear domains or RNP complexes, the guiding of chromatin modifications, and the activation or repression of protein activity [239]. GAA repeat-containing RNAs, GRC-RNAs, represent a subclass of nuclear lncRNAs that show focal localization in the mammalian interphase nucleus, where they are a part of the nuclear matrix. They have been suggested to play a role in the organization of the nucleus by assembling various nuclear matrix-associated proteins [240].

The mitochondrial genome is also transcribed into mitochondrial ncRNAs, ncmtRNAs [241,242,243]. Their biogenesis is dependent on nuclear-encoded mitochondrial processing proteins. After synthesis, some ncmtRNAs are exported from the mitochondria to the nucleus [244]. Importantly, expression of ncmtRNAs is altered in cancers promoting them as potential targets for cancer therapy [245, 246].

1.6.6 Classification According to lncRNA Function

To highlight a regulatory role, lncRNAs are often classified based on their function. Several archetypal activities of lncRNAs are used for classification: scaffolds, guides, decoys or ribo-repressors, ribo-activators, sponges, and precursors of small ncRNAs. Here we present examples of functional lncRNA classifications that regroup several lncRNAs into subclasses with a common operating mode.

LncRNA scaffolds function in the assembly of RNP complexes. The structural plasticity of lncRNAs allows them to adopt complex and dynamic three-dimensional structures with high affinity for proteins [247]. LncRNA scaffolds are often actors of epigenetic and transcriptional control of gene expression regulation. In this case, a lncRNA can act in trans or in cis in respect to its transcription site [248]. They are known to associate with a multitude of histone- or DNA-modifying and nucleosome remodeling complexes [249, 250]. LncRNA-mediated assembly of these complexes reshapes the epigenetic landscape and the organization of chromatin domains, thus allowing the modulation of all DNA-based processes including transcription, recombination, DNA repair, as well as RNA processing [159, 177, 251, 252]. HOTAIR is one example of a scaffold lncRNA which recognizes numerous targets. HOTAIR adopts a four-module secondary structure [144] which interacts in the nucleus with the PRC2 and Lsd1/REST/coREST complexes through its 5′ and 3′ modules, respectively [253]; it then targets them to specific genomic locations to affect histone modifications and gene silencing. In the cytoplasm, HOTAIR associates with the E3 ubiquitin ligases, Dzip3 and Mex3b, facilitating ubiquitination and proteolysis of their respective substrates, Ataxin-1 and Snurportin-1, in senescent cells [251].

Architectural lncRNAs (arcRNAs) represent a subclass of lncRNA scaffolds that are essential for the assembly of particular nuclear substructures [254]. Presently, five lncRNAs are classified as arcRNAs, and among them is NEAT1, which assembles more than 60 different RNA-binding proteins and transcription factors in paraspeckles [255]. ArcRNAs are highly enriched in repetitive sequences indicative of complex RNA folding that is essential for their scaffold function. They could be temporarily regulated by stress, during development, or in disease. ArcRNAs often sequester regulatory proteins, thereby changing gene expression. A detailed molecular role of scaffold and arcRNAs will be discussed in the forthcoming chapter.

Guide lncRNAs can recruit RNP complexes to specific chromatin loci. Remarkably, a guide function of one and the same lncRNA depends on the biological context (cell-/tissue-type, developmental stage, pathology) and often cannot be explained by a simple RNA/DNA sequence complementarity. For some lncRNA guides the formation of a triple helix structure between DNA and the lncRNA was experimentally proven, as in the case of Khps1 which anchors the CBP/p300 complex to the proto-oncogene SPHK1 [256]. Another example is MEG3 which guides the EZH2 subunit of PRC2 to TGFβ-regulated genes [257].

lncRNA decoys play the role of ribo-repressors for protein activities through the induction of allosteric modifications, the inhibition of catalytic activity, or by blocking the binding sites. One classical example of a ribo-repressor lncRNA is GAS5 (growth arrest-specific 5), which acts as a decoy for a glucocorticoid receptor (GR) by mimicking its genomic DNA glucocorticoid response element (GRE). The interaction of GAS5 with GR prevents it from binding to the GRE and ultimately represses GR-regulated genes, thus influencing many cellular functions including metabolism, cell survival, and response to apoptotic stimuli [258].

lncRNAs can also act as ribo-activators essential for or enhancing protein activities. One such example is the lnc-DC lncRNA which promotes the phosphorylation and activation of the STAT3 transcription factor [259]. Another subclass is the lncRNA transcriptional co-activators, also called activating ncRNAs (ncRNA-a), which possess enhancer-like properties [260]. They were shown to interact with and regulate the kinase activity of Mediator, hence facilitating chromatin looping and transcription [261]. In addition to Mediator-interacting RNAs, other lncRNAs are able to upregulate transcription and could also be considered as ncRNA-a. Among them is the steroid receptor RNA activator SRA which interacts with and enhances the function of the insulator protein CTCF [262], and NeST which binds to and stimulates the activity of a subunit of the histone H3 Lysine 4 methyltransferase complex [263].

Competing endogenous RNAs (ceRNAs), also known as lncRNA sponges, are represented by lncRNAs and circRNAs that share partial sequence similarity to PCG transcripts; they function by competing for miRNA binding and posttranscriptional control [264]. Pseudogene-derived lncRNAs represent an important source of ceRNAs as they are particularly enriched in miRNA response elements, as is the case of the already mentioned PTENP1 [265]. The subcellular balance between ceRNA, one or multiple miRNAs, and mRNA targets constitutes a complex network allowing a fine-tuning of the regulation of gene expression during adaptation, stress response, and development [266, 267].

Many lncRNAs host small RNA genes and serve as precursor lncRNAs for shorter regulatory RNAs, in particular, those involved in the RNAi pathway (mi/si/piRNAs). Many lncRNAs were identified and functionally studied before their precursor function was known. Such is the case for H19, one of the first discovered lncRNA genes and which contains two conserved microRNAs, miR-675-3p and miR-675-5p. In undifferentiated cells, H19 acts as a ribo-activator interacting with and promoting the activity of the ssRNA-binding protein KSRP (K homology-type splicing regulatory protein) to prevent myogenic differentiation [268]. During development, and, in particular, during skeletal muscle differentiation, H19 is processed into miRNAs ensuring the posttranscriptional control of the anti-differentiation transcription factors Smad [269]. Some piRNA clusters were found to map to lncRNA genes, mostly in exonic but also in non-exonic regions enriched in mobile elements thereby constituting putative pi-lncRNA precursors [270]. Putative endo-siRNAs can be produced from inverted repeats within lncRNA genes or from any double-stranded lncRNA-RNA precursors originated from sense-antisense convergent transcription [271, 272]. Endo-siRNAs have been documented in many eukaryotes, including fly, nematode, and mouse. Overlapping and bidirectional transcription is an abundant and conserved phenomenon among eukaryotes [154, 218]. However, in mammals, processing of sense-antisense paired transcripts into siRNA and their functional relevance is still controversial and requires experimental evidence, specifically at the single cell level. LncRNA processing into small RNA molecules could depend on different cellular machineries such as RNase P- and RNase Z-mediated cleavage of the small cytoplasmic mascRNA from MALAT1 [110] or Drosha-DGCR8-driven termination and 3′ end formation for lnc-pri-miRNAs [111]. The possible coexistence of two operational modes combining a long, precursor lncRNA and a derived small RNA adds additional complexity in lncRNA-mediated regulatory circuits.

1.6.7 Classification According to lncRNA Association with Specific Biological Processes

Examination of the non-coding transcriptome in different biological contexts has resulted in the discovery of lncRNAs specifically associated with particular biological states or pathologies. LncRNAs differentially expressed during replicative senescence represent senescence-associated lncRNAs, or SAL [273]. One such example, SALNR, is able to delay oncogene-induced senescence by its interaction with and inhibition of the NF90 posttranscriptional repressor [274]. Hypoxia, one of the classic features of the tumor microenvironment, induces the expression of many lncRNAs, in particular those from UCRs, named HINCUTs [197, 275]. Oxidative stress induces the production of stress-induced lncRNAs, si-lncRNAs, that accumulate at polysomes in contrast to mRNAs, which are depleted [138]. Deep sequencing transcriptome analysis of mammalian stem cells identified non-annotated stem transcripts, or NASTs, that appear to be important for maintaining pluripotency [276]. Finally, with the progression of clinical and diagnostic studies, a growing number of specific disease-associated lncRNAs have been detected. An example is the prostate cancer-associated transcripts (PCATs), such as PCAT1, that were shown to have a role in cancer biology but also as potent prognostic markers [277].

1.6.8 Future Challenges in lncRNA Annotation and Classification

Presently, the discovery of a novel lncRNA is an everyday occurrence, and proper annotation and classification are a necessity. In addition to catchy nicknames, various classifications of lncRNAs that rely on certain properties of the transcript, its origin, or possible function are proposed in oral and written communications. However, and in the aim of universalization, a “gold standard” of annotation should be sought. Repositories such as RNAcentral and other consortia are working on the challenging task of integrating the unambiguous annotations of all transcripts and genes, including numerical identifiers in addition to unique transcript names such as “GENENAME”. Recently, John Mattick and John Rinn have proposed some rules for lncRNA annotation. In particular, it has been recommended to refer to intergenic lncRNAs as “LINC-X,” where X represents a number and to all intragenic lncRNAs as “GENENAME” corresponding to overlapping PCG annotations with a prefix “AS-” for antisense, “BI-” for bidirectional, “OT-” for overlapping sense, and “INT-” for intronic transcripts in order to provide them with a positional criterion [278]. Respecting this guideline, OT-SOX2-(1) would correspond to the first isoform of the SOX2-OT1 lncRNA overlapping in sense orientation the SOX2 gene, while HOTAIR should take the name of AS-HOXC11-(1) to designate the largest lncRNA antisense to the HOXC11 gene. However, the descriptive nickname of experimentally assigned lncRNAs should be preserved on condition of its uniqueness as a gene name. To avoid confusion, the renaming of transcripts should be accurately marked in all lncRNA repositories.

Identification, annotation, and classification are the first steps toward unraveling lncRNA biology. This work is still in its early days and requires novel thinking and methodologies, in parallel with the development of new and more accurate technologies and improved tools for the discovery and assembly of transcripts. In particular, the twenty-first century has been marked by the emergence of new technologies in regard to genomics and integrative system biology. These new approaches will allow researchers to build a comprehensive framework of regulatory circuits embedding both coding and non-coding transcripts, thereby deciphering a bit more the puzzle of life biodiversity and complexity.