Introduction

In the recent era, next-generation sequencing (NGS) is a vastly used method to answer various biological research questions at different omics levels such as genomics, epigenomics, transcriptomics, metabolomics, etc. For transcriptomics research, RNA-Seq was a revolution (Mortazavi et al. 2008) (Pan et al. 2008) due to its ability to measure the transcription of 90% of the genomic DNA in eukaryotes and can also be utilized for other variety of analyses (Copy number alterations, TWAS, neoantigen pre) and to reveal complex events (Thind et al. 2021). Most of the transcribed DNA includes RNAs without any coding capacity, commonly known as the non-coding RNAs (ncRNAs). With the evolution of species, the approximate amount of coding genes remains the same while the number of non-coding sequences rises with the increase in organism complexity (Amaral and Mattick 2008). The fact that most ncRNAs are expressed at much lower levels as compared to mRNAs indicates that ncRNAs are primarily playing role in regulation of the gene expression (Geisler and Coller 2013). ncRNAs consist of ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), small nucleolar RNAs (snoRNAs), long ncRNAs (lncRNAs), small interfering RNAs (siRNAs), microRNAs (miRNA), PIWI-interacting RNAs (PiRNAs), circular RNAs, etc. Studies have shown the role of ncRNAs in the occurrence and regulation of normal physiological processes (Dinger et al. 2008), regulation of gene expression (Holoch and Moazed 2015), and human diseases (Fig. 1). Moreover, genetic and epigenetic deformities in miRNAs or their machinery may cause many diseases (Wang et al. 2013a) and it has applications in forensic science (Rocchi et al. 2020). However, ncRNAs role is mainly studied in cancer (Huarte 2015) and cardiovascular diseases (Fang et al. 2020). The abnormal expression of LncRNA’s can cause the development and progression of cancer (Mercer et al. 2009; Hauptman and Glavač 2013). Circular RNAs and piwi RNAs are also known to play role in cardiovascular diseases (Altesha et al. 2019; Zeng et al. 2021). It is essential to recognize the full repertoire of available ncRNAs to understand their regulatory function with respect to normal developmental processes and human diseases. Expression of ncRNA varies among healthy cell types, proven by various studies such as LncRNA expression ability to resolve various cell types using single cell RNAseq (scRNASeq) (Mortazavi et al. 2008). The recognition of total available ncRNAs is difficult to achieve via contemporary molecular biology techniques.

Fig. 1
figure 1

Schematic representation of biosynthesis of different types of non-coding RNAs and their applications. A miRNAs. B lncRNAs. C circRNAs

However, third generation sequencing has provided advancement in this field to discover and characterize the role of ncRNA. In the past few years, the submission of RNA-seq datasets eventually increased exponentially in public databases, which can be reutilized for many novel analyses using advanced bioinformatics methods. Here, in this review, we discuss the resources of ncRNASeq data and the advanced tool to analyze ncRNAseq data. Furthermore, we discussed some key challenges that needed to be solved.

Types of non-coding RNAs

There is growing appreciation and understanding of non-coding regions’ roles in gene function and expression. Many non-coding RNA (ncRNA) molecules have been reported and attributed to various roles depending upon their length, function, action, location of transcription, etc. Examples of these ncRNA molecules microRNA (miRNA), small-interfering RNAs (siRNAs), small nuclear RNAs (snRNAs), small-nucleolar RNAs (snoRNAs), Piwi-interacting RNAs (piRNAs), long ncRNAs (lncRNAs), and circular ncRNA.

Long non-coding RNA and circular RNA

Among all ncRNA molecules, lncRNAs are the most versatile, critical molecule implicated in diverse gene regulatory processes (Mercer et al. 2009; Rinn and Chang 2012). LncRNAs, in the broadest sense, are defined as a novel class of functional ncRNAs of length > 200 nucleotides (Ma et al. 2013). They play a significant role in regulating gene expression (both at transcriptional/post-transcriptional level), chromatin remodeling, protein localization, etc. (Jarroux et al. 2017). They modulate target gene expression at transcriptional (via RNA–DNA hybrid), post-transcriptional, epigenetic modifier by interacting with chromatin complexes, and at the level of 3D-genome structure. There are various kinds of lncRNAs molecules based on their length, as proposed by Amarar et al. In lncRNAdb (Amaral et al. 2011). Later, these were classified based on their position, the direction of transcription (relative to protein-coding genes), and their shape such as long intergenic non-coding RNAs (lincRNAs): lncRNA transcribed between the exons; long intronic non-coding RNAs (lncRNAs): lncRNA from the intronic regions; Telomere-associated ncRNAs (TERRAs): lncRNAs from the telomere region; Transcribed-ultraconserved regions (T-UCRs), antisense lncRNAs; and circular RNAs (circRNAs).

CircRNAs are the most recently identified form of lncRNAs and are formed by back splicing. They can be classified into intronic, single exon, multiple exons, and intronic exonic based on which parts are parts of the gene are retained in the closed circular. circRNAs are usually cell-type specific and more stable than linear RNAs because exonucleases cannot easily degrade their closed circular structure. They are excellent competing endogenous RNAs (ceRNAs) acting as a miRNA sponge to compete with the host cell’s miRNAome and allow gene expression. A major proportion of lncRNAs reside in intergenic regions of the genome, so they are termed as long intergenic ncRNA (lincRNAs). lincRNAs target their gene through long-range chromatin looping; they do not need cohesion proteins to close the 3D chromatin loops. These studies opened an exciting function of lincRNAs as a linker for 3D-genome structure maintenance. Given these alterations in lncRNAs expression promotes disease states like tumor formation, progression, and metastasis. Increased knowledge of the molecular mechanisms of lncRNAs could provide novel therapeutic targets for treating various diseases like autoimmune disorders, cancers, and viral infections.

Small non-coding RNA analysis

In general RNAseq applications lie in identifying and characterizing two types of small RNAs, i.e., MicroRNA and Piwi-RNA. Complementary DNA library construction protocol is widely used for small non-coding RNAseq, but it generates a bias in the sequencing results, partially due to RNA modifications. RNA modifications interfere with adapter ligation and reverse transcription processes and prevent the detection of sncRNAs bearing these modifications (Shi et al. 2021).

miRNA

MicroRNAs are 20–22 bp long small noncoding RNAs. It acts as a post-transcriptional regulator by interacting with mRNA, lncRNA, and circular RNA molecules. miRNAs also serve as biomarkers in various diseases, including cancer. NGS-based miRNA analyses evolved very fast. For example, the alignment of the short reads of miRNA (20–22 bp) to the reference was a bit challenging, but much work is done so far to solve this issue (Ziemann et al. 2016). Currently, many advanced command-line and user-friendly pipelines are available to carry out miRNA analyses (Aparicio-Puerta, et al. 2019; Riffo-Campos et al. 2016; Chen et al. 2019a; Farrell 2017; Liu et al. 2021). These pipelines can identify novel miRNA, predict miRNA structure, and perform expression quantification and differential expression analyses (Bortolomeazzi et al. 2019). Each of these steps has its importance, e.g., miRNA’s secondary structure can have a conformational role in modulating miRNA-mRNA interactions. Expression quantification is required for differential expression of miRNA.

PIWI-RNA

PIWI-interacting RNA (piRNA) is a recently discovered class of small ncRNAs with ~ 19–33 nt in length implicated in gene regulation of transposable elements (TEs) in the germline cells (Cox et al. 1998) and amongst the most abundant small ncRNAs in the germline cells (Wang et al. 2019). As mentioned before, the eukaryotic genomes encode millions of copies of selfish DNA elements like repetitive sequences, transposons, SINE, LINE, etc. (Ünsal and Morgan 1995; Ernst et al. 2017; Jurka 2000; Frith et al. 2005). The TEs’ movement plays an important role in genomic evolution by creating novel genes, diverse immune responses viz V(D)J recombination for MHC-alleles. These movements need to be controlled because unrestricted movement creates a threat to genomics integrity, which could cause deadly diseases like cancers, autoimmunity, and genetic disorders (Castañeda et al. 2011). piRNA sequences are loaded onto the germline argonaute (AGO) proteins termed as PIWI (PIWI) proteins and regulate the TEs expression (Castañeda et al. 2011). The PIWI proteins were first identified by their roles in maintaining (Cox et al. 2000) and patterning (Wilson et al. 1996).

The new small RNA-seq technologies and advanced bioinformatics tools have contributed to the piRNA repertoire’s growth that helped to improve prediction tools for novel piRNA (Jensen et al. 2020). Experimentally known piRNAs datasets are usually used to train classifiers that predict piRNA sequences from the genome. Basic features used for training the algorithms are nucleotide usage, physicochemical properties, RNA secondary structure, etc. After training, these algorithms’ performance is assessed to predict novel piRNA sequences using statistical measures like sensitivity, specificity, and Matthews Correlation coefficient (MCC). Many tools and databases for prRNA published recently have various functions such as piRNA prediction, identification of novel piRNA, can differentiate transposon-derived piRNAs from non-piRNAs, identification of Piwi-Interacting RNAs, database of piRNA, detection of piRNA-mediated transposon-silencing and discriminate the siRNAs and piRNAs, etc.

Computational tools for identification, annotations, and analysis of ncRNAs

The recent transcriptomics methods and current computational resources in the ncRNA field are helping to improve the classification, annotation, and analysis of ncRNAs, helping the scientific community identify, annotate, store, predict, and analyze ncRNA data (Thind et al. 2022). In Table 1, we summarized various computational tools based on non-coding RNA types and functions.

Table 1 Bioinformatics tools used for the identification, prediction, and annotation of different types of non-coding RNAs

Future prospects and challenges of non-coding RNAs

Despite advancements in the technology of RNA sequencing, there are several technical challenges in this field that are needed to be addressed. For instance, the expression of ncRNAs is generally restricted to a specific cell lineage and are expressed in lower amount as compared to that of the other genes (Guttman et al. 2010; Iyer et al. 2015). Due to lower expression levels, their exact quantification is very difficult to achieve thus impacting the differential analysis studies (Everaert et al. 2017). In order to obtain the proper quantification via differential expression studies higher sequence coverage is required (e.g., 100–200 million reads using total-RNA-Seq library for deep whole transcriptomics analysis of human RNA-Seq data). Another challenge is in dealing with natural antisense transcripts which are widespread in the class of lncRNAs (Pasmant et al. 2007; Beltran et al. 2008). The antisense transcripts of lncRNAs and miRNAs with overlapping exons on the opposite gene strand are also difficult to count. To deal with such issues various computational methods are developed to correctly identify antisense transcription utilizing the information of location and orientation of splicing sites and poly(A) tails (Lorenzi et al. 2019).

Both the lncRNAs and circRNAs have also been observed to be present in extracellular vesicles (EVs) secreted by diseased cells. The RNA content of these vesicles generally acts as biomarkers for a particular disease (Mohankumar and Patel 2016; Hinger et al. 2018; Li et al. 2020). However, the amount of the RNA present in EVs is very less. Thus, again making it challenging to quantify. Similarly, the detection of ncRNA in a single cell is also difficult due to lower abundance. In order to carry out accurate prediction and quantification, the detection methods like (lnc) RNAs capture sequencing techniques have also been developed. This process involves biotinylated probes for capturing the target (lnc)RNAs, improving the coverage for low-abundant lncRNAs (Kato and Carninci 2020). In normal RNA-seq library preparations, CircRNAs get depleted at poly(A) enrichment step because it lacks a poly(A) tail. However, they are found to retain in rRNA-depleted libraries and libraries treated with RNAse R degraded linear RNAs. The RNase R treatment followed by RT-quantitative PCR (qPCR), is a popular experimental strategy for validating the circRNAs obtained from rRNA-depleted samples thus allowing the targeted confirmation of true positives (Szabo and Salzman 2016).

Neither CircRNAs nor lncRNAs have a standard naming convention. The naming of lncRNAs is mostly based on their functions, structures, and mechanisms of action(Gong et al. 2021). The same circRNA is called by different names in various circDatabases. For instance, circBase takes into account species and numeric code and other proposed new naming based on genomic coordinates (e.g., chr10:126,970,702|127,127,764), which is also inconsistent since reference genomes are updated periodically, and newly developed databases could use hg38/others instead of hg19/old. In addition, the naming of genomic coordinates can be influenced by the zero/one index formats (chr10:126,970,701|12,712,776 could also be named as chr10:126,970,702|127,127,766). Based on genomic coordinates from UCSC resources, circBank and circAtlas use gene symbols to identify transcriptional units that generate circRNAs; however, there may be discrepancies in the names in these databases due to the non-consistent transcriptional unit defined for a particular gene (e.g., hsa_circAEBP2_003 in circAtlas could be hsa_circAEBP2_001 or hsa_circAEBP2_002 in circBank).

With deep sequencing technologies, both known and novel miRNAs can now be detected at a large scale. As most organisms do not have their genomes completely sequenced, even mapping reads to genomes can be challenging. Although there are several tools available for miRNA profiling, some of them are already mentioned in Table 1. These methods depend on databases consisting of known miRNAs thus the accuracy of predicted novel miRNAs is still questionable. Also, it is observed that many different sequences can be produced from a single miRNA locus. These variable length short sequences may have various 5′ and 3′ ends as compared to that of the miRNA reads stored in public databases. They may possess the regulatory activity and require more of the properly stored information in the form of databases such as already existing databases: YM500 (Cheng et al. 2013) and isomiRex (Sablok et al. 2013). On the other hand, the limitation with piRNAs is the absence of a reliable and efficient method for the detection in tissues other than the germline. Due to the lack of proper databases, the detection and characterization of piRNAs in somatic cells are still difficult. Similar to miRNA isomers, identical piRNA sequences are produced from multiple loci thus adding to the higher complexity and lower precision of the generated data (Geles et al. 2021).

Single-cell and long-read technology for non-coding RNAs sequencing

Single-cell RNA-Seq (ScRNA-seq) is a very recent and transformative technology. With the help of single-cell RNASeq, the role of non-coding RNA in cell specificity (Gawronski and Kim 2017), embryonic development (Fu 2018), and cell reprogramming has been revealed (Luginbühl et al. 2017). It is used to search for the answers which bulk RNA sequencing cannot give, for instance, it helps in the gene expression analysis of an individual cell among the group of cells. The non-coding RNAs (ncRNAs) play an important role in the differentiation of cells by changing the overall genomic program in a small subset of the cells. They are also expressed in lower amounts, transiently expressed, or in association with transcription events involved in regulatory processes. Therefore, they cannot be easily detected by the bulk RNA-seq analysis and require single cell transcriptome sequencing to evaluate their role in a particular type of cell. Traditional approaches for sequencing small RNAs required a huge amount of cell material that limits the possibilities for single-cell analyses.

Recently, various single-cell specific protocols for non-coding RNAs are being developed. CAS-seq and Small-seq are single-cell small RNA sequencing method that enables the capture, sequencing, and molecular counting of small RNAs (Yang et al. 2019; Hagemann-Jensen et al. 2018). Small-seq is a ligation-based approach. Not only sequencing protocols are advanced but also tools specific to single-cell data are evolving, e.g., miReact software infers miRNA activities from single-cell mRNAseq that use motif enrichment analysis to derive miRNA activity estimates from scRNAseq data (Nielsen and Pedersen 2021). With the availability of long-read sequencing technologies, there is an improvement in the current annotations and large-scale initiatives are taken to complete the human lncRNA transcriptome map (Uszczynska-Ratajczak et al. 2018). lncRNAs are probably the most beneficial class of transcripts that would have improved annotation using long-read sequencing technology. Compared to protein-coding genes, long non-coding RNAs (lncRNAs) annotations are poorly characterized due to trade-offs between quality and size, often unappreciated consequences for downstream studies. Furthermore, the impact of short and long-read sequencing on the identification of lncRNAs in humans and plants is documented (Chiquitto et al. 2022) where a significant improvement in annotations of lncRNA in humans is observed using tools such as CPAT (Wang et al. 2013b), RNAmining (Ramos et al. 2021), lncRNAnet (Baek et al. 2018), and LncADeep (Yang et al. 2021). The ScRNA-seq has shown applications in the identification of the role of non-coding RNA in gene regulatory networks (Zhao et al. 2022), cell specificity (Gawronski and Kim 2017), embryonic development (Fu 2018), and cell reprogramming (Luginbühl et al. 2017).

ScRNA-seq is a powerful tool to study the expression and regulation of cell-specific ncRNAs. However, current single-cell sequencing methods are not well optimized, so many limitations and issues exist. For example, a very small amount of the starting material is generally obtained for scRNA-seq causing lower capture efficiency and higher dropouts, thus leading to the detection of a minority of expressed genes (Hwang et al. 2018). Since the ncRNAs have lower expression so the dropout events may have prominent effects on the analysis. scRNA-seq produces noisier and more complex sequencing data as compared to the bulk RNA-seq data, thus making the computational analysis of the data difficult. The batch effects caused due to slight variations in sample preparations are generally found. Besides, biological variations due to the state of the cell, size, cycle, etc. also affect the transcriptomic analysis. To minimize both the technical and the biological errors, repeated analysis of multiple cells is required. To resolve this, recently, scLVM59 approach was developed to minimize the errors caused by the latent variables (Chen et al. 2019b). In gene regulatory network analysis, many different tools such as SCODE (Matsumoto et al. 2017), SCGRNs (Turki and Taguchi 2020), scGNN (Wang et al. 2021), etc., but none of these have been tested for ncRNAs gene regulatory network mapping. In this regard, multi-omics data integration may be helpful as it cross-validates the regulatory interactions in multiple datasets (Hu et al. 2020). Based on the fact that ncRNAs are emerging players in cell differentiation, interactions, and reprogramming and are less explored as compared to single-cell mRNA and bulk RNA, their investigation in a specific type of cell would provide a new outlook in near future.