The cacao tree, Theobroma cacao (L.) (family Malvaceae), is an economically important tree crop that is cultivated in tropical regions and from which the beans are harvested to produce high-value confectionaries and other products [1]. A number of plant viruses cause diseases in cacao trees that limit crop production where cacao is grown, and recently, at least three have been identified in cacao germplasm collections [2,3,4,5,6,7]. The family Solemoviridae includes four genera of plant viruses: Enamovirus, Polemovirus, Polerovirus, and Sobemovirus [8]. Poleroviruses have non-enveloped icosahedral virions that are 26–30 nm in diameter and a polycistronic positive-sense single-stranded RNA (+ ssRNA) genome of approximately 5.6–6.2 kilobases (kb) with 5’- and 3’-terminal noncoding regions and 7–10 open reading frames (ORFs). They are transmitted in a persistent, circulative manner by aphids (s.o. Homoptera; o. Hemiptera) [9] and are economically important pathogens of cereal grains, fruits, and vegetable crops. Infected plants exhibit symptoms that consist of leaf rolling, foliar chlorosis, and overall stunting [10]. To safeguard the transport of virus-free cacao germplasm and ensure that cacao seedlings planted on commercial cacao farms are virus-free, high-throughput sequencing has been used for screening of germplasm and planting materials for the presence of plant viruses to prevent their introduction. This has led to the characterization of numerous badnaviruses and, recently, to the detection of a previously uncharacterized polerovirus [3]. In this study, a previously uncharacterized polerovirus, which we have named "cacao leafroll virus" (CaLRV; family Solemoviridae; genus Polerovirus), was found to be associated with symptomatic cacao seedlings (n = 4) held in quarantine, making it only the second polerovirus known to infect cacao, [11]. The virus was initially detected by discovery Illumina RNAseq and de novo assembly of the complete genome sequence, and it was identified as a polerovirus based on phylogenetic and pairwise distance analysis. The complete CaLRV genome sequence was obtained for two of the four isolates (MITC2028 and MITC2039) by rapid amplification of cDNA ends (RACE). The coding and non-coding regions of all four genome sequences were annotated based on previously published polerovirus genome sequences available in the NCBI GenBank database.

CaLRV genomic RNA was recovered from the cacao accessions CATIE-R4, CCN 51, CC 137, MX C67, and CATIE-R1, which were identified by genetic analysis as crosses between different genetic groups of cacao [12]. These plants were held in the USDA-ARS-SHRS, Miami, FL, quarantine greenhouse facility for mandatory observation prior to release. The plants exhibited foliar symptoms consisting of bluish-greenish discoloration, foliar malformation, and downward leaf rolling. Leaves with the petioles attached were collected from young cacao seedlings at the 12- to 14-leaf stage and shipped by courier to The University of Arizona, Tucson, AZ, under a USDA-APHIS permit (issued to J.K. Brown). Total RNA was isolated from leaf samples using a modified silica RNA isolation protocol [13] and submitted to Novogene Corp., Sacramento, CA, for Illumina RNAseq sequencing. Following ribosomal RNA depletion, cDNA libraries were constructed and sequenced (150 base-paired ends) on an Illumina NovaSeq 6000 platform (Illumina, San Diego, CA). Adapter sequences and low-quality reads were removed using BBDuk (Department of Energy, Joint Genome Institute, Berkeley, CA), and the read quality was assessed using the FastQC quality control tool, which is available online at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Quality-filtered reads were assembled de novo using rnaviralSPAdes, implemented in SPAdes ver. 3.15.4 [14]. Virus-specific contigs were initially identified using BLASTn [15] to query a local database (UA Plant Sciences Diagnostic Lab) containing sequences downloaded from the NCBI GenBank Virus Refseq database, available at https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/, including the previously published partial CaLRV genome sequences [11]. Bioinformatic analysis was carried out on The University of Arizona High Performance Computing (HPC) cluster (UA Research Data Center, Tucson, AZ). Viral coding regions were predicted using NCBI ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/), implemented in Geneious Prime v2023.1.2 (Biomatters Inc., San Diego, CA). The resultant viral nucleotide and amino acid sequences were aligned in MUSCLE [16], available in Geneious Prime v2023.1.2 (Biomatters Inc., San Diego, CA). Pairwise distance analysis of the viral nt sequences and predicted aa sequences was carried out using Sequence Demarcation Tool (SDT) v1.2 [17]. Phylogenetic analysis of the aligned sequences was carried out using MrBayes software, version 3.2.7a [18], implemented on The University of Arizona High Performance Computing (HPC) cluster (UA Research Data Center, Tucson, AZ). The phylogenetic tree was drawn using FigTree v1.4.4 (https://github.com/rambaut/figtree/releases) and edited in Adobe Illustrator 2019. The amino acid sequence protein family (pfam) matches were determined using the InterProScan tool [19] and InterPro protein signature databases. The putative viral fusion protein was predicted using KnotInFrame software (Turner model 2004) [20] to identify the location of the − 1 ribosomal frameshift site at the ORF1 and ORF2 overlap. The 5’- and 3’-terminal ends of the genome were determined by rapid amplification of cDNA ends (RACE) as described previously [21]. Briefly, to determine the 5’-terminal sequence, first-strand complementary DNA (cDNA) was synthesized using a SuperScript IV reverse transcription (RT) system (Invitrogen, Carlsbad, CA) with the virus-specific antisense primer CaPol_1335R (5’- GGTCGCATCGAGCCATTCTA − 3’). The cDNA was purified using a PureLink PCR Micro Kit (Invitrogen) and C-tailed using recombinant terminal deoxynucleotide transferase (rTdT) in 10X TdT reaction buffer, 2 mM deoxycytidine triphosphate (dCTP), and 2.5 mM cobalt(II) chloride (COCl2) solution (New England Biolabs, Ipswich, MA). The 5’-terminal end was amplified from the C-tailed cDNA template by PCR amplification. The reaction mixture contained the sense primer P5-poly-G-F (5’- GGGGGGGGGGGGACAAAAGA- 3’), for which the conserved polerovirus sequence [21] is shown underlined, the virus-specific antisense primer CaPol_653R (5’- CGTCCACCATATCCATGCAAAC − 3’), and JumpStart REDTaq ReadyMix (Sigma-Aldrich, St. Louis, MO). To determine the sequence of the 3’-terminal end of the genome, the total RNA was poly-A tailed using Escherichia coli poly(A) polymerase in 10X E. coli poly(A) polymerase reaction buffer containing 10 mM ATP (New England Biolabs). The poly-A-tailed RNA was reverse transcribed using the SuperScript IV RT system (Invitrogen) with the primer poly-T (5’- TCCCGGGTTTTTTTTTTTTTTTTTTTTTT − 3’) [21]. The cDNA template was amplified by PCR using the virus-specific primer CaPol_5225F (5’- AGATGAAACTACCTCTGTGGCG-3’) and poly-T as the antisense primer [21]. The amplicons were separated by agarose gel (1.0%) electrophoresis in 1X TAE buffer, pH 8.0, and excised from the gel with a razor (VWR, Radnor, PA), followed by gel-purification using a Wizard SV Gel and PCR Clean-up System (Promega, Madison, WI). Gel-purified amplicons were ligated into the pGEM-T Easy plasmid vector (Promega) and used to transform chemically competent E. coli strain DH5α cells. Bacterial colonies containing a plasmid bearing a fragment of the expected size were screened by colony PCR amplification as described previously [22]. The DNA sequence of each insert was determined by Sanger sequencing (Eurofins Genomics LLC, Louisville, KY).

The number of reads used for de novo assembly ranged from 75,350,628 (isolate MITC2007) to 84,531,254 (isolate MITC2028), yielding 131,803 to 166,226 contigs, respectively (Supplementary Table S1A). The sequence coverage ranged from 1,883 (isolate MITC2007) to 30,103 (isolate MITC2039). A BLASTn analysis of the assembled reads (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/) identified contigs from the four samples, ranging in size from 5,973 bp to 6256 bp, that yielded matches to one of five previously characterized poleroviruses, melon aphid-borne yellows virus (MABYV, NC_010809), pepo aphid-borne yellows virus (PABYV, NC_030225), potato leafroll virus (PLRV, NC_0076505), pumpkin polerovirus (PuPV, NC_055513), or wheat leaf yellowing-associated virus (WLYaV, NC_035451) (Supplementary Table S1B) with 71.8–79.7% nucleotide sequence identity and e-values from 7.27e− 31 (PABYV) to 1.13e− 78 (Supplementary Table S1B).

Four CaLRV genome sequences were submitted to the GenBank database and assigned the accession numbers OR423047-50. The complete CaLRV genome sequences MITC2007, MITC2028, MITC2039, and MITC2064, which were assembled from Illumina RNAseq reads and/or RACE, were 5,994 to 5,997 nt in length. Each genome contained seven ORFs (ORFs 0, 1, 1–2, 3a, 3, 4, and 3–5), 5’ and 3’ untranslated regions (UTRs), and an intergenic region (IGR) and exhibited a genome organization like that of other well-characterized poleroviruses (Fig. 1). The 5’-UTR (nt 1-101) is 101 nt in length and contains a polerovirus-like conserved terminal “ACAAAA” sequence [8]. ORF0 (nt 102–791) is 690 nt in length and encodes a predicted protein (P0; 229 aa) with a predicted molecular weight (MW) of 26.7 kDa. The polerovirus P0 protein, which is believed to be a suppressor of host gene silencing, contains the F-box motif FPFLLX14P64 − 83, which is conserved among the well-characterized poleroviruses [23, 24]. ORF1 (nt 229–2289) is 2,061 nt in length and encodes a predicted multifunctional polyprotein (P1) of 686 aa with a predicted MW of 76.3 kDa that is predicted to be involved in virus replication [25]. P1 contains a conserved protease motif with the predicted catalytic triad of a chymotrypsin-like serine protease (HX31DX63TAAGFSG276 − 378) [25,26,27] and includes the viral genome-linked protein (VPg), which is released following cleavage by the serine protease portion of P1. ORF2 (nt 1857–3566) is 1,710 nt in length and encodes an RNA-dependent RNA polymerase, RdRp (P2), of 569 aa, with an MW of 64.3 kDa. It is expressed as an overlapping protein from the C-terminal region of ORF1, which, with P1, forms a putative fusion protein, RdRp (P1-P2), due to a -1 ribosomal frameshift at nt 1,863 during translation. This is mediated by a slippery heptanucleotide sequence (GGGAAAA) located at nt 1,857-1,863 within the ORF1-ORF2 overlapping region. The RdRp consists of the conserved ‘GDD’ aa core GSYNTSSSNX19GDD1220 − 1250, which is present in other viral RdRps [28]. The ORF1-ORF2 region (nt 229-3,566) is 3,338 nt in length and encodes a P1-P2 protein of 1,112 aa, with an estimated MW of 124.3 kDa. The viral ORF2 is followed by an intergenic region (IGR; nt 3567–3660) of 94 nt. The ORF3a sequence (nt 3661–3783) is 123 nt in length and begins with a non-canonical ‘CUG’ initiation codon [29]. It encodes the 40-aa P3a protein, which has an MW of 4.4 kDa. In PLRV, the P3a is involved in long-distance movement [30]. ORF3 (nt 3790–4395) is 606 nt long and encodes a coat protein (P3) that has 201 aa and an MW of 22.3 kDa. ORF4 (nt 3875–4279) is 405 nt long, is nested within ORF3, and encodes the viral cell-to-cell movement protein (P4) of 134 aa, with an MW of 15.2 kDa [28]. ORF5 (nt 4396–5817) is 1,422 nt long and is expressed through a translational readthrough domain (RTD) due to suppression of the amber stop codon of ORF3. This results in an ORF referred to as ORF 3–5 (nt 3790–5817), which encodes the P3-P5 fusion protein, or readthrough protein (RTP), consisting of 675 aa, with an MW of 76.1 kDa. This protein is essential for aphid-mediated polerovirus transmission [9]. The viral 3’UTR (nt 5818–5997) is 180 nt in length. The 3’-proximal ORFs ORF3a, ORF3, ORF4, and ORF5 of poleroviruses are expressed from subgenomic mRNA (sgRNA) to synthesize the capsid protein (CP) and CP-RTD readthrough protein (RTP) and the P4 and P3a movement proteins (see above) [31].

Fig. 1
figure 1

Genome organization of cacao leafroll virus (CaLRV) showing an illustration of the open reading frames. Numbers indicate the nucleotide positions of ORFs within the genome. Boxes show open reading frames (ORF). The shaded triangle (▲) indicates the location of the slippery heptamer ‘GGGAAAA’ that is responsible for the − 1 ribosomal frameshift (indicated by a slanting arrow) that results in the synthesis of the P1-P2 fusion protein. The unshaded triangle (∆) indicates the position of suppression of the amber stop codon of ORF3 that results in a translational readthrough from ORF3 to ORF5

Pairwise comparisons of the four CaLRV genome sequences from this study with those of other well-characterized poleroviruses indicated that they shared the lowest and highest nt identity of 58.3% and 62.2% with beet chlorosis virus (NC_002766) and potato leafroll virus (NC_076505), respectively (Table 1).

Table 1 Comparison of the percent nucleotide and amino acid sequence identity values for the complete genome and predicted proteins, respectively, of cacao leafroll virus (CaLRV) and selected members of the genus Polerovirus

The P0 proteins of the four CaLRV isolates from this study shared the least aa sequence similarity (17.9% identity) with wheat leaf yellowing-associated virus (YP_009407906) and the most (28.4% identity) with cowpea polerovirus 1 (YP_009352244) (Table 1). The CaLRV P1 protein shared the least aa sequence similarity (27.4% identity) with the P1 protein of pumpkin polerovirus 1 (YP_010087204) and the most similarity (36.8% identity) with that of beet mild yellowing virus (NP_612214). The CaLRV P1-P2 protein shared the least aa sequence similarity (39.4% identity) with that of suakwa aphid-borne yellows virus (YP_006666506) and the most (48.5% identity) with that of beet mild yellowing virus (NP_620479). The CaLRV P3 protein is 36.7% identical to that of strawberry polerovirus-1 (YP_009100306) and 57.3% identical to that of melon aphid-borne yellows virus (YP_001949873). The CaLRV P4 protein shared the least aa sequence similarity (27.8% identity) with that of luffa aphid-borne yellows virus (YP_009162337) and the most aa sequence similarity (45.3% identity) with those of potato leafroll virus (NP_056750) and tobacco virus 2 (YP_009352891). The CaLRV P3-P5 protein shared the least aa sequence similarity (17.1% identity) with that of sugarcane yellow leaf virus (NP_050010) and the least (37.5% identity) with that of cereal yellow dwarf virus-RPV (NP_840025).

Based on the polerovirus species threshold of < 75% nt sequence identity in the complete genome sequence [8] and > 10% aa sequence divergence from any polerovirus protein [32], CLRaV should be considered a member of a new species in the genus Polerovirus. The P3 protein of CaLRV shares no more than 57% aa sequence identity with any of the P3 proteins of ICTV-recognized poleroviruses, while the most divergent CaLRVprotein, P3-P5, shares no more than 17% aa sequence identity with any of the other polerovirus proteins.

Bayesian phylogenetic analysis based on the complete genome sequence of CLRaV and reference polerovirus genome sequences obtained from the GenBank database revealed that the closet relatives of CLRaV are poleroviruses belonging to clade I, one of three well-supported sister groups (90–100% posterior probability), which exclude the highly divergent strawberry polerovirus 1 (88%) (Fig. 2).

Fig. 2
figure 2

Bayesian phylogenetic analysis of the genome sequence of cacao leafroll virus (CaLRV) and reference genome sequences of poleroviruses from the GenBank database. The tree was rooted using pea enation mosaic virus (PEMV, family Solemoviridae, genus Enamovirus). The evolutionary history was inferred using MrBayes 3.2.7a [18], with 50,000,000 generations (nruns = 2) and a burn-in of 2,500 generations. The evolutionary distances were computed using the general time-reversible (GTR) model with invariable sites. The rate variation among sites was modeled with a gamma distribution (GTR + I + G). Evolutionary analysis was carried out on the University of Arizona High Performance Computing (HPC) cluster (UA Research Data Center, Tucson, AZ). The phylogenetic tree was drawn using FigTree v1.4.4 and edited in Adobe Illustrator 2019

The only known host of CLRaV is T. cacao, (family Malvaceae). However, CaLRV grouped with poleroviruses that infect plants of various families, including the Apiaceae, Brassicaceae, Fabaceae, Poaceae, Solanaceae, and others. This is consistent with the observation that many poleroviruses infect a wide variety of plants and indicates that host range does not closely correlate with genus-level evolutionary relationships. This is the first report of the complete genome sequence of the newly discovered polerovirus CLRaV. A recently published partial genome sequence of cotton leafroll dwarf virus (CLRDV) recovered from commercial T. cacao trees in Brazil [3] and the nearly complete genome sequence of a polerovirus isolate from cacao, referred to as ‘cacao polerovirus’, which is nearly identical to that of CLRaV [33] and was assembled from transcriptome libraries of several T. cacao accessions originating from the germplasm collection maintained at CATIE (Centro Agronómico Tropical de Investigación y Enseñanza, Costa Rica), and CLRaV, are the first sequences from poleroviruses associated with symptomatic cacao trees.