Possible new members in the Tymovirales order
Initial similarity searches indicated that the new sequences probably belonged to members of the Tymovirales order. To elucidate the placement within this order, a dataset containing members of the order is used to construct the phylogeny of the group. This dataset combines reference genomes from the Alphaflexiviridae, Betaflexiviridae, Deltaflexiviridae, and Tymoviridae families with sequences obtained through an NCBI database search [44]. Additionally, a broader BLAST similarity search was performed against a database containing all the proteins deposited in the Virus database in the NCBI database [45]. After filtering hits that generated shorter alignments, the corresponding genomes were also added to the dataset. The polyprotein amino acid sequences were then extracted, and the ones containing very short or truncated ORFs were excluded from the analysis. The resulting phylogenetic tree is shown in Fig. 2. The highlighted clades indicate where the mined viral sequences clustered.
The phylogeny of the members of the Tymovirales order corroborates the family classification currently accepted for the well-known plant-infecting viruses such as the Alphaflexiviridae, Betaflexiviridae, and Tymoviridae, as they form monophyletic clades (shown in green, orange, and grey in Fig. 2, respectively) with bootstrap values above 90. Additionally, the phylogenetic tree corroborates the subfamily structure within the Betaflexiviridae family, with the two subfamilies Quirivinae and Trivinae forming well-supported clades (shown in orange in Fig. 2). Another well-defined clade, shown in blue in Fig. 2, with high statistical support (Bootstrap above 90), included members that have been suggested to constitute a new family called Emraviridae [46]. Surprisingly, the clade that would form the Deltaflexiviridae family presented an unexpected structure: Two large clades (shown in red and purple in Fig. 2), harboring most of the new sequences (100 sequences out of the 111), displayed considerable divergence from each other, despite stemming from a common ancestor and forming a monophyletic clade. The evident separation between these two clades, as indicated by the length of the branch marked in purple, suggests that the Deltaflexiviridae can be divided into two families, proposed in this work as Deltaflexiviridae and Thetaflexiviridae. Two other sequences described in this work formed a separate clade (shown in pink in Fig. 2). Given their high divergence from the rest of the clade B, these two sequences should also be classified as a new family, proposed in this work as Epsilonflexiviridae.
Two novel sequences clustered with Alphaflexiviridae (marked in green in Fig. 2), and one of them, hereby named East River virus 2, diverged from the common ancestor for the whole family. A second phylogenetic tree, containing a subset of the sequences in Fig. 3, created by sampling three members of each Alphaflexiviridae family and using the Tymoviridae member turnip yellow mosaic virus as the outgroup, shows that East River virus 2 probably belongs to this family, despite sharing only the polyprotein ORF with the other members of the family (Fig. 3B). The phylogenetic tree also places the other novel virus, East River virus 1, among the members of the Allexivirus genus (Fig. 3A). This grouping is well supported by the high Fast Bootstrap value and by genomic composition (Fig. 3B), since East River virus 1 shares genes with the other members in the genus. Interestingly, this virus also has a coat protein gene ORF overlapping with the 40kDa protein gene ORF, and it is also longer than its counterparts.
The Betaflexiviridae family contained four new sequences, which were highly similar to published sequences (Fig. 4). Repeating the approach used for Alphaflexiviridae, three reference sequences were selected from each genus to place the new putative viruses. The high similarity indicates that these viruses are isolates from the existing species belonging to the genus Carlavirus. Three isolates originated from samples taken from Gansu. They belonged to the potato virus S (closest BLAST hit identity: AAP76207–98.22%), potato virus H (closest BLAST hit identity: AEI55831–96.1%), and potato virus M (closest BLAST hit identity: YP_277428 − 94.68%). A fourth isolate identified as a member of the potato virus M (closest BLAST hit identity: UTQ50775–98.32%) originated from a sample of wintersweet, Chimonanthus praecox (Fig. 4).
Four of the new genomes clustered with the isolated group, suggesting they can be classified into the new family Emraviridae (shown in blue in Fig. 2). A phylogenetic tree (Fig. 5A) with fewer sequences and, thus, a higher resolution indicates that these four genomes (shown in bold and italic in Fig. 5) belong to new species within this family. The highest identity to a BLAST hit among the new sequences is 43.4% (AQM32763), thus corroborating the classification of the novel putative genomes as new species.
As this family was recently proposed, it is yet to be recognized by the ICTV, consequently, there is still no genus-level classification. Yet, there are some indications of a possible genus classification. A combination of orthologous search clustering and phylogenetic analysis revealed the formation of five well-defined groups, corroborated by both genome organization clustering and tree structure. Three out of the four new viruses clustered with clade V, which is also characterized by the presence of HP1 (shown in green in Fig. 5B). The remaining genome clustered within clade I, which is corroborated by the presence of HP3, even though not all members of the clade contain this gene.
Deltaflexiviridae family split
Interestingly, most of the sequences described in this work were grouped with known members classified within the family Deltaflexiviridae. However, the group exhibits a remarkable topology, characterized by three distinct and clearly defined clades (Fig. 2). Additionally, the published sequences classified as members of the family did not form a monophyletic clade. For example, Sclerotinia sclerotiorum deltaflexivirus 1 (Fig. 6), the type species for the Deltaflexiviridae family, and Fusarium deltaflexivirus 2 (Fig. 7) are categorized within the same family, yet they cluster in distinct clades.
The high divergence between the clades suggests that these three clades should be classified into different families. Thus, we propose the creation of two families to represent the evolutionary history of this group more accurately, as estimated by the phylogenetic analysis, namely Epsilonflexiviridae (shown in pink in Fig. 6) and Thetaflexiviridae (Figs. 7 and 8). Despite this phylogeny showing some incongruences with the tree shown in Fig. 2, such as the placement of the Epsilonflexiviridae, the divergence is still consistent, which supports our proposed classification.
Following our proposed classification, the Deltaflexiviridae family will then encompass 18 out of the 111 new sequences. The new Deltaflexiviridae phylogenetic tree is presented in Fig. 6, with the new genomes shown in bold and italics. Interestingly, the genome content is quite diverse within this group. The polyprotein is the sole shared element among all viruses, and seven clusters of orthologs are identified within the group.
Pairwise amino acid identity comparisons revealed that 13 out of the 18 genomes classified into the Deltaflexiviridae family belong to new species (marked with circles in Fig. 6, Supplementary Table 2). However, using the more conservative prerequisite and assuming that only genomes mined from samples originated from a single individual indicated that 7 of the 18 putative viruses should be classified as new species (marked with black circles in Fig. 6). Four novel species, three mined from single plant samples collected from conifer samples (Picea glauca deltaflexivirus, Cupressus duclouxiana deltaflexivirus, Cupressus gigantea deltaflexivirus, and sharing 55.16%, 55.53% and 55.45% identity with QYF50206, respectively) [47, 48], and one from a single lesion caused by the plant pathogen Puccinia striiformis in a wheat leaf [49], clustered together (Puccinia striiformis deltaflexivirus − 55.43% identity with QYF50206). Another cluster containing four new genomes harbored a putative virus, Picea deltaflexivirus, isolated from a single spruce individual (58.01% identity to UYL94495), which clustered with two new genomes mined from a sample of the moss Dicranum scoparium, which could be classified as isolates of a new species, but did not come from a sample of a single individual [50]. Finally, a genome mined from a single-plant sample of the conifer Abies sachalinensis, (Abies sachalinensis deltaflexivirus − 57.7% identity to 3QYF50206), clustered with a genome mined from a soil sample from Antarctica.
In addition to the putative novel species, viruses that belonged to already described species were also detected by our methods. Among those, three genomes were classified as members of the species Eryphisie necator associated deltaflexivirus 4 and Sclerotinia sclerotiorum deltaflexivirus 1, viruses associated with plant-infecting fungi. The isolate originating from corn lacks a part of the genome, indicating that it is incomplete. The remaining new genome within this cluster was classified as an isolate of the already described Leptosphaeria biglobosa deltaflexivirus 2 (Fig. 6).
The family Epsilonflexiviridae (depicted in pink in Figs. 2 and 6) comprises two members. The amino acid identity between Chimonanthus praecox epsilonflexivirus and the closest sequenced entry in GenBank (QDW81317) is 34.33%. The sample from which this putative virus was derived from a single wintersweet plant (47), supporting its classification as a new species. In contrast, information regarding the sample of the other family member, Macrocybe gigantea epsilonflexivirus, isolated from an edible mushroom, could not be located. Therefore, although it shares 33.83% identity with the closest deposited sequence (QDW81317), the classification of this putative virus as a novel species should be approached with caution.
Meanwhile, the family Thetaflexiviridae (shown in purple in Fig. 2) harbored 81 of the 111 new sequences. The novel likely viruses are presented in Figs. 7 and 8 with a higher-resolution phylogeny, marked in bold and italics. To facilitate the visualization, portions of the tree were collapsed; specifically, the highlighted section in Fig. 7 is collapsed in Fig. 8 and vice versa.
Taking the threshold of 90% amino acid identity and the genomic content, 18 novel genomes belong to species already described, and 67 novel genomes probably belong to new species. Taking into account only genomes mined from samples in which a single individual was sequenced, this number drops to 34. Since some of the new genomes assigned as probable new species were closely related (above 90% identity), only 55 new species were predicted to exist under the identity criterion within the Thtaflexividae family. For instance, a putative virus named Crawfurdia deltaflexivirus (55.62% identity with QKN22686), mined from a sequenced museum accession [51], showed high identity with another virus mined from leaves of a single individual of the Chinese cypress [48], Cupressus duclouxiana. Finally, 25 new species fit the single-individual sample criteria within the Thetaflexiviridae family.
A notable large clade contains genomes with a single ORF, the polyprotein (see highlighted section in Fig. 8). Interestingly, a putative new virus, Brassica napus thetaflexivirus, is placed within this group, but contains HP3, RNAse III, HP7, and two more genes with no detected orthologs. It is unclear whether these distinct additional ORFs were acquired independently or if they were present in the ancestor, as this virus was not positioned as diverging from the ancestor of the entire single ORF group.
This clade also harbors 41 of the 81 new sequences characterized in this work. All of them could be classified as new species under the similarity criteria. However, taking into account only viruses mined from the single individual samples, only 30 should be considered to belong to a new species. There is a clade composed only of new species, comprising, for instance, the putative virus Tetracentron sinense thetaflexivirus 2, mined from a sequencing project that sampled leaves from a single individual, [52] and Picea wilsonii thetaflexivirus 2, mined from sequencing data of a single individual of the conifer Pìcea wilsonii [50] (Fig. 8). This highlights the limited understanding of the diversity within this group.
The orthology detection performed by OrthoMCL [35] resulted in five new groups of genes in different viral genomes with likely shared evolutionary history, hereby named hypothetical protein 6 through 10 (HP6, HP7, HP8, HP9, and HP10). This method was also capable of reproducing the orthology relationships previously indicated; that is, orthologs of the hypothetical proteins HP2 to HP5 were also found. However, information about the function of these genes could only be found for the HP2 via InterproScan searches [37], which includes an RNAse 3 domain commonly found in viruses infecting fungi.
Additionally, patterns of gene presence/absence were also found. For instance, the orthology clustering showed that genomes lacking HP7 (yellow in Fig. 6) contained the gene that showed similarity with the protein annotated as hypothetical protein 3 (HP3) (orange in Fig. 6), which could indicate that this cluster of genes and HP3 are orthologous.
The genomic content is more conserved than that observed in the family Deltaflexiviridae. Interestingly, genes annotated as the coat protein gene and HP3 are mutually exclusive. The coat protein gene is shared between the Deltaflexiviridae and Thetaflexiviridae, with the latter found in two genomes sourced from Angelo Coast and Alaska (Fig. 7). Given that the presence of this gene is sparse even within the Deltaflexiviridae, further studies may be necessary to understand its evolutionary history.
A set of 28 detected clusters contained only two genes in closely related viruses, hereby named clade-restricted clusters (dark grey squares in Figs. 6, 7, and 8). Notably, the HP5 gene was also present in only two genomes, the previously described Agave tequilana deltaflexivirus and a newly described putative virus, Spruce root thetaflexivirus. However, HP5 was not assigned as clade-restricted, being another example of a gene shared between the families Deltaflexiviridae and Thetaflexiviridae.
In summary, the five previously described hypothetical proteins (HP2 to HP5) were present in both families. Among the orthologs reported in this work, HP6 and HP9 were common to both families, HP8 was exclusive to Deltaflexiviridae, and HP7 and HP9 were only detected in Thetaflexiviridae.
Pipeline efficiency
The pipeline was executed for 202 accessions available in the Serratus platform. Despite the indications that all of them contained viral sequences, our pipeline failed to detect new genomes in more than half of the accessions. The reads from two accessions could not be retrieved. The pipeline yielded 193 sets of contigs that probably belong to viruses. Similarity searches revealed that 34 sets of contigs had been previously published; thus, they were discarded. Out of the 159 sets remaining, only 81 passed the manual curation step (Fig. 9A).
Out of the 81 SRA accessions that contain curated genomes, 56 originated from plant samples (Fig. 9B), with 41 of those containing new species (Fig. 9C). However, only 26 plant accessions that contain putative new species originated from single individual samples (Fig. 9D). It is of notable interest that the overrepresentation of samples containing new species belonged to Gymnosperms. A single sample originating from non-vascular plants was sequenced from single branches and leaves of the livewort Mylia verrucosa [53].
Environmental samples constituted the second most significant source, contributing 18 accessions that resulted in novel genomes. Of these, 16 contained novel species. Specifically, 14 originated from soil samples, with three of them being derived from metagenomic studies investigating plant-associated microbiota. Curiously, one sample was collected in the sea, and the viral genome is probably the result of contamination.
Interestingly, three accessions came from arthropods: the grasshoppers Locusta migratoria and Ceracris nigricornis, and the phytophagous Apolygus lucorum. The latter is classified as an isolate of the Aspergillus flavus deltaflexivirus 1. The grasshopper Ceracris nigricornis was the only one that originated from a single individual (the Bioproject accession refers to an individual: PRJNA543568). The remaining four accessions originated from fungi: two from Macrocybe gigantea, an edible mushroom; two from the plant pathogen Puccinia striiformis.