To further evaluate the final viral sequences, we applied checkV and accordingly most of the sequences (2821) were estimated to be low-quality genome fragments (< 50% complete), with only 67 with medium-quality (50–90% complete) and a total of 37 contigs were estimated to be high-quality (> 90% complete), of which 20 were predicted to be completed genome, also 115 contigs were not-determined (Supplementary Fig. S1). The chance of not-determined contigs to be false positive is low, as we retained them as viruses if they have been predicted with all three pieces of software (with VirSorter2 and DeepVirfinder quality scores ≥ 0.9). Therefore, the not-determined contigs are more likely to be too short to categorise, or the sequences are too divergent from the CheckV references—although Cenote-Taker 2 also could not predict halmarks for most of these 115 contigs. Next, we used VIBRANT to analyze whether the 3040 identified contigs were likely to have a lytic lifecycle. We found that 96.7% of them (2941 viral contigs) were predicted to have a lytic lifestyle, while only 3.3% represented a temperate lifestyle (Supplementary Table S2).
The 1289 clustered contigs formed 469 clusters, of which 434 clusters did not include any sequences from the Refseq database (Fig. 1A and Supplementary Table S3). The remaining 35 clusters, representing 67 contigs, were classified mainly as Siphoviridae, Myoviridae, and Podoviridae (Supplementary Fig. S2B). Furthermore, members of Schitoviridae, Herelleviridae, and Autographiviridae were present, although they comprised very small numbers of contigs (Fig. 1A and Supplementary Fig. S2B). In general, the resulting protein similarity network represented the newly identified phage sequences as broadly dispersed across the known viruses from the RefSeq database (Fig. 1A), despite the fact that 25% of the recovered sequences were also singletons (i.e. not presented in the network graph). The protein similarity network classification, however, showed that the higher-quality contigs (> 50% complete) tended to cluster more closely with previously identified viruses in the Refseq database, where almost 30% of these sequences assigned to known phages (Supplementary Fig. S3). Assignment success depended partly on contig length (Supplementary Fig. S4), with 53 (4.6%) of larger (> 5kb) contigs being assigned, but only 14 (~ 1%) of the smaller contigs (Supplementary Figs. S4A and S4B). However, the assignment of the large contigs to different phage families was comparable to the entire data set (Fig. 1 and Supplementary Figs. S2 versus Figs. S4).
To determine the family level taxonomy of the recovered contigs, we also characterised overlap in the composition of their viral protein families (VPF) with known viral taxa (Supplementary Table S4). Most of the recovered contigs have VPF compositions with family-level relatedness to Siphoviridae, Myoviridae, Podoviridae, and in lower ratios with Tectiviridae and Microviridae (an ssDNA virus). We also identified VPFs from archaeal virus families (Lipothrixviridae, Sphaerolipoviridae, Rudiviridae, and Turriviridae), NCLDV, and Nudiviridae—a family of DNA viruses known to infect eukaryotes, including Drosophila (Fig. 1B and Supplementary Table S4).
Of the 37 high-quality genomes, the VPF membership ratios of almost all suggest that they were related to bacteriophage families including Myoviridae, Siphoviridae, Podoviridae, and Microviridae (Fig. 1C). Although 27 (73%) of these almost complete phage genomes did not cluster with known reference phages (Supplementary Fig. S3), their VPF membership ratios were primarily related to Caudovirales (Fig. 1B and Supplementary Table S4), which may point to novel families or genera within this order.
Host Assignment of bacteriophage contigs
To predict the hosts of the recovered phage sequences, we took advantage of the VPFs as well as CRISPR-spacer sequences. Using the VPF approaches with a membership ratio and a confidence score cutoff of 0.6 and 0.8 respectively, 650 contigs could be linked to a host domain, of which almost all were assigned to the bacterial host (Supplementary Table S5). Further investigation showed that 267 of these phage contigs were linked to 15 host families and 154 of them could be related to 20 host genera. All identified potential hosts reflect either the gut or environmental bacteria (Supplementary Table S5). Of the 156 contigs that hit the host at genus-level, 43 could be related to the genus Pseudomonas, 30 to Lactococcus, 25 to Staphylococcus, 10 to Escherichia, and eight to Lactobacillus (Supplementary Table S5).
On the other hand, the CRISPR-spacer approach, which was used to predict bacterial hosts of determined phages, linked 488 bacteriophage contigs to 61 unique bacterial genera. Of these 488 contigs, most of them (> 300 contigs) hit against Drosophila gut specific bacteria (such as 74 Lactobacillus hits, 66 Gluconobacter hits, 48 Acetobacter hits, 31 Acinetobacter hits, 20 Pseudomonas hits, 19 Serratia hits, 15 Corynebacterium hits, 13 Klebsiella hits, 13 Komagataeibacter hits, etc.) (Supplementary Table S5).
Comparing the result from VPF and CRISPR-spacer approaches showed 29 contigs that hit the host at genus-level shared between two approaches, but only 69% of them (20 contigs) hit the same bacteria genus using both approaches (Supplementary Table S5). Further investigation showed that the VPF tools also predicted the presence of similar genus as CRISPR-spacer for these nine contigs, but were excluded due to our inclusion threshold (including only VPFs with a membership ratio and confident score of ≥ 0.6 and ≥ 0.8, respectively).
Functional Potential of D. melanogaster associated phage contigs from gut
Annotations from KEGG, Pfam, and VOG databases were used to gain insight into the functional potential encoded by the recovered contigs. Based on the available databases, only 50% (1549 contigs) of the contigs were assigned a function based on KEGG. However, the majority of phage associated functional genes were annotated with metabolism functions, including nucleotide, amino acid, lipid, energy and carbohydrate metabolism (Supplementary Table S6). Furthermore, functional genes associated with DNA repair, replication, and recombination, also transcription, were including quite large proportion (Supplementary Table S6).
We identified a total of 149 phage contigs that encoded at least one AMG and could be assigned to 42 unique KEGG orthologs (KO) (Supplementary Table S6). The most abundant AMGs (in terms of number of contigs involved) encoding dcm/DNMT1, folA/DHFR, mec, queE, NAMPT and folE/GCH1. These genes encode, respectively, DNA (cytosine-5)-methyltransferase 1, dihydrofolate reductase, the [CysO sulfur carrier protein] -S-L-cysteine hydrolase, the 7-cyano-7-deazaguanine synthase, nicotinamide phosphoribosyl transferase, and GTP cyclohydrolase IA (Supplementary Table 6). In other words, the AMGs derived from D. melanogaster tend to encode products mainly for amino acid, cofactor and vitamin metabolism and folding, sorting and degradation.
To further explore the functional potential of the identified contigs, we projected KO accession onto the KEGG pathways (Fig. 2). We used a total of 1549 phage contigs, which could be annotated based on KEGG database and assigned to 419 unique KO, for KEGG pathways projection (Supplementary Table 6). In some cases, the retrieved pathways represent functions that could directly affect bacteria, such as involvement in biofilm formation, quorum sensing, and the bacterial secretion system. In general, the retrieved pathways reflect a wide range of metabolic functions, including carbohydrate, energy, lipid, nucleotide, and amino acid metabolism. Furthermore, glycan biosynthesis and cofactors, vitamins, terpenoid and polyketide metabolism, and xenobiotic degradation were presented (Fig. 2). The identified genes in D. melanogaster gut phages are also involved in folding, sorting, degradation, transcription, and translation, as well as in replication and repair.
Genome binning of phage contigs from metagenomic data from D. melanogaster
We used vRhyme to construct viral metagenome-assembled genomes (vMAGs), resulting in the binning of 767 contigs into 188 vMAGs (Supplementary Table 7). However, after post-binning filtration based on potential contamination of vMAGs (bins that could include sequences from multiple viral genomes), in which we filtered out bins with ≥ 6 redundant proteins, we ended up with 167 vMAGs. We then checked the relative abundance of the final vMAG sets across the collected samples and observed variation related to the sampling locations (Fig. 3A). Out of 167 vMAGs, 11 were identified in more than 70% of the samples. Among these, two genomes, which their VPF membership ratios primarily belong to the Myoviridae family, were detected in all 851 samples (Fig. 3B and Supplementary Fig. S5).
Next, we further investigated the pattern of viral sequences across the samples, using principal coordinate analysis (PCoA), and clustering for sampling locations was assessed using Adonis2 and Betadisper tests. A significant effect from sampling location was observed not just based on the vMAG sequences but also based on unbinned viral sequences (2381 contigs) (Fig. 3C and Supplementary Fig. S6).
Drosophila melanogaster DNA viruses
During the analysis of virus protein composition, we identified a number of contigs deriving from nudiviruses, which are known to include multiple pathogens of Drosophila melanogaster (Wallace, Coffman et al., 2021). A blastn search against the NCBI viral genome database showed that these contigs were derived from Drosophila Kallithea nudivirus (NC_033829) with 99–100% coverage and a 99% identity. By mapping the high-quality short reads to the reference genome, we found that Kallithea virus occurs frequently, with a high overall prevalence of 54.2% (in 461 individuals of 851 samples with genome copy-number threshold ≥ 1% of the fly genome copy-number). The prevalence of Kallithea virus varied among four different sampling locations, ranging from 33.9% to 82.1%. We also detected Drosophila Esparto nudivirus, but its prevalence was extremely low, with only affected two flies.
By mapping reads to all previously-reported DNA virus genomes from Drosophila melanogaster [15, 17, 18], we identified two further previously-known pathogens, including Drosophila Linvill Road densovirus and Vesanto bidna-like virus. Drosophila Linvill Road densovirus was extremely rare, being detected in only seven individuals, all from sampling location 1, indicating a prevalence of 2 percent in that location and a total prevalence of 0.8% (95% confidence intervals 0.78%-0.82%). On the other hand, Drosophila Vesanto virus, a multi-segmented bidna-like virus, occurred at relatively high prevalence of 20% (95% confidence interval 18–23%; 172 samples out of 851) using a threshold of ≥ 1% (Fig. 4).
Almost all previous sequencing of Vesanto virus has been based on large metagenomic pools, leaving the associations between putative segments uncertain, as many independent infections are represented in each pool [17]. Here, based on single fly sequencing, we did not detect segments 9 or 12 —which were also absent from some population pools reported in [17]. Detection of the other segments was variable, with segment 5 being detected in 165 out of 851 samples (Fig. 4A, i.e. nearly all individuals infected with the Vesanto virus), segments 3 and 10 being detectable in 86 and 56 samples respectively and segments 1, 2, 4, 6, 7, and 8 each appearing in more than 35 individuals. Segment 11, was extremely rare, detected in only three samples.
Excluding segment 11, all other segments were detectable in those infected individuals that had the highest Vesanto virus read numbers (e.g., samples DE5797 and DE5682, which exhibited almost 2.5 and 1.8 million Vesanto virus reads, respectively). This is consistent with some segments (e.g. segments 9, 11, and 12) being ‘optional’ or ‘satellite’ components and other segments differing in copy-number, with segment 5 being the most readily detectable.