Sample details and quality check
The organism was isolated from human, biological replicate and laboratory-adapted strains. Leishmania sp. cultured under different conditions and drug resistance has shown different mutations and genetic makeup (Supplementary Table 1). The samples were quality checked, processed, and aligned to the reference genome. The Binary Alignment Map (BAM) files were used to assess the alignment and depth statistics of each sample (Fig. 1). 2 samples were discarded from further analysis because of insufficient alignment to the reference genome, due to the incompatibility of the read files with BWA alignment software. The 31 remaining samples were used for further analysis. The samples used in this study have been provided in Supplementary Table 2.
Aneuploidy change pattern in Leishmania donovani samples from Indian Sub-continent
All the analyzed samples have shown tetrasomic state as reported in the earlier studies. In addition to this, trisomy was observed in chromosomes 2 and 8 in BHU 1220, BHU 575, MHOM/IN/1983/AG83, MHOM/IN/2013, MHOM/IN/2014, K133 samples and tetrasomic states in BPK275/0/cl18 samples. Contradictorily, the Nepalese strain BPK275/0/cl18 did not show either trisomy or tetrasomy in chromosomes 2 and 8, and clustered away from the other Nepalese strains, with the K133 strain isolated from India. A few Indian strains along with one Nepalese strain, BPK275/0/cl18, K133PMM/R, K133AS/R, K133WT, MHOM/IN/2010/T3, MHOM/IN/IN/T8, MHOM/IN/2012/T9, MHOM/IN/2014/A2, MHOM/IN/2013/P2, MHOM/IN/2014/H1, BHU 575, showed the disomic state of chromosome 2 and clustered together separate from the other Indian cluster (Fig. 2). It should be noted that the strains are clustering into 4 major clusters on the similarity in their aneuploidy profile. These clusters encompass the two distinctly separate groups of Indian strains, a Nepalese cluster and a mixed group with Nepalese and Indian strains. Here, we observed that the Indian cluster of strains isolated from India BHU 1220, MHOM/IN/1983/AG83.i and ii, West Bengal MHOM/IN/2013/M1, Jharkhand MHOM/IN/2014/P6, and MHOM/IN/2014/T10, showed expected karyotypes as most chromosomes were disomic barring the trisomic chromosomes 6 and 15, and the tetrasomic chromosome 31. On the other hand, the second Indian cluster comprising strains isolated from Bihar MHOM/IN/2010/T3, MHOM/IN/2010/T7, MHOM/IN/2012/T9, MHOM/IN/2013/P2, West Bengal MHOM/IN/2012/T8, Jharkhand MHOM/IN/2014/H1, MHOM/IN/2014/A2 and BHU 575 strain showed a remarkable difference in aneuploidy profile; chromosome 31 and 8 showed an increase in copy number which reflected in their tetrasomic and trisomic states, chromosomes 2, 5, 9, 11, 13, 14, 16, 20, 22, 23, 26, 35 showed variable aneuploidy, ranging from disomy to trisomy, which was lacking in the previous Indian cluster. The Nepalese strains grouped together and surprisingly were closer to an Indian cluster due to similarity in aneuploidy profiles, except the chromosome 2 which distinctly showed tetrasomic condition in Nepalese strains BPK275/0/cl18. Finally, the remaining cluster, was a mixed group of one Nepalese strain and three Indian strains. This may have been due to the similarity in tetrasomic state of chromosome 5 along with trisomy observed in chromosomes 6, 9, 13, 15, 20, 23, 26, and 33. Chromosomes 2, 8 and 31 were also observed to be clustered together, possibly due to their peculiar nature of existing in unusually higher copy number states, mostly trisomic and/or tetrasomic with respect to the other chromosomes. The chromosomes 5, 9, 13, 14, 16, 20, 23, 26, 33, and 35 were grouped together, possibly because of their tendency to exhibit variable aneuploidies ranging from disomy to tetrasomy. Chromosomes 6 and 15 show occasional trisomy and hence grouped together. Rest of the chromosomes showed normal copy numbers with minimal deviation from disomy. The overall heatmap of the aneuploidy change in the samples from Indian subcontinent has shown trisomy or tetrasomy in the whole or partially in chromosome 2 and 31. This observation is supported by earlier study that has shown chromosome 31 is superanumerary and tetrasomic (Pilling, Reis-Cunha et al. 2023). The aneuploidy changes of Leishmania under the effect of paromomycin can be observed in the samples with BPK* id, most of the samples have shown trisomy in Chromosome 2. WGS samples of late passage has shown tetrasomy in chromosome 31 and trisomy in chromosome 5 and 6. The present analysis show a very high genetic modification in Leishmania donovani sequenced from patients of Indian sub-continent and drug resistance also alter the aneuploidy of Leishmania.
Distribution of variants among samples
The variant analysis has shown that the sample from the Himachal region of India has recorded the highest number of variants in the coding sequences, gene interval and chromosomal interval (Supplementary Figs. 1 and 2). Rest of the samples have shown variants in the same range i.e. 1000 to 10000 in number. We observed the genotype distribution of SNPs in different strains; Indian strains MHOM/IN/2010, MHOM/IN/2012, MHOM/IN/2013, and MHOM/IN/2014 showed an average SNP frequency of 1900. These strains were isolated from different regions of India mainly from Bihar, Delhi, Jharkhand, and West Bengal. Strains with unknown states as geographical locations were tagged as ‘Indian’. The strain MHOM/IN/2010/T7, isolated from Bihar, had the highest SNP occurrence of 1975 while Jharkhand isolate MHOM/IN/2014/H1 had the lowest SNP occurrence of 1868. Indian strains BHU1220, isolated from Bihar, showed 2089 SNPs, MHOM/IN/1983/AG showed a higher SNP frequency of 2707 and 2719, and BHU575/0/cl2 with 3800 SNPs was the highest. Indian strains K133WT, K133AS-R, and K133PM-R showed abundance with each having 3713, 3281 and 3488 SNPs respectively. We looked at the distribution of genotypes in these samples; 86–91% of SNPs had Heterozygous Reference Alternate (Het RA) genotype, 9–13% had Homozygous Alternate (Hom AA) genotype and a miniscule percentage, around 0–0.002% of SNPs represented the Heterozygous Alternate (Het AA) genotype, which represents two different alternate alleles in place of the reference (Fig. 3).
The Nepalese strains BPK275/0/cl18 showed a distinctly higher SNP abundance than Indian strains and an average SNP abundance of 3482 SNPs. BPK275/0/cl18.vii, isolated from Koshi province, showed the lowest SNP abundance of 2962 while BPK275/0/cl18 was the highest – 4003, among all strains from both India and Nepal. Most of the Nepalese strains were isolated from the Koshi province of Nepal, barring BPK275/0/cl18 of the study SRR15851294 which did not have an associated geographical location. The variant frequency distribution is similar among all the samples except the one obtained from the Himalayan region.
We also analyzed the Insertion/Deletions occurrence frequency in the samples and reported the different genotypes associated (Fig. 4). Quite interestingly, the Indian strains MHOM/IN/2010 T3, T7, MHOM/IN/2012/T8, T9, MHOM/IN/2013/M1, P2, MHOM/IN/2014/A2, H1, P6 and T10 had remarkably high insertion/deletion frequency compared to rest of the samples. The highest occurrence was in P2–29760, and the lowest – 18,153 – in T8. Among the strains, there existed unusually high counts of Het ALT/REF alleles nearing 20,000 to 30,000 in a few strains. The average insertion/deletions variants in these strains were 24,060. While the Het REF/ALT alleles were in abundance, the Hom ALT and Het ALT alleles showed sparse distribution. Allele
frequency ranged from strain MHOM/IN/2014/T10, from Jharkhand, having the most with 314 Hom AA and 16 Het AA alleles and strain MHOM/IN/2014/A2, from West Bengal, having the least with 165 Hom AA and 8 Het AA alleles. Other Indian strains such as the BHU1220, BHU575/0/cl2, MHOM/IN/1983/AG83 and K133 strains showed markedly higher frequency of Hom AA and Het AA while exhibiting negligible Het RA alleles relative to the strains mentioned previously.
The Nepalese strains BPK275/0/cl18 showed limited InDels relative to the Indian strains, while also exhibiting a higher frequency of Hom ALT and Het ALT alleles. The strain BPK275/0/cl18.vii showed lower amounts of insertion/deletions in comparison to other Nepalese strains. The sample of Himachal state of India has around 1 lakh missense and 40000 InDels.
To decipher the distribution of different variant types within the strains used in this study, we utilized snpEff for annotation. The method involves identifying and attaching the relevant sequence ontological terms to the variants, as well as predicting their effect types and prospective impacts. Most of the variants appeared to be impacting non-coding genomic regions which included the 5’Flank, 3’Flank, and IGR. We observed an abundance of Frame_Shift_Del events and Missense_Mutation which impacted the coding portion of the genome. Few variants which impacted the coding regions were associated with in-frame deletion or insertion, missense and silent mutations. A tiny proportion of the variants represented intron, nonsense, nonstop, splice region, splice site and targeted region classes (Fig. 5). We looked at the distribution of different variant types across strains. The majority of the variants were in 3’ Flank or 5’ Flank across the strains that constituted 75.1% of the total variants. The strains had an average of 7.43% IGR variants. A relatively small share of the variants affected the coding regions of the gene sequences, these were Frame_Shift_Del, Frame_Shift_Ins, and Missense Mutations, contributing to an average of 8.37%, 1.18% and 4.54% respectively (Fig. 6).
When we compared the average variant proportion between Indian and Nepalese strains we found there to be significant differences between Frame_Shift_Del variants, 3’Flank and IGR regions.
We observed that Indian strains MHOM/IN/2010/, MHOM/IN/2012/, MHOM/IN/2013/, and MHOM/IN/2014 exhibited substantial amounts of variants tagged as ‘Frame_Shift_Del’, except for the BHU 1220, BHU575/0/cl2, MHOM/IN/1983/AG83 and K133 strains. This might be due to the presence of a higher number of insertion/deletion events, potentially leading to an increased representation of these variants in the samples. This was supported by the higher frequency of these variants relative to other strains in the population. It ranged from 4364 in MHOM/IN/2013/M1 and 1497 in MHOM/IN/2014/H1. The proportion also remained high at 27 to 18% of the total variants in these samples. Frame_Shift_Del events in other Indian strains remained lower; only occurring as less than 1% of the total variants in the samples. The frequency relative to other Indian strains previously mentioned also remained significantly low i.e. 30, 41, 44, 53 respectively. This might be due to the lesser insertion/deletion events in these strains which could be contributing to remarkably higher frame shift events. Whereas the Nepalese strains exhibited less than 1% of these variants, except strain BPK/275/0/cl18.vii which had 5% of ‘Frame_Shift_Del’ variants, which could be possible due to relatively lower insertion/deletion events in these strains. Nepalese strains also showed a higher mean share of Missense_Mutations, around 5.59%, than the Indian strains which showed 3.78%. IGR also contributed to 9.74% in Nepalese strains while 5.76% in Indian strains. In the Himalayan region sample, downstream and upstream variants comprises of 36% and 35% of the total variants and 9.2% in the exon region.
Single nucleotide polymorphisms impact functional domains
We used the Mutation Annotation Format (MAF) files generated by the R package MAFtools to investigate the abundance of ‘Missense variants’ in different isolates. We observed a median of 310 missense variants in the isolates affecting an array of genes, mainly imparting function as proteins involved in biosynthesis pathways, transport, anchorage, and as receptor proteins involved in cellular uptake of essential amino acids, interaction with host immune system and drug resistance. Among the strains from the Indian subcontinent those from Nepal showed a relatively high frequency of missense variations. Although the Indian isolate K133WT exhibited the highest missense mutation frequency, also K133 and BHU575/0/cl2 strains had a higher frequency of mutations which could be due to the exposure from Paromomycin (PMM) drugs (Fig. 7). Fifty genes were found to be affected by missense variants in the entire population of
isolates (Fig. 8). These included genes of ATP-binding cassette (ABC) family, Amastin-like surface proteins, A2 genes, amino acid permeases, and others. ABC subfamily A, member 2 (ABCA2) and member 9 (ABCA9) genes showed the most mutations, 503 and 326 respectively. We observed an increase in copy number of ABC subfamily A, member 9 protein (LDBPK_270840) in all the isolates. ABCA9 protein has two ‘ABC transporter domains’ from amino acids (aa) 760–993 and 1547–1777; while none of the domains were affected by variants, several missense variants of ‘Moderate Impact’ were observed in the gene sequence. It has been reported that Leishmania modulates aneuploidy in response to drug pressure, as ABC family has been shown to be present in higher copy number. ABC family of proteins are membrane transporter proteins which have been observed to be involved in drug resistance against antileishmaniasis through efflux (Ouellette, Légaré et al. 1998). Three amastin genes that exhibit high mutation count; Amastin-like surface protein (LDBPK_341150), Amastin surface glycofamily protein (LDBPK_044340), and Amastin surface glycofamily protein (LDBPK_140500) were among that constituted high number of mutations. It has been proposed that Amastin-like surface glycoproteins play a crucial role in macrophage-parasite interaction and parasite virulence. As reported by in a study wherein they knocked down the δ-Amastin expression in L. braziliensis using RNA-interference and observed decreased intracellular parasites in amastigotes in-vitro and in-vivo. We observed 2 missense mutations and 3 silent mutations in LDBPK_341150, 2 missense and 1 silent mutation in LDBPK_044340 and 1 missense mutation in LDBPK_140500. The 3’a2rel (LDBPK_220560) and 5’a2rel (LDBPK_220660) related proteins of the A2-A2rel cluster showed an increased missense variant count, the A2rel genes have been predicted to be involved in the visceralization of the leishmaniasis infection, as reported in a study on BALB/c mice by infection of L. major Friedlin V9 strain which was transfected with A2 expressing plasmids .The 3’a2rel sequence showed 7 moderate impact missense variants, 2 silent mutations and the 5’a2rel sequence showed missense, a modifier 5’ Flank and silent mutations. Several surface protein encoding genes showed high frequency of missense mutations (Fig. 9) including the Phosphoglycan-ꞵ-1,3-Galactosyltransferase, partial protein (LDBPK_020150), Mannosyltransferase-like protein (LDBPK_311920) and Amastin-like surface protein (LDBPK_341150). Several studies have implicated the Lipophosphoglycan (LPG), which are glycolipid surface proteins, which play a key role in the propagation of the parasite, through the insect vector, by facilitating the interaction of promastigotes with midgut epithelial lining of sandfly. (Ilg T., Sacks D et al.) LPGs are involved in parasite defence against the host innate immune system by avoiding the formation of membrane attack complexes. These glycoconjugates have also been shown to have a role in promoting cellular uptake of Leishmania parasites into the host macrophages, through the complement system, where they take refuge and multiply inside the acidic phagolysosomes. (SM Puentes et al., PJ Green et al.) In response to the selective pressures by the host environment and immune system Leishmania parasites alter surface physiology by altering LPG side chains which contribute to enhanced survivability and pathogenicity. (Roditi I et al.) The geographical distribution of sandfly influences species-specific LPG alterations and contribute to the successful transmission of parasite from sandfly vector to vertebrate host. (Ilg T.) Differential host-species and host-strain interactions could be contributing to the observed mutational frequency in the Galactosyltransferase gene. Mannosyltransferases are a type of galactosyltransferase which act on mannose and produces Mannogen, a unique carbohydrate material soluble in the cytoplasm, a homopolymer of ꞵ-1,2-linked mannose, 4–40 residues long. (Ralton et al., Sernee et al.) Mannogen is constitutively produced by seven Mannosyltransferases/phosphorylases (MTPs). Infective stage amastigotes have accumulated Mannogen, as it has been reported that defective Mannogen biosynthesis can lead to loss of virulence and attenuated growth in macrophages. (Sernee, M. Fleur et al.) There exist 𝛼-mannosyltransferases that catalyse the synthesis of N-linked glycans and glycosylphosphatidylinositol (GPI) glycolipids which promotes anchoring of surface proteins, including LPGs, by GPI. (Hilley et al.) Among the multi-hit top mutated genes were Amino acid permease, putative (LDBPK_270530), Tryparedoxin peroxidase (LDBPK_151140), cAMP specific phosphodiesterases (LDBPK_151450). Arginine is an essential amino acid for Leishmania hence it needs to be sourced from exogenous sources. Arginine uptake and homeostasis is maintained by transmembrane proteins such as amino acid permeases which transport Arginine on depletion in the parasite. (Opperdoes FR et al., Darlyuk, Ilona et al.) Tryparedoxin peroxidase enzyme, which is upregulated in the presence of Reactive Oxygen Species (ROOs), helps mitigate cellular toxicity by eliminating toxic agents responsible for Ca2 + influx, also an increase in Tryparedoxin peroxidases leads to lesser responsiveness to anti-leishmanial Antimony. (Iyer JP et al.) Various membrane and surface proteins have been implicated in the pathogenicity of Leishmania which also showed high frequency of missense variations, such as the ATP-binding cassette protein subfamily A, member 2, putative (LDBPK_111210) and member 9, putative (LDBPK_111210), 3’a2rel-related protein (LDBPK_220560). The A2 gene family is essential for visceralization of the disease as it has been observed to be absent in Leishmania major which causes Mucocutaneous Leishmaniasis (MCL) but is present in L. donovani which results in the visceral manifestation of the disease. (Zhang et al.). Next, we investigated if any of the mutations affected the functional domains of protein products of the gene sequences. We found 20 genes which had a reported functional domain that happened to be impacted by the missense mutations. The findings are reported in the Table 1.
Indian and Nepalese L. donovani isolates have distinct mutational profiles
We checked for genes which were affected by indels and found 370 genes common to all isolates. Our analysis revealed differential mutation profiles for each gene in different isolates. A cluster analysis of isolates was performed based on Indel mutation profiles and the results are shown in the below. (Fig. 10)
We observed the clustering of Leishmania donovani strains into three main groups, one of the clusters showed exclusively Indian isolates, cluster A, while the other two showed an intermixing of both Indian and Nepalese isolates, cluster B and C.
Cluster A comprised the strains MHOM/IN/2010/T3, MHOM/IN/2010/T7, MHOM/IN/2012/T8, MHOM/IN/2012/T9, MHOM/IN/2013/P2, MHOM/IN/2013/P2/M1, MHOM/IN/2014/A2, MHOM/IN/2014/H1, MHOM/IN/2014/P6, and MHOM/IN/2014/T10. These strains showed increased mutations in Bridge-like lipid transfer protein family member 1 N-terminal domaining containing protein (LDBPK_160180), Mannosyltransferase like-protein (LDBPK_311920), Calpain-like cysteine peptidase, putative (LDBPK_270510, ATP-binding cassette protein subfamily A, member 2, putative (LDBPK_111210), and uncharacterized proteins (LDBPK_333030-LDBPK_333040). Strains M1, P2 and T10 isolated from West Bengal, Bihar and Jharkhand grouped together since they had frequency of variants in the above-mentioned genes.
Cluster B comprised Indian and Nepalese isolates; BPK275/0/cl18.i, BPK275/0/cl18.ii, BPK275/0/cl18.iii, BPK275/0/cl18.vi, BPK275/0/cl18.viii, BPK275/0/cl18.ix,BPK275/0/cl18.xi, MHOM/IN/1983/AG83.i, MHOM/IN/1983/AG83.ii, and K133AS-R, K133PMM-R and K133WT. This cluster exhibited higher mutations in Uncharacterized protein (LDBPK_310470), uncharacterized proteins (LDBPK_333030-LDBPK_333040), Calpain-like cysteine peptidase, putative (LDBPK_270500-LDBPK_270510), Phosphoglycan beta 1,3 galactosyltransferase (LDBPK_020150), Amastin-like surface protein, putative (LDBPK_341690), Amino acid permease, putative (LDBPK_272680), Amastin-like surface protein, putative (LDBPK_341720), Amastin surface glycofamily protein-CHR_END (LDBPK_044360-CHR_END), more than 30 variants were seen in these genes. A few more genes were observed that contained mutations, not as many as the above mentioned, these were; pseudo (LDBPK_310470), Uncharacterized protein (LDBPK_120661-LDBPK_120667), ATP-binding cassette protein subfamily A, member 2, putative (LDBPK_111210), Uncharacterized protein (LDBPK_093020), Tyrosine phosphatase family protein (LDBPK_321070), Amastin surface glycofamily protein (LDBPK_044340), and Uncharacterized protein (LDBPK_333030-LDBPK_333040).
Cluster C comprised Indian and Nepalese isolates; BHU 1220, BHU 575/0/cl2, BPK275/0/cl18.iv, BPK275/0/cl18.v, BPK275/0/cl18.vi, BPK275/0/cl18.x, BPK275/0/cl18.xii, and BPK275/0/cl18.xiii. Strains showed similar mutation profiles to the strains in cluster B but distinct to the cluster A strains. Isolates showed a peculiar abundance in mutation in the Amastin surface glycoprotein (LDBPK_044360-CHR_END) gene and the Amastin-like surface protein, putative (LDBPK_341720).
Indels aggregate in specific regions of the L. donovani genome
Here we provide a comprehensive report on the Indel variation in the population of Nepalese and Indian isolates (Table 2). 91 indels were seen to be occurring in all the samples, with different variant classifications, in noncoding and coding regions of the genome. Only 7.6% i.e. 7 indels affected the coding regions. These variants were classified as either ‘In_Frame_Ins’ or ‘In_Frame_Del’, meaning they existed inside the reading frame of the genes and were size multiples of three, thereby preserving the reading frame. Whereas 92%, i.e. 84 indels were in noncoding regions and mostly were size multiples of two. Indels in coding regions may have a severe impact on the structure and function of gene products and hence are negatively selected, which explains the lower frequency of reading frame disrupting indels. Non-coding indels may be better tolerated due to more lenient selection pressure. It has been proposed that non-coding variants may play a role in influencing the gene expression of upstream and downstream genes. 91 variants were spread over coding and non-coding regions of 54 genes. These variants were observed to be occurring in and around specific nucleotide positions of gene sequences. The ATP-binding cassette protein, family A, member 2 (ABCA2) showed 6 frameshift indels, 585_586insTA, 567_568insCTTGC, 557_558insA, 580_581delCA, 572delA, and 559_563delGGTGT, all of which push the reading frame of the coding sequence and are concentrated around nucleotide (nt) 550 to 590. Hypothetical predicted transmembrane protein (LDBPK_240440) had three in-frame indels and two frameshift indels concentrated at nt 1270 to 1290. IGR between two Amastin-like proteins, LDBPK_080760 and LDBPK_080780 had four indels, which could possibly be involved in influencing gene expressions. Indels were seen in the 5’ Flank region of the Histone H2 protein. 3',5'-cyclic nucleotide phosphodiesterase, putative indels were at 5’Flank upstream of the gene sequence. Sodium stibogluconate resistance protein, putative had indels which resulted in reading frame shift which were predicted to be of high impact. This trend continued in the rest of the genes, thereby indicating that indels tend to arise in specific regions of the gene sequences most prone to alterations.
Copy number alterations
We observed that all of the chromosomes were predicted to have local copy number alterations in them. Chromosomes 36 through 26, 24, 22, 21, 16, 12, 7 and 2 were observed to have developed variations in copy number in all 31 samples. While chromosome 1, 3 and 4 were predicted to have local copy number variations in 5, 14 and 17 samples respectively. We checked genes that were associated with loss in copy number and the frequency of samples they occurred in. We observed that most of the genes showing a loss in copy numbers, that occurred in ≥ 30 samples, existed on chromosomes 2, 3, 7, 8, 9, 16, 22, 26 through 36. We checked the same for genes showing a gain in copy number and observed that the same chromosomes, which showed increased loss in copy number alterations, also had increased gain in copy number events. We found 393 genes with a gain in copy numbers; the highest number of CNV alterations were predicted in the chromosome 8 and 31 which may be because of their increased chromosomal copy number.