Short read polished assemblies indicated the presence of inversions: From Illumina-based short sequencing reads deposited in SRA with BioProjects PRJEB41116 and PRJNA879962, a subset of 500 assemblies was selected. Randomization did not yield unique numbers; as a result, there were some duplicate assemblies that had to be removed. Finally, 467 WGS datasets were acquired from the European Nucleotide Archive (Table S1). Based on N50 (2817734±1888117), Genome fraction (98.9±11.5%), and total length (4403718±513257) from de novo assembly QC analysis, poor quality samples were identified and eliminated. WGS-based lineage detection was also correlated with samples and is included in the de novo assembly statistics table of short-read data sets, shown in Table S2. The subsequent analysis was done with a set of 467 pure lineage and high-quality samples. A schematic representation of procedure followed for analysis is shown in Figure 1A. Phylogenetic clustering of 467 samples showed lineage-specific clustering, and higher prevalence of drug resistance in lineage 2 isolates was observed (Figure 1B). A subset of 40 isolates (10 samples per lineage) was subsequently processed for alignment using the ProgressiveMauve algorithm. The corrected de novo assemblies that were aligned using ProgressiveMauve showed distorted patterns. Alignment-free clustering based on the mash distance matrix also clustered samples according to lineages. Again, lineage 2 samples from this subset showed resistance to a greater number of drugs, and only 2 of 10 lineage 2 isolates were sensitive (Figure 1C). Even though ProgressiveMauve results were not conclusive, the LCB matrices built using the approaches mentioned in the methodology section showed clustering on categorical PCA according to lineages as far as inversions were concerned (Figure 1D). Similar approach for translocations clustered samples into 4 subclusters. However, in the case of translocations, each cluster contained isolates from varying lineages (Figure 1E). This indicated that translocations are probably associated with an unknown factor, unlike in the case of lineages.
SV determination from long reads: Long-read WGS datasets of 237 PacBio datasets originating from cultures were identified using the metadata and downloaded. A subset (n=182) of datasets could be assembled successfully into fewer contigs. The others that could not be assembled were either due to poor quality (< 500 bp average read length) or limited computational resources. Lineage determination using tb-profiler revealed a major fraction of lineage 4 (n=113), followed by lineage 2 (n=26), lineage 1 (n=9), lineage 3 (n=6), and mixed lineages (n=28). The datasets representing mixed lineages were eliminated from subsequent analysis, except lineage2/4 samples. Inversion and BND were filtered for PCA analysis, which showed subtle clustering on the graph, indicating the presence of fewer INVs and BNDs between lineages (Figure 2A). Inversions with breakpoints on both strands were considered as true inversions, and those that had breakpoints only on one of the strands were, despite being significant, not considered true inversions. In Lineage 1, a set of two inversions, 652 bp and 310 bp, were observed with coordinates 76531-77183 (Raw P < 0.0001; BH P < 0.016) and 1482828-1483138 (Raw P < 0.01; BH P = 0.08), respectively (Figure 2B). In lineage 3, a set of three inversions of 639 bp, 681 bp and 539 bp were observed with coordinates 1508080-1508719 (Raw P = 0.01; BH P = 0.1); 2272338-2273019 (Raw P = 0.04; BH P = 0.1) and 3781352-3781891 (Raw P = 0.005; BH P = 0.09) (Figure 2C). Lineage 4 showed four inversions of 197 bp, 639 bp, 245 bp and 539 bp with coordinates 869783-869980 (Raw P = 0.02; BH; P = 0.2), 1508080-1508719 (Raw P = 0.008; BH P = 0.1), 2844929-2845174 (Raw P = 0.001; BH P = 0.02), and 3781352-3781891 (Raw P = 0.02; BH P = 0.2), respectively (Figure 2D). The observed lineage inversions in Lineage 1 of 652 bp and 310 bp were within the genes sdaA and Rv1320c (Figure 2E). Another inversion of 681 bp in lineage 3 was within but at the latter end of the coding region of dosT and within Rv2026c. Both these genes are reverse in direction. Probably this, disorientation, is causing a gene fusion or, more likely, disruption of both (Figure 2F). A set of two unique mutations that were not present in other lineages but found in lineage 4 included a 197 bp inversion in the upstream control region of the two genes oriented away from each other (Rv0776c and purB), probably having an impact on their expression. The other 245 bp inversion is within the centroid of the fas gene (Figure 2G). A set of two inversions was found, both in lineage 3 and lineage 4, that included a 539 bp inversion encompassing the latter end of Rv3369 and the latter end of dnaE2, and a 639 bp inversion disrupting three relatively 3 small genes oriented in opposite directions in lineages 3 and 4 was observed. The three genes included Rv1341 (forward), Rv1342 (reverse), and lprD (reverse) (Figure 2H). Some of the lineage 4 inversions were also found in samples of mixed lineages (lineage 2 and lineage 4), and these included 869783-869980, 1508080-1508719, and 2844929-2845174 (Figure S1). Due to low sample size, true inversions were not observed in lineage 2 isolates (Figure S1). A summary of inversions is included in Table 1.