Overview
To evaluate the performance of opjMap, we conducted a comparative analysis against widely used long-read aligners: minimap2[29], NGMLR[30] and Winnowmap2[31]. All alignment methods were tested on both simulated and real single-molecule sequencing datasets. The experiments were performed on a server running the Ubuntu 22.04 operating system, equipped with 189 GB of RAM and two Intel Xeon E5-2686 v4 processors (2.30 GHz, 16 cores, 32 threads each).
Simulated Data Experiments
Alignment Evaluation for Non-Structural Variation Reads
Evaluating the alignment performance of different tools involved using PBSIM2[32] to generate simulated reads with known reference positions, thereby enabling a precise comparison of alignment quality. We generated four sets of simulated reads with varying error rates: 10% and 15% to mimic the PacBio platform, and 20% and 30% to represent the ONT platform. The reads were generated from the chromosome 1 sequence of H.sapiens. The commands used for generating these datasets are provided in supplementary file Table S1.
Given the inherent error rate of sequencing reads, a base is considered correctly aligned if its mapped position on the reference genome differs from its true simulated position by no more than w bases (where w = 5). A read is considered correctly aligned if more than 90% of its bases are correctly mapped[33]. Base-level accuracy is defined as the ratio of correctly aligned bases to the total number of aligned bases[34], while sensitivity is the ratio of correctly aligned bases to the total number of bases in the simulated dataset. Similarly, read-level accuracy is the ratio of correctly aligned reads to the total number of aligned reads, and sensitivity is the ratio of correctly aligned reads to the total number of reads in the simulated dataset. The specific commands used for aligning with each tool are provided in supplementary file Table S2. The resulting alignment data are presented in Table 1. The values in parentheses in the Accuracy and Sensitivity columns indicate the percentage difference relative to opjMap. For example, minimap2's base-level accuracy at a 10% error rate is 95.40 (-0.13) %, where − 0.13% signifies that it is 0.13% lower than opjMap.
At a simulated error rate of 10%, opjMap demonstrated higher sensitivity in both base-level and read-level alignments compared to all other tools, with the exception of minimap2, to which it was slightly inferior. For error rates of 15%, 20% and 30%, opjMap consistently exhibited superior sensitivity at both base and read levels compared to the other aligners. Although other tools achieved higher accuracy at the base level, their read-level alignment accuracy was consistently lower than that of opjMap. These results collectively suggest that the orthogonal projection-based opjMap offers high sensitivity under moderate to high error rate conditions, enabling it to align a greater number of bases and accurately map more reads.
Alignment Evaluation for Duplications Across Distinct Reference Regions
We evaluated the tools' ability to detect interspersed repeats located outside the read's corresponding gene. We generated sequences containing repeats using a custom script, randomly selecting the strand for each fragment. Unlike PBSIM, Badread[35] can introduce sequencing errors into a short sequence, simulating its output under various error rates. Using Badread, we added sequencing errors to the fragments (see Supplementary Table S3 for specific commands) and then used a script to select simulated sequences with repetitive variations that met our criteria. Due to the random nature of the simulation, the number of reads in each dataset varied.
To select an appropriate error rate for comparison, we first tested the sensitivity and accuracy of different methods for aligning variations in 1000 bp sequences. The results are shown in Supplementary Fig. S1. opjMap demonstrated a significant lead in both accuracy and sensitivity under high error rates, with this gap only narrowing when the error rate
Table 1
Results of different methods on simulated dataset
|
Error Rate (Number of Reads)
|
Alignment Tool
|
Base Level
|
|
Read Level
|
|
Number of Alignments(M)
|
Correct Alignments(M)
|
Accuracy (%)
|
Sensitivity (%)
|
|
Number of Alignments
|
Correct Alignments
|
Accuracy (%)
|
Sensitivity (%)
|
|
10% (241144)
|
opjMap
|
2,347
|
2,242
|
95.53
|
89.94
|
|
217354
|
216563
|
99.64
|
89.81
|
|
minimap2
|
2,350
|
2,242
|
95.40(-0.13)
|
89.96(+0.02)
|
|
217318
|
216660
|
99.70(-0.06)
|
89.85(+0.04)
|
|
Winnowmap2
|
2,273
|
2,231
|
98.14(+2.61)
|
89.50(-0.44)
|
|
215791
|
214798
|
99.54(-0.10)
|
89.07(-0.74)
|
|
ngmlr
|
2,234
|
2,225
|
99.57(+4.04)
|
89.25(-0.69)
|
|
216443
|
214058
|
98.90(-0.74)
|
88.77(-1.04)
|
|
15% (239772)
|
opjMap
|
2,320
|
2,232
|
96.22
|
89.55
|
|
215443
|
213962
|
99.31
|
89.24
|
|
minimap2
|
2,313
|
2,212
|
95.61(-0.61)
|
88.74(-0.81)
|
|
214111
|
211273
|
98.67(-0.64)
|
88.11(-1.13)
|
|
Winnowmap2
|
2,112
|
2,064
|
97.72(+1.50)
|
82.81(-6.74)
|
|
198830
|
195513
|
98.33(-0.98)
|
81.54(-7.70)
|
|
ngmlr
|
2,182
|
2,170
|
99.45(+3.23)
|
87.07(-2.48)
|
|
212526
|
205323
|
96.61(-2.70)
|
85.63(-3.61)
|
|
20% (127587)
|
opjMap
|
2,322
|
2,233
|
96.14
|
89.58
|
|
114860
|
114074
|
99.32
|
89.41
|
|
minimap2
|
2,326
|
2,233
|
96.03(-0.11)
|
89.60(-0.34)
|
|
114850
|
114045
|
99.30(-0.02)
|
89.39(-0.02)
|
|
Winnowmap2
|
2,224
|
2,205
|
99.17(+3.03)
|
88.48(-1.46)
|
|
113582
|
111987
|
98.60(-0.72)
|
87.77(-1.64)
|
|
ngmlr
|
2,221
|
2,206
|
99.35(+3.21)
|
88.52(-1.42)
|
|
114446
|
111828
|
97.71(-1.61)
|
87.65(-1.76)
|
|
30% (118334)
|
opjMap
|
2,258
|
2,203
|
97.57
|
88.40
|
|
105809
|
104154
|
98.44
|
88.02
|
|
minimap2
|
2,216
|
2,142
|
96.64(-0.93)
|
85.93(-2.47)
|
|
103884
|
98286
|
94.61(-3.83)
|
83.06(-4.96)
|
|
Winnowmap2
|
1,386
|
1,357
|
97.94(+0.37)
|
54.46(-33.94)
|
|
69471
|
57197
|
82.33(-16.11)
|
48.34(-39.68)
|
|
ngmlr
|
2,052
|
2,036
|
99.25(+1.68)
|
81.70(-6.70)
|
|
101117
|
90927
|
89.92(-8.52)
|
76.84(-11.18)
|
dropped to between 15% and 10%. We chose an error rate of 15% to test the alignment of repeats of different lengths. For this test, we set five different lengths for external repetitive variations: 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp.
The average sequence length was 10,000 bp, with 3,000 sequences in each length group. For more detailed information on these two sets of reads with different error rates and lengths, please refer to Supplementary Tables S4 and S5.
Due to the presence of base-level errors in sequencing reads, the position of repeat variations within the reads is affected. The experiment determined whether a repeat variation was detected by verifying if the read's corresponding variation position on the reference genome was aligned multiple times[36]. Specifically, we defined the position and orientation of an aligned read on the reference genome as G = (Gst, Ged, Gd), and the true position of the variation on the reference as T = (Tst, Ted, Td). If the number of non-empty intersections between G and T was greater than or equal to (n + 1), and the alignment orientation was identical, where n is the number of repeats (n = 1), the alignment was considered correct.
Figure 3 illustrates the accuracy and sensitivity of this detection at different lengths (for specific commands, see Supplementary Table S6, and for numerical values, see Supplementary Table S7). As the length of the repetitive variation region increases, the accuracy and sensitivity of the alignment tools also increase. Because the simulated repeat sequences were all fragments extracted from the reference genome, and their length was around 10,000 bp, most alignment tools were able to align the entire sequence. This resulted in the sensitivity and accuracy of many results being identical. Throughout the length-based experiments, opjMap consistently outperformed other tools, achieving 100% sensitivity and accuracy in detecting repetitive variations when the repeat length was 5000 bp.
Overall, opjMap maintained high accuracy and sensitivity in detecting duplications across distinct reference regions, regardless of variations in length or error rate. This indicates that opjMap is capable of identifying a greater number of inter-regional repetitive variations, even under conditions of high error rates and short variation lengths.
Alignment Evaluation for Duplications within a Single Reference Region
The experiments also tested the detection of repetitive regions located within the reads, with two distinct types of variations: interspersed repeats with a single duplication event and contiguous segmental duplications with multiple repeats.
Single Duplication Event
We evaluated the tools' detection capabilities by fixing the repeat fragment length at 1000 bp and varying the sequencing error rates. The detailed results are shown in Supplementary Fig S2. At high sequencing error rates, opjMap maintained a high level of performance. As the error rate decreased, the alignment results of other tools began to approach those of opjMap. We then chose an error rate of 15% for the subsequent experiment, which was designed to test the alignment of repeat fragments of different lengths. For this, we generated sequences with this error rate, containing internal repeats of 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp. Detailed information on these two datasets with varying error rates and lengths can be found in Supplementary Tables S8, S9.
Figure 4 illustrate the accuracy and sensitivity at different fragment lengths, with specific numerical values available in Supplementary Table S10. opjMap showed higher alignment sensitivity and accuracy when the repeat fragments were short. As the length of the repeat fragments increased, the performance of other tools approached that of opjMap. This indicates that opjMap is suitable for detecting interspersed repeats in a wide range of scenarios.
Contiguous Segmental Duplication
For the comparison of segmental duplication detection, a custom script was used to generate sequence fragments of five different lengths (100 bp, 250 bp, 500 bp, 750 bp, and 1000 bp), with each length repeated 10 times. We then introduced sequencing errors at a rate of 15% using the Badread tool. Detailed read information can be found in Supplementary Table S11. Given the high number of repeats, we fixed the fragment length at 1000 bp and initially tested the sensitivity for repeat judgment thresholds (n) of 3, 5, 7, and 10. The results are shown in Fig. 5, with specific values available in Supplementary Table S12. From these results, it can be seen that Winnowmap2 is not well-suited for aligning segmental repeats. In contrast, opjMap maintained high sensitivity as the threshold n increased.
A repeat judgment threshold (n) of 10 was selected to evaluate the performance of alignment tools on repeats. The results are shown in Fig. 6, with specific numerical values available in Supplementary Table S13. As the figure illustrates, opjMap surpassed the other aligners in both accuracy and sensitivity for detecting segmental duplications. opjMap achieves this by extracting shorter sub-fragments of length 9 from overlapping regions. This demonstrates that constructing the alignment skeleton with shorter fragment information can effectively enhance the detection of repeat region information.
Figure 7 presents a comparison of opjMap with three other tools, visualized using the IGV alignment visualization tool. Figure 7a shows the true skeleton anchor graph for a segmental tandem repeat of 500 bp, repeated 10 times. From this, it can be seen that 6 of the repeats are in the forward direction and 4 are in the reverse direction. Figure 7b shows the alignment results for this sequence from all four tools. We can observe that opjMap successfully aligned 4 reverse-oriented and 5 forward-oriented segmental repeats. In comparison, NGMLR aligned 2 reverse-oriented and 4 forward-oriented repeats. Winnowmap2 failed to recognize this repeat region, and minimap2 produced only a small number of alignment results. These findings demonstrate that opjMap is capable of identifying a greater number of segmental repeats, yielding more comprehensive alignment results. This indicates that opjMap possesses a superior ability to align segmental repeat variations.
Real Data Experiments
Evaluation on Datasets Without Segmental Repeats
A comparison of alignment performance on real-world datasets was conducted using sequencing data from two platforms: PacBio and ONT. The PacBio dataset, from A.thaliana, contained 300,000 sequences, while the ONT dataset, from E.coli, contained 60,000 sequences. All experiments were run using 64 threads, and the alignment results are presented in the table below. As shown in the Table 2, opjMap aligns a greater number of bases and reads on both the PacBio and ONT platforms while maintaining a lower consumption of computational resources. minimap2's performance is close to opjMap's, whereas NGMLR consumes significantly more resources.
Table 2
Results of different methods on real dataset.
|
DataSet
(Read number)
|
Aligner
|
Mapped bases
|
Mapped reads
|
CPU time
(seconds)
|
Wall time
(seconds)
|
Peak Memory
(GB)
|
|
PacBio
(304718)
|
opjMap
|
5492704427
|
292604
|
48032
|
990
|
26.3
|
|
minimap2
|
5456246662
|
290099
|
67325
|
1150
|
25.4
|
|
Winnowmap2
|
5251174215
|
280013
|
80924
|
1353
|
40.1
|
|
NGMLR
|
4362237072
|
255632
|
321851
|
5134
|
39.2
|
|
ONT
(62094)
|
opjMap
|
413018134
|
53917
|
691
|
15
|
13.8
|
|
minimap2
|
412818972
|
53665
|
495
|
13
|
15.5
|
|
Winnowmap2
|
403409572
|
52908
|
1119
|
26
|
28.4
|
|
NGMLR
|
365331379
|
49943
|
19117
|
395
|
39.0
|
Evaluation on Datasets With Segmental Repeats
To compare the performance of different alignment tools on segmental repeat variations in real-world sequencing data, we used long-read sequencing datasets from the human genomes T2T-CHM13 and HG002[37]. T2T-CHM13, considered the first complete and gapless human reference genome, serves as an ideal benchmark for evaluating and improving genomic alignment and variant calling algorithms. The HG002 dataset, on the other hand, consists of high-quality sequencing data from a real human sample. As existing structural variation benchmark sets lack sufficient information on segmental repetitive variations, we programmatically inserted 2,300 segmental repeat sequences into the T2T-CHM13 reference genome at regions corresponding to the original reads. The length distribution is shown in Supplementary Fig. S3.
Table 3
Comparison of Mappers for Segmental Repeat Detection on a Reference Genome
|
Aligner
|
opjMap
|
minimap2
|
NGMLR
|
Winnowmap2
|
|
Total
|
2300
|
2297
|
1705
|
2140
|
|
Correct
|
1893
|
1878
|
44
|
1450
|
|
Acc (%)
|
82.3%
|
81.76%
|
2.58%
|
67.46%
|
|
Sen (%)
|
82.3%
|
81.65%
|
1.91%
|
63.04%
|
As shown in the Table 3, opjMap achieved both an accuracy and sensitivity of 82.3%, outperforming all other alignment tools. minimap2 followed closely behind, while both NGMLR and Winnowmap2 performed poorly in aligning segmental repetitive variations. Notably, segmental repeats occurring within the reference genome are more challenging to detect than those in the reads. Due to its orthogonal projection-based approach, opjMap exhibits higher sensitivity when dealing with a reference genome containing segmental repeats, allowing it to identify a greater number of variations.