opjMap: A Sensitive Mapper for Repetitive Structural Variations in Long Noisy Reads Based on Orthogonal Projection

doi:10.21203/rs.3.rs-7929852/v1

Background

The continuous advancements in single-molecule sequencing (SMS) technologies, including PacBio Single Molecule Real-Time and Oxford Nanopore Technologies (ONT), have led to a significant increase in read lengths. This has unlocked tremendous potential for a wide range of cutting-edge genomic applications. However, these long reads suffer from higher sequencing error rates and contain repetitive segments, making it challenging for most existing alignment tools to effectively map these repetitive regions. Given the crucial role that repetitive variations play in biological evolution, we introduce opjMap, an alignment tool based on orthogonal projection localization, which is specifically designed to align long, noisy SMS reads to a reference sequence while also accommodating repetitive structural variations (SVs).

Results

Through exhaustive benchmark experiments on both simulated and real SMS datasets, we demonstrate that opjMap exhibits higher sensitivity compared to other mainstream alignment tools like minimap2, NGMLR, and Winnowmap2, enabling it to align more reads and bases to the reference genome. Furthermore, opjMap produces a greater number of alignment results under challenging conditions of high error rates and short repetitive segments.

Conclusions

opjMap provides a robust and highly sensitive solution for mapping noisy long reads containing repetitive structural variations. opjMap supports multi-threaded alignment. The source code is publicly available for download at https://github.com/FanXingGuo/opjMap.

High error rate

long-read alignment

orthogonal projection

repetitive variations

segmental duplication

Sequence alignment is a fundamental technique in bioinformatics and serves as the cornerstone for subsequent biological sequence analyses[1–5]. Biological sequences, typically obtained through sequencing technologies, are continuous chains of nucleotides (such as DNA, RNA) or amino acids (such as proteins)[6]. These sequences encode the complete genetic information of an organism and are crucial for essential life activities, including growth, development, and metabolism. The primary purpose of sequence alignment is to determine the similarity between biological sequences, which in turn facilitates the study of species homology and evolutionary processes[7–11].

Third-generation sequencing technologies are characterized by a high error rate, which makes it challenging to directly and accurately align the reads to a reference genome[12]. For this reason, most state-of-the-art third-generation sequence alignment algorithms utilize a seed-and-extend approach[13]. When sequencing errors are present, the overall read may not perfectly match a local region of the reference genome, but the two sequences will share numerous identical short substrings (seeds)[14]. A key principle is that a reference region containing more shared seeds is more likely to be the correct mapping location for the read. Based on the seeding strategy employed, existing third-generation alignment methods can be broadly categorized into two types: dynamic programming-based alignment and voting-based alignment.

Dynamic programming-based alignment methods select regions with a high density of collinear seeds, effectively filtering out irrelevant seeds (noise) to enhance accuracy. These collinear seeds can serve as a basic skeleton for base-to-base alignment, which is why this approach is widely adopted. Several notable algorithms exemplify this strategy. GraphMap[15] utilizes a hash-based indexing technique and a conservative, stepwise filtering strategy for candidate regions, achieving high sensitivity and speed. Minimap2[16] pioneered the use of minimizers—seeds with the smallest hash values within a given window—to construct its index. This approach has demonstrated superior performance in aligning long reads with high error rates. NGMLR[17] employs a convex gap-scoring model to handle gaps between skeleton segments, enabling it to effectively align sequences with minor insertions, deletions, and large-scale structural variations. kngmap[18] focuses on identifying the maximum number of collinear seeds for localization, which allows it to align a greater number of reads and bases. Furthermore, it leverages gap lengths between skeleton seeds to identify structural variation types, demonstrating a capability to align a range of structural variants.

Voting-based alignment methods, in contrast, statistically rank candidate windows by counting the number of shared seeds and selecting the top m windows for further analysis. This approach has lower computational complexity than dynamic programming and provides a more holistic view by considering multiple overlapping regions, but it is more susceptible to including noise. rHAT[19] improves alignment speed and quality by using overlapping windows on the reference genome and extracting k-mers from the read for efficient lookup. lordFAST[20] enhances this approach by considering not only the number of seeds during localization but also their length. By combining hash indexing with the FM-index[21], lordFAST achieves superior performance in both alignment speed and memory utilization.

Despite the development of two main third-generation sequencing alignment approaches—dynamic programming-based and voting-based—to effectively handle long reads with high error rates, these methods still face limitations when confronted with duplication, such as interspersed repeats and segmental duplications. Specifically, dynamic programming algorithms struggle to process overlapping variant skeletons effectively, while voting-based methods are susceptible to noise, leading to reduced alignment sensitivity and quality for duplication.

Duplication plays a crucial role in significant genomic structural changes and is fundamental to biological evolution[22]. However, the complex structure of duplication often compromises the sensitivity of existing third-generation alignment tools in detecting and aligning these variations[23]. Consequently, the development of specialized tools that can leverage the unique characteristics of repeat variation has become a pressing issue in the advancement of third-generation sequencing technologies[24]. To address this, we developed opjMap, a highly sensitive alignment tool based on orthogonal projection. opjMap is capable of aligning more bases and reads under high error rate conditions while also identifying a greater number of repetitive variations. The opjMap workflow primarily consists of five steps. First, an index of minimizers is built for the reference genome. Next, minimizers are extracted from the read, used to query the index to obtain matching anchors, and the positions of reverse-oriented anchors are recalculated. Subsequently, orthogonal projection is used to project the alignment skeleton onto a straight line, which is then partitioned into windows for a voting process. Different types of repetitive variations are then selected using windows of varying sizes. Finally, after further processing for each type of repetitive variation, a detailed alignment is performed to produce the complete alignment result. Experimental results demonstrate that opjMap exhibits higher sensitivity when aligning sequences with moderate-to-high error rates in both PacBio and ONT platform simulations of real-life data, while also being capable of aligning a greater number of repetitive variations.

Overview

opjMap employs an orthogonal projection method to align repetitive variations by projecting matched anchor points onto a straight line, followed by a window-based voting approach. The overall process, as illustrated in Fig. 1, consists of five key steps: (a) Reference Genome Indexing: A hash-based index is constructed for the reference genome to enable efficient lookup of seeds. (b) Generation of Minimizer Anchor Graph: Minimizers are extracted from the query read and using them to construct an anchor graph. (c) Orthogonal Projection and Voting: The anchors from the graph are orthogonally projected onto a straight line. This line is then partitioned into windows, and a voting strategy is applied to identify regions with a high density of projected anchors, which serve as alignment candidates. (d) Localization of Repetitive and Non-Repetitive Regions: opjMap employs a refined localization strategy that uses two distinct window sizes, l and pl, to identify repetitive and non-repetitive Regions. (e) Refined Alignment and Result Merging: The candidate regions identified in step (d) are further refined by pruning the anchor skeleton. A detailed alignment is then performed based on this skeleton. Finally, the alignment results for both the seeded and non-seeded regions are merged to produce a complete alignment result.

Reference Genome Indexing

To facilitate rapid lookup, a hash-based index is constructed for the reference genome[25]. This process involves extracting minimizers from the reference and storing each minimizer along with its corresponding position in a hash table[26].

Generation of Minimizer Anchor Graph

Following index construction, minimizers are extracted from the read and are used to query the reference index. Each match is recorded as an anchor tuple m_i = (x_i, y_i, d_i), where x_i is the minimizer's position on the reference, y_i is its position on the read, and d_i represents its orientation (0 for forward, 1 for reverse). For reverse-oriented minimizers (as shown in Fig. 1b ), the position y_i on the read is recalculated using a specific formula. Given the read r with length len(r), the new position is computed using (1). The recalculated positions for reverse-oriented anchors are illustrated in Fig. 1c.

$$\left\{ {\begin{array}{*{20}{l}} {{y_i},}&{{d_i}=0} \\ {len(r) - {y_i},}&{{d_i}=1} \end{array}} \right.$$

1

Orthogonal Projection and Voting

After recalculating the positions of reverse-oriented anchors, all collinear skeleton anchors are positioned along a 45-degree upward-sloping line on the anchor graph. To facilitate the voting process, anchors are orthogonally projected onto a 45-degree downward-sloping line using (2). The projected anchors are recorded as proj_i = (projx_i, projy_i, d_i). After orthogonal projection, a linear skeleton approximates a single focal point. The difference in anchor count between windows containing a skeleton and those without one is significantly greater after orthogonal projection than it would be without it. This key characteristic allows us to effectively identify the skeleton-containing windows using non-overlapping windows. The projected line is then partitioned into windows of a fixed length l (default 1000 bp). The number of anchors falling into each window is counted, as illustrated in Fig. 1c, which forms the basis for the subsequent voting strategy.

$$\left[ {\begin{array}{*{20}{c}} {proj{x_i}} \\ {proj{y_i}} \end{array}} \right]=\left[ {\begin{array}{*{20}{c}} {\frac{1}{2}}&{ - \frac{1}{2}} \\ { - \frac{1}{2}}&{\frac{1}{2}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{x_i}} \\ {{y_i}} \end{array}} \right]$$

2

Detailed Alignment.

Localization of Repetitive and Non-Repetitive Regions

opjMap employs a refined localization strategy that uses two distinct window sizes, l and pl (where p is a hyperparameter), to identify different types of anchor regions, as show in Fig. 1d. This process avoids the limitations of a standard sliding window by first ranking all windows on the projected line based on their anchor count. It then selects the top m₁ windows of size l as candidates for repeats across distinct reference region, which correspond to either non-repetitive alignments or interspersed repeats where the duplicated segment is external to the read's mapping region in reference. Subsequently, after removing these selected windows, the algorithm re-evaluates the remaining regions using a larger window size of pl. The top m₂ windows of this size are then chosen as candidates for repeats within a single reference region, which are characteristic of complex structures like segmental duplications. This two-stage, multi-size window selection process effectively distinguishes unique alignment regions from those associated with duplication structural variations.

Refined the Alignment Skeleton and Detailed Alignment

Orthogonal projection and voting are initially used to identify regions containing linear skeletons. Since these initial skeletons often contain noise and can be incomplete, we address this by employing a dynamic programming algorithm to further process these regions and construct a refined alignment skeleton[27]. As shown in Fig. 2, this dynamic scoring algorithm can effectively: (a) remove irrelevant anchors (noise), (b) merge skeletons that span multiple windows to form a complete skeleton, and (c) construct skeletons for segmental repeats, which facilitates subsequent detailed

Overlapping skeleton structures can be complex, and with relatively short fragments in the read, the skeleton information is often sparse, which can compromise the alignment quality. Therefore, opjMap extracts shorter minimizers of length 9 from these regions to construct a more informative and complete skeleton for detailed alignment. After a high-quality

alignment skeleton is obtained, opjMap extends it at both ends to ensure the completeness of the alignment region. Finally, a basic alignment algorithm is used for a detailed alignment of the non-seed regions[28], and the results are merged with those of the seed regions to produce a complete alignment (as shown in Fig. 1e).

Overview

To evaluate the performance of opjMap, we conducted a comparative analysis against widely used long-read aligners: minimap2[29], NGMLR[30] and Winnowmap2[31]. All alignment methods were tested on both simulated and real single-molecule sequencing datasets. The experiments were performed on a server running the Ubuntu 22.04 operating system, equipped with 189 GB of RAM and two Intel Xeon E5-2686 v4 processors (2.30 GHz, 16 cores, 32 threads each).

Simulated Data Experiments

Alignment Evaluation for Non-Structural Variation Reads

Evaluating the alignment performance of different tools involved using PBSIM2[32] to generate simulated reads with known reference positions, thereby enabling a precise comparison of alignment quality. We generated four sets of simulated reads with varying error rates: 10% and 15% to mimic the PacBio platform, and 20% and 30% to represent the ONT platform. The reads were generated from the chromosome 1 sequence of H.sapiens. The commands used for generating these datasets are provided in supplementary file Table S1.

Given the inherent error rate of sequencing reads, a base is considered correctly aligned if its mapped position on the reference genome differs from its true simulated position by no more than w bases (where w = 5). A read is considered correctly aligned if more than 90% of its bases are correctly mapped[33]. Base-level accuracy is defined as the ratio of correctly aligned bases to the total number of aligned bases[34], while sensitivity is the ratio of correctly aligned bases to the total number of bases in the simulated dataset. Similarly, read-level accuracy is the ratio of correctly aligned reads to the total number of aligned reads, and sensitivity is the ratio of correctly aligned reads to the total number of reads in the simulated dataset. The specific commands used for aligning with each tool are provided in supplementary file Table S2. The resulting alignment data are presented in Table 1. The values in parentheses in the Accuracy and Sensitivity columns indicate the percentage difference relative to opjMap. For example, minimap2's base-level accuracy at a 10% error rate is 95.40 (-0.13) %, where − 0.13% signifies that it is 0.13% lower than opjMap.

At a simulated error rate of 10%, opjMap demonstrated higher sensitivity in both base-level and read-level alignments compared to all other tools, with the exception of minimap2, to which it was slightly inferior. For error rates of 15%, 20% and 30%, opjMap consistently exhibited superior sensitivity at both base and read levels compared to the other aligners. Although other tools achieved higher accuracy at the base level, their read-level alignment accuracy was consistently lower than that of opjMap. These results collectively suggest that the orthogonal projection-based opjMap offers high sensitivity under moderate to high error rate conditions, enabling it to align a greater number of bases and accurately map more reads.

Alignment Evaluation for Duplications Across Distinct Reference Regions

We evaluated the tools' ability to detect interspersed repeats located outside the read's corresponding gene. We generated sequences containing repeats using a custom script, randomly selecting the strand for each fragment. Unlike PBSIM, Badread[35] can introduce sequencing errors into a short sequence, simulating its output under various error rates. Using Badread, we added sequencing errors to the fragments (see Supplementary Table S3 for specific commands) and then used a script to select simulated sequences with repetitive variations that met our criteria. Due to the random nature of the simulation, the number of reads in each dataset varied.

To select an appropriate error rate for comparison, we first tested the sensitivity and accuracy of different methods for aligning variations in 1000 bp sequences. The results are shown in Supplementary Fig. S1. opjMap demonstrated a significant lead in both accuracy and sensitivity under high error rates, with this gap only narrowing when the error rate

Table 1

Results of different methods on simulated dataset
Error Rate （Number of Reads）	Alignment Tool	Base Level				Read Level
Error Rate （Number of Reads）	Alignment Tool	Number of Alignments(M)	Correct Alignments(M)	Accuracy (%)	Sensitivity (%)	Number of Alignments	Correct Alignments	Accuracy (%)	Sensitivity (%)
10% （241144）	opjMap	2,347	2,242	95.53	89.94	217354	216563	99.64	89.81
	minimap2	2,350	2,242	95.40(-0.13)	89.96(+0.02)	217318	216660	99.70(-0.06)	89.85(+0.04)
	Winnowmap2	2,273	2,231	98.14(+2.61)	89.50(-0.44)	215791	214798	99.54(-0.10)	89.07(-0.74)
	ngmlr	2,234	2,225	99.57(+4.04)	89.25(-0.69)	216443	214058	98.90(-0.74)	88.77(-1.04)
15% （239772）	opjMap	2,320	2,232	96.22	89.55	215443	213962	99.31	89.24
	minimap2	2,313	2,212	95.61(-0.61)	88.74(-0.81)	214111	211273	98.67(-0.64)	88.11(-1.13)
	Winnowmap2	2,112	2,064	97.72(+1.50)	82.81(-6.74)	198830	195513	98.33(-0.98)	81.54(-7.70)
	ngmlr	2,182	2,170	99.45(+3.23)	87.07(-2.48)	212526	205323	96.61(-2.70)	85.63(-3.61)
20% （127587）	opjMap	2,322	2,233	96.14	89.58	114860	114074	99.32	89.41
	minimap2	2,326	2,233	96.03(-0.11)	89.60(-0.34)	114850	114045	99.30(-0.02)	89.39(-0.02)
	Winnowmap2	2,224	2,205	99.17(+3.03)	88.48(-1.46)	113582	111987	98.60(-0.72)	87.77(-1.64)
	ngmlr	2,221	2,206	99.35(+3.21)	88.52(-1.42)	114446	111828	97.71(-1.61)	87.65(-1.76)
30% （118334）	opjMap	2,258	2,203	97.57	88.40	105809	104154	98.44	88.02
	minimap2	2,216	2,142	96.64(-0.93)	85.93(-2.47)	103884	98286	94.61(-3.83)	83.06(-4.96)
	Winnowmap2	1,386	1,357	97.94(+0.37)	54.46(-33.94)	69471	57197	82.33(-16.11)	48.34(-39.68)
	ngmlr	2,052	2,036	99.25(+1.68)	81.70(-6.70)	101117	90927	89.92(-8.52)	76.84(-11.18)

dropped to between 15% and 10%. We chose an error rate of 15% to test the alignment of repeats of different lengths. For this test, we set five different lengths for external repetitive variations: 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp.

The average sequence length was 10,000 bp, with 3,000 sequences in each length group. For more detailed information on these two sets of reads with different error rates and lengths, please refer to Supplementary Tables S4 and S5.

Due to the presence of base-level errors in sequencing reads, the position of repeat variations within the reads is affected. The experiment determined whether a repeat variation was detected by verifying if the read's corresponding variation position on the reference genome was aligned multiple times[36]. Specifically, we defined the position and orientation of an aligned read on the reference genome as G = (G_st, G_ed, G_d), and the true position of the variation on the reference as T = (T_st, T_ed, T_d). If the number of non-empty intersections between G and T was greater than or equal to (n + 1), and the alignment orientation was identical, where n is the number of repeats (n = 1), the alignment was considered correct.

Figure 3 illustrates the accuracy and sensitivity of this detection at different lengths (for specific commands, see Supplementary Table S6, and for numerical values, see Supplementary Table S7). As the length of the repetitive variation region increases, the accuracy and sensitivity of the alignment tools also increase. Because the simulated repeat sequences were all fragments extracted from the reference genome, and their length was around 10,000 bp, most alignment tools were able to align the entire sequence. This resulted in the sensitivity and accuracy of many results being identical. Throughout the length-based experiments, opjMap consistently outperformed other tools, achieving 100% sensitivity and accuracy in detecting repetitive variations when the repeat length was 5000 bp.

Overall, opjMap maintained high accuracy and sensitivity in detecting duplications across distinct reference regions, regardless of variations in length or error rate. This indicates that opjMap is capable of identifying a greater number of inter-regional repetitive variations, even under conditions of high error rates and short variation lengths.

Alignment Evaluation for Duplications within a Single Reference Region

The experiments also tested the detection of repetitive regions located within the reads, with two distinct types of variations: interspersed repeats with a single duplication event and contiguous segmental duplications with multiple repeats.

Single Duplication Event

We evaluated the tools' detection capabilities by fixing the repeat fragment length at 1000 bp and varying the sequencing error rates. The detailed results are shown in Supplementary Fig S2. At high sequencing error rates, opjMap maintained a high level of performance. As the error rate decreased, the alignment results of other tools began to approach those of opjMap. We then chose an error rate of 15% for the subsequent experiment, which was designed to test the alignment of repeat fragments of different lengths. For this, we generated sequences with this error rate, containing internal repeats of 100 bp, 500 bp, 1000 bp, 2500 bp, and 5000 bp. Detailed information on these two datasets with varying error rates and lengths can be found in Supplementary Tables S8, S9.

Figure 4 illustrate the accuracy and sensitivity at different fragment lengths, with specific numerical values available in Supplementary Table S10. opjMap showed higher alignment sensitivity and accuracy when the repeat fragments were short. As the length of the repeat fragments increased, the performance of other tools approached that of opjMap. This indicates that opjMap is suitable for detecting interspersed repeats in a wide range of scenarios.

Contiguous Segmental Duplication

For the comparison of segmental duplication detection, a custom script was used to generate sequence fragments of five different lengths (100 bp, 250 bp, 500 bp, 750 bp, and 1000 bp), with each length repeated 10 times. We then introduced sequencing errors at a rate of 15% using the Badread tool. Detailed read information can be found in Supplementary Table S11. Given the high number of repeats, we fixed the fragment length at 1000 bp and initially tested the sensitivity for repeat judgment thresholds (n) of 3, 5, 7, and 10. The results are shown in Fig. 5, with specific values available in Supplementary Table S12. From these results, it can be seen that Winnowmap2 is not well-suited for aligning segmental repeats. In contrast, opjMap maintained high sensitivity as the threshold n increased.

A repeat judgment threshold (n) of 10 was selected to evaluate the performance of alignment tools on repeats. The results are shown in Fig. 6, with specific numerical values available in Supplementary Table S13. As the figure illustrates, opjMap surpassed the other aligners in both accuracy and sensitivity for detecting segmental duplications. opjMap achieves this by extracting shorter sub-fragments of length 9 from overlapping regions. This demonstrates that constructing the alignment skeleton with shorter fragment information can effectively enhance the detection of repeat region information.

Figure 7 presents a comparison of opjMap with three other tools, visualized using the IGV alignment visualization tool. Figure 7a shows the true skeleton anchor graph for a segmental tandem repeat of 500 bp, repeated 10 times. From this, it can be seen that 6 of the repeats are in the forward direction and 4 are in the reverse direction. Figure 7b shows the alignment results for this sequence from all four tools. We can observe that opjMap successfully aligned 4 reverse-oriented and 5 forward-oriented segmental repeats. In comparison, NGMLR aligned 2 reverse-oriented and 4 forward-oriented repeats. Winnowmap2 failed to recognize this repeat region, and minimap2 produced only a small number of alignment results. These findings demonstrate that opjMap is capable of identifying a greater number of segmental repeats, yielding more comprehensive alignment results. This indicates that opjMap possesses a superior ability to align segmental repeat variations.

Real Data Experiments

Evaluation on Datasets Without Segmental Repeats

A comparison of alignment performance on real-world datasets was conducted using sequencing data from two platforms: PacBio and ONT. The PacBio dataset, from A.thaliana, contained 300,000 sequences, while the ONT dataset, from E.coli, contained 60,000 sequences. All experiments were run using 64 threads, and the alignment results are presented in the table below. As shown in the Table 2, opjMap aligns a greater number of bases and reads on both the PacBio and ONT platforms while maintaining a lower consumption of computational resources. minimap2's performance is close to opjMap's, whereas NGMLR consumes significantly more resources.

Table 2

Results of different methods on real dataset.
DataSet (Read number)	Aligner	Mapped bases	Mapped reads	CPU time (seconds)	Wall time (seconds)	Peak Memory (GB)
PacBio (304718)	opjMap	5492704427	292604	48032	990	26.3
	minimap2	5456246662	290099	67325	1150	25.4
	Winnowmap2	5251174215	280013	80924	1353	40.1
	NGMLR	4362237072	255632	321851	5134	39.2
ONT (62094)	opjMap	413018134	53917	691	15	13.8
	minimap2	412818972	53665	495	13	15.5
	Winnowmap2	403409572	52908	1119	26	28.4
	NGMLR	365331379	49943	19117	395	39.0

Evaluation on Datasets With Segmental Repeats

To compare the performance of different alignment tools on segmental repeat variations in real-world sequencing data, we used long-read sequencing datasets from the human genomes T2T-CHM13 and HG002[37]. T2T-CHM13, considered the first complete and gapless human reference genome, serves as an ideal benchmark for evaluating and improving genomic alignment and variant calling algorithms. The HG002 dataset, on the other hand, consists of high-quality sequencing data from a real human sample. As existing structural variation benchmark sets lack sufficient information on segmental repetitive variations, we programmatically inserted 2,300 segmental repeat sequences into the T2T-CHM13 reference genome at regions corresponding to the original reads. The length distribution is shown in Supplementary Fig. S3.

Table 3

Comparison of Mappers for Segmental Repeat Detection on a Reference Genome
Aligner	opjMap	minimap2	NGMLR	Winnowmap2
Total	2300	2297	1705	2140
Correct	1893	1878	44	1450
Acc (%)	82.3%	81.76%	2.58%	67.46%
Sen (%)	82.3%	81.65%	1.91%	63.04%

As shown in the Table 3, opjMap achieved both an accuracy and sensitivity of 82.3%, outperforming all other alignment tools. minimap2 followed closely behind, while both NGMLR and Winnowmap2 performed poorly in aligning segmental repetitive variations. Notably, segmental repeats occurring within the reference genome are more challenging to detect than those in the reads. Due to its orthogonal projection-based approach, opjMap exhibits higher sensitivity when dealing with a reference genome containing segmental repeats, allowing it to identify a greater number of variations.

Alignment of repetitive structural variations in long reads with high error rates presents a significant challenge. When aligning such reads to a reference genome, the high error rate often leads to overlapping alignment skeletons, which many existing tools struggle to handle effectively. To overcome this issue, we propose opjMap, an alignment tool based on orthogonal projection. opjMap projects the linear alignment skeleton onto a straight line, enabling highly sensitive localization of the skeleton. This method allows opjMap to identify a greater number of reads on the reference genome. After locating the skeleton, opjMap extracts shorter minimizers from the repetitive regions to gather more detailed alignment information, thereby aligning a greater number of bases and improving overall alignment quality.

opjMap achieves high localization sensitivity while maintaining a low computational complexity. Unlike dynamic programming algorithms, which perform scoring and backtracking on window anchors to select collinear seeds—with an optimized time complexity approaching O(nlogn), where n is the number of anchors—opjMap's approach is more efficient. Because the number of windows is significantly smaller than the number of anchors, our method primarily focuses on projecting and counting each anchor, resulting in a time complexity closer to O(n). After the projection and voting step, opjMap utilizes radix sort to count the anchors within each window, selecting windows with a high vote count as alignment candidates.

However, due to sequencing errors, two linear alignment skeletons within a read can become misaligned, which might lead to them being incorrectly projected into separate windows, thereby reducing read alignment sensitivity. To mitigate this issue, opjMap's projection process strategically increases the window length to place these misaligned skeletons within a single window. While this approach enhances read detection sensitivity, it can make it challenging to identify the specific structural variation information within the window, thus lowering the sensitivity for detecting internal variations. In future work, we plan to develop targeted processing methods for the alignment skeletons within these voted windows to further improve the sensitivity of structural variation alignment.

In this work, we propose a novel orthogonal projection-based voting localization method. This approach effectively avoids introducing excessive noise during the candidate region selection process, thereby satisfying the requirement for selecting collinear seeds. The method significantly reduces computational time complexity, and its use of orthogonal projection effectively filters out noise, which is beneficial for subsequent skeleton construction and detailed alignment. Experimental results demonstrate that our method can align a greater number of reads and bases under moderate-to-high sequencing error rates. Furthermore, it is also capable of aligning a higher number of repetitive variations, confirming its robustness and effectiveness.

SMS

single-molecule sequencing

SMRT

Single Molecule Real-Time

ONT

Oxford Nanopore Technologies

SVs

Structural variations

RHT

Regional hash table

FM-index

Full-text Minute-space index

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and material

All data in this paper is available in the supplementary file or from the corresponding author on a reasonable request.

Competing interests

Not applicable.

Funding

This work was supported in part by the Scientific Research General Project of Wuhan Technology And Business University under Grant A2025044 and was also supported by the Special Fund of Advantageous and Characteristic Disciplines (Group) of Hubei Province.

Availability of data and materials

The datasets used in this study, along with the corresponding reference genomes, are publicly available from the NCBI and EBI repositories.

Real Datasets: Raw reads from Escherichia coli (ONT platform), Arabidopsis thaliana (PacBio platform), and Homo sapiens (PacBio platform) were obtained from the following sources:

E. coli: https://www.ncbi.nlm.nih.gov/sra/?term=SRR34757056%2F

A. thaliana: https://www.ncbi.nlm.nih.gov/sra/?term=ERR15092965

H. sapiens: https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb_20kb_chemistry2/reads/

Reference Genomes: The reference genomes for E. coli, A. thaliana, and H. sapiens can be accessed through these links:

E. coli: https://www.ebi.ac.uk/ena/browser/view/ERX987748

A. thaliana: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001735.4/

H. sapiens: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

Authors’ contributions

Xing-Guo Fan developed the programming of the alignment tool and drafted the manuscript. Xiao-Dan Zhang carried out the revision of the manuscript. Cheng-Song Hu conducted the analysis of the experimental results. Jie-Jie Zeng and Shu-Rui Li executed the testing of the tool. Ze-Gang Wei provided the reference genome, reads, and computational infrastructure. All authors contributed to the conception and design of the study, discussed the results, and read, edited, and approved the final manuscript.

Acknowledgements

Not applicable.

Beran P, et al. KEC: unique sequence search by k-mer exclusion. Bioinf (Oxford England). 2021;37(19):btab196.
Charalampous T, et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat Biotechnol. 2019;37(7):783–92.
Wei Z-G, Zhang S-W. DMclust, a density-based modularity method for accurate OTU picking of 16S rRNA sequences. Mol Inf. 2017;36(12):1600059.
Wei Z-G, Zhang S-W. MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs. Mol BioSyst. 2015;11(7):1907–13.
Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics. 2008;9(1):128.
Hedges DJ, et al. Evidence of novel fine-scale structural variation at autism spectrum disorder candidate loci. Mol autism. 2012;3:1–11.
Pan B, et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics. 2019;20:17–29.
Ferragina P, Manzini G. Opportunistic data structures with applications. In: Symposium on Foundations of Computer Science; 2000.
Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform. 2010;11(5):473.
Zhang H, et al. Fast and efficient short read mapping based on a succinct hash index. BMC Bioinformatics. 2018;19(1):92.
Kaur H, Chand L. Biological sequence alignment using varied optimization algorithms. International Conference on Inventive Computation Technologies. Berlin: Springer; 2016. pp. 1–5.
Xu X et al. SLPal: Accelerating long sequence alignment on many-core and multi-core architectures. 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2020: pp. 2242–2249.
Poerba YS, Martanti D. Genetic variability of Amorphophallus muelleri Blume in Java based on random amplified polymorphic DNA. Biodiversitas J Biol Divers, 2008. 9(4).
Savage DG, et al. Clinical features at diagnosis in 430 patients with chronic myeloid leukaemia seen at a referral centre over a 16-year period. Br J Haematol. 1997;96(1):111–6.
Ivan S, et al. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nat Commun. 2016;7:11307.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100.
Sedlazeck FJ et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods, 2018. 15(6).
Wei ZG, et al. kngMap: sensitive and fast mapping algorithm for noisy long reads based on the K-Mer neighborhood graph. Front Genet. 2022;13:890651.
Liu B, et al. rHAT: fast alignment of noisy long reads with regional hashing. Bioinformatics. 2016;32(11):1625–31.
Haghshenas E, et al. lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data. Bioinformatics. 2019;35(1):20–7.
Lippert RA. Space-efficient whole genome comparisons with Burrows–Wheeler transforms. J Comput Biol. 2005;12(4):407–15.
Takahashi KK, Innan H. Duplication with structural modification through extrachromosomal circular and lariat DNA in the human genome. Sci Rep. 2020;10(1):7150.
Rasko DA, et al. Origins of the E. coli strain causing an outbreak of hemolytic–uremic syndrome in Germany. N Engl J Med. 2011;365(8):709–17.
Murray IA, et al. The methylomes of six bacteria. Nucleic Acids Res. 2012;40(22):11450–62.
Ning Z, et al. SSAHA: a fast search method for large DNA databases. Genome Res. 2001;11(10):1725–9.
Roberts M, et al. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363–9.
Liu B, et al. deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics. 2016;32(21):3224–32.
Wei ZG, et al. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics. 2023;39(12):btad726.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100.
Sedlazeck FJ et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods, 2018. 15(6).
Jain C, Rhie A, Hansen NF et al. Long-read mapping to repetitive reference sequences using Winnowmap2. 2022; 19:705–10.
Ono Y, Asai K, Hamada MJB. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores. 2020.
Wei Z-G, Zhang S-W. NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model. BMC Bioinformatics. 2018;19(1):177.
Wei Z-G, Zhang S-W, Liu F. smsMap: mapping single molecule sequencing reads by locating the alignment starting positions. BMC Bioinformatics. 2020;21(1):341.
Wick RR. Badread: simulation of error-prone long reads. J Open Source Softw. 2019;4(36):1316.
Wei ZG, et al. invMap: a sensitive mapping tool for long noisy reads with inversion structural variants. Bioinformatics. 2023;39(12):btad726.
Mitchell R, Vollger, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376:eabj6965.

No competing interests reported.

supplymentaryfile.docx

opjMap: A Sensitive Mapper for Repetitive Structural Variations in Long Noisy Reads Based on Orthogonal Projection

Status:

Version 1

Abstract

Figures

Background

Methods

Overview

Generation of Minimizer Anchor Graph

Orthogonal Projection and Voting

Localization of Repetitive and Non-Repetitive Regions

Refined the Alignment Skeleton and Detailed Alignment

Results

Overview

Simulated Data Experiments

Alignment Evaluation for Non-Structural Variation Reads

Alignment Evaluation for Duplications Across Distinct Reference Regions

Alignment Evaluation for Duplications within a Single Reference Region

Single Duplication Event

Contiguous Segmental Duplication

Real Data Experiments

Evaluation on Datasets Without Segmental Repeats

Evaluation on Datasets With Segmental Repeats

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1