Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

doi:10.21203/rs.3.rs-8360344/v1

Download PDF

Research Article

Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

https://doi.org/10.21203/rs.3.rs-8360344/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

Gene fusions are critical drivers of oncogenesis and diagnostic biomarkers in various cancers. However, their detection from RNA or DNA sequencing, when performed using traditional analytical methods, encounters challenges related to sample quality, computational complexity, and noise. Although deep learning is more robust, it usually requires large labeled datasets and substantial training resources. Genomic foundation models (GFMs), which are pre-trained on pangenome-scale data, offer a promising solution to these issues.

Methods

This study presents the first comprehensive benchmark of four transformer-based GFMs, Nucleotide Transformer, Evo2, HyenaDNA, and DNABERT2, for gene fusion detection. Using the curated FusionAI dataset of ~ 52,000 sequences, we extracted embeddings from 10-kilobase-pair (kbp) DNA sequences surrounding fusion breakpoints. We evaluated the quality of these representations qualitatively using t-SNE visualization and quantitatively by training lightweight classifiers (Support Vector Machines and simple Neural Networks) on the fixed embeddings.

Results

The Nucleotide Transformer achieved the best performance with an accuracy of 0.967 and an F1 score of 0.967. This result outperformed the dedicated deep learning baseline (FusionAI, with an accuracy of 0.894). Evo2 was the second-best performer (accuracy: 0.920), demonstrating robustness derived from evolutionary pretraining. Conversely, DNABERT2 failed to compete (accuracy 0.677–0.723). Furthermore, sample efficiency analysis revealed that the Nucleotide Transformer required only ~ 2,600 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples.

Conclusions

These findings demonstrate that advanced GFMs, particularly the NT and Evo2 models, generate highly discriminative 'out-of-the-box' embeddings. These embeddings significantly outperform dedicated deep learning baselines while requiring a fraction of the training data and computational time. This suggests that GFMs could be a scalable, data-efficient way of developing precise genomic diagnostic tools, particularly for rare diseases.

genomic foundation models

gene fusion detection

DNA sequence analysis

transformer models

Nucleotide Transformer

Evo2

bioinformatics benchmarking

Gene fusions are genetic alterations that typically arise from large-scale structural changes in the DNA, leading to the joining of two previously non-adjacent genes^{1, 2}. These fusions can result in the production of abnormal proteins or cause dysregulation of gene expression. Aberrant gene fusions are implicated in the pathogenesis of various cancers, including hematologic malignancies such as leukemia, as well as numerous solid tumors. They serve as important diagnostic, prognostic, and therapeutic biomarkers. Fusion events are most commonly detected through RNA sequencing due to its ability to capture expressed fusion transcripts. However, DNA sequencing can also be used, albeit less frequently, due to its higher cost and lower sensitivity to expressed fusions³. However, both approaches face significant technical challenges, including degraded RNA samples, variability in library preparation, high computational demands, and data noise — all of which can contribute to both false positives and false negatives^{4, 5}.

Standard bioinformatic tools, such as Arriba⁶ and STAR-Fusion⁷, have been widely adopted to address these analytical challenges. While these tools offer high specificity, they often lack robustness and generalizability when dealing with variable or low-quality data^{7, 8}. Additionally, these tools require expert-driven parameter tuning to perform optimally⁹, which hinders their scalability for large research cohorts and clinical applications¹⁰.

These limitations have driven the current trend toward machine learning (ML) and deep learning (DL) models, which aim to reliably extract fusion signals from complex, noisy data with minimal manual intervention^11–13. However, developing DL models traditionally presents significant challenges, including the need for substantial labeled training data and substantial domain expertise. These are common bottlenecks in genomics^{14, 15}.

The recent emergence of genomic foundation models (GFMs), which are based on large language model architectures, offers a potential solution¹⁶. GFMs are pre-trained on vast, pangenome-scale datasets and can be fine-tuned for specific downstream tasks. They reportedly achieve high accuracy even with limited data.

A pioneering example is DNABERT¹⁷, which adapts the BERT (Bidirectional Encoder Representations from Transformers) architecture to genomic sequences. DNABERT has demonstrated strong performance in predicting promoter regions, splice sites, and transcription factor binding sites. Building on this foundation, subsequent studies have fine-tuned or extended the model for various tasks. These include msBERT-Promoter¹⁸, a model for DNA promoter identification and strength estimation; PLANNER¹⁹, a model for predicting origins of replication sites; BERT-TFBD²⁰, and MutBERT²¹.

A more recent and versatile model family is the Nucleotide Transformer²², designed to predict molecular phenotypes directly from DNA sequences. Trained on 3,202 diverse human genomes and 850 non-human genomes, it leverages multi-task learning and transfer learning to overcome limitations posed by scarce annotated data. Its applications include functional element detection, chromatin accessibility analysis, and variant prioritization. Emerging models also target specific biological domains. For example, the Genetic Transformer²³ identifies causal variants in rare diseases, while HyenaDNA²⁴ captures long-range dependencies and processes DNA sequences up to one million nucleotides long. The Evo2 model²⁵ extends HyenaDNA's capabilities by incorporating evolutionary conservation data into its architecture. This enables better prediction of the functional impact of genetic variants by improving the modeling of sequence conservation and long-range dependencies. While these models show great promise across various genomic tasks, their application to gene fusion detection remains unexplored.

Because pre-trained GFMs encode rich biological features into vector representations, we hypothesize that these embeddings can be used as inputs for a simple classifier that can distinguish between fusion-positive and fusion-negative sequences. This approach may offer improved accuracy and training efficiency while requiring significantly less labeled data. To test this hypothesis, this article compares the performance of current foundation models. We used the dataset generated from the FusionAI study¹¹ (previously used for deep learning) to explore whether GFMs could improve performance on this existing benchmark.

For model evaluation, we used the dataset introduced and described in FusionAI study¹¹, comprising approximately 26K fusion-positive and 26K fusion-negative sequences. From the total of ~ 52K sequences, we used ~ 36K (~ 18K positive and ~ 18K negative) for training, identical to the training set in the original study. The remaining ~ 16K sequences were evenly divided into validation (~ 8K) and test sets (~ 8K), while preserving the same ratio of positive and negative samples. We kept all data partitions consistent across experiments to ensure a fair comparison between models.

Each data sample consisted of a 10 kbp sequence surrounding the fusion breakpoint in gene 1 (sequence 1) and gene 2 (sequence 2). Each sequence was encoded using four widely used genomics foundation models based on the transformer architecture: Nucleotide Transformer²², HyenaDNA²⁴, and Evo2²⁵, and DNABERT2¹⁷. For each model, we tokenized the sequences using the corresponding tokenizer and extracted embeddings from hidden states from an author-recommended selected internal layer, specifically:

Nucleotide Transformer (NT)

For embedding generation, the Nucleotide Transformer 500M_multi_species_v2 model was used. The Nucleotide Transformer was configured with a maximum sequence length of 1,671 token, and 1024-dimensional embeddings wereextracted from the 20th transformer layer (out of 24 total layers). Given the model's 6-mer tokenization scheme, this creates a receptive field of approximately 10,026 bp, which fully encompasses the 10 kbp input sequences centered on the gene fusion breakpoints. This configuration allowed the model to process the entire region of interest in a single pass without the need for truncation or sliding window strategies.

Evo2 (Evo)

For the Evo2 model, sequences were encoded using the evo2_7b checkpoint. Through byte-level tokenization, each character in the DNA string was converted to its corresponding UTF-8 integer value. Embeddings were subsequently extracted from the blocks.28.mlp.l3 internal layer, where the embedding dimension was 4096. Unlike k-mer-based approaches, Evo2 utilizes a byte-level tokenizer where each nucleotide maps directly to a single token, preserving single-base resolution. Leveraging the model's StripedHyena architecture designed for long-range genomic modeling, we processed the input sequences at their full 10 kbp length (10,000 tokens) without truncation or downsampling.

HyenaDNA (Hyena)

The “hyenadna-large-1m-seqlen-hf” model was used with 1 million token index, which employs a character-level tokenizer where each nucleotide corresponds to a single token. HyenaDNA's sub-quadratic operator allowed for efficient processing of the full input at single-nucleotide resolution without the need for downsampling or token aggregation. Embeddings were extracted from the model's final hidden layer, yielding a tensor of shape, where the embedding dimension was 256.

DNABERT2 (BERT)

For the BERT-based approach, the zhihan1996/DNABERT-2-117M model was utilized. DNABERT-2 employs Byte Pair Encoding (BPE), which results in variable token counts for fixed-length DNA sequences. To enable consistent batch processing and preserve spatial alignment, we standardized all input tensors to a fixed length of 2,143 tokens. This threshold was empirically determined to accommodate the maximum tokenized length of any 10 kbp sequence in our dataset. We implemented a symmetric padding strategy, adding special tokens to both ends of shorter sequences. DNA sequences were tokenized into overlapping k-mers using the model's specific tokenizer. The final embeddings were then extracted from the hidden states of the last transformer layer, where the embedding dimension was 768.

From each sequence, only the middle embedding was used, which corresponds to the fusion breakpoint. Due to the contextual nature of the embeddings, this central embedding also encodes information from the surrounding sequence. We concatenated the middle embeddings from both sequences and used the resulting vector for classification.

We visualized embedding quality using t-distributed Stochastic Neighbor Embedding (t-SNE) with perplexity = 30 and 1,000 iterations on a random subset of 1,000 samples. Class separability was qualitatively assessed by visual inspection of the 2D projections.

Classification

For classification, we used two classifiers following the FusionAI article¹¹ : (i) a support vector machine (SVM) with RBF kernel with C = 1.0 and γ automatically computed as 1/(n_features × variance). Hyperparameters were not tuned via cross-validation; fixed values were used across all experiments. (ii) a fully connected neural network (NN) with an architecture adopted from FusionAI article¹¹. We reimplemented the entire CNN architecture as described in FusionAI¹¹ to allow for a direct comparison (see Fig. 1). The classifier architecture consists of a single-layer feedforward neural network with one hidden layer containing 32 neurons with ReLU activation, followed by dropout (p = 0.4) and a softmax output layer for binary classification. The model was compiled using the Adadelta optimizer with categorical cross-entropy loss. Training was performed with a batch size of 256. The neural networks were trained for up to 1000 epochs to ensure training stability and convergence.

Evaluation Metrics

We evaluated model performance on the test set using accuracy (percentage of correct predictions), precision (weighted average across classes), recall (weighted average across classes), F1 score (weighted average), and area under the ROC curve (AUC-ROC, macro-averaged for multi-class. For neural networks, we report the final epoch performance; for SVMs, and the single training run results.

Sample efficiency

To assess sample efficiency, we trained models on 19 stratified training subsets ranging from 200 to 36,302 samples. We fit logarithmic functions (y = a·log(x) + b) to the resulting learning curves and computed: (1) the sample size required to reach 95% of final accuracy (Samples@95%) by inverting the fitted curve: x₉₅ = exp((0.95·y_final - b) / a), and (2) a logarithmic efficiency score defined as (Efficiency@95%) y_final / ln(x₉₅), which quantifies accuracy achieved per natural logarithm unit of training data. Higher efficiency scores indicate models that reach high performance with logarithmically fewer samples. Accuracy values were normalized to 0-100% of each model's maximum performance for cross-model visualization.

Implementation details

All models were implemented in Python 3.11 using Keras 3.0 with PyTorch backend. Neural network training used the Adadelta optimizer with default Keras parameters (learning_rate = 1.0, rho = 0.95, epsilon = 1e-07). SVM models were trained using scikit-learn 1.4 with default parameters unless otherwise specified. Random sampling and data splitting used a fixed seed (42) to ensure reproducibility across different training set sizes. Training was performed on NVIDIA H100 GPUs with 96GB memory. All source code and notebooks with results used for comparison are available at https://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark/. The data and results files are available on Zenodo²⁶.

Visual Assessment of Embedding Quality

We visualized the embeddings of fusion-positive and fusion-negative sequences using t-SNE to qualitatively assess whether genomic foundation models (GFMs) capture features relevant to gene fusion detection. The resulting projections revealed differences in class separability across the evaluated models (see Fig. 2).

Classification Performance

We evaluated the efficacy of these embeddings using two classifiers, an SVM and a neural network, and compared them against the FusionAI baseline model, which is a dedicated deep learning protocol. Table 1 summarizes the results on the full test set (~ 8K samples), and Fig. 2 shows training convergence.

Nucleotide Transformer achieved the highest overall performance, significantly outperforming the baseline. Using the SVM classifier, NT achieved an accuracy of 0.967 and an F1 score of 0.967, whereas FusionAI achieved an accuracy of 0.894. The neural network classifier yielded nearly identical results for NT (accuracy: 0.966; ROC AUC: 0.994), demonstrating the robustness of these embeddings regardless of the classification method.
Evo2 was the second-best performer, consistently surpassing the FusionAI baseline. Both the support vector machine (SVM) and neural network (NN) classifiers achieved an accuracy and F1 score of 0.920 and a ROC AUC of 0.970–0.975.
HyenaDNA produced mixed results. When paired with a simple neural network, its performance was lower than the baseline (accuracy: 0.857 vs. 0.894). However, using an SVM improved HyenaDNA’s performance, achieving an accuracy of 0.900 and slightly surpassing the FusionAI baseline.
DNABERT2 failed to compete with the other foundation models or the baseline. Its accuracy ranged from 0.677 (NN) to 0.723 (SVM), and its ROC AUC was significantly lower at 0.745–0.799.

Table 1

Comparative performance of embedding models versus the reimplemented FusionAI baseline on the full test set.
Model	Classifier	Accuracy	Precision	Recall	F1 Score	ROC AUC
FusionAI	nn	0.894	0.894	0.894	0.894	0.960
NT	nn	0.966	0.966	0.966	0.966	0.994
	svm	0.967	0.972	0.962	0.967	0.995
Evo	nn	0.920	0.920	0.920	0.920	0.970
	svm	0.920	0.920	0.920	0.920	0.975
Hyena	nn	0.857	0.858	0.857	0.857	0.936
	svm	0.900	0.880	0.925	0.902	0.962
BERT	nn	0.677	0.678	0.677	0.676	0.745
	svm	0.723	0.737	0.690	0.713	0.799

Computational Efficiency

A major advantage of the GFM-based approach observed in this study is the dramatic reduction in training time. The original FusionAI protocol required approximately 40 hours to train for 1,000 epochs to ensure stability. In contrast, the foundation model workflow, comprising embedding extraction and training of the lightweight classifiers, was completed in under 10 minutes.

Sample Efficiency

To evaluate how well the models perform in data-scarce regimes, we analyzed learning curves across training subsets ranging from 200 to ~ 36,000 samples. We calculated the number of samples required to reach 95% of the model's final accuracy (Samples@95%) and a logarithmic efficiency score (see Table 2 and Fig. 2).

Nucleotide Transformer demonstrated superior sample efficiency, requiring only 2,581 samples to reach 95% of its peak performance (Efficiency Score: 12.29).
Evo2 followed, requiring 4,461 samples (Efficiency Score: 10.94).
HyenaDNA required 10,303 samples, approaching the data requirements of the baseline.
FusionAI (Baseline) required 14,200 samples to reach its convergence threshold.
DNABERT2 was the least efficient, requiring 21,768 samples to stabilize its (comparatively lower) performance.

Table 2

Data efficiency benchmark comparing embedding models to the reimplemented FusionAI baseline (trained for 1,000 epochs).
Model	Samples (at 95%)	Accuracy (at 95%)	Efficiency Score (at 95%)
FusionAI	14,200	0.849	9.35
NT	2,581	0.918	12.29
Evo	4,461	0.874	10.94
Hyena	10,303	0.814	9.27
Bert	21,768	0.643	6.78

To the best of our knowledge, this study presents the first comprehensive benchmark of genomic foundation models for gene fusion detection in DNA sequences. Our results show that using pre-trained embeddings from advanced GFMs, such as the Nucleotide Transformer and Evo2, significantly improves accuracy compared to dedicated deep learning protocols like FusionAI, while requiring a fraction of the computational resources and training data.

The Nucleotide Transformer produced the most distinct class separation, forming dense, non-overlapping clusters of fusion-positive and fusion-negative sequences. This high-quality representation resulted in superior classification performance (accuracy: 0.967), which remained consistent regardless of whether a simple support vector machine (SVM) or neural network was used as the classification head. Evo2 followed closely behind, consistently surpassing the baseline and demonstrating that models incorporating evolutionary information are highly effective at characterizing structural genomic events.

A notable finding is that DNABERT2 was unable to compete in this specific benchmark. It showed significant class overlap and lower accuracy (accuracy: 0.677–0.723). This contrasts with recent literature, in which DNABERT2 and its predecessor are successful backbones for various genomic tasks^{18, 19, 21, 27}. For example, DNABERT2 has been adapted for viral lineage classification in ViralLM²⁸, bacterial genome decoding, and competing against protein language models in downstream protein tasks²⁹.

The discrepancy between these successes and our findings suggests that the utility of specific GFMs depends heavily on the task at hand. Although DNABERT2’s Byte Pair Encoding and pre-training objectives excel at capturing semantic patterns in viral or bacterial genomes, these objectives may not produce sufficiently discriminative embeddings for the specific structural features that characterize human gene fusions within a 10 kbp window without extensive fine-tuning.

HyenaDNA is designed to process context lengths up to one million tokens and showed mixed results. Although it outperformed the baseline when paired with an SVM, it did not achieve the same level of precision as NT or Evo2 in our "frozen embedding" setting. These results are similar to those from the NextVir study³⁰, which benchmarked GFMs for oncoviral classification. The NextVir authors also noted that base models provided decent results with simple adapters but that maximizing performance often required fine-tuning strategies, such as Low-Rank Adaptation (LoRA), particularly for Hyena-based models.

However, HyenaDNA's potential for structural tasks remains strong. The HyenaCircle model³¹ recently demonstrated that HyenaDNA-based architectures can predict extrachromosomal circular DNA (eccDNA) effectively from sequence data. Since eccDNA formation shares mechanistic similarities with gene fusions, which involve DNA breaks and re-ligation, it is likely that HyenaDNA captures the relevant signal. However, this signal is more complex for linear classifiers to disentangle than the representations produced by NT or Evo2. Furthermore, HyenaDNA’s single-nucleotide resolution has proven advantageous in viral taxonomy (e.g., ViTax)³², as it can handle high mutation rates better than k-mer-based approaches can. Future work on gene fusions should explore fully tuning or using LoRA adapters to unlock HyenaDNA's full potential for this task.

In comparison with RNA-based approaches, it is important to contextualize our work within the broader field of fusion detection. Recent machine learning efforts have primarily focused on identifying chimeric RNAs from RNA-seq data. For instance, transformer-based classifiers for chimeric reads have achieved success with DNABERT-based architectures³³. However, our approach detects at the DNA level. This is crucial and distinct for clinical scenarios where RNA samples are degraded or unavailable. By replacing complex feature engineering with high-quality pretrained embeddings, we provide a more streamlined alternative that complements existing RNA-based methods.

Computational and Sample Efficiency

A critical advantage of the GFM-based workflow is its dramatic increase in efficiency. Training time decreased from about 40 hours for the FusionAI baseline to less than 10 minutes for our foundation model approach. Additionally, our analysis of sample efficiency underscores the "few-shot" capabilities of robust GFMs. The Nucleotide Transformer required only ~ 2,500 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples. This makes GFM-based approaches particularly promising for rare disease studies or clinical scenarios where large annotated datasets are scarce.

In conclusion, the landscape of genomic foundation models is rapidly expanding with the emergence of models like PathoLM³⁴ and Embed-Search-Align³⁵ for diverse tasks, and novel frameworks for unsupervised embedding evaluation³⁶. However, our benchmark shows that Nucleotide Transformer and Evo2 provide the most robust "out-of-the-box" representations for human gene fusion detection. These models offer a superior balance of accuracy, speed, and data efficiency, paving the way for scalable, AI-driven genomic diagnostics.

Limitations and Future Directions

To ensure a rigorous and reproducible comparison with established baselines, this study used the curated FusionAI benchmark dataset. This controlled environment enabled us to isolate the contribution of foundation model embeddings to classification performance rather than confounding the results with sequencing artifacts and variable coverage, which are often present in raw whole-genome sequencing (WGS) data. Although the current evaluation focused on binary classification to validate the discriminative power of these embeddings, this lays the groundwork for more granular structural analysis.

Building on these findings, our future work will extend this framework to precisely identify fusion breakpoints and type fusion partners. Breakpoint localization typically requires extensive annotated datasets to prevent overfitting, but our analysis of sample efficiency offers a promising approach. We demonstrated that models such as Nucleotide Transformer and Evo2 converge with significantly fewer samples than traditional baselines. This high learning efficiency suggests that training advanced heads for coordinate regression or token-level segmentation is computationally feasible, even with smaller, high-quality datasets available in clinical settings. Thus, we can bridge the gap between foundation models and precision diagnostics.

This study presents the first comprehensive benchmark of genomic foundation models (GFMs) for the direct detection of gene fusions from DNA sequences. Our results show that using pre-trained embeddings from advanced models such as the Nucleotide Transformer and Evo2 significantly improves detection accuracy compared with specialised deep learning protocols such as FusionAI. Specifically, the Nucleotide Transformer achieved an accuracy of 0.967, followed by Evo2 with an accuracy of 0.920; both models outperformed the baseline accuracy of 0.894.

In addition to superior performance, the GFM-based approach offers dramatic computational efficiency. The training time was reduced from approximately 40 hours required by the original protocol to under 10 minutes. Further analysis of sample efficiency highlighted the robustness of these models: the Nucleotide Transformer required only ~ 2,600 samples to reach 95% of its peak performance, whereas the baseline method needed over 14,000 samples. This characteristic is particularly valuable for clinical applications and rare disease research, where large annotated datasets are often unavailable.

Although certain models, such as DNABERT2, were unable to compete in this specific task, the overall findings confirm that GFMs offer a scalable and highly effective alternative to traditional methods. Future work should focus on extending these capabilities and integrating these efficient workflows into clinical genomic diagnostics.

ACC: Accuracy;
AUC:Area Under the Curve;
BERT: Bidirectional Encoder Representations from Transformers;
BPE: Byte Pair Encoding;
CNN: Convolutional Neural Network;
DL: Deep Learning;
DNA: Deoxyribonucleic acid;
FN: False Negative;
FP: False Positive;
GFM: Genomic Foundation Model;
MCC: Matthews Correlation Coefficient;
ML: Machine Learning;
NLP: Natural Language Processing;
NN: Neural Network;
NT: Nucleotide Transformer;
SVM: Support Vector Machine;
TN: True Negative;
TP: True Positive.

Ethics approval and consent to participate: Not applicable. This study utilizes a publicly available benchmark dataset (FusionAI)¹¹ and does not involve human participants, human data, or animal experiments.

Consent for publication: Not applicable.

Availability of data and materials: The datasets analyzed during the current study and learned models are available in the Zenodo repository: https://zenodo.org/records/17898581. The source code is available at
https://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark .

Competing interests: The authors declare that they have no competing interests.

Funding: This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS25/188/OHK4/3T/17.

Authors' contributions: RK conceived and designed the study, analyzed and interpreted the data, created the software used in the work, and drafted the manuscript. MK analyzed and interpreted the data and drafted the manuscript. BD acquired the data, created the software used in the work, and substantively revised the manuscript. KK acquired and analyzed the data and created the software used in the work. OD conceived and designed the study and substantively revised the manuscript. All authors read and approved the final manuscript.

Acknowledgements: Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic

Foltz SM, Gao Q, Yoon CJ, et al. Evolution and structure of clinically relevant gene fusions in multiple myeloma. Nat Commun. 2020;11:2666.
Liu SV, Nagasaka M, Atz J, Solca F, Müllauer L. Oncogenic gene fusions in cancer: from biology to therapy. Signal Transduct Target Ther. 2025;10:111.
Bao Z, Chai R, Liu X, Wang J. Fusion genes as diagnostic and predictive biomarkers for tumor. Glob Transl Med. 2022;1:1–12.
Ahmed J, Torrado C, Chelariu A, Kim S-H, Ahnert JR. (2024) Fusion Challenges in Solid Tumors: Shaping the Landscape of Cancer Care in Precision Medicine. JCO Precis Oncol e2400038.
Creason A, Haan D, Dang K, et al. A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery. Cell Syst. 2021;12:827–e8385.
Uhrig S, Ellermann J, Walther T, et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 2021;31:448–60.
Haas BJ, Dobin A, Li B, Stransky N, Pochet N, Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20:213.
Apostolides M, Jiang Y, Husić M, Siddaway R, Hawkins C, Turinsky AL, Brudno M, Ramani AK. MetaFusion: a high-confidence metacaller for filtering and prioritizing RNA-seq gene fusion candidates. Bioinformatics. 2021;37:3144–51.
Jin Z, Huang W, Shen N, Li J, Wang X, Dong J, Park PJ, Xi R. Single-cell gene fusion detection by scFusion. Nat Commun. 2022;13:1084.
Hsieh G, Bierman R, Szabo L, Lee AG, Freeman DE, Watson N, Sweet-Cordero EA, Salzman J. Statistical algorithms improve accuracy of gene fusion detection. Nucleic Acids Res. 2017;45:e126–126.
Kim P, Tan H, Liu J, Kumar H, Zhou X. FusionAI, a DNA-sequence-based deep learning protocol reduces the false positives of human fusion gene prediction. STAR Protoc. 2022;3:101185.
Lovino M, Urgese G, Macii E, Di Cataldo S, Ficarra E. A Deep Learning Approach to the Screening of Oncogenic Gene Fusions in Humans. Int J Mol Sci. 2019;20:1645.
Lovino M, Ciaburri MS, Urgese G, Di Cataldo S, Ficarra E. DEEPrior: a deep learning tool for the prioritization of gene fusions. Bioinformatics. 2020;36:3248–50.
Ching T, Himmelstein DS, Beaulieu-Jones BK, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15:20170387.
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019;20:389–403.
Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. Natl Sci Rev. 2025;12:nwaf028.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20.
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol. 2024;22:126.
Wang C, He Z, Jia R, Pan S, Coin LJ, Song J, Li F. PLANNER: A Multi-Scale Deep Language Model for the Origins of Replication Site Prediction. IEEE J Biomed Health Inf. 2024;28:2445–54.
Wang K, Zeng X, Zhou J, Liu F, Luan X, Wang X. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform. 2024;25:bbae195.
Long W, Su H, Xiong J, Zhang Y. MutBERT: probabilistic genome representation improves genomics foundation models. Bioinformatics. 2025;41:i294–303.
Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2025;22:287–97.
Liang L, Chen Y, Wang T et al. (2024) Genetic Transformer: An Innovative Large Language Model Driven Approach for Rapid and Accurate Identification of Causative Variants in Rare Genetic Diseases. https://doi.org/10.1101/2024.07.18.24310666
Nguyen E, Poli M, Faizi M et al. (2023) HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. https://doi.org/10.48550/ARXIV.2306.15794
Brixi G, Durrant MG, Ku J et al. (2025) Genome modeling and design across all domains of life with Evo 2. https://doi.org/10.1101/2025.02.18.638918
Krupicka R. (2025) Embedding and benchmarks results for fusionAI dataset. https://doi.org/10.5281/ZENODO.17898581
Akotenou G, El Allali A. Genomic language models (gLMs) decode bacterial genomes for improved gene prediction and translation initiation site identification. Brief Bioinform. 2025;26:bbaf311.
Peng C, Shang J, Guan J, Wang D, Sun Y. ViraLM: empowering virus discovery through the genome foundation model. Bioinformatics. 2024;40:btae704.
Boshar S, Trop E, De Almeida BP, Copoiu L, Pierrot T. Are genomic language models all you need? Exploring genomic language models on protein downstream tasks. Bioinformatics. 2024;40:btae529.
Robertson J, Consul S, Vikalo H. NextVir: Enabling classification of tumor-causing viruses with genomic foundation models. PLOS Comput Biol. 2025;21:e1013360.
Li F, Lu W, Bai Y. HyenaCircle: a HyenaDNA-based pretrained large language model for long eccDNA prediction. Front Genet. 2025;16:1641162.
He Y, Zhou F, Bai J, Gao Y, Huang X, Wang Y. ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model. Brief Bioinform. 2024;26:bbaf041.
Bonizzoni P, De Felice C, Pirola Y, Rizzi R, Zaccagnino R, Zizza R. Identification of Chimeric RNAs: A Novel Machine Learning Perspective. In: Bansal MS, Chen W, Khudyakov Y, Măndoiu II, Moussa MR, Patterson M, Rajasekaran S, Skums P, Thankachan SV, Zelikovsky A, editors. Comput. Adv. Bio Med. Sci. Cham: Springer Nature Switzerland; 2025. pp. 14–26.
Dip SA, Shuvo UA, Chau T, Song H, Choi P, Wang X, Zhang L. (2024) PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model. https://doi.org/10.1101/2024.06.18.599629
Holur P, Enevoldsen KC, Rajesh S, Mboning L, Georgiou T, Bouchard L-S, Pellegrini M, Roychowdhury V. Embed-Search-Align: DNA sequence alignment using Transformer models. Bioinformatics. 2025;41:btaf041.
Awasthi R, Mend Mend Arachchige GS, Zhu X. Unsupervised evaluation of pre-trained DNA language model embeddings. BMC Genomics. 2025;26:710.

No competing interests reported.

Download PDF

Reviews received at journal
03 Jan, 2026
Reviews received at journal
30 Dec, 2025
Reviews received at journal
30 Dec, 2025
Reviewers agreed at journal
25 Dec, 2025
Reviewers agreed at journal
23 Dec, 2025
Reviews received at journal
23 Dec, 2025
Reviewers agreed at journal
21 Dec, 2025
Reviewers agreed at journal
21 Dec, 2025
Reviewers agreed at journal
21 Dec, 2025
Reviewers agreed at journal
19 Dec, 2025
Reviewers invited by journal
19 Dec, 2025
Editor assigned by journal
19 Dec, 2025
Submission checks completed at journal
16 Dec, 2025
First submitted to journal
14 Dec, 2025

You are reading this latest preprint version

Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

Status:

Version 1

Abstract

Figures

Introduction

Methods

Classification

Evaluation Metrics

Sample efficiency

Implementation details

Results

Sample Efficiency

Discussion

Computational and Sample Efficiency

Limitations and Future Directions

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1