For model evaluation, we used the dataset introduced and described in FusionAI study11, comprising approximately 26K fusion-positive and 26K fusion-negative sequences. From the total of ~ 52K sequences, we used ~ 36K (~ 18K positive and ~ 18K negative) for training, identical to the training set in the original study. The remaining ~ 16K sequences were evenly divided into validation (~ 8K) and test sets (~ 8K), while preserving the same ratio of positive and negative samples. We kept all data partitions consistent across experiments to ensure a fair comparison between models.
Each data sample consisted of a 10 kbp sequence surrounding the fusion breakpoint in gene 1 (sequence 1) and gene 2 (sequence 2). Each sequence was encoded using four widely used genomics foundation models based on the transformer architecture: Nucleotide Transformer22, HyenaDNA24, and Evo225, and DNABERT217. For each model, we tokenized the sequences using the corresponding tokenizer and extracted embeddings from hidden states from an author-recommended selected internal layer, specifically:
Nucleotide Transformer (NT)
For embedding generation, the Nucleotide Transformer 500M_multi_species_v2 model was used. The Nucleotide Transformer was configured with a maximum sequence length of 1,671 token, and 1024-dimensional embeddings wereextracted from the 20th transformer layer (out of 24 total layers). Given the model's 6-mer tokenization scheme, this creates a receptive field of approximately 10,026 bp, which fully encompasses the 10 kbp input sequences centered on the gene fusion breakpoints. This configuration allowed the model to process the entire region of interest in a single pass without the need for truncation or sliding window strategies.
Evo2 (Evo)
For the Evo2 model, sequences were encoded using the evo2_7b checkpoint. Through byte-level tokenization, each character in the DNA string was converted to its corresponding UTF-8 integer value. Embeddings were subsequently extracted from the blocks.28.mlp.l3 internal layer, where the embedding dimension was 4096. Unlike k-mer-based approaches, Evo2 utilizes a byte-level tokenizer where each nucleotide maps directly to a single token, preserving single-base resolution. Leveraging the model's StripedHyena architecture designed for long-range genomic modeling, we processed the input sequences at their full 10 kbp length (10,000 tokens) without truncation or downsampling.
HyenaDNA (Hyena)
The “hyenadna-large-1m-seqlen-hf” model was used with 1 million token index, which employs a character-level tokenizer where each nucleotide corresponds to a single token. HyenaDNA's sub-quadratic operator allowed for efficient processing of the full input at single-nucleotide resolution without the need for downsampling or token aggregation. Embeddings were extracted from the model's final hidden layer, yielding a tensor of shape, where the embedding dimension was 256.
DNABERT2 (BERT)
For the BERT-based approach, the zhihan1996/DNABERT-2-117M model was utilized. DNABERT-2 employs Byte Pair Encoding (BPE), which results in variable token counts for fixed-length DNA sequences. To enable consistent batch processing and preserve spatial alignment, we standardized all input tensors to a fixed length of 2,143 tokens. This threshold was empirically determined to accommodate the maximum tokenized length of any 10 kbp sequence in our dataset. We implemented a symmetric padding strategy, adding special tokens to both ends of shorter sequences. DNA sequences were tokenized into overlapping k-mers using the model's specific tokenizer. The final embeddings were then extracted from the hidden states of the last transformer layer, where the embedding dimension was 768.
From each sequence, only the middle embedding was used, which corresponds to the fusion breakpoint. Due to the contextual nature of the embeddings, this central embedding also encodes information from the surrounding sequence. We concatenated the middle embeddings from both sequences and used the resulting vector for classification.
We visualized embedding quality using t-distributed Stochastic Neighbor Embedding (t-SNE) with perplexity = 30 and 1,000 iterations on a random subset of 1,000 samples. Class separability was qualitatively assessed by visual inspection of the 2D projections.
Classification
For classification, we used two classifiers following the FusionAI article11 : (i) a support vector machine (SVM) with RBF kernel with C = 1.0 and γ automatically computed as 1/(n_features × variance). Hyperparameters were not tuned via cross-validation; fixed values were used across all experiments. (ii) a fully connected neural network (NN) with an architecture adopted from FusionAI article11. We reimplemented the entire CNN architecture as described in FusionAI11 to allow for a direct comparison (see Fig. 1). The classifier architecture consists of a single-layer feedforward neural network with one hidden layer containing 32 neurons with ReLU activation, followed by dropout (p = 0.4) and a softmax output layer for binary classification. The model was compiled using the Adadelta optimizer with categorical cross-entropy loss. Training was performed with a batch size of 256. The neural networks were trained for up to 1000 epochs to ensure training stability and convergence.
Evaluation Metrics
We evaluated model performance on the test set using accuracy (percentage of correct predictions), precision (weighted average across classes), recall (weighted average across classes), F1 score (weighted average), and area under the ROC curve (AUC-ROC, macro-averaged for multi-class. For neural networks, we report the final epoch performance; for SVMs, and the single training run results.
Sample efficiency
To assess sample efficiency, we trained models on 19 stratified training subsets ranging from 200 to 36,302 samples. We fit logarithmic functions (y = a·log(x) + b) to the resulting learning curves and computed: (1) the sample size required to reach 95% of final accuracy (Samples@95%) by inverting the fitted curve: x₉₅ = exp((0.95·y_final - b) / a), and (2) a logarithmic efficiency score defined as (Efficiency@95%) y_final / ln(x₉₅), which quantifies accuracy achieved per natural logarithm unit of training data. Higher efficiency scores indicate models that reach high performance with logarithmically fewer samples. Accuracy values were normalized to 0-100% of each model's maximum performance for cross-model visualization.
Implementation details
All models were implemented in Python 3.11 using Keras 3.0 with PyTorch backend. Neural network training used the Adadelta optimizer with default Keras parameters (learning_rate = 1.0, rho = 0.95, epsilon = 1e-07). SVM models were trained using scikit-learn 1.4 with default parameters unless otherwise specified. Random sampling and data splitting used a fixed seed (42) to ensure reproducibility across different training set sizes. Training was performed on NVIDIA H100 GPUs with 96GB memory. All source code and notebooks with results used for comparison are available at https://github.com/kbi-fbmi/articles--2026fusionEmbBenchmark/. The data and results files are available on Zenodo26.