Data collection
This study utilizes 20 types of open-source peptide sequence datasets, comprising a total of 24 datasets, including anti-hypertensive, antioxidant, dipeptidyl peptidase IV inhibition, bitter, umami, antimicrobial, antimalarial, quorum sensing, anticancer, anti-MRSA strains, tumor T cell antigens, blood-brain barrier, antiparasitic, neuropeptide, antibacterial, antifungal, antiviral, toxicity, anti-coronavirus and interleukin-6 inducing activity, to train and evaluate RNN and BERT modeling methods. Table 1 provides an overview of these open-source datasets. The data volume distribution is as follows: >2000 (11 datasets), 1000–2000 (5 datasets), < 1000 (8 datasets), including 15 balanced datasets with an equal number of positive and negative samples, and 9 imbalanced datasets. To ensure a fair comparison with existing methods, the training set, validation set, and independent test set used are consistent with those in the original paper, and the same number of cross-validation folds are applied. Specifically, 20 fair external independent tests were constructed by collecting data on 20 types of bioactive peptides, newly identified in 2023–2025 and not included in the aforementioned 24 datasets (Table 1), to assess the model's generalization ability. The ACEiPs dataset was constructed by deduplicating and balancing the ACEiPP dataset [12], AHTpin dataset [32], and newly reported ACE inhibitory peptides (104 positive samples and 98 negative samples), resulting in a total of 1181 positive and negative samples (Table 1). The ACEiPs dataset was randomly split into a benchmark dataset (ACEiPs_benchmark) and an independent test set (ACEiPs_test) in a 7:3 ratio.
Additionally, two large peptide sequence datasets with unknown activities were constructed to assess the model’s ability to pre-screen features. The first dataset, named Sequence-Library 1, contains a total of 3,768,340 peptide sequences retrieved from the UniProt database using the search term ‘(length:[2 TO 50])’. The second dataset, named Sequence-Library 2, consists of 2,321,342 non-redundant peptide sequences of length 2–20 from 21,249 food-derived proteins, downloaded from the ACEiPP database [12].
Recurrent neural network modelling module
The peptide sequence, based on the single-letter codes of the 20 natural amino acids (R, K, N, D, Q, E, H, P, Y, W, S, T, G, A, M, C, F, L, V, I), can be represented as:
Peptide = [A1, A2, ..., Ai] (1)
where Ai is the i-th amino acid in the peptide sequence, i.e., the positional index of the amino acid residue. And then, peptide sequences were transformed into feature vectors using One-hot coding and 23 sets of amino acid descriptors (AADs) to represent the types and physicochemical properties of sequence residues (Table S1). By referencing these AADs, peptide sequences can be transformed from non-numeric sequence data into numeric feature vectors, which are then input into recurrent neural networks for model training. The AADs coding is defined as follows:
$$\:\text{A}\text{A}\text{D}\text{s}=\left[\begin{array}{cccc}{\text{V}}_{1}^{1}&\:{\text{V}}_{2}^{1}&\:\cdots\:&\:{\text{V}}_{n}^{1}\\\:{\text{V}}_{1}^{2}&\:{\text{V}}_{2}^{2}&\:\cdots\:&\:{\text{V}}_{\text{n}}^{2}\\\:⋮&\:⋮&\:⋮&\:⋮\\\:{\text{V}}_{1}^{20}&\:{\text{V}}_{2}^{20}&\:\cdots\:&\:{\text{V}}_{\text{n}}^{20}\end{array}\right]$$
2
where V represents the feature variables of the 20 natural amino acids, n is the number of variables per residue, and the AADs coding matrix has dimensions of 20×n.
The RNN-Trainer module was developed based on the TensorFlow framework and incorporates four recurrent neural networks: SimpleRNN, LSTM, GRU, and BiLSTM. For each of these networks, 24 amino acid encoding schemes (Table S1), 19 activation functions, N-fold cross-validation (N = 5, 10, 15, 20), and other training parameters (network layers, neurons, learning rate, dropout, nEpochs, early stopping and checkpoint) are provided.
Table 1
Open-source benchmark and test datasets collected from various sources in the literature and platform
Bioactivity | Dataset reference | Training dataset | Test dataset | Newly reported peptides |
|---|
Positive | Negative | Positive | Negative | Positive | Negative |
|---|
Antihypertensive activity | mAHTPred [24] | 913 | 913 | 386 | 386 | 102 | 93 |
ACEiPP[12] | 730 | 730 | 313 | 313 |
ACEiPs (this study) | 826 | 826 | 355 | 355 |
Antioxidant activity | AnOxPP [14] | 848 | 848 | 212 | 212 | 112 | 42 |
AnOxPePred-FRS [33] | 530 | 593 | 146 | 135 |
DPP IV inhibitory activity | iDPPIV-SCM [34] | 532 | 532 | 133 | 133 | 109 | 75 |
Bitter | BERT4Bitter [27] | 256 | 256 | 64 | 64 | 179 | 184 |
Umami | iUMAMI-SCM [35] | 112 | 241 | 28 | 61 | 244 | 19 |
Antimicrobial activity | TransImbAMP [26] | 3876 | 9552 | 2584 | 6369 | 376 | 9 |
Antimalarial activity | iAMAP-SCM (Main dataset) [36] | 111 | 1708 | 28 | 427 | 17 | 4 |
iAMAP-SCM (Alternative dataset) [36] | 111 | 542 | 28 | 135 |
Quorum sensing activity | QSPred-FL [37] | 200 | 200 | 20 | 20 | 7 | 0 |
Anticancer activity | AntiCP 2.0 (Main dataset) [38] | 689 | 689 | 172 | 172 | 570 | 71 |
AntiCP 2.0 (Alternative dataset) [38] | 776 | 776 | 194 | 194 |
Anti-MRSA strains activity | SCMRSA [28] | 118 | 678 | 30 | 169 | 12 | 1 |
Tumor T cell antigens | iTTCA-Hybrid [39] | 470 | 318 | 122 | 75 | 122 | 75 |
Blood-Brain Barrier | BBPpred [40] | 100 | 100 | 19 | 19 | 45 | 3 |
Antiparasitic activity | PredAPP [41] | 255 | 1863 | 46 | 46 | 23 | 4 |
Neuropeptide | NeuroPred-CLQ [42] | 1940 | 1940 | 485 | 485 | 9 | 18 |
Antibacterial activity | starPep_AB [43] | 6583 | 6583 | 1695 | 1695 | 226 | 27 |
Antifungal activity | starPep_AF [43] | 778 | 778 | 215 | 215 | 117 | 21 |
Antiviral activity | starPep_AV [43] | 2321 | 2321 | 623 | 623 | 31 | 467 |
Toxicity | ATSE [44] | 1663 | 1621 | 290 | 290 | 22 | 6 |
Anti-coronavirus activity | FEOpti-ACVP [18] | 125 | 1587 | 32 | 397 | 63 | 38 |
Interleukin-6 inducing activity | StackIL6[47] | 292 | 2393 | 73 | 597 | 2 | 1 |
BERT pre-training fine-tuning module
A total of 556,603 protein sequences were downloaded from UniProt as the peptide-related pre-training corpus. UniProt, which integrates data from SWISS-PROT, TrEMBL, and UniParc, is the largest and most comprehensive protein database, providing ample data for model pre-training. Based on the extensive discussion in the literature regarding the relationship between peptide sequences and activity, amino acid residues and specific motifs (dipeptides and tripeptides) are key determinants of the activity of therapeutic peptides [45, 46]. We divide proteins into motifs, where every k (k = 1, 2, 3) residues in the sequence are grouped into k-mers. When fewer than k amino acids remain at the end of the sequence, the remaining amino acids are grouped together [13]. The computational formula of the BERT model is as follows:
MultiHead (Q, K, V) = Concat (head1, …, headh) (3)
Headi = Attention (\(\:{\text{Q}\text{W}}_{\text{i}}^{\text{Q}}\), \(\:{\text{K}\text{W}}_{\text{i}}^{\text{K}}\), \(\:{\text{V}\text{W}}_{\text{i}}^{\text{V}}\)) (4)
Attention (Q, K, V) = softmax(\(\:\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\))V (5)
where dk is the dimension of the Key, \(\:{\text{W}}_{\text{i}}^{\text{Q}}\), \(\:{\text{W}}_{\text{i}}^{\text{K}}\), \(\:{\text{W}}_{\text{i}}^{\text{V}}\) and \(\:{\text{W}}^{\text{O}}\) are the parameter matrices.
Three pre-trained k-mer models were trained based on the TensorFlow framework for model fine-tuning. The developed BERT-Trainer module provides 7 standard configuration parameters: Batch size, Evaluate, Train epochs, Warmup proportion, Learning rate, Classification, and Cross-validation (supporting 5-, 10-, 15-, and 20-fold), along with 11 advanced configuration parameters: Attention probability, Dropout probability, 5 Hidden layer activation functions, Hidden layer dropout probability, Hidden layer size, Initializer range, Intermediate layer size, Maximum position embeddings, Number of attention heads, Number of hidden layers, Type vocabulary size, and 3 Vocabulary sizes (k-mers). Furthermore, Early stopping and Checkpoint parameters are designed to monitor the model's performance and systematically save its state during training.
Evaluation criteria
The prediction ability of the models in n-fold cross-validation and external testing is evaluated using six evaluation parameters: Precision, Sensitivity, Specificity, Accuracy, Matthew’s correlation coefficient (MCC), and the area under the receiver operating characteristics curves (AUC). Their definitions are as follows:
$$\:Precision=\frac{TP}{TP+FP}$$
6
$$\:Sensitivity=\frac{TP}{TP+FN}$$
7
$$\:Specificity=\frac{TN}{TN+FP}$$
8
$$\:Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
9
$$\:MCC=\frac{TP\times\:TN-FP\times\:FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$
10
where TP, FP, TN, and FN represent the number of true positive samples, false positive samples, true negative samples, and false negative samples, respectively.