Selection of Candidate Genes Based on Thermophilicity Prediction Properties
From the metagenomic data of the intestines of two termites, we identified 115 protein sequences corresponding to the GH5 family, the most predominant family involved in cellulose degradation.4 Using three tools employing different machine learning algorithms (MCIC, iThermo and ThermoPred), we classified these enzymes as potentially mesophilic or thermophilic (Table 1). From the 115 sequences, we selected five enzymes predicted to be thermophilic by two or more different tools, which also possessed a signal peptide and an e-value ≤ 1e-12. The five selected GH5-encoding genes were KBCPBGKF_14422, KBCPBGKF_22703, AFHCADON_14335, KBCPBGKF_00253, and KBCPBGKF_33952, henceforth referred to as T1Cel5, T2Cel5, T3Cel5, T4Cel5, and T5Cel5, respectively (Table 1). According to BLASTP analysis (Table S1), T1Cel5 showed 61.38% identity and 99% query coverage with a GH5 protein from Treponema sp. (MDR0301551.1). Similarly, T2Cel5 had 47.48% identity and 99% coverage with a GH5 protein from Treponema sp. (MDR2575803.1), while T3Cel5 exhibited 68.73% identity and 99% coverage with another GH5 protein from Treponema sp. (MDR1839590.1). T4Cel5 had 74.56% identity and 98% coverage with a GH5 protein from Treponema sp. (MDR2953146.1). In addition, T5Cel5 displayed 79.07% identity and 100% coverage with a cellulase from Chitinispirillales bacterium (MDR0303540.1). The amino acid sequences with the highest similarity were obtained from a termite gut metagenome analysis conducted by Salgado et al.53
Phylogenetic Analysis
A phylogenetic analysis of GH5 was conducted to establish the evolutionary relationship of the studied enzymes. We generated a maximum likelihood tree using FastTreeMP (Fig. 1). A subfamily analysis using the CAZy database and BLASTP grouped the five GH5 enzymes into four GH5 subfamilies (GH5_2, GH25, GH5_39, and GH5_40). T1Cel5 grouped within the GH5_40 subgroup, an uncharacterized subfamily primarily represented by organisms from the Terrabacteria group. T2Cel5 and T4Cel5 were found in the same subgroup (GH5_39), composed of enzymes that are distributed across a wide variety of organisms, predominantly within the Terrabacteria group. This includes species from the families Herpetosiphonaceae and Synechococcaceae, such as Synechococcus sp. and Herpetosiphon aurantiacus, respectively. T3Cel5 grouped in the GH5_2 subgroup predominantly represented by organisms from the Terrabacteria group, including species from the phylum Firmicutes (e.g., Bacillus sp.) and Spirochaetes (e.g., Treponema sp.). T5Cel5 groups within the GH5_25 subgroup, predominantly composed of enzymes from the Actinobacteria and Bacteroidetes phyla, including thermophilic species such as Dictyoglomus thermophilum and Caldithrix abyssi. GH5_39 is characterized as a monospecific cellulase subfamily, exhibiting a single activity among its characterized enzyme members, while GH5_2 and GH5_25 are polyspecific. Additionally, this subfamily includes sequences from uncultured bacteria and members of the Spirosomaceae family.
Physicochemical Characterization and Secondary Structure Prediction
To gain insight into the physicochemical properties of the enzymes, we assessed features such as molecular weight (Mw), isoelectric point (pI), general hydrophobicity (GRAVY), aliphatic index, and instability index to predict protein characteristics (Table 2). The differences in Mw among the thermophilic cellulase candidates can be explained by both their domain architecture and the physicochemical properties data. T1Cel5 and T3Cel5, with molecular weights of 41.97 kDa and 38.89 kDa, respectively, consist solely of the cellulase domain, as shown in Fig. S1. In contrast, T2Cel5, T4Cel5, and T5Cel5, which have significantly higher molecular weights (66.6 kDa, 65.96 kDa, and 63.14 kDa, respectively), possess an additional galactose-binding-like domain, contributing to their increased Mw (Fig. S1). The GRAVY value, which evaluates hydrophobicity and hydrophilicity, serves as an indicator of protein solubility and its interactions with water. A positive GRAVY value indicates that a protein is generally hydrophobic, while a negative value suggests that it is more hydrophilic. In this study, the GRAVY values observed ranged from -0.34 to -0.45, which could indicate the hydrophilic potential of the five proteins evaluated in this study (Table 2). The aliphatic index of a protein is a measure of the relative volume occupied by the aliphatic side chain of the following amino acids: alanine, valine, leucine and isoleucine. The sequence-based analysis of the aliphatic index among five GH5 revealed a mean value of 81.33 ± 4.10. In addition, the instability index, the measure of the in vivo half-life of a protein, of the five GH5 enzymes tested revealed a mean value of 33.29 ± 6.81 (Table 2). Furthermore, the number of negatively charged amino acid residues (R-: aspartic acid and glutamic acid) is higher than the number of positive residues (R+: arginine and lysine) in the two candidates (T1Cel and T2Cel), except for T3Cel, which showed the same number of positive and negative residues. For the T4Cel and T5Cel5 enzymes, the number of positively charged residues was higher than the number of negatively charged residues (Table 2). The secondary structure analysis determines the spatial arrangement of amino acids, classifying them into helices, strands, or coils. In the identified enzymes, the secondary structure composition averaged 28.66% ± 6.50 α-helix, 25.5% ± 3.04 β-sheet, 25.44% ± 3.74 turn, and 24.38% ± 3.03 random coil. This composition typically shows a well-balanced distribution of four characteristics (Table 2).
In addition, the in silico Mw and pI for C. fulviceps and N. aquilinus were evaluated (Fig. S2A). The in silico Mw ranged from 21 to 71 kDa (C. fulviceps) and from 21 to 83 kDa (N. aquilinus). Whereas the average theoretical pI ranged from 4.5 to 7.4 for C. fulviceps and from 4.6 to 8.6 for N. aquilinus (Fig. S2A). The average GRAVY values observed were -0.42 ± 0.129 for C. fulviceps and -0.36 ± 0.129 for N. aquilinus (Fig. S2B).
Amino Acid Composition
Subsequently, the candidate cellulases were analyzed for their amino acid compositions. Regarding amino acid composition, using the Dayhoff metric, we observed that enzymes predicted as thermophilic show a higher relative occurrence of the amino acids Phe, Glu, Asp, Cys, Tyr, Trp, Pro, Asn, Leu, Lys, Ile and His compared to mesophilic enzymes (Fig. 2A). In Fig. 2B, T1Cel5 showed an enrichment in Glu, Cys, Tyr, Trp, Asn, Leu, Lys, Ile, and His compared to mesophilic enzymes, which have higher levels of Gly, Val, Thr, and Ser, and GH5CelA, which is richer in Phe, Asp, Ala, Arg, Gln, Pro, and Met. In Figure 2C, T2Cel5 contains a higher amount of Glu, Cys, Tyr, Trp, Leu, Lys, Ile, and His compared to GH5CelA, while mesophilic enzymes display a greater proportion of Gly, Val, Thr, Ser, and Asn. For Figures 2D to 2F, T3Cel5, T4Cel5, and T5Cel5 are all enriched in Glu, Cys, Tyr, Trp, Leu, Lys, Ile, and His in comparison to GH5CelA, which is richer in Phe, Asp, Ala, Arg, Gln, Pro, and Met, whereas Mesophilic shows higher levels of Gly, Val, Thr, Ser, and Asn.
In Figure 3, the thermophilic and mesophilic enzymes are further categorized based on their amino acid properties. Although mesophilic enzymes exhibited a dominant composition of small and tiny amino acids compared to thermophilic enzymes, thermophilic enzymes show a higher presence of aromatic, acidic, basic, polar and charged amino acids in their molecular composition (Fig. 3A). In Figures 3B to 3F, the categorized amino acid properties of the individual thermophilic cellulases T1Cel5, T2Cel5, T3Cel5, T4Cel5, and T5Cel5 are presented alongside GH5CelA and mesophilic enzymes. In Figures 3B to 3F, tiny and small amino acids are more enriched in mesophilic proteins and GH5CelA than in thermophilic candidates. Aliphatic amino acids are primarily enriched in GH5CelA, with the exception of T3Cel5 (Fig. 3D) and T5Cel5 (Fig. 3F). Aromatic amino acids show enrichment in most thermophilic candidates, except for T3Cel5 and T4Cel5 (Fig. 3E). GH5CelA displays the highest levels of non-polar amino acids. With respect to polar amino acids, thermophilic candidates exhibit greater enrichment, with the exception of T1Cel5 (Fig. 3B). Charged amino acids are more abundant in thermophilic candidates. Basic amino acids are predominantly present in GH5CelA, with T3Cel5 and T4Cel5 as exceptions. Lastly, acidic amino acids are enriched in all thermophilic candidates.
Tertiary Structure Prediction
We analyzed the tertiary structures of different cellulase sequences using the I-TASSER prediction server (Table 3). We selected the best model based on the C-score and TM-score. The C-scores for the selected models ranged from -1.73 to 0, indicating varying confidence levels in their predictions. The TM-scores, which measure the structural similarity to native proteins, ranged from 0.50 to 0.75, with T3Cel5 exhibiting the highest TM-score of 0.75. Afterwards, these models were refined using AlphaFold2_mmseqs2 and ColabFold.44 AlphaFold2 not only predicts 3D structures but also evaluates the quality of the predicted models using two key metrics: the pLDDT and pTM scores. The pLDDT score, ranging from 0 to 100, signifies the confidence of the model prediction for each amino acid residue relative to the α carbon atoms. The refined models demonstrated optimal pLDDT confidence values greater than 90, with T3Cel5 achieving the highest pLDDT score of 97.2, indicating strong predictive accuracy. Structural modeling revealed that the enzymes exhibit the typical TIM barrel structure, with eight α-helices and eight β-sheets (Fig. 4A). Furthermore, the catalytic residues for each enzyme were identified: two glutamic acid residues, Glu169 and Glu279, and the histidine His250 were identified as probably involved in the catalytic site of T1Cel5; Glu149, Glu293, and His207 (T2Cel5); Glu150, Glu239 and His211 (T3Cel5); Glu156, Glu298, and His214 were identified for T4Cel5 and Glu128, Glu245 and His187 for T5Cel5 (Fig. 4C, T2Cel5). Regarding the structure and sequence alignment, the GH5 candidates contain the eight key residues Arg46, His90, Asn139, Glu140, His198, Tyr200, Glu280, and Trp313, referencing T1Cel5, which are conserved in the functional motif of the GH5 family (Fig. 5).
All five refined models were associated with specific substrates identified from the BioLip2 database, with cellobiose assigned to T1Cel5 and T4Cel5, and cellotriose associated with T2Cel5, T3Cel5 and T5Cel5.
The structural similarity among the thermophilic cellulase candidates was evaluated through RMSD values obtained from pairwise alignments of their three-dimensional structures. Typically, RMSD values below 3Å indicate strong structural similarity, allowing for reliable comparisons between proteins.54 As shown in Table 4, the candidates exhibited RMSD values predominantly below this threshold, with T3Cel5 displaying the lowest RMSD of 0.57Å in comparison to its closest structural homolog, 3pzt. This structural proximity among the candidates potentially contributes to their classification as thermophilic proteins. In addition, the TM-scores further corroborate these findings, with values above 0.76, indicating a high degree of structural alignment with known thermophilic proteins in the PDB.
To assess the interatomic features of thermophilic and mesophilic enzymes, we conducted an analysis using the Arpeggio tool. Our findings showed a distinct pattern: thermophilic enzymes exhibited a significantly higher tendency towards hydrophobic contacts compared to mesophilic enzymes (Table 5). Specifically, the hydrophobic contacts per amino acid for thermophilic enzymes ranged from 3.05 to 3.37, while mesophilic enzymes exhibited lower values, indicating a robust hydrophobic character among the thermophilic candidates. This observation underscores potential structural adaptations that contribute to the stability of thermophilic enzymes in high-temperature environments.
Finally, we performed a data mining analysis by querying the molecular structure of a set of reference enzymes and our top five enzymes to evaluate the abundance of certain amino acids, in two topological scenarios of structure/function relationship, that influence on thermotolerance.
Specifically, we extracted from Figure 2A the residues overrepresented in the set of thermophilic enzymes over mesophilic enzymes and, based on what is described in the literature, we assigned the amino acids Ile, Phe, Tyr, Trp to the HYD group and the amino acids Lys, His, Glu to the CAT group.55-57 Then, the residues of higher hydrophobic character of the HYD group were counted and relativized according to the ratio with the total amino acids of the enzymes (Fig. S3, C and D) or according to the ratio with the total HYD amino acids per enzyme. In this sense, we found a higher ratio of presence of HYD-type amino acids in the core residues of our five enzymes compared to a set of mesophilic enzymes. Similarly, for these novel enzymes we found a higher ratio of catalytically influenced residues in the catalytic pocket environment compared to the mesophilic enzyme set (Fig. S3, A and B).