Workflow Architecture and Implementation
MS-Net was developed using the Knime visual programming interface21 and needs three primary inputs: (1) feature intensity tables (peak height or area), (2) mass spectral similarity (MSS) network edge lists, and (3) multi-Level annotation hits encompassing experimental library matches (Levels 1-2) and in silico predictions (Levels 3-4). These last collectively define a putative chemical space that may comprise several thousand candidate structures per dataset (Figure 1A,B).
Annotation Confidence Scoring System
The workflow begins by normalizing confidence scores across annotation Levels to establish a unified ranking system. Level 1 receives the highest confidence score and is assigned to features showing perfect concordance with authentic standards based on MS/MS spectral similarity (similarity > 0.95) and retention time agreement (ΔRT < 0.2 min). Level 2 annotations derive from spectral library matches and are subdivided according to matching quality: Level 2a for high-confidence matches (similarity > 0.85) and Level 2b for moderate-confidence matches (similarity > 0.7). For in silico annotations, MS-Net enriches structural candidates with taxonomic information by querying Coconut 2.022. The workflow performs InChIKey-based matching to identify compounds reported from user-specified taxonomic sources (genus and family Levels). Candidates with confirmed biosource origins or matching the target chemical class are assigned Level 3a, while remaining in silico matches receive Level 3b. Spectral library matches exhibiting lower similarity (similarity < 0.7) or significant precursor mass discrepancies are classified as MS/MS analogs (Level 4). Features lacking any structural annotation remain at Level 5 (unknown) (Figure 1B). Confidence scores are normalized to a 0–100 scale to enable cross-Level comparison. The normalization scheme employed Level-specific transformations: Level 1 = 95 + (similarity score− 0.95) × 100; Level 2a = 85 + (similarity score− 0.85) × 100; Level 2b = 70 + (similarity score − 0.70) / 0.15 × 14; Level 3b (in silico and de novo) = 40 + confidence score × 40; Level 4 (MS/MS analogs) = 30 + (similarity score − 0.5) × 90; Level 5 = 0. These transformations ensure that experimental matches consistently receive higher scores than computational predictions while preserving score discrimination within each Level.
Feature Filtering and Data Reduction
Prior to network-based annotation propagation, users may optionally apply filtering strategies to reduce dataset complexity while retaining chemically informative features. The workflow supports retention time clustering using the MS-CleanR3 algorithm. Within each RT cluster, the most intense features and/or those with the highest network connectivity (degree) are preferentially retained, effectively removing redundant signals (Figure 1C). Optionally, a cosine filter threshold can be added to constrain the MSS network. Finally, putative annotations between two nodes may be filtered according to XlogP calculated for each candidate. For each feature pair, the XlogP is compared to the edge delta retention time. In C18 mode, only feature pairs exhibiting XlogP trends consistent with retention time order are conserved.
Network-Based Annotation Propagation
MS-Net constructs a seed subnetwork comprising only high-confidence annotations (Levels 1, 2a, and 3a), which serves as the foundation for propagating structural assignments throughout the entire MSS network. To rank competing structural candidates for each feature pair connected in the MSS network, we developed a composite scoring metric that integrates spectral, structural, and computational evidence into a unified Link Score. The structural similarity component employs two complementary Tanimoto measures calculated from molecular fingerprints: Tanimoto_Full (based on Morgan, PubChem or RDkit) captures overall molecular similarity, including substituents, while Tanimoto_Murcko (scaffold fingerprints) emphasizes core structural frameworks. These metrics are dynamically weighted according to their relative informativeness. When the absolute difference between these two measures exceeds 0.1 (|Tanimoto_Full - Tanimoto_Murcko| > 0.1), the higher value receives greater weight, reflecting either the dominance of substituent patterns (Tanimoto_Full) or core scaffold similarity (Tanimoto_Murcko). When this difference falls within ±0.1, equal weights are applied. The resulting Combined_Tanimoto is then scaled by the MS/MS cosine similarity to produce the Adjusted_Structural score, which accounts for both molecular structure and spectral concordance. In parallel, an InSilico_Combined score is calculated as the arithmetic mean of confidence scores described above. The final Link Score integrates these two components through a user-tunable weighting parameter:
Link Score = (1 - α) × Adjusted_Structural + α × InSilico_Combined
The parameter α allows users to control the relative contributions of spectral-structural evidence versus in silico prediction scores. Lower α values (e.g., 0.3) prioritize structural and spectral similarity, while higher values give more weight to computational prediction confidence. For each feature pair in the MSS network, the workflow evaluates all candidate structures and selects the annotation with the highest Link Score. This process propagates iteratively from high-confidence seeds to their direct neighbors and subsequently through the entire connected network. Features that remain outside the MSS network are ranked using a simplified metric combining their best in silico score with the maximum Tanimoto similarity to any annotated compound within the MSS network.
Metadata Enrichment and Output Generation
The final annotated feature list is optionally enriched with chemical metadata using ClassyFire and NPClassifier ontologies, providing hierarchical chemical classifications (kingdom, superclass, class, subclass). Database identifiers for each annotated feature are retrieved using the Chemical Translation Service23, ensuring compatibility with downstream pathway enrichment tools and multi-omics integration platforms (Figure 1E).
MS-Net generates four primary outputs: (1) a comprehensive annotated feature table with confidence Levels, structural information, and chemical ontology; (2) a feature height/area table; (3) an MSS network edge table for visualizing spectral similarity relationships; and (4) a Tanimoto-based network edge table connecting structurally related compounds, including links between unknown features and their nearest annotated structural neighbors. This latter output enables users to infer structural motifs for unannotated features. Both networks are enriched with putative chemical reactions between two neighboring nodes using a delta mass match to a predefined list from Metanetter 224. Optionally, features acquired in positive and negative ionization modes can be merged based on user-defined retention time and m/z tolerances.
Application to Cannabis Metabolomics Dataset
Inflorescences of three medical-grade Cannabis sativa L. chemotypes were selected to evaluate MS-Net's annotation capabilities. Cannabis represents an ideal model system for several reasons. First, the species exhibits remarkable chemical diversity, encompassing a wide array of cannabinoids, terpenes, and phenolic compounds (primarily flavonoids and hydroxycinnamic acids)25–27. The phytochemistry of cannabis is well-documented, with numerous metabolomic studies demonstrating robust discrimination among cultivars and chemotypes28–30.
The traditional morphology-based classification (Indica vs. Sativa) has been superseded by a chemotype system based on the relative concentrations of Δ9-tetrahydrocannabinol (THC) and cannabidiol (CBD). This framework defines three main chemotypes: Type I (THC-predominant, <0.5% CBD), Type II (balanced THC:CBD ratio), and Type III (CBD-predominant, <1% THC). Although expanded classifications include Type IV (cannabigerol-predominant) and Type V (cannabinoid-free), Types I–III remain the most extensively characterized31,32.
Crucially for algorithm evaluation, cannabinoids exhibit exceptional structural diversity—estimated at 120–150 distinct phytocannabinoids—while sharing highly similar core molecular scaffolds33. For instance, the major cannabinoids THC, CBD, and cannabichromene (CBC) are constitutional isomers (C₂₁H₃₀O₂) differing only in cyclization patterns: THC features a pyran ring, CBD a cyclohexene ring, and CBC a benzopyran structure. This structural similarity, combined with well-elucidated biosynthetic pathways, provides an ideal benchmark for evaluating network-based annotation algorithms. The biosynthetic pathway initiates with cannabigerolic acid (CBGA)—formed by CBGA Synthase-catalyzed condensation of olivetolic acid and geranyl pyrophosphate—which serves as the universal precursor for nearly all other cannabinoids34. Additionally, in planta, cannabinoids exist predominantly as carboxylic acids (THCA, CBDA, CBCA), with decarboxylation to neutral forms occurring upon heating.
To demonstrate MS-Net's capabilities, we applied the workflow to a comprehensive untargeted metabolomics study of Cannabis sativa L., analyzing three distinct chemotypes: Bedrocan® (THC-dominant, Type I), Bedrolite® (CBD-dominant, Type III), and Bediol® (THC/CBD-balanced, Type II). Initial data processing detected 2,595 features across positive and negative ionization modes.
Feature Filtering and Chemical Space Reduction
We applied a sequential filtering strategy to reduce dataset complexity while retaining chemically meaningful signals. First, Ion Identity Networking identified and collapsed redundant adducts and isotopes by selecting the most informative precursor ions ([M+H]⁺ or [M-H]⁻) from each ion cluster. Subsequently, MS-CleanR-based retention time clustering further consolidated co-eluting features (Figure 2B). This filtering reduced the dataset to 1,297 unique features while maintaining comprehensive chemical coverage.
In silico annotation using Sirius-CSI (top 50 candidates per feature) and MSNovelist (top 20 de novo structures per feature) generated a putative chemical space encompassing more than 118,000 candidate structures (Figure 2A). This expansive search space highlights the challenge of confident structural assignment in untargeted metabolomics: without prioritization strategies, the likelihood of selecting incorrect annotations from such large candidate pools is substantial.
Network-Based Annotation Prioritization
We seeded the MSS network using Level 1 (authentic standards), Level 2a (high-confidence spectral matches), and Level 3a annotations (taxonomically informed candidates from the Cannabis genus, Cannabaceae family, or cannabinoid chemical class). PubChem-based fingerprint was selected to calculate Tanimoto similarities. The Link Score algorithm was configured with α = 0.3 to prioritize structural and spectral similarity over raw in silico ranking scores.
To evaluate the algorithm's performance, we examined the agreement between MSS network topology (cosine similarity) and structural similarity (Tanimoto scores). Before annotation prioritization, the raw chemical space exhibited a mean absolute distance of 0.55 between these metrics, with the highest density occurring between 0.6 and 0.8 (Figure 2C). This discrepancy reflects that spectral similarity does not always correlate with structural similarity, particularly when in silico tools generate diverse candidates. Restricting to only the top-ranked in silico candidate (top 1) dramatically reduced the mean distance to 0.3, but at the cost of excluding potentially correct structures ranked lower. Expanding to the top 10 or top 20 candidates achieved better performance, with maximum density centered around 0.2, indicating strong agreement between spectral and structural similarities. Notably, incorporating the top 10 de novo structures from MSNovelist further improved concordance, suggesting that machine learning-generated candidates can complement database-constrained searches for features representing novel or underrepresented chemical scaffolds. Finally, top 50 in silico and Top 20 de novo candidates per feature were selected for annotation prioritization.
Global Dataset Annotation and Chemical Space Coverage
MS-Net reduced the initial chemical space from 118,000 candidates to 1,275 confidently annotated compounds across 1,297 features (Figure 2D). Analysis of annotation rank distribution revealed that 47% of features were assigned their top-ranked in silico candidate, indicating strong agreement between computational predictions and network-guided prioritization. An additional 30% of annotations fell within ranks 2–20, demonstrating the algorithm's ability to rescue correct structures initially ranked lower due to limitations in in silico fragmentation models. The remaining 23% of annotations were ranked above position 20 (Figure 2F).
The final annotation distribution by confidence Level showed: 9 Level 1 (authentic standards), 58 Level 2a (high-confidence spectral matches), 31 Level 2b (moderate-confidence spectral matches), 43 Level 3a (taxonomically informed in silico annotations), 1051 Level 3b in silico and 71 3b de novo matches, 4 Level 4 (MS/MS analogs), and 26 Level 5 (unknown) (Figure 2E). This distribution reflects the typical annotation coverage achievable in specialized metabolomics studies, where experimental spectral libraries cover only a fraction of detected features, necessitating extensive in silico inference.
Chemotype Discrimination
Principal component analysis (PCA) of the annotated feature matrix (n = 18 samples, p = 1,297 features) revealed clear separation of the three cannabis chemotypes, with the first two principal components explaining 90 % of total variance (Figure 2G). Sparse partial least squares discriminant analysis (sPLS-DA) identified 60 discriminant features that robustly distinguished the chemotypes (Figure 2H).
Chemical ontology classification using NPClassifier revealed distinct natural product pathway enrichments for each chemotype (Figure 2I). Bedrocan® (THC-dominant) exhibited enrichment in phenylpropanoids and terpenoid pathways, consistent with high Levels of THC and related cannabinoids. Bedrolite® (CBD-dominant) showed elevated Levels of amino acid derivatives and shikimate pathways. Bediol® (balanced THC/CBD) displayed an intermediate metabolite profile with enrichment in polyketide derivatives.
Case Study: Cannabinoid Subnetwork Annotation
The initial MSS network is seeded with annotation Level 1, 2a and 3a (green dots, figure 3A). The cannabinoid MSS subnetwork illustrates the algorithm's discriminative power (Figure 3B). MS-Net successfully prioritized structurally coherent annotations, including close derivatives differing primarily in hydroxylation patterns, methyl substitutions, or double bond positions—chemical variations consistent with known cannabinoid biosynthetic pathways. Two de novo structures from MSNovelist were also integrated, representing potential novel cannabinoid scaffolds warranting further investigation (Figure 3B). An illustration of MS-net prioritization algorithm is displayed between cannabichromenic acid (CBC-A, Level 1 authentic standard) and its close neighbor, accounting for annotation le 3b (Figure 3C), displaying a pseudomolecular ion at m/z 301.145 (molecular formula C₁₈H₂₂O₄, 0.9 ppm mass accuracy), which matched 70 putative in silico candidates. Reliance solely on the top-ranked in silico candidate from Sirius-CSI would have resulted in an incorrect structural assignment. However, the Link Score algorithm identified cannabiorcichromenic acid, originally ranked 50th, as the most probable annotation based on its high Tanimoto similarities to CBC-A, which is consistent with the MSS cosine similarity of 0.89.
Exploitation of the Tanimoto structural similarity network through the study of cannabinoid biosynthesis pathway
Figure 4 illustrates how the MSNet workflow efficiently clusters metabolites according to their structural similarities, providing a comprehensive view of the metabolic landscape. The upper panel (A) presents the complete Tanimoto structural similarity network. Within this network, each node represents a distinct metabolite, and the edges delineate the Tanimoto similarity score between them. The organization of the network reveals well-defined clusters of metabolites corresponding to distinct biosynthetic routes, reflecting the algorithm’s capacity to capture pathway-level organization within complex metabolomic data. Overall, the similarity network is organized into three main subnetworks, five large clusters, and several smaller satellite clusters. An expanded sub-network (B) was isolated by selecting the cannabinoids and their precursors, as well as their close neighbors. This network highlights a local region of the network, illustrating how cannabinoids and precursors with related chemical structures or biosynthetic origins are tightly interconnected. The lower panel (C) focuses on the three main biosynthetic pathways that specifically yield cannabinoids: the olivetolic acid, orsellinic acid, and divarinic acid pathways. Each pathway is represented as a sequence of enzymatic and chemical reactions converting early intermediates into key cannabinoids. The three precursors associated with the three main biosynthetic pathways were detected within the extracts, along with the majority of metabolites produced via the olivetolic acid pathway and a subset originating from the divarinic and orsellinic acid pathways.