The primary aim of this study was to assess the diagnostic utility of two patient-level scoring strategies, derived from the OMERACT definitions for salivary gland ultrasonography (SGUS) in patients with suspected primary Sjögren’s syndrome (pSS): an ordinal scale (0–6) and a summative score (0–12), both integrating the number and severity of affected glands. When compared to labial salivary gland biopsy, both scores showed modest diagnostic performance, with area under the ROC curve (AUC) values of 0.683 (95% CI: 0.543–0.824) for the sum score and 0.687 (95% CI: 0.546–0.828) for the ordinal scale. Nevertheless, they demonstrated high specificity at certain thresholds (≥ 4 for the ordinal score and ≥ 5 for the sum score), highlighting their potential clinical value as biopsy-sparing tools in selected patients(10, 12, 13).
Previous studies have evaluated SGUS as a diagnostic tool for pSS using a variety of scoring systems. Mossel et al. reported good agreement with histopathology using the Hocevar score (AUC 0.82) and a specificity of 85%(14). Barrio-Nogal et al applied a 0–4 scale and used the most affected gland to define positivity, achieving 90% sensitivity but lower specificity (67%)(15). Tang et al. summarized in a 2023 meta-analysis that pooled sensitivity and specificity for SGUS were 81% and 87%, respectively, although reference standards varied, and scoring systems were heterogeneous(16). Unlike these approaches, we evaluated composite scoring methods based on OMERACT criteria that capture global glandular involvement, providing a reproducible and clinically feasible framework.
Various semiquantitative systems have been proposed to standardize SGUS interpretation in pSS. The Hocevar score, though comprehensive, has been criticized for its complexity and limited feasibility in routine practice(17). Attempts to simplify it, for instance, by focusing solely on hypoechogenic areas, have improved applicability but at the expense of omitting other relevant echostructural features(13). Cornec et al., proposed a simplified 0–4 scale based on De Vita, which showed high specificity but lacked multicentre validation and standardized reliability testing(18). In contrast, the OMERACT system, developed by international consensus, offers a pragmatic balance between diagnostic value and usability. Its 0–3 scale focuses on reproducible features- parenchymal inhomogeneity, hypoechoic areas and fibrotic bands- and has demonstrated high inter-reader agreement and external validity, fulfilling key OMERACT criteria for clinical application(10, 14),(19).
Our findings reinforce the role of SGUS as a structural imaging biomarker associated with systemic autoimmunity in pSS. In our cohort, an OMERACT ordinal score ≥ 2 was significantly associated with anti-Ro52 positivity, while higher thresholds (≥ 3 for ordinal, ≥ 5 for sum score) correlated with anti-Ro52, anti-Ro60 and anti-La antibodies. These results align with previous studies reporting significant associations between abnormal SGUS findings and various autoantibodies, including ANA, anti-SSA and rheumatoid factor positivity(14,20,(21). Taken together, this evidence supports the interpretation of SGUS as a non-invasive surrogate marker of B-cell mediated glandular damage in pSS.
Although SGUS is emerging as a key structural biomarker in pSS, our results, and previous reports, suggest that its findings are not consistently aligned with measures of glandular function. In our cohort, no significant associations were found between SGUS scores and either unstimulated salivary flow or Schirmer’s test. This aligns with prior reports by Schmidt and Mossel, reinforcing the interpretation that SGUS captures chronic structural damage rather than real-time exocrine performance(20)(22). A recent study by Shi et al. further supported this dissociation: despite only moderate sensitivity, an SGUS grade 3 in the most affected gland and a patient-level sum score ≥ 9 (OMERACT-based) were strongly associated with biopsy positivity and reduced salivary flow, achieving a specificity of 93%(23). While some studies, such as that by Caraba et al., have noted parallel declines in structure and function, these discrepancies may reflect cohort variability(24). Altogether, our results highlight the complementary nature of SGUS and functional testing in the diagnostic approach to pSS.
This study has several strengths. First, we evaluated two patient-level scoring strategies based on the OMERACT consensus definitions, an ordinal scale (0–6) and a sum score (0–12), that incorporate both the number and severity of abnormalities across all major salivary glands. Second, SGUS readings were performed independently by two experienced operators blinded to clinical and serological data, with discrepancies resolved by consensus. Third, the study included a real-world cohort of consecutive patients referred for suspected pSS, increasing its external validity and closely reflecting diagnostic scenarios encountered in daily practice. Finally, the use of labial salivary gland biopsy as the reference standard provides a clinically meaningful comparator.
Nonetheless, some limitations must be acknowledged. The sample size was moderate and derived from a single-center cohort, which may limit generalizability. Although interobserver agreement was evaluated using kappa and ICC metrics, the readings were performed by experienced sonographers at a single center, which may overestimate reproducibility compared to multicenter or less specialized settings. Histological interpretation by a single pathologist ensured internal consistency but may reduce external applicability. Finally, the cross-sectional design captures diagnostic performance at a single time point. IN addition, only patients with suspected primary SSp were included, excluding those with previously diagnosed disease or overlapping systemic autoimmune conditions. While this improves internal validity, it may limit extrapolation to broader clinical populations, such as patients with secondary Sjögren’s or more complex immunological profiles. Similarly, although common comorbidities (e.g., hypertension, dyslipidemia) were not exclusion criteria, their influence on SGUS findings was not specially analyzed. Finally, variations in disease duration, symptom burden or pre-test probability may also affect SGUS performance and should be explored in future studies. Longitudinal studies are needed to explore the utility of SGUS scores in disease monitoring, stratification or prognostication.
Our study supports the use of SGUS as a non-invasive, reproducible and clinically informative tool in the diagnostic assessment of patients with suspected pSS. By applying two patient-level scoring strategies derived from the OMERACT definitions—an ordinal scale and a sum score—we observed that SGUS abnormalities were frequently aligned with biopsy positivity and serological markers of autoimmunity. Although the overall discriminative performance of both scores was modest (AUC ~ 0.68), their clinical utility lies in the identification of sonographic patterns with high specificity. For example, an OMERACT ordinal score ≥ 4 and a sum score ≥ 5 both achieved specificities close to 92%. In this context, and assuming compatible clinical and serological profiles, these thresholds could potentially be used to avoid biopsy in selected patients with high pre-test probability. In support of this, the likelihood ratio for a sum score ≥ 5 was 3.62 (95% CI: 2.84-24.00), and for the ordinal score ≥ 4 was 2.15 (95% CI: 1.95-19.00), reinforcing their diagnostic utility as “rule-in” thresholds. Conversely, lower thresholds such as ordinal score ≥ 2 or sum ≥ 3 yielded weaker negative likelihood ratios (LR-), limiting their role in safely ruling out disease. Although these thresholds did not reach absolute specificity, they consistently exceeded 89%, supporting their potential utility as clinically robust biopsy-sparing markers when interpreted alongside compatible clinical and serological features. In our cohort, this approach would have spared biopsy in approximately 16.6% and 21.7% of patients, respectively. Lower thresholds such as ordinal score ≥ 2 or sum score ≥ 3—though less specific—would have identified a greater proportion of biopsy-positive cases (82.6% and 65.2%, respectively), illustrating the trade-off between diagnostic certainty and test avoidance. These findings underscore the potential value of SGUS as a gatekeeper in the diagnostic algorithm, especially when high-specificity thresholds are applied in patients with concordant clinical and serological features. Thus, the clinical utility of SGUS scoring appears to be limited to “rule-in” applications using high-specificity thresholds, rather than broad screening or exclusion strategies.
Finally, a recent study by Finzel et al. confirmed the good interobserver reliability of the OMERACT scoring system when applied at the patient level, further supporting its broader integration into diagnostic workflows(25). While promising, these results must be interpreted cautiously given the single-center, cross-sectional design of our study, and should be validated in larger, prospective cohorts.