Technological advancements over the past century have significantly transformed medical knowledge and clinical practice. The complexity of modern medicine has introduced a growing number of variables beyond traditional clinical concerns, necessitating the incorporation of more quantitative approaches. As early as the second half of the 20th century, some authors proposed replacing clinical judgment with the use of mathematical formulas [10]. Despite the difficulties encountered in applying these formulas in practice, clinical judgment remains a cornerstone of medical practice, especially when interpreting the often-mechanistic results provided by statistical methods [11].
In this context, ChatGPT-4.0 presents itself as a novel tool capable of facilitating doctor-machine interactions that resemble a dialectical process rather than a purely logical-formal one, delivering responses in a conversational, human-like manner. This positions the AI as a potential information source, where data is provided in a more organic and articulated fashion.
Previous studies have tested the performance of AI tools like ChatGPT in the context of medical examinations [12], and the potential applications of AI in medicine have been widely discussed [13]. However, the clinical use of AI tools, particularly in the field of rheumatology, remains underexplored, and further research is needed. Some studies comparing AI models and specialists in practical rheumatology settings [14, 15, 16] have yielded results consistent with ours: AI demonstrates generally acceptable performance, but with notable areas for improvement.
Our findings highlighted a particular weakness of ChatGPT-4.0 when responding to questions about the most useful signs or symptoms for diagnosis—questions that require an understanding grounded in clinical experience. This contrasts with its strong performance on other types of questions. This discrepancy is likely due to the subjective nature of these diagnostic questions, which depend heavily on the clinician’s bedside experience in differentiating between possible disease manifestations. In evidence-based medicine, the terms sensitivity and specificity attempt, albeit imperfectly, to address this distinction [17]. This is especially critical in rheumatology, given the broad spectrum of disease manifestations and the overlap of symptoms across various conditions.
Regarding treatment choices, imaging studies, and laboratory tests, it is unsurprising that ChatGPT-4.0 performed better than human experts. These decisions often follow established protocols and guidelines, to which ChatGPT has direct access. Moreover, such decisions tend to be less subjective and less controversial than diagnostic judgments about which signs or symptoms are most useful. Therefore, it makes sense that the AI’s performance would be strong in these areas.
Another intriguing finding in our study was the lower performance of the most experienced group (Group D). We do not attribute this to the actual length of experience of these specialists, but rather to the nature of the questions being evaluated. As rheumatologists accumulate more experience, their perspectives become more nuanced and complex, making seemingly “simple questions” more challenging. In many cases, this reflects a departure from basic, standardized concepts, as more experienced clinicians focus on complex or individualized approaches to rheumatology. Furthermore, this complexity may be compounded by reduced involvement in academic settings. This is underscored by the fact that the evaluating experts—who themselves had over 30 years of experience—did not align with the responses of Group D. The performance of the most experienced group in our study suggests a need for continued medical education (CME) rather than a critique of their expertise, supported by the lack of agreement between Group D and the evaluators, who also possessed extensive clinical backgrounds.
Our study also presents several limitations. One major issue is the variability in ChatGPT-4.0’s responses to the same questions. We minimized this variability by employing input that promoted more objective responses, but when using the original “answer objectively” command (the same given to human experts), the AI produced excessively verbose and inappropriate responses for the purposes of our evaluation. This led to a change in the input instruction to “answer objectively and without explanations.” Another limitation is the small sample size, which restricts the generalizability of our findings and limits the ability to draw definitive statistical conclusions regarding the utility of ChatGPT-4.0 in rheumatology. Additionally, ChatGPT-4.0 is subject to continuous updates and modifications, meaning that our study provides a snapshot of its performance as of April 2024 in its 4.0 version.
Ultimately, the primary goal of our study was to describe the results obtained and highlight specific findings that could not only address existing questions but also serve as a springboard for future research. The performance of ChatGPT-4.0 in this study provides valuable insights and paves the way for more comprehensive evaluations of AI tools in the field of rheumatology.