In the field of patient safety, the World Health Organization (WHO) has identified misdiagnosis as one of the primary contributors to unsafe healthcare delivery (World Health Organization & World Alliance for Patient Safety Research Priority Setting Working Group, 2008). To make a diagnostic judgment, health professionals must rely on clinical evidence and, in addition, possess an adequate level of confidence in that judgment. The discrepancy between actual diagnostic accuracy and the perceived confidence in that accuracy represents a critical threat to patient safety (Meyer et al., 2013). When a professional feels unsure about a clinical diagnosis, they tend to seek additional information to confirm or rule out their hypothesis, thereby helping minimize diagnostic error. In this regard, the self-assessment that professionals make about their decisions largely determines the course of action they adopt in clinical practice, modulated by their level of confidence.
Therefore, the quality of care is closely related to healthcare professionals’ self-assessment capacity and to self-regulated learning (Murdoch-Eaton & Whittle, 2012; Sandars & Cleary, 2011). However, several studies have shown a weak correlation between self-assessments and standardized external assessments among health professionals (Davis et al., 2006; Eva et al., 2004), with coefficients ranging from 0.02 to 0.65 in the case of students (Gordon, 1991). This gap highlights the need to strengthen metacognition, understood as the ability to monitor, regulate, and evaluate one's own performance, in the training of future healthcare professionals (Honeycutt et al., 2021; Naug et al., 2011; Prokop, 2019; Tan et al., 2010).
One of the key components of clinical practice is the interpretation of medical images, which constitutes a fundamental basis for accurate diagnostic judgments. This skill requires rigorous and early training (Weimer et al., 2024), since the interpretation of these images entails reasoning with complex spatial information. In this context, spatial ability plays a central role, as these tasks rely on spatial mental representations and on processes such as mental rotation and visualization (Hegarty et al., 2007). Spatial ability is defined as the capacity to generate, retain, retrieve, and transform well-structured visual images (Lohman, 1996). Health science students are frequently required to reason about spatial concepts such as shape, relative position, and the connections between anatomical structures (Hegarty et al., 2007), which involves learning to mentally represent internal body structures that are not directly observable.
Consequently, given the relevance of spatial ability in the interpretation of medical images and in clinical decision-making, it is essential to consider it a central component of metacognition in health education. Analyzing how students judge their own spatial performance allows researchers to understand and optimize these processes from the early stages of university studies. Although numerous studies analyze confidence in responses, confidence does not equate to competence, and further research is needed to examine the degree of discrepancy between confidence judgments and actual performance in health sciences students.
Metacognition in education
Metacognition is defined as reflection on one's own thinking. In other words, it is the ability to plan, monitor, and evaluate learning and performance by regulating one's own cognitive processes (Flavell, 1979; Lai, 2011). In the clinical context, metacognition has been associated with greater patient safety, enhanced decision-making, and a lower incidence of diagnostic errors in clinical professionals (Kuiper & Pesut, 2004; Royce et al., 2019; Siqueira et al., 2020). Conversely, insufficient metacognitive skills can lead to overestimation of abilities or reduced self-monitoring and, consequently, to medical errors (Medina et al., 2017; Royce et al., 2019). For example, Pusic et al. (2015) reported that medical students rated their diagnostic judgments on an X-ray image as "definitely correct," although this was accurate in only 69% of cases.
Metacognitive knowledge and metacognitive regulation are the two components of metacognition, the latter involving monitoring and control processes (Nelson & Narens, 1994; Schraw & Dennison, 1994). Metacognitive monitoring refers to activities that involve reviewing and evaluating the quality of cognition, whereas metacognitive control refers to decisions made based on information from monitoring operations (for a review, see Fiedler et al., 2019; Nelson & Narens, 1990). Control processes determine the confidence threshold at which an action is initiated. For example, in the clinical context, a physician may proceed with treatment only when they have more than 90% confidence in a diagnosis (e.g., Djulbegovic et al., 2014; Pauker & Kassirer, 1980). Inaccurate monitoring deteriorates decision quality; thus, the effectiveness of control depends directly on monitoring accuracy (Ackerman & Thompson, 2017). In this context, confidence judgments (hereinafter, CJ) are among the most widely used measures of metacognitive monitoring (Fleming & Lau, 2014). CJs reflect an individual's belief in the accuracy of their decisions following a cognitive task. According to Nelson and Narens (1990), “a system that monitors itself (even if imperfectly) can use its own introspections as information to alter the behavior of the system” (p. 128). Thus, confidence levels help determine whether individuals feel sufficiently competent in a particular domain or whether further learning is required (Dautriche et al., 2021), guiding control decisions such as seeking help or checking for errors (Dunlosky & Hertzog, 1998; Dunlosky & Metcalfe, 2009; Efklides, 2011; Metcalfe, 2002, 2009; Nelson & Narens, 1990).
Empirically, monitoring accuracy is assessed by comparing subjective performance evaluations (e.g., CJs) with objective task performance, measuring the correspondence between the two. A recent meta-analysis (Jin et al., 2022) across 16 countries showed significant correlations between performance and mean confidence, suggesting that individuals who feel more confident tend to perform better. Judgment accuracy, defined as the relationship between objective performance and CJs, is considered a central indicator of monitoring quality (Kleitman & Moscrop, 2010). Monitoring accuracy can be assessed through absolute accuracy, relative accuracy, or the bias index (Burson et al., 2006; Nelson, 1996). Absolute accuracy is calculated as the difference between estimated performance scores (confidence judgments) and objective performance; values closer to zero indicate better calibration. The bias index is another form of absolute accuracy that measures the degree to which an individual has an excess or a deficiency of confidence when making a prediction. The bias index reflects the direction and degree of discrepancy between the judgment and performance (Schraw, 2009), with positive values indicating overconfidence and negative values indicating underconfidence. Thus, overconfident students show higher confidence judgments, while underconfident students show lower confidence judgments relative to objective performance on the task.
In educational contexts, students who consistently underestimate their performance (with low confidence) may lose motivation to learn, whereas those who overestimate their performance may be at a disadvantage in the long term, as this can hinder their motivation to learn new techniques (since they feel confident that they know everything). Numerous studies have shown that students tend to be inaccurate in judging their performance (Fitzsimmons & Thompson, 2024), tending to overestimate, that is, they assess their performance as higher than their actual performance on tests (Händel & Dresel, 2018; Kruger & Dunning, 1999). This phenomenon, known as overconfidence bias, varies depending on contextual and individual factors. Due to the nature of test administration, confidence ratings are sensitive to item difficulty. Thus, difficulty directly impacts the CJs assigned, so that with difficult items, participants tend to show overconfidence, and with easy items, a lack of confidence. This effect is called the easy-difficult effect (Juslin et al., 2000; Lichtenstein & Fischhoff, 1977). It is worth noting that item difficulty is closely related to participants' ability: lower-ability participants tend to be overconfident with moderately difficult items, whereas higher-ability participants tend to be well calibrated or even slightly underconfident (Morphew, 2021; Stankov et al., 2012). Other studies, however, indicate that low-performing participants were more accurate, overestimating less than those considered high-performing in a test on pedagogical knowledge (Florín & Grecu, 2019). Despite these discrepancies, the literature converges in pointing out that lower-performing students generally tend to overestimate their performance to a greater extent than those of average or high performance, a pattern also observed in disciplines related to health sciences and allied sciences (Bunce et al., 2023; Cale, 2023; Morphew, 2021).
Although these individual differences exist, confidence tends to manifest as a relatively stable characteristic across ability levels; thus, overconfidence is not limited to low performers. Stankov and Lee (2014) noted that this trend persists even among high-ability groups, suggesting that overconfidence is a widespread phenomenon. In the healthcare field, Cleary et al. (2019) found that medical students—in a sample of 157 participants—overestimated their performance on two tasks (medical history and physical examination) in 98% and 95% of cases, respectively, and their judgment accuracy remained moderately stable across tasks. Similar findings have been reported in other studies that show poor calibration in medicine (Benjamin et al., 2022; Berner & Graber, 2008; Meyer et al., 2013) as well as in education (Callender et al., 2016; Foster et al., 2017; Huff & Nietfeld, 2009), where a low correspondence between confidence and performance can lead to systematic diagnostic errors (Berner & Graber, 2008).
In addition to ability and difficulty, gender emerges as a relevant factor in metacognitive calibration. Several studies have shown that, in general knowledge tasks or university contexts, women tend to show lower confidence and less overestimation than men (de Bruin et al., 2017; Buratti et al., 2013), resulting in better calibration, reflected in lower bias scores (Pallier et al., 2002). In a numerical series task with a large sample of 6,544 participants (58% women), men showed higher confidence, with no differences in performance, and exhibited greater overconfidence biases, although with a small effect size. The meta-analysis by Rivers et al. (2020) supported these findings partially by analyzing six studies with 758 participants, where men were more confident and, in some cases, also more accurate. Even when controlling for performance, women reported approximately 7% less confidence than men. However, gender differences appear to be task-dependent: in mathematical problems, women have shown superior calibration (Bench et al., 2015). McMurran (2020) found that men were more confident but less calibrated, particularly on low-difficulty problems; as task complexity increased, women’s calibration improved and gender differences in confidence decreased. Overall, although men tend to appear more confident (Lundeberg & Mohan, 2009; Stankov & Lee, 2008), the evidence indicates that overconfidence decreases in both genders with practice and time (Hadwin & Webster, 2013).
Spatial reasoning and metacognition in health sciences
Spatial reasoning has been less explored from a metacognitive perspective, despite its central role in multiple cognitive and professional activities. Spatial reasoning can be considered an umbrella term encompassing a broad set of spatial abilities, including spatial visualization, mental rotation, and spatial orientation (Ramful et al., 2017). In their meta-analysis, Voyer et al. (1995) defined mental rotation (hereafter, MR) as the ability to mentally manipulate two- and three-dimensional objects, whereas spatial visualization refers to the ability to mentally transform or modify the spatial properties of an object. Subsequent studies have confirmed that these are clearly differentiated abilities (Hegarty & Waller, 2005; Mix & Cheng, 2012). Given that these processes involve monitoring, controlling, and evaluating one's own performance, their analysis from a metacognitive perspective is particularly relevant.
In the clinical field, spatial reasoning acquires particular importance, as many medical and surgical tasks depend on the accurate interpretation of spatial information. For example, understanding X-rays, CT scans, or ultrasounds requires integrating complex two- and three-dimensional representations with prior anatomical knowledge. Numerous studies have demonstrated the relationship between spatial ability and the learning of specific skills in laparoscopic surgery, colonoscopy, and other techniques related to dentistry (Hassan et al., 2007; Hedman et al., 2006; Hegarty et al., 2008; Keehner et al., 2006; Luursema et al., 2008; Risucci, 2002; Evans et al., 2001; Wanzel et al., 2002, 2003). In this regard, clinical teaching seeks to develop competencies in the interpretation of radiological images through strategies that enhance anatomical understanding and visuospatial ability (de Barros et al., 2001; Ertl-Wagner et al., 2016; Khalil et al., 2005; Sendra Portero et al., 2023). Thus, the ability to perceive anatomical relationships and mentally manipulate internal structures is considered a fundamental skill for healthcare practice.
Empirical evidence has shown that targeted practice and specific training can significantly improve spatial reasoning and, consequently, clinical performance. Weimer et al. (2024) found that after an ultrasound course, students improved both their visuospatial ability and their understanding of radiological sections and anatomical relationships in the abdomen. Similarly, Garg et al. (2001) demonstrated that students with higher MR ability and greater exposure to multi-view three-dimensional models achieved superior spatial learning in anatomy. These findings underscore initial spatial ability as a predictor of anatomical learning and success in clinical tasks (Garg et al., 2001). Similarly, Clem et al. (2013) reported that 21% of students’ performance in ultrasound interpretation could be explained by their initial spatial ability. Complementarily, Hoyek et al. (2009) observed that MR training improved performance on an anatomy test with items that specifically required mental rotations, but not on those focused on factual knowledge. Both men and women improved after the training, although the former maintained a higher absolute performance. These results support the hypothesis of transfer from mental rotation tasks to complex anatomical tasks (Guillot et al., 2007; Hoyek et al., 2009), suggesting that strengthening spatial skills contributes directly to anatomical learning and retention.
Individual differences in spatial ability emerge early in development (Levine et al., 1999) and persist into adulthood (Hegarty & Waller, 2005). Regarding gender, men consistently outperform women on mental rotation tasks (Fernández-Méndez et al., 2024; Lippa et al., 2010; Maeda & Yoon, 2013; Voyer et al., 1995). This mental rotation advantage is considered one of the most robust gender differences in cognitive psychology (Halpern, 2013; Voyer et al., 1995; Zell et al., 2015). In contrast, sex/gender differences in chronometric mental rotation tasks are less consistent, often small or non-significant, or appear only in particular subtests or with specific types of stimuli (Bauer et al., 2021; Bauer & Jansen, 2024; Jansen-Osmann & Heil, 2007; Peters & Battista, 2008).
Regarding the psychometric Mental Rotation Test (Vandenberg & Kuse, 1978; Peters et al., 1995), it is well established that the speed at which spatial problems are solved differs between men and women, and that this may contribute to observed gender differences in mental rotation performance when time limits are imposed (Peters, 2005; Voyer & Saunders, 2004). As such, examining the effects of varying time constraints, Peters (2005) proposed that men would outperform women under all timing conditions, and that their superiority would be more significant as the time available per item decreases.
In the context of these psychometric spatial tasks, analyses of metacognitive variables suggest that confidence in spatial task responses may explain part of the individual differences observed in such tasks (Arrighi & Hausmann, 2022; Cooke-Simpson & Voyer, 2007; Desme et al., 2024; Estes & Felker, 2012). Specifically, Cooke-Simpson and Voyer (2007) examined the role of confidence in MR task performance. They hypothesized that participants who responded randomly were more likely to have less confidence in their answers, so they looked at the role of confidence as an explanation for overall performance. Using a sample of university students, they found a high correlation (r = 0.685) between MR (measured through the Mental Rotation Test; MRT) and the average confidence rating. Similarly, Estes and Felker (2012) showed that men not only had more confidence than women in their own MR abilities, but that individuals who rated themselves as more confident in the accuracy of their answers were, in fact, more accurate on a classic MR task. Other studies have corroborated that those who obtain the best results also have greater confidence in their responses in the MRT task (Desme et al., 2024). However, greater self-confidence is not always related to better performance on spatial tasks. The study by Ariel et al. (2018) showed that men had higher confidence levels even when no gender differences were observed in visuospatial performance. Despite differing results among studies, the finding of greater confidence in men's spatial responses compared to women’s seems consistent. Furthermore, the positive relationship between self-confidence and MR performance is stronger in men than in women (Estes & Felker, 2012). Thus, confidence differs between men and women in MR tasks (Lemieux et al., 2019; Rahe & Jansen, 2022), even when performance is controlled. Moreover, among the different metacognitive variables, confidence seems to be the one that best predicts performance in spatial tasks, surpassing both perceived difficulty and the effort invested (Ackerman et al., 2024).
Among health science students, these differences are also reproduced in different educational contexts. Male advantages have been documented in MR among medical students (Langlois et al., 2013) and in anatomy (Koh et al., 2023). Since clinical tasks constantly require spatial reasoning, understanding how individual differences in spatial ability and metacognitive confidence interact is essential for optimizing the teaching of diagnostic skills that rely on visuospatial processing.
Present research
This study had two primary objectives. The first was to analyze calibration and bias in a sectioning task, considering gender, spatial ability, and item difficulty. This task was selected because it requires spatial reasoning about object sections, an essential component of anatomical learning. Spatial ability was estimated using MR and visualization tests, which provide complementary measures of individuals’ capacity to mentally manipulate spatial information. We hypothesized that students with greater spatial ability would exhibit better calibration (scores closer to zero) and less overconfidence bias. Furthermore, calibration was expected to decrease and overconfidence to increase as item difficulty rose. Men were also expected to exhibit poorer calibration than women, with a greater overconfidence bias.
The second objective was to analyze performance across three spatial tests (sections, visualization, and mental rotation) as a function of confidence, task difficulty, and the proportion of items attempted. We expected performance to be higher among students with greater spatial ability and higher confidence, whereas increased difficulty would reduce performance. In line with previous research, men were expected to outperform women on the MR task. Additionally, we hypothesized that students who answered a greater number of items under time pressure would perform worse, reflecting a speed-accuracy trade-off. However, this negative effect would be attenuated in participants with greater spatial ability, compensating for the impact of responding quickly on performance.