The in-house simulator showed advantages over the conventional simulator.
Resection status and LEEP score
Regarding R0 resections, there was no statistically significant difference between the two simulators concerning the LEEP score of the first a second resection. In the third attempt, all students (100%) achieved R0 resection with the in-house simulator, whereas while only 25 out of 30 students (83.3%) did so with the conventional simulator (p = 0.020). In the fourth attempt, 29 students (96.7%) achieved R0 resection with the in-house simulator, compared to only 22 out of 30 students (73.3%) with the conventional simulator (p = 0.011). A graphical representation of the R0 resections for both simulators is displayed in Fig. 1.
Regarding the LEEP scores, the average of all recorded scores for LLETZ attempts one to five was calculated for both simulators. A LEEP Score of zero points was regarded as best possible result. In such a case, a cone depth of 8 to 10 mm was achieved with one single resection. Hence, higher scores indicated a greater deviation of the excised cone from the study’s ideal target and/or more excisions. The mean scores facilitated a more accurate comparison between the two phantoms. The mean scores for the conventional simulator exhibited a downward trend at the beginning, starting from an initial value of 1.667 in the first attempt, dropping to 1.167 in the second attempt and further decreasing to 0.633 in the third attempt. The LEEP scores then began to rise again: in the fourth attempt, the average LEEP score was 0.867, and in the fifth attempt, it increased to 1.200. The mean LEEP scores for the in-house simulator consistently decreased. The average score in the first attempt was 1.233, which dropped to 0.767 in the second attempt, 0.567 in the third, and 0.433 in the fourth attempt. Finally, in the fifth and last attempt, an average LEEP score of 0.167 was reached. Comparing the LEEP score results for the individual attempts between the two simulators, a statistically significant difference between both simulators was found from attempt four onwards (p = 0.036 and p < 0.001) (Fig. 2).
The variation in LEEP scores with the use of each simulator across the five attempts was assessed using the Friedman test. Pairwise comparisons of attempts one to five for each simulator revealed statistically significant differences from attempt three onwards for both simulators (p = 0.006). For the conventional simulator a statistically significant difference was also observed between attempt one and attempt four (p = 0.037), but not for attempt five (p = 0.079). For the in-house simulator significant differences in the LEEP score between attempt one and attempt four (p = 0.003) and between attempt one and attempt five (p < 0.001) were noted. However, the improvement between attempts three and four and five was no longer statistically significant for either simulator (Table 1).
Table 1
Pairwise comparison of LEEP scores across five attempts for both simulators
| Comparison of two samples | Conventional simulator (p-value) | In-house simulator (p-value) |
| LLETZ1 – LLETZ3 | 0.006* | 0.006* |
| LLETZ1 – LLETZ4 | 0.037* | 0.003* |
| LLETZ1 – LLETZ5 | 0.079 (n.s.) | < 0.001* |
Legend: * indicate statistically significant differences between the LEEP scores. Differences between the LEEP scores of other attempts were not statistically significant.
Video evaluation by two senior experts
Initially, following the methodology of Wilson et al. (33), the inter-rater reliability between the two blinded senior experts was assessed. This measure reflects the degree of consistency in the ratings provided by both assessors. Values above 0.75 are considered to represent very strong agreement. In our study, the inter-rater reliability was 0.70 for the checklist mean score, 0.80 for the GRS mean score, and 0.79 for the overall mean score. These results demonstrate strong consistency, confirming the validity of the data for further analysis. Subsequently, the mean scores for the checklist mean, GRS mean, and overall mean were compared between the two simulators across the five attempts. The checklist score assessed three key aspects, each rated on a scale from 0 to 3, with a maximum possible total of 9 points. The Global Rating Scale (GRS) evaluated four criteria, with each category scored between 0 and 5, allowing for a highest possible total of 20 points. To calculate the overall score, the checklist score and the GRS score were summed. resulting in a maximum achievable score of 29. For the checklist mean, both simulators demonstrated an increase in mean scores. The mean score of the conventional simulator rose from 6.45 points in the first attempt to 7.5 points in the fifth attempt. Similarly, the in-house simulator showed an increase, starting at 5.75 in the first attempt and rising to 8.5 in the fifth attempt. The GRS mean values also showed a steady increase for both simulators over the five attempts. The conventional simulator's mean score increased from 15.17 in the first attempt to 16.95 in the fifth attempt. For the in-house simulator, the mean score started at 15.52 in the first attempt and increased to 19.20 in the fifth attempt. The overall mean values also steadily increased for both simulators. The mean score for the conventional simulator increased from 21.62 in the first attempt to 24.45 in the fifth attempt, while the in-house simulator’s mean score rose from 21.27 to 27.85.
As with the LEEP score, the checklist mean, GRS mean, and overall mean for each attempt were compared between the two simulators. The in-house simulator demonstrated a statistically significant advantage over the conventional simulator starting from the third attempt for the checklist mean (p = 0.009, p = 0.006, p > 0.001), from the second attempt for the GRS mean (p = 0.011, p < 0.001, p < 0.001, p < 0.001), and from the third attempt onward for the overall mean (all p < 0.001) (Table 2–4).
Table 2
Mean "checklist mean" scores across five electrosurgical excisions for both simulators
| | LLETZ1 | LLETZ2 | LLETZ3* | LLETZ4* | LLETZ5* |
| checklist mean (conventional) | 6,450 | 7,150 | 7,100 | 7,150 | 7,500 |
| checklist mean (in-house) | 5,750 | 7,050 | 8,100 | 8,050 | 8,500 |
| p-value (checklist mean) | 0,083 | 0,944 | 0,009 | 0,006 | < 0,001 |
| Legend: The checklist score evaluates three key aspects: (1) whether the procedure was completed in a single step, (2) whether an adequate tissue sample was obtained and (3) whether damage to the surrounding tissue was avoided. Each category is scored on a scale of 0 to 3 points, with a maximum possible total of 9 points. * indicate statistically significant differences between the conventional simulator and the in-house simulator |
Table 3
Mean "GRS mean" scores across five electrosurgical excisions for both simulators
| | LLETZ1 | LLETZ2* | LLETZ3* | LLETZ4* | LLETZ5* |
| GRS mean (conventional) | 15,167 | 15,900 | 16,150 | 16,550 | 16,950 |
| GRS mean (in-house) | 15,517 | 16,967 | 18,083 | 18,500 | 19,200 |
| p-value (GRS mean) | 0,431 | 0,011 | < 0,001 | < 0,001 | < 0,001 |
| Legend: The Global Rating Scale (GRS) score assesses four criteria: (1) tissue handling, (2) time and motion efficiency, (3) instrument handling, and (4) procedural fluidity. Each category was rated on a scale from 1 to 5 points, with higher scores indicating superior performance. * indicate statistically significant differences between the conventional simulator and the in-house simulator |
Table 4
Mean "Overall mean" scores across five electrosurgical excisions for both simulators
| | LLETZ1 | LLETZ2 | LLETZ3* | LLETZ4* | LLETZ5* |
| Overall mean (conventional) | 21,617 | 23,050 | 23,250 | 23,700 | 24,450 |
| Overall mean (in-house) | 21,267 | 24,017 | 26,241 | 26,517 | 27,850 |
| p-value (overall mean) | 0,700 | 0,083 | < 0,001 | < 0,001 | < 0,001 |
| Legend: To calculate the overall score, the checklist score and the GRS score assigned by each expert is summed. * indicate statistically significant differences between the conventional simulator and the in-house simulator |
Students’ assessment of both simulators
A total of 60 participating students answered 21 identical questions for both simulators. Additionally, Group A answered one extra question, while Group B responded to two additional simulator-specific questions. Responses were recorded using a 10-point Likert scale (1: "strongly agree/very good" to 10: "strongly disagree/very bad").
The statement "The surgical simulation training was enjoyable" received an average rating of 1.1 in both groups. Both cohorts indicated that the simulation training had a positive impact on their education. The statement "The simulation training improved my medical skills" was rated 1.4 in Group A and 1.3 in Group B. Similarly, "The simulation training enhanced my knowledge and medical skills in gynecological examinations" was rated 1.6 in Group A and 1.2 in Group B. Both groups stated that "Surgical simulation helps me in real-life surgeries", with scores of 1.5 (Group A) and 1.4 (Group B). Additionally, "I would like to participate in further simulation training in gynecology and obstetrics" was rated 1.3 in Group A and 1.2 in Group B. The broader statement "I would like to participate in simulation training in other medical specialties" received an average score of 1.3 from both groups. Regarding professional interest, participants were asked whether "The training sparked or strengthened my interest in gynecology", with Group A rating it at 2.2 and Group B at 2.0. The statement "How well did the model simulate a LLETZ procedure?" was rated 2.4 by Group A and 2.1 by Group B. The question "Is the representation of an artificial cervical canal useful for surgical simulation?" received ratings of 1.3 (Group A) and 1.2 (Group B). Further, participants evaluated the variation between the cervix of a nulliparous vs. multiparous patient in surgical simulation. Those using the conventional simulator rated it 3.5, while the in-house simulator group rated it 2.6. Students were also asked whether "Performing the LLETZ procedure enhanced their gynecological knowledge", with Group A rating it at 1.7 and Group B at 1.4. The question "Would you be able to perform a real LLETZ procedure under supervision?" was rated 3.3 in Group B and 3.1 in Group A. Participants were further surveyed on their confidence and knowledge regarding electrosurgery. When asked if they had "acquired sufficient technical knowledge about electrosurgery", Group A rated it 2.2 and Group B 2.4. Confidence in handling electrosurgery was rated at 2.3 in Group A and 2.2 in Group B. Regarding the improvement of operative skills through electrosurgery, Group A rated it 2.1, while Group B gave it 1.8 points. Participants using the conventional simulator were specifically asked, "I did not like working with raw meat", with 23% answering "yes" and 77% "no." For those using the in-house simulator, the variation in the length and depth of the artificial vagina for surgical simulation was rated 2.2, while the ability to actively perform a Lugol iodine test scored an average rating of 1.6. The only statistically significant difference was found in the question: "Training in electrosurgery improved my medical education". (p < 0.05), with Group A rating it 1.2 and Group B 1.6.
Residents' assessment of both simulators
Overall, the simulator training received positive feedback. The statement "The operative simulation training was enjoyable" was rated 2.0 by Group A and 1.0 by Group B. The question "The simulation of surgical procedures helps me in real operations" received mean scores of 1.4 (Group A) and 1.2 (Group B). Furthermore, participants expressed a willingness to undergo additional simulation training in gynecology and obstetrics (Group A: 1.2; Group B: 1.0). The statement "The simulation training improved my medical skills" was rated 3.2 by Group A and 1.6 by Group B. The assertion "The simulation training improved my knowledge and skills regarding gynecological examination" received scores of 3.6 in Group A and 2.2 in Group B. Regarding the simulation of the LLETZ procedure, participants indicated an improvement in gynecological knowledge (Group A: 3.6; Group B: 1.2). The suitability of the simulators for replicating diagnostic procedures was also assessed. The simulation of a real PAP smear was rated 4.6 by Group A and 2.0 by Group B. The ability to simulate a cervical biopsy was rated 4.8 (Group A) and 3.0 (Group B). The replication of an endocervical curettage received scores of 5.4 (Group A) and 3.6 (Group B). The statement "Thanks to the simulator, I feel more confident in independently diagnosing cervical dysplasia" was rated 4.6 in Group A and 1.6 in Group B. Additionally, the potential of the simulator for LLETZ simulation was evaluated. The question "How well did the given simulator replicate a real LLETZ procedure?" was rated 4.0 by Group A and 2.0 by Group B. The artificial cervical canal representation received 1.2 points in Group A and 1.0 points in Group B. The inclusion of a variation between nulliparous and multiparous cervices was rated 2.8 by Group A and 3.6 by Group B. Regarding confidence in performing an actual LLETZ, Group A gave a mean of 4.4 and Group B a mean of 1.6 points. Residents were also surveyed regarding their knowledge of electrosurgery. The statement "I have acquired sufficient technical knowledge in electrosurgery" received ratings of 3.0 (Group A) and 2.8 (Group B). The assertion "Using electrosurgery has improved my surgical skills" was rated 3.0 by Group A and 1.6 by Group B. In the final overall assessment of electrosurgery training, Group B rated with 1.2 points and Group A with 3.6 points. Furthermore, doctors who trained with Simulator A were asked if they found working with raw meat unpleasant. Three participants responded "no," while two answered "yes." The opportunity to actively perform a Lugol’s iodine test was rated 1.4. The only statistically significant difference was found in the question: "I feel more confident handling electrosurgery" with Group A rating it 4.2 and Group B 1.6 (p = 0.032). Concerning the question "How well did the given model simulate a real LLETZ procedure?" Group A gave a mean of 4.0 and Group B a mean of 2.0 points (p = 0.056).