A total of 26 AI-generated reports from nine different simulation scenarios, conducted between January 21st and January 28th 2025, were reviewed and commented on by four experienced simulation debriefers. In addition, 27 participants took part in interviews about their experiences with the AI-based observation and analysis during the simulations.
The debriefers’ perspective on AI
We have identified four different themes in debriefers’ perspectives on AI-generated teamwork reports: (1) positive feedback, (2) categorization accuracy issues, (3) identification of problems, and (4) suggestions for improvement. Each of these four themes consists of three to four subthemes. Each thematic category is illustrated below with representative examples. Table 1 provides selected quotes from the debriefers’ comments. The coding trees derived from these debriefers' responses are presented in Supplementary Fig. 2. Supplementary Table 2 provides the full debriefers' feedback.
Debriefers’ theme 1: Positive Feedback
The first theme, positive feedback, consisted of four subthemes: The first subtheme, “Broader Observation Capture” emphasized that AI-generated transcripts often captured details that debriefers missed during live observation. “In practice, it’s impossible to write down this much, because you’re doing several things at once.” Another added, “I didn’t hear the first example — that would have been great.”
The second subtheme, “Quote Selection Supports Theme Well,” captured debriefers’ views that the generative AI models selected quotes that effectively illustrated specific and relevant aspects of teamwork. One debriefer noted, “These examples include well-chosen quotes, and the context is immediately clear. I find that particularly compelling.” The practical relevance for real debriefings was also emphasized: “A participant stated that the (AI) remarks helped to build trust, spontaneously and without the topic having been discussed (further by the debriefer).”
The third subtheme, “Support for Debriefing,” reflected debriefers’ perception that the concept offered valuable assistance for debriefing practice. One debriefer noted, “Based on my first impression yesterday, I think the concept has great potential.” Another emphasized the transcripts' usefulness: “This transcript would help me simply by providing a collection of quotes.”
Debriefers’ theme 2: Categorization Accuracy Issues
The second theme consisted of four subthemes around the difficulties of categorizing transcribed teamwork quotes correctly into the ten Team FIRST categories. The first subtheme was “Unclear, Inaccurate, or Overlapping Assignment”. One debriefer remarked, “Overall, I had the impression that the quotes didn’t fit the headings. I might have assigned them differently.” Another noted, “Many of the quotes were assigned to other team competencies, even though I remembered them differently.”
The second subtheme, “Overgeneralized, Superficial, or Ambiguous Categorization” reflected debriefers’ concerns that some assignments lacked depth, clarity, or contextual grounding. One debriefer noted: “The positive examples of closed-loop communication are well chosen but remain too vague in depth.” Another debriefer remarked: “Global statements are pointing to missing elements, but the specific points these judgments are based on are not provided.”
The third subtheme was “More Representative Quotes Could Have Been Chosen”. “These examples don’t quite fit, in my view.” Another remarked, “I find that inappropriate; quotes from later in the scenario would be more relevant.”
The fourth subtheme, “Desire for More Context or Stronger Exemplification,” captured debriefers’ wishes for more robust examples to support the assigned categories. “Some example sentences are too short, like ‘okay, good‘, but I need more context to make a valid judgment.” Another remarked, “The proactive coordination was a good observation, it’s a pity the full dialogue wasn’t included.”
Debriefers’ theme 3: Identification of Problems
The first subtheme, “Identification of Problems: Misidentification of Speaker or Role” captured debriefers’ concerns about inaccuracies in attributing statements to the correct team members. “This question came from the embedded simulated person, not from a team member, so the conclusion based on this example is incorrect.” Another added, “The above examples also seem to come from the simulation staff. In this case, from the ‘audio to the room’.”
The second subtheme, “Identification of Problems: Misinterpretation of Technical Terms or Abbreviations” reflected debriefers’ observations that technical language was sometimes inaccurately captured or translated. One noted, “The translation of specialized terms and abbreviations is still somewhat flawed and shows gaps.” However, another remarked, “It didn’t bother me that some technical terms weren’t correctly recognized. I could recall what was said.”
The third subtheme, “Overconfident or Unwarranted Wording by AI” captured concerns that some statements lacked critical nuance. “The wording is very assertive and explicit. For example, team members understand their roles and work together, but it doesn’t sufficiently question the basis of such claims.” Another warned, “It’s problematic to suggest that single events indicate a shared mental model. During the debriefing, it became clear that this model differed at least initially.”
The final subtheme, “Systematic Biases in AI Interpretation” captured concerns that the AI may have exhibited implicit assumptions in its interpretation of team dynamics. One debriefer questioned, “I wonder what knowledge the statement is based on: that this happened between a doctor and a nurse. Could a bias be at play, assuming certain roles automatically belong to certain professions? (For example, the assumption that the team leader must be the physician and not the nurse?)” Another noted possible positive bias was, "What I’d find interesting is whether the analysis could also highlight what was missing, for instance, absent closed-loop communication or destructive language. Not noticing what's missing creates blind spots that this tool could help uncover.”
Debriefers’ theme 4: Suggestions for Improvement
The final theme reflected the debriefers’ suggestions for improvement and included three subthemes. The first subtheme was “Link Quotes to Timestamps and Identify Speakers”. One noted, “What’s missing is the identification of speakers. You can’t tell if it’s a dialogue.” Another remarked, “The timestamp would help me find the scene in the video.”
The second subtheme, “Expand Beyond Text-Based Analysis” highlighted debriefers’ concerns that important nonverbal cues, such as tone of voice, body orientation, and overall atmosphere, were not captured. One remarked, “The calm and controlled tone that contributed to psychological safety is missing. This is something we pay close attention to as instructors.” Another noted, “Participants mentioned that they felt safe because team members faced each other, which the software cannot detect.” Also, the question was raised whether ambient noise levels, which may affect performance, were considered at all by the AI.
The third subtheme was “Improve Result Presentation and Clarity”. One suggestion was to embed timestamped quotes directly into existing tools, such as a Team FIRST debriefing event list: “I could imagine inserting these quotes with timestamps directly into our Team FIRST debriefing event log.” Another comment proposed a more structured transcript output by linking quotes to the scenario protocol or displaying them chronologically in a color-coded list indicating whether each quote relates to challenges, communication skills, or coordination.
The learners’ perspective on AI
Based on the interviews with the learners, we have identified four different themes in their perspectives on being observed and analyzed by an AI during simulation-based training: (1) influence of perceived AI observation, (2) perceived benefits, (3) perceived risks, and (4) suggested key features for AI. Three of these four themes consisted of two to seven subthemes, while the first theme stands independently. Table 2 presents corresponding examples from the participant interviews. Supplementary Fig. 3 and Supplementary Table 3 provide the full learners' feedback.
Learners’ theme 1: Influence of Perceived AI Observation
This theme included participants’ differing reactions to the awareness of being observed by AI. For some learners, the AI presence faded into the background and did not noticeably influence their behavior: “I forgot that AI was running during the simulation.” Others described a general sense of inhibition caused simply by being observed, regardless of who or what was watching: “You don’t feel entirely free to act. Just the fact of being watched causes tension, whether it’s AI or a human.”
Learners’ theme 2: Perceived Benefits
The first subtheme, “General Optimism About the Potential of AI Technologies”, captured participants’ hopeful outlook on the role of AI in clinical settings. One participant noted, “I believe it can offer real benefits and support.” Another emphasized: “New positive aspects should be explored. AI shouldn’t be rejected as a concept outright.”
The second subtheme, “Trust in AI”, included participants’ statements expressing high confidence in AI systems and little concern about potential risks. One participant noted, “I have no worries about data protection with AI,” while another remarked, “I don’t see any risks.”
The third subtheme, “Enhanced Perception Through AI”, included learners’ reflections on how AI could expand human perceptual limits. One remarked, “As humans, we can only perceive so much. We can’t see or hear everything. AI might help us notice other things.” Another added, “I think AI will help with recognizing (hidden) connections.”
The fourth subtheme, “Support for Generating Ideas and Structuring Feedback”, included participants’ views of AI as a valuable partner in creative and analytical processes. One stated, “We have the ideas, and AI helps us realize them.” Another added, “If personal data security is guaranteed, AI can help us in research.”
The fifth subtheme was “Increased Efficiency Through Automation”. One noted, “AI technology can take much work off our shoulders in the future.” Another emphasized, “AI increases efficiency, data can be captured and analyzed faster and more effectively.”
The sixth subtheme was “Objectivity and Impartiality”. “Unlike human analysis, it prevents bias based on past mistakes.” Another pointed out, “AI has no emotions, no memory — it doesn’t judge, making it impartial.”
The seventh subtheme, “Independence from Human Factors”, included participants’ appreciation of AI’s ability to operate without being influenced by human expectations or biases. One noted, “It misses nothing, while we might overlook things.” Another added, “Humans sometimes wait for participants to make mistakes, influenced by past groups or personal experiences.”
Learners’ theme 3: Perceived Risks
The first subtheme, ”Lack of Transparency”, included participants’ concerns about the nature of AI systems. One remarked, “AI can present facts in a way that makes them seem true. With humans, you notice uncertainty or emotion. A computer comes across as more confident. But is the information correct?” Another added, “I’m more worried about not fully understanding how AI works. There’s a tendency to think the computer knows better, but we need to think more critically.”
The second subtheme, “Data Protection and Security Concerns”, included participants’ unease about how their data is stored, processed, and shared. One participant stated they “would rather not have their voice uploaded to AI,” highlighting discomfort with audio data. Others felt more at ease “if only transcripts are generated,” provided it is clear “where the data is going.” Concerns about data being passed on to companies or possible “conflicts of interest” reflected a strong desire for transparency and control over one’s digital footprint.
The third subtheme, “Interpretation Errors”, included participants’ concerns about the risk of AI misinterpreting input. Users asked, “How can you check the information or know if the AI has made a mistake?” This uncertainty was seen as particularly problematic in emotionally nuanced situations. One participant noted that a key limitation is that it “cannot recognize emotions,” making it difficult for AI to fully grasp context or subtext, essential elements in human communication.
The fourth subtheme, “Loss of Cognitive, Social, and Communicative Abilities”, included participants’ concerns about the long-term impact of AI on essential human skills. One remarked, “You don’t need communication skills anymore,” referring to how AI tools increasingly take over tasks like writing or speaking. Another observed, “People just enter it into the GPS. You don’t have to remember anything,” pointing to the erosion of memory and cognitive independence. Participants feared that overreliance on AI might gradually diminish our ability to think, speak, and connect with others unaided.
The fifth subtheme, “Lack of Trust in AI”, included participants’ skepticism about overreliance on AI systems. One participant warned, “If humans no longer carry the main responsibility, that could be problematic,” pointing to the risks of delegating critical decisions. An example was automated ECG interpretation, which is “often incorrect and requires critical human review.” Trust in AI was limited, especially when accuracy and accountability were at stake.
The sixth subtheme, “Various Concerns About AI Technology”, included a broad range of practical and ethical questions raised by participants. One asked, “Does it understand Swiss German?”, highlighting limitations in language comprehension. Environmental concerns were also voiced, such as “Energy consumption for the servers is a concern. Where does the electricity come from?” These reflections culminated in a broader, critical question: “Do we need it?”, expressing a desire for more mindful and sustainable use of AI technologies.
Learners’ theme 4: Suggested Key Features for AI
The first subtheme, “Support for Human Situation Awareness”, included participants’ reflections on how AI output should be accessible and practically applicable. They emphasized that “being easy to perceive will be the key. It must be output that people will notice.” Participants also stressed the need for something “beneficial for everyone and transparent.”
The second subtheme, “Desire for Human Involvement in Evaluation”, included participants’ insistence on needing expert oversight when using AI. One noted that “AI is only as good as the input you give it,” and its output must be “evaluated with expert knowledge.” Others stressed that “a layperson cannot assess the AI’s output,” and that both input and output must be “cross-checked with human expertise.” If AI alone conducted the evaluation, “the human aspect would be missing.”