Overview of Student Research Papers - From 2021 to 2024, 117 students (60 undergraduate, 57 graduate) have completed Fundamentals of Global Health. Writing the research paper that is the focus of this project is required of graduate students only. Of 54 completed research papers, 28 (51.8%) students accepted the invitation to collaborate as authors (Supplemental Table 1). Infectious diseases of global health consequence and the regional concentrations of the research papers are summarized in Fig. 2. This included 20 viral infections, 5 bacterial infections and 3 parasitic infections. High representation of papers focused on COVID-19 stemmed from student interests in the pandemic during the 2021 and 2022 course offerings.
The further perspectives students took to assess the complexity of the infectious diseases of global health importance, based on weekly modules presented during the course, are summarized in Table 1. Taken together, data from Fig. 2 and Table 1, illustrate that the content of the research papers was unique to the individual students.
Assessment of ChatGPT4o Global Health Research Papers - ChatGPT4o Narrative – The 28 ChatGPT4o-generated papers (Supplemental Table 1) were first evaluated based on the word-length targets provided in the prompt guidelines Results in Fig. 3 show that the content generated by ChatGPT4o was significantly shorter (red crosses) than targets (blue dashed lines/boxes) suggested for the Introduction P1, P2, P3 sections and total paper; word-length of the Summary section met the suggested target. A cross comparison also demonstrated the content generated by ChatGPT was significantly (p < 0.001) shorter than student written papers.
All students indicated that the ChatGPT4o paper included an Introduction that followed prompt guidelines. Results summarizing student responses to the 5 Likert scale items (1 = Significantly inferior; 2 = Inferior; 3 = Similar; 4 = Somewhat better; 5 = Significantly better) are presented in Fig. 4. An overall evaluation of the student surveys suggested that the majority of the students evaluated the AI-generated paper as inferior or similar to their own paper (overall satisfaction average = 2.39 (1.61–3.17); 2 = Significantly inferior; 16 = Inferior; 7 = Similar; 3 = Somewhat better; 0 = Significantly better).
Individual student responses showed both within and between variation. Only four of the 28 students selected the same score for all 5 survey items (students 6, 8, 14 and 20), responding with a score of 2 to each item. All other students entered different responses among the 5 items, suggesting that the students evaluated each component of the research paper independently. Consistent with the inferior assessment of the AI-generated paper, the average individual student responses for the 5 queries showed that 17 scores (60.7%) were < 2.9; 7 scores (25%) were between 3.0 to 3.9; 4 scores (60,7%) were ≥ 4.0. Additionally, for the 5 Likert scale items, we found that the upper bound of the standard deviation was less than 2.9 for 17 students (60.7%), 3-3.9 for 7 students (14.3%) and 4-4.9 for 4 students (14.3%) (Supplemental Table 2). By this individual student assessment, the majority of the students felt that the AI-generated paper was inferior to the paper that they had written; 12% indicated that the AI-generated paper was superior.
In response to the open-ended query regarding experience with AI tools, 6 students indicated first-time use in the context of this project, 9 responded a little/non-academic use, 10 responded that they had moderate experience (in the context of studying, improving grammar and spelling, organizing daily plans), 3 indicated frequent experience in writing and research. The students who had used AI tools prior to this project had used earlier versions of ChatGPT, Microsoft CoPilot and Perplexity AI. In response to their overall expectations for ChatGPT4o in this project, the students provided both positive and negative assessments of the AI-generated paper and excerpts from a sample of these responses are provided in the following Text Box 1 (Minor editing to delete “…” or [modify] text were made by the instructor (e.g., from “will not work in a research paper, let alone graduate level research” to “will not work in a [graduate level research paper]”)).
Text Box 1
-
…favorite things about public health is its relationship to medicine, math, history, and politics... It seemed that ChatGPT struggled with interweaving these topics
-
I was surprised to see how much of the essay was spent [on] defining the terms given in the prompt…I was astounded by the ability of AI to generate a paper that I would have pored over for hours to ponder each sentence and connection in a short period of time. I cannot believe that this is possible.
-
ChatGPT did a great job of summarizing the existing work [conclusion], which was honestly better than mine. The introduction also did a pretty nice job of setting the scene. Yet, the ChatGPT did not really explain any claims it made throughout the body of the paper, which simply will not work in a [graduate level research paper].
-
I believe AI tools are best [for aggregating] resources… A great deal of nuance is missing from the ChatGPT version that I included in my paper because I had the historical, cultural, and global context of previous research…
-
… the synthesis, which was just regurgitated information from other sections and lacked deeper analysis.
-
The modeling section was surprising, as ChatGPT did an analysis similar to mine, and created an SEIR model of COVID-19 and influenza dual-endemicity.
-
I expected a higher quality output... I think part of the problem was in the quality of the inputs. I didn't include nuanced topics that I researched...
-
Perhaps due to the niche nature of the subject I was surprised that the chat bot did not provide more recent examples compared to my paper written nearly 3 years ago (original student paper Spring 2021).
-
In my own paper, I attempted to present more background information than I found in the generated paper, but the generated paper presented the biological/technical information better than I did, I think. I was both surprised and frightened by the quality of ChatGPT's output and annoyed when it presented the information better than I did.
-
It is…very simple…to use and very quick. I am sure some of the grammar…in the ChatGPT version is better than my own. … very complicated to use a tool like this to completely synthesize different perspectives, specifically …as broad as Global Health.
-
…reinforced my prior assumptions about ChatGPT…they can write at a high-school, 10th grade level but have repetitive sentence structure, poor transitions, and overall stilted writing style…ChatGPT made up facts about the pathophysiology of malaria.
-
ChatGPT 4o output did not meet the word requirements of the paper but provided a lot of relevant information to the three perspectives. While it is helpful…, it still needs a lot of work in order to fully create a paper that meets all of the requirements.
-
I think some more critical analysis of sources and synthesis of the material was needed to bring this up to par with the work of a graduate student, but it provided a useful overview of the perspectives in the prompt.
-
I think I have it deeply ingrained in me that I can do it better myself, and so I'm unwilling to give up the steering wheel, so to speak.
-
The ChatGPT provided similar information to my original paper. However, it was very surface level information that lacked depth and details of the topic at hand. It was a very general paper with few references that supported each point.
These responses consistently reflected the quantitative assessment of the overall paper (Fig. 4) and provided further insights into the students’ assessments of their own papers and the ChatGPT4o product. General summaries of their comments found that the ChatGPT4o was generated very quickly, was grammatically correct and might serve as a very useful outline or organized guide to a complete paper. However, the students found that the content was noticeably superficial, repetitive of the prompt, and nuanced details of their complex global health problem were not developed. Additionally, students commented on a number of ChatGPT4o shortcomings with identification and integration of references. Of note, ChatGPT seldom used more than one reference to support its content. Additionally, although prompted to integrate references into the narrative (in some cases more than once), ChatGPT4o did not complete this step in 11 of 28 papers.
Effective use of citations is an indication of acquired skills in academic writing. Importantly, appropriate use of citations provides attribution to those who have previously published, content, concepts, methods, made discoveries and developed theories [29]. Written narrative on complex topics that is supported by well-selected references builds credibility and authority of the author. It shows that the writer has read the appropriate background to understand many aspects of the topic being presented, provides validation to critical facts underpinning the topic, and calls attention to the history of contributions that have led to the present. Ultimately, a well-referenced work enables readers to feel as though the authors have helped them understand the essentials linked to the topic. Understanding present-day global health requires that students read from multiple sources and use references to (1) avoid bias or over-simplification and (2) accurately represent the history, culture, geographic, biomedical and public health perspectives contributing to challenges on which they are writing. Inadequate referencing leads to the appearance of an incomplete or confused presentation of important topics. Failure to cite previous work is viewed as plagiarism [30] and stiff punishments are often meted out (failure of an assignment or expulsion from school) to those who violate this basic principle of academic integrity.
Therefore, products generated by the students and ChatGPT4o were finally evaluated with the importance of referencing in mind. Two criteria were used to evaluate references – accuracy and impact. Accuracy, or “hit-rate” (n cited/n exist) was determined by whether the reference cited could be found to exist using the approach outlined in the method. Impact was determined by assigning the Journal impact factor (IF) to an individual reference and averaging the Journal impact factors in the paper.
For the 790 references cited by the students (28.21 references per paper), the hit-rate was 100%; average IF across the 28 student papers was 10.43. For the 729 references cited by ChatGPT4o (26.03 references per paper), the hit-rate was 54.3% (396/729); average IF across the 28 ChatGPT4o papers was 14.14. Factors contributing to the ChatGPT4o 46.7 miss rate (333/729) are summarized in Table 2. Of the ChatGPT4o references deemed to be accurate,
26.5% (105/396; 105/729 = 14.4% of total references) were then determined to be relevant to the paper narrative where they were cited. The most common reason for Failed Relevance resulted from ChatGPT4o not integrating citations into the text (259/396). More specific examples of reference failures for accuracy and relevance are provided in Text Box 2.
Text Box 2
Failed Accuracy Examples
-
Cited manuscript did not appear in Nature Medicine, 26 pp. 1641–1645, 2020; it did appear in Nature Medicine 27 pp. 94–105, 2021.
-
Cited manuscript did not appear in PLoS Negl Trop Dis. 2016;10(1); it did appear in PLoS Negl Trop Dis. 2011 Jan 25;5(1):e1003.
-
Cited manuscript did not appear in Int. J. of STD & AIDS. 2005, 16(3):217–223; it did appear in Int. J. of STD & AIDS. 2005, 16(4):217–223.
-
Cited manuscript did not appear in Nature Microbiology. 2014; 2,14012; it did appear in Nature Microbiology. 2014; 4(9):1508–1515, with an expanded author group.
-
Ferguson, N. M., et al. (2020). Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. Nature, 585(7807), 257–261.
It was published as an internal document @ Imperial College of London on March 16,2020.
6. Lancet 396(10249) did not contain pages 1138-40. These pages were found in Lancet 36(10258). No matching manuscript was found.
Failed Relevance Examples
Sentence reads - In addition to vaccines, antiviral medications such as oseltamivir and zanamivir can reduce the severity and duration of influenza symptoms if administered early in the course of illness if administered early in the course of illness [REF]. The reference (Gostic KM, et al. (2020) Practical considerations for measuring the effective reproductive number, Rt. PLoS Comput Biol 16(12): e1008409) makes no mention of treatments for severity and duration of influenza symptoms by the indicated drugs.
Sentence reads - France, leveraging its economic strength, swiftly mobilized funds for healthcare, research, and economic stimulus to mitigate the pandemic's impact [REF]. The reference (Emanuel EJ, et al.(2020) Fair Allocation of Scarce Medical Resources in the Time of Covid-19. New England Journal of Medicine 21;382(21):2049–2055) is focused on rationing of medical equipment and interventions in the United States. There was no mention of France in the article.
Sentence reads - Understanding the genetic and environmental factors that influence the progression of COVID-19 and its variants is essential for developing effective public health policies [REF]. The reference (Bastard P, et al. (2020) Autoantibodies against type I IFNs in patients with life-threatening COVID-19. Science 370(6515):eabd4585. doi: 10.1126/science.abd4585) was not focused on COVID-19 variants or public health.
Sentence reads - A study by Evans et al. [REF] reported that the timely establishment of Ebola Treatment Centers (ETCs) in Sierra Leone was associated with a reduction in case fatality rates (CFRs) from 70–40%, highlighting the importance of accessible and effective treatment facilities. The reference (Evans, D. K., Goldstein, M., & Popova, A. (2015). Health-care worker mortality and the legacy of the Ebola epidemic. The Lancet Global Health, 3(8), e439-e440) modeled how the loss of health-care workers - defined here as doctors, nurses, and midwives - to Ebola might affect maternal, infant, and under-5 mortality. There was no mention of ETCs and no specific mention of reduced CFRs from 70–40%.
Sentence reads - A study by Tiffany et al. (REF) indicated that community engagement efforts led to increased compliance with public health measures and a greater willingness to report suspected cases. The reference (Tiffany, A., et al. (2017). Estimating the number of secondary Ebola cases resulting from an unsafe burial and risk factors for transmission during the West Africa Ebola epidemic. PLoS Neglected Tropical Diseases, 11(6), e0005491.) focused on safe dignified burial practices and not on willingness to report suspected cases.