AI Just Discovered Shocking Drug Safety Signals Hidden in Clinical Notes—Are You at Risk?

In the ever-evolving landscape of medical research, the quest for drug safety has become increasingly complex. A recent study published on April 6 in eBioMedicine highlights how large language models (LLMs), a form of artificial intelligence (AI), can play a pivotal role in identifying safety signals hidden within clinical notes and electronic health records (EHRs). This is particularly critical for immune checkpoint inhibitors (ICIs), a class of cancer drugs that have been in use since 2011 and are known for their potential to cause a range of immune-related adverse events (irAEs).
Traditionally, identifying these adverse events has involved labor-intensive manual chart abstraction or specialized natural language processing tailored to specific drugs and healthcare facilities. This method can be costly and slow, hampering the rapid pace of discovery in precision medicine. ICIs can trigger adverse effects impacting various organ systems, including the colon, liver, lungs, heart, nervous system, skin, and endocrine system, making effective monitoring even more challenging.
Researchers from Vanderbilt Health and the University of California at San Francisco have embarked on a promising exploration of LLMs to streamline this process. Their multicenter study focused on utilizing zero-shot learning, a technique where the model is prompted with detailed instructions without the need for example-based training. The prompt used by the researchers was particularly focused: “You are a clinical expert in identifying immune-related adverse events caused by immune checkpoint inhibitors…” This directive included a list of six ICIs and their associated irAEs.
The study analyzed clinical notes from various sources, including 100 patients at Vanderbilt Health, 70 patients at UCSF, and 272 notes from ICI trials sponsored by Roche, a leading pharmaceutical company based in Basel, Switzerland. The LLMs tested included GPT-3.5, GPT-4, and GPT-4o, with the latter demonstrating superior performance across the datasets.
According to the report’s corresponding author, Cosmin Bejan, PhD, an assistant professor of Biomedical Informatics at Vanderbilt Health, “Manual patient chart abstraction for monitoring the safety and efficacy of drugs already on the market requires tremendous resources and puts a drag on the pace of discovery in precision medicine.” Bejan emphasized that if zero-shot learning with LLMs could assist in analyzing these clinical notes, it would significantly reduce time and costs for all involved.
For the main performance measure, the researchers employed F1 scores, a statistical measure that evaluates a test's accuracy by considering both false positives and false negatives. An F1 score of 90% or more is deemed excellent, while a score above 80% might qualify a predictive model for automated clinical decision support. However, the average F1 scores obtained from GPT-4o for detecting irAEs at the patient level were 56%, 66%, and 62% across the various data sources. The model displayed a systematic bias towards overpredicting these adverse events.
When focusing on the detection of 17 irAEs at the note level—where GPT-4o analyzed 667 notes from Vanderbilt Health—the average F1 score was recorded at 57%. Bejan remarked, “These results show that zero-shot learning with a powerful LLM is useful for detecting these adverse events. This performance does not rise to the level required for clinical decision support, but the method could be valuable for automated irAE extraction across multiple sites, potentially speeding discovery and enhancing the safety and effectiveness of cancer immunotherapies.”
Additional contributors to the study from Vanderbilt Health included Yaomin Xu, PhD, Eric Mukherjee, MD, PhD, Matthew Krantz, MD, Douglas Johnson, MD, MSCI, Elizabeth Phillips, MD, and Justin Balko, PhD. The research was partly funded by grants from the National Institutes of Health, specifically awards R01CA227481 and R01HL156021.
In a related study published in December in JAMA Oncology, Mukherjee and Phillips, among others, utilized logistic regression on adverse event reports collected by the Food and Drug Administration (FDA) to confirm that ICIs were independently associated with an increased risk of severe skin reactions, namely Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN). This condition was found to sometimes correlate with patient exposure to human leukocyte antigen-restricted drugs.
As researchers continue to explore the capabilities of LLMs in identifying irAEs, the potential for these technologies to enhance patient safety and accelerate drug discovery remains a critical focus in the intersection of AI and healthcare. With the challenges posed by drug safety monitoring in a rapidly evolving medical landscape, the integration of advanced AI techniques could usher in a new era of precision medicine.
You might also like: