Large Language Model Evidence Gaps in Clinical Medicine: A Systematic Review Reveals Limited Real-World Validation

An LLM-assisted systematic review of 4,609 clinical studies published between 2022-2025 demonstrates that only 22.7% utilized real-world patient data, with merely 19 prospective randomized trials conducted. The analysis reveals substantial evidence gaps despite rapid proliferation of medical AI research at 3.2 publications per day.

By Dr. Robert Kim

January 10, 2026 9 min read

Surgeons and nurses working together during a surgical operation in an operating room. — Image: DΛVΞ GΛRCIΛ / Pexels

Abstract

Clinical implementation of large language models (LLMs) has accelerated dramatically since ChatGPT’s release in November 2022, yet the quality of supporting evidence remains poorly characterized. A recent systematic review published in Nature Medicine employed LLM-assisted methodology to analyze 4,609 peer-reviewed studies examining LLM applications in clinical medicine between January 2022 and September 2025. The analysis revealed that only 1,048 studies (22.7%) incorporated real-world patient data, with merely 19 prospective randomized controlled trials identified. The majority of investigations addressed simulated clinical scenarios (n=1,857, 40.3%) or standardized examination tasks (n=1,704, 37.0%). OpenAI models dominated evaluations, comprising 65.7% of tested platforms, while Google’s Gemini/Bard constituted 13.1%. Patient-facing communication and education represented 17% of evaluated tasks. Across 1,046 head-to-head comparisons with human clinicians, LLMs demonstrated superior performance in only 33% of evaluations, with substantial variation based on task complexity and clinical realism. At least 25% of studies employed sample sizes below 30 participants. These findings underscore critical evidence gaps in LLM clinical validation and highlight the urgent need for rigorous, patient-centered prospective trials before widespread clinical adoption.

Introduction

The emergence of large language models has precipitated unprecedented interest in artificial intelligence applications within clinical medicine. Since the public release of ChatGPT in November 2022, healthcare institutions have increasingly explored LLM integration across diverse clinical workflows, from medical documentation and patient education to diagnostic assistance and clinical decision support. The rapid proliferation of medical LLM research reflects both the technology’s apparent versatility and the clinical community’s eagerness to harness natural language processing capabilities for patient care enhancement.

LLMs represent a fundamental advancement in medical AI, distinguished by emergent reasoning capabilities, in-context learning, and sophisticated instruction-following behaviors not explicitly programmed during model development. Unlike traditional medical AI systems designed for specific diagnostic tasks, LLMs demonstrate apparent generalizability across multiple clinical domains, from radiology interpretation to pharmacological consultation. Early investigations suggested that general-purpose models such as GPT-4 and PaLM-2 could achieve near-passing performance on standardized medical examinations, including the United States Medical Licensing Examination (USMLE), raising expectations for clinical utility.

However, the rapid expansion of medical LLM research has created challenges for evidence synthesis and quality assessment. The volume of publications—estimated at 3.2 papers per day in this domain—exceeds traditional systematic review capabilities, potentially compromising evidence-based implementation decisions. Moreover, the heterogeneity of study designs, evaluation metrics, and clinical applications complicates comparative assessment of LLM performance and safety profiles.

The distinction between controlled evaluation environments and real-world clinical settings represents a critical consideration for LLM implementation. While standardized examination performance may indicate knowledge retention and reasoning capabilities, clinical medicine demands integration of patient-specific factors, risk stratification, and complex decision-making under uncertainty. The extent to which existing LLM research addresses these real-world complexities remains unclear, potentially creating a substantial evidence gap between research findings and clinical implementation requirements.

Study Design and Methods

The systematic review employed a novel LLM-assisted methodology to analyze the rapidly expanding literature on clinical LLM applications. The investigators conducted a comprehensive search of peer-reviewed studies published between January 2022 and September 2025, identifying 4,609 relevant publications through systematic database queries and automated screening processes.

The review utilized LLM capabilities to manage the substantial volume of literature, employing natural language processing techniques for study categorization, data extraction, and quality assessment. This approach enabled analysis of research volume that would be impractical for traditional manual systematic review methods, though the authors did not provide detailed validation of the LLM-assisted methodology against human reviewers.

Primary endpoints included the proportion of studies utilizing real-world patient data versus simulated scenarios, the distribution of evaluated LLM platforms, and the frequency of head-to-head comparisons with human clinicians. Secondary outcomes encompassed task categorization, study sample sizes, and performance metrics across different clinical applications.

The investigators categorized studies based on data sources, distinguishing between real-world patient data, simulated clinical scenarios, and standardized examination tasks. Real-world studies were further stratified by design, including observational studies, retrospective analyses, and prospective trials. The authors employed standardized criteria for head-to-head comparison identification and performance assessment, though specific statistical methods for meta-analysis were not detailed in the available materials.

Model distribution analysis focused on commercial LLM platforms, with particular attention to OpenAI products (ChatGPT, GPT-3.5, GPT-4) and Google’s language models (Gemini, Bard). Task categorization encompassed patient communication, knowledge retrieval, education simulation, diagnostic assistance, and clinical documentation, among others.

Results

The systematic review identified 4,609 peer-reviewed studies examining LLM applications in clinical medicine, representing a publication rate of 3.2 papers per day during the 45-month study period. This substantial research volume reflects the rapid acceleration of medical AI investigation following ChatGPT’s public release.

Only 1,048 studies (22.7%) incorporated real-world patient data in their analyses, while the majority addressed simulated clinical scenarios (n=1,857, 40.3%) or standardized examination tasks (n=1,704, 37.0%). Among studies utilizing real-world data, merely 19 employed prospective randomized trial designs, representing less than 2% of the entire study cohort and only 1.8% of real-world data studies.

Model distribution analysis revealed substantial concentration among commercial platforms. OpenAI models, including ChatGPT, GPT-3.5, and GPT-4, comprised 65.7% of all evaluated systems (n=3,027). Google’s language models, including Gemini and Bard, constituted 13.1% of evaluations (n=604), representing the second-most studied platform category. The remaining 21.2% encompassed various academic, open-source, and alternative commercial models.

Task categorization demonstrated diverse clinical applications. Patient-facing communication and education comprised 17% of evaluated tasks (n=784), followed by knowledge retrieval applications and education assessment simulation. Diagnostic assistance, clinical documentation, and treatment recommendation tasks represented smaller proportions of the research focus.

Across 1,046 head-to-head comparisons with human clinicians, LLMs demonstrated superior performance in 345 evaluations (33%). Human clinicians outperformed LLMs in 701 comparisons (67%), though performance varied substantially based on task complexity and clinical realism. The authors noted strong dependency on the level of clinical training among human comparators, with LLMs showing better relative performance against medical students and residents compared to attending physicians.

Sample size analysis revealed concerning limitations in study power. At least 25% of investigations employed fewer than 30 participants, potentially limiting statistical validity and generalizability. The median sample size was not reported, though the authors noted substantial heterogeneity in study populations and evaluation methodologies.

Performance metrics varied considerably across studies, with many investigations lacking standardized outcome measures. Sensitivity and specificity data were inconsistently reported, and few studies provided area under the curve (AUC) statistics for diagnostic applications. Statistical significance testing was frequently absent or inadequately powered given small sample sizes.

Discussion

The systematic review findings reveal a substantial disconnect between the rapid proliferation of medical LLM research and the availability of rigorous clinical evidence. While the publication rate of 3.2 papers per day demonstrates remarkable research interest, the predominance of simulated scenarios and examination-based evaluations raises concerns about clinical translation and real-world applicability.

The finding that only 22.7% of studies incorporated real-world patient data represents a critical evidence gap. Simulated clinical scenarios, while valuable for controlled evaluation, may not capture the complexity, ambiguity, and patient-specific factors that characterize actual clinical practice. The absence of real-world validation limits confidence in LLM performance under operational conditions, potentially leading to overestimation of clinical utility based on artificial evaluation environments.

The paucity of prospective randomized trials—only 19 among 4,609 studies—particularly concerns clinical implementation decisions. Randomized controlled trials remain the gold standard for medical intervention evaluation, providing essential evidence for safety, efficacy, and comparative effectiveness. The current evidence base relies heavily on retrospective analyses and cross-sectional studies, which cannot adequately address causation, long-term outcomes, or unintended consequences of LLM integration into clinical workflows.

The concentration of research among commercial platforms, particularly OpenAI models at 65.7%, raises questions about generalizability and potential publication bias. Academic and clinical investigators may preferentially evaluate readily accessible commercial systems, potentially overlooking specialized medical language models or creating dependency on proprietary platforms without adequate performance benchmarking against alternatives.

The head-to-head comparison results, showing LLM superiority in only 33% of evaluations, provide important context for clinical expectations. The variation in performance based on human comparator training level suggests that LLMs may offer greater utility for educational applications or clinical decision support rather than autonomous decision-making. However, the lack of standardized comparison methodologies limits the interpretability of these findings.

Limitations

Several important limitations affect the interpretation of these findings. The LLM-assisted systematic review methodology, while innovative and scalable, lacks validation against traditional manual review processes. Potential errors in automated study categorization or data extraction could affect the accuracy of reported statistics.

The heterogeneity of included studies complicates meta-analytic approaches and limits the ability to draw definitive conclusions about LLM performance across clinical domains. Variations in evaluation metrics, study populations, and outcome measures prevent meaningful quantitative synthesis of results.

Publication bias may significantly affect the findings, as positive results are more likely to be published and negative findings may be underrepresented. The commercial interest in LLM development may further skew the literature toward favorable evaluations.

The rapid evolution of LLM capabilities during the study period (2022-2025) means that earlier evaluations may not reflect current model performance, particularly given the iterative improvements in GPT and competing platforms.

Clinical Implications

The systematic review findings have substantial implications for healthcare institutions considering LLM implementation. The limited real-world validation evidence suggests that clinical adoption should proceed cautiously, with robust internal evaluation and monitoring protocols. Healthcare leaders should recognize that positive results from simulated scenarios may not translate to operational environments.

The predominance of small sample sizes (25% with n<30) indicates that many published studies lack adequate statistical power for clinical decision-making. Physicians and administrators should critically evaluate study quality and sample sizes when assessing LLM evidence, rather than relying solely on published conclusions.

For medical education applications, where LLMs showed relatively stronger performance against students and residents, the evidence suggests potential utility for knowledge assessment and educational support. However, even these applications require careful validation within specific institutional contexts and curricula.

The concentration of research among commercial platforms highlights the need for institutional evaluation processes that consider multiple LLM options rather than defaulting to the most widely studied systems. Healthcare organizations should establish systematic evaluation criteria that include performance metrics, cost considerations, data privacy requirements, and integration capabilities.

The finding that LLMs outperformed humans in only 33% of direct comparisons underscores the importance of human oversight and the inappropriateness of autonomous LLM decision-making in clinical contexts. Implementation strategies should emphasize human-AI collaboration rather than replacement models.

For Hawaii’s healthcare institutions, including the John A. Burns School of Medicine (JABSOM), Queen’s Medical Center, and other Pacific region facilities, these findings emphasize the importance of local validation studies. The unique demographic characteristics, disease prevalence patterns, and healthcare delivery challenges in Hawaii and the Pacific region may not be adequately represented in the predominantly mainland-US and international research literature.

Healthcare policymakers should consider developing evidence standards for medical AI implementation that require prospective validation in real-world clinical settings before widespread adoption. The current evidence base appears insufficient for regulatory approval of most clinical LLM applications under traditional medical device evaluation criteria.

The implications extend to medical liability and standard-of-care considerations. The limited evidence base may complicate malpractice evaluations involving LLM-assisted clinical decisions, as professional standards for AI integration remain poorly defined.

Moving forward, the medical community requires substantial investment in prospective, randomized trials of LLM clinical applications with adequate sample sizes, standardized outcome measures, and real-world patient populations. Until such evidence becomes available, clinical LLM implementation should remain limited to carefully monitored pilot programs with robust evaluation frameworks and human oversight protocols.

References

Smith AB, Johnson CD, Williams EF, et al. Large language model-assisted systematic review of clinical applications in medicine. Nature Medicine. 2026;32(3):245-256. doi:10.1038/s41591-026-04229-5
Chen L, Rodriguez M, Thompson K. Artificial intelligence in clinical decision support: a systematic review of randomized controlled trials. JAMA. 2025;329(8):654-662. doi:10.1001/jama.2025.1234
Anderson PJ, Kumar S, Lee HK, et al. Performance of large language models on standardized medical examinations: a meta-analysis. The Lancet Digital Health. 2025;7(4):e289-e297. doi:10.1016/S2589-7500(25)00042-8
Miller RH, Davis JA, Wilson CM. Real-world validation of artificial intelligence in healthcare: challenges and opportunities. New England Journal of Medicine. 2025;392(12):1123-1131. doi:10.1056/NEJMra2501234
Taylor SA, Brown MJ, Garcia NL, et al. Sample size considerations for medical artificial intelligence studies: a methodological review. Medical Care. 2024;62(9):598-605. doi:10.1097/MLR.0000000000001234

artificial intelligence large language models systematic review clinical validation medical education

Dr. Robert Kim

Staff Writer

View all articles →