Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

Artículo

Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

Yannick Laurent Tchenadoyo Bayala et al · SAGE Publishing · 2026

Acceso abierto disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Yannick Laurent Tchenadoyo Bayala et al

Editorial

SAGE Publishing

Año

2026

ISSN

1741-2811

ISSN

1741-2811

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Acceso abierto disponible

DOAJ DOAJ - Open Access Journals

Recurso identificado como acceso abierto, sin confirmar automáticamente si es texto completo directo.

Abrir recurso

Resumen

Descripción general del contenido del recurso.

Background Large language models (LLMs) have demonstrated promising capabilities in medical diagnostic reasoning, yet their performance in specialized clinical domains such as rheumatology remains incompletely characterized. While diagnostic accuracy has been evaluated, critical dimensions including calibration, reasoning quality, and temporal stability have not been systematically assessed across contemporary models. Objectives This study aimed to comprehensively evaluate and compare the diagnostic accuracy, certainty expression, reasoning quality, and hallucination rates of four state-of-the-art LLMs ChatGPT-4, Claude 3.5, DeepSeek-V3, and Gemini 1.5 Pro in complex rheumatologic case scenarios. Design A cross-sectional, analytical, and comparative study was conducted following STARD and TRIPOD guidelines, adapted for LLM evaluation. Nine complex rheumatologic cases from published case reports were evaluated at three time points (Days 1, 5, and 10) between July 1 and September 18,2025. Methods Standardized clinical vignettes were submitted to each LLM under controlled experimental conditions. Two blinded senior rheumatologists independently assessed diagnostic accuracy, reasoning quality across five analytical dimensions using Likert scales, and hallucination frequency. Certainty expression and temporal stability were quantified using intraclass correlation coefficients. Correlation analyses examined relationships between reasoning quality and confidence expression. Results All models achieved near-perfect diagnostic accuracy, with ChatGPT, Claude and Gemini correctly identifying the primary diagnosis in 100% of cases and DeepSeek in 88.9%. However, Spearman correlation analysis revealed uniformly weak and non-significant associations between reasoning quality and expressed certainty across all models (ρ range: -0.156 to 0.215, all p>0.05), indicating fundamental miscalibration. ChatGPT demonstrated the highest reasoning score (3.89±0.23) and lowest hallucination rate (7.4%), while Gemini showed the highest hallucination frequency (18.5%). Temporal stability was excellent for ChatGPT (ICC=0.84) and good for DeepSeek (ICC=0.79). Conclusion Despite exceptional diagnostic accuracy, current LLMs exhibit critical limitations in confidence calibration and variable hallucination rates, representing significant barriers to safe clinical deployment in rheumatology.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, Y. L. T. B. E. (2026). Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence. https://doi.org/10.1177/14604582261448687

MLA

al, Yannick Laurent Tchenadoyo Bayala et. "Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence." 2026. https://doi.org/10.1177/14604582261448687.

Chicago

al, Yannick Laurent Tchenadoyo Bayala et. 2026. "Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence.". https://doi.org/10.1177/14604582261448687.

Harvard

al, Y. L. T. B. E. 2026, Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence, SAGE Publishing, available at: https://doi.org/10.1177/14604582261448687 [Accessed 23 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

Autor / colaboradores: Yannick Laurent Tchenadoyo Bayala et al

Editorial: SAGE Publishing

Año de publicación: 2026

ISSN: 1741-2811

ISSN: 1741-2811

Idioma: eng