The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study

Artículo

The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study

Maria Bajwa et al · JMIR Publications · 2026

Acceso abierto disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Maria Bajwa et al

Editorial

JMIR Publications

Año

2026

ISSN

2563-6316

ISSN

2563-6316

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Acceso abierto disponible

Recurso identificado como acceso abierto, sin confirmar automáticamente si es texto completo directo.

Abrir recurso

Resumen

Descripción general del contenido del recurso.

Abstract BackgroundGenerative artificial intelligence models, especially reasoning large language models (LLMs), are gaining adoption in health care for diagnostic decision support and medical education. DeepSeek R1 is a reasoning LLM that generates extended chain-of-thought explanations to make its decision-making process more explicit. Traditional medical benchmarks often lack complexity and authenticity, motivating the adoption of scenario-rich datasets, such as the Massive Multitask Language Understanding Pro (MMLU-Pro) professional medicine subset, which provides multispecialty clinical vignettes for reasoning-centric evaluation. ObjectiveThe objective of this study is to assess the diagnostic accuracy, reasoning quality, reasoning transparency, and practical usability of DeepSeek R1 and Gemini 3 Pro across closed- and open-ended clinical scenarios, with the intention of guiding their prospective application in practical clinical education and training. This evaluation was conducted by analyzing 162 diverse medical scenarios (both closed- and open-ended) from the MMLU-Pro health subset. MethodsIn a 2-phase, dual-model evaluation, DeepSeek R1 and Gemini 3 Pro were applied to 162 matched clinical vignettes from the MMLU-Pro professional medicine subset spanning 21 specialties. Closed-ended, multiple-choice, and open-ended prompts were constructed for the same scenarios, and model outputs were coded for accuracy, reasoning steps, and citation behavior; descriptive statistics and the McNemar test were used to compare performance across formats. ResultsDeepSeek R1 achieved an accuracy of 86.4% (140/162 scenarios) on closed-ended tasks and 80.9% (131/162) on open-ended questions across 162 clinical scenarios, indicating modest attenuation of performance when answer cues were removed. Gemini 3 Pro demonstrated 90.7% (147/162) closed-ended and 88.9% (144/162) open-ended accuracy on the same scenarios, showing a similar pattern of decreased performance without answer options. Error analysis indicated that incorrect answers typically involved longer reasoning chains, suggesting overthinking. In a structured review of open-ended responses, DeepSeek R1 produced an average of 18.7 (range 0‐52) references per case, with 5.2 unrelated references and 13.1 (range 3‐67) reasoning steps, whereas Gemini 3 Pro averaged 22.5 (range 12‐50) references, 1.9 (range 0‐8) unrelated references, and 4.4 (range 1‐10) reasoning steps per case. ConclusionsDeepSeek R1 demonstrated moderate-to-excellent accuracy and reasoning in evaluating both closed- and open-ended medical scenarios. In parallel, Gemini 3 Pro showed broadly comparable but distinct performance and reasoning patterns. While the closed-ended format may inflate accuracy due to cueing, the open-ended evaluation yielded richer insights into the fidelity of reasoning. Side-by-side evaluation of two large reasoning models highlights the importance of format, specialty, and citation behavior when considering clinical and educational use. Continued validation across a wider range of specialties and real-world contexts will enhance the model’s trustworthiness for diagnostic and teaching applications.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, M. B. E. (2026). The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study. https://doi.org/10.2196/76822

MLA

al, Maria Bajwa et. "The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study." 2026. https://doi.org/10.2196/76822.

Chicago

al, Maria Bajwa et. 2026. "The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study.". https://doi.org/10.2196/76822.

Harvard

al, M. B. E. 2026, The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study, JMIR Publications, available at: https://doi.org/10.2196/76822 [Accessed 29 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: The Performance of DeepSeek R1 and Gemini 3 in Complex Medical Scenarios: Comparative Study

Autor / colaboradores: Maria Bajwa et al

Editorial: JMIR Publications

Año de publicación: 2026

ISSN: 2563-6316

ISSN: 2563-6316

Idioma: eng