Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

Artículo

Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

Qing Wang et al · Nature Portfolio · 2026

Material complementario disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Qing Wang et al

Editorial

Nature Portfolio

Año

2026

ISSN

2045-2322

ISSN

2045-2322

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

DOAJ DOAJ - Open Access Journals

El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.

Abrir material

Resumen

Descripción general del contenido del recurso.

Abstract Large language models (LLMs) like ChatGPT and DeepSeek are gaining attention for their potential in medical education. This study aims to evaluate the performance of ChatGPT and DeepSeek in the United States Medical Licensing Examination (USMLE) and the Chinese National Medical Licensing Examination (CNMLE), followed by the targeted optimizations methods to advance the efficient and effective application of LLMs in medical education. This study conducted a comparative quantitative analysis across multiple dimensions, including answer accuracy, consistency, the number of reasoning characters, and runtime.Based on the identified limitations of LLMs, targeted optimization explorations were carried out, including the construction of a technical safeguard framework and a multi-dimensional evaluation system. In the USMLE, DeepSeek had an average accuracy of 92.59% and a Fleiss’ Kappa of 0.96, while ChatGPT had 90.26% accuracy and a Fleiss’ Kappa of 0.93. In the CNMLE, DeepSeek achieved an accuracy of 86.78% and a Fleiss’ Kappa of 0.96, while ChatGPT had an accuracy of 79.44% and a Fleiss’ Kappa of 0.90. Both DeepSeek and ChatGPT demonstrated the ability to identify flawed questions, yet they also produced incorrect answers due to hallucinations. Additionally, DeepSeek had a relatively longer runtime. To address these issues, this study proposed a Knowledge Graph-Based RAG Fact-Checking Framework centered on evidence anchoring and a multi-dimensional evaluation system focusing on reliability and safety. DeepSeek generally outperforms ChatGPT in accuracy, particularly excelling in handling complex medical problems and Chinese medical knowledge. However, DeepSeek had a longer runtime compared with ChatGPT. The proposed optimization framework and evaluation system effectively address core issues such as LLM hallucinations, clarifying the positioning of LLMs as “auxiliary tools” that require rigorous fact-checking. These solutions jointly form a core governance system for the application of LLM in medical education, providing key support for their precise and efficient integration into educational scenarios. The study indicates that LLMs are expected to bring about a progressive transformation, evolving from functional enhancement to paradigm reconstruction.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, Q. W. E. (2026). Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education. https://doi.org/10.1038/s41598-026-40043-2

MLA

al, Qing Wang et. "Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education." 2026. https://doi.org/10.1038/s41598-026-40043-2.

Chicago

al, Qing Wang et. 2026. "Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education.". https://doi.org/10.1038/s41598-026-40043-2.

Harvard

al, Q. W. E. 2026, Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education, Nature Portfolio, available at: https://doi.org/10.1038/s41598-026-40043-2 [Accessed 25 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

Autor / colaboradores: Qing Wang et al

Editorial: Nature Portfolio

Año de publicación: 2026

ISSN: 2045-2322

ISSN: 2045-2322

Idioma: eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Large language models; Medical education; DeepSeek; ChatGPT; United States Medical Licensing Examination; Chinese National Medical Licensing Examination

Multi-metric comparative evaluation of DeepSeek and ChatGPT in USMLE versus CNMLE for medical education

3D scan-based classification of Chinese young female hand morphology

Acceso al recurso

Resumen

Cómo citar

APA 7

MLA

Chicago

Harvard

Compartir e imprimir

Exportar referencia

Detalles del recurso

Materias