← Volver a resultados
Ficha bibliográfica · Consulta y acceso
Artículo

Comparative evaluation of LLMs in orthopedic surgery

Gnaneswar Chundi et al · Elsevier · 2026

Material complementario disponible
Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.
Abrir material

Resumen

Descripción general del contenido del recurso.

Aims/objectives: This study aimed to evaluate the performance of leading large language models (LLMs) in orthopedic surgery, with a focus on diagnostic accuracy, radiographic interpretation, subspecialty-specific performance, consistency, and gender bias. Unlike previous investigations that focused solely on ChatGPT, we assessed Claude-3-Sonnet, GPT-4o, Gemini-1.5, and Meta's LLaMA Vision-Instruct models against Orthopaedic In-Training Examination (OITE) questions. Methods: We tested each LLM on 2906 multiple-choice questions from the OITE question bank. Questions were categorized by subspecialty and by presence of images. Models provided answer choices, confidence scores (1–4), and justifications. Each question was administered three times to evaluate consistency. Statistical analyses included Z-tests, t-tests, ANOVA, chi-square tests, and logistic regression. Rasch transformation enabled comparison to PGY-1 and PGY-5 resident performance. Gender bias was evaluated based on performance differences across gender-specific cases. Results: GPT-4o achieved the highest accuracy (72.8 %), outperforming all other models. Performance improved with larger model sizes across all vendors. All models showed diminished accuracy on image-based questions (mean 10.9 % lower, p < 0.001). Confidence scores and triplicate agreement were associated with accuracy; their combination yielded the most reliable outputs (up to 83.4 % accuracy). Several models exhibited worse performance on female patient questions, indicating possible gender bias. Conclusion: LLMs demonstrate varying levels of accuracy in orthopaedic applications, with GPT-4o approaching PGY-5 performance, strictly in the context of standardized exam questions. Current models underperform on image-based questions and exhibit gender bias. Response confidence and consistency can flag reliable outputs. Continued development, bias mitigation, and fine-tuning with diverse, domain-specific datasets are essential for integration of LLMs into orthopaedic education and practice.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, G. C. E. (2026). Comparative evaluation of LLMs in orthopedic surgery. https://doi.org/10.1016/j.jorep.2025.100728

MLA

al, Gnaneswar Chundi et. "Comparative evaluation of LLMs in orthopedic surgery." 2026. https://doi.org/10.1016/j.jorep.2025.100728.

Chicago

al, Gnaneswar Chundi et. 2026. "Comparative evaluation of LLMs in orthopedic surgery.". https://doi.org/10.1016/j.jorep.2025.100728.

Harvard

al, G. C. E. 2026, Comparative evaluation of LLMs in orthopedic surgery, Elsevier, available at: https://doi.org/10.1016/j.jorep.2025.100728 [Accessed 29 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título
Comparative evaluation of LLMs in orthopedic surgery
Autor / colaboradores
Gnaneswar Chundi et al
Editorial
Elsevier
Año de publicación
2026
ISSN
2773-157X
ISSN
2773-157X
Idioma
eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Copiado