Evaluating large language models for abstract evaluation tasks: an empirical study

Artículo

Evaluating large language models for abstract evaluation tasks: an empirical study

Yinuo Liu et al · Frontiers Media S.A · 2026

Material complementario disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Yinuo Liu et al

Editorial

Frontiers Media S.A

Año

2026

ISSN

2504-0537

ISSN

2504-0537

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

DOAJ DOAJ - Open Access Journals

El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.

Abrir material

Resumen

Descripción general del contenido del recurso.

IntroductionLarge language models (LLMs) show great promise as tools for assisting scientific peer review, but their agreement with human experts in quantitative assessment of academic content needs further investigation. This study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5′s consistency and reliability in evaluating conference abstracts compared to one another and to human reviewers.MethodsThree LLMs independently graded 160 abstracts from a regional conference, while 14 human reviewers each assessed a subset using an identical rubric with eight criteria scored on a 1–5 scale. We compared AI and human scoring patterns using boxplots, calculated intraclass correlation coefficients (ICCs) for inter-rater reliability both among LLMs and between human and LLMs, and examined Bland-Altman plots to identify agreement patterns and systematic bias.ResultsThree LLMs demonstrated high internal consistency with narrow interquartile ranges and few outliers in composite scores, while human reviewers exhibited greater scoring variability. LLMs also achieved good-to-excellent agreement with each other across all criteria (ICCs: 0.59–0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs = 0.45–0.60 for composite score, impression, clarity, objective, and results. The two LLMs' concordance with humans achieved fair levels on subjective dimensions, with ICC ranging from 0.23–0.38 for impact, engagement, and applicability. Gemini performed notably worse, showing fair agreement on half the criteria and poor reliability on impact and applicability. Bland-Altman analysis revealed acceptable or negligible systematic bias, with mean differences of 0.24 (ChatGPT), 0.42 (Gemini), and −0.02 (Claude) from human mean ratings.DiscussionWith appropriate model selection, LLMs could reach moderate agreement with human experts on abstract overall quality and objective criteria, supporting their potential use for pre-screening low-quality submissions or serving as additional reviewers. Their ability to apply rubrics consistently across large volumes of abstracts offers advantages in efficiency and standardization that exceed human feasibility. However, LLMs' reduced performance on subjective dimensions indicates that they should complement rather than replace human judgment in abstract evaluation, with expert review remaining essential for comprehensive assessment.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, Y. L. E. (2026). Evaluating large language models for abstract evaluation tasks: an empirical study. https://doi.org/10.3389/frma.2026.1807672

MLA

al, Yinuo Liu et. "Evaluating large language models for abstract evaluation tasks: an empirical study." 2026. https://doi.org/10.3389/frma.2026.1807672.

Chicago

al, Yinuo Liu et. 2026. "Evaluating large language models for abstract evaluation tasks: an empirical study.". https://doi.org/10.3389/frma.2026.1807672.

Harvard

al, Y. L. E. 2026, Evaluating large language models for abstract evaluation tasks: an empirical study, Frontiers Media S.A, available at: https://doi.org/10.3389/frma.2026.1807672 [Accessed 23 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: Evaluating large language models for abstract evaluation tasks: an empirical study

Autor / colaboradores: Yinuo Liu et al

Editorial: Frontiers Media S.A

Año de publicación: 2026

ISSN: 2504-0537

ISSN: 2504-0537

Idioma: eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

abstract evaluation; artificial intelligence; inter-rater reliability; large language models; peer-review