← Volver a resultados
Ficha bibliográfica · Consulta y acceso
Artículo

Measuring the gap: correlating synthetic-to-real drift with PHI de-identification performance

Joseph Cornelius et al · BioMed Central · 2026

Acceso abierto disponible
Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Acceso abierto disponible

Recurso identificado como acceso abierto, sin confirmar automáticamente si es texto completo directo.
Abrir recurso

Resumen

Descripción general del contenido del recurso.

Abstract Clinical text de-identification enables the use of electronic health records while protecting patient privacy, but public training data remain scarce and often have mismatched documentation styles. Recent works have proposed using large language models (LLMs) to generate synthetic clinical notes, but it remains unclear if they reflect distributions of real clinical notes. We examine how lexical and semantic drift across training and evaluation corpora affects protected health information (PHI) tagger performance. We generated synthetic notes from scratch for four categories using five generator LLMs and one judge LLM. Next, we fine-tuned small de-identification models on real, synthetic, and mixed corpora, and evaluated them on three external benchmarks under a harmonized label schema. Models trained on broad, clinically oriented sources transfer better than those on legal or narrowly synthetic data. These results suggest that although synthetic data lacks some real-world distributional properties, it remains useful in low-resource settings. We found that compact distributional and embedding-based drift measures moderately correlate with out-of-distribution F1 score, a practically important result because drift estimation can improve synthetic-data quality control and alignment.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, J. C. E. (2026). Measuring the gap: correlating synthetic-to-real drift with PHI de-identification performance. https://doi.org/10.1186/s44342-026-00072-9

MLA

al, Joseph Cornelius et. "Measuring the gap: correlating synthetic-to-real drift with PHI de-identification performance." 2026. https://doi.org/10.1186/s44342-026-00072-9.

Chicago

al, Joseph Cornelius et. 2026. "Measuring the gap: correlating synthetic-to-real drift with PHI de-identification performance.". https://doi.org/10.1186/s44342-026-00072-9.

Harvard

al, J. C. E. 2026, Measuring the gap: correlating synthetic-to-real drift with PHI de-identification performance, BioMed Central, available at: https://doi.org/10.1186/s44342-026-00072-9 [Accessed 29 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título
Measuring the gap: correlating synthetic-to-real drift with PHI de-identification performance
Autor / colaboradores
Joseph Cornelius et al
Editorial
BioMed Central
Año de publicación
2026
ISSN
1598-866X
ISSN
1598-866X
Idioma
eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Copiado