Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

Artículo de revista

Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

Ruize Xia · IEEE · 2026

Material complementario disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Ruize Xia

Editorial

IEEE

Año

2026

ISSN

2169-3536

ISSN

2169-3536

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.

Abrir material

Resumen

Descripción general del contenido del recurso.

Sign language is a primary communication channel for millions of people who are deaf or hard of hearing, yet generating signer video directly from text remains difficult because video diffusion models are expensive to train and evaluate. This paper presents Text2Sign, a text-conditioned diffusion architecture for short sign-language video clips, designed to operate on a single NVIDIA L4 graphics processor rather than on a multi-node training infrastructure. The model combines a frozen vision–language text encoder with a three-dimensional encoder–decoder backbone and factorized spatial and temporal attention, thereby reducing the cost of full spatio-temporal attention while preserving motion coherence. Three design choices are examined: whether transformer-style blocks improve upon convolution-only baselines, whether a frozen pretrained text encoder yields lower loss than a task-specific encoder trained from scratch under the present short-budget comparison, and whether factorized attention is competitive with full video attention. On a signer-disjoint partition of short clips extracted from How2Sign, the best short-run ablation attains a validation loss of 0.0648, while a longer-run checkpoint reaches 0.00999. A compact evaluation slice of that checkpoint yields SSIM <inline-formula> <tex-math notation="LaTeX">$0.2403\pm 0.0238$ </tex-math></inline-formula>, PSNR <inline-formula> <tex-math notation="LaTeX">$15.11\pm 0.42$ </tex-math></inline-formula> dB, and temporal consistency <inline-formula> <tex-math notation="LaTeX">$1.0000\pm 0.0000$ </tex-math></inline-formula>; under an 8-step DDIM setting with guidance scale 5.0, the model generates a 32-frame <inline-formula> <tex-math notation="LaTeX">$64\times 64$ </tex-math></inline-formula> clip in 12.60 s (2.54 frames/s) with 3.12 GB peak inference memory on a single NVIDIA L4. In a held-out conditional denoising audit on real validation clips, removing text raises late-timestep denoising loss from 0.9875 to 0.9891, whereas shuffled prompts remain nearly indistinguishable from the intended prompt. Thus, frozen text conditioning yields a lower short-budget validation loss than the custom encoder baseline, and the revised post-revision checkpoint is qualitatively stronger than the earlier baseline in direct side-by-side inspection; however, held-out audits still show only weak prompt-specific separation. The present system remains limited to low-resolution short clips and does not yet include expert linguistic evaluation; accordingly, the reported results should be interpreted as a single-GPU research baseline rather than a complete solution to sign-language production. The code is publicly available at <uri>https://github.com/xiaruize0911/text2sign</uri>

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

Xia, R. (2026). Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation. https://doi.org/10.1109/ACCESS.2026.3686260

MLA

Xia, Ruize. "Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation." 2026. https://doi.org/10.1109/ACCESS.2026.3686260.

Chicago

Xia, Ruize. 2026. "Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation.". https://doi.org/10.1109/ACCESS.2026.3686260.

Harvard

Xia, R. 2026, Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation, IEEE, available at: https://doi.org/10.1109/ACCESS.2026.3686260 [Accessed 28 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

Autor / colaboradores: Ruize Xia

Editorial: IEEE

Año de publicación: 2026

ISSN: 2169-3536

ISSN: 2169-3536

Idioma: eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Sign language generation; diffusion models; text-to-video synthesis; video generation; accessibility; spatio-temporal attention

Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

3PS-RAN: A Real-Time Framework for Securing the O-RAN RACH Against DDoS Attacks Toward NextG

Acceso al recurso

Resumen

Cómo citar

APA 7

MLA

Chicago

Harvard

Compartir e imprimir

Exportar referencia

Detalles del recurso

Materias