← Volver a resultados
Ficha bibliográfica · Consulta y acceso
Artículo de revista

Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation

Ruize Xia · IEEE · 2026

Material complementario disponible
Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.
Publicación seriada

3PS-RAN: A Real-Time Framework for Securing the O-RAN RACH Against DDoS Attacks Toward NextG

Esta publicación seriada contiene 172 contenidos relacionados.

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.
Abrir material

Resumen

Descripción general del contenido del recurso.

Sign language is a primary communication channel for millions of people who are deaf or hard of hearing, yet generating signer video directly from text remains difficult because video diffusion models are expensive to train and evaluate. This paper presents Text2Sign, a text-conditioned diffusion architecture for short sign-language video clips, designed to operate on a single NVIDIA L4 graphics processor rather than on a multi-node training infrastructure. The model combines a frozen vision&#x2013;language text encoder with a three-dimensional encoder&#x2013;decoder backbone and factorized spatial and temporal attention, thereby reducing the cost of full spatio-temporal attention while preserving motion coherence. Three design choices are examined: whether transformer-style blocks improve upon convolution-only baselines, whether a frozen pretrained text encoder yields lower loss than a task-specific encoder trained from scratch under the present short-budget comparison, and whether factorized attention is competitive with full video attention. On a signer-disjoint partition of short clips extracted from How2Sign, the best short-run ablation attains a validation loss of 0.0648, while a longer-run checkpoint reaches 0.00999. A compact evaluation slice of that checkpoint yields SSIM <inline-formula> <tex-math notation="LaTeX">$0.2403\pm 0.0238$ </tex-math></inline-formula>, PSNR <inline-formula> <tex-math notation="LaTeX">$15.11\pm 0.42$ </tex-math></inline-formula>&#x2006;dB, and temporal consistency <inline-formula> <tex-math notation="LaTeX">$1.0000\pm 0.0000$ </tex-math></inline-formula>; under an 8-step DDIM setting with guidance scale 5.0, the model generates a 32-frame <inline-formula> <tex-math notation="LaTeX">$64\times 64$ </tex-math></inline-formula> clip in 12.60&#x2006;s (2.54 frames/s) with 3.12&#x2006;GB peak inference memory on a single NVIDIA L4. In a held-out conditional denoising audit on real validation clips, removing text raises late-timestep denoising loss from 0.9875 to 0.9891, whereas shuffled prompts remain nearly indistinguishable from the intended prompt. Thus, frozen text conditioning yields a lower short-budget validation loss than the custom encoder baseline, and the revised post-revision checkpoint is qualitatively stronger than the earlier baseline in direct side-by-side inspection; however, held-out audits still show only weak prompt-specific separation. The present system remains limited to low-resolution short clips and does not yet include expert linguistic evaluation; accordingly, the reported results should be interpreted as a single-GPU research baseline rather than a complete solution to sign-language production. The code is publicly available at <uri>https://github.com/xiaruize0911/text2sign</uri>

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

Xia, R. (2026). Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation. https://doi.org/10.1109/ACCESS.2026.3686260

MLA

Xia, Ruize. "Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation." 2026. https://doi.org/10.1109/ACCESS.2026.3686260.

Chicago

Xia, Ruize. 2026. "Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation.". https://doi.org/10.1109/ACCESS.2026.3686260.

Harvard

Xia, R. 2026, Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation, IEEE, available at: https://doi.org/10.1109/ACCESS.2026.3686260 [Accessed 28 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título
Text2Sign: A Single-GPU Diffusion Baseline for Text-to-Sign Language Video Generation
Autor / colaboradores
Ruize Xia
Editorial
IEEE
Año de publicación
2026
ISSN
2169-3536
ISSN
2169-3536
Idioma
eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Copiado