MoTIF: An end-to-end multimodal road traffic scene understanding foundation model

Artículo

MoTIF: An end-to-end multimodal road traffic scene understanding foundation model

Zihe Wang et al · Tsinghua University Press · 2025

Acceso abierto disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Zihe Wang et al

Editorial

Tsinghua University Press

Año

2025

ISSN

2772-4247

ISSN

2772-4247

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Acceso abierto disponible

Recurso identificado como acceso abierto, sin confirmar automáticamente si es texto completo directo.

Abrir recurso

Resumen

Descripción general del contenido del recurso.

Video-based road intelligent detection constitutes a critical component in modern intelligent transportation systems, serving as a crucial role for comprehensive transportation planning and emergency traffic management. Current traffic scene perception methodologies relying on conventional deep learning architectures present inherent limitations, including heavy dependence on extensive manual annotations of specific traffic scenarios and predefined rule configurations. These approaches demonstrate constrained semantic representation capacity and limited generalizability across heterogeneous traffic scenarios. To address these challenges, this study proposes a novel end-to-end multimodal foundation model architecture that jointly generates dynamic traffic event detection outcomes and semantic-rich contextual descriptions. Through integration of low-rank adaptation (LoRA) and prompt fine-tuning as parameter-efficient fine-tuning strategies, we develop the multimodal road traffic scene understanding foundation model (MoTIF), which establishes cross-modal alignment between visual patterns and textual semantics. This framework demonstrates enhanced capability in extracting salient traffic targets and generating hierarchical scene representations, significantly improving automated detection efficiency in road video analytics. Notably, MoTIF exhibits contextual reasoning capabilities for implicit traffic event interpretation. Extensive evaluations on two real-world datasets encompassing urban road intersection scenarios in Tianjin and highway monitoring systems in Shandong Province reveal that MoTIF achieves superior performance metrics: 65.81 average score on multimodal scene understanding assessment and 83.33% event detection accuracy, outperforming mainstream benchmarks in both precision and computational efficiency. This research advances multimodal learning paradigms for intelligent transportation systems while providing practical insights for adaptive traffic management applications.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, Z. W. E. (2025). MoTIF: An end-to-end multimodal road traffic scene understanding foundation model. https://doi.org/10.1016/j.commtr.2025.100227

MLA

al, Zihe Wang et. "MoTIF: An end-to-end multimodal road traffic scene understanding foundation model." 2025. https://doi.org/10.1016/j.commtr.2025.100227.

Chicago

al, Zihe Wang et. 2025. "MoTIF: An end-to-end multimodal road traffic scene understanding foundation model.". https://doi.org/10.1016/j.commtr.2025.100227.

Harvard

al, Z. W. E. 2025, MoTIF: An end-to-end multimodal road traffic scene understanding foundation model, Tsinghua University Press, available at: https://doi.org/10.1016/j.commtr.2025.100227 [Accessed 1 Jul. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: MoTIF: An end-to-end multimodal road traffic scene understanding foundation model

Autor / colaboradores: Zihe Wang et al

Editorial: Tsinghua University Press

Año de publicación: 2025

ISSN: 2772-4247

ISSN: 2772-4247

Idioma: eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Road traffic; Scene understanding; Multimodal foundation model; Fine-tuning