GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Artículo

GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Haicheng Liao et al · Tsinghua University Press · 2024

Acceso abierto disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

Haicheng Liao et al

Editorial

Tsinghua University Press

Año

2024

ISSN

2772-4247

ISSN

2772-4247

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Acceso abierto disponible

Recurso identificado como acceso abierto, sin confirmar automáticamente si es texto completo directo.

Abrir recurso

Resumen

Descripción general del contenido del recurso.

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, H. L. E. (2024). GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. https://doi.org/10.1016/j.commtr.2023.100116

MLA

al, Haicheng Liao et. "GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models." 2024. https://doi.org/10.1016/j.commtr.2023.100116.

Chicago

al, Haicheng Liao et. 2024. "GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models.". https://doi.org/10.1016/j.commtr.2023.100116.

Harvard

al, H. L. E. 2024, GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models, Tsinghua University Press, available at: https://doi.org/10.1016/j.commtr.2023.100116 [Accessed 28 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Autor / colaboradores: Haicheng Liao et al

Editorial: Tsinghua University Press

Año de publicación: 2024

ISSN: 2772-4247

ISSN: 2772-4247

Idioma: eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Autonomous driving; Visual grounding; Cross-modal attention; Large language models; Human-machine interaction