← Volver a resultados
Ficha bibliográfica · Consulta y acceso
Artículo

An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering

M. Balakrishna Mallapu et al · IEEE · 2026

Material complementario disponible
Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.
Revista académica

3PS-RAN: A Real-Time Framework for Securing the O-RAN RACH Against DDoS Attacks Toward NextG

Esta revista contiene 172 artículos y documentos relacionados.

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

DOAJ DOAJ - Open Access Journals
El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.
Abrir material

Resumen

Descripción general del contenido del recurso.

The provision of accurate and explainable captions and answers to visual questions about remote sensing images is essential in areas such as disaster management, land-use supervision, and military analysis, where system interpretability is a critical factor for decision-making. However, existing vision–language models, powered by transformers and multimodal encoders, often struggle to generalize across diverse scene types and lack transparent reasoning mechanisms, thereby limiting their applicability in critical domains. To address these challenges, thisresearch aims to develop a unified and explainable vision–language framework that improves both caption generation and visual question answering performance while ensuring interpretability. Although the current vision-language models using the transformer architecture, multimodal encoders, and attention-based decoders achieve promising results, the standardization of diverse scene types and transparent decision-making remain to be addressed. Therefore, this research presents the application of the explainable vision-language model with the incorporation of the Mixture of Experts-based Robustly bidirectional encoder representations from the transformers AutoEncoder (MoERAE) block to the caption generation task. The MoERAE model utilizes the dynamically routing mechanism for the routing of the multimodal features using different expert paths to capture the context-based representation of the different scenarios in remote sensing. The framework further incorporates an Attention-based Vision Transformer (AVT) for robust spatial feature extraction and GPT-4 for effective multimodal alignment and semantic reasoning. The fine-tuning of MoERAE is carried out by the Rabbit and Turtle Algorithm (RTA), which is an adaptive optimizer that ensures the optimal selection of experts by striking a balance between rapid exploration and convergence. Thus, the quality of the captions generated is very high. Furthermore, the Local Interpretable Model-Agnostic Explanations (LIME) technique has been employed to ascertain the importance of specific visual and textual tokens in the generation of captions and answers, thus promoting transparency. The framework is tested and validated on several benchmark datasets such as UCM-Captions, Sydney-Captions, and RSICD. The metrics for evaluating the framework are also provided. The experimental results prove that the proposed framework performs better than existing ones by achieving a high of 99% and a cross-validation accuracy of 99%, thereby proving the robustness and generalization of the framework. The results also prove that the integration of multimodal alignment, adaptive expert routing, and explainability mechanisms into the framework significantly enhances its performance and interpretability for practical use cases in the field of remote sensing.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, M. B. M. E. (2026). An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering. https://doi.org/10.1109/ACCESS.2026.3686569

MLA

al, M. Balakrishna Mallapu et. "An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering." 2026. https://doi.org/10.1109/ACCESS.2026.3686569.

Chicago

al, M. Balakrishna Mallapu et. 2026. "An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering.". https://doi.org/10.1109/ACCESS.2026.3686569.

Harvard

al, M. B. M. E. 2026, An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering, IEEE, available at: https://doi.org/10.1109/ACCESS.2026.3686569 [Accessed 23 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título
An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering
Autor / colaboradores
M. Balakrishna Mallapu et al
Editorial
IEEE
Año de publicación
2026
ISSN
2169-3536
ISSN
2169-3536
Idioma
eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Copiado