An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering

Artículo

An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering

M. Balakrishna Mallapu et al · IEEE · 2026

Material complementario disponible

Lectura rápida. Revisá los datos básicos del recurso y luego accedé al contenido desde el botón principal. En esta ficha solo se muestra la información necesaria para identificar la obra, citarla y abrirla.

Autor / responsable

M. Balakrishna Mallapu et al

Editorial

IEEE

Año

2026

ISSN

2169-3536

ISSN

2169-3536

Idioma

eng

Acceso al recurso

Entrá al contenido desde la opción principal o elegí otra fuente disponible.

Acceso principal

Material complementario disponible

DOAJ DOAJ - Open Access Journals

El enlace apunta a material asociado, anexos, tablas, datos o página complementaria. No se marca como libro/texto completo.

Abrir material

Resumen

Descripción general del contenido del recurso.

The provision of accurate and explainable captions and answers to visual questions about remote sensing images is essential in areas such as disaster management, land-use supervision, and military analysis, where system interpretability is a critical factor for decision-making. However, existing vision–language models, powered by transformers and multimodal encoders, often struggle to generalize across diverse scene types and lack transparent reasoning mechanisms, thereby limiting their applicability in critical domains. To address these challenges, thisresearch aims to develop a unified and explainable vision–language framework that improves both caption generation and visual question answering performance while ensuring interpretability. Although the current vision-language models using the transformer architecture, multimodal encoders, and attention-based decoders achieve promising results, the standardization of diverse scene types and transparent decision-making remain to be addressed. Therefore, this research presents the application of the explainable vision-language model with the incorporation of the Mixture of Experts-based Robustly bidirectional encoder representations from the transformers AutoEncoder (MoERAE) block to the caption generation task. The MoERAE model utilizes the dynamically routing mechanism for the routing of the multimodal features using different expert paths to capture the context-based representation of the different scenarios in remote sensing. The framework further incorporates an Attention-based Vision Transformer (AVT) for robust spatial feature extraction and GPT-4 for effective multimodal alignment and semantic reasoning. The fine-tuning of MoERAE is carried out by the Rabbit and Turtle Algorithm (RTA), which is an adaptive optimizer that ensures the optimal selection of experts by striking a balance between rapid exploration and convergence. Thus, the quality of the captions generated is very high. Furthermore, the Local Interpretable Model-Agnostic Explanations (LIME) technique has been employed to ascertain the importance of specific visual and textual tokens in the generation of captions and answers, thus promoting transparency. The framework is tested and validated on several benchmark datasets such as UCM-Captions, Sydney-Captions, and RSICD. The metrics for evaluating the framework are also provided. The experimental results prove that the proposed framework performs better than existing ones by achieving a high of 99% and a cross-validation accuracy of 99%, thereby proving the robustness and generalization of the framework. The results also prove that the integration of multimodal alignment, adaptive expert routing, and explainability mechanisms into the framework significantly enhances its performance and interpretability for practical use cases in the field of remote sensing.

Cómo citar

Elegí el formato que necesitás y copiá la referencia al portapapeles.

APA 7

al, M. B. M. E. (2026). An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering. https://doi.org/10.1109/ACCESS.2026.3686569

MLA

al, M. Balakrishna Mallapu et. "An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering." 2026. https://doi.org/10.1109/ACCESS.2026.3686569.

Chicago

al, M. Balakrishna Mallapu et. 2026. "An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering.". https://doi.org/10.1109/ACCESS.2026.3686569.

Harvard

al, M. B. M. E. 2026, An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering, IEEE, available at: https://doi.org/10.1109/ACCESS.2026.3686569 [Accessed 23 Jun. 2026].

Compartir e imprimir

Guardá la ficha, copiá su enlace permanente o imprimila como PDF.

Exportar referencia

Si usás un gestor bibliográfico, podés exportar el registro en los formatos más comunes.

RIS BibTeX

Detalles del recurso

Información bibliográfica útil para confirmar que se trata del material correcto.

Título: An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering

Autor / colaboradores: M. Balakrishna Mallapu et al

Editorial: IEEE

Año de publicación: 2026

ISSN: 2169-3536

ISSN: 2169-3536

Idioma: eng

Materias

Explorá otros recursos relacionados a partir de estas materias.

Remote sensing image captioning; rabbit and turtle algorithm; visual question answering; local interpretable model-agnostic explanations.

An Explainable Multimodal Vision–Language Framework With Adaptive Mixture of Experts and Optimized Learning for Remote Sensing Image Captioning and Visual Question Answering

3PS-RAN: A Real-Time Framework for Securing the O-RAN RACH Against DDoS Attacks Toward NextG

Acceso al recurso

Resumen

Cómo citar

APA 7

MLA

Chicago

Harvard

Compartir e imprimir

Exportar referencia

Detalles del recurso

Materias