unirs_logo

UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Aerospace Information Research Institute, Chinese Academy of Sciences
School of Geographic Sciences, Hunan Normal University
Department of Computer Science, City University of Hong Kong
*The corresponding author

The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce UniRS, the first vision-language model unifying multi-temporal remote sensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks.

🏆 Contributions

  1. We propose UniRS, the first vision-language model designed to tackle multi-temporal remote sensing tasks, including visual question answering, change captioning, and video scene classification. It establishes a unified framework that combines three critical temporal visual input types in remote sensing i.e., single image, dual-time image pair, and video, broadening the capabilities of VLMs in remote sensing analysis, providing a paradigm for future research in multi-task integration within the remote sensing community.

  2. We design a dedicated Change Extraction module, which enhances the comprehension of spatiotemporal semantic information in dual-time image pairs. This module incorporates a spatial feature enhancement component and a dual-time image feature fusion mechanism, enabling the model to detect and interpret local differences of interest and temporal relationships between two images. The module achieves high granularity in extracting and enhancing the spatiotemporal correlations of images, which is crucial for tasks requiring nuanced change detection.

  3. We design a prompt augmentation mechanism for the inference process, which leverages the visual-language interaction capabilities of general VLM to enrich templated task instructions and provide task-specific clues for the UniRS in multimodal comprehension. During the clue generation, we design specific prompts for each type of remote sensing visual input. This mechanism utilizes the extensive prior knowledge of general-purpose VLM, facilitating the transfer of general knowledge to remote sensing analysis.

  4. We develop a multi-task joint fine-tuning framework, designing task-specific prompt templates for different types of visual inputs to distinguish between tasks. UniRS is jointly trained on mixed datasets and the training promotes knowledge sharing across different tasks, enhancing the model's ability to understand the spatiotemporal features of remote sensing images compared to individual training. We extensively evaluate UniRS on visual question answering, change captioning, and video scene classification tasks, and it achieves state-of-the-art in all tasks, showcasing its versatility and effectiveness in tackling multi-temporal remote sensing challenges.

unirs UniRS: A unified framework integrating multi-temporal analysis

s

Our UniRS is a framework unifying multi-temporal remote sensing tasks of various visual inputs within a single model. It can analyze three critical types of remote sensing visual inputs i.e., single image, dual-time image pair, and video, under task instructions. Our research focuses on typical remote sensing tasks for each input type, including visual question answering, change captioning, and video classification.

unirs UniRS: Architecture

The architecture of our UniRS. The left part of this figure includes the prompt augmentation mechanism and UniRS main architecture. UniRS is primarily composed of four components, i.e., visual encoder $\mathcal{E}_v$, multimodal projector $\mathcal{M}_{m}$, language module $\mathcal{M}_{l}$, and change extraction module $\mathcal{M}_{c}$. Here change extraction module $\mathcal{M}_{c}$ is designed for the dual-time image pair input to extract and enhance spatiotemporal relationship features between image pairs. During inference, all visual inputs $I$ are encoded into visual features $F$ by the visual encoder $\mathcal{E}_v$. In the prompt augmentation mechanism, initial visual clues $P_c$ are obtained after parsing and merged with the task instruction $P_t$ to form the full prompt $P$. In UniRS, the multimodal projector $\mathcal{M}_{m}$ projects visual feature $F$ into the text feature space as visual embedding $E_{I}$, which is then combined with the text embedding $E_{P}$ and fed into the language module $\mathcal{M}_{l}$ to get the final answer $a$. The right part of this figure is the structure of the change extraction module $\mathcal{M}_{c}$.

unirs_logo Inference Process of UniRS with Prompt Augmentation Mechanism

The inference process of UniRS using prompt augmentation mechanism. During the execution of remote sensing tasks, visual inputs are first processed by the base model, where clues $P_{c}$ are generated under the fixed prompts $P_g$ customized for each input type. These clues, special markers, task tags, and the task instruction $P_{t}$, are then merged to form the prompt $P$ input into UniRS. The model then generates the corresponding response $a$.


unirs_logo Qualitative Results

Qualitative results of UniRS. (left-right) Results are shown on visual question answering, dual-time change captioning and video scene classification. The user can provide task-specific instructions to shape model responses according to the desired behavior. Compared with the previous mainstream remote sensing VLMs, general VLMs and traditional expert models, UniRS shows better generation performance in various tasks.

Remote Sensing Visual Question Answering (RSVQA)

We use the test sets from the RSVQA-LR, RSVQA-HR, and CRSVQA datasets for quantitative testing of the RSVQA task. For the testing of RSVQA-LR and RSVQA-HR, we follow the GeoChat benchmark settings. Additionally, we follow the setting of MQVQA, adopting 10% of the data as the test set and evaluating the performance under supervised assessment.

Method Presence Comparison Rural/Urban Avg. Accuracy
VILA-1.5 (3B) 68.49 64.99 64.00 66.44
RSVQA 87.47 81.50 90.00 86.32
Bi-Modal 91.06 91.16 92.66 91.63
SHRNet 91.03 90.48 94.00 91.84
RSGPT 91.17 91.70 94.00 92.29
SkyEyeGPT (7B) 88.93 88.63 75.00 84.19
LHRS-Bot (7B) 89.07 88.51 90.00 89.19
GeoChat (7B) 91.09 90.33 94.00 90.70
UniRS 91.64 92.68 90.00 92.21
UniRS (further training) 91.81 93.23 93.00 92.63
Comparison of the visual question answering performance on RSVQA-LR Dataset. VILA-1.5 is evaluated under the zero-shot setting. Our UniRS, SkyEyeGPT, LHRS-Bot and Geochat are non-expert models. UniRS (further training) is compared with expert models.
Method Presence Comparison Avg. Accuracy
VILA-1.5 (3B) 61.44 63.06 62.79
MiniGPTv2 40.79 50.91 46.46
LLaVA-1.5 68.23 65.45 66.67
GeoChat 59.02 83.16 72.53
EarthGPT 62.77 79.53 72.06
UniRS 59.29 84.05 73.15
Comparison of the zero-shot visual question answering performance on RSVQA-HR dataset. We compare our UniRS with general VLMs and remote sensing VLMs under zero-shot settings.
Method OA
VILA-1.5 (3B) 80.33
Qonly 23.49
RSVQA 58.96
RSVQA (GRU) 59.41
SAN 61.17
MQVQA 70.91
EarthGPT 82.00
GeoChat 82.50
UniRS 86.67
Comparison of the visual question answering performance on CRSVQA dataset. VILA-1.5, EarthGPT, GeoChat, and UniRS are tested with supervised settings.

Remote Sensing Change Captioning

In the change captioning task, the user inputs a dual-time image pair and corresponding text instructions to the model. The model can understand the spatiotemporal features contained in the image pair and give a text description of the changes that occurred at the same location between two time nodes.

Method CIDEr-D
VILA-1.5 (3B) 6.22
RSICCFormer 131.40
PSNet 132.62
PromptCC 136.44
Chg2Cap 136.61
LLaVA-1.5 126.25
GeoChat 128.36
UniRS 139.12
Comparison of the change captioning performance on LEVIR-CC Dataset. LLaVA-1.5 and GeoChat refer to the engineered models fine-tuned on LEVIR-CC train set.

Remote Sensing Video Scene Classification

UniRS also integrates the ability to understand and analyze remote sensing video content. By inputting drone bird’s-eye view video and text instructions, the model can give the corresponding video scene classification results.

Method post-earthquake flood fire landslide mudslide traffic collision traffic congestion harvesting ploughing constructing police chase conflict baseball basketball boating cycling running soccer swimming car racing party concert parade/protest religious activity non-event OA
VILA-1.5 (3B) 0.0 63.3 14.3 16.3 0.0 35.8 70.0 28.1 34.6 40.7 0.0 68.0 86.0 58.3 62.7 26.4 12.8 85.5 11.8 100.0 0.0 91.8 55.1 5.6 78.1 41.7
HDense 67.3 71.4 78.6 34.7 74.5 35.9 74.0 81.3 82.7 59.3 64.7 16.0 76.0 72.9 88.2 62.3 16.3 82.3 76.5 63.2 54.0 73.5 59.2 61.1 58.1 63.0
FuTH-Net 72.7 75.7 87.5 57.1 74.5 34.0 56.0 76.6 71.2 81.4 76.5 36.0 78.0 85.4 80.4 73.6 16.3 64.5 80.4 84.2 56.0 89.8 65.3 63.0 63.9 66.8
MSTN 61.8 76.1 92.2 60.4 62.8 54.1 69.6 80.0 91.1 73.6 71.7 54.6 86.0 72.4 86.5 66.0 66.9 90.2 74.1 61.9 67.4 56.0 46.6 58.5 51.5 67.4
TRM 72.7 75.5 87.5 57.1 74.5 34.0 56.0 76.6 71.2 81.4 76.5 36.0 78.0 85.4 80.4 73.6 16.3 64.5 80.4 84.2 56.0 89.8 65.3 63.0 63.9 66.8
ASAT 62.3 85.7 91.4 56.5 62.7 47.7 66.0 68.8 90.8 75.0 80.5 40.0 84.0 78.9 85.7 64.6 78.8 94.0 61.4 61.9 87.1 56.0 47.9 65.3 55.2 68.1
UniRS 85.5 89.8 100.0 69.4 78.4 67.9 88.0 95.3 94.2 91.5 88.2 96.0 96.0 100.0 100.0 94.3 73.3 91.9 88.2 89.5 86.0 93.9 83.7 92.6 82.6 87.8
Comparison of video scene classification performance on ERA dataset. Here HDense, FuTH-Net, MSTN, TRM and ASAT are expert models designed for video classification. VILA-1.5 (3B) is tested under the zero-shot setting.

BibTeX


    @article{li2024unirs,
  title={UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models},
  author={Li, Yujie and Xu, Wenjia and Li, Guangzuo and Yu, Zijian and Wei, Zhiwei and Wang, Jiuniu and Peng, Mugen},
  journal={arXiv preprint arXiv:2412.20742},
  year={2024}
}
  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA, GeoChat and VILA for releasing their models and code as open-source contributions.

IVAL Logo Oryx Logo MBZUAI Logo