UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models

Yujie Li, Wenjia Xu^*,

Guangzuo Li, Zijian Yu, Zhiwei Wei, Jiuniu Wang, Mugen Peng,

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications
Aerospace Information Research Institute, Chinese Academy of Sciences
School of Geographic Sciences, Hunan Normal University
Department of Computer Science, City University of Hong Kong

^*The corresponding author

arXiv Code Dataset

The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce UniRS, the first vision-language model unifying multi-temporal remote sensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks.

🏆 Contributions

We propose UniRS, the first vision-language model designed to tackle multi-temporal remote sensing tasks, including visual question answering, change captioning, and video scene classification. It establishes a unified framework that combines three critical temporal visual input types in remote sensing i.e., single image, dual-time image pair, and video, broadening the capabilities of VLMs in remote sensing analysis, providing a paradigm for future research in multi-task integration within the remote sensing community.

We design a dedicated Change Extraction module, which enhances the comprehension of spatiotemporal semantic information in dual-time image pairs. This module incorporates a spatial feature enhancement component and a dual-time image feature fusion mechanism, enabling the model to detect and interpret local differences of interest and temporal relationships between two images. The module achieves high granularity in extracting and enhancing the spatiotemporal correlations of images, which is crucial for tasks requiring nuanced change detection.

We design a prompt augmentation mechanism for the inference process, which leverages the visual-language interaction capabilities of general VLM to enrich templated task instructions and provide task-specific clues for the UniRS in multimodal comprehension. During the clue generation, we design specific prompts for each type of remote sensing visual input. This mechanism utilizes the extensive prior knowledge of general-purpose VLM, facilitating the transfer of general knowledge to remote sensing analysis.

We develop a multi-task joint fine-tuning framework, designing task-specific prompt templates for different types of visual inputs to distinguish between tasks. UniRS is jointly trained on mixed datasets and the training promotes knowledge sharing across different tasks, enhancing the model's ability to understand the spatiotemporal features of remote sensing images compared to individual training. We extensively evaluate UniRS on visual question answering, change captioning, and video scene classification tasks, and it achieves state-of-the-art in all tasks, showcasing its versatility and effectiveness in tackling multi-temporal remote sensing challenges.

UniRS: A unified framework integrating multi-temporal analysis

Our UniRS is a framework unifying multi-temporal remote sensing tasks of various visual inputs within a single model. It can analyze three critical types of remote sensing visual inputs i.e., single image, dual-time image pair, and video, under task instructions. Our research focuses on typical remote sensing tasks for each input type, including visual question answering, change captioning, and video classification.

UniRS: Architecture

The architecture of our UniRS. The left part of this figure includes the prompt augmentation mechanism and UniRS main architecture. UniRS is primarily composed of four components, i.e., visual encoder $\mathcal{E}_v$, multimodal projector $\mathcal{M}_{m}$, language module $\mathcal{M}_{l}$, and change extraction module $\mathcal{M}_{c}$. Here change extraction module $\mathcal{M}_{c}$ is designed for the dual-time image pair input to extract and enhance spatiotemporal relationship features between image pairs. During inference, all visual inputs $I$ are encoded into visual features $F$ by the visual encoder $\mathcal{E}_v$. In the prompt augmentation mechanism, initial visual clues $P_c$ are obtained after parsing and merged with the task instruction $P_t$ to form the full prompt $P$. In UniRS, the multimodal projector $\mathcal{M}_{m}$ projects visual feature $F$ into the text feature space as visual embedding $E_{I}$, which is then combined with the text embedding $E_{P}$ and fed into the language module $\mathcal{M}_{l}$ to get the final answer $a$. The right part of this figure is the structure of the change extraction module $\mathcal{M}_{c}$.

Inference Process of UniRS with Prompt Augmentation Mechanism

The inference process of UniRS using prompt augmentation mechanism. During the execution of remote sensing tasks, visual inputs are first processed by the base model, where clues $P_{c}$ are generated under the fixed prompts $P_g$ customized for each input type. These clues, special markers, task tags, and the task instruction $P_{t}$, are then merged to form the prompt $P$ input into UniRS. The model then generates the corresponding response $a$.

Qualitative Results

Qualitative results of UniRS. (left-right) Results are shown on visual question answering, dual-time change captioning and video scene classification. The user can provide task-specific instructions to shape model responses according to the desired behavior. Compared with the previous mainstream remote sensing VLMs, general VLMs and traditional expert models, UniRS shows better generation performance in various tasks.

Remote Sensing Visual Question Answering (RSVQA)

We use the test sets from the RSVQA-LR, RSVQA-HR, and CRSVQA datasets for quantitative testing of the RSVQA task. For the testing of RSVQA-LR and RSVQA-HR, we follow the GeoChat benchmark settings. Additionally, we follow the setting of MQVQA, adopting 10% of the data as the test set and evaluating the performance under supervised assessment.

Comparison of the visual question answering performance on RSVQA-LR Dataset. VILA-1.5 is evaluated under the zero-shot setting. Our UniRS, SkyEyeGPT, LHRS-Bot and Geochat are non-expert models. UniRS (further training) is compared with expert models.
Method	Presence	Comparison	Rural/Urban	Avg. Accuracy
VILA-1.5 (3B)	68.49	64.99	64.00	66.44
RSVQA	87.47	81.50	90.00	86.32
Bi-Modal	91.06	91.16	92.66	91.63
SHRNet	91.03	90.48	94.00	91.84
RSGPT	91.17	91.70	94.00	92.29
SkyEyeGPT (7B)	88.93	88.63	75.00	84.19
LHRS-Bot (7B)	89.07	88.51	90.00	89.19
GeoChat (7B)	91.09	90.33	94.00	90.70
UniRS	91.64	92.68	90.00	92.21
UniRS (further training)	91.81	93.23	93.00	92.63

Comparison of the zero-shot visual question answering performance on RSVQA-HR dataset. We compare our UniRS with general VLMs and remote sensing VLMs under zero-shot settings.
Method	Presence	Comparison	Avg. Accuracy
VILA-1.5 (3B)	61.44	63.06	62.79
MiniGPTv2	40.79	50.91	46.46
LLaVA-1.5	68.23	65.45	66.67
GeoChat	59.02	83.16	72.53
EarthGPT	62.77	79.53	72.06
UniRS	59.29	84.05	73.15

Comparison of the visual question answering performance on CRSVQA dataset. VILA-1.5, EarthGPT, GeoChat, and UniRS are tested with supervised settings.
Method	OA
VILA-1.5 (3B)	80.33
Qonly	23.49
RSVQA	58.96
RSVQA (GRU)	59.41
SAN	61.17
MQVQA	70.91
EarthGPT	82.00
GeoChat	82.50
UniRS	86.67

Remote Sensing Change Captioning

In the change captioning task, the user inputs a dual-time image pair and corresponding text instructions to the model. The model can understand the spatiotemporal features contained in the image pair and give a text description of the changes that occurred at the same location between two time nodes.

Comparison of the change captioning performance on LEVIR-CC Dataset. LLaVA-1.5 and GeoChat refer to the engineered models fine-tuned on LEVIR-CC train set.
Method	CIDEr-D
VILA-1.5 (3B)	6.22
RSICCFormer	131.40
PSNet	132.62
PromptCC	136.44
Chg2Cap	136.61
LLaVA-1.5	126.25
GeoChat	128.36
UniRS	139.12

Remote Sensing Video Scene Classification

UniRS also integrates the ability to understand and analyze remote sensing video content. By inputting drone bird’s-eye view video and text instructions, the model can give the corresponding video scene classification results.

Comparison of video scene classification performance on ERA dataset. Here HDense, FuTH-Net, MSTN, TRM and ASAT are expert models designed for video classification. VILA-1.5 (3B) is tested under the zero-shot setting.
Method	post-earthquake	flood	fire	landslide	mudslide	traffic collision	traffic congestion	harvesting	ploughing	constructing	police chase	conflict	baseball	basketball	boating	cycling	running	soccer	swimming	car racing	party	concert	parade/protest	religious activity	non-event	OA
VILA-1.5 (3B)	0.0	63.3	14.3	16.3	0.0	35.8	70.0	28.1	34.6	40.7	0.0	68.0	86.0	58.3	62.7	26.4	12.8	85.5	11.8	100.0	0.0	91.8	55.1	5.6	78.1	41.7
HDense	67.3	71.4	78.6	34.7	74.5	35.9	74.0	81.3	82.7	59.3	64.7	16.0	76.0	72.9	88.2	62.3	16.3	82.3	76.5	63.2	54.0	73.5	59.2	61.1	58.1	63.0
FuTH-Net	72.7	75.7	87.5	57.1	74.5	34.0	56.0	76.6	71.2	81.4	76.5	36.0	78.0	85.4	80.4	73.6	16.3	64.5	80.4	84.2	56.0	89.8	65.3	63.0	63.9	66.8
MSTN	61.8	76.1	92.2	60.4	62.8	54.1	69.6	80.0	91.1	73.6	71.7	54.6	86.0	72.4	86.5	66.0	66.9	90.2	74.1	61.9	67.4	56.0	46.6	58.5	51.5	67.4
TRM	72.7	75.5	87.5	57.1	74.5	34.0	56.0	76.6	71.2	81.4	76.5	36.0	78.0	85.4	80.4	73.6	16.3	64.5	80.4	84.2	56.0	89.8	65.3	63.0	63.9	66.8
ASAT	62.3	85.7	91.4	56.5	62.7	47.7	66.0	68.8	90.8	75.0	80.5	40.0	84.0	78.9	85.7	64.6	78.8	94.0	61.4	61.9	87.1	56.0	47.9	65.3	55.2	68.1
UniRS	85.5	89.8	100.0	69.4	78.4	67.9	88.0	95.3	94.2	91.5	88.2	96.0	96.0	100.0	100.0	94.3	73.3	91.9	88.2	89.5	86.0	93.9	83.7	92.6	82.6	87.8

BibTeX


    @article{li2024unirs,
  title={UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models},
  author={Li, Yujie and Xu, Wenjia and Li, Guangzuo and Yu, Zijian and Wei, Zhiwei and Wang, Jiuniu and Peng, Mugen},
  journal={arXiv preprint arXiv:2412.20742},
  year={2024}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA, GeoChat and VILA for releasing their models and code as open-source contributions.