Our UniRS is a framework unifying multi-temporal remote sensing tasks of various visual inputs within a single model. It can analyze three critical types of remote sensing visual inputs i.e., single image, dual-time image pair, and video, under task instructions. Our research focuses on typical remote sensing tasks for each input type, including visual question answering, change captioning, and video classification.
The architecture of our UniRS. The left part of this figure includes the prompt augmentation mechanism and UniRS main architecture. UniRS is primarily composed of four components, i.e., visual encoder $\mathcal{E}_v$, multimodal projector $\mathcal{M}_{m}$, language module $\mathcal{M}_{l}$, and change extraction module $\mathcal{M}_{c}$. Here change extraction module $\mathcal{M}_{c}$ is designed for the dual-time image pair input to extract and enhance spatiotemporal relationship features between image pairs. During inference, all visual inputs $I$ are encoded into visual features $F$ by the visual encoder $\mathcal{E}_v$. In the prompt augmentation mechanism, initial visual clues $P_c$ are obtained after parsing and merged with the task instruction $P_t$ to form the full prompt $P$. In UniRS, the multimodal projector $\mathcal{M}_{m}$ projects visual feature $F$ into the text feature space as visual embedding $E_{I}$, which is then combined with the text embedding $E_{P}$ and fed into the language module $\mathcal{M}_{l}$ to get the final answer $a$. The right part of this figure is the structure of the change extraction module $\mathcal{M}_{c}$.
The inference process of UniRS using prompt augmentation mechanism. During the execution of remote sensing tasks, visual inputs are first processed by the base model, where clues $P_{c}$ are generated under the fixed prompts $P_g$ customized for each input type. These clues, special markers, task tags, and the task instruction $P_{t}$, are then merged to form the prompt $P$ input into UniRS. The model then generates the corresponding response $a$.
Qualitative results of UniRS. (left-right) Results are shown on visual question answering, dual-time change captioning and video scene classification. The user can provide task-specific instructions to shape model responses according to the desired behavior. Compared with the previous mainstream remote sensing VLMs, general VLMs and traditional expert models, UniRS shows better generation performance in various tasks.
We use the test sets from the RSVQA-LR, RSVQA-HR, and CRSVQA datasets for quantitative testing of the RSVQA task. For the testing of RSVQA-LR and RSVQA-HR, we follow the GeoChat benchmark settings. Additionally, we follow the setting of MQVQA, adopting 10% of the data as the test set and evaluating the performance under supervised assessment.
Method | Presence | Comparison | Rural/Urban | Avg. Accuracy |
VILA-1.5 (3B) | 68.49 | 64.99 | 64.00 | 66.44 |
RSVQA | 87.47 | 81.50 | 90.00 | 86.32 |
Bi-Modal | 91.06 | 91.16 | 92.66 | 91.63 |
SHRNet | 91.03 | 90.48 | 94.00 | 91.84 |
RSGPT | 91.17 | 91.70 | 94.00 | 92.29 |
SkyEyeGPT (7B) | 88.93 | 88.63 | 75.00 | 84.19 |
LHRS-Bot (7B) | 89.07 | 88.51 | 90.00 | 89.19 |
GeoChat (7B) | 91.09 | 90.33 | 94.00 | 90.70 |
UniRS | 91.64 | 92.68 | 90.00 | 92.21 |
UniRS (further training) | 91.81 | 93.23 | 93.00 | 92.63 |
Method | Presence | Comparison | Avg. Accuracy |
VILA-1.5 (3B) | 61.44 | 63.06 | 62.79 |
MiniGPTv2 | 40.79 | 50.91 | 46.46 |
LLaVA-1.5 | 68.23 | 65.45 | 66.67 |
GeoChat | 59.02 | 83.16 | 72.53 |
EarthGPT | 62.77 | 79.53 | 72.06 |
UniRS | 59.29 | 84.05 | 73.15 |
Method | OA |
VILA-1.5 (3B) | 80.33 |
Qonly | 23.49 |
RSVQA | 58.96 |
RSVQA (GRU) | 59.41 |
SAN | 61.17 |
MQVQA | 70.91 |
EarthGPT | 82.00 |
GeoChat | 82.50 |
UniRS | 86.67 |
In the change captioning task, the user inputs a dual-time image pair and corresponding text instructions to the model. The model can understand the spatiotemporal features contained in the image pair and give a text description of the changes that occurred at the same location between two time nodes.
Method | CIDEr-D |
VILA-1.5 (3B) | 6.22 |
RSICCFormer | 131.40 |
PSNet | 132.62 |
PromptCC | 136.44 |
Chg2Cap | 136.61 |
LLaVA-1.5 | 126.25 |
GeoChat | 128.36 |
UniRS | 139.12 |
UniRS also integrates the ability to understand and analyze remote sensing video content. By inputting drone bird’s-eye view video and text instructions, the model can give the corresponding video scene classification results.
Method | post-earthquake | flood | fire | landslide | mudslide | traffic collision | traffic congestion | harvesting | ploughing | constructing | police chase | conflict | baseball | basketball | boating | cycling | running | soccer | swimming | car racing | party | concert | parade/protest | religious activity | non-event | OA |
VILA-1.5 (3B) | 0.0 | 63.3 | 14.3 | 16.3 | 0.0 | 35.8 | 70.0 | 28.1 | 34.6 | 40.7 | 0.0 | 68.0 | 86.0 | 58.3 | 62.7 | 26.4 | 12.8 | 85.5 | 11.8 | 100.0 | 0.0 | 91.8 | 55.1 | 5.6 | 78.1 | 41.7 |
HDense | 67.3 | 71.4 | 78.6 | 34.7 | 74.5 | 35.9 | 74.0 | 81.3 | 82.7 | 59.3 | 64.7 | 16.0 | 76.0 | 72.9 | 88.2 | 62.3 | 16.3 | 82.3 | 76.5 | 63.2 | 54.0 | 73.5 | 59.2 | 61.1 | 58.1 | 63.0 |
FuTH-Net | 72.7 | 75.7 | 87.5 | 57.1 | 74.5 | 34.0 | 56.0 | 76.6 | 71.2 | 81.4 | 76.5 | 36.0 | 78.0 | 85.4 | 80.4 | 73.6 | 16.3 | 64.5 | 80.4 | 84.2 | 56.0 | 89.8 | 65.3 | 63.0 | 63.9 | 66.8 |
MSTN | 61.8 | 76.1 | 92.2 | 60.4 | 62.8 | 54.1 | 69.6 | 80.0 | 91.1 | 73.6 | 71.7 | 54.6 | 86.0 | 72.4 | 86.5 | 66.0 | 66.9 | 90.2 | 74.1 | 61.9 | 67.4 | 56.0 | 46.6 | 58.5 | 51.5 | 67.4 |
TRM | 72.7 | 75.5 | 87.5 | 57.1 | 74.5 | 34.0 | 56.0 | 76.6 | 71.2 | 81.4 | 76.5 | 36.0 | 78.0 | 85.4 | 80.4 | 73.6 | 16.3 | 64.5 | 80.4 | 84.2 | 56.0 | 89.8 | 65.3 | 63.0 | 63.9 | 66.8 |
ASAT | 62.3 | 85.7 | 91.4 | 56.5 | 62.7 | 47.7 | 66.0 | 68.8 | 90.8 | 75.0 | 80.5 | 40.0 | 84.0 | 78.9 | 85.7 | 64.6 | 78.8 | 94.0 | 61.4 | 61.9 | 87.1 | 56.0 | 47.9 | 65.3 | 55.2 | 68.1 |
UniRS | 85.5 | 89.8 | 100.0 | 69.4 | 78.4 | 67.9 | 88.0 | 95.3 | 94.2 | 91.5 | 88.2 | 96.0 | 96.0 | 100.0 | 100.0 | 94.3 | 73.3 | 91.9 | 88.2 | 89.5 | 86.0 | 93.9 | 83.7 | 92.6 | 82.6 | 87.8 |
@article{li2024unirs,
title={UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models},
author={Li, Yujie and Xu, Wenjia and Li, Guangzuo and Yu, Zijian and Wei, Zhiwei and Wang, Jiuniu and Peng, Mugen},
journal={arXiv preprint arXiv:2412.20742},
year={2024}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA, GeoChat and VILA for releasing their models and code as open-source contributions.