Track: Oral Session 3B: Multimodal Computer Vision

Sat 14 June 7:00 - 7:15 PDT

SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images

Kaiyu Li · Ruixun Liu · Xiangyong Cao · Xueru Bai · Feng Zhou · Deyu Meng · Wang Zhi

Current remote sensing semantic segmentation methods are mostly built on the close-set assumption, meaning that the model can only recognize pre-defined categories that exist in the training set. However, in practical Earth observation, there are countless unseen categories, and manual annotation is impractical. To address this challenge, we first attempt to introduce training-free open-vocabulary semantic segmentation (OVSS) into the remote sensing context. However, due to the sensitivity of remote sensing images to low-resolution features, distorted target shapes and ill-fitting boundaries are exhibited in the prediction mask. To tackle these issues, we propose a simple and universal upsampler, i.e. SimFeatUp, to restore lost spatial information of deep features. Specifically, SimFeatUp only needs to learn from a few unlabeled images, and can upsample arbitrary remote sensing image features. Furthermore, based on the observation of the abnormal response of patch tokens to the [CLS] token in CLIP, we propose to execute a simple subtraction operation to alleviate the global bias in patch tokens. Extensive experiments are conducted on 17 remote sensing datasets of 4 tasks, including semantic segmentation, building extraction, road detection, and flood detection. Our method achieves an average of 5.8\%, 8.2\%, 4.0\%, and 15.3\% improvement over state-of-the-art methods on the 4 tasks.

Sat 14 June 7:15 - 7:30 PDT

Towards Universal Dataset Distillation via Task-Driven Diffusion

Ding Qi · Jian Li · Junyao Gao · Shuguang Dou · Ying Tai · Jianlong Hu · Bo Zhao · Yabiao Wang · Chengjie Wang · Cai Rong Zhao

Dataset distillation (DD) condenses key information from large-scale datasets into smaller synthetic datasets, reducing storage and computational costs for training networks. However, recent research has primarily focused on image classification tasks, with limited expansion to detection and segmentation. Two key challenges remain: (i) Task Optimization Heterogeneity, where existing methods focus on class-level information and fail to address the diverse needs of detection and segmentation and (ii) Inflexible Image Generation, where current generation methods rely on global updates for single-class targets and lack localized optimization for specific object regions.To address these challenges, we propose a universal dataset distillation framework, named UniDD, a task-driven diffusion model for diverse DD tasks, as illustrated in Fig.1. Our approach operates in two stages: Universal Task Knowledge Mining, which captures task-relevant information through task-specific proxy model training, and Universal Task-Driven Diffusion, where these proxies guide the diffusion process to generate task-specific synthetic images.Extensive experiments across ImageNet-1K, Pascal VOC, and MS COCO demonstrate that UniDD consistently outperforms state-of-the-art methods. In particular, on ImageNet-1K with IPC-10, UniDD surpasses previous diffusion-based methods by 6.1\%, while also reducing deployment costs.

Sat 14 June 7:30 - 7:45 PDT

IceDiff: High Resolution and High-Quality Arctic Sea Ice Forecasting with Generative Diffusion Prior

Jingyi Xu · Siwei Tu · Weidong Yang · Ben Fei · Shuhao Li · Keyi Liu · Yeqi Luo · Lipeng Ma · Lei Bai

Variation of Arctic sea ice has significant impacts on polar ecosystems, transporting routes, coastal communities, and global climate. Tracing the change of sea ice at a finer scale is paramount for both operational applications and scientific studies. Recent pan-Arctic sea ice forecasting methods that leverage advances in artificial intelligence have made promising progress over numerical models. However, forecasting sea ice at higher resolutions is still under-explored. To bridge the gap, we propose a two-module cooperative deep learning framework, IceDiff, to forecast sea ice concentration at finer scales. IceDiff first leverages a vision transformer to generate coarse yet superior forecasting results over previous methods at a regular 25km grid. This high-quality sea ice forecasting can be utilized as reliable guidance for the next module. Subsequently, an unconditional diffusion model pre-trained on low-resolution sea ice concentration maps is utilized for sampling down-scaled sea ice forecasting via a zero-shot guided sampling strategy and a patch-based method. For the first time, IceDiff demonstrates sea ice forecasting with a 6.25km resolution. IceDiff extends the boundary of existing sea ice forecasting models and more importantly, its capability to generate high-resolution sea ice concentration data is vital for pragmatic usages and research.

Sat 14 June 7:45 - 8:00 PDT

Efficient Test-time Adaptive Object Detection via Sensitivity-Guided Pruning

Kunyu Wang · Xueyang Fu · Xin Lu · Chengjie Ge · Chengzhi Cao · Wei Zhai · Zheng-Jun Zha

Continual test-time adaptive object detection (CTTA-OD) aims to online adapt a source pre-trained detector to ever-changing environments during inference under continuous domain shifts. Most existing CTTA-OD methods prioritize effectiveness while overlooking computational efficiency, which is crucial for resource-constrained scenarios. In this paper, we propose an efficient CTTA-OD method via pruning. Our motivation stems from the observation that not all learned source features are beneficial; certain domain-sensitive feature channels can adversely affect target domain performance. Inspired by this, we introduce a sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels. We apply weighted sparsity regularization to selectively suppress and prune these sensitive channels, focusing adaptation efforts on invariant ones. Additionally, we introduce a stochastic channel reactivation mechanism to restore pruned channels, enabling recovery of potentially useful features and mitigating the risks of early pruning. Extensive experiments on three benchmarks show that our method achieves superior adaptation performance while reducing computational overhead by 12% in FLOPs compared to the recent SOTA method.

Sat 14 June 8:00 - 8:15 PDT

Keep the Balance: A Parameter-Efficient Symmetrical Framework for RGB+X Semantic Segmentation

Jiaxin Cai · Jingze Su · Qi Li · Wenjie Yang · Shu Wang · Tiesong Zhao · Shengfeng He · Wenxi Liu

Multimodal semantic segmentation is a critical challenge in computer vision, with early methods suffering from high computational costs and limited transferability due to full fine-tuning of RGB-based pre-trained parameters. Recent studies, while leveraging additional modalities as supplementary prompts to RGB, still predominantly rely on RGB, which restricts the full potential of other modalities. To address these issues, we propose a novel symmetric parameter-efficient fine-tuning framework for multimodal segmentation, featuring with a modality-aware prompting and adaptation scheme, to simultaneously adapt the capabilities of a powerful pre-trained model to both RGB and X modalities. Furthermore, prevalent approaches use the global cross-modality correlations of attention mechanism for modality fusion, which inadvertently introduces noise across modalities. To mitigate this noise, we propose a dynamic sparse cross-modality fusion module to facilitate effective and efficient cross-modality fusion. To further strengthen the above two modules, we propose a training strategy that leverages accurately predicted dual-modality results to self-teach the single-modality outcomes. In comprehensive experiments, we demonstrate that our method outperforms previous state-of-the-art approaches across six multimodal segmentation scenarios with minimal computation cost.