Oral Session

Orals 3C Medical and Physics-based vision

Summit Flex Hall C
Thu 20 Jun 9 a.m. PDT — 10:30 a.m. PDT
Thu 20 June 9:00 - 9:18 PDT

EventPS: Real-Time Photometric Stereo Using an Event Camera

Bohan Yu · Jieji Ren · Jin Han · Feishi Wang · Jinxiu Liang · Boxin Shi

Photometric stereo is a well-established technique to estimate the surface normal of an object. However, the requirement of capturing multiple high dynamic range images under different illumination conditions limits the speed and real-time applications. This paper introduces EventPS, a novel approach to real-time photometric stereo using an event camera. Capitalizing on the exceptional temporal resolution, dynamic range, and low bandwidth characteristics of event cameras, EventPS estimates surface normal only from the radiance changes, significantly enhancing data efficiency. EventPS seamlessly integrates with both optimization-based and deep-learning-based photometric stereo techniques to offer a robust solution for non-Lambertian surfaces. Extensive experiments validate the effectiveness and efficiency of EventPS compared to frame-based counterparts. Our algorithm runs at over 30 fps in real-world scenarios, unleashing the potential of EventPS in time-sensitive and high-speed downstream applications.

Thu 20 June 9:18 - 9:36 PDT

EvDiG: Event-guided Direct and Global Components Separation

xinyu zhou · Peiqi Duan · Boyu Li · Chu Zhou · Chao Xu · Boxin Shi

Separating the direct and global components of a scene aids in shape recovery and basic material understanding. Conventional methods capture multiple frames under high frequency illumination patterns or shadows, requiring the scene to keep stationary during the image acquisition process. Single-frame methods simplify the capture procedure but yield lower-quality separation results. In this paper, we leverage the event camera to facilitate the separation of direct and global components, enabling video-rate separation of high quality. In detail, we adopt an event camera to record rapid illumination changes caused by the shadow of a line occluder sweeping over the scene, and reconstruct the coarse separation results through event accumulation. We then design a network to resolve the noise in the coarse separation results and restore color information. A real-world dataset is collected using a hybrid camera system for network training and evaluation. Experimental results show superior performance over state-of-the-art methods.

Thu 20 June 9:36 - 9:54 PDT

MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation

Xiaolong Deng · Huisi Wu · Runhao Zeng · Jing Qin

We propose a novel echocardiographical video segmentation model by adapting SAM to medical videos to address some long-standing challenges in ultrasound video segmentation, including (1) massive speckle noise and artifacts, (2) extremely ambiguous boundaries, and (3) large variations of targeting objects across frames. The core technique of our model is a temporal-aware and noise-resilient prompting scheme. Specifically, we employ a space-time memory that contains both spatial and temporal information to prompt the segmentation of current frame, and thus we call the proposed model as MemSAM. In prompting, the memory carrying temporal cues sequentially prompt the video segmentation frame by frame. Meanwhile, as the memory prompt propagates high-level features, it avoids the issue of misidentification caused by mask propagation and improves representation consistency. To address the challenge of speckle noise, we further propose a memory reinforcement mechanism, which leverages predicted masks to improve the quality of the memory before storing it. We extensively evaluate our method on two public datasets and demonstrate state-of-the-art performance compared to existing models. Particularly, our model achieves comparable performance with fully supervised approaches with limited annotations. Codes are available at

Thu 20 June 9:54 - 10:12 PDT

Transcriptomics-guided Slide Representation Learning in Computational Pathology

Guillaume Jaume · Lukas Oldenburg · Anurag Vaidya · Richard J. Chen · Drew F. K. Williamson · Thomas Peeters · Andrew Song · Faisal Mahmood

Self-supervised learning (SSL) has been successful in building patch embeddings of small histology images (e.g., 224 x 224 pixels), but scaling these models to learn slide embeddings from the entirety of giga-pixel whole-slide images (WSIs) remains challenging. Here, we leverage complementary information from gene expression profiles to guide slide representation learning using multimodal pre-training. Expression profiles constitute highly detailed molecular descriptions of a tissue that we hypothesize offer a strong task-agnostic training signal for learning slide embeddings. Our slide and expression (S+E) pre-training strategy, called TANGLE, employs modality-specific encoders, the outputs of which are aligned via contrastive learning. TANGLE was pre-trained on samples from three different organs: liver (n=6,597 S+E pairs), breast (n=1,020), and lung (n=1,012) from two different species (Homo sapiens and Rattus norvegicus). Across three independent test datasets consisting of 1,265 breast WSIs, 1,946 lung WSIs, and 4,584 liver WSIs, TANGLE shows significantly better few-shot performance compared to supervised and SSL baselines. When assessed using prototype-based classification and slide retrieval, TANGLE also shows a substantial performance improvement over all baselines. Code will be made available upon acceptance.

Thu 20 June 10:12 - 10:30 PDT

Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration

Mingyuan Meng · Dagan Feng · Lei Bi · Jinman Kim

Deformable image registration is a fundamental step for medical image analysis. Recently, transformers have been used for registration and outperformed Convolutional Neural Networks (CNNs). Transformers can capture long-range dependence among image features, which have been shown beneficial for registration. However, due to the high computation/memory loads of self-attention, transformers are typically used at downsampled feature resolutions and cannot capture fine-grained long-range dependence at the full image resolution. This limits deformable registration as it necessitates precise dense correspondence between each image pixel. Multi-layer Perceptrons (MLPs) without self-attention are efficient in computation/memory usage, enabling the feasibility of capturing fine-grained long-range dependence at full resolution. Nevertheless, MLPs have not been extensively explored for image registration and are lacking the consideration of inductive bias crucial for medical registration tasks. In this study, we propose the first correlation-aware MLP-based registration network (CorrMLP) for deformable medical image registration. Our CorrMLP introduces a correlation-aware multi-window MLP block in a novel coarse-to-fine registration architecture, which captures fine-grained multi-range dependence to perform correlation-aware coarse-to-fine registration. Extensive experiments with seven public medical datasets show that our CorrMLP outperforms state-of-the-art deformable registration methods.