new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Jan 7

Reinforcing Video Reasoning Segmentation to Think Before It Segments

Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into <SEG> tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and frame-level spatial grounding, yielding the supervised fine-tuned model Veason-SFT. Subsequently, GRPO fine-tuning encourages efficient exploration of the reasoning space by optimizing reasoning chains. To this end, we incorporate a holistic reward mechanism that synergistically enhances spatial alignment and temporal consistency, bolstering keyframe localization and fine-grained grounding. Comprehensive empirical evaluations demonstrate that Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins (e.g., +1.3 J &F in ReVOS and +10.0 J &F in ReasonVOS), while exhibiting robustness to hallucinations (+8.8 R). Our code and model weights will be available at Veason-R1.

  • 6 authors
·
Aug 15, 2025

Decoding with Structured Awareness: Integrating Directional, Frequency-Spatial, and Structural Attention for Medical Image Segmentation

To address the limitations of Transformer decoders in capturing edge details, recognizing local textures and modeling spatial continuity, this paper proposes a novel decoder framework specifically designed for medical image segmentation, comprising three core modules. First, the Adaptive Cross-Fusion Attention (ACFA) module integrates channel feature enhancement with spatial attention mechanisms and introduces learnable guidance in three directions (planar, horizontal, and vertical) to enhance responsiveness to key regions and structural orientations. Second, the Triple Feature Fusion Attention (TFFA) module fuses features from Spatial, Fourier and Wavelet domains, achieving joint frequency-spatial representation that strengthens global dependency and structural modeling while preserving local information such as edges and textures, making it particularly effective in complex and blurred boundary scenarios. Finally, the Structural-aware Multi-scale Masking Module (SMMM) optimizes the skip connections between encoder and decoder by leveraging multi-scale context and structural saliency filtering, effectively reducing feature redundancy and improving semantic interaction quality. Working synergistically, these modules not only address the shortcomings of traditional decoders but also significantly enhance performance in high-precision tasks such as tumor segmentation and organ boundary extraction, improving both segmentation accuracy and model generalization. Experimental results demonstrate that this framework provides an efficient and practical solution for medical image segmentation.

  • 3 authors
·
Dec 5, 2025

Segmentation and Processing of German Court Decisions from Open Legal Data

The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgründe (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran's formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.

  • 4 authors
·
Jan 4

Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve F_1 of old SOTA by 3.42 (73.74 -> 77.16) and reduces P_k by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on P_k on WikiSection. The average relative P_k drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.

  • 6 authors
·
Oct 18, 2023

Matching-Based Few-Shot Semantic Segmentation Models Are Interpretable by Design

Few-Shot Semantic Segmentation (FSS) models achieve strong performance in segmenting novel classes with minimal labeled examples, yet their decision-making processes remain largely opaque. While explainable AI has advanced significantly in standard computer vision tasks, interpretability in FSS remains virtually unexplored despite its critical importance for understanding model behavior and guiding support set selection in data-scarce scenarios. This paper introduces the first dedicated method for interpreting matching-based FSS models by leveraging their inherent structural properties. Our Affinity Explainer approach extracts attribution maps that highlight which pixels in support images contribute most to query segmentation predictions, using matching scores computed between support and query features at multiple feature levels. We extend standard interpretability evaluation metrics to the FSS domain and propose additional metrics to better capture the practical utility of explanations in few-shot scenarios. Comprehensive experiments on FSS benchmark datasets, using different models, demonstrate that our Affinity Explainer significantly outperforms adapted standard attribution methods. Qualitative analysis reveals that our explanations provide structured, coherent attention patterns that align with model architectures and and enable effective model diagnosis. This work establishes the foundation for interpretable FSS research, enabling better model understanding and diagnostic for more reliable few-shot segmentation systems. The source code is publicly available at https://github.com/pasqualedem/AffinityExplainer.

  • 5 authors
·
Nov 22, 2025

Segmentation with Noisy Labels via Spatially Correlated Distributions

In semantic segmentation, the accuracy of models heavily depends on the high-quality annotations. However, in many practical scenarios such as medical imaging and remote sensing, obtaining true annotations is not straightforward and usually requires significant human labor. Relying on human labor often introduces annotation errors, including mislabeling, omissions, and inconsistency between annotators. In the case of remote sensing, differences in procurement time can lead to misaligned ground truth annotations. These label errors are not independently distributed, and instead usually appear in spatially connected regions where adjacent pixels are more likely to share the same errors. To address these issues, we propose an approximate Bayesian estimation based on a probabilistic model that assumes training data includes label errors, incorporating the tendency for these errors to occur with spatial correlations between adjacent pixels. Bayesian inference requires computing the posterior distribution of label errors, which becomes intractable when spatial correlations are present. We represent the correlation of label errors between adjacent pixels through a Gaussian distribution whose covariance is structured by a Kac-Murdock-Szeg\"{o} (KMS) matrix, solving the computational challenges. Through experiments on multiple segmentation tasks, we confirm that leveraging the spatial correlation of label errors significantly improves performance. Notably, in specific tasks such as lung segmentation, the proposed method achieves performance comparable to training with clean labels under moderate noise levels. Code is available at https://github.com/pfnet-research/Bayesian_SpatialCorr.

  • 3 authors
·
Apr 20, 2025

Learning Structured Output Representations from Attributes using Deep Conditional Generative Models

Structured output representation is a generative task explored in computer vision that often times requires the mapping of low dimensional features to high dimensional structured outputs. Losses in complex spatial information in deterministic approaches such as Convolutional Neural Networks (CNN) lead to uncertainties and ambiguous structures within a single output representation. A probabilistic approach through deep Conditional Generative Models (CGM) is presented by Sohn et al. in which a particular model known as the Conditional Variational Auto-encoder (CVAE) is introduced and explored. While the original paper focuses on the task of image segmentation, this paper adopts the CVAE framework for the task of controlled output representation through attributes. This approach allows us to learn a disentangled multimodal prior distribution, resulting in more controlled and robust approach to sample generation. In this work we recreate the CVAE architecture and train it on images conditioned on various attributes obtained from two image datasets; the Large-scale CelebFaces Attributes (CelebA) dataset and the Caltech-UCSD Birds (CUB-200-2011) dataset. We attempt to generate new faces with distinct attributes such as hair color and glasses, as well as different bird species samples with various attributes. We further introduce strategies for improving generalized sample generation by applying a weighted term to the variational lower bound.

  • 1 authors
·
Apr 30, 2023

ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

We present ReXGroundingCT, the first publicly available dataset to link free-text radiology findings with pixel-level segmentations in 3D chest CT scans that is manually annotated. While prior datasets have relied on structured labels or predefined categories, ReXGroundingCT captures the full expressiveness of clinical language represented in free text and grounds it to spatially localized 3D segmentation annotations in volumetric imaging. This addresses a critical gap in medical AI: the ability to connect complex, descriptive text, such as "3 mm nodule in the left lower lobe", to its precise anatomical location in three-dimensional space, a capability essential for grounded radiology report generation systems. The dataset comprises 3,142 non-contrast chest CT scans paired with standardized radiology reports from the CT-RATE dataset. Using a systematic three-stage pipeline, GPT-4 was used to extract positive lung and pleural findings, which were then manually segmented by expert annotators. A total of 8,028 findings across 16,301 entities were annotated, with quality control performed by board-certified radiologists. Approximately 79% of findings are focal abnormalities, while 21% are non-focal. The training set includes up to three representative segmentations per finding, while the validation and test sets contain exhaustive labels for each finding entity. ReXGroundingCT establishes a new benchmark for developing and evaluating sentence-level grounding and free-text medical segmentation models in chest CT. The dataset can be accessed at https://huggingface.co/datasets/rajpurkarlab/ReXGroundingCT.

  • 23 authors
·
Jul 29, 2025

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

  • 9 authors
·
Jun 3, 2025

A large-scale image-text dataset benchmark for farmland segmentation

The traditional deep learning paradigm that solely relies on labeled data has limitations in representing the spatial relationships between farmland elements and the surrounding environment.It struggles to effectively model the dynamic temporal evolution and spatial heterogeneity of farmland. Language,as a structured knowledge carrier,can explicitly express the spatiotemporal characteristics of farmland, such as its shape, distribution,and surrounding environmental information.Therefore,a language-driven learning paradigm can effectively alleviate the challenges posed by the spatiotemporal heterogeneity of farmland.However,in the field of remote sensing imagery of farmland,there is currently no comprehensive benchmark dataset to support this research direction.To fill this gap,we introduced language based descriptions of farmland and developed FarmSeg-VL dataset,the first fine-grained image-text dataset designed for spatiotemporal farmland segmentation.Firstly, this article proposed a semi-automatic annotation method that can accurately assign caption to each image, ensuring high data quality and semantic richness while improving the efficiency of dataset construction.Secondly,the FarmSeg-VL exhibits significant spatiotemporal characteristics.In terms of the temporal dimension,it covers all four seasons.In terms of the spatial dimension,it covers eight typical agricultural regions across China.In addition, in terms of captions,FarmSeg-VL covers rich spatiotemporal characteristics of farmland,including its inherent properties,phenological characteristics, spatial distribution,topographic and geomorphic features,and the distribution of surrounding environments.Finally,we present a performance analysis of VLMs and the deep learning models that rely solely on labels trained on the FarmSeg-VL,demonstrating its potential as a standard benchmark for farmland segmentation.

  • 5 authors
·
Mar 29, 2025

RPCANet++: Deep Interpretable Robust PCA for Sparse Object Segmentation

Robust principal component analysis (RPCA) decomposes an observation matrix into low-rank background and sparse object components. This capability has enabled its application in tasks ranging from image restoration to segmentation. However, traditional RPCA models suffer from computational burdens caused by matrix operations, reliance on finely tuned hyperparameters, and rigid priors that limit adaptability in dynamic scenarios. To solve these limitations, we propose RPCANet++, a sparse object segmentation framework that fuses the interpretability of RPCA with efficient deep architectures. Our approach unfolds a relaxed RPCA model into a structured network comprising a Background Approximation Module (BAM), an Object Extraction Module (OEM), and an Image Restoration Module (IRM). To mitigate inter-stage transmission loss in the BAM, we introduce a Memory-Augmented Module (MAM) to enhance background feature preservation, while a Deep Contrast Prior Module (DCPM) leverages saliency cues to expedite object extraction. Extensive experiments on diverse datasets demonstrate that RPCANet++ achieves state-of-the-art performance under various imaging scenarios. We further improve interpretability via visual and numerical low-rankness and sparsity measurements. By combining the theoretical strengths of RPCA with the efficiency of deep networks, our approach sets a new baseline for reliable and interpretable sparse object segmentation. Codes are available at our Project Webpage https://fengyiwu98.github.io/rpcanetx.

  • 7 authors
·
Aug 6, 2025 2

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work of academic research, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.

  • 4 authors
·
Oct 12, 2025

Benchmarking Multi-modal Semantic Segmentation under Sensor Failures: Missing and Noisy Modality Robustness

Multi-modal semantic segmentation (MMSS) addresses the limitations of single-modality data by integrating complementary information across modalities. Despite notable progress, a significant gap persists between research and real-world deployment due to variability and uncertainty in multi-modal data quality. Robustness has thus become essential for practical MMSS applications. However, the absence of standardized benchmarks for evaluating robustness hinders further advancement. To address this, we first survey existing MMSS literature and categorize representative methods to provide a structured overview. We then introduce a robustness benchmark that evaluates MMSS models under three scenarios: Entire-Missing Modality (EMM), Random-Missing Modality (RMM), and Noisy Modality (NM). From a probabilistic standpoint, we model modality failure under two conditions: (1) all damaged combinations are equally probable; (2) each modality fails independently following a Bernoulli distribution. Based on these, we propose four metrics-mIoU^{Avg}_{EMM}, mIoU^{E}_{EMM}, mIoU^{Avg}_{RMM}, and mIoU^{E}_{RMM}-to assess model robustness under EMM and RMM. This work provides the first dedicated benchmark for MMSS robustness, offering new insights and tools to advance the field. Source code is available at https://github.com/Chenfei-Liao/Multi-Modal-Semantic-Segmentation-Robustness-Benchmark.

  • 9 authors
·
Mar 24, 2025

Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

Automatic table detection in PDF documents has achieved a great success but tabular data extraction are still challenging due to the integrity and noise issues in detected table areas. The accurate data extraction is extremely crucial in finance area. Inspired by this, the aim of this research is proposing an automated table detection and tabular data extraction from financial PDF documents. We proposed a method that consists of three main processes, which are detecting table areas with a Faster R-CNN (Region-based Convolutional Neural Network) model with Feature Pyramid Network (FPN) on each page image, extracting contents and structures by a compounded layout segmentation technique based on optical character recognition (OCR) and formulating regular expression rules for table header separation. The tabular data extraction feature is embedded with rule-based filtering and restructuring functions that are highly scalable. We annotate a new Financial Documents dataset with table regions for the experiment. The excellent table detection performance of the detection model is obtained from our customized dataset. The main contributions of this paper are proposing the Financial Documents dataset with table-area annotations, the superior detection model and the rule-based layout segmentation technique for the tabular data extraction from PDF files.

  • 5 authors
·
Feb 20, 2021

VISA: Reasoning Video Object Segmentation via Large Language Models

Existing Video Object Segmentation (VOS) relies on explicit user instructions, such as categories, masks, or short phrases, restricting their ability to perform complex video segmentation requiring reasoning with world knowledge. In this paper, we introduce a new task, Reasoning Video Object Segmentation (ReasonVOS). This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities based on world knowledge and video contexts, which is crucial for structured environment understanding and object-centric interactions, pivotal in the development of embodied AI. To tackle ReasonVOS, we introduce VISA (Video-based large language Instructed Segmentation Assistant), to leverage the world knowledge reasoning capabilities of multi-modal LLMs while possessing the ability to segment and track objects in videos with a mask decoder. Moreover, we establish a comprehensive benchmark consisting of 35,074 instruction-mask sequence pairs from 1,042 diverse videos, which incorporates complex world knowledge reasoning into segmentation tasks for instruction-tuning and evaluation purposes of ReasonVOS models. Experiments conducted on 8 datasets demonstrate the effectiveness of VISA in tackling complex reasoning segmentation and vanilla referring segmentation in both video and image domains. The code and dataset are available at https://github.com/cilinyan/VISA.

  • 8 authors
·
Jul 15, 2024

DAMASHA: Detecting AI in Mixed Adversarial Texts via Segmentation with Human-interpretable Attribution

In the age of advanced large language models (LLMs), the boundaries between human and AI-generated text are becoming increasingly blurred. We address the challenge of segmenting mixed-authorship text, that is identifying transition points in text where authorship shifts from human to AI or vice-versa, a problem with critical implications for authenticity, trust, and human oversight. We introduce a novel framework, called Info-Mask for mixed authorship detection that integrates stylometric cues, perplexity-driven signals, and structured boundary modeling to accurately segment collaborative human-AI content. To evaluate the robustness of our system against adversarial perturbations, we construct and release an adversarial benchmark dataset Mixed-text Adversarial setting for Segmentation (MAS), designed to probe the limits of existing detectors. Beyond segmentation accuracy, we introduce Human-Interpretable Attribution (HIA overlays that highlight how stylometric features inform boundary predictions, and we conduct a small-scale human study assessing their usefulness. Across multiple architectures, Info-Mask significantly improves span-level robustness under adversarial conditions, establishing new baselines while revealing remaining challenges. Our findings highlight both the promise and limitations of adversarially robust, interpretable mixed-authorship detection, with implications for trust and oversight in human-AI co-authorship.

  • 6 authors
·
Dec 4, 2025

SOS: Synthetic Object Segments Improve Detection, Segmentation, and Grounding

Visual grouping -- operationalized via instance segmentation, visual grounding, and object detection -- underpins applications from robotic perception to photo editing. Large annotated datasets are costly, biased in coverage, and hard to scale. Synthetic data are promising but often lack flexibility, accuracy, and compositional diversity. We present SOS, a simple and scalable data synthesis pipeline based on an object-centric composition strategy. It pastes high-quality synthetic object segments into new images using structured layout priors and generative relighting, producing accurate and diverse masks, boxes, and referring expressions. Models trained on 100000 synthetic images from SOS outperform those trained on larger real-image datasets such as GRIT (20M) and V3Det (200K) on detection and grounding tasks, achieving +10.9 AP on LVIS detection and +8.4 N_{Acc} on gRefCOCO grounding. SOS enables controllable dataset construction and improves generalization in both low-data and closed-vocabulary settings. Augmenting LVIS and COCO with synthetic object segments yields strong performance across real-data scales and even larger gains under extremely limited real data (for example, +3.83 AP_{rare} on LVIS instance segmentation and +6.59 AP with a 1 percent COCO setup). This controllability also supports targeted data generation for challenging intra-class referring in visual grounding.

  • 7 authors
·
Oct 10, 2025

PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data

Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.

  • 9 authors
·
Sep 26, 2025

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by 3times, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.

  • 9 authors
·
Sep 8, 2025

Unleashing Hierarchical Reasoning: An LLM-Driven Framework for Training-Free Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) aims to segment an object of interest throughout a video based on a language description. The prominent challenge lies in aligning static text with dynamic visual content, particularly when objects exhibiting similar appearances with inconsistent motion and poses. However, current methods often rely on a holistic visual-language fusion that struggles with complex, compositional descriptions. In this paper, we propose PARSE-VOS, a novel, training-free framework powered by Large Language Models (LLMs), for a hierarchical, coarse-to-fine reasoning across text and video domains. Our approach begins by parsing the natural language query into structured semantic commands. Next, we introduce a spatio-temporal grounding module that generates all candidate trajectories for all potential target objects, guided by the parsed semantics. Finally, a hierarchical identification module select the correct target through a two-stage reasoning process: it first performs coarse-grained motion reasoning with an LLM to narrow down candidates; if ambiguity remains, a fine-grained pose verification stage is conditionally triggered to disambiguate. The final output is an accurate segmentation mask for the target object. PARSE-VOS achieved state-of-the-art performance on three major benchmarks: Ref-YouTube-VOS, Ref-DAVIS17, and MeViS.

  • 8 authors
·
Sep 6, 2025

BI-RADS BERT & Using Section Segmentation to Understand Radiology Reports

Radiology reports are one of the main forms of communication between radiologists and other clinicians and contain important information for patient care. In order to use this information for research and automated patient care programs, it is necessary to convert the raw text into structured data suitable for analysis. State-of-the-art natural language processing (NLP) domain-specific contextual word embeddings have been shown to achieve impressive accuracy for these tasks in medicine, but have yet to be utilized for section structure segmentation. In this work, we pre-trained a contextual embedding BERT model using breast radiology reports and developed a classifier that incorporated the embedding with auxiliary global textual features in order to perform section segmentation. This model achieved a 98% accuracy at segregating free text reports sentence by sentence into sections of information outlined in the Breast Imaging Reporting and Data System (BI-RADS) lexicon, a significant improvement over the Classic BERT model without auxiliary information. We then evaluated whether using section segmentation improved the downstream extraction of clinically relevant information such as modality/procedure, previous cancer, menopausal status, the purpose of the exam, breast density, and breast MRI background parenchymal enhancement. Using the BERT model pre-trained on breast radiology reports combined with section segmentation resulted in an overall accuracy of 95.9% in the field extraction tasks. This is a 17% improvement compared to an overall accuracy of 78.9% for field extraction with models using Classic BERT embeddings and not using section segmentation. Our work shows the strength of using BERT in radiology report analysis and the advantages of section segmentation in identifying key features of patient factors recorded in breast radiology reports.

  • 3 authors
·
Oct 14, 2021

DeepOrgan: Multi-level Deep Convolutional Networks for Automated Pancreas Segmentation

Automatic organ segmentation is an important yet challenging problem for medical image analysis. The pancreas is an abdominal organ with very high anatomical variability. This inhibits previous segmentation methods from achieving high accuracies, especially compared to other organs such as the liver, heart or kidneys. In this paper, we present a probabilistic bottom-up approach for pancreas segmentation in abdominal computed tomography (CT) scans, using multi-level deep convolutional networks (ConvNets). We propose and evaluate several variations of deep ConvNets in the context of hierarchical, coarse-to-fine classification on image patches and regions, i.e. superpixels. We first present a dense labeling of local image patches via P{-}ConvNet and nearest neighbor fusion. Then we describe a regional ConvNet (R_1{-}ConvNet) that samples a set of bounding boxes around each image superpixel at different scales of contexts in a "zoom-out" fashion. Our ConvNets learn to assign class probabilities for each superpixel region of being pancreas. Last, we study a stacked R_2{-}ConvNet leveraging the joint space of CT intensities and the P{-}ConvNet dense probability maps. Both 3D Gaussian smoothing and 2D conditional random fields are exploited as structured predictions for post-processing. We evaluate on CT images of 82 patients in 4-fold cross-validation. We achieve a Dice Similarity Coefficient of 83.6pm6.3% in training and 71.8pm10.7% in testing.

  • 7 authors
·
Jun 21, 2015

xLSTM-UNet can be an Effective 2D \& 3D Medical Image Segmentation Backbone with Vision-LSTM (ViL) better than its Mamba Counterpart

Convolutional Neural Networks (CNNs) and Vision Transformers (ViT) have been pivotal in biomedical image segmentation, yet their ability to manage long-range dependencies remains constrained by inherent locality and computational overhead. To overcome these challenges, in this technical report, we first propose xLSTM-UNet, a UNet structured deep learning neural network that leverages Vision-LSTM (xLSTM) as its backbone for medical image segmentation. xLSTM is a recently proposed as the successor of Long Short-Term Memory (LSTM) networks and have demonstrated superior performance compared to Transformers and State Space Models (SSMs) like Mamba in Neural Language Processing (NLP) and image classification (as demonstrated in Vision-LSTM, or ViL implementation). Here, xLSTM-UNet we designed extend the success in biomedical image segmentation domain. By integrating the local feature extraction strengths of convolutional layers with the long-range dependency capturing abilities of xLSTM, xLSTM-UNet offers a robust solution for comprehensive image analysis. We validate the efficacy of xLSTM-UNet through experiments. Our findings demonstrate that xLSTM-UNet consistently surpasses the performance of leading CNN-based, Transformer-based, and Mamba-based segmentation networks in multiple datasets in biomedical segmentation including organs in abdomen MRI, instruments in endoscopic images, and cells in microscopic images. With comprehensive experiments performed, this technical report highlights the potential of xLSTM-based architectures in advancing biomedical image analysis in both 2D and 3D. The code, models, and datasets are publicly available at http://tianrun-chen.github.io/xLSTM-UNet/{http://tianrun-chen.github.io/xLSTM-Unet/}

  • 7 authors
·
Jul 1, 2024

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.

  • 6 authors
·
Apr 11, 2025 2

Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner

Large reasoning models (LRMs) have recently shown promise in solving complex math problems when optimized with Reinforcement Learning (RL). But conventional approaches rely on outcome-only rewards that provide sparse feedback, resulting in inefficient optimization process. In this work, we investigate the function of process reward models (PRMs) to accelerate the RL training for LRMs. We propose a novel intrinsic signal-driven generative process evaluation mechanism operating at the thought level to address major bottlenecks in RL-based training. Specifically, instead of requiring PRMs to know how to solve problems, our method uses intrinsic signals in solutions to judge stepwise correctness and aggregate contiguous correct/incorrect steps into coherent 'thought' units. This structured, thought-level rewards enable more reliable credit assignment by reducing ambiguity in step segmentation and alleviating reward hacking. We further introduce a capability-adaptive reward mechanism that dynamically balances exploration and exploitation based on the LRM's current proficiency, guiding learning without stifling creative trial-and-error. These innovations are integrated into a new off-policy RL algorithm, TP-GRPO, which extends grouped proximal optimization with process-based rewards and improves training efficiency. Experiments on 1.5B and 7B parameter LRMs demonstrate that our method achieves higher problem-solving accuracy with significantly fewer training samples than outcome-only reward baselines. The results validate that well-structured process rewards can substantially accelerate LRM optimization in math reasoning tasks. Code is available at https://github.com/cs-holder/tp_grpo.

  • 6 authors
·
Jul 31, 2025

LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning

Current high-performance semantic segmentation models are purely data-driven sub-symbolic approaches and blind to the structured nature of the visual world. This is in stark contrast to human cognition which abstracts visual perceptions at multiple levels and conducts symbolic reasoning with such structured abstraction. To fill these fundamental gaps, we devise LOGICSEG, a holistic visual semantic parser that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge. In particular, the semantic concepts of interest are structured as a hierarchy, from which a set of constraints are derived for describing the symbolic relations and formalized as first-order logic rules. After fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training. During inference, logical constraints are packaged into an iterative process and injected into the network in a form of several matrix multiplications, so as to achieve hierarchy-coherent prediction with logic reasoning. These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models. Extensive experiments over four datasets with various segmentation models and backbones verify the effectiveness and generality of LOGICSEG. We believe this study opens a new avenue for visual semantic parsing.

  • 3 authors
·
Sep 24, 2023

Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.

  • 7 authors
·
Nov 17, 2025

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

  • 7 authors
·
Oct 9, 2024

UniPose: Detecting Any Keypoints

This work proposes a unified framework called UniPose to detect keypoints of any articulated (e.g., human and animal), rigid, and soft objects via visual or textual prompts for fine-grained vision understanding and manipulation. Keypoint is a structure-aware, pixel-level, and compact representation of any object, especially articulated objects. Existing fine-grained promptable tasks mainly focus on object instance detection and segmentation but often fail to identify fine-grained granularity and structured information of image and instance, such as eyes, leg, paw, etc. Meanwhile, prompt-based keypoint detection is still under-explored. To bridge the gap, we make the first attempt to develop an end-to-end prompt-based keypoint detection framework called UniPose to detect keypoints of any objects. As keypoint detection tasks are unified in this framework, we can leverage 13 keypoint detection datasets with 338 keypoints across 1,237 categories over 400K instances to train a generic keypoint detection model. UniPose can effectively align text-to-keypoint and image-to-keypoint due to the mutual enhancement of textual and visual prompts based on the cross-modality contrastive learning optimization objectives. Our experimental results show that UniPose has strong fine-grained localization and generalization abilities across image styles, categories, and poses. Based on UniPose as a generalist keypoint detector, we hope it could serve fine-grained visual perception, understanding, and generation.

  • 4 authors
·
Oct 12, 2023

Polyline Path Masked Attention for Vision Transformer

Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

  • 6 authors
·
Jun 18, 2025

LEMUR Neural Network Dataset: Towards Seamless AutoML

Neural networks are fundamental in artificial intelligence, driving progress in computer vision and natural language processing. High-quality datasets are crucial for their development, and there is growing interest in datasets composed of neural networks themselves to support benchmarking, automated machine learning (AutoML), and model analysis. We introduce LEMUR, an open source dataset of neural network models with well-structured code for diverse architectures across tasks such as object detection, image classification, segmentation, and natural language processing. LEMUR is primarily designed to provide a rich source of structured model representations and associated performance data, enabling the fine-tuning of large language models for AutoML applications. Leveraging Python and PyTorch, LEMUR enables seamless extension to new datasets and models while maintaining consistency. It integrates an Optuna-powered framework for evaluation, hyperparameter optimization, statistical analysis, and graphical insights. LEMUR VR extension enables the seamless deployment of models in virtual reality, optimizing their performance on resource-constrained devices. Providing tools for model evaluation, preprocessing, and database management, LEMUR supports researchers and practitioners in developing, testing, and analyzing neural networks. It offers an API that delivers comprehensive information about neural network models and their complete performance statistics with a single request, which can be used in experiments with code-generating large language models. The LEMUR and its plugins are accessible as open source projects under the MIT license at https://github.com/ABrain-One/nn-dataset, https://github.com/ABrain-One/nn-plots and https://github.com/ABrain-One/nn-vr.

  • 10 authors
·
Apr 14, 2025

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on. In this paper, we expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora. This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes. A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text. Through this, we expand on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance. This enables more fine-grained and controllable multimodal generation capabilities and allows us to study the distillation of models trained on diverse data and objectives into a unified model. We successfully scale the training to a three billion parameter model using tens of modalities and different datasets. The resulting models and training code are open sourced at 4m.epfl.ch.

  • 9 authors
·
Jun 13, 2024 2

A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation

Artificial intelligence-assisted imaging analysis has made substantial strides in tumor diagnosis and management. Here we present PASTA, a pan-tumor CT foundation model that achieves state-of-the-art performance on 45 of 46 representative oncology tasks -- including lesion segmentation, tumor detection in plain CT, tumor staging, survival prediction, structured report generation, and cross-modality transfer learning, significantly outperforming the second-best models on 35 tasks. This remarkable advancement is driven by our development of PASTA-Gen, an innovative synthetic tumor generation framework that produces a comprehensive dataset of 30,000 CT scans with pixel-level annotated lesions and paired structured reports, encompassing malignancies across ten organs and five benign lesion types. By leveraging this rich, high-quality synthetic data, we overcome a longstanding bottleneck in the development of CT foundation models -- specifically, the scarcity of publicly available, high-quality annotated datasets due to privacy constraints and the substantial labor required for scaling precise data annotation. Encouragingly, PASTA demonstrates exceptional data efficiency with promising practical value, markedly improving performance on various tasks with only a small amount of real-world data. The open release of both the synthetic dataset and PASTA foundation model effectively addresses the challenge of data scarcity, thereby advancing oncological research and clinical translation.

  • 16 authors
·
Feb 10, 2025

UU-Mamba: Uncertainty-aware U-Mamba for Cardiovascular Segmentation

Building on the success of deep learning models in cardiovascular structure segmentation, increasing attention has been focused on improving generalization and robustness, particularly in small, annotated datasets. Despite recent advancements, current approaches often face challenges such as overfitting and accuracy limitations, largely due to their reliance on large datasets and narrow optimization techniques. This paper introduces the UU-Mamba model, an extension of the U-Mamba architecture, designed to address these challenges in both cardiac and vascular segmentation. By incorporating Sharpness-Aware Minimization (SAM), the model enhances generalization by targeting flatter minima in the loss landscape. Additionally, we propose an uncertainty-aware loss function that combines region-based, distribution-based, and pixel-based components to improve segmentation accuracy by capturing both local and global features. While the UU-Mamba model has already demonstrated great performance, further testing is required to fully assess its generalization and robustness. We expand our evaluation by conducting new trials on the ImageCAS (coronary artery) and Aorta (aortic branches and zones) datasets, which present more complex segmentation challenges than the ACDC dataset (left and right ventricles) used in our previous work, showcasing the model's adaptability and resilience. We confirm UU-Mamba's superior performance over leading models such as TransUNet, Swin-Unet, nnUNet, and nnFormer. Moreover, we provide a more comprehensive evaluation of the model's robustness and segmentation accuracy, as demonstrated by extensive experiments.

  • 8 authors
·
Sep 21, 2024

GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation

Echocardiogram video segmentation plays an important role in cardiac disease diagnosis. This paper studies the unsupervised domain adaption (UDA) for echocardiogram video segmentation, where the goal is to generalize the model trained on the source domain to other unlabelled target domains. Existing UDA segmentation methods are not suitable for this task because they do not model local information and the cyclical consistency of heartbeat. In this paper, we introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for cardiac structure segmentation. Our GraphEcho comprises two innovative modules, the Spatial-wise Cross-domain Graph Matching (SCGM) and the Temporal Cycle Consistency (TCC) module, which utilize prior knowledge of echocardiogram videos, i.e., consistent cardiac structure across patients and centers and the heartbeat cyclical consistency, respectively. These two modules can better align global and local features from source and target domains, improving UDA segmentation results. Experimental results showed that our GraphEcho outperforms existing state-of-the-art UDA segmentation methods. Our collected dataset and code will be publicly released upon acceptance. This work will lay a new and solid cornerstone for cardiac structure segmentation from echocardiogram videos. Code and dataset are available at: https://github.com/xmed-lab/GraphEcho

  • 5 authors
·
Sep 20, 2023

PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation

Comprehensive modeling of the surrounding 3D world is key to the success of autonomous driving. However, existing perception tasks like object detection, road structure segmentation, depth & elevation estimation, and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development procedure at the cost of losing an end-to-end unified solution to the problem. In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this, we introduce a novel method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme, integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to verify the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore, our method can be easily extended to dense occupancy prediction and has shown promising performance on the Occ3D benchmark. The code will be released at https://github.com/Robertwyq/PanoOcc.

  • 5 authors
·
Jun 16, 2023

Joint Self-Supervised Image-Volume Representation Learning with Intra-Inter Contrastive Clustering

Collecting large-scale medical datasets with fully annotated samples for training of deep networks is prohibitively expensive, especially for 3D volume data. Recent breakthroughs in self-supervised learning (SSL) offer the ability to overcome the lack of labeled training samples by learning feature representations from unlabeled data. However, most current SSL techniques in the medical field have been designed for either 2D images or 3D volumes. In practice, this restricts the capability to fully leverage unlabeled data from numerous sources, which may include both 2D and 3D data. Additionally, the use of these pre-trained networks is constrained to downstream tasks with compatible data dimensions. In this paper, we propose a novel framework for unsupervised joint learning on 2D and 3D data modalities. Given a set of 2D images or 2D slices extracted from 3D volumes, we construct an SSL task based on a 2D contrastive clustering problem for distinct classes. The 3D volumes are exploited by computing vectored embedding at each slice and then assembling a holistic feature through deformable self-attention mechanisms in Transformer, allowing incorporating long-range dependencies between slices inside 3D volumes. These holistic features are further utilized to define a novel 3D clustering agreement-based SSL task and masking embedding prediction inspired by pre-trained language models. Experiments on downstream tasks, such as 3D brain segmentation, lung nodule detection, 3D heart structures segmentation, and abnormal chest X-ray detection, demonstrate the effectiveness of our joint 2D and 3D SSL approach. We improve plain 2D Deep-ClusterV2 and SwAV by a significant margin and also surpass various modern 2D and 3D SSL approaches.

  • 10 authors
·
Dec 4, 2022

BPKD: Boundary Privileged Knowledge Distillation For Semantic Segmentation

Current knowledge distillation approaches in semantic segmentation tend to adopt a holistic approach that treats all spatial locations equally. However, for dense prediction, students' predictions on edge regions are highly uncertain due to contextual information leakage, requiring higher spatial sensitivity knowledge than the body regions. To address this challenge, this paper proposes a novel approach called boundary-privileged knowledge distillation (BPKD). BPKD distills the knowledge of the teacher model's body and edges separately to the compact student model. Specifically, we employ two distinct loss functions: (i) edge loss, which aims to distinguish between ambiguous classes at the pixel level in edge regions; (ii) body loss, which utilizes shape constraints and selectively attends to the inner-semantic regions. Our experiments demonstrate that the proposed BPKD method provides extensive refinements and aggregation for edge and body regions. Additionally, the method achieves state-of-the-art distillation performance for semantic segmentation on three popular benchmark datasets, highlighting its effectiveness and generalization ability. BPKD shows consistent improvements across a diverse array of lightweight segmentation structures, including both CNNs and transformers, underscoring its architecture-agnostic adaptability. The code is available at https://github.com/AkideLiu/BPKD.

  • 6 authors
·
Jun 13, 2023

View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields

Large-scale vision foundation models such as Segment Anything (SAM) demonstrate impressive performance in zero-shot image segmentation at multiple levels of granularity. However, these zero-shot predictions are rarely 3D-consistent. As the camera viewpoint changes in a scene, so do the segmentation predictions, as well as the characterizations of "coarse" or "fine" granularity. In this work, we address the challenging task of lifting multi-granular and view-inconsistent image segmentations into a hierarchical and 3D-consistent representation. We learn a novel feature field within a Neural Radiance Field (NeRF) representing a 3D scene, whose segmentation structure can be revealed at different scales by simply using different thresholds on feature distance. Our key idea is to learn an ultrametric feature space, which unlike a Euclidean space, exhibits transitivity in distance-based grouping, naturally leading to a hierarchical clustering. Put together, our method takes view-inconsistent multi-granularity 2D segmentations as input and produces a hierarchy of 3D-consistent segmentations as output. We evaluate our method and several baselines on synthetic datasets with multi-view images and multi-granular segmentation, showcasing improved accuracy and viewpoint-consistency. We additionally provide qualitative examples of our model's 3D hierarchical segmentations in real world scenes. The code and dataset are available at https://github.com/hardyho/ultrametric_feature_fields

  • 4 authors
·
May 30, 2024

You Only Look at Once for Real-time and Generic Multi-Task

High precision, lightweight, and real-time responsiveness are three essential requirements for implementing autonomous driving. In this study, we incorporate A-YOLOM, an adaptive, real-time, and lightweight multi-task model designed to concurrently address object detection, drivable area segmentation, and lane line segmentation tasks. Specifically, we develop an end-to-end multi-task model with a unified and streamlined segmentation structure. We introduce a learnable parameter that adaptively concatenates features between necks and backbone in segmentation tasks, using the same loss function for all segmentation tasks. This eliminates the need for customizations and enhances the model's generalization capabilities. We also introduce a segmentation head composed only of a series of convolutional layers, which reduces the number of parameters and inference time. We achieve competitive results on the BDD100k dataset, particularly in visualization outcomes. The performance results show a mAP50 of 81.1% for object detection, a mIoU of 91.0% for drivable area segmentation, and an IoU of 28.8% for lane line segmentation. Additionally, we introduce real-world scenarios to evaluate our model's performance in a real scene, which significantly outperforms competitors. This demonstrates that our model not only exhibits competitive performance but is also more flexible and faster than existing multi-task models. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/YOLOv8-multi-task

  • 3 authors
·
Oct 2, 2023

Dual Structure-Aware Image Filterings for Semi-supervised Medical Image Segmentation

Semi-supervised image segmentation has attracted great attention recently. The key is how to leverage unlabeled images in the training process. Most methods maintain consistent predictions of the unlabeled images under variations (e.g., adding noise/perturbations, or creating alternative versions) in the image and/or model level. In most image-level variation, medical images often have prior structure information, which has not been well explored. In this paper, we propose novel dual structure-aware image filterings (DSAIF) as the image-level variations for semi-supervised medical image segmentation. Motivated by connected filtering that simplifies image via filtering in structure-aware tree-based image representation, we resort to the dual contrast invariant Max-tree and Min-tree representation. Specifically, we propose a novel connected filtering that removes topologically equivalent nodes (i.e. connected components) having no siblings in the Max/Min-tree. This results in two filtered images preserving topologically critical structure. Applying the proposed DSAIF to mutually supervised networks decreases the consensus of their erroneous predictions on unlabeled images. This helps to alleviate the confirmation bias issue of overfitting to noisy pseudo labels of unlabeled images, and thus effectively improves the segmentation performance. Extensive experimental results on three benchmark datasets demonstrate that the proposed method significantly/consistently outperforms some state-of-the-art methods. The source codes will be publicly available.

  • 7 authors
·
Dec 12, 2023

TotalSegmentator: robust segmentation of 104 anatomical structures in CT images

We present a deep learning segmentation model that can automatically and robustly segment all major anatomical structures in body CT images. In this retrospective study, 1204 CT examinations (from the years 2012, 2016, and 2020) were used to segment 104 anatomical structures (27 organs, 59 bones, 10 muscles, 8 vessels) relevant for use cases such as organ volumetry, disease characterization, and surgical or radiotherapy planning. The CT images were randomly sampled from routine clinical studies and thus represent a real-world dataset (different ages, pathologies, scanners, body parts, sequences, and sites). The authors trained an nnU-Net segmentation algorithm on this dataset and calculated Dice similarity coefficients (Dice) to evaluate the model's performance. The trained algorithm was applied to a second dataset of 4004 whole-body CT examinations to investigate age dependent volume and attenuation changes. The proposed model showed a high Dice score (0.943) on the test set, which included a wide range of clinical data with major pathologies. The model significantly outperformed another publicly available segmentation model on a separate dataset (Dice score, 0.932 versus 0.871, respectively). The aging study demonstrated significant correlations between age and volume and mean attenuation for a variety of organ groups (e.g., age and aortic volume; age and mean attenuation of the autochthonous dorsal musculature). The developed model enables robust and accurate segmentation of 104 anatomical structures. The annotated dataset (https://doi.org/10.5281/zenodo.6802613) and toolkit (https://www.github.com/wasserth/TotalSegmentator) are publicly available.

  • 12 authors
·
Aug 11, 2022

FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

  • 9 authors
·
Mar 29, 2024

TotalSegmentator MRI: Robust Sequence-independent Segmentation of Multiple Anatomic Structures in MRI

Since the introduction of TotalSegmentator CT, there is demand for a similar robust automated MRI segmentation tool that can be applied across all MRI sequences and anatomic structures. In this retrospective study, a nnU-Net model (TotalSegmentator) was trained on MRI and CT examinations to segment 80 anatomic structures relevant for use cases such as organ volumetry, disease characterization, surgical planning and opportunistic screening. Examinations were randomly sampled from routine clinical studies to represent real-world examples. Dice scores were calculated between the predicted segmentations and expert radiologist reference standard segmentations to evaluate model performance on an internal test set, two external test sets and against two publicly available models, and TotalSegmentator CT. The model was applied to an internal dataset containing abdominal MRIs to investigate age-dependent volume changes. A total of 1143 examinations (616 MRIs, 527 CTs) (median age 61 years, IQR 50-72) were split into training (n=1088, CT and MRI) and an internal test set (n=55; only MRI), two external test sets (AMOS, n=20; CHAOS, n=20; only MRI), and an internal aging-study dataset of 8672 abdominal MRIs (median age 59 years, IQR 45-70) were included. The model showed a Dice Score of 0.839 on the internal test set and outperformed two other models (Dice Score, 0.862 versus 0.759; and 0.838 versus 0.560; p<.001 for both). The proposed open-source, easy-to-use model allows for automatic, robust segmentation of 80 structures, extending the capabilities of TotalSegmentator to MRIs of any sequence. The ready-to-use online tool is available at https://totalsegmentator.com, the model at https://github.com/wasserth/TotalSegmentator, and the dataset at https://zenodo.org/records/14710732.

  • 19 authors
·
May 29, 2024

From Density to Geometry: YOLOv8 Instance Segmentation for Reverse Engineering of Optimized Structures

This paper introduces YOLOv8-TO, a novel approach for reverse engineering of topology-optimized structures into interpretable geometric parameters using the YOLOv8 instance segmentation model. Density-based topology optimization methods require post-processing to convert the optimal density distribution into a parametric representation for design exploration and integration with CAD tools. Traditional methods such as skeletonization struggle with complex geometries and require manual intervention. YOLOv8-TO addresses these challenges by training a custom YOLOv8 model to automatically detect and reconstruct structural components from binary density distributions. The model is trained on a diverse dataset of both optimized and random structures generated using the Moving Morphable Components method. A custom reconstruction loss function based on the dice coefficient of the predicted geometry is used to train the new regression head of the model via self-supervised learning. The method is evaluated on test sets generated from different topology optimization methods, including out-of-distribution samples, and compared against a skeletonization approach. Results show that YOLOv8-TO significantly outperforms skeletonization in reconstructing visually and structurally similar designs. The method showcases an average improvement of 13.84% in the Dice coefficient, with peak enhancements reaching 20.78%. The method demonstrates good generalization to complex geometries and fast inference times, making it suitable for integration into design workflows using regular workstations. Limitations include the sensitivity to non-max suppression thresholds. YOLOv8-TO represents a significant advancement in topology optimization post-processing, enabling efficient and accurate reverse engineering of optimized structures for design exploration and manufacturing.

  • 4 authors
·
Apr 29, 2024

SPOCKMIP: Segmentation of Vessels in MRAs with Enhanced Continuity using Maximum Intensity Projection as Loss

Identification of vessel structures of different sizes in biomedical images is crucial in the diagnosis of many neurodegenerative diseases. However, the sparsity of good-quality annotations of such images makes the task of vessel segmentation challenging. Deep learning offers an efficient way to segment vessels of different sizes by learning their high-level feature representations and the spatial continuity of such features across dimensions. Semi-supervised patch-based approaches have been effective in identifying small vessels of one to two voxels in diameter. This study focuses on improving the segmentation quality by considering the spatial correlation of the features using the Maximum Intensity Projection~(MIP) as an additional loss criterion. Two methods are proposed with the incorporation of MIPs of label segmentation on the single~(z-axis) and multiple perceivable axes of the 3D volume. The proposed MIP-based methods produce segmentations with improved vessel continuity, which is evident in visual examinations of ROIs. Patch-based training is improved by introducing an additional loss term, MIP loss, to penalise the predicted discontinuity of vessels. A training set of 14 volumes is selected from the StudyForrest dataset comprising of 18 7-Tesla 3D Time-of-Flight~(ToF) Magnetic Resonance Angiography (MRA) images. The generalisation performance of the method is evaluated using the other unseen volumes in the dataset. It is observed that the proposed method with multi-axes MIP loss produces better quality segmentations with a median Dice of 80.245 pm 0.129. Also, the method with single-axis MIP loss produces segmentations with a median Dice of 79.749 pm 0.109. Furthermore, a visual comparison of the ROIs in the predicted segmentation reveals a significant improvement in the continuity of the vessels when MIP loss is incorporated into training.

  • 8 authors
·
Jul 11, 2024

MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation

Segmentation of anatomical structures and pathological regions in medical images is essential for modern clinical diagnosis, disease research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. As a result, developing precise segmentation methods that require fewer labeled datasets remains a critical challenge in medical image analysis. Recently, the introduction of foundation models like CLIP and Segment-Anything-Model (SAM), with robust cross-domain representations, has paved the way for interactive and universal image segmentation. However, further exploration of these models for data-efficient segmentation in medical imaging is still needed and highly relevant. In this paper, we introduce MedCLIP-SAMv2, a novel framework that integrates the CLIP and SAM models to perform segmentation on clinical scans using text prompts, in both zero-shot and weakly supervised settings. Our approach includes fine-tuning the BiomedCLIP model with a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, and leveraging the Multi-modal Information Bottleneck (M2IB) to create visual prompts for generating segmentation masks from SAM in the zero-shot setting. We also investigate using zero-shot segmentation labels within a weakly supervised paradigm to enhance segmentation quality further. Extensive testing across four diverse segmentation tasks and medical imaging modalities (breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT) demonstrates the high accuracy of our proposed framework. Our code is available at https://github.com/HealthX-Lab/MedCLIP-SAMv2.

  • 4 authors
·
Sep 28, 2024

Semantic Amodal Segmentation

Common visual recognition tasks such as classification, object detection, and semantic segmentation are rapidly reaching maturity, and given the recent rate of progress, it is not unreasonable to conjecture that techniques for many of these problems will approach human levels of performance in the next few years. In this paper we look to the future: what is the next frontier in visual recognition? We offer one possible answer to this question. We propose a detailed image annotation that captures information beyond the visible pixels and requires complex reasoning about full scene structure. Specifically, we create an amodal segmentation of each image: the full extent of each region is marked, not just the visible pixels. Annotators outline and name all salient regions in the image and specify a partial depth order. The result is a rich scene structure, including visible and occluded portions of each region, figure-ground edge information, semantic labels, and object overlap. We create two datasets for semantic amodal segmentation. First, we label 500 images in the BSDS dataset with multiple annotators per image, allowing us to study the statistics of human annotations. We show that the proposed full scene annotation is surprisingly consistent between annotators, including for regions and edges. Second, we annotate 5000 images from COCO. This larger dataset allows us to explore a number of algorithmic ideas for amodal segmentation and depth ordering. We introduce novel metrics for these tasks, and along with our strong baselines, define concrete new challenges for the community.

  • 4 authors
·
Sep 3, 2015

Full-scale Representation Guided Network for Retinal Vessel Segmentation

The U-Net architecture and its variants have remained state-of-the-art (SOTA) for retinal vessel segmentation over the past decade. In this study, we introduce a Full-Scale Guided Network (FSG-Net), where a novel feature representation module using modernized convolution blocks effectively captures full-scale structural information, while a guided convolution block subsequently refines this information. Specifically, we introduce an attention-guided filter within the guided convolution block, leveraging its similarity to unsharp masking to enhance fine vascular structures. Passing full-scale information to the attention block facilitates the generation of more contextually relevant attention maps, which are then passed to the attention-guided filter, providing further refinement to the segmentation performance. The structure preceding the guided convolution block can be replaced by any U-Net variant, ensuring flexibility and scalability across various segmentation tasks. For a fair comparison, we re-implemented recent studies available in public repositories to evaluate their scalability and reproducibility. Our experiments demonstrate that, despite its compact architecture, FSG-Net delivers performance competitive with SOTA methods across multiple public datasets. Ablation studies further demonstrate that each proposed component meaningfully contributes to this competitive performance. Our code is available on https://github.com/ZombaSY/FSG-Net-pytorch.

  • 3 authors
·
Jan 31, 2025

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy. Code is available at https://github.com/HealthX-Lab/MedCLIP-SAM.

  • 4 authors
·
Mar 29, 2024

Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated 11 datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.

  • 6 authors
·
Aug 15, 2023

CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography

Accurate delineation of anatomical structures in volumetric CT scans is crucial for diagnosis and treatment planning. While AI has advanced automated segmentation, current approaches typically target individual structures, creating a fragmented landscape of incompatible models with varying performance and disparate evaluation protocols. Foundational segmentation models address these limitations by providing a holistic anatomical view through a single model. Yet, robust clinical deployment demands comprehensive training data, which is lacking in existing whole-body approaches, both in terms of data heterogeneity and, more importantly, anatomical coverage. In this work, rather than pursuing incremental optimizations in model architecture, we present CADS, an open-source framework that prioritizes the systematic integration, standardization, and labeling of heterogeneous data sources for whole-body CT segmentation. At its core is a large-scale dataset of 22,022 CT volumes with complete annotations for 167 anatomical structures, representing a significant advancement in both scale and coverage, with 18 times more scans than existing collections and 60% more distinct anatomical targets. Building on this diverse dataset, we develop the CADS-model using established architectures for accessible and automated full-body CT segmentation. Through comprehensive evaluation across 18 public datasets and an independent real-world hospital cohort, we demonstrate advantages over SoTA approaches. Notably, thorough testing of the model's performance in segmentation tasks from radiation oncology validates its direct utility for clinical interventions. By making our large-scale dataset, our segmentation models, and our clinical software tool publicly available, we aim to advance robust AI solutions in radiology and make comprehensive anatomical analysis accessible to clinicians and researchers alike.

  • 33 authors
·
Jul 29, 2025

Complexity in Complexity: Understanding Visual Complexity Through Structure, Color, and Surprise

Understanding how humans perceive visual complexity is a key area of study in visual cognition. Previous approaches to modeling visual complexity assessments have often resulted in intricate, difficult-to-interpret algorithms that employ numerous features or sophisticated deep learning architectures. While these complex models achieve high performance on specific datasets, they often sacrifice interpretability, making it challenging to understand the factors driving human perception of complexity. Recently (Shen, et al. 2024) proposed an interpretable segmentation-based model that accurately predicted complexity across various datasets, supporting the idea that complexity can be explained simply. In this work, we investigate the failure of their model to capture structural, color and surprisal contributions to complexity. To this end, we propose Multi-Scale Sobel Gradient (MSG) which measures spatial intensity variations, Multi-Scale Unique Color (MUC) which quantifies colorfulness across multiple scales, and surprise scores generated using a Large Language Model. We test our features on existing benchmarks and a novel dataset (Surprising Visual Genome) containing surprising images from Visual Genome. Our experiments demonstrate that modeling complexity accurately is not as simple as previously thought, requiring additional perceptual and semantic factors to address dataset biases. Our model improves predictive performance while maintaining interpretability, offering deeper insights into how visual complexity is perceived and assessed. Our code, analysis and data are available at https://github.com/Complexity-Project/Complexity-in-Complexity.

  • 4 authors
·
Jan 27, 2025

Segmentation and Vascular Vectorization for Coronary Artery by Geometry-based Cascaded Neural Network

Segmentation of the coronary artery is an important task for the quantitative analysis of coronary computed tomography angiography (CCTA) images and is being stimulated by the field of deep learning. However, the complex structures with tiny and narrow branches of the coronary artery bring it a great challenge. Coupled with the medical image limitations of low resolution and poor contrast, fragmentations of segmented vessels frequently occur in the prediction. Therefore, a geometry-based cascaded segmentation method is proposed for the coronary artery, which has the following innovations: 1) Integrating geometric deformation networks, we design a cascaded network for segmenting the coronary artery and vectorizing results. The generated meshes of the coronary artery are continuous and accurate for twisted and sophisticated coronary artery structures, without fragmentations. 2) Different from mesh annotations generated by the traditional marching cube method from voxel-based labels, a finer vectorized mesh of the coronary artery is reconstructed with the regularized morphology. The novel mesh annotation benefits the geometry-based segmentation network, avoiding bifurcation adhesion and point cloud dispersion in intricate branches. 3) A dataset named CCA-200 is collected, consisting of 200 CCTA images with coronary artery disease. The ground truths of 200 cases are coronary internal diameter annotations by professional radiologists. Extensive experiments verify our method on our collected dataset CCA-200 and public ASOCA dataset, with a Dice of 0.778 on CCA-200 and 0.895 on ASOCA, showing superior results. Especially, our geometry-based model generates an accurate, intact and smooth coronary artery, devoid of any fragmentations of segmented vessels.

  • 6 authors
·
May 7, 2023

Token Contrast for Weakly-Supervised Semantic Segmentation

Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, \ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo.

  • 4 authors
·
Mar 2, 2023

NuClick: A Deep Learning Framework for Interactive Segmentation of Microscopy Images

Object segmentation is an important step in the workflow of computational pathology. Deep learning based models generally require large amount of labeled data for precise and reliable prediction. However, collecting labeled data is expensive because it often requires expert knowledge, particularly in medical imaging domain where labels are the result of a time-consuming analysis made by one or more human experts. As nuclei, cells and glands are fundamental objects for downstream analysis in computational pathology/cytology, in this paper we propose a simple CNN-based approach to speed up collecting annotations for these objects which requires minimum interaction from the annotator. We show that for nuclei and cells in histology and cytology images, one click inside each object is enough for NuClick to yield a precise annotation. For multicellular structures such as glands, we propose a novel approach to provide the NuClick with a squiggle as a guiding signal, enabling it to segment the glandular boundaries. These supervisory signals are fed to the network as auxiliary inputs along with RGB channels. With detailed experiments, we show that NuClick is adaptable to the object scale, robust against variations in the user input, adaptable to new domains, and delivers reliable annotations. An instance segmentation model trained on masks generated by NuClick achieved the first rank in LYON19 challenge. As exemplar outputs of our framework, we are releasing two datasets: 1) a dataset of lymphocyte annotations within IHC images, and 2) a dataset of segmented WBCs in blood smear images.

  • 4 authors
·
May 29, 2020

Not All Pixels Are Equal: Learning Pixel Hardness for Semantic Segmentation

Semantic segmentation has recently witnessed great progress. Despite the impressive overall results, the segmentation performance in some hard areas (e.g., small objects or thin parts) is still not promising. A straightforward solution is hard sample mining, which is widely used in object detection. Yet, most existing hard pixel mining strategies for semantic segmentation often rely on pixel's loss value, which tends to decrease during training. Intuitively, the pixel hardness for segmentation mainly depends on image structure and is expected to be stable. In this paper, we propose to learn pixel hardness for semantic segmentation, leveraging hardness information contained in global and historical loss values. More precisely, we add a gradient-independent branch for learning a hardness level (HL) map by maximizing hardness-weighted segmentation loss, which is minimized for the segmentation head. This encourages large hardness values in difficult areas, leading to appropriate and stable HL map. Despite its simplicity, the proposed method can be applied to most segmentation methods with no and marginal extra cost during inference and training, respectively. Without bells and whistles, the proposed method achieves consistent/significant improvement (1.37% mIoU on average) over most popular semantic segmentation methods on Cityscapes dataset, and demonstrates good generalization ability across domains. The source codes are available at https://github.com/Menoly-xin/Hardness-Level-Learning .

  • 5 authors
·
May 15, 2023

CrackNex: a Few-shot Low-light Crack Segmentation Model Based on Retinex Theory for UAV Inspections

Routine visual inspections of concrete structures are imperative for upholding the safety and integrity of critical infrastructure. Such visual inspections sometimes happen under low-light conditions, e.g., checking for bridge health. Crack segmentation under such conditions is challenging due to the poor contrast between cracks and their surroundings. However, most deep learning methods are designed for well-illuminated crack images and hence their performance drops dramatically in low-light scenes. In addition, conventional approaches require many annotated low-light crack images which is time-consuming. In this paper, we address these challenges by proposing CrackNex, a framework that utilizes reflectance information based on Retinex Theory to help the model learn a unified illumination-invariant representation. Furthermore, we utilize few-shot segmentation to solve the inefficient training data problem. In CrackNex, both a support prototype and a reflectance prototype are extracted from the support set. Then, a prototype fusion module is designed to integrate the features from both prototypes. CrackNex outperforms the SOTA methods on multiple datasets. Additionally, we present the first benchmark dataset, LCSD, for low-light crack segmentation. LCSD consists of 102 well-illuminated crack images and 41 low-light crack images. The dataset and code are available at https://github.com/zy1296/CrackNex.

  • 4 authors
·
Mar 5, 2024

FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation

Curvilinear object segmentation is critical for many applications. However, manually annotating curvilinear objects is very time-consuming and error-prone, yielding insufficiently available annotated datasets for existing supervised methods and domain adaptation methods. This paper proposes a self-supervised curvilinear object segmentation method that learns robust and distinctive features from fractals and unlabeled images (FreeCOS). The key contributions include a novel Fractal-FDA synthesis (FFS) module and a geometric information alignment (GIA) approach. FFS generates curvilinear structures based on the parametric Fractal L-system and integrates the generated structures into unlabeled images to obtain synthetic training images via Fourier Domain Adaptation. GIA reduces the intensity differences between the synthetic and unlabeled images by comparing the intensity order of a given pixel to the values of its nearby neighbors. Such image alignment can explicitly remove the dependency on absolute intensity values and enhance the inherent geometric characteristics which are common in both synthetic and real images. In addition, GIA aligns features of synthetic and real images via the prediction space adaptation loss (PSAL) and the curvilinear mask contrastive loss (CMCL). Extensive experimental results on four public datasets, i.e., XCAD, DRIVE, STARE and CrackTree demonstrate that our method outperforms the state-of-the-art unsupervised methods, self-supervised methods and traditional methods by a large margin. The source code of this work is available at https://github.com/TY-Shi/FreeCOS.

  • 4 authors
·
Jul 14, 2023

Mediastinal lymph nodes segmentation using 3D convolutional neural network ensembles and anatomical priors guiding

As lung cancer evolves, the presence of enlarged and potentially malignant lymph nodes must be assessed to properly estimate disease progression and select the best treatment strategy. Following the clinical guidelines, estimation of short-axis diameter and mediastinum station are paramount for correct diagnosis. A method for accurate and automatic segmentation is hence decisive for quantitatively describing lymph nodes. In this study, the use of 3D convolutional neural networks, either through slab-wise schemes or the leveraging of downsampled entire volumes, is investigated. Furthermore, the potential impact from simple ensemble strategies is considered. As lymph nodes have similar attenuation values to nearby anatomical structures, we suggest using the knowledge of other organs as prior information to guide the segmentation task. To assess the segmentation and instance detection performances, a 5-fold cross-validation strategy was followed over a dataset of 120 contrast-enhanced CT volumes. For the 1178 lymph nodes with a short-axis diameter geq10 mm, our best performing approach reached a patient-wise recall of 92%, a false positive per patient ratio of 5, and a segmentation overlap of 80.5%. The method performs similarly well across all stations. Fusing a slab-wise and a full volume approach within an ensemble scheme generated the best performances. The anatomical priors guiding strategy is promising, yet a larger set than four organs appears needed to generate an optimal benefit. A larger dataset is also mandatory, given the wide range of expressions a lymph node can exhibit (i.e., shape, location, and attenuation), and contrast uptake variations.

  • 5 authors
·
Feb 11, 2021