1 Introduction

In recent years, large language models (LLMs) (Chowdhery et al, 2023; Touvron et al, 2023; Zhang et al, 2022) have sparked a revolution in natural language processing (NLP). These foundational models exhibit remarkable transfer capabilities, extending far beyond their initial training objectives. LLMs showcase robust generalization abilities and excel in a multitude of open-world language tasks, including language comprehension, generation, interaction, and reasoning. Inspired by the success of LLMs, vision foundational models such as CLIP (Radford et al, 2021), DINOv2 (Oquab et al, 2023), BLIP (Li et al, 2022), and SAM (Kirillov et al, 2023) have also emerged. These models, once trained, can seamlessly apply their knowledge to various downstream tasks. Such a trend has further motivated researchers to explore ways of open-world visual understanding.

Fig. 1
figure 1

Comparison of different open-world segmentation frameworks based on foundation models. From left to right, they are foundation model adaptions, task-specific foundation models training from scratch, and training-free foundation models

Pioneering works (Liu et al, 2023a; Dai et al, 2023; Zhu et al, 2023a) have mainly focused on how to understand images as a whole in the open world. Herein, we project our viewpoint to open-world understanding at the object level, specifically for the task of open-world segmentation (Qi et al, 2022). When approaching open-world segmentation tasks, there are three primary strategies for leveraging foundational models. The most widely studied approach (Liang et al, 2023; Oin et al, 2023; Ghiasi et al, 2022) is to utilize a vision foundation model like CLIP or DINOv2 and cooperate it with a specific segmentation header or adapter to complete the open-world segmentation task. Such methods (Fig. 1a) often require fine-tuning or training the segmentation header or adapter. In addition to the above methods combining foundation model with adapter, some researchers have tried to draw on the successful experience in NLP and directly train a foundation model for generic dense-prediction vision problems, as demonstrated in works like Painter (Wang et al, 2023a). Such models (Fig. 1b) can complete open-world segmentation simply with a task-specific prompt. Lately, the Segment Anything Model (SAM) (Kirillov et al, 2023) has attained remarkable zero-shot segmentation results. It presents researchers with the prospect of devising an alternative way to accomplish open-world segmentation without the need for training (Fig. 1c). For example, PerSAM (Zhang et al, 2023b) effectively transfers SAM to open-world object segmentation tasks in a training-free manner through the design of the cross-attention layer in SAM’s decoder, thereby tapping into the potential of vision foundational models to a significant extent. While these approaches have achieved excellent performance, incorporating more vision foundation models to improve the generalization capability and segmentation quality for open-world segmentation remains an avenue for further inspection.

Fig. 2
figure 2

Different prompt forms in existing open-world segmentation methods. The left is the prompt of predefined textual descriptions or categories. The middle is the prompt form used in existing one-shot object segmentation works (Liu et al, 2023b; Zhang et al, 2023b). The right is the prompt form used in this paper, which only uses one image containing a salient object with specific visual concepts

In addition to the architectural design of the foundation model for open-world segmentation tasks, another critical aspect is the development of flexible and user-friendly prompts. This ensures that the model accurately grasps the visual concepts users desire. As shown in Fig. 2a, b existing works typically rely on predefined textual descriptions or high-quality annotations for a given image as the segmentation prompt, which lacks flexibility. Yet, in the context of open-world scenarios, we not only expect the network to perform well on various open-set datasets but also need it to handle object segmentation tasks with more versatile prompt information. Therefore, a fundamental question emerges: Could we prompt the foundational models, such as SAM, to segment specific objects based on the prompt of the user-given image that contains objects with a clear subjective concept?

Motivated by this question, we present a novel open-world segmentation framework, which utilizes image prompts to instruct the training-free vision foundational models to segment open-world objects. The proposed Image Prompt Segmentation (IPSeg) network is a straightforward yet highly effective framework, comprising three main components, i.e., feature extraction, feature interaction, and segmentation. For the feature extraction, we design two branches, including the prompt and the input branches. The prompt branch is dedicated to capturing general representations of subjective objects belonging to a specific category from the prompt image, and the extracted representations are employed to identify the objects in the input image.

The input branch is designed to capture the feature representation of the input image to be segmented, following the same architecture proposed in the prompt branch. For the feature interaction, we’ve devised a feature interaction module to facilitate interaction between the input image features and the given image prompt features, thereby accentuating the pixel points of the target objects. Finally, the generated pixel points serve as the prompt information for SAM (Kirillov et al, 2023), guiding SAM in predicting the final segmentation map.

In summary, the key contributions are listed as follows:

  • We propose a training-free open-world object segmentation framework based on foundational models. We take the pioneering step of utilizing image prompts with clear target objects to query generic object representations from foundational models. Such a framework can potentially inspire researchers to address open-world segmentation from a fresh perspective.

  • We introduce a simple but effective framework, coined as IPSeg, which contains three effective components. They are utilized to extract discriminative features of target objects identified in the given image prompt and generate accurate points to prompt SAM models to generate object masks.

  • We validate the proposed IPSeg framework on widely used segmentation datasets, including COCO-20\(^i\) (Nguyen and Todorovic, 2019), FSS-1000 (Li et al, 2020) and PerSeg (Zhang et al, 2023b). Compared to methods PerSAM and Painter, our proposed method can achieve a 30.6% and 42.8% improvement in the mIoU metric with flexible prompts under a training-free mechanism.

2 Related Works

2.1 Large Vision Models (LVMs)

Prompted by the powerful generalized ability of large language models (Devlin et al, 2018; Lu et al, 2019; Brown et al, 2020; Radford et al, 2018, 2019; Zhang et al, 2023a) in nature language processing, large vision models (Oquab et al, 2023; Kirillov et al, 2023; Radford et al, 2021) have emerged. Among these large vision models, CLIP (Radford et al, 2021) align the image and text feature spaces through contrastive learning on the huge number of image-text pairs, whose models show powerful zero-shot generalization ability on various downstream vision tasks (Xu et al, 2023), such as open-world segmentation (Qi et al, 2022; Cen et al, 2021). SAM (Kirillov et al, 2023) train a prompt-based large segmentation model on 1 billion masks. The prompt-based segmentation model can accurately segment objects in images from different domains. Such ability has facilitated different applications, such as object tracking (Yang et al, 2023; Cheng et al, 2023; Zhu et al, 2023b), image segmentation (Zhang and Liu, 2023; Chen et al, 2023; Tang et al, 2023; Jiang and Yang, 2023), 3D reconstruction (Cen et al, 2023; Shen et al, 2023) etc. Besides, DINOv2 (Oquab et al, 2023) learn powerful object-level representations in an unsupervised manner. Such powerful representations facilitate downstream dense scene parsing tasks, such as semantic segmentation (Chen et al, 2017; Long et al, 2015), and depth estimation (Ranftl et al, 2021).

2.2 Open-World Segmentation

Open-world segmentation aims to extend traditional close-set segmentation models (Long et al, 2015; Chen et al, 2017) to enable open-set pixel classification, making them more versatile and capable of generalization. The models of open-world segmentation (Cui et al, 2020; Cen et al, 2021; Qi et al, 2022) need to be able to handle unknown classes. There exist several kinds of open-world segmentation methods. The first line of works attempts (Xia et al, 2020; Cen et al, 2021; Angus et al, 2019; Hammam et al, 2023) to classify the pixels of objects out of the training set’s distribution to ‘anomaly’. They do not distinguish different novel classes in “anomaly", in detail. The second line of works (Xian et al, 2019; Bucher et al, 2019) usually trains segmentation models on datasets with a fixed number of seen classes and utilizes the models to segment images with unseen classes. They strive to improve the generalization of segmentation embedding to unseen classes.

Fig. 3
figure 3

The framework of our proposed IPSeg framework. Importantly, all parameters in the network remain frozen, eliminating the need for additional training. The green point in \(\mathcal {P}_\mathcal {G}\) represents the positive point prompts sent to SAM, while the red point represents the negative point prompts sent to SAM (Color figure online)

Recently, owing to LVMs, such as CLIP (Radford et al, 2021) have shown significant zero-shot classification ability, researchers attempt to transfer their image-level classification ability to region-level classification. These methods (Luo et al, 2023; Xu et al, 2023; Ma et al, 2022; Liang et al, 2023; Xu et al, 2022b; Zhou et al, 2023c; Liu et al, 2022) adapt CLIP models to the open-world segmentation models by training on the datasets with seen classes to align the predicted region features and text features. Among the methods using LVMs, some works (Zhou et al, 2022; Liu et al, 2023b; Zhang et al, 2023b) also attempt to utilize training-free LVMs and design prompts to conduct open-world segmentation. Without fine-tuning LVMs, they directly extract object segmentation masks from them. Zhou et al. Zhou et al (2022) conduct minimal modification of the CLIP model to extract segmentation masks of open-world categories. Liu et al. Liu et al (2023b) and Zhang et al. Zhang et al (2023b) utilize an image with an object mask to extract prompts. Then, the prompts are used to instruct the SAM model to segment objects of the target category indicated in the provided image.

Our proposed method also falls into the training-free LVMs categories. Different from previous works using image-mask pairs, we only utilize an image containing the objects of the target concept as prompts to conduct open-world segmentation. Image prompts are more flexible than image-mask pairs, as humans do not need to annotate the objects of the target class. Besides, we also utilize off-of-the-shelf LVMs, such as DINOv2, to extract discriminative feature representations of image prompts. Then, discriminative feature representations are used to prompt LVMs to segment target objects in test images.

3 Method

We first introduce the preliminaries about the Segmentation Anything Models (SAM) (Kirillov et al, 2023), used in this paper. Then, we introduce the proposed IPSeg framework, which is shown in Fig. 3. Given an image prompt with a clear concept, IPSeg is capable of segmenting any semantically identical object under the open-world setting.

3.1 Preliminaries

SAM consists of three components: a prompt encoder Enc\(\mathcal {P}\), an image encoder Enc\(\mathcal {I}\), and a lightweight mask decoder Dec\(\mathcal {M}\). As a prompt-based framework, SAM takes as input an image \(\mathcal {I}\), and prompts \(\mathcal {P}\) (like specific points). Specifically, SAM initially utilizes Enc\(\mathcal {I}\) to extract features from the input image and employs Enc\(\mathcal {P}\) to encode the provided prompts into prompt tokens:

$$\begin{aligned} F_\mathcal {I} = Enc\mathcal {I}(\mathcal {I}), \quad T_\mathcal {P} = Enc\mathcal {P}(\mathcal {P}). \end{aligned}$$
(1)

Afterwards, the encoded image \(F_\mathcal {I}\) and prompts \(T_\mathcal {P}\) are input into the decoder Dec\(\mathcal {M}\) for feature interaction. It’s worth noting that SAM constructs the decoder’s input by concatenating several learnable mask tokens \(T_\mathcal {M}\) as prefixes to the prompt tokens \(T_\mathcal {P}\). These mask tokens are responsible for generating the mask output, formulated as:

$$\begin{aligned} \mathcal {M} = Dec\mathcal {M}(F_\mathcal {I}, Concat(T_\mathcal {M},T_\mathcal {P})), \end{aligned}$$
(2)

where \(\mathcal {M}\) denotes the final segmentation masks predicted by SAM.

As discussed above, SAM can segment objects in an image based on the given prompt. Therefore, the core of this paper lies in how to find semantically matching points in the image \(\mathcal {I}\) to be segmented when given an image prompt \(\mathcal {I}_{\mathcal {P}}\) that contains clear visual concepts. This, in turn, guides SAM in generating segmentation results. Note we focus on constructing an image-prompt open-world framework. Exploring prompts, like bounding boxes, is out of the scope of this paper.

3.2 Overview

The pipeline of our method is shown in Fig. 3. The proposed IPSeg framework comprises three components: feature extraction, feature interaction and SAM. The feature extraction module is used in the prompt branch and input branch, which can extract the discriminative feature representations of both input image \(\mathcal {I}\) and image prompt \(\mathcal {I}_{\mathcal {P}}\). Then, the prompt feature \(\mathcal {F}_\mathcal{I}\mathcal{P}\) interacts with the input image feature \(\mathcal {F}_\mathcal {I}\) in the feature interaction module, to generate specialized prompts \(\mathcal {P}_\mathcal {G}\) such as points in the input image, which contains the same semantic information with the prompt image. Finally, the generated prompt \(\mathcal {P}_\mathcal {G}\) and the input image \(\mathcal {I}\) are sent to SAM, generating the final prediction \(\mathcal {M}\). We will provide detailed explanations of the first two components in the subsequent subsections.

Fig. 4
figure 4

Visualization results of features extracted from different models. The second and fifth columns indicate the use of only the DINOv2 model for feature extraction, while the third and sixth columns denote the use of both DINOv2 and SD models for this purpose

3.3 Feature Extraction

Extracting a robust feature representation from both the prompt image \(\mathcal {I}_{\mathcal {P}}\) and the input image \(\mathcal {I}\), which effectively captures the visual semantic information in both sets of images, also ensures that the network can find a consistent semantic object between these two sets of images. Generally, the feature representation of an image can be divided into high-level feature representation and low-level feature representation. In this paper, we explore how to extract a feature representation of an image from both of these aspects.

In the following, we first introduce the feature extraction process. Then, we introduce how we utilize the feature extraction to constitute the prompt and input branch of the IPSeg framework.

3.3.1 Feature Extraction

High-level Feature Extraction

Previous study (Oquab et al, 2023) has established that features from Vision Transformers, particularly those from DINOv2, are rich in explicit information pertinent to semantic segmentation and are highly effective when used as K-Nearest Neighbors classifiers. DINOv2, in essence, excels at extracting semantic content with high accuracy from each image. Consequently, we have chosen to utilize the features extracted by the foundational model DINOv2 to represent the semantic information of each image, denoted as \(\mathcal {F}_\mathcal {D}\).

Low-Level Feature Extraction

DINOv2 is proficient in capturing significant high-level semantic information, yet it has limitations in providing intricate low-level detail information. As illustrated in the second column of Fig. 4, the visual features generated exclusively through DINOv2 might miss out on fine-grained low-level details. Notably, there is a discernible research gap in augmenting features extracted by DINOv2 with low-level detail information without necessitating additional training.

In our proposed IPSeg, integrating a pre-trained model that specializes in capturing low-level detail information becomes vital. Such a model is capable of effectively compensating for the detailed information that might be overlooked by DINOv2. Notably, Stable Diffusion (SD) (Rombach et al, 2022) has recently been recognized for its exceptional prowess in generating high-quality images, underscoring its ability to robustly represent images with comprehensive content and detailed information. Consequently, our primary focus is to explore the potential benefits of combining SD features with DINOv2 in enhancing the overall quality of feature representations.

The architecture of SD consists of three key components: an encoder \(\mathcal {E}_{nc}\), a decoder \(\mathcal {D}_{ec}\), and a denoising U-Net \(\mathcal {U}_{net}\) operating within the latent space. We initiate the process by projecting an input image \(I_0\) into the latent space using the encoder \(\mathcal {E}_{nc}\), resulting in a latent code \(x_0=\mathcal {E}_{nc}(I_0)\). Subsequently, we introduce Gaussian noise \(\epsilon \) to the latent code, following a predefined time step t. Finally, utilizing the latent code \(x_t\) at time step t, we extract the SD features \(\mathcal {F}_\mathcal {S}\) through the denoising U-Net:

$$\begin{aligned} \mathcal {F}_\mathcal {S} = \mathcal {U}_{net}(x_t,t),\ x_t = \sqrt{\bar{a}_t}x_0 + \sqrt{1-\bar{a}_t}\epsilon . \end{aligned}$$
(3)

\(\bar{a}_t\) is utilized to determine the noise schedule (Ho et al, 2020).

Feature Fusion

Building upon the discussions mentioned earlier, we present a straightforward yet notably effective fusion strategy. This strategy is designed to capitalize on the strengths of both SD and DINOv2 features:

$$\begin{aligned} \mathcal {F}_\mathcal {F} = Cat(\mathcal {F}_\mathcal {S}, \mathcal {F}_\mathcal {D}), \end{aligned}$$
(4)

where Cat(, ) denotes feature concatenation along the channel dimension. In the third and sixth columns of Fig. 4, the fused feature aids in generating a smoother and more resilient visual feature, which helps for feature matching. Specifically, the addition of SD enhances the internal features of foreground objects, making them smoother and more consistent, thereby assisting the network in extracting target objects from segmented images.

3.3.2 Input and Prompt Branches

After introducing the pipeline of the feature extraction, we utilize the visual encoder to extract features for the input image (input branch) and image prompt (prompt branch), respectively.

Input Branch

For the input image \(\mathcal {I}\), we use the above process to extract the feature \(\mathcal {F}_\mathcal {I} \in \mathbb {R}^{H \times W \times C}\), where HW mean the spatial resolution of the feature and C means the channel number. Then, we reshape the \(\mathcal {F}_\mathcal {I}\) to \(\mathbb {R}^{HW \times C}\), where HW means the total number in the feature and the representation of each pixel is \(\mathbb {R}^{C}\).

Prompt Branch

For the image prompt \(\mathcal {I}_\mathcal {P}\), we also extract its feature \(\mathcal {F}_\mathcal{I}\mathcal{P} \in \mathbb {R}^{H \times W \times C}\) through the above process. Since we do not care about the background information of this feature, we use an unsupervised salient object detection method TSDN (Zhou et al, 2023b) to filter these pixels belong to the background, then use the average pooling (Avgpool) operation to generate the prompt embedding:

$$\begin{aligned} \mathcal {F}_{IP} = Avgpool(\mathcal {F}_{IP} \odot \mathcal{M}\mathcal{S}), \end{aligned}$$
(5)

where \(\odot \) denotes pixel-wise multiplication. The object map \(\mathcal{M}\mathcal{S}\) is directly obtained by the unsupervised method TSDN.

Fig. 5
figure 5

Visualizing the features of foreground objects in the prompt image and all objects in input prompt

3.4 Feature Interaction and Segment

After generating input image feature \(\mathcal {F}_\mathcal {I}\) and input prompt feature vector \(\mathcal {F}_\mathcal{I}\mathcal{P}\), we can obtain specific point prompt for the input image \(\mathcal {I}\) by performing interaction between \(\mathcal {F}_\mathcal {I}\) and \(\mathcal {F}_\mathcal{I}\mathcal{P}\).

Concretely, for input image feature which contains HW pixels, the feature representation of each pixel is denoted as \(\mathcal {F}_\mathcal {I}^{l}\), where \(l \in [1,HW]\). Firstly, we calculate the correlation score between \(\mathcal {F}_\mathcal{I}\mathcal{P}\) and \(\mathcal {F}_\mathcal {I}^{l}\) through cosine similarity. Secondly, we utilize a TopK algorithm to select the points in the input image that are most semantically similar to the prompt image, which are at position \(P_{coord}\):

$$\begin{aligned} S = \mathcal {F}_{IP} \otimes \mathcal {F}_{I}, P_{coord} = {\text {TopK}}(S) \in \mathbb {R}^K, \end{aligned}$$
(6)

where \(\otimes \) means matrix multiplication. As shown in Fig. 5, The foreground object in the prompt image and the object to be segmented in the input image maintain good semantic consistency, ensuring the effectiveness of our TopK algorithm.

Finally, we further refine the \(P_{coord}\) into c clustering centers as the positive point prompts for SAM. In addition, using the same pipeline, we also selected K points that are the least similar to the prompt image feature and clustered them into c cluster centers as negative point prompts for SAM. We set \(K=32\) and \(c=4\) in this paper. The generated positive/negative point prompts and the input image \(\mathcal {F}_\mathcal {I}\) are sent to SAM to predict final segmentation results \(\mathcal {M}\).

Table 1 Comparison of the few-shot semantic segmentation performance between our proposed method and five typical generalist models. Painter (Wang et al, 2023a), SegGPT (Wang et al, 2023b) and DeLVM (Guo et al, 2024) are three methods which require the extra training process

4 Experiment

4.1 Experimental Setup

We employ the Stable Diffusion v1.5 and DINOv2 models as our feature extractors. The DDIM timestep in the denoising process is set to 50 by default. All experiments are conducted on a single RTX A6000 GPU with only 13 G GPU memory. This means that our proposed training-free framework can run on cheaper graphics cards such as RTX3090, providing a good perspective for researchers with limited computing power to explore foundational models.

4.2 Evaluation Datasets

Following PerSAM (Zhang et al, 2023b), we conduct few-shot experiments on three datasets, including COCO-20\(^i\) (Nguyen and Todorovic, 2019), FSS-1000 (Li et al, 2020) and PerSeg (Zhang et al, 2023b) to evaluate the performance of our proposed IPSeg network in the open-world scene. Note that PerSeg is a new dataset collected by PerSAM, which comprises a total of 40 objects from various categories, including daily necessities, animals, and buildings. For each object, there are 5 to 7 images and masks, representing different poses or scenes. We use the same setting in the paper PerSAM to perform experiments. Unlike previous few-shot works utilizing the image-mask pair as input, our method only needs a randomly sampled image as the image prompt.

Moreover, inspired by the work ViL-Seg (Liu et al, 2022), we employ three datasets, including COCO-Stuff (Caesar et al, 2018), PASCAL-VOC (Everingham et al, 2010) and PASCAL-Context (Mottaghi et al, 2014) to evaluate the performance of our IPSeg network in the zero-shot setting. We use the same experimental setting of ViL-Seg to perform the experiments. For the above datasets, 15 classes (frisbee, skateboard, cardboard, carrot, scissors, suitcase, giraffe, cow, road, wall concrete, tree, grass, river, clouds, playingfield) is out of the 183 object categories in COCO-Stuff; 5 classes (potted plant, sheep, sofa, train, tv-monitor) is out of the 20 object categories in PASCAL-VOC; 4 classes (cow, motorbike, sofa, cat) is out of the 59 object categories in PASCAL-Context.

4.3 Quantitative Evaluation

Herein, we do not utilize the ground-truth mask corresponding to the prompt image to select its foreground object for IPSeg.

Table 2 Comparison of the zero-shot segmentation performance between our proposed methods and seven typical specialist models

4.3.1 Compared to Generalist Models

We select five representative open-world object segmentation methods that employ foundational models in distinct ways: Painter (Wang et al, 2023a), SegGPT (Wang et al, 2023b), DeLVM (Guo et al, 2024), PerSAM (Zhang et al, 2023b) and Matcher (Zhao et al, 2023). Painter, SegGPT and DeLVM are based on a generalized foundation model that is directly trained for various tasks, allowing the use of image-mask pairs for open-world object segmentation. In contrast, PerSAM and Matcher efficiently adapt SAM for open-world object segmentation tasks without the need for additional training. The comparative results are in Table. 1.

As indicated in Table. 1, our proposed method consistently outperforms Painter, DeLVM and PerSAM. This demonstrates the efficacy of our IPSeg network. Specifically, our approach shows significant mIoU performance improvements over PerSAM on the COCO-20\(^i\), FSS, and PerSeg datasets, with improvements of 87.0%, 1.3%, and 3.6%, respectively. A noteworthy point is that Painter, DeLVM and PerSAM rely on image-mask pair inputs, which are more stringent and less flexible approaches compared to our method. This observation suggests that the use of a single image as a prompt, as proposed in our method, is a promising avenue for further research. This approach could serve as an alternative or supplement to the traditional image-mask pair prompts, potentially broadening the scope of research in open-world segmentation tasks.

Note that, our proposed IPSeg is designed for the zero-shot open-world segmentation. Therefore, for fair comparison, we also evaluate the performance of Matcher under zero-shot setting. The zero-shot setting means using the unsupervised salient object detection method TSDN (Zhou et al, 2023b) to filter the background of image prompts instead of their corresponding ground truth. As shown in Table. 1, IPSeg can surpass Matcher’s performance in the zero-shot setting (Matcher-Z) by a large margin, which further illustrates the validity of our IPSeg.

Fig. 6
figure 6

Qualitative segmentation results of the proposed IPSeg framework. It can be seen that the proposed method can effectively segment the objects contained in the prompt image in the input images from different scenarios. The green point represents positive point prompts sent to SAM, while the red point represents negative point prompts sent to SAM (Color figure online)

Fig. 7
figure 7

Qualitative segmentation results of the proposed IPSeg and PerSAM using same image prompts. The green point represents positive point prompts sent to SAM, while the red point represents negative point prompts sent to SAM (Color figure online)

Table 3 Ablation studies of the combination of SD and DINOv2 in this paper

4.3.2 Compared to Specialist Models

We have conducted a comparison of our proposed IPSeg network with several well-known specialist zero-shot segmentation methods, including SPNet (Xian et al, 2019), ZS3 (Bucher et al, 2019), CaGNet (Gu et al, 2020), SIGN (Cheng et al, 2021), ViL-Seg (Liu et al, 2022), GroupVit (Xu et al, 2022a) and TCL (Cha et al, 2023). It is important to note that these specialist methods are designed with specific segmentation models, each trained on particular datasets. The comparative results are displayed in Table. 2. Our IPSeg network demonstrates superior performance compared to these specialist models. Notably, it outperforms the CLIP-based ViL-Seg method on the COCO-Stuff, Pascal-VOC, and Pascal-Context datasets, with mIoU performance improvements of 99.4%, 68.3%, and a remarkable 231%, respectively. It is worth mentioning that the Pascal-Context dataset, primarily comprising four common classes, represents relatively simpler scenarios. This aspect may have contributed to the substantial superiority of IPSeg over ViL-Seg in this dataset. Compared to TCL, our method has also achieved competitive performance.

In conclusion, our training-free IPSeg network consistently surpasses specialist open-world object segmentation methods. This success underscores the potential of exploring open-world object segmentation from a novel angle, combining foundational models in a training-free approach. Such an endeavor could significantly enhance the efficiency and applicability of segmentation tasks in diverse real-world scenarios.

Fig. 8
figure 8

Further analysis about why adding SD can help improve the performance. The second and fifth columns indicate the use of only the DINOv2 model for feature extraction, while the third and sixth columns denote the use of both DINOv2 and SD models for this purpose

Table 4 Hyperparameters setting in the feature interaction module

4.4 Qualitative Evaluation

In Fig. 6, we showcase the visualization results from our IPSeg network. These visualizations highlight the network’s capability in effectively segmenting objects within a variety of complex scenes. This serves as a testament to the effectiveness of our approach from a visual standpoint. Particularly noteworthy is the network’s performance in intricate scenarios involving multiple objects, such as scenes labelled ‘Dogs’ and ‘Elephants.’ In these cases, our IPSeg network accurately segments the target objects, underscoring its proficiency in correctly identifying objects in the input image that have semantic correspondence with those in the image prompt. This ability showcases the robustness and adaptability of the IPSeg network in dealing with diverse and challenging segmentation tasks. To further illustrate the validity of our method, we conduct some visual comparisons with PerSAM in Fig. 7. It can be seen that in different complex scenes, the performance of our IPSeg is better than that of PerSAM under the same image prompts.

4.5 Ablation Studies

For ablation studies, similar to the experimental setting above, we do not utilize the ground truth corresponding to the prompt image to select its foreground object.

4.5.1 Combination of SD and DINOv2

In the process of feature extraction, our IPSeg network considers both high-level and low-level details from the input and prompt images. Recognizing the limitations of the DINOv2 model in capturing low-level features, we integrate the SD model to address this gap. As shown in Table. 3, incorporating SD significantly boosts the performance of our IPSeg network. This improvement is further evidenced by the visual results in Fig. 4, where the inclusion of SD is observed to result in smoother feature representations. Moreover, as shown in Table. 3, the performance of using solely SD as the feature extractor is clearly inferior to that of using a combination of DINOv2 and SD. One primary reason is that the features extracted by SD lack high-level semantic information. As illustrated in Fig. 8, incorporating features extracted by the SD model allows IPSeg to more distinctly differentiate between foreground and background. This enhancement significantly boosts the performance of IPSeg.

Table 5 Image prompt robustness of this paper
Fig. 9
figure 9

Qualitative results of the proposed IPSeg framework when using different image prompts. When given the same input image with different image prompts, our proposed IPSeg network can consistently generate satisfactory results. This also indicates the robustness of our method. The green point represents positive point prompts sent to SAM, while the red point represents negative point prompts sent to SAM (Color figure online)

Fig. 10
figure 10

Some failure prediction results of our IPSeg under different image prompts. The green point represents positive point prompts sent to SAM, while the red point represents negative point prompts sent to SAM (Color figure online)

4.5.2 Hyperparameters in Feature Interaction

In feature interaction, we introduce a simple yet effective approach for generating point prompts to guide SAM in generating the corresponding segmentation results. In this module, we compute the similarity between each pixel in the prompt image and the input image using cosine similarity. We use the TopK algorithm to select the TopK most/least similar points, followed by the clustering algorithm to group these points into c cluster centers. In Table. 4, we investigate the impact of different values of K and c on performance. We observe that using the TopK algorithm alone helps the model achieve initial performance (Ours(\(K=32,c=32\))), and further application of the clustering algorithm improves performance even more.

4.5.3 Image Prompt Robustness

Table 6 The impact of background noise on IPSeg
Table 7 The performance of IPSeg using different feature extractors
Table 8 Comparison between IPSeg and PerSAM on other four datasets, containing DAVIS2017 (Pont-Tuset et al, 2017), Pascal-Part (Morabia et al, 2020), PACO-Part (Ramanathan et al, 2023) and LVIS-92\(^i\) (Gupta et al, 2019)

In this paper, we introduce a more flexible approach to using image prompts. To further validate the robustness of our model with different image prompt combinations, we randomly selected three different image prompt combinations. Specifically, we prepare appropriate prompt images based on their categories. For all prompt images, we firstly manually choose different prompt images with clear visual representations in certain classes. Then, we randomly compose prompt set-1 to set-3 from these prompt images. Note that, all prompt images are chosen from the used benchmark based on categories, such as COCO and FSS, and we make sure that the selected prompt image and evaluation datasets do not have the same image. From Table. 5, it can be observed that our method maintains good robustness across different prompt inputs. As shown in Fig. 9, when given the same input image with different image prompts, our proposed IPSeg network can consistently generate satisfactory results. This experiment further indicates that in future improvement of this framework, researchers can have a more flexible choice of prompts, reaffirming the potential of our IPSeg.

Moreover, in Fig. 10, we share some failed cases. Specifically, we showcase objects that are correctly segmented in prompt set-1 but failed in prompt set-2 and set-3. This group of examples indicates that choosing image prompts with only a single, complete target object can significantly aid IPSeg in achieving accurate segmentation results. Hence, when preparing image prompts, we strive to adhere to these two principles for collecting image prompts. However, while aiming for optimal performance, we do not want our framework to be constrained by the reference images. Consequently, the three prompt sets designed in Table. 5 are not deliberately combined. This design approach ensures that the results obtained by IPSeg are reliable and credible and indirectly shows that our framework does not rely on carefully selected image prompts that require extensive time investment.

4.5.4 Impact of Background Noise

To investigate the impact of background noise on our method, we conduct experiments under the following settings: without using the unsupervised salient object detection (USOD) method TSDN (Zhou et al, 2023b) to filter the background, using TSDN to filter the background, and using the ground truth corresponding to the referring image to filter the background. The results are shown in Table. 6. Initially, it is evident that not filtering the background significantly affects our experimental performance. Further improvements are observed upon utilizing TSDN. Finally, by utilizing the ground truth to filter the background noise in the referring image, we can achieve best performance. Moreover, if we use another USOD method A2S (Zhou et al, 2023a), IPSeg’s performance does not fluctuate dramatically. This experiment shows that IPSeg requires the USOD method to provide a relatively less noisy image prompt, but is not dependent on a particular USOD method.

4.5.5 Different Feature Extractors

To demonstrate the impact of different feature extractors, inspired by Matcher (Zhao et al, 2023), we use MAE (He et al, 2022) and CLIP (Radford et al, 2021) as feature extractors, and the performance is shown in Table. 7. Using DINOv2 as feature extractor achieves the best performance on all datasets. Additionally, this experiment demonstrates that IPSeg, as a training-free framework, facilitates the integration of various feature extractors.

4.5.6 Transferability of IPSeg

We conduct experiments on several datasets to further demonstrate the effectiveness and transferability of our IPSeg, as shown in Table. 8. These datasets contain video object segmentation benchmark DAVIS2017 (Pont-Tuset et al, 2017), semantic segmentation benchmark LVIS-92\(^i\) (Gupta et al, 2019), and part segmentation benchmarks Pascal-Part (Morabia et al, 2020) and PACO-Part (Ramanathan et al, 2023). As shown in Table. 8, the performance of our method can outperform PerSAM for all datasets. This point once again illustrates the validity of our IPSeg.

5 Conclusion

In this paper, we introduce the IPSeg framework for open-world segmentation using visual concepts from a single image. IPSeg is a simple yet highly effective approach designed to inspire researchers to approach open-world segmentation from two pivotal perspectives: efficient utilization of foundational models and a flexible setup for prompt information. Through our exploration of how to optimally combine diverse foundational models, our method attains outstanding performance on six widely utilized datasets. Furthermore, our research underscores the importance of adaptability in foundational models, emphasizing their potential to revolutionize the way we approach complex computer vision challenges. We believe that our contributions will pave the way for future research endeavors, pushing the boundaries of what’s possible in open-world segmentation and setting new standards for efficiency and versatility in the field.