1 Introduction

Experiments in fluid mechanics have evolved significantly, with objects of study becoming complex. While traditional experiments focused on canonical objects like cylinder wake flows and backward step flows, current studies contain complicated cases such as the motion of flying birds (Usherwood et al. 2020). Even studies in canonical objects are now in three-dimensional experiments with perspective effects in the recorded images, which challenge accurate masking. This becomes essential in fluid–structure interaction studies where the object’s shape and position must be detected. Current approaches rely on either identifying void regions with no particles (Jux et al. 2021) or case-specific manual masking.

Apart from object identification, flow structures also require segmentation. Turbulent flow structures are known for their complex and chaotic patterns, which make the identification of these structures complicated. The turbulent/non-turbulent interface (TNTI) is among these structures, marking the boundary between the chaotic, rotational regions of turbulent flow and the irrotational regions (Westerweel et al. 2005). Accurately detecting the TNTI structures is essential for understanding and modelling turbulent properties, such as transport across the interfaces. In scalar turbulence, such as smoke plumes, the TNTI displays a sharp-edge separation from the non-turbulent region. However, edge detection is still challenging due to the chaotic patterns of turbulence (Asadi 2024). Scalar turbulence segmentation is a common measure to detect the TNTI. Existing methods primarily rely on a threshold approach, finding the local minimum from the histogram intensity distribution, or a clustering approach (Younes et al. 2021). Manual segmentation to detect turbulent structures is also employed in complex situations. A review of TNTI detection algorithms is available in the thesis of Asadi (2024). All these examples highlight the need for a universal segmentation model that can effectively work on both coherent structures and objects in fluid experiments.

Recent advancements in image segmentation using artificial intelligence (AI) offer favourable applications in fluid mechanics experiments. Vennemann and Rösgen (2020) introduced an automatic masking method based on artificial neural networks (ANNs) in velocimetry images. This approach showed promising results in 2D measurements, particularly in scenarios where only a single object is present within the field of view. However, its capability to segment complex flow structures or differentiate between distinct objects in the view, such as for segmenting the bike from the cyclist in a sports flow experiment (Jux et al. 2018), is constrained. The Segment Anything Model (SAM), developed by Meta AI (Kirillov et al. 2023), stands out as a foundation model. The extensive training dataset of SAM, consisting of over 1 billion masks and 11 million images, offers a robust starting point for exploring its capabilities in fluid experiments.

Fig. 1
figure 1

The Segment Anything Model (SAM) (Kirillov et al. 2023). Its image encoder creates a representation that supports object mask generation in response to different prompts, capable of producing multiple valid masks (shown in black) and confidence scores

In this paper, we begin by introducing object segmentation and detailing the process of fine-tuning to address the complexities in fluid flow detection using pre-trained architecture and weightings of SAM. We focus on detecting structures in a turbulent flow, the turbulent/non-turbulent interface, by using a time series of images of scalar concentration. This needed fine-tuning of the mask decoder of the SAM implementation. We fine-tuned the mask decoder using the same approach as the model was originally trained, aiming to detect scalar turbulent/non-turbulent structures. Finally, we demonstrate how the prompt encoder of SAM can be modified and combined with language models to ease complex object detection in experiments using only textual input.

Fig. 2
figure 2

Identification of turbulent/non-turbulent flow. The model is trained on a a low Reynolds number of 2000. bd The fine-tuned model is then applied at higher Reynolds numbers

2 Segment Anything Model

SAM’s architecture comprises three key modules: an image encoder, a prompt encoder, and a mask decoder (see Fig. 1). The image encoder processes input images to generate image embeddings (representations), while the prompt encoder transforms point and box prompts into embeddings that guide segmentation. The mask decoder combines the information from the image and prompts encoders to predict the final segmentation mask(s). SAM accepts guiding prompts in various forms, such as points or bounding boxes. However, complex geometries require more detailed prompts, and objects often move during experiments, such as a flying bird (Usherwood et al. 2020). Therefore, using point or box prompts might not be directly applicable in fluid experiments.

Fig. 3
figure 3

IoU scores for four Reynolds numbers: a SAM and b Fine-tuned SAM in TNTI detection

Recent studies have attempted to integrate natural language models, such as BERT introduced by Google AI Language (Devlin et al. 2018), as encoder prompts to perform highly specific and context-aware tasks. The language understanding from BERT helps the model focus attention and isolates desired objects within an image. In this study, we were inspired by Lightning AI (Lightning 2024), which integrated natural language prompts with GroundingDino (Liu et al. 2023) and SAM. GroundingDino employs BERT to detect a bounding box around objects. BERT tokenises the textual input to create contextualised embeddings, which are enhanced using text-to-image and image-to-text cross-attention mechanisms (Liu et al. 2023). These refined features are processed by a cross-modality decoder, aligning the text with relevant visual regions to generate bounding boxes around described objects and serve as prompts for the SAM model (see Fig. 1). We, therefore, can combine language understanding from BERT to use textual inputs in SAM for flow experiment segmentation.

Fig. 4
figure 4

Object detection within the field of view of flow experiments. The segmentation model demonstrates robustness across various objects, accurately detecting their position, area, and shape with solely textual inputs. Textual inputs are a A flying owl (Usherwood et al. 2020), b A cyclist and a bike (Jux et al. 2018), c A skater (Terra et al. 2023), d Particles (Schanz et al. 2016), require fine-tuning similar to Appendix B, e Sharks (Muller 2022), f A hydrofoil close to a wall (Zhou 2023)

2.1 Fine-tune SAM model

Fine-tuning involves optimising a pre-trained model (architecture+weights) with data specific to a particular use case. Ma et al. (2024) demonstrated that employing SAM in medical images can enhance performance, particularly when the number of training images is substantially increased. The fine-tuning process involves multiple epochs, where the model iterates over the entire dataset, computing the loss between predicted masks and ground truth masks for each batch and updating the model’s parameters using backpropagation. During fine-tuning, the pre-trained model’s parameters are adjusted to minimise the discrepancy between the predicted segmentation masks and the ground truth masks. This is achieved by iteratively optimising the model’s parameters using an optimisation algorithm (in this case, Adaptive Moment Estimation, ADAM (Kingma and Ba 2014)). The loss between the predicted and ground truth penalises deviations between the predicted and ground truth masks. Through this process, the model learns better to capture the specific patterns and features in datasets, ultimately improving its performance on segmentation tasks such as scalar turbulence. The convergence and evaluation of the fine-tuning process are provided in Appendix A.

3 Detect scalar turbulence

Direct application of SAM to scalar turbulence works on random cases, but as Fig. 1 also illustrates, the output valid masks fail as they are trained and designed for natural images. SAM suffers from incorrect predictions, broken masks, and large turbulent and non-turbulent detection errors. Scalar turbulence can exhibit complex patterns, low-contrast boundaries, thin structures, and significant differences from the objects typically found in natural images. Despite being trained with 1.1 billion masks, SAM’s prediction quality falls short in dealing with turbulent flow.

We then fine-tuned the mask decoder of SAM for the specific task of scalar turbulence segmentation. As explained in Appendix A, we selected reliable turbulent/non-turbulent masks from low Reynolds number experimental data and applied them to pre-trained SAM model weights. We used scalar turbulence images of a jet flow provided by Fukushima and Westerweel (2022). We intentionally trained the model with low Reynolds number data because the interface is well-shaped at such Reynolds numbers. Subsequently, we applied the fine-tuned model to higher Reynolds numbers, as shown in Fig. 2 and evaluated in Fig. 3. The performance of the fine-tuned model improved significantly, with IoU scores, which measure the overlap between the predicted and reference masks relative to the area of their union (see Appendix A), increasing from 0.5 to above 0.95 for all Reynolds numbers. At higher Reynolds numbers, the interfaces have more scattered patterns and less sharply defined edges, which is why the pdf plots show broader distributions. We extend an additional application of the segmentation model in bubbly flows in Appendix B.

4 Objects in experiments

We analysed recent particle image velocimetry (PIV) and particle tracking velocimetry (PTV) experiment images, which present unique challenges compared to other segmentation cases due to thousands of surrounding light-emitting particles. In our initial case study, we focused on volumetric measurements of vortices behind a flying owl as it crossed the field of view (Usherwood et al. 2020). Given the complex nature of this object, manual masking proved impractical. Instead, by inputting only the text "Flying Owl", the segmentation model accurately produced masks without necessitating fine-tuning (see Fig. 4). The next case study involved 3D-PTV analysis around a cyclist (Jux et al. 2018). The model segmented the cyclist and accurately differentiated the bike from the cyclist’s body.

The most challenging scenario occurred during particle detection in a water tank experiment where the jet was injected into the tank (Schanz et al. 2016). This case required fine-tuning the segmentation model for effective particle detection (similar to the bubbly flow in Appendix B). Even after fine-tuning, some large particles remained undetected. We found the model fails to detect particles when the background is fully dark, and the particles are bright. Therefore, we inverted the image colours to have dark particles and a bright white background. Furthermore, the model effectively masked four sharks in a 3D-PIV study of schooling fish (Muller 2022). This level of understanding from the model to distinguish sharks individually allows for tracking schooling fish without interference from larger fish. In 2D-PIV image analysis, the model segmented a flat plate and a hydrofoil using the "Hydrofoil + Wall" input (Zhou 2023).

5 Conclusion

In conclusion, we have introduced a practical approach by implementing the natural image segmentation model for coherent structure and object identification in fluid experiments. SAM has proven to be a valuable tool, capable of being fine-tuned to understand the TNTI structures. Our approach involved fine-tuning the mask decoder of the SAM model, aligning with its original training methodology. Additionally, we integrated the model with language models to serve as a prompt encoder, allowing communication between the language model and SAM for precise context detection.