Object segmentation is a foundational and critically important field in modern computer vision. It plays a vital role in applications requiring extensive visual components, such as object localization and identification, and demands real-time, fast, and accurate segmentation. This importance has made object segmentation a consistently hot research topic, with significant work done in areas like instance segmentation, semantic segmentation, and panoptic segmentation.
With the evolution of object segmentation, the Segment Anything Model (SAM) has emerged as a remarkable tool, showcasing outstanding segmentation abilities and quickly being adopted in various computer vision applications. Frameworks using a pre-trained SAM architecture have achieved impressive performance in downstream vision tasks. However, despite its capabilities and high accuracy in segmentation tasks, SAM’s complex and heavy architecture necessitates substantial computational power, hindering its implementation on computationally constrained devices.
Addressing SAM’s computational challenges, researchers have developed the Tiny Segment Anything Model (TinySAM), which retains the zero-shot performance of the original framework while being more lightweight. TinySAM uses a full-stage knowledge distillation method with online hard prompts to create a more efficient student model. Post-training quantization adapted to promptable segmentation tasks further reduces computational needs. Additionally, TinySAM’s design aims for hierarchical segmentation, almost doubling the inference speed without compromising performance.
This article delves into the TinySAM framework, exploring its foundational principles, architecture, and performance compared to other state-of-the-art segmentation frameworks. Let’s explore these aspects in more detail.
The Segment Anything Model has helped in the rapid progress of several computer vision applications owing to its commendable segmentation capabilities coupled with a massive segmentation dataset that houses over 11 million images and over a billion image masks. Owing to its exceptional performance on tasks segmenting objects with arbitrary categories and shapes, it serves as the foundation for frameworks performing downstream tasks like image inpainting, object tracking, 3D vision, and more. Furthermore, the Segment Anything Model also offers remarkable zero-shot segmentation performance that has benefitted sensitive industries that work with a limited amount of data including the medical research and medical imaging industries.
Although one cannot question the remarkable segmentation capabilities offered by the Segment Anything Model on a wide array of downstream vision tasks, it does have its downside in terms of a complex architectural overload, high computational requirements, and significant operational costs. For a system running on a modern GPU, the inference time of a SAM model can be as high as up to 2 seconds for a 1024×1024 image. As a result, it is a highly difficult task to implement SAM applications on devices with limited computational abilities. To overcome this hurdle, recent works like MobileSAM and FastSAM have tried to develop a SAM model with more computational efficiency. The MobileSAM framework attempts to replace the heavy component in the image encoder with the architecture of the TinyViT framework whereas the FastSAM model transfers the segment task to an instance segmentation task with only one category with the YoloV8 model. Although these methods were able to achieve some level of success in terms of reducing the computational requirements, they could not maintain the performance especially on downstream zero-shot tasks.
TinySAM or the Tiny Segment Anything Model is an attempt to reduce the computational requirement of the current SAM model without hindering the performance on zero-shot downstream tasks. Furthermore, the TinySAM framework proposes to implement a full-stage knowledge distillation method in its architecture with the aim of improving the ability of the compact student network. The TinySAM framework distills the student network in an end to end manner under the supervision of the teacher network from different stages. To boost performance further, the framework allows the distillation process to attend more to hard examples by implementing an additional online hard prompt sampling strategy. Furthermore, to additionally reduce computational costs, the TinySAM framework exposes the promptable segmentation tasks to post-training quantization components.
The major chunk of the computation requirement of a Segment Anything Model is because the model generates massive masks from the grid prompt points to segment everything in the image. To overcome the computational requirement of this segmentation strategy, the TinySAM framework employs a hierarchical segment everything strategy that almost doubles the inference speed without degrading the performance. With these methods employed in its architecture, the TinySAM framework offers significant reduction in computational requirements, and sets new limits for efficient segment anything tasks.
TinySAM : Architecture and Methodology
Before we talk about the architecture and methodology of the TinySAM framework, it is important to first have a look at its predecessor, the SAM framework. Ever since its introduction, the Segment Anything Model has demonstrated remarkable performance, versatility, and generalization capabilities across a range of downstream vision and object segmentation tasks.
At its core, the SAM model consists of three subnetworks: the prompt encoder, the image encoder, and the mask decoder. The primary aim of the prompt encoder is to encode the arbitrary shaped masks, input points and boxes, and free form text with positional information. The image encoder is a heavy ViT or vision transformer based network that extracts the input image into embeddings. The model uses different networks to process the geometric and the text prompts. Finally, the mask decoder contains a two-way transformer that receives the output of the prompt and the image encoder to generate the final mask prediction. With the dataset, the SAM framework demonstrates remarkable high quality segmentation capabilities for objects irrespective of their shape and category. Furthermore, the Segment Anything Model demonstrates remarkable performance and efficiency across zero-shot downstream vision tasks including object proposal, edge detection, text to mask prediction, and instance segmentation. Owing to its high quality segmentation abilities, and flexible prompt offerings, the SAM frameworks form the foundation for vision applications. With that being said, one cannot ignore the high computational requirement of the traditional SAM architecture with a large number of parameters making it almost impossible for developers to deploy SAM based applications on devices with constrained resources.
Knowledge Distillation
Knowledge distillation is an important approach to boost the performance of compact networks during the training phase. The knowledge distillation method that uses the output of the teacher network to supervise the training of the lightweight student network. The knowledge distillation method can be split into two subcategories: distillation for intermediate features, and distillation for network outputs, with a majority of research work around knowledge distillation focusing on image classification tasks.
With that being said, the following figure demonstrates the generic architecture of the TinySAM framework along with the performance overview on zero-shot instance segmentation tasks.
In the first stage, the TinySAM framework implements knowledge distillation designed specifically for the SAM framework, and to activate the distillation process further, the model uses an online hard prompt sampling to mine the hard knowledge to the student network from the teacher network. In the second stage, the TinySAM framework adapts the post-training quantization method to promptable segmentation tasks and implements it on the lightweight student network. Finally, the model implements the hierarchical segment everything inference mode designed for segmentation tasks resulting in doubling the inference speed with negligible accuracy loss.
Full-Stage Knowledge Distillation
As mentioned earlier, the Segment Anything Model consists of three sub-networks at its core: the prompt encoder, the image encoder, and the mask decoder, with the image encoder component built on a vision transformer, and having high computational requirements. To tackle this issue, the MobileSAM framework replaced the vision transformer with a TinyViT or Tiny Vision Transformer, although the substitution wasn’t effective given the significant performance decay. To ensure no performance decay, the TinySAM framework implements a full stage knowledge distillation method that guides the lightweight image encoder from the learning level to the multiple knowledge level. In addition to the conventional loss between the ground-truth labels and the predicted results, the TinySAM framework introduces numerous distillation losses during different stages as shown in the following figure.
Quantization
Model Quantization is a popular approach in computer vision frameworks, and is used to compress the model by quantizing weights or activations from higher to lower bandwidth in an attempt to reduce computational complexity and storage requirements without degrading the output quality significantly.
The primary aim of quantization in TinySAM is to project the floating point tensor to the bit integer tensor using a scaling factor with the metric for measuring the distance between the matrix multiplication and the quantized matrix playing a vital role for optimizing the scaling factor.
Hierarchical Segment Anything
The Segment Anything Model proposes to use an automatic mask generator that samples points as a grid to segment everything in the image. However, it has been indicated that the use of dense point grid results in over-fine grained segmentation outputs and the process requires massive computational requirements and incurs high operational costs. Furthermore, on one end, too many sampling points for a complete object might result in different sections of the object to be segmented incorrectly as separate masks whereas on the other end, the time cost of the everything mode inference is primarily due to the reason that the image encoder has been shrinkled significantly. To reduce the operational cost of the everything mode, the TinySAM framework uses a hierarchical mask generation approach, with the difference in the strategy with the original SAM framework demonstrated in the following image.
Different from the approach implemented in the original SAM framework, the TinySAM model uses only 25% points on each side, thus utilizing only 1/16 of the available points in the original setting. The model then infers the mask decoder and the prompt encoder with these prompts and gets the output. The model then filters some masks with confidence exceeding a certain threshold, and masks the corresponding locations as areas for potential final predictions. Since the model treats these regions as the segmentation result of instances with high confidence, it has no need to generate point prompts. The strategy not only helps in preventing over-fine grained segmentation of the object but it also helps in bringing down the operational costs and computational requirements significantly. The framework then merges and post-processes the results of these two rounds to obtain the final masks.
TinySAM : Experiments and Results
To accelerate the distillation process, the TinySAM framework computes and stores the image embeddings from the teacher network in advance, owing to which it is not mandatory for the model to compute the heavy image encoder of the teacher network repeatedly during the training phase anymore. For post training quantization, the TinySAM framework quantizes all the matrix multiply layers, the convolution layers, the deconvolution layers, and the linear layers, with the model using chanel-wise scaling factors for both the convolution and the deconvolution layers. For the matrix multiply layers, the model implements head-wise scaling factors whereas for the linear layers, the model implements linear-wise scaling factors. The model also conducts evaluation on zero-shot downstream tasks.
For instance segmentation tasks in a zero-shot setting, the TinySAM framework follows the experimental settings of its predecessor, the Segment Anything Model, and uses object detection results of the Vision Transformer Det-H or VitDet-H framework for instance segmentation. As demonstrated in the following image, the TinySAM framework outperforms existing methods in terms of instance segmentation accuracy and the FLOPs score.
Furthermore, the qualitative performance of the TinySAM model is demonstrated in the following image for zero-shot instance segmentation with the green box representing the box prompts.
In terms of zero-shot points valid mask evaluation, the TinySAM model outperforms the MobileSAM framework significantly on different datasets, and delivers substantially better results when a fewer number of points are utilized as prompts by the framework.
Furthermore, the following table summarizes the results of the acceleration and decrease in computational requirements achieved as a result of the hierarchical everything mode strategy. The model applies the same stability score and threshold value with different strategies for a fair comparison, and the results are summarized below.
Final Thoughts
In this article, we have talked about TinySAM, a proposed framework that pushes the boundaries for segmenting any task, and obtains an efficient model architecture with less computational requirements and accuracy at par with the original SAM framework. TinySAM or the Tiny Segment Anything Model that maintains and delivers the zero-shot performance of the original framework. The TinySAM framework first implements a full-stage knowledge distillation method that uses online hard prompts to distill a lightweight student model. The TinySAM framework then adapts the post-training quantization to promptable segmentation tasks that further helps in reducing the computational requirements. Furthermore, the framework also aims to segment everything hierarchically that almost doubles the inference speed without affecting the performance.
Credit: Source link