Fine-Grained Image Segmentation (FGIS)

Remove Background From Images For Free

Try Erase.bg For Free

Table of Content

The Problem and Computer Vision To The Rescue

Nowadays, realistic editing of photographs demands a careful treatment of color blends that frequently occur in natural scenes. These color blends are typically modelled through a soft selection of scene or object colors. Therefore to achieve high-quality image editing and background composition, accurate representation of these soft transitions between image regions is essential.

‍

Most of the present techniques used in the industry for generating such representations heavily depend on some user interaction by skilled visual artists. So creating such an accurate saliency selection becomes an expensive and tedious task. To fulfil this void of skilled visual artists, we utilize computer vision to simulate the human vision system, which has an effective attention mechanism for determining the most salient information from visual scenes.

‍

Such a problem can also be interpreted as a foreground extraction problem, where the salient objects are considered foreground class, and the remaining scene is the background class.

‍

Computer vision and deep learning aim to model such a mechanism through selective research branches, namely image matting, salient object detection, eye-fixation detection, and soft segmentation. It is also important to note that, unlike computer vision, deep learning is mainly a data-intensive research approach.

‍

Deep learning has significantly improved foreground extraction and saliency detection baselines with the recent rise in fully convolutional networks (FCN) for image segmentation. Despite all these improvements, most of the suggested architectures use network backbones initially designed for image attribute classification tasks, extracting representative semantic signification features rather than global contrast and local detail information. But in this blog, I will discuss this issue in more detail later.

‍

Is it a Segmentation problem?

Yes, it’s a segmentation problem if we look at it from the angle of our output format. In recent years, semantic segmentation has become a key problem in computer vision and deep learning.

‍

Thus looking at the more significant scenario, we can say that semantic segmentation is one of the critical tasks in its field that paves the way towards better scene understanding. The significance of scene understanding is also highlighted by the evidence of an increasing number of applications that nourishes from inferring cognitive facts from images and videos.

‍

The three approaches we discuss for achieving a smooth and perceptually sound fine-grained semantic segmentation are:

Image Matting
Salient Object Detection (SOD)
Soft Segmentation

‍

Image Matting

‍

Image matting can be understood as a generalized version of green-screen keying used for precise estimation of foreground opacities in an unconstrained setting. Image matting is a critical topic in both computer graphics and vision applications. Earlier approaches to image matting involved large sparse matrices such as large kernel matting Laplacian and its optimization.

‍

However, these methodologies of solving such linear systems are often time-consuming and unfavoured by users. Many pieces of research tried to improve this linear system solving speed by using adaptive kernel sizes and KD-tree. Still, no significant improvement was observed regarding the quality on wild images and inference speed. Since the problem is highly ill-posed, a trimap (or strokes) indicating definite foreground, definite background, and unknown regions is usually given by the user as supportive input.

‍

Let’s first formulate a basic equation for image matting. Denoting an image pixel’s background color, foreground color, and foreground opacity (alpha matte) as B, F and α, respectively, the pixel’s color C can be written as a convex combination of B and F:

C = F (α)+ B(1 − α).

‍

Image matting methods can be categorized into three major types, propagation-based, sampling-based, and learning-based. In some approaches, a hybrid combination of sampling-based and propagation-based matting is also used.

‍

Sampling-based image matting is based on an assumption that the true background and foreground colors of the uncharted pixels can be derived from the known background and foreground pixels that are localized near that unknown pixel. Some sampling-based methods are:

‍

Shared sampling matting
Iterative matting
Bayesian matting
Sparse gcoding

‍

Propagation-based image matting techniques calculate the alpha values of the uncharted pixels by propagating the alpha values of the known local background and foreground pixels into the unknown regions. However, in the case of wild background images, the over-dependency on color knowledge leads to artefacts in images where the distribution of background and foreground color overlap. Some propagation-based methods are:

‍

Geodesic matting
Close-form matting
Poisson matting
Spectral matting

‍

Nonetheless, both sampling and propagation-based techniques are unable to provide satisfactory and completely automated results. Thus, recently, several deep learning studies have suggested approaches that either solve the above mentioned linear system through a concatenated input of trimap and RGB image into FCN or just the RGB image itself to predict final alpha matte. Some known trimap dependent deep learning architectures are:

‍

Deep Automatic Portrait Matting by Shen et al.
Deep Image Matting ( DIM ) by Xu et al.
Disentangled Image Matting by Cai et al.

‍

Whereas some trimap independent deep learning architectures are:

‍

Late Fusion Matting by Zhang et el.
Semantic Human Matting by Chen et al.
AlphaNet: An Attention Guided Deep Network for Automatic Image Matting by Sharma et al.

‍

In my personal experience, deep learning-based approaches are able to capture global semantic information and local details better than the other two approaches, also they are not biased on any crude assumption of the existence of a correlation between the known and unknown region pixels.

‍

Salient Object Detection (SOD)

‍

SOD's major objective is to segment the most salient (important) and visually attractive object in the picture. Many fields such as image segmentation and visual tracking apply SOD in a variety of applications. Similar to image matting, after the rise of Fully Convolutional Networks (FCN) for saliency detection, SOD state-of-the-art has improved significantly.

‍

Unlike Natural Image Matting, salient object detection is not as complex as it seems. The major challenges in achieving a precise salient object detection are:

‍

(1) The Saliency Localization. The saliency of a particular visual asset is generally defined over the global contrast of the entire image rather than any pixel-wise or local feature. Therefore, in order to achieve a precise SOD, the saliency detection algorithm must not only capture the global contrast of the whole image but also establish an accurate representation of the detailed structures of the foreground object. To tackle this problem, multi-level deep feature aggregation networks are used.

‍

(2) The Absence of Boundary Refinement Losses. The most common loss used for training saliency object detection models is either Intersection over Union (IoU) loss or Cross-Entropy (CE). But both of them lead to a blurry boundary detailing due to their incipiency of efficiently differentiating boundary pixels. Many studies also use the Dice-score loss, but its major purpose is to handle the biased training sets and not specifically enforcing the modelling of fine structures.

‍

Research History

There has been a rich modern history of deep learning literature for salient object detection. Some studies emphasize using deep recurrent networks with attention mechanisms for iterative refinement of some selective image sub-regions.

‍

On the other hand, some studies highlight the effectiveness of global information transfer from the deep layers of the network to the shallow end by a deep multi-path recurrent connection.

‍

Many authors like Hu et al.[1] and Wang et al.[2] proposed methods that either use recurrent fully connected networks or recurrently concatenated multi-layer in-depth features for salient object detection. These studies also show the effectiveness of iterative correction of prediction errors.

‍

In contrast to the previously mentioned research works, a few pieces of research also showed the use of a contextual attention network in U-Net architecture for predicting pixel-wise attention maps.

‍

These extracted pixel-wise attention maps are proved very effective for saliency detection in terms of evaluation metrics. Only some proposed methods emphasize on coarse to fine prediction transitioning. These methods propose refinement strategies for achieving more accurate boundary details by capturing finer structures.

‍

For example, Lu et al. proposed an architecture that captures a deep hierarchical salient representation for modelling various global structured saliency cues of the saliency maps along with a post-refinement stage.

‍

The latest published advancement in the field of salient object detection (as I write this blog) is by Qin et al., who suggest a robust deep network architecture (U^2-Net) with a two-level nested U-structure.

‍

The key improvements stated by the authors are multi-scale contextual information captivity (a mixture of receptive fields) and increased network depth (pooling in ReSidual U-blocks) without significant computational expense.

‍

In my personal experience, SOD also achieves higher quality saliency maps as natural image matting but an inferior quality in terms of transparency modeling and fine-structure extraction.

‍

Soft Segmentation

‍

Soft segmentation is defined as the decomposition of the image into two or more sections where each member pixel may own membership into two or more sections.

‍

Research History

Most of the earlier soft segmentation methods emphasize on the extraction of soft saliency maps of various homogeneous colors using either per-pixel color unmixing or global optimization. Although these extracted soft color maps are observed to be useful for many critical image editing applications such as image recoloring but similar to SOD they do not specifically respect object boundary and transition region granularity.

‍

It is interesting to note that image matting has a very close relationship with the branch of soft segmentation. In fact, some pieces of image matting literature such as Matting Laplacian are completely aligned with the key idea of soft segmentation, which is capturing a powerful representation for the local soft transition regions in the image.

‍

Given a set of user-defined regions, these methods mainly work on the idea of iteratively solving a two-layer soft segmentation problem to generate multiple layers. The work of Levin et al. on spectral matting also serves the same purpose by estimating a set of spatially connected soft segments automatically via spectral decomposition.

‍

Recent soft segmentation research by Aksoy et al. also follows the idea of spectral matting in combination with spectral decomposition and matting Laplacian. However, unlike spectral matting, their work approaches the problem from a spectral decomposition angle by fusing the local texture information with the high-level features from a deep convolutional neural network trained for scene analysis.

‍

One of their key contributions is the use of the graph like structure to enrich the eigenvectors of the corresponding Laplacian matrix by semantic objects as well as the soft transitions between them.

‍

In my personal experience, soft segmentation is a derived branch of natural image matting that combines the rich history image matting practices with the power of deep learning. Also unlike vanilla image matting, soft segmentation gives more layers of output that represent the semantically meaningful regions. But despite all these significant improvements, there remains a huge scope of improvement that still needs to be solved.

‍

Conclusion

I have explained these approaches from the point of view of solving the problem of salient foreground extraction. Still, the actual problem these methods aim to solve is very diverse and enriched in their respective research branches and contributes to the domain of Deep Computer Vision (my take on Computer Vision + Deep learning) in their way.

‍

References

[1] Xiaowei Hu, Lei Zhu, Jing Qin, Chi-Wing Fu, and PhengAnn Heng. Recurrently aggregating deep features for salient object detection. In Proceedings of AAAI-18, New Orleans, Louisiana, USA, pages 6943–6950, 2018.

‍

[2] Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, and Xiang Ruan. Salient object detection with recurrent fully convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. Thanks to Elliot Gunn.

FAQ's

What is fine-grained segmentation?

Fine-grained segmentation refers to the process of dividing an image or dataset into smaller, more specific regions or categories, often with the aim of identifying subtle differences or details within the data. This type of segmentation can be useful in a wide range of applications, such as image recognition, natural language processing, and data analysis.

What is fine-grained in image processing?

In image processing, "fine-grained" refers to the level of detail or specificity in the segmentation or classification of image elements, such as objects, features, or textures. Fine-grained processing aims to identify subtle differences or nuances within these elements, often requiring advanced algorithms and techniques to achieve high accuracy.

What are the three types of image segmentation?

The three main types of image segmentation are:‍ Thresholding-based segmentation, which uses a threshold value to separate foreground from background. Edge-based segmentation, which identifies edges or boundaries in the image to separate different regions or objects. Region-based segmentation, which groups together pixels or regions based on their similarity or proximity to each other.

Which algorithm is best for image segmentation?

There is no single algorithm that is universally considered the "best" for image segmentation, as the choice of algorithm depends on the specific application and the characteristics of the images being processed. Some commonly used algorithms for image segmentation include thresholding, k-means clustering, watershed segmentation, and convolutional neural networks (CNNs).

What is fine-grained image classification?

Fine-grained image classification is a type of image classification that focuses on distinguishing between visually similar subcategories within a broader category. This involves identifying subtle differences or features within the image that may be difficult for humans to distinguish but can be detected using advanced algorithms and techniques.

What is thresholding in image segmentation?

Thresholding is a common method of image segmentation that involves separating foreground objects from the background by using a threshold value to classify pixels as either "object" or "background". This technique is based on the idea that objects of interest are often darker or brighter than the background, and can be separated based on their pixel intensity values.