How to remove caption from images? - Language SAM

June 23, 2023 · 3 min read

Vision AI Engineer

How to remove caption from the video data?

Task

Goal
- I want to remove caption area from images
- Output sholud be like bbox, or non-rectangular mask only for the text area. Format doesn't matter but the later one is better to keep more information on remain scene
Issue
- prefer to use generalized pretrained model
- Fully Automatic: without any supervision from human (without GUI prompt)
- speed matters

How?

1. OCR First of all, optical character recognition would be a nice and trustful approach because it has been a long-lasting computer vision task.

Scene Text Detection (Localization): to get bbox area of the text in the image
- e.g., CRAFT https://github.com/clovaai/CRAFT-pytorch provides the pre-trained model
Scene Text Recognition: I don't need this part

2. Generalized Segmentation Model These days, "Segment Anything" is a trending topic in the field of computer vision. Therefore, I am examining the performance of SAM in masking text areas, and conducting a survey on follow-up studies.

Step 1. What is SAM?

I believe the scene text detection task is well-known and has already achieved satisfactory performance. Therefore, I will explore how SAM works for my specific task.

Generalization: SAM is general enough to cover a broad set of use cases and can be used out of the box on new image domains — whether underwater photos or cell microscopy — without requiring additional training (a capability often referred to as zero-shot transfer).
Time Cost: using light-weight for the real-time inference
Promptable Segmentation Task: take various prompts to train for enabling a valid segmentation even with the ambiguous inputs
- e.g., Forground, Backgroud Points / Bounding Box / Masks / Free-form Text are defined as prompts.
- But free-form text is not released (2023.04.06)

How the model looks like:

To summarize, the model composed of 3 parts. image encoder, prompt encoder, and decoder.
image encoder: MAE (Masked Auto-Encoder), takes 1024x1024 sized input, using pretrained ViT
prompt encoder: the model convert any types of prompt to 256 dim.
- text: using CLIP embedding
- fg/bg or bbox: using 2 points (NN)
- Mask: using CNN to reduce size and the last, 1*1 conv generates 256 dim code

Dataset: SA-1B

COCO run-length encoding (RLE) annotation format
provides 11M images and the huge amount of class agnostic, only for the research purpose (the mask doesn't contains the category!)
masks are autogenerated by SAM
you can request validation set, which is randomly sampled and annotated from human annotators.

My comments: I need a fully automatic process but finding fg&bg or bbox would be an additional task. Luckly, I find a nice open-source project to tackle this!

However, Language SAM is not exactly the same as the original SAM architecture, since Language SAM is relying on text-to-bbox approach by combining SAM and GroundingDINO.

GroundingDINO
Segment-Anything
GUI is available by Lightning AI! (but i didn't check this)

Step 2. Let's try Language SAM

Let's go back to our task, removing caption area. Language segment anything model: github link

pip install torch torchvision
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git

Outputs

The output is consists of instances, and each of them contain binary mask, bounding box, phrase, logit.

I didn't upload the output due to the copyright issue, but I'll add some visualization later on.

References

Language SAM https://github.com/luca-medeiros/lang-segment-anything
https://lightning.ai/pages/community/lang-segment-anything-object-detection-and-segmentation-with-text-prompt/
facebook post
(KR) Explanation about SAM details https://blog.annotation-ai.com/segment-anything/

Task​

How?​

Step 1. What is SAM?​

Step 2. Let's try Language SAM​

Task

How?

Step 1. What is SAM?

Step 2. Let's try Language SAM