How to remove caption from the video data?
Task
- Goal
- I want to remove caption area from images
- Output sholud be like bbox, or non-rectangular mask only for the text area. Format doesn't matter but the later one is better to keep more information on remain scene
- Issue
- prefer to use generalized pretrained model
- Fully Automatic: without any supervision from human (without GUI prompt)
- speed matters
How?
1. OCR First of all, optical character recognition would be a nice and trustful approach because it has been a long-lasting computer vision task.
- Scene Text Detection (Localization): to get bbox area of the text in the image
- e.g., CRAFT https://github.com/clovaai/CRAFT-pytorch provides the pre-trained model
- Scene Text Recognition: I don't need this part
2. Generalized Segmentation Model These days, "Segment Anything" is a trending topic in the field of computer vision. Therefore, I am examining the performance of SAM in masking text areas, and conducting a survey on follow-up studies.
Step 1. What is SAM?
I believe the scene text detection task is well-known and has already achieved satisfactory performance. Therefore, I will explore how SAM works for my specific task.
- Generalization: SAM is general enough to cover a broad set of use cases and can be used out of the box on new image domains — whether underwater photos or cell microscopy — without requiring additional training (a capability often referred to as zero-shot transfer).
- Time Cost: using light-weight for the real-time inference
- Promptable Segmentation Task: take various prompts to train for enabling a valid segmentation even with the ambiguous inputs
- e.g., Forground, Backgroud Points / Bounding Box / Masks / Free-form Text are defined as prompts.
- But free-form text is not released (2023.04.06)
How the model looks like:
- To summarize, the model composed of 3 parts. image encoder, prompt encoder, and decoder.
- image encoder: MAE (Masked Auto-Encoder), takes 1024x1024 sized input, using pretrained ViT
- prompt encoder: the model convert any types of prompt to 256 dim.
- text: using CLIP embedding
- fg/bg or bbox: using 2 points (NN)
- Mask: using CNN to reduce size and the last, 1*1 conv generates 256 dim code
Dataset: SA-1B
- COCO run-length encoding (RLE) annotation format
- provides 11M images and the huge amount of class agnostic, only for the research purpose (the mask doesn't contains the category!)
- masks are autogenerated by SAM
- you can request validation set, which is randomly sampled and annotated from human annotators.
My comments: I need a fully automatic process but finding fg&bg or bbox would be an additional task. Luckly, I find a nice open-source project to tackle this!
However, Language SAM is not exactly the same as the original SAM architecture, since Language SAM is relying on text-to-bbox approach by combining SAM and GroundingDINO.
- GroundingDINO
- Segment-Anything
- GUI is available by Lightning AI! (but i didn't check this)
Step 2. Let's try Language SAM
Let's go back to our task, removing caption area. Language segment anything model: github link
pip install torch torchvision
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git
Outputs
- The output is consists of instances, and each of them contain binary mask, bounding box, phrase, logit.
I didn't upload the output due to the copyright issue, but I'll add some visualization later on.
References
- Language SAM https://github.com/luca-medeiros/lang-segment-anything
- https://lightning.ai/pages/community/lang-segment-anything-object-detection-and-segmentation-with-text-prompt/
- facebook post
- (KR) Explanation about SAM details https://blog.annotation-ai.com/segment-anything/