Skip to main content

How to remove caption from images? - Language SAM

· 3 min read
SeulGi Hong

How to remove caption from the video data?

Task

  • Goal
    • I want to remove caption area from images
    • Output sholud be like bbox, or non-rectangular mask only for the text area. Format doesn't matter but the later one is better to keep more information on remain scene
  • Issue
    • prefer to use generalized pretrained model
    • Fully Automatic: without any supervision from human (without GUI prompt)
    • speed matters

How?

1. OCR First of all, optical character recognition would be a nice and trustful approach because it has been a long-lasting computer vision task.

  • Scene Text Detection (Localization): to get bbox area of the text in the image
  • Scene Text Recognition: I don't need this part

2. Generalized Segmentation Model These days, "Segment Anything" is a trending topic in the field of computer vision. Therefore, I am examining the performance of SAM in masking text areas, and conducting a survey on follow-up studies.

Step 1. What is SAM?

I believe the scene text detection task is well-known and has already achieved satisfactory performance. Therefore, I will explore how SAM works for my specific task.

  • Generalization: SAM is general enough to cover a broad set of use cases and can be used out of the box on new image domains — whether underwater photos or cell microscopy — without requiring additional training (a capability often referred to as zero-shot transfer).
  • Time Cost: using light-weight for the real-time inference
  • Promptable Segmentation Task: take various prompts to train for enabling a valid segmentation even with the ambiguous inputs
    • e.g., Forground, Backgroud Points / Bounding Box / Masks / Free-form Text are defined as prompts.
    • But free-form text is not released (2023.04.06)

How the model looks like:

  • To summarize, the model composed of 3 parts. image encoder, prompt encoder, and decoder.
  • image encoder: MAE (Masked Auto-Encoder), takes 1024x1024 sized input, using pretrained ViT
  • prompt encoder: the model convert any types of prompt to 256 dim.
    • text: using CLIP embedding
    • fg/bg or bbox: using 2 points (NN)
    • Mask: using CNN to reduce size and the last, 1*1 conv generates 256 dim code img

Dataset: SA-1B

  • COCO run-length encoding (RLE) annotation format
  • provides 11M images and the huge amount of class agnostic, only for the research purpose (the mask doesn't contains the category!)
  • masks are autogenerated by SAM
  • you can request validation set, which is randomly sampled and annotated from human annotators.

My comments: I need a fully automatic process but finding fg&bg or bbox would be an additional task. Luckly, I find a nice open-source project to tackle this!

However, Language SAM is not exactly the same as the original SAM architecture, since Language SAM is relying on text-to-bbox approach by combining SAM and GroundingDINO.

Step 2. Let's try Language SAM

Let's go back to our task, removing caption area. Language segment anything model: github link

pip install torch torchvision
pip install -U git+https://github.com/luca-medeiros/lang-segment-anything.git

Outputs

  • The output is consists of instances, and each of them contain binary mask, bounding box, phrase, logit. image

I didn't upload the output due to the copyright issue, but I'll add some visualization later on.

References