grounded SAM 1, 2 사용

FSA·2024년 8월 22일

vision

목록 보기

25/25

1. grounded SAM 1

https://github.com/IDEA-Research/Grounded-Segment-Anything?tab=readme-ov-file#install-without-docker

1.1. Install without Docker

environment variable 설정 (local GPU environment)

export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True
export CUDA_HOME=/usr/local/cuda-12.1/

Install Segment Anything:

python -m pip install -e segment_anything

Install Grounding DINO:

pip install --no-build-isolation -e GroundingDINO

The following optional dependencies are necessary for
- mask post-processing,
- saving masks in COCO format,
- the example notebooks, and
- exporting the model in ONNX format.
jupyter is also required to run the example notebooks.

pip install opencv-python pycocotools matplotlib onnxruntime onnx ipykernel

More details
- install segment anything
  - https://github.com/facebookresearch/segment-anything#installation
- install GroundingDINO
  - https://github.com/IDEA-Research/GroundingDINO#install

2. Grounded SAM 2

https://github.com/IDEA-Research/Grounded-SAM-2

2.1. installation

Download the pretrained SAM 2 checkpoints:

cd checkpoints
bash download_ckpts.sh

Download the pretrained Grounding DINO checkpoints:

cd gdino_checkpoints
bash download_ckpts.sh

2.1.1. Installation without docker

Install PyTorch environment first. (in our environment to run this demo.)
- python=3.10
- torch >= 2.3.1
- torchvision>=0.18.1
- cuda-12.1
  pip3 install torch torchvision torchaudio (가장 추천하는 방법)
Since we need the CUDA compilation environment to compile the Deformable Attention operator used in Grounding DINO,
- we need to check whether the CUDA environment variables have been set correctly
  - (which you can refer to Grounding DINO Installation for more details).
You can set the environment variable manually as follows
- if you want to build a local GPU environment for Grounding DINO to run Grounded SAM 2:

export CUDA_HOME=/usr/local/cuda-12.1/

Install Segment Anything 2:

pip install -e .

Install Grounding DINO:

pip install --no-build-isolation -e grounding_dino

Grounded SAM 2 Demos

Grounded SAM 2 Image Demo (with Grounding DINO)

Grounding DINO 가 이미 Huggingface를 통해 지원됩니다.
그래서 우리는 Grounded SAM 2 model을 돌리기 위해 2가지 선택지를 제공합니다.
[선택지 1] Use huggingface API to inference Grounding DINO (which is simple and clear)

python grounded_sam2_hf_model_demo.py

[!NOTE]
🚨 If you encounter network issues while using the HuggingFace model, you can resolve them by setting the appropriate mirror source as export HF_ENDPOINT=https://hf-mirror.com

Load local pretrained Grounding DINO checkpoint and inference with Grounding DINO original API
- (make sure you've already downloaded the pretrained checkpoint)

python grounded_sam2_local_demo.py

TODO: 나중에 공부

Grounded SAM 2 Image Demo (with Grounding DINO 1.5 & 1.6)

most capable open-set detection model Grounding DINO 1.5 & 1.6,
You can apply the API token first and run Grounded SAM 2 with Grounding DINO 1.5 as follows:
Install the latest DDS cloudapi:

pip install dds-cloudapi-sdk

Apply your API token from our official website here: request API token.

python grounded_sam2_gd1.5_demo.py

Grounded SAM 2 Video Object Tracking Demo

Based on the strong tracking capability of SAM 2, we can combined it with Grounding DINO for open-set object segmentation and tracking. You can run the following scripts to get the tracking results with Grounded SAM 2:

python grounded_sam2_tracking_demo.py

The tracking results of each frame will be saved in ./tracking_results
The video will be save as children_tracking_demo_video.mp4
You can refine this file with different text prompt and video clips yourself to get more tracking results.
We only prompt the first video frame with Grounding DINO here for simple usage.

Support Various Prompt Type for Tracking

We've supported different types of prompt for Grounded SAM 2 tracking demo:

Point Prompt: In order to get a stable segmentation results, we re-use the SAM 2 image predictor to get the prediction mask from each object based on Grounding DINO box outputs, then we uniformly sample points from the prediction mask as point prompts for SAM 2 video predictor
Box Prompt: We directly use the box outputs from Grounding DINO as box prompts for SAM 2 video predictor
Mask Prompt: We use the SAM 2 mask prediction results based on Grounding DINO box outputs as mask prompt for SAM 2 video predictor.

Grounded SAM 2 Video Object Tracking Demo (with Grounding DINO 1.5 & 1.6)

사용자는 자신만의 비디오 파일(예: assets/hippopotamus.mp4)을 업로드하고,
Grounding DINO 1.5와 SAM 2를 사용하여 지상화 및 추적을 위한 맞춤형 텍스트 프롬프트를 지정할 수 있습니다.
이를 위해 다음 스크립트를 사용할 수 있습니다:

python grounded_sam2_tracking_demo_custom_video_input_gd1.5.py

이 파일에서 다음 매개변수를 지정할 수 있습니다:
그리고 추적 시각화 결과는 OUTPUT_VIDEO_PATH에 자동으로 저장됩니다.

VIDEO_PATH = "./assets/hippopotamus.mp4"  # 사용자 비디오 파일 경로
TEXT_PROMPT = "hippopotamus."  # 추적할 객체에 대한 텍스트 프롬프트
OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"  # 출력 비디오 경로
API_TOKEN_FOR_GD1_5 = "Your API token"  # Grounding DINO 1.5용 API 토큰
PROMPT_TYPE_FOR_VIDEO = "mask"  # SAM 2 마스크 예측을 비디오 예측자의 프롬프트로 사용

주의사항

입력 비디오의 첫 번째 프레임에서 박스 프롬프트가 초기화됩니다.
다른 프레임에서 시작하려면 코드에서 ann_frame_idx를 직접 수정할 수 있습니다.

Grounded-SAM-2 비디오 객체 추적(연속 ID 부여 기능 포함) (Grounding DINO 사용)

이전 위 데모에서는 특정 프레임에서만 Grounded SAM 2를 프롬프트로 사용
- 이는 비디오 전체에서 새로운 객체를 찾는 데 불편할 수 있습니다.
- 새로운 프레임에서 새롭게 나타나는 객체를 탐지하지 않아.
- 즉, 새로운 객체가 중간에 등장하더라도 처음에 지정된 객체들만 추적되고, 나중에 등장한 객체는 감지되지 않아.
- 이는 비디오 전체에서 새롭게 등장하는 객체를 자동으로 탐지하고 추적할 수 없는 문제를 발생시킬 수 있어.
그래서, 연속 ID 부여 기능이란, 비디오 전체에서 새로운 객체가 등장할 때마다 해당 객체에 새로운 ID를 할당해 지속적으로 추적하는 기능을 말해.
- 즉, 비디오가 진행되면서 새로운 객체가 나타나면 그 객체도 추적되도록 하여, 놓치는 객체 없이 전체 비디오에서 객체를 추적할 수 있게 돕는 기능이야.
- 이 기능은 아직 개발 중이며 현재는 완전히 안정적이지 않습니다.
사용자는 자신만의 비디오 파일을 업로드하고 Grounding DINO 및 SAM 2 프레임워크를 사용하여 맞춤형 텍스트 프롬프트로 지상화 및 추적을 지정할 수 있습니다.
이를 위해 스크립트를 실행하세요:

python grounded_sam2_tracking_demo_with_continuous_id.py

다양한 매개변수를 사용자 정의할 수 있습니다:
- text: 지상화 텍스트 프롬프트.
- video_dir: 비디오 파일이 포함된 디렉토리.
- output_dir: 처리된 출력 파일을 저장할 디렉토리.
- output_video_path: 출력 비디오 경로.
- step: 처리할 프레임 간격.
- box_threshold: Grounding DINO 모델의 박스 임계값.
- text_threshold: Grounding DINO 모델의 텍스트 임계값.
참고: 이 방법은 텍스트 프롬프트의 마스크 유형만 지원합니다.
Grounding DINO 1.5 모델을 사용해보고 싶다면, API 토큰을 설정한 후 다음 스크립트를 실행할 수 있습니다:

python grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py

Grounded-SAM-2 비디오 객체 추적(연속 ID 부여 및 역추적 포함) (Grounding DINO 사용)

이 방법은 객체의 전체 수명을 단순하게 추적할 수 있습니다.
여기서 역추적 기능은, 객체가 비디오 내에서 등장한 시점만 추적하는 것이 아니라,
- 객체가 처음 등장하기 이전 프레임까지도 추적할 수 있는 기능을 의미해.
즉, 연속 ID 부여 기능으로 새로운 객체를 찾아내는 것뿐만 아니라, 그 객체가 처음 나타나기 전의 프레임에서도 해당 객체가 있었는지 확인하여, 그 이전 프레임에서도 객체를 추적할 수 있어. - 이렇게 하면 비디오 내에서 객체가 처음 등장한 시점뿐만 아니라 객체의 전체 수명을 추적할 수 있는 기능을 제공하는 거야.

이 방식은, 비디오를 앞에서부터 끝까지 순차적으로 추적하는 방식이 아니라, 필요시 객체의 등장 시점 이전까지 거슬러 올라가면서 객체를 추적해 더 정확한 추적을 가능하게 해.
따라서 객체가 비디오 어디서 처음 등장했는지와 관계없이, 그 객체의 전체 수명을 비디오 전반에 걸쳐 모두 추적할 수 있게 돕는 거지.

python grounded_sam2_tracking_demo_with_continuous_id_plus.py

위 방법을 사용하면 비디오 전체에서 새로운 객체를 식별하고 추적하며, 역추적 기능까지 포함하여 더욱 정확한 객체 추적이 가능합니다.

Grounded SAM 2 Florence-2 데모

Grounded SAM 2 Florence-2 이미지 데모

Florence-2는 Microsoft에서 제공하는 강력한 비전 기반 모델로, 다양한 비전 작업을 지원합니다.
여기에는 특별한 작업 프롬프트(task_prompt)를 사용한 작업들이 포함되며, 그 예시는 다음과 같습니다:

작업	작업 프롬프트	텍스트 입력	작업 설명
객체 감지	`<OD>`	✘	단일 카테고리 이름으로 주요 객체 감지
밀집 영역 캡션	`<DENSE_REGION_CAPTION>`	✘	짧은 설명으로 주요 객체 감지
영역 제안	`<REGION_PROPOSAL>`	✘	카테고리 이름 없이 제안 생성
구문 연결	`<CAPTION_TO_PHRASE_GROUNDING>`	✔	캡션에 언급된 이미지 내 주요 객체 연결
참조 표현 세분화	`<REFERRING_EXPRESSION_SEGMENTATION>`	✔	텍스트 입력과 가장 관련 있는 객체 연결
오픈 보케블러리 감지 및 세분화	`<OPEN_VOCABULARY_DETECTION>`	✔	텍스트 입력으로 모든 객체 연결

Florence-2를 SAM-2와 통합하면 복잡한 비전 작업을 해결할 수 있는 강력한 비전 파이프라인을 구축할 수 있습니다. 다음 스크립트를 실행하여 데모를 실행해보세요:

참고사항

🚨 HuggingFace 모델 사용 시 네트워크 문제가 발생하면 HF_ENDPOINT=https://hf-mirror.com 설정을 통해 해결할 수 있습니다.

객체 감지 및 세분화

python grounded_sam2_florence2_image_demo.py \
    --pipeline object_detection_segmentation \
    --image_path ./notebooks/images/cars.jpg

밀집 영역 캡션 및 세분화

python grounded_sam2_florence2_image_demo.py \
    --pipeline dense_region_caption_segmentation \
    --image_path ./notebooks/images/cars.jpg

영역 제안 및 세분화

python grounded_sam2_florence2_image_demo.py \
    --pipeline region_proposal_segmentation \
    --image_path ./notebooks/images/cars.jpg

구문 연결 및 세분화

python grounded_sam2_florence2_image_demo.py \
    --pipeline phrase_grounding_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "The image shows two vintage Chevrolet cars parked side by side, with one being a red convertible and the other a pink sedan, \
            set against the backdrop of an urban area with a multi-story building and trees. \
            The cars have Cuban license plates, indicating a location likely in Cuba."

참조 표현 세분화

중요

python grounded_sam2_florence2_image_demo.py \
    --pipeline referring_expression_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "The left red car."

오픈 보케블러리 감지 및 세분화

중요

python grounded_sam2_florence2_image_demo.py \
    --pipeline open_vocabulary_detection_segmentation \
    --image_path ./notebooks/images/cars.jpg \
    --text_input "car <and> building"

참고: 여러 객체를 감지하려면 입력 텍스트에서 <and>로 객체를 구분해야 합니다.

Grounded SAM 2 Florence-2 이미지 자동 라벨링 데모

중요
Florence-2는 캡션 기능과 연결 기능을 결합하여 자동 이미지 주석 도구로 사용할 수 있습니다.

작업	작업 프롬프트	텍스트 입력
캡션 + 구문 연결	`<CAPTION>` + `<CAPTION_TO_PHRASE_GROUNDING>`	✘
상세 캡션 + 구문 연결	`<DETAILED_CAPTION>` + `<CAPTION_TO_PHRASE_GROUNDING>`	✘
더 상세한 캡션 + 구문 연결	`<MORE_DETAILED_CAPTION>` + `<CAPTION_TO_PHRASE_GROUNDING>`	✘

다음 스크립트를 사용하여 이러한 데모를 실행할 수 있습니다:

캡션에서 구문 연결로

python grounded_sam2_florence2_autolabel_pipeline.py \
    --image_path ./notebooks/images/groceries.jpg \
    --pipeline caption_to_phrase_grounding \
    --caption_type caption

참고: 캡션의 세분화 수준을 제어하려면 caption_type을 지정할 수 있습니다. 더 상세한 캡션을 원할 경우, --caption_type detailed_caption 또는 --caption_type more_detailed_caption을 시도해보세요.

FSA

모든 의사 결정 과정을 지나칠 정도로 모두 기록하고, 나중에 스스로 피드백 하는 것

이전 포스트