D🧨ffusers Library Basics

구명규·2023년 9월 25일
0

'23 Internship Study

목록 보기
19/19
post-thumbnail

  DALL-E2와 Imagen 등 코드를 공개하지 않는 최근 diffusion models에 반해, 손쉬운 모델 개발 및 API 사용 등을 위해 Huggingface에서 구현된 오픈소스 diffusion model 라이브러리. 빠른 inference를 위해 UNet과 Scheduler로 구성된 pipeline 함수를 제공하며, huggingface에 호스팅되어 있는 모든 pretrained model을 사용할 수 있음.


Installation

pip install --upgrade diffusers[torch]

Simple Inference

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipeline.to("cuda")
prompt = "An image of a squirrel in Picasso style"
image = pipeline(prompt).images[0]

# Pipeline options (deterministic outputs, inference steps, guidance scales, ...)
generator = torch.Generator("cuda").manual_seed(1024)
image = guided_pipeline(
    prompt,
    num_inference_steps=50,
    guidance_scale=7.5,
    clip_guidance_scale=100,
    num_cutouts=4,
    use_cutouts=False,
    generator=generator,
).images[0]
  • UNet Models & Schedulers manually
    : UNet2DModel과 같은 Models는 noisy image를 입력받아 less noisy image or residual을 계산하도록 학습된 parameterized neural network(PyTorch modules)에 해당.
    https://github.com/huggingface/diffusers/tree/main/src/diffusers/models
    : 반면, DDPMScheduler와 같은 Schedulers는 image에 noise를 추가하거나 less noisy sample을 계산하는 알고리즘에 해당하며, trainable weights가 없는 parameter-free Python class에 해당.
    https://github.com/huggingface/diffusers/tree/main/src/diffusers/schedulers
    \rarr Model이 less noisy sample과의 residual을 예측(model.sample), scheduler가 less noisy sample 생성(scheduler.step)하며 inference 진행.
from diffusers import DDPMScheduler, UNet2DModel
from PIL import Image
import torch
import numpy as np

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda")
scheduler.set_timesteps(50)

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")
input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
        prev_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
        input = prev_noisy_sample

image = (input / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()[0]
image = Image.fromarray((image * 255).round().astype("uint8"))

Core API Codes

  1. from diffusers import [PIPELINE or MODEL or SCHEDULER module name]

  2. repo_id = "[HUGGINGFACE MODEL repo name]"

  3. model = [module name].from_pretrained([repo_name])

  4. model.config \rarr Necessary parameters to define the model architecture

  5. model.to("cuda")

  6. image = model("[OPTIONAL conditioning]").images

  7. model.save_pretrained("my_model") \rarr 'config.json', 'diffusion_pytorch_model.bin'


Prompt Conditioning

  • Text prompt를 tokenize(word\rarrtoken mapping) & encode(token\rarrembedding-vector mapping)하는 module.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
  • Text embeddings 생성
prompt = ["a photograph of an astronaut riding a horse"]

text_input = tokenizer(
    prompt,
    padding="max_length",
    max_length=tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt"
)

with torch.no_grad():
  text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
  • Classifier-free guidance를 위한 unconditional text embeddings 추가
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
    [""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
with torch.no_grad():
  uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
  • 이후 noise residual을 계산할 때 encoder_hidden_states에 text embeddings 전달
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

Textual-Inversion

Koo's log: Fine-Tuning Techniques for Diffusion Models
HuggingFace: Stable Diffusion concepts library

  • Text prompt를 word \rarr token \rarr embedding vector의 순서로 변환하는 text encoder에 대해, 특정 visual concept에 대한 새로운 pseudo-word를 생성하여 diffusion model이 해당 concept에 대한 semantic understanding을 갖추도록 하는 방법.

  • 학습시키려는 concept에 대해선 일반적으로 angle brackets를 붙여 <cat-toy>와 같은 placeholder_token을 설정함. 또한 해당 concept과 관련된 initializer token으로 해당 concept에 대한 학습의 시작점 설정 가능.


References

profile
K'AI'ST 학부생까지의 기록

0개의 댓글