PSA
는 Position-wise Spatial Attention.SCDown
으로 채널 Downsample 을 했다.n
(nano), s
(small) 일때, 모델 크기를 줄이기 위해서 RepVGG
기법을 사용한다.[-1, 3, C2fCIB, [1024, True, True]]
Argument 가 2개 사용,m
, b
, l
, x
일때는 RepVGG
는 사용하지 않는다.[depth, width, max_channels]
만 조절한다.obb
, pose
, seg
등 tasks 마다 나누어져 있다.b
모델은 balanced version 이다.class SCDown(nn.Module):
def __init__(self, c1, c2, k, s):
"""
Spatial Channel Downsample (SCDown) module.
Args:
c1 (int): Number of input channels.
c2 (int): Number of output channels.
k (int): Kernel size for the convolutional layer.
s (int): Stride for the convolutional layer.
"""
super().__init__()
self.cv1 = Conv(c1, c2, 1, 1)
self.cv2 = Conv(c2, c2, k=k, s=s, g=c2, act=False)
def forward(self, x):
"""
Forward pass of the SCDown module.
Args:
x (torch.Tensor): Input tensor.
Returns:
(torch.Tensor): Output tensor after applying the SCDown module.
"""
return self.cv2(self.cv1(x))
Depthwise Separable Convolution
은 Depthwise Conv 다음 Pointwise Conv 인데..Spatial Channel Downsample
은 Pointwise Conv 다음 Depthwise Conv 이다.kernel
크기가 일 경우 입력 채널의 정보를 결합하여 새로운 출력을 생성.
일 경우이고, g
roup 또한 같은 channel 로 설정
이 block 의 learnable parameter 수를 계산 해보자.
YOLO v10 structure
model.5.cv1.conv.weight False 131072 [512, 256, 1, 1] -0.000754 0.00819 torch.float32
model.5.cv1.bn.weight False 512 [512] 0.968 0.101 torch.float32
model.5.cv1.bn.bias False 512 [512] -0.746 0.405 torch.float32
model.5.cv2.conv.weight False 4608 [512, 1, 3, 3] 0.000167 0.0182 torch.float32
model.5.cv2.bn.weight False 512 [512] 0.998 0.133 torch.float32
model.5.cv2.bn.bias False 512 [512] 2.21e-07 9.74e-06 torch.float32
model.6.cv1.conv.weight False 262144 [512, 512, 1, 1] 4.71e-05 0.00666 torch.float32
sample code
SCDown
moduleConv2d
뺀 나머지는 연산 계산을 위해 뺐다.class Conv(nn.Module):
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1):
super(Conv, self).__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, bias=False)
def forward(self, x):
return self.conv(x)
class SCDown(nn.Module):
def __init__(self, c1, c2, k, s):
super(SCDown, self).__init__()
self.cv1 = Conv(c1, c2, 1, 1)
self.cv2 = Conv(c2, c2, k=k, s=s, g=c2)
def forward(self, x):
return self.cv2(self.cv1(x))
c_in = 256
input_data = torch.randn(1, 256, 512, 512)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SCDown(c1=256, c2=512, k=3, s=2).to(device=device)
summary(model, (256, 512, 512))
c_in = 256
c_out = 512
h = 1024
w = 1024
kernel = 3
stride = 2
input_data = torch.randn(1, c_in, h, w)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SCDown(c1=c_in, c2=c_out, k=kernel, s=stride).to(device=device)
summary(model, (c_in, h, w))
torchsummary
결과 yolov10
architecture 와 같이 나온다.
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 512, 512, 512] 131,072
Conv-2 [-1, 512, 512, 512] 0
Conv2d-3 [-1, 512, 256, 256] 4,608
Conv-4 [-1, 512, 256, 256] 0
================================================================
parameter formula
+1
은 bias 가 있을 경우. g
는 groupself.cv1 = Conv(c1, c2, 1, 1)
self.cv2 = Conv(c2, c2, k=k, s=s, g=c2)
Conv2d
바로 사용시 Parameter 수
계산 상 9배 차이가 나게 된다.
FastViT
에서 봤던 RepVGG
+ DWConv
.. FastViT Link
RepVGG
구조에 DW separable Conv
block 을 나타냈다.
7
과 3
으로 되어 있어서 두 가지 의미로 해석이 가능.Depthwise Conv
가 된다. Pointwise Conv
는 없다.k=1
짜리 Conv2d
가 추가되면 완전한 DW separable Conv
가 된다.class RepVGGDW(torch.nn.Module):
"""RepVGGDW is a class that represents a depth wise separable convolutional block in RepVGG architecture."""
def __init__(self, ed) -> None:
super().__init__()
self.conv = Conv(ed, ed, 7, 1, 3, g=ed, act=False)
self.conv1 = Conv(ed, ed, 3, 1, 1, g=ed, act=False)
self.dim = ed
self.act = nn.SiLU()
def forward(self, x):
"""
Performs a forward pass of the RepVGGDW block.
Args:
x (torch.Tensor): Input tensor.
Returns:
(torch.Tensor): Output tensor after applying the depth wise separable convolution.
"""
return self.act(self.conv(x) + self.conv1(x))
def forward_fuse(self, x):
"""
Performs a forward pass of the RepVGGDW block without fusing the convolutions.
Args:
x (torch.Tensor): Input tensor.
Returns:
(torch.Tensor): Output tensor after applying the depth wise separable convolution.
"""
return self.act(self.conv(x))
conv1_w
에 pad 를 [2, 2, 2, 2]
붙인다.RepVGG (with FastViT post) Link
RepVGG Paper Link
@torch.no_grad()
def fuse(self):
"""
Fuses the convolutional layers in the RepVGGDW block.
This method fuses the convolutional layers and updates the weights and biases accordingly.
"""
conv = fuse_conv_and_bn(self.conv.conv, self.conv.bn)
conv1 = fuse_conv_and_bn(self.conv1.conv, self.conv1.bn)
conv_w = conv.weight
conv_b = conv.bias
conv1_w = conv1.weight
conv1_b = conv1.bias
conv1_w = torch.nn.functional.pad(conv1_w, [2, 2, 2, 2])
final_conv_w = conv_w + conv1_w
final_conv_b = conv_b + conv1_b
conv.weight.data.copy_(final_conv_w)
conv.bias.data.copy_(final_conv_b)
self.conv = conv
del self.conv1
self.conv(x) + self.conv1(x)
를 더하고 self.conv1
을 지운다.n
과 s
이고, 나머지는 사용하지 않는다.의문인 부분이 ..
c2
를 e (0.5
) 로 곱하여 반으로 줄였는데,2 * c_
로2
를 곱해서 다 살렸다. 이 영향도는 더 적은 소수점으로 설정 해야 의미가 있어보인다.
class CIB(nn.Module):
"""
Conditional Identity Block (CIB) module.
Args:
c1 (int): Number of input channels.
c2 (int): Number of output channels.
shortcut (bool, optional): Whether to add a shortcut connection. Defaults to True.
e (float, optional): Scaling factor for the hidden channels. Defaults to 0.5.
lk (bool, optional): Whether to use RepVGGDW for the third convolutional layer. Defaults to False.
"""
def __init__(self, c1, c2, shortcut=True, e=0.5, lk=False):
"""Initializes the custom model with optional shortcut, scaling factor, and RepVGGDW layer."""
super().__init__()
c_ = int(c2 * e) # hidden channels
self.cv1 = nn.Sequential(
Conv(c1, c1, 3, g=c1),
Conv(c1, 2 * c_, 1),
RepVGGDW(2 * c_) if lk else Conv(2 * c_, 2 * c_, 3, g=2 * c_),
Conv(2 * c_, c2, 1),
Conv(c2, c2, 3, g=c2),
)
self.add = shortcut and c1 == c2
def forward(self, x):
return x + self.cv1(x) if self.add else self.cv1(x)
Depthwise Conv
가 맞다.Pointwise Conv
가 맞다.lk
가 True
면 RepVGGDW
, 아니면, 이므로 Depthwise Conv
가 맞다.Pointwise Conv
가 맞다.Depthwise Conv
가 맞다.말을 붙이자면,
YOLO-v8
에서 Cf2
block link 에서 Bottleneck
block 을 ModuleList
에 넣는 구조에서의 변형.
class C2fCIB(C2f):
"""
C2fCIB class represents a convolutional block with C2f and CIB modules.
Args:
c1 (int): Number of input channels.
c2 (int): Number of output channels.
n (int, optional): Number of CIB modules to stack. Defaults to 1.
shortcut (bool, optional): Whether to use shortcut connection. Defaults to False.
lk (bool, optional): Whether to use local key connection. Defaults to False.
g (int, optional): Number of groups for grouped convolution. Defaults to 1.
e (float, optional): Expansion ratio for CIB modules. Defaults to 0.5.
"""
def __init__(self, c1, c2, n=1, shortcut=False, lk=False, g=1, e=0.5):
"""Initializes the module with specified parameters for channel, shortcut, local key, groups, and expansion."""
super().__init__(c1, c2, n, shortcut, g, e)
self.m = nn.ModuleList(CIB(self.c, self.c, shortcut, e=1.0, lk=lk) for _ in range(n))
그 CIB
블럭을 range n
만큼 모은다.
__init__
에서 default 가 e=0.5
인데, 정작 함수는 e=1.0
으로 넣네?class Attention(nn.Module):
"""
Attention module that performs self-attention on the input tensor.
Args:
dim (int): The input tensor dimension.
num_heads (int): The number of attention heads.
attn_ratio (float): The ratio of the attention key dimension to the head dimension.
Attributes:
num_heads (int): The number of attention heads.
head_dim (int): The dimension of each attention head.
key_dim (int): The dimension of the attention key.
scale (float): The scaling factor for the attention scores.
qkv (Conv): Convolutional layer for computing the query, key, and value.
proj (Conv): Convolutional layer for projecting the attended values.
pe (Conv): Convolutional layer for positional encoding.
"""
def __init__(self, dim, num_heads=8, attn_ratio=0.5):
"""Initializes multi-head attention module with query, key, and value convolutions and positional encoding."""
super().__init__()
self.num_heads = num_heads
self.head_dim = dim // num_heads
self.key_dim = int(self.head_dim * attn_ratio)
self.scale = self.key_dim**-0.5
nh_kd = nh_kd = self.key_dim * num_heads
h = dim + nh_kd * 2
self.qkv = Conv(dim, h, 1, act=False)
self.proj = Conv(dim, dim, 1, act=False)
self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)
def forward(self, x):
"""
Forward pass of the Attention module.
Args:
x (torch.Tensor): The input tensor.
Returns:
(torch.Tensor): The output tensor after self-attention.
"""
B, C, H, W = x.shape
N = H * W
qkv = self.qkv(x)
q, k, v = qkv.view(B, self.num_heads, self.key_dim * 2 + self.head_dim, N).split(
[self.key_dim, self.key_dim, self.head_dim], dim=2
)
attn = (q.transpose(-2, -1) @ k) * self.scale
attn = attn.softmax(dim=-1)
x = (v @ attn.transpose(-2, -1)).view(B, C, H, W) + self.pe(v.reshape(B, C, H, W))
x = self.proj(x)
return x
query, key, value 로 split 하고
attn = (q.transpose(-2, -1) @ k) * self.scale
attn = attn.softmax(dim=-1)
self.proj = Conv(dim, dim, 1, act=False)
self.pe = Conv(dim, dim, 3, 1, g=dim, act=False)
일단, 이 상태에서 Positional Encoding
이 재밌게 되어 있다.
sin, cos
의 주기.다음,projection
Linear
대신 Conv
로 하는 것은 처음 봤다.
dim
이어서 channel 은 변하지 않는다.Position-wise Spatial Attention
class PSA(nn.Module):
"""
Position-wise Spatial Attention module.
Args:
c1 (int): Number of input channels.
c2 (int): Number of output channels.
e (float): Expansion factor for the intermediate channels. Default is 0.5.
Attributes:
c (int): Number of intermediate channels.
cv1 (Conv): 1x1 convolution layer to reduce the number of input channels to 2*c.
cv2 (Conv): 1x1 convolution layer to reduce the number of output channels to c.
attn (Attention): Attention module for spatial attention.
ffn (nn.Sequential): Feed-forward network module.
"""
def __init__(self, c1, c2, e=0.5):
"""Initializes convolution layers, attention module, and feed-forward network with channel reduction."""
super().__init__()
assert c1 == c2
self.c = int(c1 * e)
self.cv1 = Conv(c1, 2 * self.c, 1, 1)
self.cv2 = Conv(2 * self.c, c1, 1)
self.attn = Attention(self.c, attn_ratio=0.5, num_heads=self.c // 64)
self.ffn = nn.Sequential(Conv(self.c, self.c * 2, 1), Conv(self.c * 2, self.c, 1, act=False))
def forward(self, x):
"""
Forward pass of the PSA module.
Args:
x (torch.Tensor): Input tensor.
Returns:
(torch.Tensor): Output tensor.
"""
a, b = self.cv1(x).split((self.c, self.c), dim=1)
b = b + self.attn(b)
b = b + self.ffn(b)
return self.cv2(torch.cat((a, b), 1))
e=0.5
로 두고 2 를 곱해서 cv1
, cv2
의 channel 은 model.10.cv1.conv.weight False 262144 [512, 512, 1, 1]
model.10.cv2.conv.weight False 262144 [512, 512, 1, 1]
self.ffn
에서 channel 은 model.10.ffn.0.conv.weight False 131072 [512, 256, 1, 1]
model.10.ffn.1.conv.weight False 131072 [256, 512, 1, 1]
with channel reduction.
이라는데.. l
사이즈는 일단 그렇지는 않다.FastViT
에서 본 Depthwise Separable Conv
구조를 block 마다 넣어서 굉장히 재밌었다.self.conv1
삭제 하는 것은 볼때마다 흥미롭다.이런 방법은 대세가 아닐까 싶다.
Yolo 와 경량화 + attention 이 합쳤졌으니.. 더 나올 것이 있을까?
구조 파악은 얼추 됬으니 논문 봐봐야지.