[경량화 챌린지] 19일차 - nn.Linear

ehghkwl·2025년 12월 16일

Lightweight Challenge

목록 보기

19/22

양자화 코드를 직접 구현해보기 전에, 먼저 정확히 알아야함!
만약, (1024,512) 크기의 weight가 있다면, 이걸 int8로 양자화 하기 위한 Scale Factor의 shape은 뭘까? 이거에 답변할 수 있어야, 정확히 이해한거임!

nn.Linear

pytorch 깃허브에 torch/nn/modules/linear.py안에 코드가 존재한다.
https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html

    r"""Applies an affine linear transformation to the incoming data: :math:`y = xA^T + b`.

affine
nn.Linear는 input 데이터에 대해서 affine 변환을 수행하는 함수이다. 그럼 affine 과 linear 뜻이 무엇인가?
- linear transformation: $y = xA^T$ 로 원점을 고정한채로 공간을 늘리거나, 줄이거나, 회전하는것이다. 하지만, (0,0)을 벗어나는
- affine transformation: $y = xA^T + b$ 로 linear + 평행이동을 하는 것이다. (linear 포함관계임)

    Shape:
        - Input: :math:`(*, H_\text{in})` where :math:`*` means any number of
          dimensions including none and :math:`H_\text{in} = \text{in\_features}`.
        - Output: :math:`(*, H_\text{out})` where all but the last dimension
          are the same shape as the input and :math:`H_\text{out} = \text{out\_features}`.

shape
Input은 $(*,H_in)$ , output은 $(*,H_out)$ 인데, 여기서 *는 임의의 개수 차원이라서 1차원 이상이 들어올 수 있다.
예를 들어서, llm의 경우 input이 "나는 밥을 먹는다" 라면, 데이터 형태는 (Batch_Size, 3(=단어개수), 100(=임베딩크기)) 이러하다.

    Attributes:
        weight: the learnable weights of the module of shape
            :math:`$(\text{out\_features}, \text{in\_features})$`. The values are
            initialized from :math:`$\mathcal{U}(-\sqrt{k}, \sqrt{k})$`, where
            :math:`$k = \frac{1}{\text{in\_features}}$`
        bias:   the learnable bias of the module of shape :math:`$(\text{out\_features})$`.
                If :attr:`bias` is ``True``, the values are initialized from
                :math:`$\mathcal{U}(-\sqrt{k}, \sqrt{k})$` where
                :math:`$k = \frac{1}{\text{in\_features}}$`

attributes
linear layer의 가중치는 (out_features, in_features) 형태로 저장된다. $y=xA^T$ 이기 때문에, 가중치는 (output, input) 순으로 저장된다.
- ❓근데, 생각해보면, 가중치를 transpose한채로 저장하면 굳이 transpose 연산을 안하게 될텐데 왜 굳이 transpose하기 전의 값으로 메모리에 저장하는거지?
  y를 하나 만들려면, 가중치A의 한 행이 필요하다. (A^T의 첫번째 열을 사용하기 때문에) 이때, 메모리는 2차원으로 저장하지 않고, 1차원으로 메모리를 저장한다. 1row, 2row, ... 순으로 저장된다. 만약 transpose한 상태로 저장하게 되면, 하나의 y를 만드는데 필요한 가중치들이 분산되어 저장되게 된다. 이는 병렬 처리에서 최악의 비효율이다. 따라서, 가중치를 A 그대로 저장하게 된다.
  A로 저장한다고해서 실제 연산때, transpose 연산이 추가되는 것도 아니다. 개념적으로 transpose연산이 수행되는 것이지, 실제 연산에서는 데이터를 뒤집지 않고, 메모리 순서 그대로 load해야 정상적으로 연산이 수행될 수 있다.

추가적으로 bias도 존재한다.

    __constants__ = ["in_features", "out_features"]
    in_features: int
    out_features: int
    weight: Tensor

constants
torchScript(JIT 컴파일러)를 위한 값으로, 해당 값은 모델이 만들어진 이후에는 변하지 않는 상수라고 컴파일러에게 알려주기 위한 것이다.

class Linear(Module):
    def __init__(
        self,
        in_features: int,
        out_features: int,
        bias: bool = True,
        device=None,
        dtype=None,
    ) -> None:
        factory_kwargs = {"device": device, "dtype": dtype}
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = Parameter(
            torch.empty((out_features, in_features), **factory_kwargs)
        )
        if bias:
            self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
        else:
            self.register_parameter("bias", None)
        self.reset_parameters()

    def reset_parameters(self) -> None:
        """
        Resets parameters based on their initialization used in ``__init__``.
        """
        # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
        # uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
        # https://github.com/pytorch/pytorch/issues/57109
        init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
            init.uniform_(self.bias, -bound, bound)

init
- 이 함수는 객체가 처음 생성될 때 실행되는 초기화 함수이다.
  여기서 Parameter는 학습해야하는 변수인걸 나타내는 것이다. 그냥 torch.tensor로 두면, optimizer.step()에서 학습되지 않음.
- resetparameters 이거는 랜덤값으로 초기화 하는 거라서, 파라미터를 만들때 empty 함수로 만드는 것이다. 이후, 랜덤 초기화를 수행한다. 하지만 완전 랜덤은 아니고, in_feature에 따라서, 입력 개수가 많으면, 그만큼 가중치를 더 작은 범위에서 뽑는 방법=init.kaiming_uniform(self.weight, a=math.sqrt(5))을 사용한다. 이렇게 해야 레이어를 통과해도 값의 크기가 일정하게 유지되어 학습이 잘된다.

class Linear(Module):
	
    ... 
    
    def forward(self, input: Tensor) -> Tensor:
        """
        Runs the forward pass.
        """
        return F.linear(input, self.weight, self.bias)

    def extra_repr(self) -> str:
        """
        Return the extra representation of the module.
        """
        return f"in_features={self.in_features}, out_features={self.out_features}, bias={self.bias is not None}"

forward
input이 들어왔을 때, 수행하는 연산으로 $input @ self.weight.T + self.bias$ 라고 짜도 되지만, PyTorch는 내부 C++/CUDA로 최적화된 F.linear 함.
이 함수가 백엔드에서 GEMM (General Matrix Multiply) 연산을 수행해서 엄청나게 빠른 속도로 $y = xA^T + b$ 를 계산해줌.
그냥 torch.tensor와 다르게 Parameter는 학습 가능한 변수로 등록되는 것이다.
(출력, 입력) 순으로 되어 있는데, 이는 메모리 구조때문이다. y1은 입력벡터 x와 가중치 행렬의 첫번째 행 전체를 곱해서 더해야 한다.
forward
extra_repr
단순히 print(model)을 했을 때 나오는 문자열을 정의하는 함수로 디버깅용 함수이다.
예시: print(nn.Linear(20, 30))을 입력하면, 이 함수 덕분에 Linear(in_features=20, out_features=30, bias=True)라고 정보가 출력된다.

ehghkwl

안녕하세요.

이전 포스트

[경량화 챌린지] 18일차 - DeQuantization 구현

다음 포스트

[경량화 챌린지] 19일차 - nn.Linear

Lightweight Challenge

nn.Linear

[경량화 챌린지] 18일차 - DeQuantization 구현

[경량화 챌린지] 20일차 - quantization layer 구현

0개의 댓글