TIL - 2023.09.26

배엘리·2023년 9월 26일

TIL

목록 보기

23/23

🤗 오늘 배운 것

오늘은 NLP 2일차이자 마지막 날..?
아직 실습이 남았지만 추석 전으로 무언가를 배우는 날은 끝이고 추석 이후로는 모의경진대회를 한다고 해서 아쉽다ㅠㅜ

BertConfig

Bert도 하이퍼파라미터의 지옥을 벗어날 수 없다...
이러한 파라미터들은

class transformers.BertConfig

에서 관리(?)할 수 있다. BertConfig 안에는 총 16개의 하이퍼파라미터가 있는데, 모두 다 알 필요는 없고 특별히 아래만 관리하면 된다고 하셨다.

vocab_size (int, optional, defaults to 30522) — Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BertModel or TFBertModel.
hidden_size (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder.
num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (int, optional, defaults to 3072) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder.
hidden_act (str or Callable, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.
hidden_dropout_prob (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (float, optional, defaults to 0.1) — The dropout ratio for the attention probabilities.
max_position_embeddings (int, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (int, optional, defaults to 2) — The vocabulary size of the token_type_ids passed when calling BertModel or TFBertModel.

특이한 점은, hidden_act의 default가 gelu라는 것인데, gelu는

아래와 같이 생겼다. 조금 더 완만한 ReLU 같이 생겼는데 시간이 있을 때 수식도 한번 파보는게 좋은 것 같다.

Huggingface Tokenizer

오늘은 Hunggingface의 다양한 모델들과 tokenizer 사용법에 대해 배웠다.

Pretrained Model 불러오기

간단하게 bert안의 pretrained model을 불러와 tokenizer를 사용해 보았다.

이게 모델 선언부라면,

from transformers import BertPreTrainedModel
import torch.nn as nn

class MyBertForSentenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.bert = BertModel(config)
        self.linear = nn.Linear(config.hidden_size, 2)

    def forward(
        self, **kwargs
    ):

        outputs = self.bert(**kwargs)
        prediction = self.linear(outputs.pooler_output)
        return prediction

pretrained model을 불러오고 싶다면 아래와 같이 from_pretrained(내가 불러오고 싶은 모델) 이렇게 선언해주면 된다.

classifier = MyBertForSentenceClassification.from_pretrained('bert-base-uncased')

Tokenizer

오늘은 다양한 Tokenizer를 사용했는데 GPT2Tokenizer, BertTokenizer, AutoTokenizer를 사용했다.

다른 tokenizer들은 추론까지 안갔지만 AutoTokenizer로

embeddings = tokenizer("This is the capital city of Korea, [MASK].", return_tensors='pt')

outputs = model(**embeddings)

이렇게 추론을 진행했는데,

mlm_logits.argmax(-1).shape # vocab 중 가장 확률 높은 것


------
tokenizer.decode(mlm_logits.argmax(-1)[0, 1:-1])

=> This is the capital city of Korea, Seoul.

이렇게 MASK값을 잘 맞추는 것을 볼 수 있었다! 너무 신기했다.

🐻 더 배워야 할 점

NLP를 공부하면서 의외로(?) 재밌는 부분이 많다고 생각했다.
그리고 CV보다 더 연구해야 할 부분이 많다고도 생각했다. CV에서는 '이런 상황에서는 어떻게 해요~?'라고 했을 때 이미 나와있는 모델이 있었는데 nlp 아직 활발히 연구되고 있는 분야라고 했다.

추석을 앞두고 추석에 공부 열심히 해야겠다 이런 생각이 가득하다.. 부산에 내려가면 괜찮을지 모르겠지만 그래도 틈틈히 강의와 코드짜는 연습을 해야하는데 시간이 날 지는 모르겠지만 그래도 꾸준히..열심히 해야지! 화이팅!

배엘리

이전 포스트