Ex5) Transformer_huggingface_pre-training

Jacob Kim·2024년 2월 1일

Naver Project Week4

목록 보기

6/16

# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.11.17)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.3.post1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)

! pip install -U accelerate
! pip install -U transformers[torch]

Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (2023.11.17)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch!=1.12.0,>=1.10->transformers[torch]) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch!=1.12.0,>=1.10->transformers[torch]) (1.3.0)

Fine-tune a pretrained model

사전 학습된 모델을 사용하면 상당한 이점이 있습니다. 계산 비용과 탄소 배출량을 줄이고 처음부터 훈련하지 않고도 최신 모델을 사용할 수 있습니다. 🤗 Transformers는 다양한 작업을 위해 수천 개의 사전 훈련된 모델에 대한 액세스를 제공합니다. 사전 훈련된 모델을 사용하는 경우 작업에 특정한 데이터 세트에서 모델을 훈련합니다. 이 튜토리얼에서는 선택한 딥러닝 프레임워크로 사전 학습된 모델을 미세 조정합니다.

Fine-tune a pretrained model with 🤗 Transformers Trainer.
Fine-tune a pretrained model in TensorFlow with Keras.

Prepare a dataset

사전 훈련된 모델을 미세 조정하기 전에 데이터 세트를 다운로드하고 훈련을 위해 준비하세요.

Begin by loading the Yelp Reviews dataset:

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

텍스트를 처리하고 가변 시퀀스 길이를 처리하기 위한 패딩 및 truncation 전략을 포함하려면 토크나이저가 필요합니다.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

원하는 경우 전체 데이터 세트의 더 작은 하위 집합을 만들어 미세 조정하여 소요 시간을 줄일 수 있습니다.

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Train

🤗 Transformers는 교육에 최적화된 Trainer 클래스를 제공합니다. 자신의 훈련 루프. Trainer API는 로깅, 기울기 누적, 혼합 정밀도와 같은 다양한 훈련 옵션과 기능을 지원합니다.

모델을 로드하여 시작하고 예상 레이블 수를 지정합니다. Yelp Review 데이터 세트 카드에서 5개의 레이블이 있음을 알 수 있습니다.

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Training hyperparameters

다음으로 조정할 수 있는 모든 하이퍼파라미터와 다양한 훈련 옵션을 활성화하기 위한 플래그가 포함된 TrainingArguments 클래스를 만듭니다.

이 튜토리얼의 경우 기본 하이퍼파라미터로 시작할 수 있지만 자유롭게 실험하여 최적의 설정을 찾으십시오.

교육에서 체크포인트를 저장할 위치를 지정합니다.

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

Metrics

트레이너는 트레이닝 중에 모델 성능을 자동으로 평가하지 않습니다. 메트릭을 계산하고 보고하려면 Trainer에 함수를 전달해야 합니다. 🤗 데이터셋 라이브러리에서는 load_metric(자세한 내용은 이 튜토리얼 함수를 참조하세요) 함수와 함께 로드할 수 있는 간단한 accuracy 함수를 제공합니다:

import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

<ipython-input-8-56b0b4182bac>:4: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
  metric = load_metric("accuracy")
/usr/local/lib/python3.10/dist-packages/datasets/load.py:752: FutureWarning: The repository for accuracy contains custom code which must be executed to correctly load the metric. You can inspect the repository content at https://raw.githubusercontent.com/huggingface/datasets/2.16.1/metrics/accuracy/accuracy.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
  warnings.warn(

metric에서 compute를 호출하여 예측의 정확도를 계산합니다. 계산`에 예측을 전달하기 전에 예측을 로그로 변환해야 합니다(모든 🤗 트랜스포머 모델은 로그를 반환한다는 점을 기억하세요):

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

미세 조정 중에 평가 지표를 모니터링하려면 훈련 인수에 evaluation_strategy 매개 변수를 지정하여 각 에포크가 끝날 때 평가 지표를 보고하세요:

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Trainer

모델, 훈련 인수, 훈련 및 테스트 데이터 세트, 평가 함수를 사용하여 Trainer 개체를 만듭니다:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

Then fine-tune your model by calling train():

🤗 트랜스포머 모델은 Keras API를 통해 텐서플로우의 트레이닝도 지원합니다.

Convert dataset to TensorFlow format

DefaultDataCollator는 모델이 학습할 텐서를 일괄 처리로 조립합니다. 텐서플로 텐서를 반환하려면 return_tensors를 지정해야 합니다:

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")

트레이너는 기본적으로 DataCollatorWithPadding을 사용하므로 데이터 콜레이터를 명시적으로 지정할 필요가 없습니다.

다음으로, to_tf_dataset 메서드를 사용하여 토큰화된 데이터셋을 텐서플로 데이터셋으로 변환합니다. 입력은 columns에, 레이블은 label_cols에 지정합니다:

tf_train_dataset = small_train_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=4,
)

tf_validation_dataset = small_eval_dataset.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=4,
)

Compile and fit

예상 레이블 수로 TensorFlow 모델을 로드합니다:

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

그런 다음 다른 Keras 모델과 마찬가지로 fit을 사용하여 모델을 컴파일하고 미세 조정합니다:

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)

Epoch 1/3
250/250 [==============================] - 207s 679ms/step - loss: 1.6181 - sparse_categorical_accuracy: 0.2180 - val_loss: 1.7784 - val_sparse_categorical_accuracy: 0.2950
Epoch 2/3
250/250 [==============================] - 166s 664ms/step - loss: 1.6177 - sparse_categorical_accuracy: 0.2340 - val_loss: 1.6148 - val_sparse_categorical_accuracy: 0.2200
Epoch 3/3
250/250 [==============================] - 166s 664ms/step - loss: 1.6218 - sparse_categorical_accuracy: 0.2220 - val_loss: 1.6146 - val_sparse_categorical_accuracy: 0.2200
<keras.src.callbacks.History at 0x7d953af76800>

Jacob Kim

AI, Information and Communication, Electronics, Computer Science, Bio, Algorithms