PL Template for NLP(4)

City_Duck·2023년 4월 28일

NLP PyTorch hydra python pytorch lightning

PL Template

목록 보기

5/6

이번 포스트에서는 template에 사용되는 구성요소들에 대해 알아보고자 합니다.
wandb, pyrootutils, pytest 순으로 진행하고자 합니다.

Weights and Biases(wandb)

wandb : 딥러닝 실험을 간편하게 추적 및 시각화할 수 있는 툴

wandb의 기능은 다음과 같습니다.
1. 학습에 사용된 파라미터 저장
2. 여러 실험들의 결과를 한번에 비교
3. sweep을 통한 하이퍼 파라미터 튜닝
4. 팀 단위로 사용시 실험 공유 기능

wandb와 PL을 함께 사용할 시

wandb.login(key="")
wandb_logger = WandbLogger()

trainer = L.trainer(logger=wandb_logger)

다음과 같이 간단한 코드 추가로 wandb를 사용할 수 있습니다.
결과는 wandb.ai 홈페이지를 통해 웹 기반으로 볼 수 있으며 실제 실험 결과는 다음과 같습니다.
wandb

이처럼 간편하게 결과를 추적 및 분석할 수 있습니다.

pyrootutils

깃허브 링크
앞선 PL Template을 만든 ashleve분이 만든 라이브러리이며 간편하게 프로젝트 path를 지정해주는 라이브러리입니다.

import pyrootutils

# find absolute root path (searches for directory containing .project-root file)
# search starts from current file and recursively goes over parent directories
# returns pathlib object
path = pyrootutils.find_root(search_from=__file__, indicator=".project-root")

# find absolute root path (searches for directory containing any of the files on the list)
path = pyrootutils.find_root(search_from=__file__, indicator=[".git", "setup.cfg"])

# take advantage of the pathlib syntax
data_dir = path / "data"
assert data_dir.exists(), f"path doesn't exist: {data_dir}"

# set root directory
pyrootutils.set_root(
    path=path # path to the root directory
    project_root_env_var=True, # set the PROJECT_ROOT environment variable to root directory
    dotenv=True, # load environment variables from .env if exists in root directory
    pythonpath=True, # add root directory to the PYTHONPATH (helps with imports)
    cwd=True, # change current working directory to the root directory (helps with filepaths)
)

이와 같이 path를 하드코딩 하지 않아도 되어서 template를 만들 때 유용합니다.

pytest

pytest는 python에서 TDD(Test Driven Development)를 지원하는 라이브러리입니다.

본 템플릿에서 pytest가 어떻게 동작하는지를 코드를 통해 이해하고자 합니다.
코드 전문은 tests/test_train.py링크를 통해 볼 수 있습니다.

먼저 test_train_fast_dev_run 함수입니다.

def test_train_fast_dev_run(cfg_train):
    """Run for 1 train, val and test step."""
    HydraConfig().set_config(cfg_train)			# hydra로 config 
    with open_dict(cfg_train):
        cfg_train.trainer.fast_dev_run = True   # fast_dev_run 기능을 통해 빠르게 실행체크
        cfg_train.trainer.accelerator = "cpu"
    train(cfg_train)

다음으로는 test_train_fast_dev_run_gup 함수입니다.

@RunIf(min_gpus=1)
def test_train_fast_dev_run_gpu(cfg_train):
    """Run for 1 train, val and test step on GPU."""
    HydraConfig().set_config(cfg_train)
    with open_dict(cfg_train):
        cfg_train.trainer.fast_dev_run = True
        cfg_train.trainer.accelerator = "gpu"
    train(cfg_train)

해당 함수에는 @RunIf 데코레이터가 나옵니다.
이는 조건에 따라 분기하는 데코레이터이며

if min_gpus:
            conditions.append(torch.cuda.device_count() < min_gpus)
            reasons.append(f"GPUs>={min_gpus}")

	...
            
reasons = [rs for cond, rs in zip(conditions, reasons) if cond]

return pytest.mark.skipif(
       condition=any(conditions),
       reason=f"Requires: [{' + '.join(reasons)}]",
       **kwargs,
)

이는 다음과 같이 정의되어있습니다.
해당 코드는 다음과 같이 작동됩니다.
1. min_gpus 등 다양한 조건을 검사합니다. (문제 없을시 False)
2. condition은 any를 통해 하나라도 True라면(문제 발생) skipif가 작동
3. test를 진행하지 않는다.

즉 모든 조건이 문제 없을시에만 Test를 진행하고 하나라도 문제가 있을 시 reason을 출력합니다.

이 외에는 gpu를 사용한다는 것을 제외하고는 다른점이 없습니다.

이 외에도 pytest 데코레이터가 붙은 함수들이 존재합니다.
먼저 @pytest.mark.slow가 붙은 함수들이 존재합니다.

@pytest.mark.slow
def test_train_epoch_double_val_loop(cfg_train):
    """Train 1 epoch with validation loop twice per epoch."""
    
@pytest.mark.slow
def test_train_ddp_sim(cfg_train):
    """Simulate DDP (Distributed Data Parallel) on 2 CPU processes."""
    
@pytest.mark.slow
def test_train_resume(tmp_path, cfg_train):
    """Run 1 epoch, finish, and resume for another epoch."""
    
						...

해당 함수의 공통점은 데코레이터(mark) 이름과 같이 실행시 시간이 오래걸린다는 점입니다.
그렇기에 테스트시에 오래걸리지 않는 테스트를 수행하는 것이 효율적입니다.
이는 해당 템플릿의 Makefile을 통해 알 수 있습니다.

# Makefile

test: ## Run not slow tests
	pytest -k "not slow"

test-full: ## Run all tests
	pytest

train: ## Train the model
	python src/train.py

해당 코드를 보면 -k와 같은 flags가 존재합니다.
이를 설명하면 글이 너무 길어질 것 같아 참고 페이지로 남겨두겠습니다. flags

결국 @pytest.mark.slow는 pytest의 mark 기능을 통해 함수들을 구별하기 위한 데코레이션입니다.
이 외에도 해당 템플릿에는 @pytest.mark.parametrize와 @pytest.fixture이 존재합니다.

@pytest.fixture은 pytest의 재사용성을 높여주는 기능이며 다음과 같이 사용할 수 있습니다.

import pytest

@pytest.fixture(param=[1,3,5])
def make_doubel(number):
	return [number.param, number.param*2]
    
def test_doubel(make_doubel):
	for x in make_doubel:
    	assert x[0]*2 == x[1]

@pytest.mark.parametrize를 사용하면 테스트 함수에 파라미터를 전달할 수 있습니다.

import pytest


@pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
def test_eval(test_input, expected):
    assert eval(test_input) == expected

pytest 함수 참고

City_Duck

AI 새싹

이전 포스트

PL Template for NLP(3)

다음 포스트

PL Template for NLP(4)

PL Template

Weights and Biases(wandb)

pyrootutils

pytest

PL Template for NLP(3)

PL Template for NLP(5)

0개의 댓글

관련 채용 정보