NLP - Day 5, 9/11 Fri

이호영·2021년 9월 11일

AI Tech NLP boostcamp

Subword tokenization

필수과제 3 Subword level Language Model

Hugging face

Boostcamp AI Tech 2기

목록 보기

22/32

Subword tokenization

아무리 Pre-training이 잘되었어도 subword의 품질이 좋지 않으면 좋은 성능이 나오지 않는다.

Rico Sennrich의 'Neural Machine Translation of Rare Words with Subword Units'에서 Out-Of-Vocabulary(OOV) 문제를 해결하기 위해 Byte-Pair Encoding Tokenization(BPE)가 고안되었다.

필수과제 3 Subword level Language Model

최근 Transformer, BERT, ELECTRA 등 모든 모델들이 Subword 분절 방식을 사용하고 있고 Subword 분절 방식엔 BPE, SentencePiece, WordPiece 등이 있다.

한국어 subword tokenizer 종류: SentencePiece, Hugging face, Opennmt

Hugging face

transformer 기반의 다양한 모델들과 학습 스크립트를 구현해놓은 모듈

일반적으로 layer.py와 model.py가 transformer.models에 train.py가 transformer.Trainer에 대응이 된다.

이호영

Speech Synthesis & Voice Cloning

이전 포스트

NLP - Day 4, 9/10 Thu

다음 포스트

NLP - Day 5, 9/11 Fri

Boostcamp AI Tech 2기

Subword tokenization

필수과제 3 Subword level Language Model

Hugging face

NLP - Day 4, 9/10 Thu

NLP Day 6, 9/13 Mon

0개의 댓글

관련 채용 정보