[김기현의 자연어 처리 딥러닝 캠프] 4장. 전처리 (2022/03/08)

gromit·2022년 3월 8일

Natural Language DEEP Learning 🤍

목록 보기

7/13

[1] 코퍼스 수집 단계 : Selenium과 BeutifulSoup의 차이 ? (https://rubber-tree.tistory.com/88)
[2] 정제(normalization) 단계 : re.sub(pattern, new_text, text)
[3] 문장 단위 분절 단계 : 자연어 처리 툴킷 NLTK(3.2.5 버전) - from nltk.tokenize import sent_tokenize
[4] 분절(Tokenization) 단계 : 한국어 → Mecab, KoNLPy (형태소 분석기)
[5] 병렬 코퍼스 정렬(alignment) : MUSE (페이스북, 단어 간 번역, 비지도 학습) ⇒ CTK (이중 언어 코퍼스의 문장 정렬) (https://kh-kim.gitbook.io/natural-language-processing-with-pytorch/00-cover-3/05-align)
[6] 서브워드 분절(Subword Segmentation) : BPE 알고리즘(Sennrich), SentencePiece (구글)
- 효과: (1) 어휘 수 줄여줌 (2) 희소성 줄여줌 (3) UNK 토큰에 대한 효율적 대처
- 단점: 학습 데이터별로 BPE 모델도 생성되어야 함
  ➕) [서브워드 분절하기(sentencepiece, bpe, sub-word, bpe-droupout)] (: 기법 소개 및 적용 코드)
  ➕) SentencePiece 역시, spm.SentencePieceTrainer.Train 함수 실행해 → 학습 후 <model_name>.model, <model_name>.vocab 두 파일 생성 → m.model 이용하여 분절 수행

torchtext 라이브러리 → data.Field() 클래스, data.TabularDataset.split() 클래스, data.BucketIterator.split() 클래스, LanguageModelDataset() 클래스

AI, Big Data, Industrial Engineering