torchtext, spaCy를 활용하여 Vocab을 만드는 실습을 해볼 것이다.
spaCy의 Tokenizer를 활용해서 vocab을 직접 구현해본다.
torchtext의 메소드를 활용해서 vocab을 만들어본다.
torchtext의 데이터셋인 WikiText-2를 사용하기 위해 데이터를 불러온다.
torchtext에서 데이터셋을 불러오려면 먼저 torchdata를 설치해야한다.
PyTorch의 version에 주의하여 적합한 version을 설치하면 된다.
torchdata를 설치할 때 아래와 같은 ERROR가 발생할 수 있다.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
해결 방법
!pip install folium==0.2.1
!pip install torchdata==0.4.0
!pip show torchdata
from torchtext.datasets import WikiText2
train = WikiText2(split='train')
데이터를 보면 \<unk>를 확인할 수 있는데 unknown token을 가리킨다.
for i, text in enumerate(train):
if i == 5: break
print(text)
= Valkyria Chronicles III =
Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " .
The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
!python -m spacy download en_core_web_sm
import spacy
from spacy.symbols import ORTH
\<unk>라는 speical token이 있기 때문에 spacy tokenizer가 \<unk>를 하나의 token으로 인식할 수 있도록 special case를 추가해주어야한다.
spacy_en = spacy.load('en_core_web_sm')
special_case = [{ORTH:'<unk>'}]
spacy_en.tokenizer.add_special_case('<unk>', special_case)
# TEST
text = 'I use <unk> things.'
for token in spacy_en.tokenizer(text):
print(token)
I
use
<unk>
things
.
Vocab 클래스는 다음과 같은 역할을 한다.
※ spacy tokenizer의 반환값인 token을 사용하는 것보다 token.text가 더 정확한 결과를 반환한다.
from collections import Counter
from tqdm.notebook import tqdm
class Vocab:
UNK_TOKEN = '<unk>'
UNK_TOKEN_ID = 0
def __init__(self, data, tokenizer, min_freq):
self.data = [text for text in data]
self.en = tokenizer
self.id2token = list()
self.token2id = dict()
self.build_vocab(min_freq)
def build_vocab(self, min_freq):
counter = Counter()
for tokens in tqdm(map(self.en.tokenizer, self.data), total=len(self.data), desc='Building Vocab'):
counter.update(map(lambda x: x.text, tokens))
self.id2token = [Vocab.UNK_TOKEN] + [ token for token, freq in counter.items() if freq >= min_freq and token != Vocab.UNK_TOKEN]
self.token2id = { token:i for i, token in enumerate(self.id2token)}
def encode(self, text):
encoded = [self.token2id.get(token.text, UNK_TOKEN_ID) for token in self.en.tokenizer(text)]
return encoded
def decode(self, sequence):
decoded = " ".join([self.id2token[token_id] for token_id in sequence])
return decoded
corpus = Vocab(train, spacy_en, 3)
Building Vocab: 0%| | 0/36718 [00:00<?, ?it/s]
len(corpus.token2id), len(corpus.id2token)
(33242, 33242)
corpus.token2id['<unk>'], corpus.id2token[0]
(0, '<unk>')
train_text = [text for text in train]
encoded = corpus.encode(train_text[4])
encoded
[2,
86,
35,
87,
88,
46,
89,
15,
90,
91,
29,
92,
93,
18,
19,
94,
95,
96,
4,
5,
97,
17,
98,
49,
99,
19,
100,
101,
18,
19,
51,
15,
49,
102,
103,
104,
105,
15,
106,
25,
107,
19,
35,
108,
0,
42,
51,
109,
17,
110,
111,
0,
112,
39,
113,
114,
115,
116,
117,
118,
119,
120,
15,
121,
122,
4,
5,
97,
123,
124,
125,
17,
126,
92,
127,
18,
128,
129,
19,
130,
17,
86,
35,
131,
132,
133,
134,
135,
37,
136,
137,
138,
17,
7]
print(f"decode : {corpus.decode(encoded)}")
print(f"original: {train_text[4]}")
decode : The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May ' n .
original: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
torchtext의 get_tokenizer와 build_vocab_from_iterator를 사용하여 비교적 쉽게 vocab을 구성할 수 있다.
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
torch_tokenizer = get_tokenizer('basic_english')
torch_tokenizer('I use <unk> thing.')
['i', 'use', '<unk>', 'thing', '.']
torch_vocab = build_vocab_from_iterator(map(torch_tokenizer, train), min_freq=3, specials=['<unk>'])
build_vocab_from_iterator()은 torchtext.vocab.Vocab 클래스의 object를 반환한다. 반환된 Vocab object를 이용하여 아래와 같은 일들을 할 수 있다.
# get_stoi(), get_itos()
p_token2id = torch_vocab.get_stoi()
p_id2token = torch_vocab.get_itos()
print(len(p_token2id.keys()), len(p_id2token))
print(p_token2id['<unk>'], p_id2token[0])
28782 28782
0 <unk>
# __getitem__, lookup_token()
torch_vocab['<unk>'], torch_vocab.lookup_token(0)
(0, '<unk>')
# 토큰화 테스트 문장
train_text = [text for text in train]
# forward(), lookup_indices()
encoded1 = torch_vocab(torch_tokenizer(train_text[4]))
encoded2 = torch_vocab.lookup_indices(torch_tokenizer(train_text[4]))
print(f"encoded1: {encoded1}")
print(f"encoded2: {encoded2}")
encoded1: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
encoded2: [1, 67, 135, 369, 6, 297, 2, 3245, 65, 8, 184, 1742, 4, 1, 138, 1177, 13, 3849, 3869, 304, 3, 66, 24, 3277, 1, 1176, 579, 4, 1, 93, 2, 24, 44, 4380, 1842, 18273, 2, 89, 14, 407, 1, 67, 61, 0, 17, 93, 19588, 3, 278, 3749, 0, 25905, 5, 3024, 25883, 19949, 99, 435, 25, 479, 11649, 2, 163, 18, 3849, 3869, 304, 537, 17954, 27012, 3, 8, 184, 157, 4, 1145, 3886, 1, 1623, 3, 1, 67, 11, 15, 658, 1071, 10, 3610, 19, 75, 11, 1586, 3]
# lookup_tokens()
decoded = torch_vocab.lookup_tokens(encoded1)
decoded_sentence = " ".join(decoded)
print(f"decoded: {decoded_sentence}")
print(f"original: {train_text[4]}")
decoded: the game began development in 2010 , carrying over a large portion of the work done on valkyria chronicles ii . while it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . character designer <unk> honjou and composer hitoshi sakimoto both returned from previous entries , along with valkyria chronicles ii director takeshi ozawa . a large team of writers handled the script . the game ' s opening theme was sung by may ' n .
original: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . Character designer <unk> Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .