Natural Language Processing in TensorFlow week2

han811·2020년 11월 14일

Word Embedding

Large Movie Review Dataset(IMDB) link
http://ai.stanford.edu/~amaas/data/sentiment/

tensorflow_datasets를 사용하려면 먼저 pip install tensorflow-datasets를 해주어야한다.
해당 패키지는 tensorflow에서 제공하는 여러 유명한 데이터셋들을 쉽게 받을 수 있다.

import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

tfds는 tensorflow offical docu를 보고 그때 그때 필요하게 사용하면 될 것 같다.
어차피 데이터는 보통 정제해서 따로 만드니까
https://www.tensorflow.org/datasets/overview#tfdsload

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

tf.keras.layers.Embedding을 사용하면 각 단어마다 특정 embedding차원으로 줄여주는 뉴럴넷이 생성된다.
따라서 weight를 뽑아보면 (vocab_size, embedding_dim)으로 형성된다.
이때 인자로 input_length를 같이주게되는데 이건 패딩까지 마쳐서 들어오게 되는 문자열을 tokenizing한 리스트의 길이가 된다.
리스트의 각 원소들이 embedding_dim 크기의 벡터로 전환되므로 (batch_size, input_length)의 데이터가 (batch_size, input_length, embedding_dim)으로 나오게 된다.

tensorflow official Embedding docu
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

tf.keras.layers.GlobalAveragePooling1D()

들어오는 3D tensor를 가운데 차원을 기준으로 제일 안쪽 차원들의 값들의 평균으로 pooling하여 2D tensor를 내놓는 layer이다.
Flatten()를 쓰는 거보다 나을 수도 있단다.

tensorflow official GlobalAveragePooling1D docu
https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D

tensorflow dataset의 기존 학습된 tokenizing한 vocab을 불러와서 사용가능하다.
자세한건 docu보고 하자 ㄱㄱ
tensorflow dataset official docu
https://www.tensorflow.org/datasets/catalog/overview

my github repo - https://github.com/han811/tensorflow

han811

이전 포스트

Natural Language Processing in TensorFlow week1

다음 포스트

Natural Language Processing in TensorFlow week2

Word Embedding

Natural Language Processing in TensorFlow week1

Natural Language Processing in TensorFlow week3

0개의 댓글