텐서플로우 데이터셋, 왜케 불편해?

주제무·2022년 7월 14일

Tensorflow

목록 보기

1/1

별 것도 아닌 텐서플로우 데이터셋 사용법

tfds.load로 출발해서, tfds.load --> return tf.data.Dataset을 아는 것으로 부터 시작한다
-> 그러니깐 method, attribute를 찾을 때 tf.data API를 참고해야한다.

import tensorflow_datasets as tfds

(ds_train, ds_val, ds_test), info = tfds.load(
					'stanford_dogs',
					split=['train[:90%]', 'train[90%:]', 'test'],
                    data_dir='/content/yo',
                    as_supervised=True,
                    with_info=True
				)

With_info

일단 무슨 데이터인지 알아보자
tfds.load의 인수로 with_info=True를 통해 접근할 수 있다.

tfds.core.DatasetInfo(
    name='stanford_dogs',
    version=0.2.0,
    description='The Stanford Dogs dataset contains images of 120 breeds of dogs from around
the world. This dataset has been built using images and annotation from
ImageNet for the task of fine-grained image categorization. There are
20,580 images, out of which 12,000 are used for training and 8580 for
testing. Class labels and bounding box annotations are provided
for all the 12,000 images.',
    homepage='http://vision.stanford.edu/aditya86/ImageNetDogs/main.html',
    features=FeaturesDict({
        'image': Image(shape=(None, None, 3), dtype=tf.uint8),
        'image/filename': Text(shape=(), dtype=tf.string),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=120),
        'objects': Sequence({
            'bbox': BBoxFeature(shape=(4,), dtype=tf.float32),
        }),
    }),
    total_num_examples=20580,
    splits={
        'test': 8580,
        'train': 12000,
    },
    supervised_keys=('image', 'label'),
    citation="""@inproceedings{KhoslaYaoJayadevaprakashFeiFei_FGVC2011,
    author = "Aditya Khosla and Nityananda Jayadevaprakash and Bangpeng Yao and
              Li Fei-Fei",
    title = "Novel Dataset for Fine-Grained Image Categorization",
    booktitle = "First Workshop on Fine-Grained Visual Categorization,
                 IEEE Conference on Computer Vision and Pattern Recognition",
    year = "2011",
    month = "June",
    address = "Colorado Springs, CO",
    }
    @inproceedings{imagenet_cvpr09,
            AUTHOR = {Deng, J. and Dong, W. and Socher, R. and Li, L.-J. and
                      Li, K. and Fei-Fei, L.},
            TITLE = {{ImageNet: A Large-Scale Hierarchical Image Database}},
            BOOKTITLE = {CVPR09},
            YEAR = {2009},
            BIBSOURCE = "http://www.image-net.org/papers/imagenet_cvpr09.bib"}""",
    redistribution_info=,
)

방대한 양의 텍스트가 나오지만 무엇하나 빼놓을 수가 없다.
주의깊게 봐야하는 부분은 다음과 같다

features
total_num_example
splits

features

다음과 같이 사용할 수 있다.

ds_info.features['image']
# Image(shape=(None, None, 3), dtype=tf.uint8)

ds_info.features['label'].num_classes
# 120

ds_info.features로 접근한 다음
FeaturesDict니깐 ['image'], ['label']과 같이 사용하면 된다.

참고로 image shape이 None인 것은
이미지의 크기가 통일되지 않았다는 뜻이고,

num_classes는 분류 문제에서의 총 class 개수를 의미한다.

total_num_example

접근 불가
# 혹시 알면 알려주세요
# ds_info.dataset_size는 다른 값이 나옵니다.

train, validation, test의 데이터셋의 총 합

splits

제일 중요
인수로 전달하지 않으면 string을 출력한다.

train, validation, test가 각각 몇 개의 데이터로 이루어져 있는지 알려준다.

tfds.load의 인수로
split=['train[:90%]', 'train[90%]', 'test']와 같이 전달하며
stanford_dogs dataset은 train과 test만 제공하기 때문에
따로 validation data를 지정했다.

만약 train만이 존재한다면
split=['train[:70%]', 'train[70%:80%]', 'train[80%:]'와 같이 지정해야 한다.

추가로 split=['train+test']와 같이 train, test data를 합쳐서 반환하도록 할 수 있다.

split 인수를 전달하는 방법은 위 방법 외에도 있지만
위 방법이 가장 직관적이다.

as_supervised

dataset에 접근하는 방식을 결정

as_supervised=False; default

우선 아래 코드를 참고하자.

for data in ds_train:
  print(data.keys())
  break

# dict_keys(['image', 'image/filename', 'label', 'objects'])

ds_train(tf.data.Dataset)은 index를 통해 접근할 수 없다.
아래 설명할 take method말고도 for문을 통해 위와 같이 iterable하게 접근할 수 있고
각 원소(data)는 dictionary로 구성되어 dataset마다 다른 keys를 가지고 있다.

objects에 궁금할 수도 있는데 위 info를 참고하면 BBoxFeature object이며 안에 normalized된 feature box의 위치값을 가지고 있다.

as_supervised=True

만약 위와 같이 인수를 전달하면

for data in ds_train:
  img, label = data
  print(label)
  break

각 원소(data)는 dictionary가 아닌 tuple이 되어 오직 image, label값만을 갖는다. 이는 info에서 supervised_keys로 확인할 수 있다.

print(info.supervised_keys)

# ('image', 'label')

물론 다음과 같이 바로 img, label로 전달이 가능하다.

for img, label in ds_train:
  print(label)
  break

Image에 접근하는 법

for img, label in ds_train:
  plt.figure(figsize=(10,10))
  plt.imshow(img)
  plt.grid(False)
  plt.axis("off")
  break
# 귀여운 강아지 출력

이어지는 연계로, tf.data.Dataset.map(function)
function <-- img, label(인수 전달)
이와 같이 map method를 간편하게 사용할 수 있다.

tf.data.Dataset

혼란스러운 존재

tfds.load를 했으면 같은 tensorflow_datasets module의 object를 반환할 줄 알았더니
tf.data.Dataset을 반환한다.

따라서 필요한 method를 찾기 위해서는 tensorflow.data API를 참고해야한다.

중요한 method를 정리하자.

map
batch
from_tensor_slices
from_tensor
prefetch
repeat
shuffle
take

직접 정리해 보자!

출처

tfds.load
https://www.tensorflow.org/datasets/api_docs/python/tfds/load

tf.data.Dataset
https://www.tensorflow.org/api_docs/python/tf/data/Dataset

tfds.imageclassification.StanfordDogs
https://www.tensorflow.org/datasets/catalog/stanford_dogs

주제무