[cs231n] Lecture 5: Convolutional Neural Networks

๊ฐ•๋™์—ฐยท2022๋…„ 1์›” 26์ผ
0

[CS231n-2017]

๋ชฉ๋ก ๋ณด๊ธฐ
4/7

๐Ÿ‘จโ€๐Ÿซ ๋ณธ ๋ฆฌ๋ทฐ๋Š” cs231n-2017 ๊ฐ•์˜๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

Syllabus
Youtube Link

Convolution Neural Networks

๐Ÿ“Œ ๊ฐ•์˜ ์•ž๋ถ€๋ถ„์—๋Š” Convolution Neural Networks์— ๋Œ€ํ•ด ์–ธ๊ธ‰ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ Neural Networks์„ image classification์— ์‚ฌ์šฉํ•˜๋ฉด์„œ ์„ฑ๊ณต์„ ์–ป๊ณ , ๊ทธ ์ดํ›„๋ถ€ํ„ฐ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์ด๋ฏธ์ง€์ฒ˜๋ฆฌ์—์„œ๋Š” CNN์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์˜ AlexNet์ด ์ฒ˜์Œ์œผ๋กœ ๋ณด์—ฌ์ง„ CNN ๋„คํŠธ์›Œํฌ๋ผ๊ณ  ์•Œ๊ณ ์žˆ์Šต๋‹ˆ๋‹ค. AlexNet์„ ์‹œ์ž‘์œผ๋กœ ํ˜„์žฌ๋Š” ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€๋ถ„์•ผ์—์„œ Neural Networks์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Detection, Segmentation, Keypoint, Image Captioning ๋“ฑ๋“ฑ ๊ฐ•์˜์—์„œ๋Š” ConvNets are everywhere๋ผ๊ณ  ํ‘œํ˜„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ ๊ธฐ์กด์˜ Fully Connected Layer๊ฐ€ ์•„๋‹Œ ์œ„์˜ Convoultion Layer๋ฅผ Filter(Kernel)์„ ์‚ฌ์šฉํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ 32x32๋ฅผ ์ด๋ฏธ์ง€์˜ ์‚ฌ์ด์ฆˆ x3์˜ ๊ฒฝ์šฐ์—๋Š” RGB์˜ ๊ฐ๊ฐ์˜ ์ฑ„๋„์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด 5x5x3 filter์„ input image์— ์œ„์—์„œ ๋ถ€ํ„ฐ ์Šฌ๋ผ์ด๋”ฉ ํ•˜๋ฉด์„œ ์—ฐ์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋กœ ์˜ค๋ฅธ์ชฝ์˜ activation map์ด ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋กœ input image์˜ channel๊ณผ filter channel์€ ๊ฐ™์•„์•ผํ•ฉ๋‹ˆ๋‹ค. filter size๋Š” 1x1, 3x3, 5x5.. ๋‹ค์–‘ํ•˜๊ฒŒ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์˜ ๊ณผ์ •์„ ๊ฐ filter์˜ ์ˆ˜๋งŒํผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ๊ฒฐ๊ณผ์ ์œผ๋กœ 28x28x6์˜ activation map์€ ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ input 32 -> 28๋กœ ์ค„์–ด๋“ค๊ฒŒ ๋˜์—ˆ๋Š”๋ฐ, ์ด๋Š” ์กฐ๊ธˆ ๋’ค์— ๋‹ค์‹œ ๋‹ค๋ฃจ๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์™€ ๊ฐ™์ด ๊ฐ๊ฐ์˜ Conv layer๋“ค์€ activation function, pooling layer๋“ฑ๋“ฑ๊ณผ ํ•จ๊ป˜ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์—์„œ Low-level features์„ ํ™•์ธํ•˜๋ฉด ๋งค์šฐ ์ž‘์€ ๋ถ€๋ถ„์„ ํ•™์Šตํ•˜๊ณ  ์žˆ๊ณ , High-level feature๋กœ ๊ฐ€๋ฉด ๊ฐˆ์ˆ˜๋ก ๋”์šฑ ๊ตฌ์ฒด์ ์ด๊ณ  ๋ฌผ์ฒด์— ๊ฐ€๊นŒ์šด ๊ฒƒ๋“ค์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ๋ณด์ด๋Š” ๊ฐ๊ฐ์˜ ๋‰ด๋Ÿฐ ์ด๋ฏธ์ง€๋“ค์€ ์ด๋ฏธ์ง€๊ฐ€ ์–ด๋–ป๊ฒŒ ์ƒ๊ฒจ์•ผ ํ•ด๋‹น ๋‰ด๋Ÿฐ์˜ ํ™œ์„ฑ์„ ์ตœ๋Œ€ํ™” ์‹œํ‚ฌ ์ˆ˜ ์žˆ์„์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ธ๋ฐ, ์ฆ‰ ๋น„์Šค์Šคํ•œ ์ด๋ฏธ์ง€๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ํฐ ์ถœ๋ ฅ๊ฐ’์„ ์ถœ๋ ฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„๋Š” ์ „๋ฐ˜์ ์ธ CNN์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ Conv - ReLU - Pooling์˜ ์ˆœ์„œ๋กœ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ˜๋Ÿฌ๊ฐ€๊ณ  ๋งˆ์ง€๋ง‰์—๋Š” classification์„ ์œ„ํ•ด Fully Connected layer์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ filter๋Š” ์œ„์˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์™ผ์ชฝ ์œ„์—์„œ๋ถ€ํ„ฐ Stride(=1)๋งŒํผ ์›€์ง์ž…๋‹ˆ๋‹ค. ์œ„์˜ ์˜ˆ์‹œ์™€ ๊ฐ™์ด 7x7 input์— 3x3 filter stride=1์œผ๋กœ ํ•™์Šตํ•˜๋ฉด ๊ฒฐ๊ณผ์ ์œผ๋กœ 5x5 ์‚ฌ์ด์ฆˆ์˜ ouput์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

๐Ÿ“Œ ์—ฌ๊ธฐ์„œ stride=2๋กœ ํ•œ๋‹ค๋ฉด ์–ด๋–ค output ์‚ฌ์ด์ฆˆ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋ ๊นŒ์š”? ๋ณด์‹œ๋‹ค ์‹ถ์ด 3x3 ์‚ฌ์ด์ฆˆ์˜ output์ด ๋‚˜์˜ค๊ฒŒ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, stride๋กœ output ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์˜ ๊ณผ์ •์„ ๊ณต์‹ํ™” ํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด (N-F)/Stride + 1 ์ด๋ผ๋Š” ๊ณต์‹์ด ๋‚˜์˜ค๊ฒŒ๋ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ ๋ฐฉ์‹์œผ๋กœ ๊ณ„์†ํ•ด์„œ ํ•™์Šตํ•˜๋ฉด ๋งˆ์ง€๋ง‰ layer์˜ activation map ์‚ฌ์ด์ฆˆ๋Š” ๋„ˆ๋ฌด๋‚˜๋„ ์ž‘์•„์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์œ„์˜ ๋ฐฉ์‹์œผ ๊ฒฝ์šฐ ๊ฐ€์žฅ์ž๋ฆฌ์„ ํ•™์Šตํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์œ„์™€ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด zero padding์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์™€ ๊ฐ™์ด zero padding์„ ์‚ฌ์šฉํ•˜๋ฉด 7x7 -> 7x7์„ Output์œผ๋กœ ๋ฐ˜ํ™˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๊ฐ€์žฅ์ž๋ฆฌ ๋ถ€๋ถ„๋„ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ filter ์‚ฌ์ด์ฆˆ๊ฐ€ 3์ด๋ฉด zero pad = 1, 5์ด๋ฉด zero pad = 2, 7์ด๋ฉด zero pad = 3์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

  • (Nโˆ’F+2โˆ—P)/stride+1(N-F+2*P)/stride + 1

๐Ÿ“Œ ๊ฒฐ๊ณผ์ ์œผ๋กœ ์œ„์˜ ์‹์œผ๋กœ output ์‚ฌ์ด์ฆˆ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ Input: 32x32x3, 10 filters 5x5 size with stride = 1, pad = 2๋ผ๊ณ  ํ•  ๋•Œ ํ•ด๋‹น layer์˜ filter ์ˆ˜๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”? ์ •๋‹ต์€ (5x5x3 + 1(bias)) * 10 = 760๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์˜ ์‚ฌ์ง„์—์„œ ๊ณต์‹์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Œ ์œ„์˜ ์‚ฌ์ง„์—์„œ ์–ธ๊ธ‰๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค๋งŒ, ์ผ๋ฐ˜์ ์œผ๋กœ 5x5 filter -> 5x5 receptive field๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ 28x28x5 activation map์„ ๋ณด์‹œ๋ฉด ๊ฐ๊ฐ์˜ ๊ฐ™์€ ๊ณต๊ฐ„์—์„œ 5๊ฐœ์˜ ๋‹ค๋ฅธ ํŠน์ง•์„ ๊ฐ€๋‹ˆ๋Š” ๊ฐ’๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ CNN์—๋Š” Pooling layer๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ํ•œ ๋งˆ๋””๋กœ ์ด์•ผ๊ธฐํ•˜๋ฉด downsampling์„ ํ•˜๋Š” ์šฉ๋„๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. Pooling layer๋Š” ๋‹จ์ˆœํžˆ activatatiom map์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์ค„์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ“Œ ๋Œ€ํ‘œ์ ์œผ๋กœ Max pooling ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ์œ„์™€ ๊ฐ™์ด 2x2 filters and stride 2๋กœ max poolingํ•˜๋ฉด 4x4 -> 2x2 ์‚ฌ์ด์ฆˆ๋กœ ๋ฐ”๋€Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. max pooling์€ ๋งค์šฐ ์‹ฌํ”Œํ•˜๊ฒŒ filter ์•ˆ์˜ ๊ฐ’๋“ค ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ์ผ๋ฐ˜์ ์œผ๋กœ F=2, 3๊ณผ Stride = 2๋กœ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Œ ๊ฐ๊ฐ์˜ layer๋“ค์ด ์–ด๋–ค ์—ญํ™œ์€ ํ•˜๋Š”์ง€ ์•Œ์•„๋ดค์Šต๋‹ˆ๋‹ค. ์œ„์˜ ๊ณผ์ •์ด ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ CNN ๋„คํŠธ์›Œํฌ์˜ ๊ตฌ์กฐ๋ผ๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค.

profile
Maybe I will be an AI Engineer?

0๊ฐœ์˜ ๋Œ“๊ธ€