NN (Neural Networks)

창슈·2025λ…„ 4μ›” 4일

Deep Learning

λͺ©λ‘ 보기
3/16
post-thumbnail

πŸ“Œ λ”₯λŸ¬λ‹μ΄λž€?

1980λ…„λŒ€ (1950) λΆ€ν„° 신경망(Neural Networks)이 인기λ₯Ό 끌기 μ‹œμž‘ν•˜μ—¬, NeurIPS, Snowbird 같은 ν›Œλ₯­ν•œ ν•™νšŒλ“€κ³Ό λ”λΆˆμ–΄ λ§Žμ€ 성곡 사둀와 큰 κΈ°λŒ€λ₯Ό λͺ¨μ•˜λ‹€.

1990λ…„λŒ€μ— λ‹€μ–‘ν•œ 기법듀이 λ“±μž₯ν•˜λ©΄μ„œ λ’·μ „μœΌλ‘œ λ°€λ Έμ§€λ§Œ, 2010λ…„ κ²½ "λ”₯λŸ¬λ‹"으둜 λΆ€ν™œν•˜μ—¬ ν˜„μž¬λŠ” 맀우 지배적인 뢄야이닀.

성곡 λ°°κ²½μ—λŠ” Computing Power, Larger Training Sets, PyTorch, Tensorflow


πŸ“Œ PyTorch vs. Tensorflow

PyTorch

  • κ°„νŽΈν•˜κ³  μœ μ—°μ„±μ΄ μ’‹μœΌλ©° Pythonic(파이썬과의 연계)ν•˜λ‹€.
  • μ΄ˆλ³΄μžμ™€ μ—°κ΅¬μžλ“€μ΄ 많이 μ‚¬μš©

Tensorflow

  • ꡬ쑰적인 μ ‘κ·Ό '정적 계산 κ·Έλž˜ν”„(static computation graph)'λ₯Ό μ‚¬μš©ν•˜μ—¬ 사전에 κ³„νšμ΄ ν•„μš”ν•˜λ‹€.
  • μ²˜μŒλΆ€ν„° ꡬ쑰적인 μƒνƒœκ³„λ₯Ό κ³ λ €ν•˜μ—¬ κ°œλ°œν•  경우 μ‚¬μš©ν•˜λ©΄ κ³ μ„±λŠ₯ λͺ¨λΈ κ°œλ°œμ— μœ λ¦¬ν•˜λ‹€.

Single Layer Neural Network

βœ”οΈ 단일 계측 신경망을 ν†΅ν•œ Y 예츑

  • π‘Œ = 𝑓(𝑋) β†’ λͺ©ν‘œλŠ” μž…λ ₯ π‘Ώλ‘œλΆ€ν„° κ²°κ³Ό 𝒀λ₯Ό μ˜ˆμΈ‘ν•˜λŠ” 것.

  • π‘Œ: λ°˜μ‘ λ³€μˆ˜ (μ˜ˆμΈ‘ν•˜κ³ μž ν•˜λŠ” κ°’)

  • 𝑋 = (𝑋₁, … , π‘‹β‚š): μž…λ ₯ 벑터, 총 p개의 λ³€μˆ˜λ‘œ ꡬ성됨

  • 𝑓(𝑋): μž…λ ₯ 𝑿에 λŒ€ν•œ λΉ„μ„ ν˜• ν•¨μˆ˜, ν•™μŠ΅μ„ 톡해 좔정됨

νŒŒλΌλ―Έν„°μ˜ 개수
parameters:(p+1)β‹…K+(K+1)parameters: (p+1) \cdot K + (K + 1) λŠ” λ‹€μŒκ³Ό κ°™λ‹€. β†’ Wkj+Ξ²kW_{kj} + \beta_k

βœ”οΈ 단일 계측 신경망(Single Layer Neural Network) λͺ¨λΈ

  • ν•¨μˆ˜ ν˜•νƒœ:

    f(X)=Ξ²0+βˆ‘k=1KΞ²khk(X)f(X) = \beta_0 + \sum_{k=1}^{K} \beta_k h_k(X)
  • μ΄λ•Œ, 각 은닉 μœ λ‹› hk(X)h_k(X) λŠ” λ‹€μŒκ³Ό 같이 계산됨:

    hk(X)=g(wk0+βˆ‘j=1pwkjXj)h_k(X) = g\left(w_{k0} + \sum_{j=1}^{p} w_{kj} X_j \right)
  • 전체λ₯Ό λ‹€μ‹œ μ“°λ©΄:

    f(X)=Ξ²0+βˆ‘k=1KΞ²kβ‹…g(wk0+βˆ‘j=1pwkjXj)f(X) = \beta_0 + \sum_{k=1}^{K} \beta_k \cdot g\left(w_{k0} + \sum_{j=1}^{p} w_{kj} X_j \right)

πŸ“Œ ꡬ성 μš”μ†Œ μ„€λͺ…
K: 은닉 μœ λ‹›(hidden units)의 개수
𝑔(z): 사전에 μ •μ˜λœ λΉ„μ„ ν˜• ν™œμ„±ν™” ν•¨μˆ˜ (예: ReLU, sigmoid, tanh λ“±)
𝑀ₖⱼ: μ€λ‹‰μΈ΅μ˜ κ°€μ€‘μΉ˜
𝛽₀, 𝛽ₖ: 좜λ ₯측의 λ°”μ΄μ–΄μŠ€μ™€ κ°€μ€‘μΉ˜


Activation Function

Ak=hk(X)=g(wk0+βˆ‘j=1pwkjXj)A_k = h_k(X) = g\left(w_{k0} + \sum_{j=1}^{p} w_{kj} X_j \right) λŠ” μ€λ‹‰μΈ΅μ—μ„œμ˜ ν™œμ„±κ°’(activation) 이라고 λΆˆλ¦°λ‹€.

μ—¬κΈ°μ„œ g(z)g(z) λŠ” ν™œμ„±ν™” ν•¨μˆ˜(activation function) 라고 ν•œλ‹€.

자주 μ‚¬μš©λ˜λŠ” ν™œμ„±ν™” ν•¨μˆ˜λ‘œλŠ” μ‹œκ·Έλͺ¨μ΄λ“œ(Sigmoid) 와 ReLU(Rectified Linear Unit) κ°€ μžˆλ‹€.

Sigmoid ν•¨μˆ˜

g(z)=ez1+ez=11+eβˆ’zg(z) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}

좜λ ₯값은 항상 0κ³Ό 1 사이이며, ν™•λ₯ μ²˜λŸΌ 해석할 수 μžˆλ‹€.

ReLU ν•¨μˆ˜

g(z)=z+={0,ifΒ z<0z,otherwiseg(z) = z^+ = \begin{cases} 0, & \text{if } z < 0 \\ z, & \text{otherwise} \end{cases}

μž…λ ₯이 0보닀 μž‘μœΌλ©΄ 0을 좜λ ₯ν•˜κ³ , 0 이상이면 κ·ΈλŒ€λ‘œ 좜λ ₯ν•œλ‹€.

ReLU ν•¨μˆ˜λŠ” μ‹œκ·Έλͺ¨μ΄λ“œλ³΄λ‹€ 계산 효율이 λ†’κΈ° λ•Œλ¬Έμ—, 졜근의 신경망 λͺ¨λΈμ—μ„œλŠ” ReLUκ°€ κΈ°λ³Έ ν™œμ„±ν™” ν•¨μˆ˜λ‘œ 널리 μ‚¬μš©λœλ‹€.

πŸ” μ€λ‹‰μΈ΅μ˜ ν™œμ„±ν™” ν•¨μˆ˜μ™€ λΉ„μ„ ν˜•μ„±

μ€λ‹‰μΈ΅μ—μ„œμ˜ ν™œμ„±ν™” ν•¨μˆ˜λŠ” 일반적으둜 λΉ„μ„ ν˜•μ΄λ‹€.
λ§Œμ•½ ν™œμ„±ν™” ν•¨μˆ˜κ°€ μ„ ν˜•μ΄λΌλ©΄, 전체 신경망 λͺ¨λΈμ€ κ²°κ΅­ μ„ ν˜• λͺ¨λΈλ‘œ μˆ˜λ ΄ν•˜κ²Œ λœλ‹€.
(즉, 은닉측을 μŒ“λŠ” μ˜λ―Έκ°€ 사라진닀.)

βœ… λͺ¨λΈ μˆ˜μ‹

f(X)=Ξ²0+βˆ‘k=1KΞ²khk(X)=Ξ²0+βˆ‘k=1KΞ²kβ‹…g(wk0+βˆ‘j=1pwkjXj)f(X) = \beta_0 + \sum_{k=1}^{K} \beta_k h_k(X) = \beta_0 + \sum_{k=1}^{K} \beta_k \cdot g\left(w_{k0} + \sum_{j=1}^{p} w_{kj} X_j \right)

‼️ μ˜ˆμ‹œ: 이차 ν•¨μˆ˜(quadratic function)λ₯Ό ν™œμ„±ν™” ν•¨μˆ˜λ‘œ μ‚¬μš©ν•  경우 (λΉ„μ„ ν˜•μ΄μ§€λ§Œ 맀우 λ‹¨μˆœν•œ ν˜•νƒœ)

  • μž…λ ₯ X=(X1,X2)X = (X_1, X_2)
  • 은닉 μœ λ‹› 수: K=2K = 2
  • ν™œμ„±ν™” ν•¨μˆ˜: g(z)=z2g(z) = z^2
  • κ°€μ€‘μΉ˜ 및 κ³„μˆ˜:
    Ξ²0=0,Ξ²1=14,Ξ²2=βˆ’14\beta_0 = 0,\quad \beta_1 = \frac{1}{4},\quad \beta_2 = -\frac{1}{4}
    w10=0,w11=1,w12=1w_{10} = 0,\quad w_{11} = 1,\quad w_{12} = 1
    w20=0,w21=1,w22=βˆ’1w_{20} = 0,\quad w_{21} = 1,\quad w_{22} = -1
  • 은닉 μœ λ‹› 계산:
    h1(X)=(0+X1+X2)2=(X1+X2)2h_1(X) = (0 + X_1 + X_2)^2 = (X_1 + X_2)^2
    h2(X)=(0+X1βˆ’X2)2=(X1βˆ’X2)2h_2(X) = (0 + X_1 - X_2)^2 = (X_1 - X_2)^2
  • μ΅œμ’… 좜λ ₯:
    f(X)=14(X1+X2)2βˆ’14(X1βˆ’X2)2=X1X2f(X) = \frac{1}{4}(X_1 + X_2)^2 - \frac{1}{4}(X_1 - X_2)^2 = X_1 X_2

즉, κ²°κ³ΌλŠ” μž…λ ₯ κ°„ μƒν˜Έμž‘μš©(interaction term) 을 λ‚˜νƒ€λ‚΄λŠ” ν•­μ΄μ§€λ§Œ, μ—¬μ „νžˆ μ„ ν˜• λͺ¨λΈμ΄λ‹€!

βœ… λͺ¨λΈν•™μŠ΅
신경망 λͺ¨λΈμ€ λ‹€μŒ 손싀 ν•¨μˆ˜λ₯Ό μ΅œμ†Œν™”ν•˜μ—¬ ν•™μŠ΅λœλ‹€. (예: νšŒκ·€ 문제):

βˆ‘i=1n(yiβˆ’f(xi))2\sum_{i=1}^{n} \left( y_i - f(x_i) \right)^2

Multilayer Neural Network

ν˜„λŒ€μ˜ 신경망(Modern Neural Networks)은 일반적으둜 ν•˜λ‚˜ μ΄μƒμ˜ 은닉측(hidden layer)을 κ°€μ§„λ‹€.

μ λ‹Ήν•œ 크기의 μ—¬λŸ¬ 은닉측을 μŒ“λŠ” 것이 훨씬 더 쒋은 해법을 μ°ΎλŠ” 데 μš©μ΄ν•˜λ‹€.
즉, λ‹€μΈ΅ ꡬ쑰(multi-layer structure)κ°€ ν•™μŠ΅μ„ 더 효율적이고 효과적으둜 λ§Œλ“ λ‹€.

πŸ”’ MNIST 숫자 인식 (MNIST Digits)

  • MNIST: 손글씨 숫자 (0~9) 이미지 데이터셋

  • 28 Γ— 28 크기의 흑백 이미지, 총 784개의 ν”½μ…€

  • ν”½μ…€ 값은 0~255 λ²”μœ„μ˜ μ •μˆ˜κ°’ (ν•™μŠ΅μš© 60,000μž₯, ν…ŒμŠ€νŠΈμš© 10,000μž₯)

  • μž…λ ₯ 벑터:

    X=(X1,X2,…,X784),Xj∈(0,255)X = (X_1, X_2, \dots, X_{784}), \quad X_j \in (0, 255)
  • 좜λ ₯ 벑터 (one-hot μΈμ½”λ”©λœ 더미 λ³€μˆ˜ μž„ β†’ 10κ°œμ€‘ ν•˜λ‚˜λ§Œ 1):

    Y=(Y0,Y1,…,Y9)Y = (Y_0, Y_1, \dots, Y_9)

πŸ–‡οΈ 1μΈ΅ 은닉측 (L1L_1: 256 μœ λ‹›)

은닉 μœ λ‹› 계산:

Ak(1)=hk(1)(X)=g(wk0(1)+βˆ‘j=1784wkj(1)Xj),k=1,…,256A_k^{(1)} = h_k^{(1)}(X) = g\left( w_{k0}^{(1)} + \sum_{j=1}^{784} w_{kj}^{(1)} X_j \right), \quad k = 1, \dots, 256

κ°€μ€‘μΉ˜ ν–‰λ ¬ π‘Š(1)π‘Š^{(1)} 크기:

785Γ—256=200,960(bias 포함)785 \times 256 = 200{,}960 \quad (\text{bias 포함})

πŸ–‡οΈ 2μΈ΅ 은닉측 (L2L_2: 128 μœ λ‹›)

은닉 μœ λ‹› 계산:

Al(2)=hl(2)(X)=g(wl0(2)+βˆ‘k=1256wlk(2)Ak(1)),l=1,…,128A_l^{(2)} = h_l^{(2)}(X) = g\left( w_{l0}^{(2)} + \sum_{k=1}^{256} w_{lk}^{(2)} A_k^{(1)} \right), \quad l = 1, \dots, 128

κ°€μ€‘μΉ˜ ν–‰λ ¬ π‘Š(2)π‘Š^{(2)} 크기:

257Γ—128=32,896257 \times 128 = 32{,}896

πŸ–‡οΈ 좜λ ₯μΈ΅ (10개 μœ λ‹›)

μ„ ν˜• κ²°ν•©:

Zm=Ξ²m0+βˆ‘l=1128Ξ²mlAl(2),m=0,…,9Z_m = \beta_{m0} + \sum_{l=1}^{128} \beta_{ml} A_l^{(2)}, \quad m = 0, \dots, 9

κ°€μ€‘μΉ˜ ν–‰λ ¬ BB 크기:

129Γ—10=1,290129 \times 10 = 1{,}290

전체 νŒŒλΌλ―Έν„° 수 (bias 포함):

μ΄Β νŒŒλΌλ―Έν„°Β μˆ˜=200,960+32,896+1,290=235,146\text{총 νŒŒλΌλ―Έν„° 수} = 200{,}960 + 32{,}896 + 1{,}290 = \boxed{235{,}146}

βœ… 좜λ ₯μΈ΅ ν™œμ„±ν™” ν•¨μˆ˜: Softmax

fm(X)=Pr⁑(Y=m∣X)=eZmβˆ‘l=09eZl,m=0,…,9f_m(X) = \Pr(Y = m \mid X) = \frac{e^{Z_m}}{\sum_{l=0}^{9} e^{Z_l}}, \quad m = 0, \dots, 9

λ©€ν‹°ν΄λž˜μŠ€ λ‘œμ§€μŠ€ν‹± νšŒκ·€μ™€ λ™μΌν•œ 방식
10개의 ν™•λ₯ κ°’은 0 이상이며 합이 1, κ°€μž₯ 높은 ν™•λ₯ μ˜ 클래슀λ₯Ό μ΅œμ’… 예츑

βœ… ν•™μŠ΅: 손싀 ν•¨μˆ˜ (Cross-Entropy)

Cross-Entropy=βˆ’βˆ‘i=1nβˆ‘m=09yimlog⁑fm(xi)\text{Cross-Entropy} = - \sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log f_m(x_i)
  • yim=1y_{im} =1: 정닡이 클래슀 π‘šπ‘š 일 λ•Œλ§Œ 11, λ‚˜λ¨Έμ§€λŠ” 00 (one-hot encoding)

negative log-likelihoodλ₯Ό μ΅œμ†Œν™”ν•˜κΈ° μœ„ν•¨

βœ… ν…ŒμŠ€νŠΈ μ—λŸ¬μœ¨κ³Ό μ •κ·œν™”

λ§Žμ€ νŒŒλΌλ―Έν„° 수 β†’ μ •κ·œν™”(regularization)κ°€ ν•„μˆ˜
μ‚¬μš©λœ μ •κ·œν™” 방식: λ¦Ώμ§€(Ridge), λ“œλ‘­μ•„μ›ƒ(Dropout)
졜고의 λͺ¨λΈμ€ μ—λŸ¬μœ¨ 0.5% 미만 달성 (μΈκ°„μ˜ μ—λŸ¬μœ¨μ€ μ•½ 0.2% (ν…ŒμŠ€νŠΈ 이미지 10,000μž₯ 쀑 20개 였λ₯˜))

2D Tensor: (#Samples, #Features)

profile
🐾

0개의 λŒ“κΈ€