MLP_2 (Multilayer Perceptron)

์ฐฝ์Šˆยท2025๋…„ 4์›” 10์ผ

Deep Learning

๋ชฉ๋ก ๋ณด๊ธฐ
12/16
post-thumbnail

์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (Backpropagation)

์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ž…๋ ฅ์ด ์ฃผ์–ด์ง€๋ฉด ์ˆœ๋ฐฉํ–ฅ์œผ๋กœ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•œ ํ›„์— ์‹ค์ œ
์ถœ๋ ฅ๊ณผ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์ถœ๋ ฅ ๊ฐ„์˜ ์˜ค์ฐจ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

  1. ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”
    ๋ชจ๋“  ๊ฐ€์ค‘์น˜์™€ ๋ฐ”์ด์–ด์Šค๋ฅผ 0โˆผ1 ์‚ฌ์ด์˜ ๋‚œ์ˆ˜(random number)๋กœ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.

  2. ๋ฐ˜๋ณต ํ•™์Šต
    ์˜ค์ฐจ๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ž‘์•„์งˆ ๋•Œ๊นŒ์ง€, ๋ชจ๋“  ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด ์•„๋ž˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

  3. ์†์‹ค ํ•จ์ˆ˜์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ
    ๊ฐ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด ์†์‹ค ํ•จ์ˆ˜ EE์˜ ๊ธฐ์šธ๊ธฐ(gradient) ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค:

    โˆ‚Eโˆ‚w\frac{\partial E}{\partial w}
  4. ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ
    ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ์˜ค์ฐจ๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•œ๋‹ค:

    w(t+1)=w(t)โˆ’ฮทโ‹…โˆ‚Eโˆ‚ww(t+1) = w(t) - \eta \cdot \frac{\partial E}{\partial w}
  • ฮท\eta: ํ•™์Šต๋ฅ  (learning rate)

์—ญ์ „ํŒŒ๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ ๊ธฐ์šธ๊ธฐ(gradient)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์†์‹ค ํ•จ์ˆ˜ E(w)E(w) ๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋Š” ์ง€์ ๊นŒ์ง€ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.


๋ฏธ๋ถ„์˜ Chain Rule (์—ฐ์‡„ ๋ฒ•์น™)

์‹ ๊ฒฝ๋ง์€ ์—ฌ๋Ÿฌ ์ธต์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฏ€๋กœ, ์ถœ๋ ฅ๊นŒ์ง€์˜ ๊ฒฝ๋กœ๊ฐ€ ํ•จ์ˆ˜์˜ ํ•ฉ์„ฑ์œผ๋กœ ๋˜์–ด ์žˆ์Œ.

์˜ˆ๋ฅผ ๋“ค์–ด,

  • y=f(u)y = f(u)
  • u=g(x)u = g(x)
โˆ‚yโˆ‚x=โˆ‚yโˆ‚uโ‹…โˆ‚uโˆ‚x\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \cdot \frac{\partial u}{\partial x}

๐Ÿ’ก $w_1$์„ ๋ณ€๊ฒฝํ•˜๋ ค๋ฉด?? - $w_1$์€ $h_1$ ๊ณ„์‚ฐ์— ์˜ํ–ฅ์„ ์ฃผ๊ณ  - $h_1$์€ ๋‹ค์‹œ $y$, ๊ทธ๋ฆฌ๊ณ  `์ตœ์ข… ์˜ค์ฐจ` $E$์— ์˜ํ–ฅ์„ ๋ฏธ์นจ

๋‹จ๊ณ„๋ณ„๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜๋ˆ ์„œ ๋ฏธ๋ถ„ํ•ด ๋‚˜๊ฐ€๋„๋ก ์„ค๊ณ„:
โˆ‚Eโˆ‚w1=โˆ‚Eโˆ‚yโ‹…โˆ‚yโˆ‚h1โ‹…โˆ‚h1โˆ‚z1โ‹…โˆ‚z1โˆ‚w1\frac{\partial E}{\partial w_1} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial h_1} \cdot \frac{\partial h_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1}


์œ ๋‹›i์™€ ์œ ๋‹›j๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฐ€์ค‘์น˜ "์œ ๋‹›j๊ฐ€ ์ถœ๋ ฅ์ธต์ธ ๊ฒฝ์šฐ"

๐Ÿ“Œ ์ถœ๋ ฅ์ธต ์œ ๋‹›์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ ๋ฏธ๋ถ„: โˆ‚Eโˆ‚wij\frac{\partial E}{\partial w_{ij}}

Chain Rule ์ ์šฉ

โˆ‚Eโˆ‚wij=โˆ‚Eโˆ‚outjโ‹…โˆ‚outjโˆ‚netjโ‹…โˆ‚netjโˆ‚wij\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial out_j} \cdot \frac{\partial out_j}{\partial net_j} \cdot \frac{\partial net_j}{\partial w_{ij}}

โ‘  โˆ‚Eโˆ‚outj\frac{\partial E}{\partial \text{out}_j}
์˜ค์ฐจ ํ•จ์ˆ˜๊ฐ€ ์ถœ๋ ฅ๊ฐ’์— ์–ผ๋งˆ๋‚˜ ๋ฏผ๊ฐํ•œ์ง€

โˆ‚Eโˆ‚outj=โˆ‚โˆ‚outjโˆ‘12(targetkโˆ’outk)2=outjโˆ’targetj\frac{\partial E}{\partial out_j} = \frac{\partial}{\partial out_j} \sum \frac{1}{2} (target_k - out_k)^2 = out_j - target_j
  • ์œ ๋‹›์˜ ์ถœ๋ ฅ๊ฐ’ ๋ณ€ํ™˜์— ๋”ฐ๋ฅธ ์˜ค์ฐจ์˜ ๋ณ€ํ™”์œจ์ด๋‹ค.

โ‘ก โˆ‚outjโˆ‚netj\frac{\partial \text{out}_j}{\partial \text{net}_j}
ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ff์˜ ๋ฏธ๋ถ„

โˆ‚outjโˆ‚netj=โˆ‚f(netj)โˆ‚netj=fโ€ฒ(netj)\frac{\partial out_j}{\partial net_j} = \frac{\partial f(net_j)}{\partial net_j} = f'(net_j)
  • ์ž…๋ ฅํ•ฉ์˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์œ ๋‹› jj์˜ ์ถœ๋ ฅ ๋ณ€ํ™”์œจ์ด๋‹ค.
  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„๊ฐ’์ด๋‹ค.

โ‘ข โˆ‚netjโˆ‚wij\frac{\partial \text{net}_j}{\partial w_{ij}}
๊ฐ€์ค‘์น˜ wijw_{ij}๊ฐ€ ์ž…๋ ฅ ํ•ฉ netjnet_j์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€

โˆ‚netjโˆ‚wij=โˆ‚โˆ‚wij(โˆ‘k=0nwkjoutk)=outi\frac{\partial net_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \left( \sum_{k=0}^n w_{kj} out_k \right) = out_i
  • ๊ฐ€์ค‘์น˜์˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ netjnet_j์˜ ๋ณ€ํ™”์œจ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

โœ… ์ตœ์ข… ๋ฏธ๋ถ„ ์‹ ์ •๋ฆฌ

โˆ‚Eโˆ‚wij=โ‘ ร—โ‘กร—โ‘ข=(outjโˆ’targetj)โ‹…fโ€ฒ(netj)โ‹…outi\frac{\partial E}{\partial w_{ij}} =โ‘ ร—โ‘กร—โ‘ข= (out_j - target_j) \cdot f'(net_j) \cdot out_i

์œ ๋‹›i์™€ ์œ ๋‹›j๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฐ€์ค‘์น˜ "์œ ๋‹›j๊ฐ€ ์€๋‹‰์ธต์ธ ๊ฒฝ์šฐ"

๐Ÿ“Œ ์€๋‹‰์ธต ์œ ๋‹›์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ ๋ฏธ๋ถ„: โˆ‚Eโˆ‚wij\frac{\partial E}{\partial w_{ij}}

์€๋‹‰ ์œ ๋‹›์˜ ์˜ค์ฐจ๋Š” ์ถœ๋ ฅ์ฒ˜๋Ÿผ ์ง์ ‘ ๊ณ„์‚ฐ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—,
์ถœ๋ ฅ์ธต์œผ๋กœ๋ถ€ํ„ฐ ์—ญ์œผ๋กœ ์ „ํŒŒ๋œ ์˜ค์ฐจ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ณ„์‚ฐํ•ด์•ผ ํ•œ๋‹ค.

Chain Rule ์ ์šฉ (์€๋‹‰์ธต)

โˆ‚Eโˆ‚wij=โˆ‚Eโˆ‚outjโ‹…โˆ‚outjโˆ‚netjโ‹…โˆ‚netjโˆ‚wij\frac{\partial E}{\partial w_{ij}} = \frac{\partial E}{\partial out_j} \cdot \frac{\partial out_j}{\partial net_j} \cdot \frac{\partial net_j}{\partial w_{ij}}

โ‘ โˆ‚Eโˆ‚outj\frac{\partial E}{\partial \text{out}_j}: ์€๋‹‰ ์œ ๋‹› jj์˜ ์˜ค์ฐจ
์€๋‹‰์ธต์€ ์ง์ ‘ ์ •๋‹ต์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์—ฐ๊ฒฐ๋œ ์ถœ๋ ฅ์ธต ์œ ๋‹› kk ๋“ค๋กœ๋ถ€ํ„ฐ ์˜ค์ฐจ๋ฅผ ์ „๋‹ฌ๋ฐ›์Œ:

โˆ‚Eโˆ‚outj=โˆ‘kโˆˆL(โˆ‚Eโˆ‚outkโ‹…โˆ‚outkโˆ‚netkโ‹…โˆ‚netkโˆ‚outj)\frac{\partial E}{\partial out_j} = \sum_{k \in L} \left( \frac{\partial E}{\partial out_k} \cdot \frac{\partial out_k}{\partial net_k} \cdot \frac{\partial net_k}{\partial out_j} \right)
=โˆ‘kโˆˆL(โˆ‚Eโˆ‚outkโ‹…โˆ‚outkโˆ‚netkโ‹…wjk)= \sum_{k \in L} \left( \frac{\partial E}{\partial out_k} \cdot \frac{\partial out_k}{\partial net_k} \cdot w_{jk}\right)
=โˆ‘kโˆˆLฮดkโ‹…wjk= \sum_{k \in L} \delta_k \cdot w_{jk}
  • LL์€ ์€๋‹‰ ์œ ๋‹› jj ์™€ ์—ฐ๊ฒฐ๋œ ์ถœ๋ ฅ ์œ ๋‹›๋“ค์˜ ์ง‘ํ•ฉ

์ด๋ฏธ ๊ณ„์‚ฐ๋œ โˆ‚Eโˆ‚outkโ‹…โˆ‚outkโˆ‚netk\frac{\partial E}{\partial out_k}\cdot\frac{\partial out_k}{\partial net_k} ์— wjkw_jk๋งŒ ๊ณฑํ•˜๋ฉด ๋จ
๐Ÿ‘‰ ์•ž์—์„œ ๊ณ„์‚ฐ๋œ ๊ฑฐ์—๋‹ค๊ฐ€ ์ด๋ฏธ ์—†๋ฐ์ดํŠธ ๋œ๊ฒƒ์„ ๊ณฑํ•˜๋ฉด ๋จ

โ‘ก โˆ‚netjโˆ‚outj=fโ€ฒ(netj)\frac{\partial \text{net}_j}{\partial \text{out}_j} = f'(\text{net}_j)

  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„๊ฐ’

โ‘ข โˆ‚netjโˆ‚wij=outi\frac{\partial \text{net}_j}{\partial w_{ij}} = \text{out}_i

  • ์•ž์ชฝ ์œ ๋‹› ii์˜ ์ถœ๋ ฅ๊ฐ’

โœ… ์ตœ์ข… ๋ฏธ๋ถ„ ์‹ ์ •๋ฆฌ

โˆ‚Eโˆ‚wij=(โˆ‘kโˆˆLฮดkโ‹…wjk)โ‹…fโ€ฒ(netj)โ‹…outi\frac{\partial E}{\partial w_{ij}} = \left( \sum_{k \in L} \delta_k \cdot w_{jk} \right) \cdot f'(net_j) \cdot out_i

์—ฌ๊ธฐ์„œ

ฮดj=(โˆ‘kโˆˆLฮดkโ‹…wjk)โ‹…fโ€ฒ(netj)\delta_j = \left( \sum_{k \in L} \delta_k \cdot w_{jk} \right) \cdot f'(net_j)

์œผ๋กœ ์ •์˜๋จ โ†’ ์ด๊ฒŒ ์€๋‹‰์ธต ์˜ค์ฐจ์˜ ํ•ต์‹ฌ ๊ณต์‹


๐Ÿ“• ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ •๋ฆฌ "delta"

ฮดk\delta_k ๋ž€?

  • ์œ ๋‹› kk์—์„œ์˜ ์˜ค์ฐจ๋ฅผ ์˜๋ฏธ
  • ์ถœ๋ ฅ์ธต์ด๋‚˜ ์€๋‹‰์ธต์˜ ์œ ๋‹›์—์„œ ์˜ค์ฐจ์˜ ๋ณ€ํ™”๋ฅผ ์ „๋‹ฌํ•˜๋Š” ๊ฐ’

๐Ÿ‘‰ ์ถœ๋ ฅ์ธต ์œ ๋‹› jj

ฮดj=(outjโˆ’targetj)โ‹…fโ€ฒ(netj)\delta_j = (out_j - target_j) \cdot f'(net_j)

๐Ÿ‘‰ ์€๋‹‰์ธต ์œ ๋‹› jj

ฮดj=(โˆ‘kwjkโ‹…ฮดk)โ‹…fโ€ฒ(netj)\delta_j = \left( \sum_{k} w_{jk} \cdot \delta_k \right) \cdot f'(net_j)

์ฆ‰, ์‹ ๊ฒฝ๋ง ๋ ˆ์ด์–ด์— ๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ๋ถ„ํ•˜์—ฌ์„œ ๊ณ„์‚ฐํ•œ๋‹ค.

โˆ‚Eโˆ‚wij=ฮดjโ‹…outiwhereฮดj={(outjโˆ’targetj)โ‹…fโ€ฒ(netj)ifย jย ๊ฐ€ย ์ถœ๋ ฅ์ธตย ์œ ๋‹›(โˆ‘kwjkฮดk)โ‹…fโ€ฒ(netj)ifย jย ๊ฐ€ย ์€๋‹‰์ธตย ์œ ๋‹›\frac{\partial E}{\partial w_{ij}} = \delta_j \cdot out_i \quad \text{where} \quad \delta_j = \begin{cases} (out_j - target_j) \cdot f'(net_j) & \text{if } j \text{ ๊ฐ€ ์ถœ๋ ฅ์ธต ์œ ๋‹›} \\ \left( \sum_k w_{jk} \delta_k \right) \cdot f'(net_j) & \text{if } j \text{ ๊ฐ€ ์€๋‹‰์ธต ์œ ๋‹›} \end{cases}

๊ทธ๋ผ๋””์–ธํŠธ(๊ธฐ์šธ๊ธฐ)๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๊ฐ’!

๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ

  • ๊ฐ€์ค‘์น˜ ๋ฏธ๋ถ„:
    ์ถœ๋ ฅ์ธต์ผ ๋•Œ๋Š” ฮดj\delta_j๊ฐ’์— ์ž…๋ ฅ๊ฐ’ outiout_i๋ฅผ ๊ณฑํ•ด์„œ ๊ณ„์‚ฐ:
    โˆ‚Eโˆ‚wij=ฮดjโ‹…outi\frac{\partial E}{\partial w_{ij}} = \delta_j \cdot out_i
  • ์€๋‹‰์ธต์ผ ๋•Œ๋Š” ๋ธํƒ€๊ฐ’์„ ์ด์šฉํ•ด ์ด์ „ ์ธต์œผ๋กœ๋ถ€ํ„ฐ ์ „ํŒŒ๋œ ์˜ค์ฐจ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ:
    โˆ‚Eโˆ‚wij=(outjโˆ’targetj)โ‹…fโ€ฒ(netj)โ‹…outi\frac{\partial E}{\partial w_{ij}} = (out_j - target_j) \cdot f'(net_j) \cdot out_i

๋ธํƒ€์˜ ์—ญํ• 

๋ธํƒ€ ฮดk\delta_k๋Š” ์˜ค์ฐจ๋ฅผ ์ถœ๋ ฅ์ธต์—์„œ ์€๋‹‰์ธต์œผ๋กœ, ์€๋‹‰์ธต์—์„œ ์ž…๋ ฅ์ธต์œผ๋กœ ์ „ํŒŒํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.

์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต์€ ์ด ๋ธํƒ€ ๊ฐ’์„ ํ†ตํ•ด ์˜ค์ฐจ๋ฅผ ๊ฐ ์œ ๋‹›์— ์ „ํŒŒํ•˜๊ณ , ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.


๐Ÿงฎ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ง์ ‘ ๊ณ„์‚ฐ

โœ… ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค (Forward Pass)

  1. ์ถœ๋ ฅ์ธต ์œ ๋‹› yy์— ๋Œ€ํ•œ ์ถœ๋ ฅ ๊ณ„์‚ฐ:
    ๊ฐ€์ค‘์น˜์™€ ์ž…๋ ฅ ๊ฐ’๋“ค์„ ๊ณ„์‚ฐํ•˜์—ฌ netynet_y๋ฅผ ๊ตฌํ•œ๋‹ค.

    nety=w5โ‹…outh1+w6โ‹…outh2+b3\text{net}_y = w_5 \cdot \text{out}_{h1} + w_6 \cdot \text{out}_{h2} + b_3
    =0.5โˆ—0.524979+0.6โˆ—0.549834+0.3=0.89239= 0.5*0.524979+0.6*0.549834+0.3=0.89239

    ์—ฌ๊ธฐ์„œ outh1out_{h1}๊ณผ outh2out_{h2}๋Š” ์€๋‹‰์ธต์—์„œ ๋‚˜์˜ค๋Š” ๊ฐ’์ด๋‹ค.
    ์ด ๊ฐ’์„ sigmoid ํ•จ์ˆ˜๋กœ ํ†ต๊ณผ์‹œ์ผœ ์ตœ์ข… ์ถœ๋ ฅ outyout_y๋ฅผ ์–ป๋Š”๋‹ค.

    outy=11+eโˆ’nety=11+eโˆ’0.89239โ‰ˆ0.709383\text{out}_y = \frac{1}{1 + e^{-\text{net}_y}} = \frac{1}{1 + e^{-0.89239}} \approx 0.709383
  2. ์ด ์˜ค์ฐจ ๊ณ„์‚ฐ
    ๋ชฉํ‘œ ์ถœ๋ ฅ targety=0.0target_y=0.0๊ณผ ๊ณ„์‚ฐ๋œ ์ถœ๋ ฅ outyout_y์‚ฌ์ด์˜ ์˜ค์ฐจ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

E=12โ‹…(targetyโˆ’outy)2=12โ‹…(0.00โˆ’0.709383)2โ‰ˆ0.251612E = \frac{1}{2} \cdot ( \text{target}_y - \text{out}_y )^2 = \frac{1}{2} \cdot (0.00 - 0.709383)^2 \approx 0.251612

โœ… ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค (Backward Pass)

๐Ÿ“Œ์ถœ๋ ฅ์ธต โ†’ ์€๋‹‰์ธต

  1. ๊ฐ€์ค‘์น˜ w5w_5์˜ ๋ณ€ํ™”๊ฐ€ ์ถœ๋ ฅ ์˜ค์ฐจ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ณ„์‚ฐ Chain Rule
    โˆ‚Eโˆ‚w5=โˆ‚Eโˆ‚outyโ‹…โˆ‚outyโˆ‚netyโ‹…โˆ‚netyโˆ‚w5\frac{\partial E}{\partial w_5} = \frac{\partial E}{\partial \text{out}_y} \cdot \frac{\partial \text{out}_y}{\partial \text{net}_y} \cdot \frac{\partial \text{net}_y}{\partial w_5}

๋‹จ๊ณ„๋ณ„๋กœ ๋ฏธ๋ถ„:

  • ์ถœ๋ ฅ์ธต์— ๋Œ€ํ•œ ์˜ค์ฐจ:
    โˆ‚Eโˆ‚outy=(outyโˆ’targety)=0.709383โˆ’0=0.709383\frac{\partial E}{\partial \text{out}_y} = (\text{out}_y - \text{target}_y) = 0.709383 - 0 = 0.709383
layer2_error = layer2*y
  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๋ฏธ๋ถ„ (sigmoid ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„):
    โˆ‚outyโˆ‚nety=outyโ‹…(1โˆ’outy)=0.709383โ‹…(1โˆ’0.709383)=0.206158\frac{\partial \text{out}_y}{\partial \text{net}_y} = \text{out}_y \cdot (1 - \text{out}_y) = 0.709383 \cdot (1 - 0.709383) = 0.206158
layer2_delta=layer2_error*actf_deriv(layer2)
  • ๊ฐ€์ค‘์น˜ ๋ฏธ๋ถ„:
    โˆ‚netyโˆ‚w5=outh1=0.524979\frac{\partial \text{net}_y}{\partial w_5} = \text{out}_{h1} = 0.524979

  1. ์ตœ์ข… ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ
    ๋”ฐ๋ผ์„œ w5w_5์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋Š”:
    โˆ‚Eโˆ‚w5=โˆ‚Eโˆ‚outyโ‹…โˆ‚outyโˆ‚netyโ‹…โˆ‚netyโˆ‚w5\frac{\partial E}{\partial w_5} = \frac{\partial E}{\partial \text{out}_y} \cdot \frac{\partial \text{out}_y}{\partial \text{net}_y} \cdot \frac{\partial \text{net}_y}{\partial w_5}
    =0.709383โ‹…0.206158โ‹…0.524979=0.076775= 0.709383 \cdot 0.206158 \cdot 0.524979 = 0.076775
    ์ด ๊ฐ’์€ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
layer2_delta*layer1.T

๐Ÿ“Œ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ํ†ตํ•ด w5w_5์˜ ๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค:

w5(t+1)=w5(t)โˆ’ฮทโ‹…โˆ‚Eโˆ‚w5w_5(t+1) = w_5(t) - \eta \cdot \frac{\partial E}{\partial w_5}

์—ฌ๊ธฐ์„œ ํ•™์Šต๋ฅ  ฮทฮท๋Š” 0.5์ด๊ณ , ๋”ฐ๋ผ์„œ:

w5(t+1)=0.5โˆ’0.2โ‹…0.076775=0.484645w_5(t+1) = 0.5 - 0.2 \cdot 0.076775 = 0.484645

์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ์ถœ๋ ฅ ์˜ค์ฐจ๋ฅผ ์€๋‹‰์ธต์œผ๋กœ ์ „ํŒŒํ•˜์—ฌ ๊ฐ ๊ฐ€์ค‘์น˜๊ฐ€ ์˜ค์ฐจ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ณ„์‚ฐํ•œ๋‹ค.
์ด ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.

w6(t+1)=0.583918b3(t+1)=0.270750w_6(t+1) = 0.583918 \\ b_3(t+1) = 0.270750
  • ๊ฐ€์ค‘์น˜๊ฐ€ ์ ์  ๋‚ฎ์•„์ง„๋‹ค.
  • ๋ฐ”์ด์–ด์Šค๋Š” ๊ธฐ์กด ๊ฐ’๋ณด๋‹ค ๋‚ฎ์•„์ง€๊ฒŒ ๋œ๋‹ค. ์ด๋Š” ๋‹ค์Œ๋ฒˆ์— ์œ ๋‹›์˜ ์ถœ๋ ฅ์„ ๋” ๋‚ฎ๊ฒŒ ๋งŒ๋“ค๊ฒƒ์ด๋‹ค.

๐Ÿ‘‰ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์ถœ๋ ฅ๊ฐ’์€ 0 ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.


๐Ÿ“Œ ์€๋‹‰์ธต โ†’ ์ž…๋ ฅ์ธต

  1. ๊ฐ€์ค‘์น˜ w1w_1์˜ ์—…๋ฐ์ดํŠธ ๊ณ„์‚ฐ:
    w1(t+1)=w1(t)โˆ’ฮทโ‹…โˆ‚Eโˆ‚w1=0.10โˆ’0.2โ‹…0.0=0.10w_1(t+1) = w_1(t) - \eta \cdot \frac{\partial E}{\partial w_1} = 0.10 - 0.2 \cdot 0.0 = 0.10
    w2(t+1)=0.2,w3(t+1)=0.3,w4(t+1)=0.4w_2(t+1) = 0.2, \quad w_3(t+1)=0.3, \quad w_4(t+1)=0.4
  • ์ž…๋ ฅ๊ฐ’์ด 0์ธ ๊ฒฝ์šฐ์—๋Š” ๊ฐ€์ค‘์น˜๋Š” ๋ณ€ํ™”ํ•˜์ง€ ์•Š๋Š”๋‹ค.
  • ์ž…๋ ฅ์ด 0์ด๋ฉด ๊ฐ€์ค‘์น˜๋ฅผ ์•„๋ฌด๋ฆฌ ๋ฐ”๊ฟ”๋„ ๋ฌด์Šจ ์†Œ์šฉ์ด ์žˆ๋‚˜?
  1. ๋ฐ”์ด์–ด์Šค b1b_1์™€ b2b_2์—…๋ฐ์ดํŠธ:
    b1(t+1)=0.096352,b2(t+1)=0.195656b_1(t+1) = 0.096352, \quad b_2(t+1) = 0.195656
  • ๋ฐ”์ด์–ด์Šค๋Š” ๊ธฐ์กด ๊ฐ’๋ณด๋‹ค ๋‚ฎ์•„์ง€๊ฒŒ ๋œ์–ด, ์ถœ๋ ฅ๊ฐ’์„ ๋” ๋‚ฎ์ถ”๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค.

๐Ÿ“Œ ์†์‹คํ•จ์ˆ˜ ํ‰๊ฐ€

E=12(targetโˆ’outy)2=12(0.00โˆ’0.709383)2=0.251612E = \frac{1}{2} ( \text{target} - \text{out}_y )^2 = \frac{1}{2} ( 0.00 - 0.709383 )^2 = 0.251612

โฌ‡๏ธ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• 1๋ฒˆ ์ ์šฉ

E=12(targetโˆ’outy)2=12(0.00โˆ’0.699553)2=0.244687E = \frac{1}{2} ( \text{target} - \text{out}_y )^2 = \frac{1}{2} ( 0.00 - 0.699553 )^2 = 0.244687

โฌ‡๏ธ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• 10000๋ฒˆ ์ ์šฉ

E=12(targetโˆ’outy)2=12(0.00โˆ’0.005770)2=0.000016E = \frac{1}{2} ( \text{target} - \text{out}_y )^2 = \frac{1}{2} ( 0.00 - 0.005770 )^2 = 0.000016

์˜ค์ฐจ๊ฐ€ ํฌ๊ฒŒ ์ค„์–ด๋“ ๋‹ค.


๐Ÿ“ฆ Numpy๋ฅผ ์ด์šฉํ•˜์—ฌ MLP ๊ตฌํ˜„

import numpy as np

# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜
def actf(x):
    return 1 / (1 + np.exp(-x))

# ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„์น˜
def actf_deriv(x):
    return x * (1 - x)

# ์ž…๋ ฅ์œ ๋‹›์˜ ๊ฐœ์ˆ˜, ์€๋‹‰์œ ๋‹›์˜ ๊ฐœ์ˆ˜, ์ถœ๋ ฅ์œ ๋‹›์˜ ๊ฐœ์ˆ˜
inputs, hiddens, outputs = 2, 2, 1
learning_rate = 0.2

# ํ›ˆ๋ จ ์ƒ˜ํ”Œ๊ณผ ์ •๋‹ต
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
T = np.array([[1], [0], [0], [1]])
W1 = np.array([[0.10, 0.20], [0.30, 0.40]])  # ์ž…๋ ฅ์—์„œ ์€๋‹‰์ธต์œผ๋กœ ๊ฐ€๋Š” ๊ฐ€์ค‘์น˜
W2 = np.array([[0.50], [0.60]])  # ์€๋‹‰์ธต์—์„œ ์ถœ๋ ฅ์ธต์œผ๋กœ ๊ฐ€๋Š” ๊ฐ€์ค‘์น˜
B1 = np.array([0.1, 0.2])  # ์€๋‹‰์ธต์˜ ๋ฐ”์ด์–ด์Šค
B2 = np.array([0.3])  # ์ถœ๋ ฅ์ธต์˜ ๋ฐ”์ด์–ด์Šค

# ์ˆœ๋ฐฉํ–ฅ ์ „ํŒŒ ๊ณ„์‚ฐ
def predict(x):
    layer0 = x  # ์ž…๋ ฅ์„ layer0์— ๋Œ€์ž…ํ•œ๋‹ค.
    Z1 = np.dot(layer0, W1) + B1  # ํ–‰๋ ฌ์˜ ๊ณฑ์„ ๊ณ„์‚ฐํ•œ๋‹ค.
    layer1 = actf(Z1)  # ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.
    Z2 = np.dot(layer1, W2) + B2  # ํ–‰๋ ฌ์˜ ๊ณฑ์„ ๊ณ„์‚ฐํ•œ๋‹ค.
    layer2 = actf(Z2)  # ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ๋‹ค.
    return layer0, layer1, layer2
# ์—ญ๋ฐฉํ–ฅ ์ „ํŒŒ ๊ณ„์‚ฐ
def fit():
    global W1, W2, B1, B2  # ์™ธ๋ถ€์—์„œ ์ •์˜๋œ ๋ณ€์ˆ˜๋ฅผ ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋ฏ€๋กœ global ์‚ฌ์šฉ
    for i in range(90000):  # 9๋งŒ๋ฒˆ ๋ฐ˜๋ณตํ•œ๋‹ค.
        for x, y in zip(X, T):  # ํ•™์Šต ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์”ฉ ๊บผ๋‚ธ๋‹ค.
            x = np.reshape(x, (1, -1))  # 2์ฐจ์› ํ–‰๋ ฌ๋กœ ๋งŒ๋“ ๋‹ค. โ‘ 
            y = np.reshape(y, (1, -1))  # 2์ฐจ์› ํ–‰๋ ฌ๋กœ ๋งŒ๋“ ๋‹ค.

            # ์ˆœ๋ฐฉํ–ฅ ๊ณ„์‚ฐ
            layer0, layer1, layer2 = predict(x)

            # ์˜ค์ฐจ ๊ณ„์‚ฐ
            layer2_error = layer2 - y  # ์ถœ๋ ฅ์ธต ์˜ค์ฐจ
            layer2_delta = layer2_error * actf_deriv(layer2)  # ์ถœ๋ ฅ์ธต ๋ธํƒ€ ๊ณ„์‚ฐ

            # ์€๋‹‰์ธต ์˜ค์ฐจ ๋ฐ ๋ธํƒ€ ๊ณ„์‚ฐ
            layer1_error = np.dot(layer2_delta, W2.T)  # ์€๋‹‰์ธต ์˜ค์ฐจ โ‘ก
            layer1_delta = layer1_error * actf_deriv(layer1)  # ์€๋‹‰์ธต ๋ธํƒ€ ๊ณ„์‚ฐ โ‘ข

            # ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ
            W2 += -learning_rate * np.dot(layer1.T, layer2_delta)  # โ‘ฃ
            W1 += -learning_rate * np.dot(layer0.T, layer1_delta)  # โ‘ค

            # ๋ฐ”์ด์–ด์Šค ์—…๋ฐ์ดํŠธ
            B2 += -learning_rate * np.sum(layer2_delta, axis=0)  # โ‘ฅ
            B1 += -learning_rate * np.sum(layer1_delta, axis=0)  # โ‘ฆ
# ํ…Œ์ŠคํŠธ ํ•จ์ˆ˜
def test():
    for x, y in zip(X, T):  # ํ•™์Šต ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜์”ฉ ๊บผ๋‚ธ๋‹ค.
        x = np.reshape(x, (1, -1))  # ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ์„ ๊บผ๋‚ด์„œ 2์ฐจ์› ํ–‰๋ ฌ๋กœ ๋งŒ๋“ ๋‹ค.
        layer0, layer1, layer2 = predict(x)  # ์ˆœ๋ฐฉํ–ฅ ๊ณ„์‚ฐ
        print(x, y, layer2)  # ์ถœ๋ ฅ์ธต์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•ด๋ณธ๋‹ค.

# ํ›ˆ๋ จ์„ ํ•˜๊ณ  ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰
fit()  # ํ•™์Šต
test()  # ํ…Œ์ŠคํŠธ
[[0 0]] [1] [[0.99196032]]
[[0 1]] [0] [[0.00835708]]
[[1 0]] [0] [[0.00836107]]
[[1 1]] [1] [[0.98974873]]

Summary

  • MLP๋Š” ์ž…๋ ฅ์ธต๊ณผ ์ถœ๋ ฅ์ธต ์‚ฌ์ด์— ์€๋‹‰์ธต(hidden layer)์„ ๊ฐ–๋Š” ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์ด๋‹ค.
  • ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ MLP๋ฅผ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ํ•ต์‹ฌ ๋ฐฉ๋ฒ•์ด๋‹ค.
  • ์—ญ์ „ํŒŒ์˜ ๊ณผ์ •:
    1. ์ž…๋ ฅ์ด ์ฃผ์–ด์ง€๋ฉด ์ˆœ๋ฐฉํ–ฅ์œผ๋กœ ๊ณ„์‚ฐํ•˜์—ฌ ์ถœ๋ ฅ์„ ๊ตฌํ•œ๋‹ค.
    2. ์‹ค์ œ ์ถœ๋ ฅ๊ณผ ์›ํ•˜๋Š” ์ถœ๋ ฅ์˜ ์ฐจ์ด์ธ ์˜ค์ฐจ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
    3. ์ด ์˜ค์ฐจ๋ฅผ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ ์ „ํŒŒํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ณ , ์˜ค์ฐจ๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€