โ‘ข ๐Ÿค– Machine Learning 1์ผ์ฐจ - ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€ (Multiple Linear Regression)

JItzelยท2025๋…„ 12์›” 10์ผ

๐Ÿก Machine_learning

๋ชฉ๋ก ๋ณด๊ธฐ
3/14

๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€ (Multiple Linear Regression)์™€ ํ–‰๋ ฌ ์—ฐ์‚ฐ

์ž…๋ ฅ ๋ณ€์ˆ˜(xx)๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ผ ๋•Œ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š”์ง€, ํŠนํžˆ ํ–‰๋ ฌ(Matrix)์˜ ์ฐจ์›์„ ๋งž์ถ”๋Š” ๋ฒ•์„ ์ค‘์ ์ ์œผ๋กœ ํ™•์ธํ•ด๋ณด์ž

1. ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€๋ž€?

  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ(Feature)๊ฐ€ 1๊ฐœ๊ฐ€ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ(x1,x2,...xnx_1, x_2, ... x_n)์ธ ๊ฒฝ์šฐ
  • ๋‹จ์ˆœ ํšŒ๊ท€๊ฐ€ 2์ฐจ์› ํ‰๋ฉด์˜ '์ง์„ '์„ ์ฐพ๋Š”๋‹ค๋ฉด, ๋‹ค์ค‘ ํšŒ๊ท€๋Š” ๊ณ ์ฐจ์› ๊ณต๊ฐ„์˜ 'ํ‰๋ฉด(Hyperplane)'์„ ์ฐพ๋Š” ๊ณผ์ •

์ˆ˜์‹: y=w1x1+w2x2+...+wnxn+by = w_1x_1 + w_2x_2 + ... + w_nx_n + b
ํ–‰๋ ฌ์‹: H(X)=XW+bH(X) = XW + b

2. ์˜ˆ์ œ: ํ€ด์ฆˆ ์ ์ˆ˜๋กœ ๊ธฐ๋ง๊ณ ์‚ฌ ์˜ˆ์ธกํ•˜๊ธฐ (๋ณ€์ˆ˜ 3๊ฐœ)

์„ธ ๋ฒˆ์˜ ์ชฝ์ง€์‹œํ—˜ ์ ์ˆ˜(q1, q2, mid)๋ฅผ ํ†ตํ•ด ๊ธฐ๋ง๊ณ ์‚ฌ ์ ์ˆ˜(final)๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ

1) ๋ฐ์ดํ„ฐ ์ค€๋น„

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ (header๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ)
df = pd.read_csv('data-01.csv', header=None)
df.columns = ['q1', 'q2', 'mid', 'final']

df.head()

2) ํ•™์Šต (Fit)

๋…๋ฆฝ๋ณ€์ˆ˜(XX)๊ฐ€ 3๊ฐœ(q1, q2, mid)์ด๊ณ  ์ข…์†๋ณ€์ˆ˜(YY)๊ฐ€ 1๊ฐœ(final)์ž…๋‹ˆ๋‹ค.

# X: ๋งˆ์ง€๋ง‰ ์—ด์„ ์ œ์™ธํ•œ ๋ชจ๋“  ์—ด (ํŠน์„ฑ 3๊ฐœ)
# Y: ๋งˆ์ง€๋ง‰ ์—ด (๋ผ๋ฒจ 1๊ฐœ)
x = df.iloc[:, :-1].values
y = df.iloc[:, [-1]].values

model = LinearRegression()
model.fit(x, y)

# ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜(w)์™€ ์ ˆํŽธ(b) ํ™•์ธ
print('๊ฐ€์ค‘์น˜(Coefficients):', model.coef_)
print('์ ˆํŽธ(Intercept):', model.intercept_)

# ์ถœ๋ ฅ ์˜ˆ์‹œ
# ๊ฐ€์ค‘์น˜ [[0.35593822 0.54251876 1.16744422]] -> ๋ณ€์ˆ˜๊ฐ€ 3๊ฐœ๋ผ w๋„ 3๊ฐœ
# ์ ˆํŽธ [-4.3361024]

3) ์˜ˆ์ธก (Predict) ๋ฐ ํ–‰๋ ฌ ์—ฐ์‚ฐ ๊ฒ€์ฆ

Q. q1: 90, q2: 90, mid: 95์ ์ธ ํ•™์ƒ์˜ ๊ธฐ๋ง๊ณ ์‚ฌ ์ ์ˆ˜๋Š”?

์˜ˆ์ธก ์‹œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ Shape(ํ˜•ํƒœ)๋ฅผ ๋งž์ถ”๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ

# ๋ชจ๋ธ์ด ํ•™์Šตํ•  ๋•Œ X๊ฐ€ (N, 3) ํ˜•ํƒœ์˜€์œผ๋ฏ€๋กœ, 
# ์˜ˆ์ธกํ•  ๋•Œ๋„ (1, 3) ํ˜•ํƒœ๋กœ ๋„ฃ์–ด์ค˜์•ผ ํ•จ (๋Œ€๊ด„ํ˜ธ 2๊ฐœ [[ ]])

model.predict([[90, 90, 95]]) 

# ๊ฒฐ๊ณผ
# array([[187.43222601]])

3. ํ•ต์‹ฌ ์ •๋ฆฌ: ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ํ–‰๋ ฌ ์ฐจ์›(Shape)

  • ํ–‰๋ ฌ ๊ณฑ(matmul)์ด ์„ฑ๋ฆฝํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทœ์น™์„ ๊ฐ€์ง„๋‹ค.

H(X)=Xโ‹…W+bH(X) = X \cdot W + b

  • ๊ฐ€์ค‘์น˜(WW, ๊ธฐ์šธ๊ธฐ):ํ˜•ํƒœ: [์ปฌ๋Ÿผ(ํŠน์„ฑ)์˜ ๊ฐœ์ˆ˜, ๋ผ๋ฒจ์˜ ๊ฐœ์ˆ˜]
    ์ด๋ฒˆ ์˜ˆ์ œ: ํŠน์„ฑ 3๊ฐœ, ๋ผ๋ฒจ 1๊ฐœ โ†’\rightarrow (3, 1)

  • ํŠน์„ฑ ๋ฐ์ดํ„ฐ(XX):ํ˜•ํƒœ: [๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜(ํ–‰), ์ปฌ๋Ÿผ์˜ ๊ฐœ์ˆ˜(์—ด)]

  • ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ(YY):ํ˜•ํƒœ: [๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜(ํ–‰), ๋ผ๋ฒจ์˜ ๊ฐœ์ˆ˜(์—ด)] (๋Œ€๋ถ€๋ถ„ 1์—ด)

  • ์˜ˆ์ธก ์‹œ(XnewX_{new}): ๋ฐ˜๋“œ์‹œ [ํ–‰, ์ปฌ๋Ÿผ์˜ ๊ฐœ์ˆ˜] ํ˜•ํƒœ๋ฅผ ๋งž์ถฐ์•ผ ํ•จ.
    ๊ณ„์‚ฐ ์›๋ฆฌ: (1ร—3)โ‹…(3ร—1)=(1ร—1)(1 \times 3) \cdot (3 \times 1) = (1 \times 1)
    โ†’\rightarrow ์˜ˆ์ธก๊ฐ’ ์Šค์นผ๋ผ ๋„์ถœ

  • Tip: Scikit-learn์—์„œ๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ Transpose ๋“ฑ์„ ์ฒ˜๋ฆฌํ•ด์ฃผ์ง€๋งŒ, ๊ฐœ๋…์ ์œผ๋กœ๋Š” "์•ž ํ–‰๋ ฌ์˜ ์—ด ๊ฐœ์ˆ˜์™€ ๋’ค ํ–‰๋ ฌ์˜ ํ–‰ ๊ฐœ์ˆ˜๊ฐ€ ๊ฐ™์•„์•ผ ๊ณฑ์…ˆ์ด ๋œ๋‹ค"๋Š” ์›์น™์„ ๊ผญ ๊ธฐ์–ตํ•ด์•ผ ํ•œ๋‹ค.

4. ์˜ˆ์ œ 1: ์ „๊ธฐ ์ƒ์‚ฐ๋Ÿ‰ vs ์‚ฌ์šฉ๋Ÿ‰ (๋‹จ์ˆœ ํšŒ๊ท€ ๋ณต์Šต)

  • ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์„ ๋•Œ(n=1)๋Š” ์‹œ๊ฐํ™”๊ฐ€ ์ง๊ด€์ 
# 1. ๋ฐ์ดํ„ฐ ์ค€๋น„
elecDF = pd.read_csv('data/electric.csv', index_col='Unnamed: 0')

x = elecDF.iloc[:, :-1].values 		# ์ƒ์‚ฐ๋Ÿ‰
y = elecDF.iloc[:, [-1]].values  	# ์‚ฌ์šฉ๋Ÿ‰

# 2. ๋ชจ๋ธ ํ•™์Šต
model = LinearRegression()
model.fit(x, y)

print(f"Shape ํ™•์ธ - x:{x.shape}, y:{y.shape}")
# Shape ํ™•์ธ - x:(12, 1), y:(12, 1)

print(f"๊ธฐ์šธ๊ธฐ:{model.coef_}, ์ ˆํŽธ:{model.intercept_}")
# ๊ธฐ์šธ๊ธฐ:[[0.49560324]], ์ ˆํŽธ:[0.91958143]

# 3. ์˜ˆ์ธก (์ƒ์‚ฐ๋Ÿ‰์ด 4์™€ 5์ผ ๋•Œ)
# 2๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๋ฏ€๋กœ (2, 1) ํ–‰๋ ฌ๋กœ ์ž…๋ ฅ
print(model.predict([[4], [5]]))
# array([[2.90199437],
#        [3.39759761]])

# 4. ์‹œ๊ฐํ™”
plt.scatter(x, y, label='์‹ค์ œ๊ฐ’')
plt.plot(x, model.predict(x), 'r', label='์˜ˆ์ธก์„ ') # ํšŒ๊ท€์„ 
plt.legend()
plt.show()

5. ์˜ˆ์ œ 2: ๋‚˜๋ฌด์˜ ๋ถ€ํ”ผ ์˜ˆ์ธก (๋‹ค์ค‘ ํšŒ๊ท€)

  • ๋‚˜๋ฌด์˜ ๋‘˜๋ ˆ(Girth)์™€ ํ‚ค(Height)๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ถ€ํ”ผ(Volume)๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. (๋ณ€์ˆ˜ 2๊ฐœ โ†’\rightarrow ๊ฒฐ๊ณผ 1๊ฐœ)
# 1. ๋ฐ์ดํ„ฐ ์ค€๋น„
treeDF = pd.read_csv('data/trees.csv')
# Girth, Height, Volume ์ปฌ๋Ÿผ ์กด์žฌ

x = treeDF.iloc[:, :-1].values # Girth, Height (ํŠน์„ฑ 2๊ฐœ)
y = treeDF.iloc[:, [-1]].values # Volume (๋ผ๋ฒจ)

# 2. ๋ชจ๋ธ ํ•™์Šต
model = LinearRegression()
model.fit(x, y)

# 3. ์˜ˆ์ธก
# Case: (๋‘˜๋ ˆ 11, ํ‚ค 66) ๊ณผ (๋‘˜๋ ˆ 11, ํ‚ค 75) ์ธ ๋‚˜๋ฌด์˜ ๋ถ€ํ”ผ๋Š”?
# ์ž…๋ ฅ ๋ฐ์ดํ„ฐ Shape: (2, 2)
result = model.predict([[11, 66], [11, 75]])
print(result)

# ๊ฒฐ๊ณผ
# array([[16.19268807],
#        [19.24594918]])

# 4. ์‹œ๊ฐํ™” (Line Chart ๋น„๊ต)
# ๋‹ค์ค‘ ํšŒ๊ท€๋Š” 3์ฐจ์› ์ฐจํŠธ๊ฐ€ ํ•„์š”ํ•˜๋ฏ€๋กœ, 
# ์‹ค์ œ๊ฐ’(y)๊ณผ ์˜ˆ์ธก๊ฐ’(pred)์„ ๋‚˜๋ž€ํžˆ ๊ทธ๋ ค์„œ ์ถ”์„ธ๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.
pred = model.predict(x)

plt.plot(y, 'b', label='์‹ค์ œ๊ฐ’(Volume)')
plt.plot(pred, 'r--', label='์˜ˆ์ธก๊ฐ’(Predicted)')
plt.legend()
plt.title("Actual vs Predicted Volume")
plt.show()

โ†’\rightarrow ํ•ด์„: ํŒŒ๋ž€ ์‹ค์„ (์‹ค์ œ๊ฐ’)๊ณผ ๋นจ๊ฐ„ ์ ์„ (์˜ˆ์ธก๊ฐ’)์ด ๋น„์Šทํ•˜๊ฒŒ ์›€์ง์ธ๋‹ค๋ฉด ๋ชจ๋ธ ํ•™์Šต์ด ์ž˜ ๋œ ๊ฒƒ


์š”์•ฝ

  • ๋‹ค์ค‘ ์„ ํ˜• ํšŒ๊ท€๋Š” ์ž…๋ ฅ ๋ณ€์ˆ˜(xx)๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ผ ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.
  • ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜(ww) ๊ฐœ์ˆ˜๋Š” ์ž…๋ ฅ ๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜์™€ ๊ฐ™๋‹ค.
  • predict ๋ฉ”์„œ๋“œ ์‚ฌ์šฉ ์‹œ, ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๋™์ผํ•œ ์—ด(Column) ๊ฐœ์ˆ˜๋ฅผ ๊ฐ€์ง„ 2์ฐจ์› ํ–‰๋ ฌ(Array)์„ ์ž…๋ ฅํ•ด์•ผ ํ•œ๋‹ค.
  • ํ–‰๋ ฌ ๊ณฑ์˜ ์›๋ฆฌ(1ร—Nโ‹…Nร—11 \times N \cdot N \times 1)๋ฅผ ์ดํ•ดํ•˜๋ฉด ์—๋Ÿฌ ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค.
profile
์†Œ๊ธˆ์— ์ ˆ์ธ ์ƒ์„ , ๋ชธ์„ ๋’ค์ฒ™์ด๋‹ค ๐ŸŸ

0๊ฐœ์˜ ๋Œ“๊ธ€