문제해결 절차
일반적인 문제해결 절차
![](https://velog.velcdn.com/images/tim0902/post/cb71ad2d-845c-4225-9415-f72d5955a3ce/image.png)
데이터 기반 문제해결 절차
![](https://velog.velcdn.com/images/tim0902/post/c733c227-29ab-4983-9085-9987fd92cfd1/image.png)
모델 스스로 데이터를 기반으로 변화에 대응
![](https://velog.velcdn.com/images/tim0902/post/723ba7de-d96d-42d5-8c50-93b22f673ecc/image.png)
머신러닝을 통한 학습
![](https://velog.velcdn.com/images/tim0902/post/bc21c8f2-8111-4347-80cf-6b14f74082b4/image.png)
지도학습 - 분류 Classification
![](https://velog.velcdn.com/images/tim0902/post/b26e83a5-bf92-4cfb-ae85-d0d6299c9fdb/image.png)
지도학습 - 회귀 Regression
![](https://velog.velcdn.com/images/tim0902/post/d21d1068-746c-40df-8db2-0c61ed99e3ff/image.png)
비지도학습은 레이블 없다
![](https://velog.velcdn.com/images/tim0902/post/4f859a43-4959-4cd4-adf5-36319e57470a/image.png)
비지도학습 - 군집
![](https://velog.velcdn.com/images/tim0902/post/3e6c08bb-c059-4f9f-be2a-63afb33ed399/image.png)
비지도학습 - 차원 축소
![](https://velog.velcdn.com/images/tim0902/post/2ff96e70-704c-4655-a6e9-d1deafb86db1/image.png)
Regression 회귀
만약 주택의 넓이과 가격이라는 데이터가 있고 주택가격을 예측한다면
![](https://velog.velcdn.com/images/tim0902/post/fb80023d-eb94-45ec-b11e-142cf3160c6f/image.png)
머신러닝 모델 만들기
![](https://velog.velcdn.com/images/tim0902/post/c2988fdd-1ecd-4c58-b611-0250851f8ec8/image.png)
![](https://velog.velcdn.com/images/tim0902/post/37be9f70-70d7-461c-8453-31e58e566ca7/image.png)
1차함수
![](https://velog.velcdn.com/images/tim0902/post/ef4e0fdc-da4b-4649-ad58-50495b3d0c62/image.png)
선형 회귀
![](https://velog.velcdn.com/images/tim0902/post/0c35ce7d-5508-40bc-9bbb-b8761c3adc34/image.png)
모델을 구성하는 파라미터 찾기
![](https://velog.velcdn.com/images/tim0902/post/738bf213-5dfb-44dc-8e00-16930c001ab9/image.png)
OLS (Ordinary Linear Least Square)
- 데이터를 하나의 직선으로 만든다면
![](https://velog.velcdn.com/images/tim0902/post/f054919d-d1a0-48a4-9118-29d9309ba920/image.png)
- 직선
![](https://velog.velcdn.com/images/tim0902/post/9a44d899-b76c-4c41-a0e7-4b5f8dc0e428/image.png)
- 데이터를 모두 직선에 대입
![](https://velog.velcdn.com/images/tim0902/post/0e3eaf34-0503-4f26-9835-dcdca47d94c6/image.png)
- 문제를 벡터와 행렬로 표현
![](https://velog.velcdn.com/images/tim0902/post/7e644fe9-83e8-4acd-916c-ef97e5bfea69/image.png)
- 찾고 싶은 모델
![](https://velog.velcdn.com/images/tim0902/post/6d972994-6838-4dd8-98e7-28b4d05810ce/image.png)
- 행렬로 정리
![](https://velog.velcdn.com/images/tim0902/post/4a5ad953-74c7-45aa-bb6e-bf3e74a79a4a/image.png)
- 드디어 X를 찾을 수 있다
![](https://velog.velcdn.com/images/tim0902/post/0bd0bdba-65c8-496f-bf40-de29f62927db/image.png)
- 본래의 문제로
![](https://velog.velcdn.com/images/tim0902/post/7482d722-d1df-4b79-a55b-5649a6b75a20/image.png)
- 데이터로
![](https://velog.velcdn.com/images/tim0902/post/cee6838e-dc5f-412a-86b9-57f680436c41/image.png)
- 원 식
![](https://velog.velcdn.com/images/tim0902/post/e71e224d-9a9a-45e5-a0c7-1f6d2be3e341/image.png)
- 정리
![](https://velog.velcdn.com/images/tim0902/post/0d7cb47e-d295-4a4e-8ca7-aa656ab18ec1/image.png)
- 적용하면 a와 b를 구할수 있다
![](https://velog.velcdn.com/images/tim0902/post/e664417c-1eae-4642-b25e-9eb4f9ed2f4e/image.png)
- 최종 모델
![](https://velog.velcdn.com/images/tim0902/post/3ee7cb5b-46b7-4722-8fc6-8fe3dab94911/image.png)
- 모델의 성능을 표현
![](https://velog.velcdn.com/images/tim0902/post/614e218b-871d-458b-a363-6fa3ab5b0caf/image.png)
실습
데이터 만들기
import pandas as pd
data = {'x':[1,2,3,4,5], 'y':[1,3,4,6,5]}
df = pd.DataFrame(data)
df
![](https://velog.velcdn.com/images/tim0902/post/85f3140b-9a3b-4ed3-a8ce-63431e38dd59/image.png)
가설을 세우기
import statsmodels.formula.api as smf
lm_model = smf.ols(formula='y~x', data=df).fit()
결과
lm_model.params
![](https://velog.velcdn.com/images/tim0902/post/1cbe3459-eff9-40fd-b734-ca065bfbb75d/image.png)
seaborn
import matplotlib.pyplot as plt
import seaborn as sns
sns.lmplot(x='x', y='y', data=df)
plt.xlim([0, 5])
![](https://velog.velcdn.com/images/tim0902/post/7fe4e56f-c939-4113-9788-b42faa030b89/image.png)
잔차 평가 residue
- 잔차는 평균이 0인 정규분포를 따르는 것 이어야 함
- 잔차 평가는 잔차의 평균이 0이고 정규분포를 따르는 지 확인
잔차 확인
resid = lm_model.resid
resid
![](https://velog.velcdn.com/images/tim0902/post/aacd6f69-bbff-4f4c-aed6-40ccb69b5d3e/image.png)
결정계수 R-Squared
- y_hat은 예측된 값
- 예측 값과 실제 값(h)이 일치하면 결정계수는 1이 됨(즉 결정계수가 높을 수록 좋은 모델)
![](https://velog.velcdn.com/images/tim0902/post/c3aa625b-faac-474a-a6f4-03d102023323/image.png)
결정계수 계산 -numpy
import numpy as np
mu = np.mean(df['y'])
y = df['y']
y_hat = lm_model.predict()
np.sum((y_hat-mu)**2) / np.sum((y - mu)**2)
![](https://velog.velcdn.com/images/tim0902/post/2d57ff76-93af-4041-94e1-58a5adaadd2d/image.png)
lm_model.rsquared
![](https://velog.velcdn.com/images/tim0902/post/97e5ae08-4d1e-4242-a5f9-7746ef341771/image.png)
잔차의 분포도 확인
sns.distplot(resid, color='black')
![](https://velog.velcdn.com/images/tim0902/post/df5f191d-8a52-42a1-ab3e-9ba7bd91ab15/image.png)