머신러닝(AI학습 27)

이유진·2024년 7월 8일

AI ML colab python

SVM을 사용한 비만도 계산

SVM : 서포트 벡터 머신

support vector machine
분류형 모델에서 사용
주어진 데이터가 어떤 카테고리에 속할지 판단하는 이진 선형(linear) 분류 모델

Margin 최대화

무작위로 2만명의 키와 몸무게 데이터를 만들고

비만도 계산의 BMI 를 사용해 저체중, 정상, 비만 레이블을 붙이기

그리고, 이를 SVM 에 학습시키고, 비만도를 정확하게 맞출수 있는지 테스트

BMI = kg / (m x m)

from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import random

import time # 실행시간 측정 용도
from time import strftime

랜덤 데이터 만들기

2만명의 키, 몸무게 데이터 생성

"키(cm), 몸무게(kg)" --> "저체중(thin), 정상(normal), 비만(fat)" 출력

레이블을 활용해 3개의 컬럼을 갖는 CSV 파일 작성

base_path = r'/content/drive/MyDrive/dataset'

비만도 계산하는 함수

def calc_bmi(h,w) :
bmi = w / (h / 100) ** 2
if bmi < 18.5 : return "thin" # 저체중
if bmi < 25 : return "normal" # 정상체중
return "fat"

calc_bmi(170, 80),calc_bmi(180, 54),calc_bmi(166,65)

20000 개의 데이터 --> CSV 저장

file_path = os.path.join(base_path, 'bmi.csv')
file_path

fp = open(file_path, "w", encoding="utf-8")
fp.write("height,weight,label\n")

cnt = {
"thin" : 0,
"normal" : 0,
"fat" : 0,
}

random.seed(10)

for i in range(20000) :
h = random.randint(120,200) # 키 120 ~ 200cm
w = random.randint(35,80) # 몸무게 35 ~ 80kg
label = calc_bmi(h,w)
cnt[label] += 1
fp.write(f'{h},{w},{label}\n')

fp.close()
print("ok", cnt)

df_bmi = pd.read_csv(file_path)
df_bmi

컬럼(feature) 별로 scaling 하고, data, target(label)분리

data분리 + scaling

w = df_bmi['weight'] / 100 # 몸무게 최대 100kg으로 가정하고 0~1 사이 정규화(normalize)
h = df_bmi['height'] / 200 # 키는 최대 200cm으로 가정

wh = pd.concat([w,h], axis = 1)

target(label) 분리

label = df_bmi['label']
label

train/test 데이터 나누기

X_train, X_test, y_train, y_test = train_test_split(wh, label, random_state=24)

train:test = 15000:5000개로 쪼개짐 (디폴트)

X_train.shape

X_test.shape

학습하기

clf = svm.SVC() # Support Vector Classifier
clf.fit(X_train, y_train)

예측하기

predict = clf.predict(X_test)

predict

정답률 확인

metrics.accuracy_score(y_test, predict)

report = metrics.classification_report(y_test, predict)

print(report)

데이터 분포 확인 (시각화)

df = df_bmi.set_index('label')
df

df.loc['normal']

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

def scatter(lbl, color) :
b = df.loc[lbl]
ax.scatter(b['weight'], b['height'], c = color, label = lbl)

scatter("fat", "red")
scatter("normal", "yellow")
scatter("thin", "purple")

ax.legend()

SVM의 종류

SVC / NuSVC / LinearSVC

start_time = time.time() # 시작시간

clf = svm.LinearSVC()
clf.fit(X_train, y_train)
predict = clf.predict(X_test)
acc_score = metrics.accuracy_score(y_test, predict)
c1_report = metrics.classification_report(y_test, predict)

end_time = time.time() # 종료시간

print(f'정답률 = {acc_score}')
print(c1_report)

print(f'경과시간 {end_time - start_time} sec')