예시 문제 출처: 한국데이터산업진흥원 공지사항 https://www.dataq.or.kr/www/board/view.do
데이터셋의 qsec에 최소최대 척도로 변환한 후 0.5보타 큰 값을 가지는 레코드 수 구하기
# 데이터 파일 읽기 예제
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# 데이터 불러오기
df = pd.read_csv('./mtcars.csv', index_col=0)
# 최소최대 척도 변환
scaler = MinMaxScaler()
df[['qsec']] = scaler.fit_transform(df[['qsec']])
# 답안
print(len(df[df['qsec'] > 0.5]))
9
# 데이터 파일 읽기 예제
import pandas as pd
x_test = pd.read_csv("./X_test.csv", encoding="cp949")
x_train = pd.read_csv("./X_train.csv", encoding="cp949")
y_train = pd.read_csv("./y_train.csv", encoding="cp949")
# 데이터 탐색
#print(x_test.head())
#print(x_train.head())
#print(y_train.head())
#print(x_test.describe)
#print(x_train.describe)
# 결측치
# print(x_train.isnull().sum())
# print(x_test.isnull().sum())
# 결측치 채우기
x_train.fillna(0, inplace=True)
x_test.fillna(0, inplace=True)
#print(x_train.isnull().sum())
#print(x_test.isnull().sum())
# 원핫인코딩 주구매상품, 주구매지점
item = pd.get_dummies(x_train['주구매상품'], prefix='주구매상품')
store = pd.get_dummies(x_train['주구매지점'], prefix='주구매지점')
x_train = pd.concat([x_train, item, store], axis=1)
x_train.drop(['주구매상품', '주구매지점'], axis=1, inplace=True)
item = pd.get_dummies(x_test['주구매상품'], prefix='주구매상품')
store = pd.get_dummies(x_test['주구매지점'], prefix='주구매지점')
x_test = pd.concat([x_test, item, store], axis=1)
x_test.drop(['주구매상품', '주구매지점'], axis=1, inplace=True)
# train에만 있는 '주구매상품_소형가전' 삭제
x_train.drop(['주구매상품_소형가전'], axis=1, inplace=True)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x_train)
x_train_sc = sc.transform(x_train)
x_test_sc = sc.transform(x_test)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)
print('Logistic socre: ',model.score(x_train, y_train))
# Knn
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=4, metric='euclidean')
model.fit(x_train, y_train)
print('KNN socre: ', model.score(x_train, y_train))
# XGB
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(x_train, y_train)
print('XGB score:', model.score(x_train,y_train))
# DT
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=1, max_depth=10)
model.fit(x_train, y_train)
print('DTree score: ', model.score(x_train,y_train))
# RF
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(max_depth=10, n_estimators=100)
model.fit(x_train, y_train)
print('RF score: ', model.score(x_train,y_train))
predict = model.predict_proba(x_test)
output = pd.DataFrame({'cust_id':x_test_id, 'gender':predict[:,0]})
output.to_csv('1234.csv', index=False)
Logistic socre: 0.6237142857142857
KNN socre: 0.624
XGB score: 0.7114285714285714
DTree score: 0.7222857142857143
RF score: 0.7605714285714286
출처
https://5ohyun.tistory.com/108
https://hobby-weighted.tistory.com/156
https://blog.naver.com/PostView.naver?blogId=da0097&logNo=222390408292&categoryNo=0&parentCategoryNo=0