ML-300제

TOLL TERRY·2024년 1월 8일

업스테이지AI_패스트캠퍼스

upstageAI_패스트캠퍼스

목록 보기

8/20

오늘은 300제 문제 풀면서
회귀 관련 문제들을 정리하겠습니다.

목표

스크래핑 된 dirty 데이터 클리닝
다양한 종류의 데이터 정규화
사이킷런 기반의 모델 학습 방법 습득
XGBoost, LightGBM 모델 학습
모델 평가 및 시각화

중고 자동차 데이터 분석하기

1. 데이터 확인

26개의 컬럼(head, info, describe)

df.isna().sum() # null 개수
df.columns # 26개 
df.drop([필요없는 것들 제외], axis=1, inplace=True)

1. year 바꾸기

df['age'] = 2021 - df['year']
df.drop('year', axis=1, inplace=True)

2. 범주형 데이터 확인

df.columns # 26개

# 제조사 43개 
df['manufacturer'].value_counts() 
fig  = plt.figure(figsize=(8,12))
sns.countplot(x='manufacturer', data=df.fillna('n/a), order = df.fillna('n/a')['manufacturer'].value_counts().index)

# 차량 모델 31520
df['model'].value_counts() 
fig  = plt.figure(figsize=(8,12))
sns.countplot(x='model', data=df.fillna('n/a), order = df.fillna('n/a')['model].value_counts().index)

# codition
df['codition'].value_counts() 
fig  = plt.figure(figsize=(8,12))
sns.countplot(x='codition', data=df.fillna('n/a), order = df.fillna('n/a')['codition].value_counts().index)

# cynlinders
df['cynlinders'].value_counts() 
fig  = plt.figure(figsize=(8,12))
sns.countplot(x='cynlinders', data=df.fillna('n/a), order = df.fillna('n/a')['cynlinders].value_counts().index)

# transmission
df['transmission'].value_counts() 
fig  = plt.figure(figsize=(8,12))
sns.countplot(x='transmission', data=df.fillna('n/a), order = df.fillna('n/a')['transmission].value_counts().index)

3. 수치형 데이터 확인

# price
sns.histplot(x='price', data=df)
sns.bosplot(x='price', data=df)
sns.rugplot(x='price', data=df, height=1)

# odometer
sns.histplot(x='odometer', data=df)

# age
sns.histplot(x='age', data=df)
sns.histplot(x='age', data=df, bins=18, kde=True)

2. 데이터 cleaning

1. 범주형 데이터 시각화

# 현재 불가능
sns.boxplot(x='manufaturer', y='price', data=df.fillna('n/a'))

2. 범주형 데이터 클리닝

결측 데이터 제거
결측 데이터를 others
클래스가 적은 데이터들을 others
분류모델로 학습해서, 결측 데이터 예측하여 넣음

df['manufacturer'].fillna('others').value_counts()

col = 'manufacturer'
counts = df[col].fillna('others').value_counts()
plt.grid()
plt.plot(range(len(counts)), counts)

n_categorical = 10
counts_index = counts.index[n_categorical:] 
df[col] = df[col].apply(lambda s:s if str(s) not in counts_index else 'others')
df[col].fillna('others', inplace=True)
df.loc[df[col] == 'other', col] = 'others'

3. 수치형 데이터 시각화

fig = plt.figure(figsize=(8,12))
sns.rugplot(x='price', data=df, height=1)

4. 수치형 데이터 클리닝

price_1 = df['price'].quantile(0.99) # 상위 1% 
price_2 = df['price'].quantile(0.1) # 하위 10% 

df = df[(price_1 > dr['price'] & (df['price'] > price_2)]
df.describe()

5. boxplot 범주형 데이터 시각화

fig = plt.figure(figsize=(14,5))
sns.boxplot(x='manufaturer', y='price', data=df)

6. Correlation Heatmap 시각화

# 절대값으로 확인 
sns.heatmap(df.corr(), annot=True, cmap='YlOrRd')

3. 데이터 전처리

x_num = df[['odometer', 'age']]
scaler = StandardScaler()
scaler.fit(x_num)
x_scaled = scaler.transform(x_num)
x_scaled = pd.DataFrame(x_scaled, index=x_num.index, columns=x_num.columns)

# one-hot vec
x_cat = pd.drop(['price', 'odometer', 'age'], axis=1)
x_cat = pd.get_dummies(x_cat)

x = pd.concat([x_scaled, x_cat], axis=1)
y = pd['price']

x.head()
x.shape()
x.isna().sum() # x.fillna(0.0, inplace=True) # x['age'].mean()

4. 모델 XGBoost 회귀

mdoel = XGBRegressor()
model.fit(x_train, y_train)

5. 모델평가

pred = model.predict(x_test)
print(mean_absolute_error(y_test, pred))
print(sqrt(mean_absolute_error(y_test, pred)))

6. 학습된 모델 평가 심화

plt.scatter(x=y_test, y=pred, alpha=0.005)
plt.plot([0, 60000], [0, 60000], 'r-')

plt.histplot(x=y_test, y=pred)
plt.plot([0, 60000], [0, 60000], 'r-')

err = (pred - y_test) / y_test * 100
plt.hist(err[err < 600], bins=12)
plt.xlabel('error (%)')
plt.xlim(-100, 100)
plt.grid()

err = (pred - y_test) / y_test 
plt.hist(err, bins=12)
plt.xlabel('error ($)')
plt.grid()

TOLL TERRY

행복을 찾아서(크리스 가드너)

다음 포스트

ML-300제

upstageAI_패스트캠퍼스

목표

중고 자동차 데이터 분석하기

1. 데이터 확인

1. year 바꾸기

2. 범주형 데이터 확인

3. 수치형 데이터 확인

2. 데이터 cleaning

1. 범주형 데이터 시각화

2. 범주형 데이터 클리닝

3. 수치형 데이터 시각화

4. 수치형 데이터 클리닝

5. boxplot 범주형 데이터 시각화

6. Correlation Heatmap 시각화

3. 데이터 전처리

4. 모델 XGBoost 회귀

5. 모델평가

6. 학습된 모델 평가 심화

upStage_ML_project

0개의 댓글

관련 채용 정보