[AI] Pima Indians Diabetes Database

황동규·2023년 8월 2일

Pima Indians Diabetes Database를 분석한 것을 적어보려 한다.

데이터 셋

df = pd.read_csv("./data/diabetes.csv", encoding="utf-8")
df.shape

(768, 9)

df의 데이터에는 768개의 행과 9개의 열로 구성되어 있는 것을 확인할 수 있다.

col = df.columns
print(col)

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Pregnancies: 임신 횟수
Glucose: 포도당 부하 검사 수치
BloodPressure: 혈압
SkinThickness: 팔 삼두근 뒤쪽의 피하지방 측정값
Insulin: 혈청 인슐린
BMI: 체질량지수
DiabetesPedigreeFunction: 당뇨 내력 가중치 값
Age: 나이
Outcome: 당뇨병 여부

데이터들을 한번 확인해보자.

df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

데이터들을 확인해 봤을 때 NaN값은 존재하지 않는 것으로 보인다. 하지만 Glucose, SkinThickness, Insulin 등의 수치에서 0값(결측치)이 들어 가 있는 것이 확인된다.

is_null = df.iloc[:,1:5] #결측치가 존재할 것 같은 값들
sns.heatmap(is_null == 0)

is_null = is_null.replace(0, np.nan)
temp = is_null.isnull().mean()
temp

Glucose          0.006510
BloodPressure    0.045573
SkinThickness    0.295573
Insulin          0.486979
dtype: float64

Glucose(포도당) 결측치 거의 없음 약 0.6%
BloodPressure(혈압) 결측치 약 5%
SkinThickness(팔 삼두근 뒤쪽의 피하지방 측정값) 결측치 약 30%
Insulin(인슐린) 결측치 약 49%

Insulin과 SkinThickness에서 결측치가 많다는 것을 알 수 있다.

결측치 처리

결측치 처리 방법은 결측치에 평균 값을 대입해 결측치를 처리하였다.

insulin_mean = df['Insulin'].mean()
skin_mean = df['SkinThickness'].mean()
df['Insulin'] = df['Insulin'].replace(0, np.nan) # 인슐린 결측치 처리
df['Insulin'].fillna(insulin_mean, inplace=True)
df['SkinThickness'] = df['SkinThickness'].replace(0, np.nan) # 스킨 결측치 처리
df['SkinThickness'].fillna(skin_mean, inplace=True)
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    float64
 4   Insulin                   768 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(4), int64(5)
memory usage: 54.1 KB

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	26.606479	118.660163	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	9.631241	93.080358	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	7.000000	14.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	20.536458	79.799479	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	79.799479	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

이상치 제거

결측치와 마찬가지로 Insulin과 SkinThickness의 이상치를 처리하였다.

def _(df, column, threshold = 1.5):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - threshold * IQR
    upper_bound = Q3 + threshold * IQR
    df_no_outliers = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df_no_outliers
columns = df.iloc[:,3:5].columns
for col in columns:
    df = _(df, col)

학습 및 결과

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, pred)
    
    print('오차행렬')
    print(confusion)
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1: {3:.4f}, AUC:{4:.4f}'
          .format(accuracy, precision, recall, f1, roc_auc))

# 피처 데이터 세트 X, 레이블 데이터 세트 y를 추출
# 맨 끝이 Outcome 칼럼으로 레이블 값임, 칼럼 위치 -1을 이용해 추출
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# stratify: default=None 이고, stratify 값을 target으로 지정해주면 각각의 class 비율(ratio)을 train / validation에 유지해 준다. (즉, 한 쪽에 쏠려서 분배되는 것을 방지)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=156, stratify=y)

# 로지스틱 회귀로 학습, 예측, 평가 
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
pred = lr_clf.predict(X_test)
pred_proba = lr_clf.predict_proba(X_test)[:, 1]

get_clf_eval(y_test, pred, pred_proba)

오차행렬
[[82  8]
 [15 29]]
정확도: 0.8284, 정밀도: 0.7838, 재현율: 0.6591, F1: 0.7160, AUC:0.7851


/Users/hwangdong-gyu/anaconda3/envs/test/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

결측치와 이상치 처리로 학습시킨 결과 약 84퍼센트의 정확도를 보이고 있다.

느낀점

이번 데이터를 가지고 분석 및 결측치, 이상치 처리를 해보면서 아직 데이터들을 다루는게 미숙하다는 것을 느꼈고, seaborn을 이용해 시각화를 통해 데이터에 대해 더 자세히 접근해봐야겠다고 생각이 들었다. 또한 정규화 등 스케일링을 통해 정확도를 높여봐야겠다고 느꼈다.