3. Individual Level Variables
- Individual level variable에는 2가지 종류가 있다
- 참/거짓을 나타내는 Boolean값
- 순서를 나타내는 값
ind = data[id_ + ind_bool + ind_ordered]
ind.shape
![](https://velog.velcdn.com/images/hsjunior1/post/23962a9c-d594-4170-8d79-4b0055fb923f/image.png)
3.1 Redundant Individual Variables
- 필요 없는 변수들을 제거하기 위해 상관계수 절댓값이 0.95가 넘어가는 것만 남기도록 한다
corr_matrix = ind.corr()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]
to_drop
![](https://velog.velcdn.com/images/hsjunior1/post/0aa8b711-5157-40b3-96a8-5f6bec98fd5c/image.png)
female
상관계수가 굉장히 높으므로 그 반대인 male
컬럼을 버린다
ind = ind.drop(columns = 'male')
3.1.1 Creating Ordinal Variables
instlevel1
: 교육 X ~ instlevel9
: 대학원 졸업 까지의 교육 수준을 나타내는 컬럼 생성
ind[[c for c in ind if c.startswith('instl')]].head()
![](https://velog.velcdn.com/images/hsjunior1/post/e5081437-1e11-4ed0-8dd0-826a2035bec2/image.png)
ind['inst'] = np.argmax(np.array(ind[[c for c in ind if c.startswith('instl')]]), axis = 1)
plot_categoricals('inst', 'Target', ind, annotate = False)
![](https://velog.velcdn.com/images/hsjunior1/post/b3d01cfb-ef81-4640-98e7-0d043f170c71/image.png)
- 위의 결과를 보아, 교육수준이 높을수록 가난지수가 낮은 것을 확인할 수 있다
plt.figure(figsize = (10. 8))
sns.violinplot(x = 'Target', y = 'inst', data = ind)
plt.title('Education Distribution by Target')
![](https://velog.velcdn.com/images/hsjunior1/post/850a5169-5510-443b-a7ae-b0197604e2a6/image.png)
3.1.2 Feature Construction
- 기존에 존재하는 데이터로 새로운 데이터 만들기
ind['escolari/age'] = ind['escolari'] / ind['age']
plt.figure(figsize = (10, 8))
sns.violinplot('Target', 'escolari.age', data = ind)
![](https://velog.velcdn.com/images/hsjunior1/post/f31d018f-8ae6-4b03-8872-951b4c2a24c4/image.png)
ind['inst/age'] = ind['inst'] / ind['age']
ind['tech'] = ind['v18q'] + ind['mobilephone']
ind['tech'].describe()
![](https://velog.velcdn.com/images/hsjunior1/post/f606e35e-a6ba-4d7e-8683-4c98bf3afb3a/image.png)
3.2 Feature Engineering through Aggregations
range_ = lambda x: x.max() - x.min()
range_.__name__ = 'range_'
ind_agg = ind.drop(columns = 'Target').groupby('idhogar').agg(['min','max','sum','count','std',range_])
ind_agg.head()
![](https://velog.velcdn.com/images/hsjunior1/post/527e5105-df46-4d69-9636-5f140b3b628d/image.png)
new_col = []
for c in ind_agg.columns.levels[0]:
for stat in ind_agg.columns.levels[1]:
new_col.append(f'{c}-{stat}')
ind_agg.columns = new_col
ind_agg.head()
![](https://velog.velcdn.com/images/hsjunior1/post/5b48c2e7-cf5e-4edc-b51e-49edbfe7d3eb/image.png)
ind_agg.iloc[:, [0, 1, 2, 3, 6, 7, 8, 9]].head()
![](https://velog.velcdn.com/images/hsjunior1/post/feb58b33-c436-43ff-abc0-5a1d4a3d2a8a/image.png)
3.2.1 Feature Selection
corr_matrix = ind_agg.corr()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]
print(f'There are {len(to_drop)} correlated columns to remove')
![](https://velog.velcdn.com/images/hsjunior1/post/df8e588d-5459-491d-b3ec-7bc97856611a/image.png)
ind_agg = ind_agg.drop(columns = to_drop)
ind_feats = list(ind_agg.columns)
final = heads.merge(ind_agg, on = 'idhogar', how = 'left')
print('Final features shape: ', final.shape)
![](https://velog.velcdn.com/images/hsjunior1/post/d2c22f53-c0ef-4be4-a671-9794e0bd1a19/image.png)
final.head()
![](https://velog.velcdn.com/images/hsjunior1/post/db56ef4f-a8a2-4069-9608-71ff52582301/image.png)
3.2.2 Final Data Exploration
corrs = final.corr()['Target']
corrs.sort_values().dropna().tail()
![](https://velog.velcdn.com/images/hsjunior1/post/d8547c02-63dd-4bba-918b-c57b187ac4ad/image.png)
plot_categoricals('escolari-max', 'Target', final, annotate=False);
![](https://velog.velcdn.com/images/hsjunior1/post/1e6aa6bd-6e41-4867-8770-9399870b75ac/image.png)
plt.figure(figsize = (10, 6))
sns.violinplot(x = 'Target', y = 'escolari-max', data = final)
plt.title('Max Schooling by Target')
![](https://velog.velcdn.com/images/hsjunior1/post/b7b114ec-5e3d-4cb8-9345-63a28f47db7e/image.png)
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'escolari-max', data = final)
plt.title('Max Schooling by Target')
![](https://velog.velcdn.com/images/hsjunior1/post/43033e19-29bb-44f8-bfd0-0cd8023860c9/image.png)
plt.figure(figsize = (10, 6))
sns.boxplot(x = 'Target', y = 'overcrowding', data = final)
plt.xticks([0, 1, 2, 3], poverty_mapping.values())
plt.title('Overcrowding by Target')
![](https://velog.velcdn.com/images/hsjunior1/post/69e420a7-f7f1-4975-9cf5-f097e77ebfae/image.png)
head_gender = ind.loc[ind['parentesco1'] == 1, ['idhogar', 'female']]
final = final.merge(head_gender, on = 'idhogar', how = 'left').rename(columns = {'female': 'female-head'})
final.groupby('female-head')['Target'].value_counts(normalize=True)
![](https://velog.velcdn.com/images/hsjunior1/post/634a34f0-e95c-4d00-a8da-78e90e421a49/image.png)
sns.violinplot(x = 'female-head', y = 'Target', data = final);
plt.title('Target by Female Head of Household');
![](https://velog.velcdn.com/images/hsjunior1/post/f76cb88a-8f25-4d49-b40f-35fb12eb5eef/image.png)
plt.figure(figsize = (8, 8))
sns.boxplot(x = 'Target', y = 'meaneduc', hue = 'female-head', data = final);
plt.title('Average Education by Target and Female Head of Household', size = 16);
![](https://velog.velcdn.com/images/hsjunior1/post/34ee86c2-d8ba-4f8b-b15a-4e65c020f2cd/image.png)
final.groupby('female-head')['meaneduc'].agg(['mean', 'count'])
![](https://velog.velcdn.com/images/hsjunior1/post/8e0429df-c826-4335-b399-8b6035f18149/image.png)
- 대체적으로 여성이 가장인 경우 더 높은 교육 수준을 가지는 것으로 보였다
- 그러나 동시에