μκ³ μλ μΆλ ₯κ°μ΄λ μ 보 μμ΄ νμ΅ μκ³ λ¦¬μ¦μ κ°λ₯΄μ³μΌ νλ λͺ¨λ μ’ λ₯μ λ¨Έμ λ¬λ
λΉμ§λ λ³ν, κ΅°μ§
λΉμ§λ λ³ν(unsupervised transformation) : λ°μ΄ν°λ₯Ό μλ‘κ² νννμ¬ μ¬λμ΄λ λ€λ₯Έ λ¨Έμ λ¬λ μκ³ λ¦¬μ¦μ΄ μλ λ°μ΄ν°λ³΄λ€ μ½κ² ν΄μν μ μλλ‘ λ§λλ μκ³ λ¦¬μ¦μ
- λ§μ κ³ μ°¨μ λ°μ΄ν°λ₯Ό νΉμ±μ μλ₯Ό μ€μΌλ©΄μ κΌ νμν νΉμ§μ ν¬ν¨ν λ°μ΄ν°λ₯Ό νννλ λ°©λ²μΈ μ°¨μμΆμμ λνμ μλ μκ°νλ₯Ό μν΄ λ°μ΄ν°μ
μ 2μ°¨μ λ³κ²½νλκ²½μ°
- λΉμ§λ λ³νμΌλ‘ λ°μ΄ν°λ₯Ό ꡬμ±νλ λ¨μλ μ±λΆμ κ²μ : λ§μ ν
μ€νΈ λ¬Έμμμ μ£Όμ λ₯Ό μΆμΆ ( μμ
λ―Έλμ΄μμ μ κ±°, μ΄κΈ°κ·μ , νμ€ν κ°μ μ£Όμ λ‘ μΌμ΄λλ ν λ‘ μ μΆμ ν λ μ¬μ©κ°λ₯)
κ΅°μ§(clustering) : λ°μ΄ν°λ₯Ό λΉμ·ν κ²λΌλ¦¬ κ·Έλ£ΉμΌλ‘ λ¬Άλ μμ
- μμ€ λ―Έλμ΄ μ¬μ΄νΈμ μ¬μ§μ μ
λ‘λνλ κ²½μ°μ μ
- μ
λ‘λν μ¬μ§μ λΆλ₯νλ €λ©΄ κ°μ μ¬λμ΄ μ°ν μ¬μ§μ κ°μ κ·Έλ£ΉμΌλ‘ λ¬Άμ μ μμΌλ μ¬μ΄νΈλ μ¬μ§μ μ°ν μ¬λμ΄ λκ΅°μ§, μ 체 μ¬μ§ μ¨λ²μ μΌλ§λ λ§μ μ¬λμ΄ μλμ§ μμ§ λͺ»ν¨
- μ΄λ κ°λ₯ν λ°©λ²μ μ¬μ§μ λνλ λͺ¨λ μΌκ΅΄μ μΆμΆν΄μ λΉμ·ν μΌκ΅΄λ‘ κ·Έλ£Ή μ§λ κ², μ΄ μΌκ΅΄λ€μ΄ κ°μ μ¬λμ μΌκ΅΄μ΄λΌλ©΄ μ΄λ―Έμ§λ€μ κ·Έλ£ΉμΌλ‘ μ λ¬Άμ κ²°κ³Ό
κ°μ₯ μ΄λ €μ΄ μΌμ μκ³ λ¦¬μ¦μ΄ λκ° μ μ©ν κ²μ νμ΅νλμ§ νκ°νλ μΌ
λΉμ§λ νμ΅μ λ³΄ν΅ λ μ΄λΈμ΄ μλ λ°μ΄ν°μ μ μ©νκΈ° λλ¬Έμ 무μμ΄ μ¬λ°λ₯Έ μΆλ ₯μΈμ§ λͺ¨λ¦
- λΉμ§λ νμ΅μ κ²°κ³Ό νκ°λ₯Ό μν΄μλ μ§μ νμΈνλ κ²μ΄ μ μΌν λ°©λ²μΌ λκ° λ§μ
- λ°μ΄ν° κ³Όνμκ° λ°μ΄ν°λ₯Ό λ μ μ΄ν΄νκ³ μΆμ λ νμμ λΆμ λ¨κ³μμ λ§μ΄ μ¬μ©λ¨
- μ§λ νμ΅μ μ μ²λ¦¬ λ¨κ³μμλ μ¬μ©λ¨. λΉμ§λ νμ΅μ κ²°κ³Όλ‘ μλ‘κ² ννλ λ°μ΄ν°λ₯Ό μ¬μ©ν΄ νμ΅νλ©΄ μ§λ νμ΅μ μ νλκ° μ’μμ§κΈ°λνλ©° λ©λͺ¨λ¦¬μ μκ°μ μ μ½ν μ μμ
- μ μ²λ¦¬ λ©μλ : μ§λ νμ΅ μκ³ λ¦¬μ¦μμ μ μ²λ¦¬μ μ€μΌμΌ μ‘°μ μ μμ£Ό μ¬μ©νμ§λ§, μ€μΌμΌ μ‘°μ λ©μλλ μ§λμ 보λ₯Ό μ¬μ©νμ§ μμΌλ―λ‘ λΉμ§λ λ°©μ
k-νκ· κ΅°μ§
- κ°μ₯ κ°λ¨νκ³ λ리 μ¬μ©νλ κ΅°μ§ μκ³ λ¦¬μ¦
- λ°μ΄ν°μ μ΄λ€ μμμ λννλ ν΄λ¬μ€ν° μ€μ¬ μ°ΎκΈ°
λ³ν© κ΅°μ§
- κ΅°μ§ μκ³ λ¦¬μ¦μ λͺ¨μ
- μ’ λ£ μ‘°κ±° λ§μ‘±κΉμ§ λΉμ·ν ν΄λ¬μ€ν° ν©μΉκΈ°
DBSCAN
- λ°μ΄ν°κ° μμΉνκ³ μλ κ³΅κ° λ°μ§λ κΈ°μ€μΌλ‘ ν΄λ¬μ€ν° ꡬλΆ
λ°μ΄ν°κ°μ μ μ¬μ±μ μΈ‘μ νλ κΈ°μ€μΌλ‘ κ° ν΄λ¬μ€ν°μ μ€μ¬κΉμ§μ 거리 μ΄μ©
λ²‘ν° κ³΅κ°μ μμΉν μ΄λ€ λ°μ΄ν°μ λν΄μ kκ°μ ν΄λ¬μ€ν°κ° μ£Όμ΄μ‘μ λ ν΄λ¬μ€ν°μ μ€μ¬κΉμ§ κ±°λ¦¬κ° κ°μ₯ κ°κΉμ΄ ν΄λ¬μ€ν°λ‘ ν΄λΉ λ°μ΄ν°λ₯Ό ν λΉ
λ€λ₯Έ ν΄λ¬μ€ν° κ°μλ μλ‘ μμ νκ² κ΅¬λΆνκΈ° μν΄ μΌμ ν 거리 μ΄μ λ¨μ΄μ ΈμΌ ν¨
λͺκ°μ ν΄λ¬μ€ν°λ‘ λ°μ΄ν°λ₯Ό ꡬλΆν κ²μΈμ§ μμ±νλ kκ°μ λ°λΌ λͺ¨νμ μ±λ₯ λ¬λΌμ§
μΌλ°μ μΌλ‘ kκ°μ΄ ν΄μλ‘ λͺ¨νμ μ νλ κ°μ kκ°μ΄ λ무 컀μ§λ©΄ μ νμ§κ° λ무 λ§μμ§λ―λ‘ λΆμμ ν¨κ³Όκ° μ¬λΌμ§
πΌ μ€λΉ
# λΌμ΄λΈλ¬λ¦¬ μν¬νΈ import pandas as pd import matplotlib.pyplot as plt # λ°μ΄ν° μ€λΉνκΈ° # Wholesale customers λ°μ΄ν°μ κ°μ Έμ€κΈ°(μΆμ² : UCI ML Repository) # https://archive.ics.uci.edu/ml/datasets/wholesale+customers # clients of a wholesale distributor κ° νλͺ©μ λν μ°κ° μ§μΆ uci_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/\ 00292/Wholesale%20customers%20data.csv' df = pd.read_csv(uci_path, header=0)
1) FRESH: annual spending (m.u.) on fresh products (Continuous);
2) MILK: annual spending (m.u.) on milk products (Continuous);
3) GROCERY: annual spending (m.u.)on grocery products (Continuous);
4) FROZEN: annual spending (m.u.)on frozen products (Continuous)
5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
7) CHANNEL: customersΓ’β¬β’ Channel - Horeca (Hotel/Restaurant/CafΓΒ©) or Retail channel (Nominal)
8) REGION: customersΓ’β¬β’ Region Γ’β¬β Lisnon, Oporto or Other (Nominal)
X = df.iloc[:,:]
# λ°μ΄ν° μ κ·ν -> μ€μΌμΌλ§
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
from sklearn import cluster
kmeans = cluster.KMeans(n_clusters=5, random_state=7)
# λͺ¨λΈ νμ΅νκΈ°
kmeans.fit(X)
cluster_label = kmeans.labels_ # 0~4κΉμ§ κ΅°μ§
# ν΄λ¬μ€ν° λ°μ΄ν° μκ°ν
df['Cluster'] = cluster_label
df
# μ±λκ³Ό μ§μμ μ°κ΄κ΄κ³
# 2 Cluster Channel 2 (Retail Channel), Region 1,2,3
# 1 Cluster Channel 1, Region 3
# 0 Cluster Channel 1, Region 1,2
df.plot(kind='scatter',x='Channel', y='Region',c='Cluster',cmap='Set1', figsize=(10,10))
# 2 Cluster Channel 2 (Retail Channel), Region 1,2,3
# 1 Cluster Channel 1, Region 3
# 0 Cluster Channel 1, Region 1,2
df.plot(kind='scatter',x='Milk', y='Fresh',c='Cluster',cmap='Set1', figsize=(10,10))
# 2 Cluster Channel 2 (Retail Channel), Region 1,2,3
# 1 Cluster Channel 1, Region 3
# 0 Cluster Channel 1, Region 1,2
df.plot(kind='scatter',x='Frozen', y='Detergents_Paper',c='Cluster',cmap='Set1', figsize=(10,10))
# 2, 3 ν΄λ¬μ€ν°λ§ λ μμΈνκ² λ³΄κ³ μΆλ€λ©΄
mask = (df['Cluster']==2) | (df['Cluster']==3)
ndf = df[mask]
ndf.Cluster.unique()
ndf.plot(kind='scatter',x='Frozen', y='Detergents_Paper',c='Cluster',cmap='Set1', figsize=(10,10))
# 2,3 ν΄λ¬μ€ν°λ₯Ό μ μΈν 0,1,4 λ μμΈνκ² λ³΄κ³ μΆλ€λ©΄
mask = (df['Cluster']==2) | (df['Cluster']==3)
ndf = df[~mask]
ndf.Cluster.unique()
ndf.plot(kind='scatter',x='Frozen', y='Detergents_Paper',c='Cluster',cmap='Set1', figsize=(10,10))
# λ°μ΄ν° νλ μμ μ¬μ©ν΄ df['Cluster'] λ°λΌμ μμΌλ‘ ꡬλΆλ μ°μ λ νλ ¬
pd.plotting.scatter_matrix(df, c=df['Cluster'], figsize=(20,20), marker='o', hist_kwds={'bins':20}, s=60, alpha=0.8)
plt.tight_layout()
import numpy as np
plt.imshow([np.unique(df['Cluster'])])
plt.show()
β
- νκ΅μ리미 곡κ°μ© λ°μ΄ν° μ€ μμΈμ μ€νκ΅ μ‘Έμ μμ μ§λ‘ νν© λ°μ΄ν°μ μμ κ³ λ±νκ΅ μ§νλ₯ λ°μ΄ν°λ₯Ό νμ©νμ¬ μμ±μ΄ λΉμ·ν μ€νκ΅κΉμ§ ν΄λ¬μ€ν°λ§
- ν΄λ¬μ€ν°λ§ν κ²°κ³Όλ₯Ό μ§λμκ°ν
πΌ μ€λΉ
# λΌμ΄λΈλ¬λ¦¬ μν¬νΈ import pandas as pd import folium # λ°μ΄ν° μ€λΉ df = pd.read_excel('/content/2016_middle_shcool_graduates_report.xlsx', index_col=0, header=0, engine='openpyxl') # μμ νμΌ μ΄κΈ°μν μ½λ df.head()
# μ€νκ΅ μμΉ μ§λ μκ°ν
mschool_map = folium.Map(location=[37.55, 126.98], tiles='Stamen Terrain', zoom_start=12)
# μ€νκ΅ μμΉ μ 보 CircleMarkerλ‘ νμ -> popupμ νκ΅λͺ
-> νμν μ»¬λΌ -> μλ, κ²½λ, νκ΅λͺ
for name, lat, lng in zip(df.νκ΅λͺ
, df.μλ, df.κ²½λ):
folium.CircleMarker([lat, lng], # μλ κ²½λ
radius=5, # λ°μ§λ¦
color='brown', # λλ μμ
fill=True, fill_color='coral', # μ μ μμ
fill_opacity=0.7, # ν¬λͺ
λ
popup='<pre>'+name+'</pre>').add_to(mschool_map)
mschool_map
π μ΄ μ§λμ ν΄λ¬μ€ν°λ§μ μ μ©νκ³ μΆμ
# -> μΈμ½λ© -> λΌλ²¨λ§(μ«μννλ‘ λ³κ²½) -> μ±λ₯ -> μν«μΈμ½λ©(λ°μ κ°λ₯)
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
l_location = label_encoder.fit_transform(df['μ§μ']) # 25 -> 0 ~ 24
l_code = label_encoder.fit_transform(df['μ½λ']) # array([3,5,9]) -> 0,1,2 λ€μλΌλ²¨λ§(λ²μ μ‘°μ )
l_type = label_encoder.fit_transform(df['μ ν']) # ['κ΅λ¦½', '곡립', 'μ¬λ¦½'] -> 0,1,2
l_day = label_encoder.fit_transform(df['μ£ΌμΌ']) # [μ£Όκ°] - > 0
# df μλ‘μ΄ μ»¬λΌμΌλ‘ μΆκ°(κ΅μ²΄λ μν¨)
df['location'] = l_location
df['code'] = l_code
df['type'] = l_type
df['day'] = l_day
df.head()
# ν΄λ¬μ€ν°λ§μ νμν μμ±μ μ ν -> κ³Όνκ³ , μΈκ³ κ΅μ κ³ , μ²΄κ³ , μμ¬κ³
# μΈλ±μ€ λ²νΈλ‘ κ°μ Έμ΄
columns_list = [9, 10, 11, 13]
X = df.iloc[:, columns_list] #[row_index, columns_index]
X.head()
# μ κ·ν
X = preprocessing.StandardScaler().fit(X).transform(X)
X
# DBSCAN λͺ¨λΈ μ€μ
from sklearn import cluster
# ν λ°μ΄ν° ν¬μΈνΈμμ eps 거리 μμ λ°μ΄ν°κ° min_samples κ°μλ§νΌ ν¬ν¨λμ΄ μμΌλ©΄ ν΄λ¬μ€ν°λ‘ ν¬ν¨
# eps ν΅μ¬ ν¬μΈνΈλ₯Ό μ€μ¬μΌλ‘ μΈ‘μ λλ μ ν΄λ¦¬λμΈ κ±°λ¦¬κ°
# min_samples : ν΅μ¬ ν¬μΈνΈλ₯Ό μ€μ¬μ μΌλ‘ κ°μ£Όνλ μ£Όλ³ μ§μμ νλ³Έ μ
dbm = cluster.DBSCAN(eps=0.2, min_samples=5)
dbm.fit(X)
import numpy as np
cluster_label = dbm.labels_ # dbm.labels_ μ€ -1μ noise
# μμΈ‘ κ²°κ³Ό λ°μ΄ν° νλ μμ μΆκ°
df['Cluster'] = cluster_label
df.head(10)
# κ° ν΄λ¬μ€ν°μ λ°μ΄ν° 건μ
df['Cluster'].value_counts()
# ν΄λ¬μ€ν° κ°μΌλ‘ κ·Έλ£Ήνκ³ κ·Έλ£Ήλ³λ‘ λ΄μ© μΆλ ₯(5κ° μΆλ ₯)
group_cols = [0, 1, 3] + columns_list # μ§μ, νκ΅λͺ
, μ ν + cols
# groupby -> key, value
grouped = df.groupby('Cluster')
# grouped['Cluster'].value_counts()
# index(key), group(value)
for index, group in grouped:
print('Cluster : ', index)
print(' * len : ', len(group))
print(group.iloc[:,group_cols].head())
print('\n')
# μ€νκ΅ μμΉ μ§λ μκ°ν
cluster_map = folium.Map(location=[37.55, 126.98], tiles='Stamen Terrain', zoom_start=12)
colors = {-1:'gray', 0:'coral', 1:'blue', 2:'green', 3:'red', 4:'purple', 5:'orange'}
# μ€νκ΅ μμΉ μ 보 CircleMarkerλ‘ νμ -> popupμ νκ΅λͺ
-> νμν μ»¬λΌ -> μλ, κ²½λ, νκ΅λͺ
, Cluster
for name, lat, lng, clus in zip(df.νκ΅λͺ
, df.μλ, df.κ²½λ, df.Cluster):
folium.CircleMarker([lat, lng], # μλ κ²½λ
radius=5, # λ°μ§λ¦
color=colors[clus], # λλ μμ
fill=True, fill_color=colors[clus], # μ μ μμ
fill_opacity=0.7, # ν¬λͺ
λ
popup='<pre>'+name+'</pre>').add_to(cluster_map)
cluster_map
λΉμ§λ νμ΅ μμ½ λ° μ 리
- νμμ λ°μ΄ν°λΆμκ³Ό λ°μ΄ν° μ μ²λ¦¬μ μ¬μ©ν μ μλ μ¬λ¬κ°μ§ λΉμ§λ νμ΅ μκ³ λ¦¬μ¦ μ΄ν΄
- λ°μ΄ν°λ₯Ό μ¬λ°λ₯΄κ² νννλ κ²μ μ§λνμ΅κ³Ό λΉμ§λ νμ΅μ μ μ μ©νκΈ° μν΄ νμμ
- μ μ²λ¦¬μ λΆν΄ λ°©λ²μ λ°μ΄ν° μ€λΉ λ¨κ³μμ μμ£Ό μ€μν λΆλΆ
- μ§λ νμ΅μμλ λ°μ΄ν° νμ λꡬλ λ°μ΄ν°μ νΉμ±μ μ μ΄ν΄νλλ° μ€μν¨
- μ λ³΄κ° μμ λ λ°μ΄ν°λ₯Ό λΆμνλ μ μΌν λ°©λ²
- 2μ°¨μ μμ λ°μ΄ν°μ scikit-learnμ μλ μ€μ λ°μ΄ν° μ μΈ digits, iris, cancerλ°μ΄ν°μ μ μ§μ κ΅°μ§κ³Ό λΆν΄ μκ³ λ¦¬μ¦μ μ μ©νλ μ°μ΅μ΄ λμλ¨