K-means clustering

ChangSeong Yooยท2023๋…„ 7์›” 25์ผ
0

Machine Learning

๋ชฉ๋ก ๋ณด๊ธฐ
5/10
post-thumbnail

๐Ÿ“์ด ํฌ์ŠคํŠธ๋Š” K-means clustering ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.




์ •์˜

  • K-ํ‰๊ท  ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ k๊ฐœ์˜ ๊ตฐ์ง‘(= class = cluster)๋กœ ๋ฌถ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.

  • K-ํ‰๊ท  ๊ตฐ์ง‘ํ™”๋Š” ์†์‹คํ•จ์ˆ˜ ๊ฐ’์ด ์ตœ์†Œํ™”๋  ๋•Œ๊นŒ์ง€ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ(Centroid)๊ณผ ๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋  ๊ตฐ์ง‘์„ ๋ฐ˜๋ณตํ•ด์„œ ๊ณ„์‚ฐํ•œ๋‹ค.

  • ์ด K-ํ‰๊ท  ํด๋Ÿฌ์Šคํ„ฐ๋ง์€

    J=โˆ‘k=1Kโˆ‘iโˆˆCkd(xi,ฮผk)J = \sum^{K}_{k=1} \sum^{}_{i \in C_k}d(x_i, \mu_k)

    KK : ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜
    CkC_k : kk ๋ฒˆ์งธ ๊ตฐ์ง‘์— ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋“ค์˜ ์ง‘ํ•ฉ
    ฮผk\mu_k : kk ๋ฒˆ์งธ ๊ตฐ์ง‘์˜ ๋ฌด๊ฒŒ์ค‘์‹ฌ(centroid)
    d(xi,ฮผk)d(x_i, \mu_k) : xix_i ์™€ ฮผk\mu_k ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋œปํ•œ๋‹ค. ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ euclidean distance๋ฅผ ์ด์šฉํ•œ๋‹ค.




์›๋ฆฌ

  1. ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ๋žœ๋คํ•˜๊ฒŒ KK ๊ฐœ ์ƒ์„ฑํ•œ๋‹ค.
  2. ๋ชจ๋“  ๋ฐ์ดํ„ฐ xi(i=1,โ‹ฏโ€‰,N)x_i(i=1, \cdots, N) ์—์„œ ๊ฐ๊ฐ์˜ ๋ฌด๊ฒŒ์ค‘์‹ฌ ฮผk\mu_k๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
  3. ๊ฐ ๋ฐ์ดํ„ฐ๋ณ„๋กœ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ์„ ํƒํ•œ๋‹ค.
  4. ์„ ํƒํ•œ ๋ฌด๊ฒŒ์ค‘์‹ฌ์ด ์†ํ•˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ •ํ•œ๋‹ค.
  5. ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์— ๋Œ€ํ•ด ๋ฌด๊ฒŒ์ค‘์‹ฌ ฮผk\mu_k๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•œ๋‹ค.
  6. ์ตœ์ ์˜ ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ์ฐพ์•„ ๋” ์ด์ƒ ๋ฌด๊ฒŒ์ค‘์‹ฌ์˜ ์œ„์น˜๊ฐ€ ๋ณ€ํ•˜์ง€ ์•Š์„ ๋•Œ ๊นŒ์ง€ '์ˆœ์„œ 1' ์„ ์ œ์™ธํ•œ ๋ชจ๋“  ์ˆœ์„œ๋ฅผ ๋ฐ˜๋ณตํ•œ๋‹ค.




์ฝ”๋“œ ์ดํ•ด

์ €๋Š” iris dataset์„ ์ด์šฉํ•˜์—ฌ K-ํ‰๊ท  ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
โš ๏ธ๊ฐ ๋ผ์ธ๋ณ„๋กœ ์ฃผ์„์„ ๋‹ฌ์•„์„œ ๊ฐ ๋ผ์ธ์ด ๋ฌด์Šจ ์˜๋ฏธ์ธ์ง€ ์•Œ๋ ค๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

  1. ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ๋žœ๋คํ•˜๊ฒŒ K๊ฐœ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•œ๋‹ค.
def random_centroids(values, K):      # K : ๊ตฐ์ง‘ ๊ฐœ์ˆ˜
    centroids = []
    # K๋ฒˆ ๋ฐ˜๋ณตํ•˜์—ฌ values ๋ฐ์ดํ„ฐ์ค‘์— K๊ฐœ๋ฅผ ๋ฌด๊ฒŒ์ค‘์‹ฌ์œผ๋กœ ์„ ์–ธ
    for i in range(K):
        centroid = values[rand.randint(0, len(values)-1)]
        centroids.append(centroid)
    return centroids
  1. ๋ชจ๋“  ๋ฐ์ดํ„ฐ xi(i=1,โ‹ฏโ€‰,N)x_i(i=1, \cdots, N) ์—์„œ ๊ฐ๊ฐ์˜ ๋ฌด๊ฒŒ์ค‘์‹ฌ ฮผk\mu_k๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
  2. ๊ฐ ๋ฐ์ดํ„ฐ๋ณ„๋กœ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ์„ ํƒํ•œ๋‹ค.
  3. ์„ ํƒํ•œ ๋ฌด๊ฒŒ์ค‘์‹ฌ์ด ์†ํ•˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ •ํ•œ๋‹ค.

์ด 3๊ฐœ๋ฅผ ๋ฌถ์–ด ๊ฐ ๋ฐ์ดํ„ฐ๋ณ„๋กœ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ตฐ์ง‘์— ์†Œ์†์‹œํ‚ค๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•œ๋‹ค.

def assign_cluster(values, centroids):
    assignments = []    #cluster assignments
    # ๊ฐ ๋ฐ์ดํ„ฐ๋“ค
    for value in values:
        # ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์™€ K๊ฐœ์˜ ๋ฌด๊ฒŒ์ค‘์‹ฌ๊ณผ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ €์žฅํ•  ๋ณ€์ˆ˜
        dist_point_clust = []

        # ๊ฐ ๋ฌด๊ฒŒ์ค‘์‹ฌ๋“ค
        for centroid in centroids:
            # ๋žœ๋คํ•œ ๋ฌด๊ฒŒ์ค‘์‹ฌ๊ณผ ๊ฐ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ์ธ๋ฑ์Šค๋ณ„๋กœ ๋บ„์…ˆ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ๊ตฌํ•œ๋‹ค.
            d_clust = np.linalg.norm(np.array(value) - np.array(centroid))

            dist_point_clust.append(d_clust)
        # ๊ฑฐ๋ฆฌ๊ฐ€ ์ตœ์†Œ์ธ ๊ฐ’์˜ ์ธ๋ฑ์Šค๋ฅผ ์ €์žฅ
        assignment = np.argmin(dist_point_clust)

        assignments.append(assignment)

    return assignments
  1. ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์— ๋Œ€ํ•ด ๋ฌด๊ฒŒ์ค‘์‹ฌ ฮผk\mu_k๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•œ๋‹ค.
def new_centroids(values, centroids, assignments, K):
                # values : ๋ฐ์ดํ„ฐ
                        # centroids : ๋ฌด๊ฒŒ์ค‘์‹ฌ๋“ค์ด ๋ชจ์ธ ๋ฆฌ์ŠคํŠธ
                                   # assignments : ๊ฐ ์  ๋ณ„๋กœ ์–ด๋Š ๊ตฐ์ง‘์— ์†ํ•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋Š” ๋ฆฌ์ŠคํŠธ
                                                # K : ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜
    # ์ƒˆ๋กญ๊ฒŒ ๊ณ„์‚ฐ๋  ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ์ถ”๊ฐ€ํ•  ๋ฆฌ์ŠคํŠธ
    new_centroids = []

    for i in range(K):
        # i_cluster :
        i_cluster = []

        for x in range(len(values)):
            # x๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๊ฐ€ ์†ํ•œ ๊ตฐ์ง‘์ด i๋ฒˆ์งธ ๊ตฐ์ง‘๊ณผ ๊ฐ™๋‹ค๋ฉด
            if (assignments[x] == i):
                # i_cluster์— ์ถ”๊ฐ€
                i_cluster.append(values[x])  # append all single cluster points

        # ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋“ค๋ผ๋ฆฌ ํ‰๊ท 
        mean_cluster = np.mean(i_cluster, axis=0)    # mean value of the cluster points will serve as new centroid
        # ๊ณ„์‚ฐ๋œ ํ‰๊ท  ์ง€์ ์„ ์ƒˆ๋กœ์šด ๋ฌด๊ฒŒ์ค‘์‹ฌ์œผ๋กœ ์„ค์ •
        new_centroids.append(mean_cluster)

    return new_centroids
  1. ์†์‹คํ•จ์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค. ์†์‹คํ•จ์ˆ˜๊ฐ€ ์–ด๋Š ์กฐ๊ฑด์ผ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜์—ฌ ๋ฌด๊ฒŒ์ค‘์‹ฌ์„ ์žฌ์กฐ์ • ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์†์‹ค์„ ๊ตฌํ•œ๋‹ค.
def sse(values, assignments, centroids):
      # values : ๋ฐ์ดํ„ฐ
              # assignments : ๊ฐ ์  ๋ณ„๋กœ ์–ด๋Š ๊ตฐ์ง‘์— ์†ํ•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋Š” ๋ฆฌ์ŠคํŠธ
                           # centroids : ๋ฌด๊ฒŒ์ค‘์‹ฌ๋“ค์ด ๋ชจ์ธ ๋ฆฌ์ŠคํŠธ
    errors = []

    for i in range(len(values)):
        #get assigned centroid for each point
        centroid = centroids[assignments[i]]

        #compute the distance (error) between one point and its closest centroid
        error = np.linalg.norm(np.array(values[i]) - np.array(centroid))

        #append squared error to the list of error
        errors.append(error**2)

    #and sum up all the errors
    sse = sum(errors)

    return sse

์ด ๊ณผ์ •์„ ๋‹ค ํฌ๊ด„ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ–ˆ๋‹ค. ๋‚˜์ค‘์— K-mean clustering์„ ์“ฐ๊ณ  ์‹ถ์„ ๋•Œ ๋‹จ ํ•œ ๋ฒˆ์˜ ํ•จ์ˆ˜ ํ˜ธ์ถœ์„ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.

def kmeans_clustering(values, K, max_iter = 100, tol = pow(10,-3) ):
    # iteration : ๋ฐ˜๋ณต ํšŸ์ˆ˜
    it = -1
    all_sse = []
    assignments = []

    # STEP 1. ์ดˆ๊ธฐ ๋žœ๋ค ๋ฌด๊ฒŒ์ค‘์‹ฌ ์ƒ์„ฑ
    centroids = random_centroids(values, K)


    # ์—๋Ÿฌ๊ฐ€ 1 ์ดํ•˜์ด๊ฑฐ๋‚˜ ์ตœ๋Œ€๋ฐ˜๋ณตํšŸ์ˆ˜ ์•„๋ž˜์˜ ๋ฐ˜๋ณต์ค‘์ด๋ฉด์„œ { (ํ˜„์žฌ ์†์‹ค - ์ง์ „ ๊ณผ๊ฑฐ ์†์‹ค) / ์ง์ „ ๊ณผ๊ฑฐ ์†์‹ค } ์ด 0.001 ์ด์ƒ์ด๋ฉด ๋ฐ˜๋ณต ์ง„ํ–‰
    while (len(all_sse)<=1 or (it < max_iter and np.absolute(all_sse[it] - all_sse[it-1])/all_sse[it-1] >= tol)):
        it += 1
        # STEP 2. ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ตฌํ•˜๊ธฐ
        assignments = assign_cluster(values, centroids)

        #STEP 3. ๋ฌด๊ฒŒ์ค‘์‹ฌ ์žฌ์กฐ์ •
        centroids = new_centroids(values, centroids, assignments, K)

        #STEP 4. ์†์‹ค๋ฅ  ๊ตฌํ•˜๊ธฐ
        sse_kmeans = sse(values, assignments, centroids)
        all_sse.append(sse_kmeans)

        print('์—๋Ÿฌ์œจ : {}'.format(sse_kmeans))

    return (assignments, centroids, all_sse, it)

์ ์šฉ :

result = kmeans_clustering(values=values, K=4)
์—๋Ÿฌ์œจ : 76.01591907424665
์—๋Ÿฌ์œจ : 72.44054722344684
์—๋Ÿฌ์œจ : 71.33622242453569
์—๋Ÿฌ์œจ : 71.33622242453569

result ๋ณ€์ˆ˜๋Š” 4์ฐจ์› ๋ณ€์ˆ˜ ์ผ ๊ฒƒ์ด๋‹ค.
์™œ๋ƒํ•˜๋ฉด kmean_clustering ๋ฉ”์†Œ๋“œ๊ฐ€ 4๊ฐœ์˜ ๊ฒฐ๊ณผ๋ฅผ returnํ•˜๊ธฐ ๋–„๋ฌธ์— ๊ธธ์ด๊ฐ€ 4์ผ ๊ฒƒ์ด๋‹ค.

์ด์ œ ์‹œ๊ฐํ™”๋ฅผ ํ•˜์—ฌ ๊ฐ ๋ฐ์ดํ„ฐ๋“ค์ด ์–ด๋Š ๊ตฐ์ง‘์— ์†ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

centroids_x = [result[1][x][0] for x in range(len(result[1]))] #sepal_length: [0]
centroids_y = [result[1][x][2] for x in range(len(result[1]))] #petal_length: [2]

x = data['sepal_length']
y = data['petal_length']
assignments = result[0]		# result[0] = assignments
							# reuslt[1] = centroids
                            # result[2] = all_sse
                            # result[3] = it
                            
plt.scatter(x, y, c=assignments)
plt.plot(centroids_x,centroids_y, c='white', marker='.', linewidth='0.01', markerfacecolor='red', markersize=22)

plt.title("K-means Visualization")
plt.xlabel("sepal_length")
plt.ylabel("petal_length")

๋ณด์‹œ๋Š” ๋ฐ”์™€ ๊ฐ™์ด kmean_clustering() ๋ฉ”์†Œ๋“œ์— K=4 ๋ผ๊ณ  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ž…๋ ฅํ•˜์—ฌ centroid(๋นจ๊ฐ„์ƒ‰ ์ )์ด 4๊ฐœ์ด๊ณ  ๋…ธ๋ž€์ƒ‰, ๋‚จ์ƒ‰, ๋ณด๋ผ์ƒ‰, ์ดˆ๋ก์ƒ‰์ด ๊ฐ๊ฐ์˜ ๊ตฐ์ง‘์ž…๋‹ˆ๋‹ค.


๐Ÿ’ป ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ๋ผ๋ฒจ์„ ๋ชจ๋ฅด๋Š” ์ƒํƒœ์—์„œ ๋ถ„๋ฅ˜๋ฅผ ํ•˜์—ฌ์•ผ ํ•  ๋•Œ ๋ฌด๊ฒŒ์ค‘์‹ฌ์ด๋ผ๋Š” ๊ฐœ๋…์„ ํ†ตํ•ด์„œ ๋ฌด๊ฒŒ์ค‘์‹ฌ๊ณผ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํด๋ž˜์Šค(๊ตฐ์ง‘)์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์—ญํ• ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
์ด๋Ÿด ๋•Œ K-means-clustering ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•˜์—ฌ ํด๋ž˜์Šค๋ณ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌถ์–ด ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

profile
๐’ฅ๐“Š๐“ƒ๐’พโ„ด๐“‡ ๐’Ÿ๐’ถ๐“‰๐’ถ ๐’ฎ๐’ธ๐’พโ„ฏ๐“ƒ๐“‰๐’พ๐“ˆ๐“‰

0๊ฐœ์˜ ๋Œ“๊ธ€