Concept

  • an unsupervised learning algorithm that will attempt to group similar clusters together in the data.

  • Example problems
    - Cluster Customers based on Features
    - Market Segmentation
    - Identify similar physical groups

  • Goal : divide data into distinct groups such that observations within each group are similar

Process

It's an EM (Expectation Maximization) Method

  • Choose a number of Clusters “K”
  • Randomly assign each point to a cluster
  • Until clusters stop changing, repeat the following:
    - Assign each data point to the cluster for which the centroid is the closest (Expectation)
    • For each cluster, compute the cluster centroid by taking the mean vector of points in the cluster (Maximization)

      - If the results do not change or the model reaches the maximum number of iterations, finish training

Choosing K

  • There is no set way to choose K
  • One way is elbow method
    - compute the sum of squared error (SSE) for some values of k (for example 2, 4, 6, 8, etc.).
    - SSE : the sum of the squared distance
    between each member of the cluster and its centroid.
  • k ↑ → SSE↓
    - this is because when the number
    of clusters increases, group sizes and distortion is also smaller
  • Choose k value where SSE decreases abruptly

Pros & Cons

  • Cons
    - initial clustering might distort the result
    - if the size of the clusters differ greatly, the algorithm might not work well
    - if the desnsity of the clusters differ greatly, the algorithm might not work well
    - if the distribution is unique, the algorithm might not work well

0개의 댓글