[Clustering] DBSCAN

안암동컴맹·2024년 2월 26일
0

Machine Learning

목록 보기
18/103

DBSCAN

Introduction

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a popular clustering algorithm that is notable for its ability to find clusters of varying shapes and sizes in a dataset, while effectively identifying outliers. Introduced by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996, DBSCAN stands out for its minimal need for domain knowledge to determine the input parameters, making it widely applicable across different fields.

Background and Theory

DBSCAN is based on the concept of density reachability and connectivity. It classifies data points into three types: core points, border points, and outliers, based on the density of their local neighborhood. The main idea is that for a point to be considered as part of a cluster, the neighborhood of a given radius ϵ\epsilon must contain at least a minimum number of points (MinPts), indicating a dense region.

Epsilon Neighborhood

For a point pp, the ϵ\epsilon-neighborhood is defined as the set of points that are within distance ϵ\epsilon from pp, including pp itself. This can be mathematically represented as:

Nϵ(p)={qDdist(p,q)ϵ}N_{\epsilon}(p) = \{q \in D | \text{dist}(p, q) \leq \epsilon\}

where DD is the dataset and dist(p,q)\text{dist}(p, q) is the distance between points pp and qq, often calculated using the Euclidean distance for spatial data.

Core, Border, and Noise Points

  • Core Point: A point pp is a core point if the size of its ϵ\epsilon-neighborhood is at least as large as a specified threshold MinPts:
    Nϵ(p)MinPts|N_{\epsilon}(p)| \geq \text{MinPts}
  • Border Point: A point pp is a border point if it is not a core point, but falls within the ϵ\epsilon-neighborhood of a core point.
  • Noise Point: A point is considered noise if it is neither a core point nor a border point.

Density Reachable and Density Connected

  • Density Reachable: A point pp is density-reachable from a point qq if there is a chain of points p1,p2,,pnp_1, p_2, \cdots, p_n such that p1=qp_1 = q, pn=pp_n = p, and each pi+1p_{i+1} is within ϵ\epsilon distance from pip_i, with pip_i being a core point for all i<ni < n. This ensures that each step of the chain is within a dense area.
  • Density Connected: Two points pp and qq are density connected if there exists a point oo such that both pp and q are density reachable from oo.

Mathematical Formulation

Initialization

  • Let C=0C = 0 be the initial count of clusters.
  • Mark all points in the dataset DD as unvisited.

Algorithm Steps

  • For each point pp in dataset DD:
    • If pp is visited, skip to the next point.
    • Mark pp as visited.
    • Collect pp's ϵ\epsilon-neighborhood, Nϵ(p)N_{\epsilon}(p).
    • If Nϵ(p)<MinPts|N_{\epsilon}(p)| < \text{MinPts}, mark pp as noise (later, pp might be found in a dense neighborhood and thus become a border point).
    • Otherwise, increment CC to create a new cluster, and expand CC by recursively adding density-reachable points from pp.

Cluster Expansion

  • For a new core point pp, all points in Nϵ(p)N_{\epsilon}(p) are added to cluster CC. For each point pp' in Nϵ(p)N_{\epsilon}(p):
    • If pp' is unvisited, mark it as visited and retrieve its ϵ\epsilon-neighborhood. If pp' is a core point, recursively apply expansion to pp'.
    • If pp' is not yet a member of any cluster, add pp' to cluster CC.

This recursive process of expanding the cluster CC through density-reachable points ensures that all points within a dense region are discovered and classified into the same cluster, while noise remains outside of any cluster.

Challenges and Considerations

  • Distance Metric: The choice of distance metric dist(p,q)\text{dist}(p, q) can significantly impact the clustering results. While Euclidean distance is common, other metrics like Manhattan or Minkowski might be more appropriate for certain datasets.
  • Parameter Selection: The selection of ϵ\epsilon and MinPts is critical. An ϵ\epsilon too large may merge separate clusters, while too small an ϵ\epsilon may lead to many points being classified as noise. Similarly, MinPts determines the density threshold for core points, influencing the granularity of the clustering.
  • Handling High-Dimensional Data: As dimensionality increases, the concept of "density" becomes less intuitive due to the curse of dimensionality. Points tend to be equidistant, making the selection of ϵ\epsilon challenging.

DBSCAN's mathematical framework allows it to adaptively cluster data based on local density conditions, making it robust against noise and capable of identifying clusters of arbitrary shapes. However, the algorithm's effectiveness is contingent on the appropriate selection of ϵ\epsilon and MinPts, which may require domain expertise or adaptive techniques for parameter estimation.

Implementation

Parameters

  • epsilon: float, default = 0.1
    Radius of a neighborhood hypersphere
  • min_points: int, default = 5
    Minimum required points to form a cluster
  • metric: Literal['euclidean', 'minkowski'], default = ‘euclidean’
    Distance metric for measuing various distances

Examples

Test with circles dataset and moons dataset:

from luma.clustering.density import DBSCAN
from luma.visual.evaluation import ClusterPlot

from sklearn.datasets import make_circles, make_moons
import matplotlib.pyplot as plt


X1, y1 = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)
X2, y2 = make_moons(n_samples=500, noise=0.05, random_state=42)

db1 = DBSCAN(epsilon=0.2,
             min_points=5,
             metric='euclidean')

db2 = DBSCAN(epsilon=0.2,
             min_points=5,
             metric='euclidean')

db1.fit(X1)
db2.fit(X2)

fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)

clst1 = ClusterPlot(db1, X1, cmap='rainbow')
clst1.plot(ax=ax1)

clst2 = ClusterPlot(db2, X2, cmap='summer')
clst2.plot(ax=ax2, show=True)

Applications

DBSCAN's ability to handle noise and find clusters of arbitrary shapes makes it useful in various applications:

  • Geospatial Data Analysis: Identifying regions of similar land use in earth observation datasets.
  • Anomaly Detection: Isolating fraudulent or unusual transactions in financial systems.
  • Biological Data Analysis: Grouping genes or proteins with similar expression levels or functions.
  • Image Segmentation: Separating different objects in an image based on pixel density.

Strengths and Limitations

Strengths

  • No Assumption on Cluster Shapes: Unlike k-means, DBSCAN does not assume clusters to be of any specific shape.
  • Robust to Outliers: Effectively differentiates between noise points and points belonging to a cluster.
  • Minimal Input Parameters: Requires only two parameters ϵ\epsilon and MinPts, which have intuitive physical interpretations.

Limitations

  • Parameter Sensitivity: The choice of ϵ\epsilon and MinPts can significantly affect the outcome, and finding suitable values can be non-trivial without domain knowledge.
  • Density Variation: Struggles with datasets where clusters have widely varying densities.
  • High-Dimensional Data: The effectiveness decreases in high-dimensional spaces due to the curse of dimensionality.

Advanced Topics

  • Parameter Selection Techniques: Research into methods for automatically selecting optimal values for ϵ\epsilon and MinPts, such as using k-distance plots.
  • Scaling DBSCAN: Efforts to improve DBSCAN's scalability and efficiency, making it suitable for large datasets and real-time applications.
  • Extensions and Variants: Development of DBSCAN variants designed to handle specific challenges, such as varying densities (e.g., OPTICS) or high-dimensional data.

References

  1. Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise." In Kdd, vol. 96, no. 34, pp. 226-231. 1996.
profile
𝖪𝗈𝗋𝖾𝖺 𝖴𝗇𝗂𝗏. 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝗋 𝖲𝖼𝗂𝖾𝗇𝖼𝖾 & 𝖤𝗇𝗀𝗂𝗇𝖾𝖾𝗋𝗂𝗇𝗀

0개의 댓글