聚类方法学习（一）DBSCAN算法与案例

DBSCAN简介DBSCAN可以克服Kmeans等算法聚出的类总是呈现椭圆形的问题，它的思路是：用一个点的邻域内的邻居点数衡量该点所在空间的密度，只要一个区域中的点的密度大过某个阈值，就把它加到与之相近的聚类中去，认为这些点属于一个类别。算法优缺点优点：不需要指定cluster的数目，形状任意对噪音不敏感聚类结果几乎不依赖于节点的遍历顺序缺点：DBSCAN算法的聚类效果依赖于距离公式的选取，面临维

呆萌的代Ma

1458人浏览 · 2021-03-22 10:15:11

呆萌的代Ma · 2021-03-22 10:15:11 发布

DBSCAN简介

DBSCAN可以克服Kmeans等算法聚出的类总是呈现椭圆形的问题，它的思路是：用一个点的邻域内的邻居点数衡量该点所在空间的密度，只要一个区域中的点的密度大过某个阈值，就把它加到与之相近的聚类中去，认为这些点属于一个类别。

算法优缺点

优点：

不需要指定cluster的数目，形状任意
对噪音不敏感
聚类结果几乎不依赖于节点的遍历顺序

缺点：

DBSCAN算法的聚类效果依赖于距离公式的选取，面临维数灾难
数据集维度相对分散（稀疏矩阵）的数据聚类效果并不好

DBSCAN案例

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = np.random.random((1000, 4))
X = StandardScaler().fit_transform(X)

""" 
Define function to change parameters and make it simple- 
    - epsilon is a float that describes the maximum distance between two samples for them to be considered as in same 
      neighbourhood.
    - minimum_samples is number of samples in a neighbourhood for a point to be considered as a core point.
    - data is our dataset
"""


def DBSCAN_display(epsilon, minimum_samples, data):
    # eps为两个点之间的最大距离，就是阈值
    # min_samples为作为核心点的一个点在一个邻域内的样本数量，就是成为一个类的最少的点的数量
    db = DBSCAN(eps=epsilon, min_samples=minimum_samples).fit(data)
    # Create an array of booleans using the labels from db
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)

    # Replace all elements with 'True' in core_samples_mask that are in cluster, 'False' if points are outliers
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

    # Black color is removed and used for noise instead.
    # Remove repetition in labels by turning it into a set.
    unique_labels = set(labels)

    # Create colors for the clusters.
    colors = plt.get_cmap('Spectral')(np.linspace(0, 1, len(unique_labels)))

    # Plot the points with colors
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = 'k'

        class_member_mask = (labels == k)

        # Plot the data points that are clustered
        xy = data[class_member_mask & core_samples_mask]
        plt.plot(xy[:, 0],
                 xy[:, 1],
                 'o',
                 markerfacecolor=col,
                 markeredgecolor='k',
                 markersize=14)

        # Plot the outliers
        xy = data[class_member_mask & ~core_samples_mask]
        plt.plot(xy[:, 0],
                 xy[:, 1],
                 'o',
                 markerfacecolor=col,
                 markeredgecolor='k',
                 markersize=6)

    plt.title('Estimated number of clusters: %d' % n_clusters_)
    plt.show()


if __name__ == '__main__':
    X = np.random.random((1000, 4))
    X = StandardScaler().fit_transform(X)
    DBSCAN_display(0.6, 6, X)