Python-混合高斯分布(GMM)的应用及K-S检验

内容简介：对几种经典的单分布模型与混合高斯分布模型进行简单介绍，随后用于交通数据分析，并进行K-S检验，讨论模型拟合情况。其中，单分布模型有：正态分布、Gamma分布、Weibull分布、Logistic分布；1 模型介绍1.1 单分布的概率密度函数介绍如下：1.2 混合高斯分布（GMM）模型介绍如下：GMM详细推导与介绍见：https://blog.csdn.net/jinping_shi/ar

你的陈某某

8352人浏览 · 2021-08-20 14:27:29

你的陈某某 · 2021-08-20 14:27:29 发布

内容简介：对几种经典的单分布模型与混合高斯分布模型进行简单介绍，随后用于交通数据分析，并进行K-S检验，讨论模型拟合情况。

其中，单分布模型有：正态分布、Gamma分布、Weibull分布、Logistic分布；

注意：没有绘制Gamma分布、Weibull分布、Logistic分布，有几个超参数，懒得调了，可以用matlab的数据分析组件进行绘图和分析。没有进行单分布K-S检验，肯定是通不过的，就没检验。文章内容主要是针对GMM进行拟合分析。

1 模型介绍

1.1 单分布的概率密度函数介绍如下：

注意：式子(1)有误，sigmoid不在根号下！！！
在这里插入图片描述

1.2 混合高斯分布（GMM）模型介绍如下：

在这里插入图片描述

GMM详细推导与介绍见：https://blog.csdn.net/jinping_shi/article/details/59613054

2 模型应用

数据集介绍：2017美国发生事故的总体情况，包括事故地点、时间、天气等N多种情况，比较全面。

注意：为简单起见，文章这里只对事故发生时间进行分析。

2.1 读取csv文件并统计事故时间

要求：对整个事故文件进行读取，并对事故时间进行统计（以hour为区间）

import pandas as pd

# 读取csv表格
df_0 = pd.read_csv('USaccident2017.csv')

# 统计事故时间与事故数量（hour为区间）
time = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().index.tolist()
counts = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().tolist()

print(sorted(list(pd.DatetimeIndex(df_0["Start_Time"]).hour)))
print(len(list(pd.DatetimeIndex(df_0["Start_Time"]).hour)))
print('time', time)
print('counts', counts)

2.2 直方分布图绘制

要求：统计2017一年中事故发生时间阶段，并绘制成直方分布图。

import pandas as pd
import matplotlib.pyplot as plt

# 读取csv表格
df_0 = pd.read_csv('USaccident2017.csv')

# 统计事故时间与事故数量（hour为区间），下面结果为列表形式
# time = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().index.tolist()
# counts = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().tolist()

# 数据中所有事故发生时间段汇集成的列表
hours = list(pd.DatetimeIndex(df_0["Start_Time"]).hour)


#  ################# 图的绘制 #########

# 自定义绘图字体
font1 = {'family': 'SimSun', 'weight': 'normal', 'size': 18}  # 自定义中文为“宋体”
font2 = {'family': 'SimSun', 'weight': 'normal', 'size': 12}  # 自定义中文为“宋体”

# 设置xtick和ytick的方向：in、out、inout
plt.rcParams['xtick.direction'] = 'in'
plt.rcParams['ytick.direction'] = 'in'
plt.rc('font', family='Times New Roman', size=18)  # 其余字体为“新罗马字体”

fig, ax = plt.subplots(figsize=(8, 6.5))  # 位置必须在定义字体下，figsize为画布大小

# 绘制直方分布图
plt.hist(hours, bins=23, color="w", density=1, label="事故时间直方分布图", edgecolor='k')


#  ################# 绘图补充设置 #####

# 设置x与y轴上下限
plt.xlim(0, 23)
plt.ylim(0, 0.1)

# 设置x轴刻度间距
plt.xticks(range(0, 23, 2))

# 去掉边框
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)

# 设置轴标签
plt.xlabel(r'事故发生时间/h', font1)
plt.ylabel(r'概率密度', font1)

# 设置图例
plt.legend(loc='upper right', frameon=False, prop=font2)  # 边框不显示

# 设置图标题
# plt.title(r'..', fontsize=20)

plt.show()

运行结果如下：
在这里插入图片描述

2.3 单分布拟合

要求：绘制单分布函数拟合曲线

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats  # 分布函数


# 读取csv表格
df_0 = pd.read_csv('USaccident2017.csv')

# 统计事故时间与事故数量（hour为区间），下面结果为列表形式
# time = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().index.tolist()
# counts = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().tolist()

# 数据中所有事故发生时间段汇集成的列表
hours = list(pd.DatetimeIndex(df_0["Start_Time"]).hour)
hours = np.sort(hours)


#  ################# 图的绘制 #########

# 自定义绘图字体
font1 = {'family': 'SimSun', 'weight': 'normal', 'size': 18}  # 自定义中文为“宋体”
font2 = {'family': 'SimSun', 'weight': 'normal', 'size': 12}  # 自定义中文为“宋体”

# 设置xtick和ytick的方向：in、out、inout
plt.rcParams['xtick.direction'] = 'in'
plt.rcParams['ytick.direction'] = 'in'
plt.rc('font', family='Times New Roman', size=18)  # 其余字体为“新罗马字体”

fig, ax = plt.subplots(figsize=(8, 6.5))  # 位置必须在定义字体下，figsize为画布大小

# 绘制直方分布图
plt.hist(hours, bins=23, color="w", density=1, label="事故时间直方分布图", edgecolor='k')

# 绘制正态分布曲线
mu = np.mean(hours)  # 计算均值
sigma = np.std(hours)   # 计算标准差,注意与方差区别，是方差开方
print('均值{0},方差{1}'.format(mu, sigma))
y0 = stats.norm.pdf(hours, mu, sigma)  # 拟合一条最佳正态分布曲线y
plt.plot(hours, y0, label='正态分布', linewidth=1, c='r')

# 由于威布尔和Gamma函数需要超参数，先跳过


#  ################# 绘图补充设置 #####
# 设置x与y轴上下限
plt.xlim(0, 23)
plt.ylim(0, 0.1)

# 设置x轴刻度间距
plt.xticks(range(0, 23, 2))

# 去掉边框
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)

# 设置轴标签
plt.xlabel(r'事故发生时间/h', font1)
plt.ylabel(r'概率密度', font1)

# 设置图例
plt.legend(loc='upper right', frameon=False, prop=font2)  # 边框不显示

# 设置图标题
# plt.title(r'..', fontsize=20)

plt.show()

运行结果如下：
在这里插入图片描述

2.4 GMM分布拟合

要求：通过AIC与BIC公式来选择GMM的最佳组分后，来求取多元GMM分布的超参数；得到AIC+BIC图与GMM图。
注意：代码中的公式书写参考上面GMM内容

# 根据车速数据表绘制高斯混合分布
# 根据AIC与BIC进行度量评估
# 注意两个超参数：56行 K_Search（组分数量设置） 、106行 steps_to_converge（EM迭代次数）
# 注意60行 MusDict 的range（）范围
# 其余如轴刻度，bin值是与事故发生时间（hour）对应

from __future__ import division
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from scipy.stats import norm


# ######## 文件读取 ######
# 读取csv表格
df_0 = pd.read_csv('USaccident2017.csv')

# 数据中所有事故发生时间段汇集成的列表
hours = list(pd.DatetimeIndex(df_0["Start_Time"]).hour)
hours = np.sort(hours)

# df = pd.read_excel('1.xlsx')
# x1 = np.sort(df["speed"].values)
# hours = x1


# ###### 绘图字体设置 #######

# 自定义绘图字体
font1 = {'family': 'SimSun', 'weight': 'normal', 'size': 18}  # 自定义中文为“宋体”
font2 = {'family': 'SimSun', 'weight': 'normal', 'size': 14}  # 自定义中文为“宋体”

# 设置xtick和ytick的方向：in、out、inout
plt.rcParams['xtick.direction'] = 'in'
plt.rcParams['ytick.direction'] = 'in'
plt.rc('font', family='Times New Roman', size=14)  # 其余字体为“新罗马字体”


# ####### 保存路径设置 ########

# 创立一个结果保存文件夹，内容：GMM拟合图、GMM参数、AIC与BIC图
script_dir = os.path.dirname(__file__)  # 去掉文件名，返回脚本目录
output_dir = os.path.join(script_dir, 'Results//Accident_Time')
if not os.path.isdir(output_dir):
   os.makedirs(output_dir)


# 样本个数
m = len(hours)  # Number of Samples


#############################################
# 设置待执行的亚类数范围，即从1->5的范围遍历
K_Search = 5  # 重要的超参数
#############################################

# Storing Means
MusDict = {'For K Gaussians': range(1, 6)}  # 注意：这个范围需要与K_Search对应
dfMus = pd.DataFrame(data=MusDict)

# Storing Variance
dfSigs = pd.DataFrame(data=MusDict)

# Storing Weights
dfPis = pd.DataFrame(data=MusDict)

# Storing Score
dfScore = pd.DataFrame(data=MusDict)

# Storing Score2
dfScore_2 = pd.DataFrame(data=MusDict)  # 添加的AIC！！

# Initializing
initvec = np.empty(K_Search)
initvec[:] = np.nan

for jj in range(1, K_Search+1):
    exec('dfMus[\'Mu k={}\']=initvec'.format(jj))
    exec('dfSigs[\'Sig k={}\']=initvec'.format(jj))
    exec('dfPis[\'Pi k={}\']=initvec'.format(jj))
    exec('dfScore[\'BIC\']=initvec')
    exec('dfScore_2[\'AIC\']=initvec')  # 添加的AIC！！

"""迭代 K 1:5"""
for k in range(K_Search):

    """初始化 EM 算法"""

    # 组分个数
    k = k+1  # 第一个值是0，需要+1

    # 随机选择一个均值
    mus = np.random.randint(min(hours), max(hours), size=k).astype("float")  # 随机返回min-max中k个整数

    # 对每个高斯分布初始化一个方差
    sigmas = np.ones(k)*np.std(hours)  # np.std(x1)结果是个1*1数值,np.ones(k)是一个1*k的数组

    # 对K个子分布分配权重
    pis = np.ones(k)/k


# ############## 通过EM算法计算GMM #################

    steps_to_converge = 100  # 迭代次数，重要的超参数

    for qq in range(steps_to_converge):
        print("Iteration Index {}".format(qq+1))

        """Expectation Stage"""  # E步
        priors = np.zeros((m, k))  # Will Hold the Probabilities For m Data Points for k distributions

        # # Inner Iteration over k
        for jj in range(k):
            # 计算每个亚类的概率密度函数
            priors[:, jj] = norm.pdf(hours, mus[jj], sigmas[jj])

        # Weighting by prior probability of a each gaussian then dividing by total weight
        Ri = pis*priors
        TotWeights = 1/np.sum(Ri, axis=1)
        NewProbs = TotWeights.reshape(m, 1)*Ri  # E步骤公式

        """Maximization Stage"""  # M步
        # 直接写出最后的公式
        for jj in range(k):

            # For Each cluster calculate mean probability
            pis[jj] = np.mean(NewProbs[:, jj])

            # Take Weighted Average of values to find means
            mus[jj] = np.average(hours, weights=(NewProbs[:, jj]))  # 与means不同的是average能计算加权平均值，如a[0]*w[0]/w.sum()+...

            # Take Weighted Average of Variances
            secmoment = np.square(hours-mus[jj])

            sqrsig = np.average(secmoment, weights=NewProbs[:, jj])

            sigmas[jj] = np.sqrt(sqrsig)

        """Plotting Results"""

    # ##迭代完后根据参数绘图
    # Settting Colors
    color = iter(plt.cm.jet(np.linspace(0, 1, k+1)))

    # Histogram
    fig, ax = plt.subplots(figsize=(8, 6.5))
    # sns.distplot(hours, bins=63, kde=False, ax=ax, norm_hist=True)
    plt.hist(hours, bins=23, color="w", density=1, label="直方分布图", edgecolor='k')

    # Final Distributions
    totd = np.zeros(len(hours))

    # ######### 保存绘图，但不显示 ############
    for qq in range(k):
        d0 = pis[qq]*norm.pdf(hours, mus[qq], sigmas[qq])  # 高斯分布
        totd = totd+d0
        c = next(color)
        plt.plot(hours, d0, c=c)  # 绘制亚类高斯模型

    # 绘制GMM图
    c = next(color)
    plt.plot(hours, totd, c=c, label="混合概率密度")  # 绘制GMM分布，该是迭代后的结果

    # 设置x与y轴上下限
    plt.xlim(0, 23)
    plt.ylim(0, 0.1)

    # 设置x轴刻度间距
    plt.xticks(range(0, 23, 2))

    # 设置轴标签
    plt.xlabel(r'事故发生时间/h', font1)
    plt.ylabel(r'概率密度', font1)

    # 设置图例
    plt.legend(loc='upper right', frameon=False, prop=font2)  # 边框不显示
    plt.title('Histogram of Accident After EM Algo')

    plt.savefig('{}/Final{}.png'.format(output_dir, k))  # 保存各GMM分布模型
    plt.close("all")


# ########### 贝叶斯信息准则(BIC)与赤池信息准则(AIC)计算与绘图############

    """First Calculate Square Error"""

    KDist = totd/np.sum(totd)  # normalizing

    # getting Relative Frequency from samples
    relfreq, binedges = np.histogram(hours, bins=23)
    relfreq = relfreq/np.sum(relfreq)
    binn = (binedges[1:] + binedges[:-1]) / 2

    # Set Search range and normalize bins to get appropriate indexes from
    # KDist
    high = max(hours)
    low = min(hours)
    span = high-low

    # Indexes to use:
    binn = np.round((len(hours)*(binn-low))/span).astype(int)

    # # Residual Sum of Squares (SSE)

    SSE = np.sum(np.square(KDist[binn]-relfreq))

    """BIC"""
    # BIC for Gaussian Special Case:
    # is k*ln(n)+n*(ln(SSE/n) Where n is the number of observations
    nn = len(hours)
    BIC = k*np.log(nn)+(nn*(np.log(SSE/nn)))

    """AIC"""
    # AIC for Gaussian Special Case:
    # is 2*k+n*(ln(SSE/n) Where n is the number of observations
    AIC = 2*k+(nn*(np.log(SSE/nn)))

    """Storage"""

    for jj in range(k):
        exec('dfMus.at[{}, \'Mu k={}\']=mus[{}]'.format(jj, k, jj))
        exec('dfSigs.at[{}, \'Sig k={}\']=sigmas[{}]'.format(jj, k, jj))
        exec('dfPis.at[{}, \'Pi k={}\']=pis[{}]'.format(jj, k, jj))
        exec('dfScore.at[{}, \'BIC\']=BIC'.format(k-1))
        exec('dfScore_2.at[{}, \'AIC\']=AIC'.format(k-1))  # 添加的AIC！！
        os.system('cls')

    print(dfScore)


#  ########### 进行AIC与BIC绘制 ###########

# AIC与BIC两者合在一个图上
plt.figure(figsize=(8, 6.5))
plt.plot(np.arange(1, K_Search+1), dfScore_2['AIC'], ls='--')
plt.plot(np.arange(1, K_Search+1), dfScore['BIC'], ls='-')
plt.xlabel('混合模型组分', font1)
plt.ylabel('信息准则度量', font1)
plt.title('AIC与BIC度量', font1)
plt.legend(['AIC', 'BIC'], loc='upper left', prop=font2)
plt.xlim(1, 5)
plt.xticks(range(1, 6, 1))  # 设置x轴刻度间距
plt.grid()  # 加网格
plt.savefig('{}/AIC+BIC.png'.format(output_dir))
plt.show()


# ########## 保存数据(可以选择隐藏) ##########

dfMus.to_csv('{}/Mus.csv'.format(output_dir), index=False, header=True)
dfSigs.to_csv('{}/Sigs.csv'.format(output_dir), index=False, header=True)
dfPis.to_csv('{}/Pis.csv'.format(output_dir), index=False, header=True)
dfScore.to_csv('{}/Score.csv'.format(output_dir), index=False, header=True)
dfScore_2.to_csv('{}/Score_2.csv'.format(output_dir), index=False, header=True)


# ########### 最终模型选择 ###########

# 最优的BIC对应的K个数
bestk = np.argmax(np.abs(np.diff(np.diff(dfScore['BIC']))))+2
print('Best BIC found at k = {}'.format(bestk))

# 最优的AIC对应的K个数
bestk = np.argmax(np.abs(np.diff(np.diff(dfScore_2['AIC']))))+2
print('Best AIC found at k = {}'.format(bestk))


# ########## 读取最佳组分的GMM分布模型图片并进行展示 #########

print('Plotting')
img = mpimg.imread('{}/Final{}.png'.format(output_dir, bestk))
imgplot = plt.imshow(img)
plt.axis('off')  # 隐藏坐标轴
plt.show()

运行结果如下：
AIC+BIC度量图：
在这里插入图片描述
组分K=2的GMM分布拟合图：

拟合分析：通过AIC+BIC求得最佳组分是2（但是图没表示出来，代码运行是2），可以用二元混合高斯模型进行拟合。从图中看出，事故大多数发生在早高峰与晚高峰时段，与现实情况基本一致，接下来进行KS检验，看是否该分布能够用于对事故发生时间数据进行描述。

2.5 K-S检验

要求：能够绘制概率累计函数图，并求取K-S检验D值，同时进行分布检验判断
在这里插入图片描述
注意：a取0.05

```python
# 验证其他组分只更改34行的 K值 即可，但注意保存的文件是否与4.py保存结果文件格式一致
# 注意74行的range（）范围，该范围与事故发生时间(hour)匹配


from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
import scipy.stats as stats  # 该模块包含了所有的统计分析函数
import os

# ######## 事故文件读取 ######

# 读取事故csv表格
df_0 = pd.read_csv('USaccident2017.csv')

# 数据中所有事故发生时间段汇集成的列表
hours = list(pd.DatetimeIndex(df_0["Start_Time"]).hour)
hours = np.sort(hours)


# ########### 读取EM算法求取的参数文件 ##########

# 读取保存的参数文件
script_dir = os.path.dirname(__file__)  # 返回当前脚本目录

df_Mus = pd.read_csv(os.path.join(script_dir, 'Results/Accident_Time/Mus.csv'))
df_Pis = pd.read_csv(os.path.join(script_dir, 'Results/Accident_Time/Pis.csv'))
df_Sigs = pd.read_csv(os.path.join(script_dir, 'Results/Accident_Time/Sigs.csv'))

# 设置亚类个数（可修改）
K = 2

# 创建一个总参数存储列表
Parameters = {}

# 初始化参数名字
Para_name = ['Mu', 'Sig', 'Pi']

# 更改参数名字，方便调用,更改后的格式如："Mu k=2"
for i in range(len(Para_name)):
    Para_name[i] = '{} k={}'.format(Para_name[i], K)

# 将提取的参数都保存到总参数类别Parameters中
# 均值部分
old_Mus_list = df_Mus[Para_name[0]].values.tolist()
new_Mus_list = [round(i, 3) for i in old_Mus_list if i == i]  # 去除nan值
Parameters[Para_name[0]] = new_Mus_list

# 方差部分
old_Sig_list = df_Sigs[Para_name[1]].values.tolist()
new_Sig_list = [round(i, 3) for i in old_Sig_list if i == i]  # 去除nan值
Parameters[Para_name[1]] = new_Sig_list

# 亚类权重部分
old_Pi_list = df_Pis[Para_name[2]].values.tolist()
new_Pi_list = [round(i, 3) for i in old_Pi_list if i == i]  # 去除nan值
Parameters[Para_name[2]] = new_Pi_list

# Parameters = {'Mu k=2': [16.59, 8.283], 'Sig k=2': [2.793, 2.961], 'Pi k=2': [0.485, 0.515]}
print('Parameters:', Parameters)

mus = Parameters[Para_name[0]]
sigmas = Parameters[Para_name[1]]
pis = Parameters[Para_name[2]]


# ######## 高斯混合模型累计曲线 #########

x = []
total_cdf = []
for i in range(0, 24, 1):  # 实际上只遍历到0-23
    x.append(i)
    cdf = 0
    for j in range(len(pis)):
        cdf += pis[j]*norm(mus[j], sigmas[j]).cdf(i)  # 各亚类概率分布累加函数值之和（每次累加计算到i）
    total_cdf.append(cdf)

plt.plot(x, total_cdf)
# plt.show()


# ########## 频率累计曲线 ##########

# 查看数据基本统计量
# 统计事故时间与事故数量（hour为区间）
time = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().index.tolist()
counts = pd.DatetimeIndex(df_0["Start_Time"]).hour.value_counts().sort_index().tolist()

# 构建一个新表
df_s = pd.DataFrame({'事故发生时间': time, '次数': counts})

# 计算频率数据
df_s['累计次数'] = df_s['次数'].cumsum()
df_s['累计频率'] = df_s['累计次数'] / len(hours)

# 绘制频率累计曲线
plt.plot(df_s['事故发生时间'], df_s['累计频率'])
plt.savefig('累加图.png')
plt.show()

# print(df_s)


# ########### 求频率曲线于高斯混合累计曲线的最大纵向差值 ###########

# 注意：两者维度不一致
bias_value_list = []
for i in range(len(df_s['累计频率'].tolist())):
    sub_x1 = list(filter(lambda x: x <= df_s['事故发生时间'].tolist()[i], hours))

    sub_cdf = 0
    for u in range(len(mus)):
        sub_cdf += pis[u]*stats.norm.cdf(sub_x1, mus[u], sigmas[u])

    bias_value = np.abs(df_s['累计频率'].tolist()[i] - list(np.array(sub_cdf))[-1])
    bias_value_list.append(bias_value)

# 求取最大纵向偏差
D = max(bias_value_list)
print('D值:', max(bias_value_list))


# ########## 根据KS检验原理进行验证 ###########

# D_default为检测值，当且仅当 D_default > D时，检验通过
D_default = 1.36/pow(len(hours), 0.5)
print('D_default值:', D_default)

if D_default > D:
    print('检验通过，符合该分布！')
else:
    print('检验未通过！')