手写数字的识别问题-SVM、朴素贝叶斯算法、决策树算法、KNN算法对比

目录一、题目内容和要求：一、题目内容和要求：目的：熟练掌握神经网络模型及其应用。基本任务：对MNIST手写数据集进行识别，并给出算法精度等指标。延伸：该问题还可以使用哪些算法进行求解？请实现不同的对比算法，并通过实验结果分析各算法的区别。...

秃秃然然

9374人浏览 · 2021-08-03 17:10:04

秃秃然然 · 2021-08-03 17:10:04 发布

一、题目内容和要求：

目的：熟练掌握神经网络模型及其应用。基本任务：对MNIST手写数据集进行识别，并给出算法精度等指标。
延伸：该问题还可以使用哪些算法进行求解？请实现不同的对比算法，并通过实验结果分析各算法的区别。

二、问题背景和相关工作介绍

概念介绍

图像识别（Image Recognition）是指利用计算机对图像进行处理、分析和理解，以识别各种不同模式的目标和对像的技术。
图像识别的发展经历了三个阶段：文字识别、数字图像处理与识别、物体识别。机器学习领域一般将此类识别问题转化为分类问题。

手写识别是常见的图像识别任务。计算机通过手写体图片来识别出图片中的字，与印刷字体不同的是，不同人的手写体风格迥异，大小不一，造成了计算机对手写识别任务的一些困难。数字手写体识别由于其有限的类别（0~9共10个数字）成为了相对简单的手写识别任务。DBRHD和MNIST是常用的两个数字手写识别数据集

数据介绍

MNIST是一个包含数字0~9的手写体图片数据集，图片已归一化为以手写数字为中心的28*28规格的图片。MNIST由训练集与测试集两个部分组成，各部分规模如下：
训练集60,000个手写体图片及对应标签；测试集10,000个手写体图片及对应标签

在这里插入图片描述

算法介绍

1.SVM

支持向量机是一种有监督的学习方法，即已知训练点的类别，求训练点和类别之间的对应关系，以便将训练集按照类别分开，或者是预测新的训练点所对应的类别。它主要针对小样本数据进行学习、分类和预测（有时也叫回归）的一种方法，能解决神经网络不能解决的过学习问题。

2.朴素贝叶斯

朴素贝叶斯分类是一种十分简单的分类算法，叫它朴素贝叶斯分类是因为这种方法的思想很朴素，对于给出的待分类项，求解在此项出现的条件下各个类别出现的概率，哪个最大，就认为此待分类项属于哪个类别。朴素贝叶斯分类的正式定义如下：

3.决策树算法

决策树算法是一种逼近离散函数值的方法。它是一种典型的分类方法，首先对数据进行处理，利用归纳算法生成可读的规则和决策树，然后使用决策对新数据进行分析。本质上决策树是通过一系列规则对数据进行分类的过程。

4.KNN

邻近算法，或者说K最近邻（KNN，K-NearestNeighbor）分类算法是数据挖掘分类技术中最简单的方法之一。所谓K最近邻，就是K个最近的邻居的意思，说的是每个样本都可以用它最接近的K个邻近值来代表。近邻算法就是将数据集合中每一个记录进行分类的方法

三、解题思路

本次实验采用SVM、朴素贝叶斯算法、决策树算法、KNN四种对手写数字进行识别，分别测试其准确率并打印出六个手写数字图像与算法预测的结果进行对比。

1.SVM算法流程

在这里插入图片描述

2.朴素贝叶斯算法流程

在深度学习中，基于贝叶斯方法衍生出许多分类方法，其中最常用的就是朴素贝叶斯算法。它应用特别广泛，运用概率分类的方法。在朴素贝叶斯算法中，将训练集表示成特征向量A 和决策变量C。在该算法中，假设每个特征相互独立并对决策变量有独立作用。根据该假设，如果样本A 表示的所有属性向量集合中每个属性都是相互独立的，那么 $\mathrm{p}\left(\left.\mathrm{A}\right|\mathrm{C}_\mathrm{i}\right)$ 可以变为
在这里插入图片描述
那么对于后验概率 $p\left(\left.A\right|C_i\right)$ ，表示特征A 属于Ci的概率，由贝叶斯准则知，可以用下式计算。

3.决策树算法流程

1）树以代表训练样本的单个结点开始。
2）如果样本都在同一个类．则该结点成为树叶，并用该类标记。
3）否则，算法选择最有分类能力的属性作为决策树的当前结点．
4）根据当前决策结点属性取值的不同，将训练样本数据集tlI分为若干子集，每个取值形成一个分枝，有几个取值形成几个分枝。匀针对上一步得到的一个子集，重复进行先前步骤，递4’I形成每个划分样本上的决策树。一旦一个属性出现在一个结点上，就不必在该结点的任何后代考虑它。
5）递归划分步骤仅当下列条件之一成立时停止：
①给定结点的所有样本属于同一类。
②没有剩余属性可以用来进一步划分样本．在这种情况下．使用多数表决，将给定的结点转换成树叶，并以样本中元组个数最多的类别作为类别标记，同时也可以存放该结点样本的类别分布，
③如果某一分枝tc，没有满足该分支中已有分类的样本，则以样本的多数类创建一个树叶。

4.KNN算法流程

①准备数据，对数据进行预处理。
②计算测试样本点（也就是待分类点）到其他每个样本点的距离。
③对每个距离进行排序，然后选择出距离最小的K个点。
④对K个点所属的类别进行比较，根据少数服从多数的原则，将测试样本点归入在K个点中占比最高的那一类

四、实验结果

1.SVM

在这里插入图片描述
如实验结果截图所示：SVM准确率0.942。实际打印的手写数字与预测的结果一致，为4 5 6 7 3 9。

2.决策树算法

在这里插入图片描述
如实验结果截图所示：决策树算法准确率0.851。实际打印的手写数字与预测的结果一致，为1 4 0 5 3 6。

3.朴素贝叶斯算法

在这里插入图片描述
如实验结果截图所示：朴素贝叶斯算法准确率0.86。实际打印的手写数字与预测的结果一致，为4 0 5 3 6 9。

4.KNN

在这里插入图片描述
如实验结果截图所示：KNN准确率0.982。实际打印的手写数字与预测的结果一致，为4 4 5 0 8 9。

五、实验结果分析

本次实验我采用了SVM、决策树算法、朴素贝叶斯算法、KNN四中算法对MINIST数据进行分类。从实验结果来看，就准确率而言：朴素贝叶斯算法 > SVM> KNN>决策树算法。

算法种类	SVM	朴素贝叶斯	决策树	KNN
准确率	0.942	0.86	0.851	0.982

对于打印的六个手写数字的预测结果与实际数字均相同。

六、总结

1.SVM
优点：适合小样本、非线性、高维模式识别。
缺点：对于大规模数据开销大，不合适多分类；对缺失数据敏感；需要选择适当的核函数。

2.决策树
优点：简单易于理解，能够处理多路输出问题。
缺点：容易过拟合；决策树的生成不稳定，微小的数据变化可能导致生成的决策树不同。

3.KNN
优点：简单易于理解，无需训练，无需估计参数准确性高；适合多标签问题。
缺点：懒惰算法，预测慢，开销大类的样本数不平衡时准确率受影响；可解释性差。

4.朴素贝叶斯
优点：分类稳定，适合小规模数据和增量式训练，对缺失数据不敏感。
缺点：属性相关性大时效果不好，需要知道先验概率，对数据的表达形式敏感。

七、附录：算法源代码

1.SVM

import numpy as np
from sklearn import svm
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import _pickle as pickle
import matplotlib.pyplot as plt
mnist = load_digits()
x,test_x,y,test_y = train_test_split(mnist.data,mnist.target,test_size=0.25,random_state=40)
model = svm.LinearSVC()
model.fit(x,y)
z=model.predict(test_x)
print('准确率：',np.sum(z==test_y)/z.size)
with open('D:\A-Python\Work\HWR\data\model.pkl','wb') as file:
        pickle.dump(model,file)
#学习后识别520到525六张图片并给出预测
print(model.predict(mnist.data[520:526]))
#实际的520到525代表的数
mnist.target[520:526]
#显示520到525数字图片
plt.subplot(321)
plt.imshow(mnist.images[520],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(322)
plt.imshow(mnist.images[521],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(323)
plt.imshow(mnist.images[522],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(324)
plt.imshow(mnist.images[523],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(325)
plt.imshow(mnist.images[524],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(326)
plt.imshow(mnist.images[525],cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()

2.朴素贝叶斯

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import _pickle as pickle
import matplotlib.pyplot as plt
mnist = load_digits()
x,test_x,y,test_y = train_test_split(mnist.data,mnist.target,test_size=0.25,random_state=40)
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(x,y)
z=model.predict(test_x)
print('准确率：',np.sum(z==test_y)/z.size)
#学习后识别1000到1006六张图片并给出预测
print(model.predict(mnist.data[1001:1007]))
#实际的1000到1006代表的数
mnist.target[1001:1007]
#显示1000到1006数字图片
plt.subplot(321)
plt.imshow(mnist.images[1001],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(322)
plt.imshow(mnist.images[1002],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(323)
plt.imshow(mnist.images[1003],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(324)
plt.imshow(mnist.images[1004],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(325)
plt.imshow(mnist.images[1005],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(326)
plt.imshow(mnist.images[1006],cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()

3.决策树

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import _pickle as pickle
import matplotlib.pyplot as plt
mnist = load_digits()
x,test_x,y,test_y = train_test_split(mnist.data,mnist.target,test_size=0.25,random_state=40)
from sklearn.tree import DecisionTreeClassifier, export_graphviz
model = DecisionTreeClassifier(criterion="entropy")
model.fit(x,y)
z=model.predict(test_x)
print('准确率：',np.sum(z==test_y)/z.size)
#学习后识别99到105六张图片并给出预测
print(model.predict(mnist.data[99:105]))

#实际的99到105代表的数
var = mnist.target[99:105]
#显示99到105数字图片
plt.subplot(321)
plt.imshow(mnist.images[99],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(322)
plt.imshow(mnist.images[100],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(323)
plt.imshow(mnist.images[101],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(324)
plt.imshow(mnist.images[102],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(325)
plt.imshow(mnist.images[103],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(326)
plt.imshow(mnist.images[104],cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()
from six import StringIO
import pandas as pd
x = pd.DataFrame(x)
with open("D:\A-Python\Work\HWR\data\JueCetree.dot", 'w') as f:
     f = export_graphviz(model, feature_names = x.columns, out_file = f)

4.KNN

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import _pickle as pickle
import matplotlib.pyplot as plt
mnist = load_digits()
x,test_x,y,test_y = train_test_split(mnist.data,mnist.target,test_size=0.25,random_state=40)
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(x,y)
z=model.predict(test_x)
print('准确率：',np.sum(z==test_y)/z.size)
#学习后识别1660到1666六张图片并给出预测
print(model.predict(mnist.data[1660:1666]))
#实际的1660到1666代表的数
mnist.target[1660:1666]
#显示1660到1666数字图片
plt.subplot(321)
plt.imshow(mnist.images[1660],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(322)
plt.imshow(mnist.images[1661],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(323)
plt.imshow(mnist.images[1662],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(324)
plt.imshow(mnist.images[1663],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(325)
plt.imshow(mnist.images[1664],cmap=plt.cm.gray_r,interpolation='nearest')
plt.subplot(326)
plt.imshow(mnist.images[1665],cmap=plt.cm.gray_r,interpolation='nearest')
plt.show()