宝可梦数据集分析及预测

摘要本文利用基于 Python 平台的Numpy、Pandas、Matplotlib 以及机器学习算法库 scikit-learn，提供的有监督学习的机械学习算法 KNeighborsClassifier 分析模型，对1013只精灵宝可梦的数据集进行分析预测，通过对数据集的抽取，清洗，转换和 KNeighborsClassifie 模型的训练，最终将预测宝可梦是否属于传奇宝可梦、神话宝可梦、超进

F1F88

5273人浏览 · 2021-04-02 19:57:57

F1F88 · 2021-04-02 19:57:57 发布

前言

以下内容为本人学习过程中记录，仅用于学习，如有错误或者纰漏，请留言指正，谢谢。

数据集和代码下载 – 百度云链接：https://pan.baidu.com/s/1RFUEVcD85J2AQ3_hbI5pdA
提取码：qwer

摘要

本文利用基于 Python 平台的Numpy、Pandas、Matplotlib 以及机器学习算法库 scikit-learn，提供的有监督学习的机械学习算法 KNeighborsClassifier 分析模型，对1013只精灵宝可梦的数据集进行分析预测，通过对数据集的抽取，清洗，转换和 KNeighborsClassifie 模型的训练，最终将预测宝可梦是否属于传奇宝可梦、神话宝可梦、超进化宝可梦三类。并通过搜索最优测试集占比，搜索最优超参数 k，寻找最优的 KNeighborsClassifier分析模型，再利用这个模型将所有的输入映射为相应的输出，对输出进行简单的判断从而实现预测和分类的目的

一、关键技术

numpy

拓展库numpy时Python支持科学计算的重要拓展库，是数据分析和科学计算领域如scipy、pandas、sklearn等众多拓展库中的必备拓展库之一，提供了强大的N维数组及其相关运算、复杂的广播函数、C/C++和Fortran代码集成工具以及线性代数、傅里叶变换和随机数生成等功能。

pandas

    Pandas 对数据的处理是为数据的分析服务的，它所提供的各种数据处理方法、工具是基于数理统计学出发，包含了日常应用中的众多数据分析方法。我们学习它不光掌控它的相应操作技术，还要从它的处理思路中学习数据分析的理论和方法。特别地，想成为或者转行数据分析师、数据产品经理、数据开发等和数据相关工作者的同学，学习 Pandas 更能让你深入数据理论和实践，更好地理解和应用数据。
    Pandas 可以轻松应对白领们日常工作中的各种表格数据处理，还应用在金融、统计、数理研究、物理计算、社会科学、工程等领域里。
    Pandas 可以实现复杂的处理逻辑，这些往往是 Excel 等工具无法处理的，还可以自动化、批量化，对于相同的大量的数据处理我们不需要重复去工作。
    Pandas 可以做到非常震撼的可视化，它对接众多的高颜值可视化库，可以实现动态数据交互效果。

matplotlib

数据可视化只是数据分析中的部分，其目的在于使用 python 中强大的标准图形库。matplotlib 在数据分析领域有很高的地位，而且具有丰富的扩展，能实现更强大的功能。开发者仅需要几行代码，便可以生成绘图，直方图、条形图、错误图、散点图等，方便快捷地对数据进行可视化操作。

可视化过程以及方法可视化技术需要在完成以下几个环节：

预处理，也就是先行处理最开始收集到的数据，对数据的格式进行合理的改变，对需要的数据进行合理的处理；
映射阶段，数据在经过预处理之后需要在映射规律的基础上成为几何元素，像点、线、立体图等；
绘图阶段，也就是对上文中的几何元素绘制；
反馈，对绘制之后的图像进行呈现，并且进行后续的可视化分析工作。在实际的应用当中，这四个步骤会有一定的循环，通过这种周而复始的检测来保证可视化结果的精确性。

sklearn

sklearn是一个Python第三方提供的非常强力的机器学习库，它包含了从数据预处理到训练模型的各个方面。在实战使用scikit-learn中可以极大的节省我们编写代码的时间以及减少我们的代码量，使我们有更多的精力去分析数据分布，调整模型和修改超参。

二、数据分析

注：文中matplotlib可视化部分图例使用了中文，所以不同环境下运行可能会出现乱码

解决方法：https://www.zhihu.com/question/25404709

导入库（含机器学习库）

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import pandas as pd
import numpy as np
mpl.use('TkAgg')
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.svm import SVC

导入数据集

df = pd.read_csv('../data/Pokemon_data.csv', encoding='utf-8')
pd.set_option('display.width', 1000)							# 设置显示宽度
pd.set_option('display.max_columns', 12)						# 设置显示几列
print(df)

在这里插入图片描述

每一代精灵宝可梦的数量

data_generation = df['generation'].value_counts().sort_index()	 # 截取代值，并排序
data_generation.plot.bar(rot=0, title='精灵各代数量排行', xlabel='代', ylabel='数量（只）')
for i, v in enumerate(data_generation):							 # 添加柱值
    plt.text(i, v, v, ha='center', va='bottom')
plt.show()

可以看出第一代精灵数量最多，而第六代的精灵最少。
在这里插入图片描述

传说、神话在每代精灵的数量

group_generation = df.groupby('generation')			# 按generation分组
group_generation = group_generation[['is_legendary', 'is_mythical']].sum()
													# 对legendary、mythical做统计运算
group_generation.plot(kind='bar', rot=0)			# 绘制图片
plt.title('各代精灵为传奇或神话数量')
plt.xlabel('代')
plt.ylabel('数量')
plt.legend(loc='best')
for i, v in enumerate(group_generation.is_legendary):	# 标记 legendary 柱值
    plt.text(i-0.12, v, v, ha='center', va='bottom')	
for i, v in enumerate(group_generation.is_mythical):	# 标记 mythical 柱值
    plt.text(i + 0.12, v, v, ha='center', va='bottom')
plt.show()

第一代精灵虽然数量很多，但是为传奇或神话的精灵相对较少。
在这里插入图片描述

精灵属性数量（柱状图）

data_type1 = df['type1'].value_counts()				# 提取列[type1]的值并统计
data_type1['None'] = 0  							# 因为部分精灵没有第二属性(即为 None)，所以需在type1中添加None这一项（空属性），与type2的X轴同步, 
data_type1 = data_type1.sort_index(ascending=True)	# 按下标(属性)进行排序
data_type2 = df['type2'].value_counts()				# 提取列[type2]的值并统计
data_type2 = data_type2.sort_index(ascending=True)	# 按下标(属性)进行排序
axis_type = np.arange(len(data_type1))				# 属性轴刻度
width = 0.36
plt.bar(axis_type, data_type1, width, facecolor='#9999ff', label='type1') # 绘制柱状图
plt.bar(axis_type + width, data_type2, width, facecolor='#ff9999', label='type2')
for i, v in enumerate(data_type1):					# 添加type1柱值
    plt.text(i, v, v, ha='center', va='bottom')
for i, v in enumerate(data_type2):					# 添加type2柱值
    plt.text(i + width, v, v, ha='center', va='bottom')
# type2中None属性数量过多，为节省空间不完全显示
plt.annotate('482', xy=(12+width, 169+width), xytext=(12-width, 165), arrowprops \
             = dict(facecolor='black', width=1, headlength=1, headwidth=1))
plt.ylim(0, 170)									# 设置y轴范围
plt.xticks(axis_type, data_type1.index)				# 设置x轴刻度
plt.legend(loc='upper left')						# 设置图例
plt.title('精灵属性分布')
plt.xlabel('属性')
plt.ylabel('数量（只）')
plt.show()

大部分精灵都没有第二属性（即为None），但所有精灵都有第一属性。
在这里插入图片描述

传奇 / 神话精灵的属性分布（饼图）

plt.subplot(2, 2, 1)											# 划分子图
target_type = 'type1'
target_category = 'is_legendary'
group_type = df.groupby(target_type)							# 按属性进行分组
group_type = group_type[target_category].sum().sort_values()    # 统计精灵数并排序
group_type.plot.pie(autopct='%1.1f%%', ylabel='')				# 绘制图片
plt.title('Type ' + target_type[4:] + ' is ' + target_category[3:])

plt.subplot(2, 2, 2)
target_type = 'type1'
target_category = 'is_mythical'
group_type = df.groupby(target_type)							# 按属性进行分组
group_type = group_type[target_category].sum().sort_values()    # 统计精灵数并排序
group_type.plot.pie(autopct='%1.1f%%', ylabel='')				# 绘制图片
plt.title('Type ' + target_type[4:] + ' is ' + target_category[3:])

plt.subplot(2, 2, 3)
target_type = 'type2'
target_category = 'is_legendary'
group_type = df.groupby(target_type)							# 按属性进行分组
group_type = group_type[target_category].sum().sort_values()    # 统计精灵数并排序
group_type.plot.pie(autopct='%1.1f%%', ylabel='')				# 绘制图片
plt.title('Type ' + target_type[4:] + ' is ' + target_category[3:])

plt.subplot(2, 2, 4)
target_type = 'type2'
target_category = 'is_mythical'
group_type = df.groupby(target_type)							# 按属性进行分组
group_type = group_type[target_category].sum().sort_values()    # 统计精灵数并排序
group_type.plot.pie(autopct='%1.1f%%', ylabel='')				# 绘制图片
plt.title('Type ' + target_type[4:] + ' is ' + target_category[3:])

plt.show()

可以发现，传奇精灵和神话精灵中，第一属性为超能系较多，龙属性在传奇精灵中也有很多，但没有一只龙属性精灵是神话精灵。而大多传奇/神话精灵都没有第二属性（所以下文中预测选择去除type2这一列）
在这里插入图片描述

统计双系宝可梦的数量

plt.subplots(figsize=(10, 10))
sns.heatmap(df[df['type2']!='None'].groupby(['type1', 'type2']).size().unstack(), linewidths=1, annot=True, cmap="Blues" )
plt.xticks(rotation=35)
plt.show()

在这里插入图片描述

战斗分析

只关注六个基础值：血量、攻击力、防御力、特殊攻击、特殊防御、速度

interested = ['hp','attack','defense','sp_attack','sp_defense','speed']
sns.pairplot(df[interested])
plt.show()

在这里插入图片描述

三、预测

经过上述对精灵宝可梦的数据大致了解后，下面开始进行预测，这里主要预测精灵是否为传奇精灵或神话精灵，预测的依据（机器学习的列）为精灵的各项基本数据（身高、体重、攻击力、防御力、捕获难易度等，以及是否为超进化精灵）。

清洗数据，划分数据集

将type1属性替换为了数值型，并且删掉了type2（因为大部分精灵type2为空）。

（因为直接修改下方代码第五行 target 的值即可修改预测的目标，所以下方不重复粘代码节省空间）

df['type1'].replace(['Bug', 'Dark', 'Dragon', 'Electric', 'Fairy', 'Fighting',
                     'Fire', 'Flying', 'Ghost', 'Grass', 'Ground', 'Ice', 'Normal',
                     'Poison', 'Psychic', 'Rock', 'Steel', 'Water'],
                    list(range(1, 19)), inplace=True)
target = 'is_legendary'										# 预测的目标列
# target = 'is_mythical' 
X = df.iloc[:, 5:].drop(columns=[target, 'type2'])			# 用于学习的列（删除预测列）
Y = df[target]												# 用于预测的列
X_train, X_test, y_train, y_test = train_test_split(		# 划分训练集和预测集
    X, Y, test_size=0.5, random_state=18)

使用三种算法预测精灵级别（传奇/神话）

machine_name = np.array(['rbf_svm', 'linear_svm', 'KNN'])	# x轴刻度
machine_x = np.array([1, 2, 3])								# 使用数字方便绘制图片
machine_score = np.array([])								# 存储各算法正确率

用不同的算法训练，并存储结果

# 1. rbf svm
model = SVC(kernel='rbf', C=1, gamma=0.1)
model.fit(X_train, y_train)
y_svmpred = model.predict(X_test)
machine_score = np.append(machine_score, metrics.accuracy_score(y_svmpred,y_test))

# 2. linear svm
model = SVC(kernel='linear', C=1, gamma=0.1)
model.fit(X_train, y_train)
y_lsvmpred = model.predict(X_test)
machine_score = np.append(machine_score, metrics.accuracy_score(y_lsvmpred, y_test))

# 3. KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X_train, y_train)
y_KNCpred = model.predict(X_test)
machine_score = np.append(machine_score, metrics.accuracy_score(y_KNCpred, y_test))

绘制图片

plt.plot(machine_x, machine_score*100)
plt.scatter(machine_x, machine_score*100, marker='*', color='red', s=80)
for x, y in zip(machine_x, machine_score*100):
    plt.text(x-0.07, y+0.3, '%.2f' % y, ha='left')
plt.title('三种算法预测 ' + target + ' 的正确率')
plt.xlabel('算法 / 模型')
plt.xticks(machine_x, machine_name)
plt.ylabel('Accuracy（ % ）')
plt.ylim(90, 100)
plt.show()

在这里插入图片描述

KNeighborsClassifie调参

研究n_neighbors改变时对预测结果的影响

# 划分训练集和预测集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=18)
y_list = np.array([])								# 用于存储正确率
for i in range(1, 51):
    knn = KNeighborsClassifier(n_neighbors=i)		# 每次改变n_neighbors的值
    knn.fit(X_train, y_train.values.ravel())		# 训练
    y_KNCpred = knn.predict(X_test)					# 计算正确率
    y_list = np.append(y_list, [metrics.accuracy_score(y_KNCpred, y_test) * 100])
plt.plot(y_list)									# 绘制折线图
plt.title('预测' + target + '改变n_neighbors数量')
plt.xlabel('n_neighbors ')
plt.ylabel('Accuracy ( % )')
plt.show()

在这里插入图片描述

研究不同的test_size对预测结果的影响

knn = KNeighborsClassifier()
y_list = np.array([])												# 存储正确率
test_list = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])	# 不同的test_size
for i in range(9):
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_list[i], random_state=18)	# 划分数据集
    knn.fit(X_train, y_train.values.ravel())						# 训练
    y_KNCpred = knn.predict(X_test)									# 计算正确率
    y_list = np.append(y_list, [metrics.accuracy_score(y_KNCpred, y_test) * 100])
for i, v in enumerate(y_list):                                      # 添加点值
    plt.text(test_list[i]*100-0.8, v+0.1, '%.2f' % v)
plt.plot(test_list * 100, y_list, 'b-o')							# 绘制折线图
plt.title('预测' + target + '改变test_size占比')
plt.xlabel('test_size ( % )')
plt.ylabel('Accuracy ( % )')
plt.ylim(95, 100)
plt.show()