status状态字段是否要创建索引?

背景

​ 前段时间在公司工作中我建议创建的某个表字段索引在内部评审的时候被删除了,理由是”状态枚举值少的索引效果不好,甚至会引发全表扫描,例如性别男女“。虽然不是什么大事,直到现在也没有什么明显影响,但这个事情给我心底留下了疑问,“状态枚举字段是否应该建立索引?”。

​ 业务场景:商品报名业务,报名以后保存了其中一个表,状态为“待发送”。另外有个界面可查询此状态数据进行手动操作调用下游接口完成相应业务操作,或者由定时任务定期扫描待发送数据自动完成。send_flag字段0-未发送,1-已发送。send_flag=0的数据每天小于500条,表中记录约30万条记录,表在以每个月5万条记录增长。(数据库是mysql,使用的是Innodb引擎)

​ 相信这种类似业务在平时工作中经常会遇到。比如说状态status(0\1\2\3)、性别sex(0未知\1男\2女)等。你在遇到这类场景的时候,是否会给"send_flag"创建索引?为什么很多人会说可能会引发全表扫描?

准备数据

为了测试,我在本地创建了一个表test_product,插入了150万条记录,开启了慢查询时间为10毫秒。除自增ID以外,无索引。

验证以下几个问题

加索引以后是否会提升查询效率?

为了模拟业务场景,在150万条记录中我设置500条记录的send_flag=0,设置send_flag的记录是随机选的。

CREATE PROCEDURE test.test()
begin
	DECLARE i INT;# 申明变量
SET i = 0;  # 变量赋值
WHILE i<500 DO # 结束循环的条件
update test_product set send_flag = 0 where id = (select ceiling(rand()*1500000) from dual);
SET i = i+1;    # 循环一次,i加1
END WHILE;  # 结束while循环
#结束循环执行语句
end 

因为自己PC配置问题,执行时间较长。

  • send_flag未添加索引的查询情况

执行语句

mysql> select count(1) from test_product where send_flag=0;
+----------+
| count(1) |
+----------+
|      503 |
+----------+
1 row in set (2.26 sec)

mysql>

通过上述可以看到,大约在2秒。同样的语句我执行5次后看下慢查询

mysql> select start_time, query_time ,sql_text from mysql.slow_log order by start_time desc limit 5;
+----------------------------+-----------------+-----------------------------------------------------+
| start_time                 | query_time      | sql_text                                            |
+----------------------------+-----------------+-----------------------------------------------------+
| 2021-04-23 16:08:54.139661 | 00:00:02.260723 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:51.221102 | 00:00:02.257314 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:48.203821 | 00:00:02.361179 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:45.247554 | 00:00:02.311678 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:42.210513 | 00:00:02.402186 | select count(1) from test_product where send_flag=0 |
+----------------------------+-----------------+-----------------------------------------------------+
5 rows in set (0.00 sec)

mysql>

可以看到每次的执行时间都在2秒+。

为了看出效果,此处我执行一个无效语句用于分隔查询结果select sleep(5)

  • send_flag添加索引的查询情况

添加索引语句

CREATE INDEX test_product_send_flag_IDX USING BTREE ON test.test_product (send_flag);

同样执行语句

mysql> select count(1) from test_product where send_flag=0;
+----------+
| count(1) |
+----------+
|      503 |
+----------+
1 row in set (0.00 sec)

mysql>

执行了5次,结果都一样查询时间都是0秒。

再看下慢查询

mysql> select start_time, query_time ,sql_text from mysql.slow_log order by start_time desc limit 5;
+----------------------------+-----------------+-----------------------------------------------------+
| start_time                 | query_time      | sql_text                                            |
+----------------------------+-----------------+-----------------------------------------------------+
| 2021-04-23 16:13:39.746433 | 00:00:05.001184 | select sleep(5)                                     |
| 2021-04-23 16:08:54.139661 | 00:00:02.260723 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:51.221102 | 00:00:02.257314 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:48.203821 | 00:00:02.361179 | select count(1) from test_product where send_flag=0 |
| 2021-04-23 16:08:45.247554 | 00:00:02.311678 | select count(1) from test_product where send_flag=0 |
+----------------------------+-----------------+-----------------------------------------------------+
5 rows in set (0.00 sec)

mysql>

可以看到,慢查询日志最后一条语句还是我们刚才执行的sleep(5)分隔,并没有刚才执行的几条语句。

没加索引的查询耗时2s,加上索引后查询0s。由此可以说明:为send_flag创建索引能有效的提升查询效率!

但是,结论下的有点早!!!,我们接着往下看:

枚举值较少时是否会引发全表扫描?

以我的理解,应该分3种情况来验证:1)枚举值对应的数据量有较大差别,查询较少的这部分数据;2)枚举值对应的数据量有较大差别,查询较多的这部分数据;3)各类枚举值对应的数据量比较接近时查询;

接下来,我们先验证

  • 数据量有较大差别,查询较少的这部分数据

验证仍然使用上面的测试数据,总记录150万,send_flag=0有500条,剩下全部是send_flag=1。

执行explain分析

mysql> select count(1) from test_product where send_flag=0;
+----------+
| count(1) |
+----------+
|      503 |
+----------+
1 row in set (0.00 sec)

mysql> explain select count(1) from test_product where send_flag=0;
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys              | key                        | key_len | ref   | rows | filtered | Extra       |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ref  | test_product_send_flag_IDX | test_product_send_flag_IDX | 5       | const |  503 |   100.00 | Using index |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql>

不出意外,此时肯定是在使用索引test_product_send_flag_IDX且查询类型type=ref。

因此,各枚举值对应的数据量有较大差别,查询较少的这部分数据是会使用索引的。

  • 数据量有较大差别,查询较多的这部分数据

验证仍然使用上面的测试数据,总记录150万,send_flag=0有500条,剩下全部是send_flag=1。

执行explain分析

mysql> select count(1) from test_product where send_flag=1;
+----------+
| count(1) |
+----------+
|  1521367 |
+----------+
1 row in set (0.37 sec)

mysql> explain select count(1) from test_product where send_flag=1;
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys              | key                        | key_len | ref   | rows   | filtered | Extra       |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ref  | test_product_send_flag_IDX | test_product_send_flag_IDX | 5       | const | 866953 |   100.00 | Using index |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql>

此时可以看到查询执行相对比上一个示例长了一些,但是仍然在使用索引test_product_send_flag_IDX查询类型是type=ref。

因此,各枚举值对应的数据量有较大差别,查询较多的这部分数据也是会使用索引的。

  • 各类枚举值对应的数据量比较接近时查询

此时我们需要修改表中的数据,使send_flag=0和send_flag=1的数据差不多。

执行explain分析

mysql> select count(1) from test_product where send_flag=1;
+----------+
| count(1) |
+----------+
|   760672 |
+----------+
1 row in set (0.20 sec)

mysql> select count(1) from test_product where send_flag=0;
+----------+
| count(1) |
+----------+
|   761198 |
+----------+
1 row in set (0.20 sec)

mysql> explain select count(1) from test_product where send_flag=1;
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys              | key                        | key_len | ref   | rows   | filtered | Extra       |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ref  | test_product_send_flag_IDX | test_product_send_flag_IDX | 5       | const | 752050 |   100.00 | Using index |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql>

可以不管是查询send_flag=0还是1,查询耗时都差不多,也都使用了索引。

因此,当各类枚举值对应的数据量比较接近时,查询也是会使用索引的。

为什么反对在status这类字段上加索引呢?

通过上面几个实验,我们可以看到索引对查询send_flag的效率有非常明显的提升。那为什么会有很多人反对在status这类字段上增加索引呢?我们再来做几个试验证。

  • 索引各类枚举值对应的数据量比较接近时,带条件查询

此时我们接着上面的数据(send_flag=0和1的数据相差不大)测试,但这次不同的是,除了索引字段我加入了其他过滤条件。

mysql> select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (4.00 sec)

慢查询

mysql>  select start_time, query_time ,sql_text from mysql.slow_log order by start_time desc limit 5;
+----------------------------+-----------------+---------------------------------------------------------------------------------------+
| start_time                 | query_time      | sql_text                                                                              |
+----------------------------+-----------------+---------------------------------------------------------------------------------------+
| 2021-04-25 11:10:56.584538 | 00:00:03.861784 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 11:10:44.473629 | 00:00:03.987070 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 11:10:38.429774 | 00:00:03.932597 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 11:10:31.988441 | 00:00:03.855316 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 11:10:26.854730 | 00:00:04.032600 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
+----------------------------+-----------------+---------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

mysql>

我们看到执行已经为4秒左右,这样的效果已经很差了,我们来看看是否使用了索引?

explain

mysql> explain select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys              | key                        | key_len | ref   | rows   | filtered | Extra       |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ref  | test_product_send_flag_IDX | test_product_send_flag_IDX | 5       | const | 752050 |    10.00 | Using where |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql>

我们可以看到这里的这里仍然使用到了索引test_product_send_flag_IDX查询类型type=ref。但是查询耗时却很多。

  • 无索引各类枚举值对应的数据量比较接近时,带条件查询

删除掉索引字段后,同样的语句我们再来执行一下。

mysql> select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (2.82 sec)

慢查询(执行5次)

mysql>  select start_time, query_time ,sql_text from mysql.slow_log order by start_time desc limit 5;
+----------------------------+-----------------+---------------------------------------------------------------------------------------+
| start_time                 | query_time      | sql_text                                                                              |
+----------------------------+-----------------+---------------------------------------------------------------------------------------+
| 2021-04-25 14:26:14.341608 | 00:00:02.635658 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 14:26:10.999191 | 00:00:02.816881 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 14:26:07.321773 | 00:00:02.844335 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 14:25:49.832506 | 00:00:02.821655 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
| 2021-04-25 14:24:51.352390 | 00:00:02.982645 | select count(1) from test_product where send_flag=1 and commodity_code = '1018223171' |
+----------------------------+-----------------+---------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

有没有发现很奇怪的现象,这次执行是2.8秒,同样的语句删除掉索引以后反而比加了索引查询时间少了很多。

explain

mysql> explain select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----+-------------+--------------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 1504101 |     1.00 | Using where |
+----+-------------+--------------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

可以看到执行语句并没有任何的索引,使用的是全表扫描type=ALL。但事实情况就是这样,没有索引反而比有索引的时候要快1.2秒左右

我们再看下数据相差较大时的情况。

  • 有索引枚举字段相差较大时查询带条件

此时我对数据进行了修改,使send_flag=0只有343条,而send_flag=1有150万。send_flag有索引。

执行查询较多部分数据(send_flag=1)

mysql> select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (10.54 sec)

查询耗时10秒

explain

mysql> explain select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys              | key                        | key_len | ref   | rows   | filtered | Extra       |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ref  | test_product_send_flag_IDX | test_product_send_flag_IDX | 5       | const | 752050 |    10.00 | Using where |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

查询较多的这部分数据也是使用了索引的,但耗时长达10秒。

执行查询较多部分数据(send_flag=0)

mysql> select count(1) from test_product where send_flag=0 and commodity_code = '108023319';
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (0.36 sec)

查询耗时0.3秒左右

explain

mysql> explain select count(1) from test_product where send_flag=0 and commodity_code = '108023319';
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+------+----------+-------------+
| id | select_type | table        | partitions | type | possible_keys              | key                        | key_len | ref   | rows | filtered | Extra       |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | test_product | NULL       | ref  | test_product_send_flag_IDX | test_product_send_flag_IDX | 5       | const |  343 |    10.00 | Using where |
+----+-------------+--------------+------------+------+----------------------------+----------------------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

可以看到查询较少的这部分数据时孔在用索引,但耗时较少。

而如果没有索引时不管是查询send_flag=1或者send_flag=0查询耗时都比较平均

mysql> select count(1) from test_product where send_flag=1 and commodity_code = '1018223171';
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (2.93 sec)

mysql> select count(1) from test_product where send_flag=0 and commodity_code = '108023319';
+----------+
| count(1) |
+----------+
|        1 |
+----------+
1 row in set (2.75 sec)

实验总结

我们将上述的实验数据整理出来

  • 只查询索引字段条件
send_flag索引查询条件send_flag数量耗时/秒索引/全表
send_flag=05032.3全表
send_flag=15030索引
send_flag=1150万0.37索引
send_flag=175万0.2索引
  • 查询索引字段+其他非索引条件
send_flag索引查询条件send_flag数量耗时/秒索引/全表
send_flag=1 and commodity_code = ‘1018223171’75万4索引
send_flag=1 and commodity_code = ‘1018223171’75万2.8全表
send_flag=1 and commodity_code = ‘1018223171’150万10索引
send_flag=0 and commodity_code = ‘108023319’343<0.5索引
send_flag=1 and commodity_code = ‘1018223171’150万2.9全表
send_flag=0 and commodity_code = ‘108023319’3432.7全表

通过上述的实验数据,我们可以得出关于枚举字段索引的结论

  1. 如果where 只查索引字段,查询会使用索引,且效率提升明显!
  2. 如果where 查询索引字段+非索引字段,如果查询索引枚举值较少的这部分数据,效率有提升;
  3. 如果where 查询索引字段+非索引字段,如果查询枚举值相差不大或者查询较多的这部分数据时,索引大大降低了查询效率!可怕的是,比全表索引效率还要低的多!

枚举值是否需要建立索引?

通过上述的实验,我们可以看到有时我们添加索引不仅不会提升效率,反而变成了累赘。

因此对于“枚举值字段是否要建立索引?”这个问题需要考虑的因素比较多,比如表中已有总索引数量、查询频率、索引字段修改是否频繁、是否会索引字段与非索引字段组合查询等等,需要综合考虑,这可能也是为什么我的添加索引提议评审时被取消的原因。

经过这一系列的实验,我对评审的结果还是赞同的,并不是说增加索引不会提升效率,而是防止出现“索引字段+非索引字段”这种情况,毕竟这类枚举值字段用到的地方比较多。

但如果片面的来看的话,单纯从“提升查询未发送记录数据的效率” 角度来说的话,为send_flag建立索引是会提升效率的。但是得保证不能出现send_flag条件与其他非索引字段同时使用的情况,否则反而成了累赘。因此是否有必要创建索引,需要综合考虑到项目其他因素!

例如“sex男女(0\1\2)”这类字段无特殊要求不需要单纯查询,则不需要建立索引。

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐