Prometheus 重启失败的教训

Prometheus 重启重启方式直接后台运行./prometheus &或者nohup ./prometheus --config.file=./prometheus.yml --storage.tsdb.retention.time=90d --web.listen-address=:9090 &参考：prometheus启动参数以服务方式启动添加prometheus.serv

wyp257

13586人浏览 · 2021-06-29 15:08:07

wyp257 · 2021-06-29 15:08:07 发布

重启方式

直接后台运行

./prometheus &

或者

nohup ./prometheus --config.file=./prometheus.yml --storage.tsdb.retention.time=90d --web.listen-address=:9090 &

参考：prometheus启动参数

以服务方式启动

添加 prometheus.service 文件到 /etc/systemd/system/ 目录下，我的prometheus服务安装在 /data/prometheus/ 目录。

[Unit]
Description=Prometheus service
After=network.target
Wants=network.target

[Service]
ExecStart=/data/prometheus/prometheus --config.file=/data/prometheus/prometheus.yml --storage.tsdb.retention.time=90d --web.listen-address=:9090

Restart=always
RestartSec=20
TimeoutSec=300
User=root
Group=root
StandardOutput=journal
StandardError=journal
WorkingDirectory=/data/prometheus/

[Install]
WantedBy=multi-user.target

设置开机自启，启动服务

systemctl daemon-reload

systemctl enable prometheus

systemctl start prometheus

Prometheus重启失败的案例及总结

以下内容是总结prometheus启动失败的案例。
😄在不同的案例中，由于目标机器不同，prometheus的相关配置（如：安装路径，启动参数等）会有所不同。

配置文件：重复的job_name

现象描述

根目录/root/磁盘满了，无论使用nohup命令还是以服务方式启动prometheus均失败。

排查原因

查找大文件发现目录/root/data/下有prometheus存储的数据文件，查看/etc/systemd/system/prometheus.service文件，发现配置项WorkingDirectory=/root/，即将工作目录配置到了根目录/root/下。由于prometheus默认存储路径为data/，所以在**/root/data/**目录下存储了prometheus的大量数据文件。清理磁盘空间后发现仍然无法启动，进一步排查发现prometheus.yml配置文件中有重复定义的配置项。

解决方式

修改/etc/systemd/system/prometheus.service文件中的配置项为WorkingDirectory=/data/prometheus/（我的机器上的/data/目录磁盘空间较大，也可以指定其他较大的磁盘目录）；
删除/root/data/目录，释放磁盘空间；
以服务方式重启prometheus，systemctl daemon-reload; systemctl start prometheus。

然而，经过上述处理后发现prometheus仍然无法重启成功，从service prometheus status打印的信息定位不到错误。使用nohup启动并将日志输出到nohup.out文件中，查看信息显示prometheus.yml文件中的scrape_configs字段下有重复定义的job_name（自己挖的坑 😓 使用自动部署服务的脚本追加了相同字段到配置文件中）。

level=error ts=***********  caller=main.go:290 msg="Error loading config (--config.file=./prometheus.yml)" err="parsing YAML file ./prometheus.yml: found multiple scrape configs with job name \"***_node\""

确保prometheus.yml配置文件中信息唯一后，以服务方式启动prometheus。

配置文件：job_name下多个static_configs

现象描述

修改prometheus配置文件后发现启动prometheus失败，报错信息如图
prometheus_error1

排查原因

执行./prometheus --config.file=/usr/local/prometheus/prometheus.yml，报错信息如下

level=error ts=2021-07-30T02:31:48.084Z caller=main.go:355 msg="Error loading config (--config.file=/usr/local/prometheus/prometheus.yml)" err="parsing YAML file /usr/local/prometheus/prometheus.yml: yaml: unmarshal errors:\n  line 43: field static_configs already set in type config.ScrapeConfig\n  line 50: field static_configs already set in type config.ScrapeConfig"

错误原因显示在解析配置文件失败，定位到配置文件第43行和第50行，发现在job_name字段下定义了多个static_configs，配置文件如下图
prometheus_mutiple_static_configs

解决方式

删除配置文件中job_name字段下面多余的static_configs，只保留顶层的一个。然后，重启prometheus。
另外，static_configs是默认的静态配置方式，每次修改该字段下的内容后，需要重启prometheus才能使配置生效。如果不想每次都重启prometheus，可以采用prometheus提供的服务发现方式，如：file_sd_configs，只要动态地修改指定的配置文件，prometheus会自动加载配置。
更多请参见：Prometheus Configuration

告警规则文件：tab键缩进

现象描述

在prometheus配置文件中开启告警组件alertmanager，并定义告警规则文件，然后发现重启prometheus失败，报错信息如下：
prometheus_error2

排查原因

prometheus的配置文件中关于告警的配置如下图
prometheus_config_file
定位问题时发现，当添加了告警规则文件node-up.rules后重启prometheus失败，所以问题出在该文件上。

解决方式

使用prometheus自带的工具promtool检查配置文件。如果不熟悉如何使用该工具，可先直接执行./promtool查看帮助信息。

./promtool check config prometheus.yml

显示告警规则文件中使用了tab键作为缩进

Checking prometheus.yml
  SUCCESS: 1 rule files found

Checking /usr/local/prometheus/rules/node-up.rules
  FAILED:
     /usr/local/prometheus/rules/node-up.rules: yaml: line 5: found a tab character that violates indentation
     /usr/local/prometheus/rules/node-up.rules: yaml: line 5: found a tab character that violates indentation

告警规则文件如下

  1 groups:
  2 - name: node-up
  3   rules:
  4   - alert: node-up
  5     expr: up{job="node_exporter"} == 0
  6     for: 15s
  7     labels:
  8       severity: 1
  9       team: node
 10     annotations:
 11       summary: "{{ $labels.instance }} has crashed over 15s! "

逐行检查并修改tab键的缩进后，使用promtool检查配置文件通过，然后重启prometheus成功。

prometheus.service文件：多余的双引号

现象描述

以服务方式启动prometheus时，prometheus.service配置文件中，将prometheus启动参数以双引号""括起来，prometheus启动失败。
prometheus_service_error

排查原因

prometheus.service配置文件中将参数--config.file字段对应的参数以双引号""括起来，导致服务解析参数失败，根本原因尚不清楚。
需要指出的是，如果将每个独立的参数用双引号括起来是不会引发错误的，如：
"--config.file=/usr/local/prometheus/prometheus.yml"不会导致错误；但是，如果将所有启动参数都以一个双引号括起来是会引发错误的。另外，如果以nohup方式启动prometheus，将启动参数用引号包括是可以成功启动的

nohup ./prometheus --config.file="/usr/local/prometheus/prometheus.yml" --web.listen-address=":9090" &

解决方式

在服务的配置文件中，对于服务带启动参数的，不要将参数用双引号括起来。
当用引号包括启动参数时，相同的格式在命令行中可以生效，但是在配置文件中却无效，根本原因大概与systemd处理服务配置文件(*.service)的方式有关。
😄感兴趣的朋友可以深度挖掘，也请不吝赐教。

selinux配置

现象描述

以服务方式启动prometheus失败，但是以nohup启动prometheus后台进程成功。查看service启动失败的日志（xxx-xx-x-xxx为hostname）：

Nov 30 15:48:13 xxx-xx-x-xxx systemd[1]: Started Prometheus service.
Nov 30 15:48:13 xxx-xx-x-xxx systemd[1]: prometheus.service: Main process exited, code=exited, status=203/EXEC
Nov 30 15:48:13 xxx-xx-x-xxx systemd[1]: prometheus.service: Failed with result 'exit-code'.
Nov 30 15:48:33 xxx-xx-x-xxx systemd[1]: prometheus.service: Service RestartSec=20s expired, scheduling restart.
Nov 30 15:48:33 xxx-xx-x-xxx systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 1.
Nov 30 15:48:33 xxx-xx-x-xxx systemd[1]: Stopped Prometheus service.

排查原因

以服务方式启动prometheus失败，但是以nohup启动成功（prometheus的启动参数相同），说明问题出现在prometheus.service配置文件，或者是系统环境。进一步排查，定位问题在selinux配置上。

解决方式

临时关闭selinux（终端执行setenforce 0），再以服务方式启动prometheus。需要注意的是，当重启prometheus或者使用curl -XPOST http://ip:port/-/reload重新加载prometheus配置时，也应当临时关闭selinux，否则操作失败；另外，当重启系统后，临时关闭selinux的操作也会失效，系统根据配置文件/etc/selinux/config使配置生效。
永久关闭selinux，重启机器，再以服务方式启动prometheus。修改/etc/selinux/config文件，设置SELINUX=disabled。
放弃以服务方式启动prometheus，以nohup启动prometheus后台进程。

总结

❤️ 当修改prometheus的配置文件后，强烈建议先执行./promtool check config prometheus.yml检查配置文件是否存在问题；
如果以nohup方式运行prometheus或者直接运行可执行文件./prometheus，可以从输出的日志文件中找出服务运行的问题，对症解决；
如果以服务方式运行prometheus失败，检查prometheus的服务启动配置文件（/etc/systemd/system/prometheus.service），当使用service prometheus status分析不出清晰的原因时，对于centos系统，可以使用journalctl -u prometheus.service查看日志（prometheus.service文件中需指定StandardOutput=journal和StandardError=journal）；也可以查看 /var/log/message，过滤想要的信息；也可以使用nohup启动，分析日志记录查找原因。
如果不想每次修改配置文件后都重启prometheus，可改用重载配置的方式。重载配置需要在prometheus的启动参数中追加--web.enable-lifecycle（如：./prometheus --web.enable-lifecycle）。当修改配置文件（非服务发现的部分）后，执行curl -XPOST http://ip:port/-/reload（其中，ip:port为相应节点的ip和prometheus服务监听的端口）。