CentOS7安装部署Slurm集群详细步骤及常见的问题.
Slurm 分布式集群安装部署步骤。
参考文章:
https://www.cnblogs.com/liu-shaobo/p/13285839.html
https://cndaqiang.github.io/2019/09/19/Centos7-CC19/
https://blog.csdn.net/Datuqiqi/article/details/50827040
https://blog.csdn.net/weixin_42506905/article/details/100165253
https://www.cnblogs.com/liwanliangblog/p/9194244.html
https://blog.csdn.net/heguangsui123/article/details/94750192
操作手册
https://bicmr.pku.edu.cn/~wenzw/pages/slurm.html
https://www.jianshu.com/p/ca12944eab67
Slurm集群资源管理器的简单使用:
https://cloud.tencent.com/developer/article/1607964
https://www.ityww.cn/1470.html
https://www.jianshu.com/p/99fd802f0577
一、基础环境
1、主机名和IP
控制节点:192.168.8.150 m1
计算节点:192.168.8.145 c1
计算节点:192.168.1.144 c2
分别在3个节点设置主机名
# hostnamectl set-hostname m1
# hostnamectl set-hostname c1
# hostnamectl set-hostname c2
2、主机配置
系统: Centos7.6 x86_64
192.168.8.145
磁盘:234G cpu:2核 内存:15G
192.168.8.150
磁盘:234G cpu:2核 内存:15G
3、关闭防火墙
# systemctl stop firewalld
# systemctl disable firewalld
# systemctl stop iptables
# systemctl disable iptables
4、修改资源限制
复制代码
# cat /etc/security/limits.conf
- hard nofile 1000000
- soft nofile 1000000
- soft core unlimited
- soft stack 10240
- soft memlock unlimited
- hard memlock unlimited
//Linux记录-limits.conf 配置:
https://blog.csdn.net/weixin_30917213/article/details/101093058
vi /etc/security/limits.conf
-
soft nofile 655360 # open files (-n),不要设置为unlimited
-
hard nofile 655360 # 不要超过最大值1048576,不要设置为unlimited
-
soft nproc 655650
-
hard nproc 655650 # max user processes (-u)
hive - nofile 655650
hive - nproc 655650
5、配置时区
配置CST时区
# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
同步NTP服务器
# yum install ntp -y
# systemctl start ntpd
# systemctl enable ntpd
安装EPEL源
# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm
6、安装NFS(控制节点)
# yum -y install nfs-utils rpcbind
创建 exports 文件
# cat /etc/exports
vim /etc/exports,编辑/etc/exports文件,文件内容如下:
/software/ *(rw,async,insecure,no_root_squash)
查看状态
# systemctl status nfs
启动NFS:
# systemctl start nfs
# systemctl start rpcbind
# systemctl enable nfs
# systemctl enable rpcbind
客户端挂载NFS
S1中:
# yum -y install nfs-utils
# mkdir /software
# mount 192.168.8.150:/software /software
S2中:
$ mount -t nfs 192.168.8.150:/software /software
格式上就是,mount -t nfs S1的IP:S1分享的目录 S2直接操作的目录
这样操作S2的这个目录就相当于直接S1分享的目录了,当然,操作S1的分享的目录,这个S2里的内容也会跟着变
7、配置SSH免登陆:
# ssh-keygen
# ssh-copy-id -i .ssh/id_rsa.pub c1
# ssh-copy-id -i .ssh/id_rsa.pub c2
//-----Linux - 配置SSH免密通信 - “ssh-keygen”的基本用法-----
https://www.cnblogs.com/shoufeng/p/11022258.html
1.ssh-keygen创建公钥-私钥对
2.ssh-copy-id把A的公钥发送给B
3.在A服务器上免密登录B服务器
第一步:在本地机器上使用ssh-keygen产生公钥私钥对:
$ ssh-keygen
第二步:用ssh-copy-id将公钥复制到远程机器中:
$ ssh-copy-id -i .ssh/id_rsa.pub root@192.168.8.150
注意: ssh-copy-id 将key写到远程机器的 ~/ .ssh/authorized_key.文件中
第三步: 登录到远程机器不用输入密码:
$ ssh 用户名字@192.168.x.xxx
//----配置SSH免密登录时报错:/usr/bin/ssh-copy-id: ERROR: failed to open ID file ‘.ssh/id_rsa.pub’: No such file or directory
https://blog.csdn.net/feifuzeng/article/details/111235160?utm_medium=distribute.pc_relevant_download.none-task-blog-baidujs-1.nonecase&depth_1-utm_source=distribute.pc_relevant_download.none-task-blog-baidujs-1.nonecase
首先登录180.8.5.101,执行如下三步
第一步:在/root/.ssh目录执行ssh-keygen产生公钥秘钥对
ssh-keygen
然后一路Enter下去
第二步:用ssh-copy-id将公钥复制到远程机器中
ssh-copy-id -i .ssh/id_rsa.pub root@192.168.8.145
注意:ssh-copy-id将key写到远程机器的~/.ssh/authorized_key文件中
第三步:登录到远程机器不用输入密码
ssh root@192.168.8.145
二、配置Munge
删除安装失败的munge:
yum remove munge munge-libs munge devel -y
删除用户:
userdel -r munge
1、创建Munge用户
Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;
# groupadd -g 1108 munge
# useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge
2、生成熵池:
# yum install -y rng-tools
使用/dev/urandom来做熵源
# rngd -r /dev/urandom
# vim /usr/lib/systemd/system/rngd.service
修改如下参数
[service]
ExecStart=/sbin/rngd -f -r /dev/urandom
查看状态
# systemctl status rngd
退出报存
# systemctl daemon-reload
# systemctl start rngd
# systemctl enable rngd
3、部署Munge
Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。
# yum install munge munge-libs munge-devel -y
创建全局密钥
在Master Node创建全局使用的密钥
# /usr/sbin/create-munge-key -r
# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
密钥同步到所有计算节点:
# scp -p /etc/munge/munge.key root@192.168.8.145:/etc/munge
# scp -p /etc/munge/munge.key root@192.168.8.144:/etc/munge
# chown munge: /etc/munge/munge.key
# chmod 400 /etc/munge/munge.key
查看状态
# systemctl status munge
所有节点都执行启动命令:
# systemctl start munge
# systemctl enable munge
//停掉服务
systemctl stop munge
//查看状态
systemctl status munge
//----------Job for munge.service failed because the control process exited with error code启动失败
See “systemctl status slurmd.service” and “journalctl -xe” for details.
查看 /var/log/munge/munged.log
Error: Failed to check pidfile dir “/var/run/munge”: cannot canonicalize “/var/run/munge”: Permission denied
查看 /var/run的用户权限
chown -R munge /var/run/chrony
-rwxr-xr-x (755) 只有所有者才有读,写,执行的权限,组群和其他人只有读和执行的权限
4、测试Munge服务
每个计算节点与控制节点进行连接验证
本地查看凭据:
# munge -n
本地解码:
# munge -n | unmunge
验证compute node,远程解码:
# munge -n | ssh 192.168.8.145 unmunge
//--------报错:unmunge: Error: Invalid credentia
重启计算节点的munge 服务
Munge凭证基准测试
# remunge
三、配置Slurm
1、创建Slurm用户
# groupadd -g 1109 slurm
# useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm
2、安装Slurm依赖
# yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel -y
编译Slurm
# wget https://download.schedmd.com/slurm/slurm-20.02.7.tar.bz2
安装rpmbuild编译Slurm,rpmbuild制作rpm包
# yum install rpm-build
# rpmbuild -ta slurm-21.08.0-0rc1.tar.bz2
//如果rpmbuild出现如下错误:
error: Failed build dependencies:
python3 is needed by slurm-21.08.0-0rc1.el7.x86_64
解决:rpmbuild 生成软件包, 在安装时候忽略依赖关系,添加参数–nodeps
rpmbuild -ta --nodeps slurm-21.08.0-0rc1.tar.bz2
–nodeps #不检查建立包时的关联文件
cd到制作好的rpm包下:
# cd /root/rpmbuild/RPMS/x86_64/
所有节点安装Slurm
yum localinstall slurm-*
//--------------yum 安装 出错 Error: Protected multilib versions:
解决办法:在执行命令后面加上:–setopt=protected_multilib=false
子节点安装报错:Bad exit status from /var/tmp/rpm-tmp.EJn6d9 (%build)
需要安装python3
解压:
tar -Jxf Python-3.6.2.tar.xz
进入目录
cd Python-3.6.2
创建安装目录
mkdir /usr/local/python3
指明安装路径
./configure -prefix=/usr/local/python3
编译安装
make && make install
建立链接
ln -s /usr/local/python3/bin/python3 /usr/bin/python3 #为python3创建软连接
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3 #为pip3创建软连接
验证
python3 # 输入
pip3 -V #V大写
3、配置控制节点Slurm
# cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
# cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
# vim /etc/slurm/slurm.conf
##修改如下部分
ControlMachine=m1
ControlAddr=192.168.8.150
SlurmUser=slurm
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_MemoryNodeName=c[1-2] RealMemory=3400 Sockets=1 CoresPerSocket=4 State=IDLE
PartitionName=all Nodes=c[1-2] Default=YES State=UP
查看配置文件:
scontrol show config
复制控制节点配置文件到计算节点:
# scp /etc/slurm/*.conf root@192.168.8.145:/etc/slurm/
# scp /etc/slurm/*.conf c2:/etc/slurm/
设置控制、计算节点文件权限
# mkdir /var/spool/slurm
# chown slurm: /var/spool/slurm
# mkdir /var/log/slurm
# chown slurm: /var/log/slurm
5、配置控制节点Slurm Accounting
Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。
创建数据库的Slurm用户(MySQL自行安装)
mysql -u root -p
//-----------Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)--------
mysql -h192.168.8.144 -uroot -proot
mysql> grant all on slurm_acct_db.* to ‘slurm’@‘%’ identified by ‘root’ with grant option;
//------------Mysql密码策略:Your password does not satisfy the current policy requirements
查看 mysql 初始的密码策略:
SHOW VARIABLES LIKE ‘validate_password%’;
首先需要设置密码的验证强度等级,设置 validate_password_policy 的全局参数为 LOW 即可:
set global validate_password_policy=LOW;
当前密码长度为 8 ,如果不介意的话就不用修改了,按照通用的来讲,设置为 6 位的密码,设置 validate_password_length 的全局参数为 6 即可
set global validate_password_length=4;
关于 mysql 密码策略相关参数;
1)、validate_password_length 固定密码的总长度;
2)、validate_password_dictionary_file 指定密码验证的文件路径;
3)、validate_password_mixed_case_count 整个密码中至少要包含大/小写字母的总个数;
4)、validate_password_number_count 整个密码中至少要包含阿拉伯数字的个数;
5)、validate_password_policy 指定密码的强度验证等级,默认为 MEDIUM;
1.LOW:只验证长度;
2.MEDIUM:验证长度、数字、大小写、特殊字符;
3.STRONG:验证长度、数字、大小写、特殊字符、字典文件;
6)、validate_password_special_char_count 整个密码中至少要包含特殊字符的个数;
配置slurmdbd.conf文件
# cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=192.168.8.150
DbdHost=m1
SlurmUser=slurm
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=192.168.8.144
StorageUser=slrum
StoragePass=root
StorageLoc=slurm_acct_db
chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
chown slurm: /var/log/slurm/slurmdbd.log
6、开启节点服务
查看状态
# systemctl status slurmdbd
启动控制节点Slurmdbd服务
# systemctl start slurmdbd
# systemctl enable slurmdbd
//--------------Failed to start slurmdbd.service: Unit not found
首先看一下服务列表里有没有这个服务:
systemctl list-unit-files --type=service
如果有的话:
systemctl daemon-reload
启动控制节点slurmctld服务
启动集群:
Master节点需要执行 slurmctld -c 和slurmd-c,都是以root账户执行
所有Slaver节点都执行 slurmd -c
# systemctl start slurmctld
# systemctl status slurmctld
# systemctl enable slurmctld
启动计算节点的服务:
# systemctl start slurmd
# systemctl status slurmd
# systemctl enable slurmd
//-----------启动控制节点 systemctl status slurmdbd 报错:Can’t open PID file /var/run/slurmdbd.pid (yet?) after start: No such file or directory
查看文件是否存在 不存在创建pid文件:
touch /var/run/slurmdbd.pid
赋权限 chmod -R 777 /var/run/slurmdbd.pid
解决。
//-----------启动控制节点 systemctl enable slurmctld 报错:Failed to parse PID from file /var/run/slurmctld.pid: Invalid argument
查看日志:
“journalctl -xe”
于是去查看slurmctld 的启动文件,/usr/lib/systemd/system/slurmctld.service
结果发现有这么一行
PIDFile=/var/run/slurmctld.pid
把这一行屏蔽以后重启,问题解决
systemctl daemon-reload
如果还是没有解决,关闭xshell工具重新打开,然后 systemctl start slurmd。
//-------sinfo 报错:slurm_load_partitions: Unable to contact slurm controller (connect failure)
# vim /etc/slurm/slurm.conf
##修改如下部分
ControlMachine=m1
ControlAddr=192.168.8.150
SlurmUser=slurm
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
NodeName=m1 NodeAddr=192.168.8.150 CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200 Procs=1 State=UNKNOWN
NodeName=c1 NodeAddr=192.168.8.145 CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200 Procs=1 State=UNKNOWN
PartitionName=control Nodes=m1 Default=NO MaxTime=INFINITE State=UP
PartitionName=compute Nodes=c1 Default=YES MaxTime=INFINITE State=UP
这里要注意: CPUs=24 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=30000 Procs=1 要根据自己的服务器资源酌情配置
查看配置文件:
scontrol show config
如果修改了配置文件slurm.conf,则请在master上执行scontrol reconfig命令更新配置文件。
重新加载一下
systemctl daemon-reload
四、检查Slurm集群
查看集群
# sinfo
# scontrol show partition
# scontrol show node
提交作业
# srun -N2 hostname
# scontrol show jobs
查看作业
# squeue -a
更多推荐
所有评论(0)