参考文章:

https://www.cnblogs.com/liu-shaobo/p/13285839.html

https://cndaqiang.github.io/2019/09/19/Centos7-CC19/

https://blog.csdn.net/Datuqiqi/article/details/50827040

https://blog.csdn.net/weixin_42506905/article/details/100165253

https://www.cnblogs.com/liwanliangblog/p/9194244.html

https://blog.csdn.net/heguangsui123/article/details/94750192

操作手册

https://bicmr.pku.edu.cn/~wenzw/pages/slurm.html
https://www.jianshu.com/p/ca12944eab67

Slurm集群资源管理器的简单使用:

https://cloud.tencent.com/developer/article/1607964
https://www.ityww.cn/1470.html
https://www.jianshu.com/p/99fd802f0577

一、基础环境

1、主机名和IP
控制节点:192.168.8.150 m1
计算节点:192.168.8.145 c1
计算节点:192.168.1.144 c2

分别在3个节点设置主机名

# hostnamectl set-hostname m1
# hostnamectl set-hostname c1
# hostnamectl set-hostname c2

2、主机配置

系统: Centos7.6 x86_64
192.168.8.145
磁盘:234G cpu:2核 内存:15G

192.168.8.150
磁盘:234G cpu:2核 内存:15G

3、关闭防火墙

# systemctl stop firewalld
# systemctl disable firewalld
# systemctl stop iptables
# systemctl disable iptables

4、修改资源限制

复制代码

# cat /etc/security/limits.conf 
  • hard nofile 1000000
  • soft nofile 1000000
  • soft core unlimited
  • soft stack 10240
  • soft memlock unlimited
  • hard memlock unlimited

//Linux记录-limits.conf 配置:
https://blog.csdn.net/weixin_30917213/article/details/101093058

vi /etc/security/limits.conf

  • soft nofile 655360 # open files (-n),不要设置为unlimited

  • hard nofile 655360 # 不要超过最大值1048576,不要设置为unlimited

  • soft nproc 655650

  • hard nproc 655650 # max user processes (-u)

hive - nofile 655650
hive - nproc 655650

5、配置时区
配置CST时区

# ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

同步NTP服务器

# yum install ntp -y
# systemctl start ntpd
# systemctl enable ntpd

安装EPEL源

# yum install http://mirrors.sohu.com/fedora-epel/epel-release-latest-7.noarch.rpm

6、安装NFS(控制节点)

# yum -y install nfs-utils rpcbind

创建 exports 文件

# cat /etc/exports

vim /etc/exports,编辑/etc/exports文件,文件内容如下:

/software/ *(rw,async,insecure,no_root_squash)

查看状态

# systemctl status nfs

启动NFS:

# systemctl start nfs
# systemctl start rpcbind
# systemctl enable nfs
# systemctl enable rpcbind

客户端挂载NFS

S1中:

# yum -y install nfs-utils
# mkdir /software
# mount 192.168.8.150:/software /software

S2中:
$ mount -t nfs 192.168.8.150:/software /software
格式上就是,mount -t nfs S1的IP:S1分享的目录 S2直接操作的目录
这样操作S2的这个目录就相当于直接S1分享的目录了,当然,操作S1的分享的目录,这个S2里的内容也会跟着变

7、配置SSH免登陆:

# ssh-keygen
# ssh-copy-id -i .ssh/id_rsa.pub c1
# ssh-copy-id -i .ssh/id_rsa.pub c2

//-----Linux - 配置SSH免密通信 - “ssh-keygen”的基本用法-----
https://www.cnblogs.com/shoufeng/p/11022258.html

1.ssh-keygen创建公钥-私钥对
2.ssh-copy-id把A的公钥发送给B
3.在A服务器上免密登录B服务器

第一步:在本地机器上使用ssh-keygen产生公钥私钥对:
$ ssh-keygen
第二步:用ssh-copy-id将公钥复制到远程机器中:
$ ssh-copy-id -i .ssh/id_rsa.pub root@192.168.8.150
注意: ssh-copy-id 将key写到远程机器的 ~/ .ssh/authorized_key.文件中

第三步: 登录到远程机器不用输入密码:
$ ssh 用户名字@192.168.x.xxx

//----配置SSH免密登录时报错:/usr/bin/ssh-copy-id: ERROR: failed to open ID file ‘.ssh/id_rsa.pub’: No such file or directory

https://blog.csdn.net/feifuzeng/article/details/111235160?utm_medium=distribute.pc_relevant_download.none-task-blog-baidujs-1.nonecase&depth_1-utm_source=distribute.pc_relevant_download.none-task-blog-baidujs-1.nonecase

首先登录180.8.5.101,执行如下三步

第一步:在/root/.ssh目录执行ssh-keygen产生公钥秘钥对

ssh-keygen

然后一路Enter下去

第二步:用ssh-copy-id将公钥复制到远程机器中

ssh-copy-id -i .ssh/id_rsa.pub root@192.168.8.145

注意:ssh-copy-id将key写到远程机器的~/.ssh/authorized_key文件中

第三步:登录到远程机器不用输入密码
ssh root@192.168.8.145

二、配置Munge

删除安装失败的munge:

yum remove munge munge-libs munge devel -y

删除用户:

userdel -r munge

1、创建Munge用户
Munge用户要确保Master Node和Compute Nodes的UID和GID相同,所有节点都需要安装Munge;

# groupadd -g 1108 munge
# useradd -m -c "Munge Uid 'N' Gid Emporium" -d /var/lib/munge -u 1108 -g munge -s /sbin/nologin munge

2、生成熵池:

# yum install -y rng-tools

使用/dev/urandom来做熵源

# rngd -r /dev/urandom
# vim /usr/lib/systemd/system/rngd.service

修改如下参数
[service]
ExecStart=/sbin/rngd -f -r /dev/urandom

查看状态

# systemctl status rngd

退出报存

# systemctl daemon-reload
# systemctl start rngd
# systemctl enable rngd

3、部署Munge
Munge是认证服务,实现本地或者远程主机进程的UID、GID验证。

# yum install munge munge-libs munge-devel -y

创建全局密钥
在Master Node创建全局使用的密钥

# /usr/sbin/create-munge-key -r
# dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key

密钥同步到所有计算节点:

# scp -p /etc/munge/munge.key root@192.168.8.145:/etc/munge
# scp -p /etc/munge/munge.key root@192.168.8.144:/etc/munge
# chown munge: /etc/munge/munge.key
# chmod 400 /etc/munge/munge.key

查看状态

# systemctl status munge

所有节点都执行启动命令:

# systemctl start munge
# systemctl enable munge

//停掉服务

systemctl stop munge

//查看状态

systemctl status munge

//----------Job for munge.service failed because the control process exited with error code启动失败
See “systemctl status slurmd.service” and “journalctl -xe” for details.

查看 /var/log/munge/munged.log
Error: Failed to check pidfile dir “/var/run/munge”: cannot canonicalize “/var/run/munge”: Permission denied
查看 /var/run的用户权限
chown -R munge /var/run/chrony

-rwxr-xr-x (755) 只有所有者才有读,写,执行的权限,组群和其他人只有读和执行的权限

4、测试Munge服务
每个计算节点与控制节点进行连接验证

本地查看凭据:

# munge -n

本地解码:

# munge -n | unmunge

验证compute node,远程解码:

# munge -n | ssh 192.168.8.145 unmunge

//--------报错:unmunge: Error: Invalid credentia
重启计算节点的munge 服务

Munge凭证基准测试

# remunge

三、配置Slurm
1、创建Slurm用户

# groupadd -g 1109 slurm
# useradd -m -c "Slurm manager" -d /var/lib/slurm -u 1109 -g slurm -s /bin/bash slurm

2、安装Slurm依赖

# yum install gcc gcc-c++ readline-devel perl-ExtUtils-MakeMaker pam-devel rpm-build mysql-devel -y

编译Slurm

# wget https://download.schedmd.com/slurm/slurm-20.02.7.tar.bz2

安装rpmbuild编译Slurm,rpmbuild制作rpm包

# yum install rpm-build
# rpmbuild -ta slurm-21.08.0-0rc1.tar.bz2

//如果rpmbuild出现如下错误:
error: Failed build dependencies:
python3 is needed by slurm-21.08.0-0rc1.el7.x86_64

解决:rpmbuild 生成软件包, 在安装时候忽略依赖关系,添加参数–nodeps
rpmbuild -ta --nodeps slurm-21.08.0-0rc1.tar.bz2

–nodeps #不检查建立包时的关联文件

cd到制作好的rpm包下:

# cd /root/rpmbuild/RPMS/x86_64/

所有节点安装Slurm

yum localinstall slurm-*

//--------------yum 安装 出错 Error: Protected multilib versions:
解决办法:在执行命令后面加上:–setopt=protected_multilib=false

子节点安装报错:Bad exit status from /var/tmp/rpm-tmp.EJn6d9 (%build)
需要安装python3

解压:
tar -Jxf Python-3.6.2.tar.xz
进入目录
cd Python-3.6.2
创建安装目录
mkdir /usr/local/python3
指明安装路径
./configure -prefix=/usr/local/python3
编译安装
make && make install

建立链接
ln -s /usr/local/python3/bin/python3 /usr/bin/python3 #为python3创建软连接
ln -s /usr/local/python3/bin/pip3 /usr/bin/pip3 #为pip3创建软连接

验证
python3 # 输入
pip3 -V #V大写

3、配置控制节点Slurm

# cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
# cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
# vim /etc/slurm/slurm.conf

##修改如下部分

ControlMachine=m1
ControlAddr=192.168.8.150
SlurmUser=slurm
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_MemoryNodeName=c[1-2] RealMemory=3400 Sockets=1 CoresPerSocket=4 State=IDLE
PartitionName=all Nodes=c[1-2] Default=YES State=UP

查看配置文件:

scontrol show config

复制控制节点配置文件到计算节点:

# scp /etc/slurm/*.conf  root@192.168.8.145:/etc/slurm/
# scp /etc/slurm/*.conf  c2:/etc/slurm/

设置控制、计算节点文件权限

# mkdir /var/spool/slurm
# chown slurm: /var/spool/slurm
# mkdir /var/log/slurm
# chown slurm: /var/log/slurm

5、配置控制节点Slurm Accounting
Accounting records为slurm收集作业步骤的信息,可以写入一个文本文件或数据库,但这个文件会变得越来越大,最简单的方法是使用MySQL来存储信息。
创建数据库的Slurm用户(MySQL自行安装)

mysql -u root -p

//-----------Can’t connect to local MySQL server through socket ‘/var/lib/mysql/mysql.sock’ (2)--------
mysql -h192.168.8.144 -uroot -proot

mysql> grant all on slurm_acct_db.* to ‘slurm’@‘%’ identified by ‘root’ with grant option;

//------------Mysql密码策略:Your password does not satisfy the current policy requirements
查看 mysql 初始的密码策略:
SHOW VARIABLES LIKE ‘validate_password%’;

首先需要设置密码的验证强度等级,设置 validate_password_policy 的全局参数为 LOW 即可:
set global validate_password_policy=LOW;

当前密码长度为 8 ,如果不介意的话就不用修改了,按照通用的来讲,设置为 6 位的密码,设置 validate_password_length 的全局参数为 6 即可

set global validate_password_length=4;
关于 mysql 密码策略相关参数;
1)、validate_password_length 固定密码的总长度;
2)、validate_password_dictionary_file 指定密码验证的文件路径;
3)、validate_password_mixed_case_count 整个密码中至少要包含大/小写字母的总个数;
4)、validate_password_number_count 整个密码中至少要包含阿拉伯数字的个数;
5)、validate_password_policy 指定密码的强度验证等级,默认为 MEDIUM;
1.LOW:只验证长度;
2.MEDIUM:验证长度、数字、大小写、特殊字符;
3.STRONG:验证长度、数字、大小写、特殊字符、字典文件;
6)、validate_password_special_char_count 整个密码中至少要包含特殊字符的个数;

配置slurmdbd.conf文件

# cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf
# cat /etc/slurm/slurmdbd.conf
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdAddr=192.168.8.150
DbdHost=m1
SlurmUser=slurm
DebugLevel=verbose
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=192.168.8.144
StorageUser=slrum
StoragePass=root
StorageLoc=slurm_acct_db

chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
chown slurm: /var/log/slurm/slurmdbd.log

6、开启节点服务

查看状态

# systemctl status slurmdbd

启动控制节点Slurmdbd服务

# systemctl start slurmdbd
# systemctl enable slurmdbd

//--------------Failed to start slurmdbd.service: Unit not found
首先看一下服务列表里有没有这个服务:
systemctl list-unit-files --type=service
如果有的话:
systemctl daemon-reload

启动控制节点slurmctld服务

启动集群:

Master节点需要执行 slurmctld -c 和slurmd-c,都是以root账户执行
所有Slaver节点都执行 slurmd -c

# systemctl start slurmctld
# systemctl status slurmctld
# systemctl enable slurmctld

启动计算节点的服务:

# systemctl start slurmd
# systemctl status slurmd
# systemctl enable slurmd

//-----------启动控制节点 systemctl status slurmdbd 报错:Can’t open PID file /var/run/slurmdbd.pid (yet?) after start: No such file or directory

查看文件是否存在 不存在创建pid文件:
touch /var/run/slurmdbd.pid
赋权限 chmod -R 777 /var/run/slurmdbd.pid
解决。

//-----------启动控制节点 systemctl enable slurmctld 报错:Failed to parse PID from file /var/run/slurmctld.pid: Invalid argument
查看日志:
“journalctl -xe”

于是去查看slurmctld 的启动文件,/usr/lib/systemd/system/slurmctld.service
结果发现有这么一行
PIDFile=/var/run/slurmctld.pid
把这一行屏蔽以后重启,问题解决
systemctl daemon-reload

如果还是没有解决,关闭xshell工具重新打开,然后 systemctl start slurmd。

//-------sinfo 报错:slurm_load_partitions: Unable to contact slurm controller (connect failure)

# vim /etc/slurm/slurm.conf

##修改如下部分

ControlMachine=m1
ControlAddr=192.168.8.150
SlurmUser=slurm
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
NodeName=m1 NodeAddr=192.168.8.150  CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200 Procs=1 State=UNKNOWN
NodeName=c1 NodeAddr=192.168.8.145  CPUs=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=200 Procs=1 State=UNKNOWN
PartitionName=control Nodes=m1 Default=NO MaxTime=INFINITE State=UP
PartitionName=compute Nodes=c1 Default=YES MaxTime=INFINITE State=UP

这里要注意: CPUs=24 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=30000 Procs=1 要根据自己的服务器资源酌情配置

查看配置文件:

scontrol show config

如果修改了配置文件slurm.conf,则请在master上执行scontrol reconfig命令更新配置文件。
重新加载一下

systemctl daemon-reload

四、检查Slurm集群

查看集群

# sinfo
# scontrol show partition
# scontrol show node

提交作业

# srun -N2 hostname
# scontrol show jobs

查看作业

# squeue -a
Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐