Nomad Nginx 暴露IP端口和重启/重调度验证

1.准备工作

1.1 搭建nomad集群

本测试使用三台ubuntu18.04虚拟机,IP地址分别为:

虚拟机1:192.168.60.10
虚拟机2:192.168.60.11
虚拟机3:192.168.60.12

具体搭建方法见Nomad集群 自身高可用测试

1.2 pull nginx镜像

在三个虚拟机中都执行sudo docker pull nginx,最好复制一个本地镜像nginx:v1,不然每次启动job都会重复pull增大延迟。

在三个虚拟机中都创建Dockerfile:

FROM nginx:latest

在三个虚拟机中都执行sudo docker build -t nginx:v1 .

2.测试driver=docker

2.1 nginx 暴露IP端口

创建job文件nginx.nomad:

job "nginx" {
  datacenters = ["dc1"]

  group "nginxg" {

    count = 1                   #只运行1个实例 

    network {
      port "nginxport" {        #自定义端口名称
        static = 8765           #设置运行容器的client暴露出的端口,即某个虚拟机的端口,可供访问
        to = 80                 #映射到容器内部80端口 876480
      }
    }
    
    task "nginxt" {
      driver = "docker"

      config {
        image = "nginx:v1"
        ports = ["nginxport"]
      }
    }
  }
}

注释部分可能需要删除才能运行

在任意一台虚拟机执行nomad job run nginx.nomad,发现job在虚拟机1中运行。

ubuntu1@ubuntu1:~$ sudo docker ps           
CONTAINER ID   IMAGE          COMMAND                  CREATED          STATUS          PORTS                                                  NAMES
7ecc30f20c63   nginx:latest   "/docker-entrypoint.…"   19 seconds ago   Up 18 seconds   192.168.60.10:8765->80/tcp, 192.168.60.10:8765->80/udp   nginxt-b2eff4f3-20d6-a666-7771-e9c8bd4bd73f

这时可以打开浏览器访问192.168.60.10:8765,出现以下界面,说明IP端口可访问。

在这里插入图片描述

2.2 重启/重调度验证

参考:restart, reschedule

修改上述nginx.nomad文件如下:

job "nginx" {
  datacenters = ["dc1"]

  group "nginxg" {

    count = 1                   #只运行1个实例 

    network {
      port "nginxport" {        #自定义端口名称
        static = 8765           #设置运行容器的client暴露出的端口,即某个虚拟机的端口,可供访问
        to = 80                 #映射到容器内部80端口 876480
      }
    }
    
    restart {
         attempts = 1           #interval时间内允许重启的最大次数
         interval = "30m"       #如果规定时间内重启次数超过attempts,则处理方式由mode控制
         delay    = "2s"        #每次重启前的延迟,最小值为0s
         mode     = "fail"      #超过次数则进入失败状态,进行重调度reschedule;其他模式见参考
    }

    reschedule {
         attempts       = 15           #interval时间内允许重调度的最大次数    
         interval       = "1h"         #如果规定时间内重调度次数超过attempts,则不再调度
         delay          = "5s"         #每次重调度前的延迟,最小值为5s
         delay_function = "constant"   #延迟的增长方式,constant为恒定值,其他见参考
         unlimited      = false        #关闭无限调度模式,若开启则attempts和interval失效
    }
      
    task "nginxt" {
      driver = "docker"

      config {
        image = "nginx:v1"
        ports = ["nginxport"]
      }
    }
  }
}

重复第2节中的示例进行验证,此时在虚拟机1中运行了nginx容器。

2.2.1 restart

执行sudo docker stop [container ID]sudo kill -9 [container process ID]来关闭nginx,两者效果相同。

container ID可以用sudo docker ps查看,container process ID可以用ps -ef查看。

关闭之后,再次执行sudo docker ps,发现container ID和创建时间发生变化,说明nginx被重启。

执行nomad job status nginx如下:

ubuntu1@ubuntu1$ nomad job status nginx   
ID            = nginx
Name          = nginx
Submit Date   = 2021-08-30T10:59:50+08:00
Type          = service
Priority      = 50
Datacenters   = home
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginxg      0       0         1        0       0         0

Latest Deployment
ID          = 3f292432
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
nginxg      1        1       1        0          2021-08-30T11:10:00+08:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
a90bf17e  d7dc7cb2  nginxg      0        run      running   21s ago     10s ago

获取Allocations ID 768c9fbb ,执行nomad alloc status a90bf17e如下:

ubuntu1@ubuntu1$ nomad alloc status a90bf17e   
ID                   = a90bf17e-7bff-73e8-63c8-3daa2bbb0b65
Eval ID              = ed640bc9
Name                 = nginx.nginxg[0]
Node ID              = a90bf17e
Node Name            = ubuntu1
Job ID               = nginx
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 7m42s ago
Modified             = 38s ago
Deployment ID        = 3f292432
Deployment Health    = healthy
Replacement Alloc ID = 768c9fbb

Allocation Addresses
Label       Dynamic  Address
*nginxport  yes      192.168.60.10:8765 -> 80

Task "nginxt" is "running"
Task Resources
CPU        Memory           Disk     Addresses
0/100 MHz  1.9 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2021-08-30T03:00:28Z
Finished At    = 2021-08-30T03:06:49Z
Total Restarts = 1
Last Restart   = 2021-08-30T11:00:27+08:00

Recent Events:
Time                       Type            Description
2021-08-30T11:00:28+08:00  Started         Task started by client
2021-08-30T11:00:27+08:00  Restarting      Task restarting in 1ns
2021-08-30T11:00:27+08:00  Terminated      Exit Code: 0
2021-08-30T10:59:50+08:00  Started         Task started by client
2021-08-30T10:59:50+08:00  Task Setup      Building Task Directory
2021-08-30T10:59:50+08:00  Received        Task received by client

在Recent Events中可以看到nginx确实被重启了。

2.2.2 reschedule

接续上述验证,再次将nginx关闭,发现虚拟机1不再运行nginx,而是延迟5秒后转移到了虚拟机2中运行。

执行nomad job status nginx如下:

ubuntu1@ubuntu1$ nomad job status nginx   
ID            = nginx
Name          = nginx
Submit Date   = 2021-08-30T10:59:50+08:00
Type          = service
Priority      = 50
Datacenters   = home
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
nginxg      0       0         1        1       0         0

Latest Deployment
ID          = 3f292432
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
nginxg      1        1       1        0          2021-08-30T11:10:00+08:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
768c9fbb  d7dc7cb2  nginxg      0        run      running   21s ago     10s ago
a90bf17e  a9470c7f  nginxg      4        stop     failed    7m25s ago   21s ago

发现新增加了Allocations 768c9fbb,之前的a90bf17e已经失败停止。

执行nomad alloc status a90bf17e如下:

...
Recent Events:
Time                       Type            Description
2021-08-30T11:06:49+08:00  Not Restarting  Exceeded allowed attempts 1 in interval 30m0s and mode is "fail"
2021-08-30T11:06:49+08:00  Terminated      Exit Code: 0
2021-08-30T11:00:28+08:00  Started         Task started by client
2021-08-30T11:00:27+08:00  Restarting      Task restarting in 1ns
2021-08-30T11:00:27+08:00  Terminated      Exit Code: 0
2021-08-30T10:59:50+08:00  Started         Task started by client
2021-08-30T10:59:50+08:00  Task Setup      Building Task Directory
2021-08-30T10:59:50+08:00  Received        Task received by client

执行nomad alloc status 768c9fbb如下:

...
Recent Events:
Time                       Type        Description
2021-08-30T11:06:54+08:00  Started     Task started by client
2021-08-30T11:06:54+08:00  Task Setup  Building Task Directory
2021-08-30T11:06:54+08:00  Received    Task received by client

说明nginx确实发生了重调度。

3.测试driver=raw_exec

3.1 重启/重调度验证

参考:restart, reschedule

编写简单程序main.c

#include <stdio.h>
int main()
{
	while(1){sleep(10);}
	return 0;
}

gcc main.c -o main.out编译。

sudo cp main.out /usr/将main.out拷贝到/usr下面。

/etc/nomad.d/nomad.hcl文件中添加以下内容:

plugin "raw_exec" {
  config {
    enabled = true
  }
}

三个虚拟机都要完成以上操作。

创建exec.nomad文件如下:

job "exec" {
    datacenters = ["home"]

    type = "batch"

    group "execg" {
        count = 1

      	restart {
         	attempts = 1
         	interval = "30m"
         	delay    = "0s"
         	mode     = "fail"
      	}

      	reschedule {
		 attempts       = 15
		 interval       = "1h"
		 delay          = "5s"
		 delay_function = "constant"
		 unlimited      = false
	}

        task "exect" {
            driver = "raw_exec"

            config {
                command = "/usr/main.out"
            }
        }
    }
}

nomad job run exec.nomad,此时在虚拟机1中运行了main.out进程。

3.2.1 restart

执行sudo kill -9 [PID]来关闭main.out进程,PID可以用ps -ef查看。

关闭之后,再次执行ps -ef,发现main.out的PID发生变化,说明进程被重启。

执行nomad job status exec如下:

ubuntu1@ubuntu1$ nomad job status exec   
ID            = exec
Name          = exec
Submit Date   = 2021-08-30T16:31:06+08:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
execg       0       0         1        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
e18abb7e  d7dc7cb2  execg       0        run      running   26m47s ago  44s ago

获取Allocations ID e18abb7e ,执行nomad alloc status e18abb7e如下:

ubuntu1@ubuntu1$ nomad alloc status e18abb7e   
ID                  = e18abb7e-4da8-44e6-17e3-2dc7fee4cac7
Eval ID             = 10353d31
Name                = exec.execg[0]
Node ID             = d7dc7cb2
Node Name           = ubuntu1
Job ID              = exec
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 28m3s ago
Modified            = 2m ago

Task "exect" is "running"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  34 MiB/300 MiB  300 MiB  

Task Events:
Started At     = 2021-08-30T08:57:33Z
Finished At    = N/A
Total Restarts = 1
Last Restart   = 2021-08-30T16:57:33+08:00

Recent Events:
Time                       Type        Description
2021-08-30T16:57:33+08:00  Started     Task started by client
2021-08-30T16:57:33+08:00  Restarting  Task restarting in 1ns
2021-08-30T16:57:33+08:00  Terminated  Exit Code: 137, Signal: 9
2021-08-30T16:31:31+08:00  Started     Task started by client
2021-08-30T16:31:31+08:00  Task Setup  Building Task Directory
2021-08-30T16:31:31+08:00  Received    Task received by client

在Recent Events中可以看到main.out进程确实被重启了。

3.2.2 reschedule

接续上述验证,再次将main.out进程关闭,发现虚拟机1不再运行main.out,而是延迟5秒后转移到了虚拟机2中运行。

执行nomad job status exec如下:

ubuntu1@ubuntu1$ nomad job status exec   
ID            = exec
Name          = exec
Submit Date   = 2021-08-30T16:31:06+08:00
Type          = batch
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
execg       0       0         1        1       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
55eeaf58  a9470c7f  execg       2        run      running   46s ago     46s ago
e18abb7e  d7dc7cb2  execg       0        stop     failed    30m34s ago  46s ago

发现新增加了Allocations 55eeaf58,之前的 e18abb7e 已经失败停止。

执行nomad alloc status e18abb7e如下:

...
Recent Events:
Time                       Type        Description
2021-08-30T16:57:36+08:00  Not Restarting  Exceeded allowed attempts 1 in interval 30m0s and mode is "fail"
2021-08-30T16:57:36+08:00  Terminated  Exit Code: 137, Signal: 9
2021-08-30T16:57:33+08:00  Started     Task started by client
2021-08-30T16:57:33+08:00  Restarting  Task restarting in 1ns
2021-08-30T16:57:33+08:00  Terminated  Exit Code: 137, Signal: 9
2021-08-30T16:31:31+08:00  Started     Task started by client
2021-08-30T16:31:31+08:00  Task Setup  Building Task Directory
2021-08-30T16:31:31+08:00  Received    Task received by client

执行nomad alloc status 55eeaf58如下:

...
Recent Events:
Time                       Type        Description
2021-08-30T17:01:19+08:00  Started     Task started by client
2021-08-30T17:01:19+08:00  Task Setup  Building Task Directory
2021-08-30T17:01:19+08:00  Received    Task received by client

说明main.out进程确实发生了重调度。

Logo

华为开发者空间,是为全球开发者打造的专属开发空间,汇聚了华为优质开发资源及工具,致力于让每一位开发者拥有一台云主机,基于华为根生态开发、创新。

更多推荐