大家好，欢迎来到IT知识分享网。

目标

提供一种途径用于反馈当前连接数和envoy进程负载
允许 envoy 滚动升级过程中尽可能少的丢失连接

非目标

保证滚动升级过程中连接 0 丢失

背景

envoy 进程作为 contour 的数据面组件，有时需要被重新部署。可能是由于升级、修改配置、或者节点问题导致的pod漂移。

contour 早期在 pod 中提供了 preStop hook，用于发送信号给 envoy ，envoy 开始减少连接，readiness probe 会触发实例不健康，envoy 停止接收新的连接。

这种方案的主要问题在于：preStop hook 发送的 /healthcheck/fail 请求没有等到 envoy 处理完所有的连接，因此当容器重启时，客户端会收到错误的返回值。

新的设计方案新增了一个新的组件，提供了一种在发送 SIGTERM 信号前能感知到 envoy 打开的连接是否存在的方式。

设计

实现了一个新的子命令：contour envoy shutdown-manager，用于处理发送给 envoy 的 healthcheck fail 请求，然后开始轮训 http listener 中的活跃连接数，这些信息是通过管理端口 localhost:9001/stats 中暴露的指标获取的。

除此之外，提供了一个可选参数 min-open-connections 参数，用于用户定于在等待连接关闭过程中允许的最小连接数

k8s 中的 prehook 允许容器在发送 SIGTERM 信号前有一段时间做清理工作和其他额外处理

设计细节

实现一个 contour 的子命令，命名为 envoy shutdown-manager
这个命令会暴露一个 http endpoint，端口默认是 8090，访问路径是 /shutdown
Envoy 的 Daemonset 中会新增一个容器，这个容器执行这个新的命令，暴露接口
当 preStop hook触发时，Envoy 容器和这个新的容器会被更新
当 pod 收到一个关闭的请求时，preStop hook 将发送一个 localhost:8090/shutdown 的请求，用于告诉 envoy 开始关闭连接，同时开始轮训获取活跃连接数，阻塞知道连接数将为0，或者是用户配置的 min-open-connections
pod 中的 terminationGracePeriodSeconds 参数需要设置一个比较大的值（默认30s），允许足够的事件关闭连接，如果时间到了还没有完全关闭，k8s将强制发送 SIGTERM信号并杀死pod
另外一个请求 /healthz，用于检查容器的监控状况

apiVersion: extensions/v1beta1 kind: DaemonSet metadata: annotations: labels: app: envoy name: envoy namespace: projectcontour spec: revisionHistoryLimit: 10 selector: matchLabels: app: envoy template: metadata: annotations: prometheus.io/path: /stats/prometheus prometheus.io/port: "8002" prometheus.io/scrape: "true" creationTimestamp: null labels: app: envoy spec: automountServiceAccountToken: false containers: - command: # <----- New Pod - /bin/contour args: - envoy - shutdown-manager image: stevesloka/envoyshutdown imagePullPolicy: Always lifecycle: preStop: # <----- PreStop Hook exec: command: - /bin/contour - envoy - shutdown livenessProbe: # <------ Liveness probe httpGet: path: /healthz port: 8090 initialDelaySeconds: 3 periodSeconds: 10 name: shutdown-manager terminationMessagePath: /dev/termination-log terminationMessagePolicy: File - args: - -c - /config/envoy.json - --service-cluster $(CONTOUR_NAMESPACE) - --service-node $(ENVOY_POD_NAME) - --log-level info command: - envoy env: - name: CONTOUR_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace - name: ENVOY_POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name image: docker.io/envoyproxy/envoy:v1.13.0 imagePullPolicy: IfNotPresent lifecycle: # <----- PreStop Hook preStop: httpGet: path: /shutdown port: 8090 scheme: HTTP name: envoy ports: - containerPort: 80 hostPort: 80 name: http protocol: TCP - containerPort: 443 hostPort: 443 name: https protocol: TCP readinessProbe: failureThreshold: 4 httpGet: path: /ready port: 8002 scheme: HTTP initialDelaySeconds: 3 periodSeconds: 3 successThreshold: 1 timeoutSeconds: 1 volumeMounts: - mountPath: /config name: envoy-config - mountPath: /certs name: envoycert - mountPath: /ca name: cacert dnsPolicy: ClusterFirst initContainers: - args: - bootstrap - /config/envoy.json - --xds-address=contour - --xds-port=8001 - --envoy-cafile=/ca/cacert.pem - --envoy-cert-file=/certs/tls.crt - --envoy-key-file=/certs/tls.key command: - contour env: - name: CONTOUR_NAMESPACE valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.namespace image: docker.io/projectcontour/contour:master imagePullPolicy: Always name: envoy-initconfig resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /config name: envoy-config - mountPath: /certs name: envoycert readOnly: true - mountPath: /ca name: cacert readOnly: true restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 300 volumes: - emptyDir: {} name: envoy-config - name: envoycert secret: defaultMode: 420 secretName: envoycert - name: cacert secret: defaultMode: 420 secretName: cacert updateStrategy: rollingUpdate: maxUnavailable: 10% type: RollingUpdate

其他备选方案

bash 脚本也可以打到该目的，但是实现比较困难并且难于测试
除了使用 preStop hook 机制，可以调用一个二进制进程去做检查，但是获取 envoy 容器的信息比较困难（可能需要挂载共享磁盘）

原理图

1: initContainer envoy-initconfig 调用 contour boostrap 生成配置文件 /config/envoy.json

2: 主容器通过磁盘挂载共享 config 文件，并作为启动的配置参数启动 envoy 进程

3：envoy 和服务端通过 XDS 协议做服务发现和路由配置

4-1：envoy关闭前，会执行 preStop 钩子，preStop调用 shutdown-manager 通过 9090 端口暴露的 /shutdown 接口。这个接口会去校验 /ok 文件是否存在，存在才说明 envoy 成功关闭了。不存在说明暂时还不能关闭，接口会阻塞在这里

4-2：和envoy一样（pod 中容器关闭是没有顺序的，可以简单理解为并行执行），shutdown-manager 关闭前，会执行 preStop 钩子，执行 contour envoy shutdown 命令。

5：调用 envoy 后台管理的 Post 请求，请求关闭 envoy

6：检查 envoy 监控指标，当活跃连接数小于某个值才认为关闭成功

7：关闭成功后，会生成 /ok 文件，用于让 /shutdown 接口成功返回

8：当 /ok 文件存在后，说明 envoy 已经优雅关闭了，envoy 进程可以退出。

完成以上步骤，整个 pod 才可以退出

源码分析

从前面的设计文档得知，优雅停机主要跟 shutdown-manger这个进程有关，这里主要是分析 shutdown-manager 这个进程的源码。

shutdown-manager

使用方式：

$ ./contour envoy shutdown-manager -h usage: contour envoy shutdown-manager [<flags>] Start envoy shutdown-manager. Flags: -h, --help Show context-sensitive help (also try --help-long and --help-man). --serve-port=SERVE-PORT Port to serve the http server on.

代码入口：

// contour/cmd/contour/contour.go func main() { ... // 一级子命令：envoy envoyCmd := app.Command("envoy", "Sub-command for envoy actions.") sdm, shutdownManagerCtx := registerShutdownManager(envoyCmd, log) ... switch kingpin.MustParse(app.Parse(args)) { case sdm.FullCommand(): doShutdownManager(shutdownManagerCtx) ... } }

注册命令行

// contour/cmd/contour/shutdownmanager.go func registerShutdownManager(cmd *kingpin.CmdClause, log logrus.FieldLogger) (*kingpin.CmdClause, *shutdownmanagerContext) { ctx := newShutdownManagerContext() ctx.FieldLogger = log.WithField("context", "shutdown-manager") // 二级子命令：shutdown-manager shutdownmgr := cmd.Command("shutdown-manager", "Start envoy shutdown-manager.") shutdownmgr.Flag("serve-port", "Port to serve the http server on.").IntVar(&ctx.httpServePort) return shutdownmgr, ctx } // 初始化默认参数 func newShutdownManagerContext() *shutdownmanagerContext { // Set defaults for parameters which are then overridden via flags, ENV, or ConfigFile return &shutdownmanagerContext{ httpServePort: 8090, // const shutdownReadyFile = "/ok" shutdownReadyFile: shutdownReadyFile, shutdownReadyCheckInterval: shutdownReadyCheckInterval, } }

命令对应的执行动作：启动 http 服务。内部暴露的 /shutdown 接口，在 envoy 被关闭之前，preStop 钩子会调用这个接口。

// contour/cmd/contour/shutdownmanager.go func doShutdownManager(config *shutdownmanagerContext) { config.Info("started envoy shutdown manager") defer config.Info("stopped") // 暴露两个接口 http.HandleFunc("/healthz", config.healthzHandler) http.HandleFunc("/shutdown", config.shutdownReadyHandler) // 默认监听 8090 端口 log.Fatal(http.ListenAndServe(fmt.Sprintf(":%d", config.httpServePort), nil)) }

shutdown 接口对应的 handler 函数：

该接口提供给 Envoy 使用，决定是否可以终止服务
一旦连接数降低到某个阈值（0或者配置的阈值），/ok 这个文件会被创建
当调用 /shutdown 请求时，使用这个 /ok 文件判断 envoy 是否可以被安全的终止
如果没有 /ok 文件，请求会阻塞住

func (s *shutdownmanagerContext) shutdownReadyHandler(w http.ResponseWriter, r *http.Request) { l := s.WithField("context", "shutdownReadyHandler") ctx := r.Context() for { // 判断 /ok 文件是否存在 _, err := os.Stat(s.shutdownReadyFile) if os.IsNotExist(err) { // 不存在就跳过，等待下次执行。说明这时候连接数还没有将到阈值，还不能安全的退出 l.Infof("file %s does not exist; checking again in %v", s.shutdownReadyFile, s.shutdownReadyCheckInterval) } else if err == nil { l.Infof("detected file %s; sending HTTP response", s.shutdownReadyFile) http.StatusText(http.StatusOK) if _, err := w.Write([]byte("OK")); err != nil { l.Error(err) } return } else { l.Errorf("error checking for file: %v", err) } select { // 休眠一段时间 case <-time.After(s.shutdownReadyCheckInterval): case <-ctx.Done(): l.Infof("client request cancelled") return } } }

shutdown

使用方式如下，该命令会在 shutdown-manager 进程被杀死之前的 preStop 中被调用执行。负责关闭 evnoy，同时写入 /ok 文件表明关闭成功。

$ ./contour envoy shutdown -h usage: contour envoy shutdown [<flags>] Initiate an shutdown sequence which configures Envoy to begin draining connections. Flags: -h, --help Show context-sensitive help (also try --help-long and --help-man). --admin-port=ADMIN-PORT Envoy admin interface port. --check-interval=CHECK-INTERVAL Time to poll Envoy for open connections. --check-delay=60s Time to wait before polling Envoy for open connections. --drain-delay=0s Time to wait before draining Envoy connections. --min-open-connections=MIN-OPEN-CONNECTIONS Min number of open connections when polling Envoy.

代码入口：

func main() { ... // 一级子命令：envoy envoyCmd := app.Command("envoy", "Sub-command for envoy actions.") sdmShutdown, sdmShutdownCtx := registerShutdown(envoyCmd, log) ... switch kingpin.MustParse(app.Parse(args)) { ... case sdmShutdown.FullCommand(): sdmShutdownCtx.shutdownHandler() ... } }

注册命令行：

func registerShutdown(cmd *kingpin.CmdClause, log logrus.FieldLogger) (*kingpin.CmdClause, *shutdownContext) { ctx := newShutdownContext() ctx.FieldLogger = log.WithField("context", "shutdown") // 二级命令：shutdown shutdown := cmd.Command("shutdown", "Initiate an shutdown sequence which configures Envoy to begin draining connections.") shutdown.Flag("admin-port", "Envoy admin interface port.").IntVar(&ctx.adminPort) shutdown.Flag("check-interval", "Time to poll Envoy for open connections.").DurationVar(&ctx.checkInterval) shutdown.Flag("check-delay", "Time to wait before polling Envoy for open connections.").Default("60s").DurationVar(&ctx.checkDelay) shutdown.Flag("drain-delay", "Time to wait before draining Envoy connections.").Default("0s").DurationVar(&ctx.drainDelay) shutdown.Flag("min-open-connections", "Min number of open connections when polling Envoy.").IntVar(&ctx.minOpenConnections) return shutdown, ctx }

命令处理函数如下，核心逻辑：

调用 envoy 后台管理端口 http://localhost:9001/healthcheck/fail 发送请求，表明需要关闭 envoy
调用 envoy metrics 指标 http://localhost:9001/stats/prometheus 获取连接数，判断是否真的已经关闭了

func (s *shutdownContext) shutdownHandler() { // 重试去关闭 envoy err := retry.OnError(wait.Backoff{ Steps: 4, Duration: 200 * time.Millisecond, Factor: 5.0, Jitter: 0.1, }, func(err error) bool { // Always retry any error. return true }, func() error { s.Infof("attempting to shutdown") // 尝试 shutdown return shutdownEnvoy(s.adminPort) }) ... time.Sleep(s.checkDelay) for { // 连接 envoy 的管理端口，获取指标信息 // 地址：http://localhost:9001/stats/prometheus // 然后获取连接数 // envoy_http_downstream_cx_active 指标，label 取 ingress_http openConnections, err := getOpenConnections(s.adminPort) if err != nil { s.Error(err) } else { // 如果连接数 <= 配置的最小连接数 if openConnections <= s.minOpenConnections { ... // 创建 /ok 文件，表明可以安全的终止服务 file, err := os.Create(shutdownReadyFile) ... return } } // 连接数没有降下来，需要再 sleep 一段时间 time.Sleep(s.checkInterval) } }

关闭 envoy 的逻辑

func shutdownEnvoy(adminPort int) error { // 向管理端口发送一个将康检查失败的请求 // http://localhost:9001/healthcheck/fail healthcheckFailURL := fmt.Sprintf(healthcheckFailURLFormat, adminPort) // 发送 POST 请求 resp, err := http.Post(healthcheckFailURL, "", nil) if err != nil { return fmt.Errorf("creating healthcheck fail POST request failed: %s", err) } defer resp.Body.Close() if resp.StatusCode != http.StatusOK { return fmt.Errorf("POST for %q returned HTTP status %s", healthcheckFailURL, resp.Status) } return nil }

获取连接数

func getOpenConnections(adminPort int) (int, error) { // 指标地址：http://localhost:9001/stats/prometheus prometheusURL := fmt.Sprintf(prometheusURLFormat, adminPort) // 发送 Get 请求获取指标 resp, err := http.Get(prometheusURL) ... // 提取指标中的连接数 return parseOpenConnections(resp.Body) } // parseOpenConnections returns the sum of open connections from a Prometheus HTTP request func parseOpenConnections(stats io.Reader) (int, error) { ... // 将指标文本转换为对象 metricFamilies, err := parser.TextToMetricFamilies(stats) // 获取 envoy_http_downstream_cx_active 指标 if _, ok := metricFamilies[prometheusStat]; !ok { return -1, fmt.Errorf("error finding Prometheus stat %q in the request result", prometheusStat) } // 查找 ingress_http label的数据并累加 for _, metrics := range metricFamilies[prometheusStat].Metric { for _, labels := range metrics.Label { for _, item := range prometheusLabels() { if item == labels.GetValue() { openConnections += int(metrics.Gauge.GetValue()) } } } } return openConnections, nil }

作者：kinnylee
链接：https://juejin.cn/post/

免责声明：本站所有文章内容,图片，视频等均是来源于用户投稿和互联网及文摘转载整编而成，不代表本站观点，不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益，请在线联系站长,一经查实,本站将立刻删除。本文来自网络,若有侵权，请联系删除，如若转载，请注明出处：https://yundeesoft.com/69672.html

Contour 中 Envoy 优雅停服的实现与源码分析

目标

非目标

背景

设计

设计细节

其他备选方案

原理图

源码分析

shutdown-manager

shutdown

发表回复

Contour 中 Envoy 优雅停服的实现与源码分析

目标

非目标

背景

设计

设计细节

其他备选方案

原理图

源码分析

shutdown-manager

shutdown

相关推荐

发表回复