乔克视界 乔克视界
首页
  • 运维
  • 开发
  • 监控
  • 安全
  • 随笔
  • Docker
  • Golang
  • Python
  • AIOps
  • DevOps
  • Kubernetes
  • Prometheus
  • ELK
  • 心情杂货
  • 读书笔记
  • 面试
  • 实用技巧
  • 博客搭建
友链
关于
收藏
  • 分类
  • 标签
  • 归档

乔克

云原生爱好者
首页
  • 运维
  • 开发
  • 监控
  • 安全
  • 随笔
  • Docker
  • Golang
  • Python
  • AIOps
  • DevOps
  • Kubernetes
  • Prometheus
  • ELK
  • 心情杂货
  • 读书笔记
  • 面试
  • 实用技巧
  • 博客搭建
友链
关于
收藏
  • 分类
  • 标签
  • 归档
  • Docker

  • Golang

  • AIOps

  • Python

  • DevOps

  • Kubernetes

  • Prometheus

    • Prometheus 介绍
    • 手动搭建 Prometheus
    • 配置监控
    • 安装 Grafana
    • AlertManager
      • 一、安装
      • 二、配置报警规则
      • 三、webhook 报警
        • 3.1、python
        • 3.2、go
    • Operator 部署 Prometheus
    • 常用函数
    • 黑盒监控
    • 集群事件监控之 kube-eventer
    • 配置企业微信告警
    • PromQL 常用操作
    • 配置短信告警
    • 监控指标
    • PushGateway
    • Google 四大黄金指标
    • Kubernetes 性能指标
  • ELK

  • 专栏
  • Prometheus
乔克
2025-07-20
目录

AlertManager

Alertmanager 主要用于接收 Prometheus 发送的告警信息,它支持丰富的告警通知渠道,而且很容易做到告警信息进行去重,降噪,分组等,是一款前卫的告警通知系统。

# 一、安装

(1)、配置 configMap 配置清单

alertmanager-config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: kube-ops
data:
  alertmanager.yml: |-
    global:
      ## 在没有报警的情况下声明为已解决的时间
      resolve_timeout: 5m
      ## 配置邮件发送信息
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'xxxs@163.com'
      smtp_auth_username: 'xxx@163.com'
      smtp_auth_password: 'xxxx'
      smtp_hello: '163.com'
      smtp_require_tls: false
    ## 所有报警信息进入后的根路由,用来设置报警的分发策略
    route:
      ## 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
      group_by: ['alertname', 'cluster']
      ## 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
      group_wait: 30s

      ## 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
      group_interval: 5m

      ## 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
      repeat_interval: 5m

      ## 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
      receiver: default

      ## 上面所有的属性都由所有子路由继承,并且可以在每个子路由上进行覆盖。
      routes:
      - receiver: email
        group_wait: 10s
        match:
          team: node
    receivers:
    - name: 'default'
      email_configs:
      - to: 'baidjay@163.com'
        send_resolved: true
    - name: 'email'
      email_configs:
      - to: '565361785@qq.com'
        send_resolved: true
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

创建 ConfigMap 对象:

## kubectl apply -f alertmanager-config.yaml
configmap/alertmanager-config created
1
2

然后配置 alertmanager 容器:

alertmanager-deploy.yaml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: kube-ops
spec:
  selector:
    matchLabels:
      app: alertmanager
  replicas: 2
  template:
    metadata:
      labels:
        app: alertmanager
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:v0.19.0
          imagePullPolicy: IfNotPresent
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 100m
              memory: 256Mi
          volumeMounts:
            - name: alert-config
              mountPath: /etc/alertmanager
          ports:
            - name: http
              containerPort: 9093
      volumes:
        - name: alert-config
          configMap:
            name: alertmanager-config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

配置 service:

alertmanager-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: alertmanager-svc
  namespace: kube-ops
  annotations:
    prometheus.io/scrape: "true"
spec:
  selector:
    app: alertmanager
  ports:
    - name: http
      port: 9093
1
2
3
4
5
6
7
8
9
10
11
12
13

在 Prometheus 中配置 AlertManager 地址:

prom-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-ops
data:
  prometheus.yaml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 15s
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["alertmanager-svc:9093"]
    scrape_configs:
    - job_name: 'prometheus'
      static_configs:
      - targets: ['localhost:9090']
......
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

然后重载配置文件,reload Prometheus:

## kubectl apply -f prom-configmap.yaml
## curl -X POST "http://10.68.254.74:9090/-/reload"
1
2

# 二、配置报警规则

上面我们只是配置了报警器,并没有配置任何报警,所以到目前其并没有其任何作用。警报规则允许你基于 Prometheus 表达式语言的表达式来定义报警报条件,并在触发警报时发送通知给外部的接收者。

首先,定义报警规则,我们依然用 ConfigMap 的形式,就配置在 Prometheus 的 configMap 中:

prom-configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-ops
data:
  prometheus.yaml: |
    global:
      scrape_interval: 15s
      scrape_timeout: 15s

    alerting:
      alertmanagers:
      - static_configs:
        - targets: ["alertmanager-svc:9093"]

    rule_files:
    - /etc/prometheus/rules.yaml

    scrape_configs:
    - job_name: 'prometheus'
      static_configs:
      - targets: ['localhost:9090']
    - job_name: 'redis'
      static_configs:
      - targets: ['redis.kube-ops.svc.cluster.local:9121']
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: "kubernetes-kubelet"
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
    - job_name: "kubernetes_cAdvisor"
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: '(.+)'
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
        target_label: __metrics_path__
    - job_name: "kubernetes-apiserver"
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs: 
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https
    - job_name: "kubernetes-scheduler"
      kubernetes_sd_configs:
      - role: endpoints
     
    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

  rules.yaml: |
    groups:
    - name: test-rule 
      rules:
      - alert: NodeMemoryUsage
        expr: (sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes + node_memory_Buffers_bytes+node_memory_Cached_bytes)) / sum(node_memory_MemTotal_bytes) * 100 > 5
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High Memory usage detected"
          description: "{{$labels.instance}}: Memory usage is above 5% (current value is: {{ $value }}"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

上面我们定义了一个名为 NodeMemoryUsage 的报警规则,其中:

  • for 语句会使 Prometheus 服务等待指定的时间, 然后执行查询表达式。
  • labels 语句允许指定额外的标签列表,把它们附加在告警上。
  • annotations 语句指定了另一组标签,它们不被当做告警实例的身份标识,它们经常用于存储一些额外的信息,用于报警信息的展示之类的。

然后更新 configmap,并重新 reload:

## kubectl apply -f prom-configmap.yaml
## curl -X POST "http://10.68.140.137:9090/-/reload"
1
2

1575008388022 a926119f 3a1a 4a5d 9409 c48be2cbee93

我们可以看到页面中出现了我们刚刚定义的报警规则信息,而且报警信息中还有状态显示。一个报警信息在生命周期内有下面 3 种状态:

  • inactive: 表示当前报警信息既不是 firing 状态也不是 pending 状态
  • pending: 表示在设置的阈值时间范围内被激活了
  • firing: 表示超过设置的阈值时间被激活了

然后就会收到报警:

1575010027705 dc1f0dae 6d74 44ce b7b4 2767520fa88f

# 三、webhook 报警

# 3.1、python

用 Flask 编写一个简单的钉钉报警程序:

#!/usr/bin/python
## -*- coding:utf-8 -*-
import os
import json
import requests

from flask import Flask
from flask import request

app = Flask(__name__)


@app.route('/', methods=['POST', 'GET'])
def send():
    if request.method == 'POST':
        post_data = request.get_data()
        post_data = format_message(bytes2json(post_data))
        print(post_data)
        send_alert(post_data)
        return 'success'
    else:
        return 'weclome to use prometheus alertmanager dingtalk webhook server!'


def bytes2json(data_bytes):
    data = data_bytes.decode('utf8').replace("'", '"')
    return json.loads(data)


def format_message(post_data):
    EXCLUDE_LIST = ['prometheus', 'endpoint']
    message_list = []
    message_list.append('#### 报警类型:{}'.format(post_data['status']))
    ## message_list.append('**alertname:**{}'.format(post_data['alerts'][0]['labels']['alertname']))
    message_list.append('> **startsAt: **{}'.format(post_data['alerts'][0]['startsAt']))
    message_list.append('##### Labels:')
    for label in post_data['alerts'][0]['labels'].keys():
        if label in EXCLUDE_LIST:
            continue
        else:
            message_list.append("> **{}: **{}".format(label, post_data['alerts'][0]['labels'][label]))
    message_list.append('##### Annotations:')
    for annotation in post_data['alerts'][0]['annotations'].keys():
        message_list.append('> **{}: **{}'.format(annotation, post_data['alerts'][0]['annotations'][annotation]))
    message = (" \n\n ".join(message_list))
    title = post_data['alerts'][0]['labels']['alertname']
    data = {"title": title, "message": message}
    return data


def send_alert(data):
    token = os.getenv('ROBOT_TOKEN')
    if not token:
        print('you must set ROBOT_TOKEN env')
        return
    url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
    send_data = {
        "msgtype": "markdown",
        "markdown": {
            "title": data['title'],
            "text": "{}".format(data['message'])
        }
    }
    req = requests.post(url, json=send_data)
    result = req.json()
    print(result)
    if result['errcode'] != 0:
        print('notify dingtalk error: %s' % result['errcode'])


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

代码非常简单,通过一个 ROBOT_TOKEN 的环境变量传入群机器人的 TOKEN,然后直接将 webhook 发送过来的数据直接以文本的形式转发给群机器人。

Dockerfile 内容:

FROM python:3.6.4

## set working directory
WORKDIR /src

## add app
ADD . /src

## install requirements
RUN pip install -r requirements.txt

## run server
CMD python app.py
1
2
3
4
5
6
7
8
9
10
11
12
13

requirements.txt

certifi==2018.10.15
chardet==3.0.4
Click==7.0
Flask==1.0.2
idna==2.7
itsdangerous==1.1.0
Jinja2==2.10
MarkupSafe==1.1.0
requests==2.20.1
urllib3==1.24.1
Werkzeug==0.14.1
1
2
3
4
5
6
7
8
9
10
11

我们在集群中部署服务:

创建 token 的 secret,将 token 保存在文件中(钉钉自定义机器人):

#kubectl create secret generic dingtalk-secret --from-literal=token=xxxxxx -n kube-ops
1

dingtalk-hook.yaml

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: dingtalk-hook
  namespace: kube-ops
spec:
  template:
    metadata:
      labels:
        app: dingtalk-hook
    spec:
      containers:
        - name: dingtalk-hook
          image: registry.cn-hangzhou.aliyuncs.com/joker_kubernetes/dingtalk-hook:v0.3
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 5000
              name: http
          env:
            - name: ROBOT_TOKEN
              valueFrom:
                secretKeyRef:
                  name: dingtalk-secret
                  key: token
          resources:
            requests:
              cpu: 50m
              memory: 100Mi
            limits:
              cpu: 50m
              memory: 100Mi

---
apiVersion: v1
kind: Service
metadata:
  name: dingtalk-hook
  namespace: kube-ops
spec:
  selector:
    app: dingtalk-hook
  ports:
    - name: hook
      port: 5000
      targetPort: http
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

创建配置清单:

## kubectl apply -f dingtalk-hook.yaml
deployment.extensions/dingtalk-hook created
1
2

然后我们修改 alertmanager 的 configmap,增加 webhook:

alertmanager-config.yaml

 ......
      - receiver: webhook
        group_wait: 10s
        match:
          filesystem: node
    receivers:
    - name: 'webhook'
      webhook_configs:
      - url: "http://dingtalk-hook.kube-ops.svc:5000"
        send_resolved: true
......
1
2
3
4
5
6
7
8
9
10
11

更新配置文件,重新创建 alertmanager-deploy.yaml

## kubectl apply -f alertmanager-config.yaml
## kubectl delete -f alertmanager-deploy.yaml
deployment.extensions "alertmanager" deleted
## kubectl apply -f alertmanager-deploy.yaml
deployment.extensions/alertmanager created
1
2
3
4
5

然后我们在 Prometheus 中添加 rules,如下:

- alert: NodeFilesystemUsage
  expr: (sum(node_filesystem_size_bytes{device="rootfs"}) - sum(node_filesystem_free_bytes{device="rootfs"}) ) / sum(node_filesystem_size_bytes{device="rootfs"}) * 100 > 10
  for: 2m
  labels:
    filesystem: node
  annotations:
    summary: "{{$labels.instance}}: High Filesystem usage detected"
    description: "{{$labels.instance}}: Filesystem usage is above 10% (current value is: {{ $value }}"
1
2
3
4
5
6
7
8

然后我们更新 Prometheus 的 configmap,并 reload Prometheus:

## kubectl apply -f prom-configmap.yaml
## curl -X POST "http://10.68.140.137:9090/-/reload"
1
2

然后我们可以看到已经触发:

1575012768110 a387dd14 81b0 42cc 9d9d 6acb7c20f232

1575021735159 fa71492f 1829 4b3f 849e 694b0d58497b

# 3.2、go

一个比较好的模板:https://github.com/timonwong/prometheus-webhook-dingtalk (opens new window)

上次更新: 2025/07/20, 10:40:32
安装 Grafana
Operator 部署 Prometheus

← 安装 Grafana Operator 部署 Prometheus→

最近更新
01
elastic 账户认证 401 问题
07-20
02
使用 helm 安装 es 和 kibana
07-20
03
elastic stack 搭建
07-20
更多文章>
Theme by Vdoing | Copyright © 2019-2025 乔克 | MIT License | 渝ICP备20002153号 |
  • 跟随系统
  • 浅色模式
  • 深色模式
  • 阅读模式