prometheus监控

整理流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
使用 Prometheus 结合 Kubernetes Operator 来监控服务的整体流程,没有包含 YAML 配置文件,只讲解步骤:

1. 安装 Prometheus Operator
首先,确保在 Kubernetes 集群中安装了 Prometheus Operator,这个 Operator 负责管理 Prometheus 的部署、配置、服务发现和监控。你可以通过 Helm 或直接应用 YAML 文件来安装 Prometheus Operator。

2. 创建 Service(服务暴露)
你需要在 Kubernetes 中为你的应用创建一个 Service,它会暴露应用的端口,以便 Prometheus 可以访问这些指标数据。

Service 是用来暴露应用的端口和接口,让外部或其他服务能够访问该服务。

在创建时,需要指定端口、选择器(确定哪个 Pod 或容器将被暴露)等信息。

3. 创建 Endpoints(可选)
一般情况下,Kubernetes 会自动为你的 Service 创建 Endpoints,用于暴露应用的网络地址和端口。如果你的应用暴露的是静态地址或需要手动配置,那么可以创建一个 Endpoints 资源。

4. 创建 ServiceMonitor(Prometheus 监控)
ServiceMonitor 是 Prometheus Operator 提供的一个资源对象,用来告知 Prometheus 哪些服务需要监控。ServiceMonitor 会关联到指定的 Kubernetes Service,并定义如何抓取该服务的指标数据。

在 ServiceMonitor 中,需要指定你要监控的 Service 的标签(例如,选择 app: my-app 的 Service),并配置抓取的端口和间隔等。

Prometheus 会根据 ServiceMonitor 自动发现并抓取服务的指标数据。

5. 验证 Prometheus 配置
一旦配置了 ServiceMonitor,Prometheus 会开始监控你定义的服务。可以通过访问 Prometheus 的 Web 界面来检查抓取的目标和监控的服务。

访问 Prometheus UI,查看 “Targets” 页面,确认目标服务是否正确显示并处于正常状态。

在 Prometheus UI 中,通过查询相关指标来确保服务的数据正在被抓取。

6. 设置告警(可选)
如果你想要设置告警规则,Prometheus 可以监控服务的健康状况并在出现问题时发送告警。例如,如果应用的响应时间过长,Prometheus 可以触发告警。

配置 PrometheusRule 资源来定义告警规则,如 HTTP 请求响应时间过长、错误率过高等。

配合 Alertmanager,Prometheus 可以将告警发送到不同的通知渠道(例如 Slack、邮件等)。

7. 维护与优化
扩展与调整:随着应用规模的增长,可以根据需求扩展 Prometheus 的配置,增加更多的 ServiceMonitor 或调整抓取间隔。

故障处理:根据告警和监控数据,及时发现服务问题并修复。可以利用监控数据来进行根本原因分析(RCA)。

开始安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
1. 解压下载的代码包
wget https://github.com/prometheus-operator/kube-prometheus/archive/refs/tags/v0.13.0.zip
unzip kube-prometheus-0.13.0.zip
rm -f kube-prometheus-0.13.0.zip && cd kube-prometheus-0.13.0

2. 这里先看下有哪些镜像
# find ./ -type f |xargs egrep 'image: quay.io|image: registry.k8s.io|image: grafana|image: docker.io'|awk '{print $3}'|sort|uniq

quay.io/prometheus-operator/prometheus-config-reloader:v0.67.1 # 注意:这个镜像配置比较特殊,上面命令过滤不出来

grafana/grafana:9.5.3
docker.io/cloudnativelabs/kube-router
quay.io/brancz/kube-rbac-proxy:v0.14.2
quay.io/fabxc/prometheus_demo_service
quay.io/prometheus/alertmanager:v0.26.0
quay.io/prometheus/blackbox-exporter:v0.24.0
quay.io/prometheus/node-exporter:v1.6.1
quay.io/prometheus-operator/prometheus-operator:v0.67.1
quay.io/prometheus/prometheus:v2.46.0
registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.9.2
registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.11.1

# 由于上面的镜像中,有部分国内网络无法正常摘取,所以博哥将上述所有镜像已转存至docker hub上,用下面命令批量替换下镜像地址即可

find ./ -type f |xargs sed -ri 's+quay.io/.*/+docker.io/bogeit/+g'
find ./ -type f |xargs sed -ri 's+docker.io/cloudnativelabs/+docker.io/bogeit/+g'
find ./ -type f |xargs sed -ri 's+grafana/+docker.io/bogeit/+g'
find ./ -type f |xargs sed -ri 's+registry.k8s.io/.*/+docker.io/bogeit/+g'

3. 开始创建所有服务
kubectl create -f manifests/setup
kubectl create -f manifests/
过一会查看创建结果:
kubectl -n monitoring get all
kubectl -n monitoring get pod -w

# 附:清空上面部署的prometheus所有服务:
kubectl delete --ignore-not-found=true -f manifests/ -f manifests/setup


访问下prometheus的UI

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# vim prometheus-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: monitoring #一定要和sv ep 要在一个命名空间
name: prometheus
spec:
rules:
- host: prometheus.k8s.com
http:
paths:
- backend:
service:
name: prometheus-k8s
port:
number: 9090
path: /
pathType: Prefix


# kubectl -n monitoring apply -f prometheus-ingress.yaml

grafana ingress创建

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# vim grafana-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: monitoring #一定要和sv ep 要在一个命名空间
name: grafana
spec:
rules:
- host: grafana.k8s.com
http:
paths:
- backend:
service:
name: grafana
port:
number: 3000
path: /
pathType: Prefix


# kubectl -n monitoring apply -f grafana-ingress.yaml

grafana 账号密码都是admin

注意:删除自带的网络策略,否则访问服务都会被阻塞

1
kubectl -n monitoring delete networkpolicies.networking.k8s.io --all

通过ingress的

1
2
3
4
5
6
7
8
9
10
11
12
访问的流程是 通过命名空间 ingress-nginx  下面svc 关联的po去类似nginx的   
就是这个pod的ip 加上 svc暴漏的端口去代理 grafana.k8s.com 或者是 prometheus.k8s.com
grafana.k8s.com:3000 prometheus.k8s.com:3000

kubectl get svc -n ingress-nginx
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress-nginx-controller NodePort 10.96.186.188 <none> 80:30000/TCP,443:31388/TCP 7d3h

kubectl get po -n ingress-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-controller-x84p4 1/1 Running 6 (5m20s ago) 8m31s 192.168.85.129 k8s-node1 <none> <none>

kube-controller-manager和kube-scheduler被监控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# 这里我们发现这两服务监听的IP是0.0.0.0 正常
# ss -tlnp|egrep 'controller|schedule'
LISTEN 0 32768 *:10257 *:* users:(("kube-controller",pid=3528,fd=3))
LISTEN 0 32768 *:10259 *:* users:(("kube-scheduler",pid=837,fd=3))
-----------------------------------------------------------------------------------------------------------
如果是127.0.0.1:10257 解决办法如下

kubectl edit po kube-controller-manager-k8s-master -n kube-system - --bind-address=0.0.0.0


然后因为K8s的这两上核心组件我们是以二进制形式部署的,为了能让K8s上的prometheus能发现,我们需要来创建相应的service和endpoints来将其关联起来
这里面有所以不需要创建servicemonitoring
kubectl get servicemonitors.monitoring.coreos.com -A
NAMESPACE NAME AGE
monitoring alertmanager-main 28h
monitoring blackbox-exporter 28h
monitoring coredns 28h
monitoring grafana 28h
monitoring kube-apiserver 28h
monitoring kube-controller-manager 28h
monitoring kube-scheduler 28h
monitoring kube-state-metrics 28h
monitoring kubelet 28h
monitoring node-exporter 28h
monitoring prometheus-adapter 28h
monitoring prometheus-k8s 28h
monitoring prometheus-operator 28h
------------------------------------------------------------------------------------------------------
apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-controller-manager
labels:
app.kubernetes.io/name: kube-controller-manager
spec:
type: ClusterIP
clusterIP: None
ports:
- name: https-metrics
port: 10257 #默认端口
targetPort: 10257
protocol: TCP

---
apiVersion: v1
kind: Endpoints
metadata:
labels:
app.kubernetes.io/name: kube-controller-manager
name: kube-controller-manager
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.85.128 #这个pod所对应的ip 如果有多个ip就写多个
ports:
- name: https-metrics
port: 10257
protocol: TCP

---

apiVersion: v1
kind: Service
metadata:
namespace: kube-system
name: kube-scheduler
labels:
app.kubernetes.io/name: kube-scheduler
spec:
type: ClusterIP
clusterIP: None
ports:
- name: https-metrics
port: 10259
targetPort: 10259
protocol: TCP

---
apiVersion: v1
kind: Endpoints
metadata:
labels:
app.kubernetes.io/name: kube-scheduler
name: kube-scheduler
namespace: kube-system
subsets:
- addresses:
- ip: 192.168.85.128 #这个pod所对应的ip 如果有多个ip就写多个
ports:
- name: https-metrics
port: 10259
protocol: TCP
----------------------------------------------------------------------------------
将上面的yaml配置保存为repair-prometheus.yaml,然后创建它

kubectl apply -f repair-prometheus.yaml

创建完成后确认下

# kubectl -n kube-system get svc |egrep 'controller|scheduler'
kube-controller-manager ClusterIP None <none> 10252/TCP 58s
kube-scheduler ClusterIP None <none> 10251/TCP 58s

然后再返回prometheus UI处,耐心等待一会,就能看到已经被发现了

serviceMonitor/monitoring/kube-controller-manager/0 (2/2 up)
serviceMonitor/monitoring/kube-scheduler/0 (2/2 up)

监控etcd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
利用下面命令,我们可以看到ETCD都暴露出了哪些监控指标出来
curl -k --cacert /etc/kubernetes/pki/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://192.168.85.128:2379/metrics
--------------------------------------------------------------------------------------------
上面查看没问题后,接下来我们开始进行配置使ETCD能被prometheus发现并监控

# 首先把ETCD的证书创建为secret
kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/ca.crt --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/server.key

# 接着在prometheus里面引用这个secrets
kubectl -n monitoring edit prometheus k8s

spec:
...
secrets:
- etcd-certs

# 保存退出后,prometheus会自动重启服务pod以加载这个secret配置,过一会,我们进pod来查看下是不是已经加载到ETCD的证书了
# kubectl -n monitoring exec -it prometheus-k8s-0 -c prometheus -- sh
/prometheus $ ls /etc/prometheus/secrets/etcd-certs/
ca.pem etcd-key.pem etcd.pem

-----------------------------------------------------------------------------------------------
因为etcd没有servicemonitoring所以要创建svc ep 和 servicemonitoring
接下来准备创建service、endpoints以及ServiceMonitor的yaml配置

注意替换下面的NODE节点IP为实际ETCD所在NODE内网IP

# vim prometheus-etcd.yaml
apiVersion: v1
kind: Service
metadata:
name: etcd-k8s
namespace: monitoring
labels:
k8s-app: etcd
spec:
type: ClusterIP
clusterIP: None
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: v1
kind: Endpoints
metadata:
name: etcd-k8s
namespace: monitoring
labels:
k8s-app: etcd
subsets:
- addresses:
- ip: 192.168.156.128 #etcd在哪就写哪的ip多个就写多个
ports:
- name: api
port: 2379
protocol: TCP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: etcd-k8s
namespace: monitoring
labels:
k8s-app: etcd-k8s
spec:
jobLabel: k8s-app
endpoints:
- port: api
interval: 30s
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/etcd-certs/ca.crt
certFile: /etc/prometheus/secrets/etcd-certs/server.crt
keyFile: /etc/prometheus/secrets/etcd-certs/server.key
#use insecureSkipVerify only if you cannot use a Subject Alternative Name
insecureSkipVerify: true
selector:
matchLabels:
k8s-app: etcd
namespaceSelector:
matchNames:
- monitoring
-------------------------------------------------------------------------


开始创建上面的资源

# kubectl apply -f prometheus-etcd.yaml
service/etcd-k8s created
endpoints/etcd-k8s created
servicemonitor.monitoring.coreos.com/etcd-k8s created

过一会,就可以在prometheus UI上面看到ETCD集群被监控了

serviceMonitor/monitoring/etcd-k8s/0 (3/3 up)

------------------------------------------------------------------------------
在grafana官网模板中心搜索etcd,下载这个json格式的模板文件
https://grafana.com/grafana/dashboards/3070-etcd/

然后打开自己先部署的grafana首页,
点击左上边菜单栏HOME --- Data source --- Add data source --- 选择 Prometheus
查看prometheus的详细地址 并编辑进去保存:
# kubectl -n monitoring get secrets grafana-datasources -o yaml

再点击右上角 +^ Import dashboard ---
点击Upload .json File 按钮,上传上面下载好的json文件 3070_rev3.json,
点击Import,即可显示etcd集群的图形监控信息

监控ingress

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
注意 因为prometheus 命名空间是 monitoring   
ingress的命名空间 是 ingress-nginx 需要对这个prometheus命名空间绑定角色
------------------------------------------------------------------------------------
cat cr.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-access-all-namespaces
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "list"]
----------------------------------------------------------------
cat crb.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-access-all-namespaces-binding
subjects:
- kind: ServiceAccount
name: prometheus-k8s
namespace: monitoring
roleRef:
kind: ClusterRole
name: prometheus-access-all-namespaces
apiGroup: rbac.authorization.k8s.io
-----------------------------------------------------------------
因为前面ingress-nginx服务是以daemonset形式部署的,并且映射了自己的端口到宿主机上,那么我可以直接用pod运行NODE上的IP来看下metrics

kubectl -n ingress-nginx get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ingress-nginx-controller-mbs95 1/1 Running 0 74m 192.168.85.129 k8s-node1 <none> <none>


# 开启metrics指标
# kubectl -n ingress-nginx edit ds ingress-nginx-controller
# 搜索 metrics , 找到 - --enable-metrics= 设置为 true
# curl 192.168.85.129:10254/metrics
------------------------------------------------------------------------------
创建 servicemonitor配置让prometheus能发现ingress-nginx的metrics
cat ingress.servicemonitoring.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ingress-servicemonitor
namespace: monitoring
labels:
app.kubernetes.io/name: ingress-nginx
spec:
jobLabel: ingress-test
endpoints:
- port: app
interval: 30s
scheme: http
path: /metrics
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
namespaceSelector:
matchNames:
- ingress-nginx
-------------------------------------------------------------------
创建它

# kubectl apply -f ingress-nginx-servicemonitor.yaml
servicemonitor.monitoring.coreos.com/nginx-ingress-scraping created
# kubectl get servicemonitors.monitoring.coreos.com -n monitoring
NAME AGE
alertmanager-main 32h
blackbox-exporter 32h
coredns 32h
etcd-k8s 27h
grafana 32h
ingress-servicemonitor 78m
kube-apiserver 32h
kube-controller-manager 32h
kube-scheduler 32h
kube-state-metrics 32h
kubelet 32h
node-exporter 32h
prometheus-adapter 32h
prometheus-k8s 32h
prometheus-operator 32h


再到prometheus UI上看下,发现已经有了

serviceMonitor/monitoring/ingress-servicemonitor/0 (1/1 up)

下载grafana模板导入使用

https://grafana.com/grafana/dashboards/14314-kubernetes-nginx-ingress-controller-nextgen-devops-nirvana/


prometheus监控
https://www.tiantian123.asia/2025/05/26/监控/
作者
lht
发布于
2025年5月26日
许可协议