在Kubernetes集群中部署一套可靠的监控系统,就像给飞机安装黑匣子——你永远不希望用到它,但关键时刻它能救命。AWS EKS作为托管Kubernetes服务,虽然简化了集群管理,但监控能力仍需自行构建。Prometheus+Grafana这对黄金组合,前者负责精准采集时序数据,后者提供可视化仪表板,两者配合能实时掌握集群健康状态。
我曾为某电商平台部署这套方案,在618大促期间成功预警了三次节点资源瓶颈。本文将分享从零开始部署的完整过程,包含我在生产环境中总结的配置技巧和避坑指南。这套方案适用于:
在开始前,请确保你的EKS集群满足以下条件:
验证命令:
bash复制kubectl get nodes -o wide
aws eks describe-cluster --name your-cluster-name
经过多个生产环境验证,我推荐以下版本组合:
| 组件 | 版本 | 选择理由 |
|---|---|---|
| Prometheus | v2.36.2 | 支持Kubernetes服务发现最新API |
| Grafana | 9.1.8 | 稳定版且兼容AWS Sigv4认证 |
| kube-state-metrics | 2.5.0 | 准确反映K8s资源状态 |
| node-exporter | 1.3.1 | 低开销采集节点指标 |
重要提示:避免使用latest标签,生产环境务必锁定具体版本。我曾遇到latest版Prometheus与某些自定义指标不兼容的情况。
Prometheus Operator是管理监控组件的"大脑",通过Helm安装最便捷:
bash复制helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--version 41.7.1 \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
关键参数解析:
serviceMonitorSelectorNilUsesHelmValues=false 允许发现非Helm管理的ServiceMonitor默认安装缺少对AWS特有资源的监控,需要添加以下配置:
yaml复制apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: aws-alb-monitor
namespace: monitoring
spec:
endpoints:
- port: metrics
interval: 30s
selector:
matchLabels:
app.kubernetes.io/name: aws-load-balancer-controller
bash复制kubectl apply -f alb-service-monitor.yaml
kubectl get servicemonitors -n monitoring
默认资源请求可能不适合生产环境,建议通过values.yaml调整:
yaml复制prometheus:
prometheusSpec:
resources:
limits:
cpu: 2
memory: 4Gi
requests:
cpu: 500m
memory: 2Gi
grafana:
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 200m
memory: 512Mi
经验之谈:Prometheus内存消耗与采集指标数量正相关,每百万时间序列约消耗2GB内存。可通过
prometheus_target_scrapes_metric_samples_scraped指标评估实际需求。
EKS集群内访问Prometheus需配置IAM认证:
yaml复制grafana:
env:
AWS_SDK_LOAD_CONFIG: "true"
AWS_STS_REGIONAL_ENDPOINTS: "regional"
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-operated.monitoring.svc:9090
jsonData:
sigV4Auth: true
sigV4AuthType: default
sigV4Region: us-west-2
官方仪表板往往过于基础,推荐这些专业模板:
导入方法:
bash复制# 获取仪表板ID列表
DASH_IDS="13771 6417 13877"
for ID in $DASH_IDS; do
kubectl create configmap grafana-dashboard-${ID} \
--from-file=https://grafana.com/api/dashboards/${ID}/revisions/latest/download \
-n monitoring
done
将关键告警推送至Slack或邮件(示例配置):
yaml复制alertmanager:
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
send_resolved: true
title: '{{ template "slack.title" . }}'
text: '{{ template "slack.text" . }}'
默认安装使用emptyDir,重启会丢失数据。建议配置EBS卷:
yaml复制prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
storageClassName: gp3
存储容量计算公式:保留时间(天) × 每秒样本数 × 字节/样本 ÷ (1024³)。例如保留15天,每秒10万样本,每个样本2字节,约需25GB。
通过relabel_configs过滤不必要指标,提升性能:
yaml复制prometheus:
prometheusSpec:
additionalScrapeConfigs:
- job_name: 'kubernetes-services'
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
yaml复制networkPolicy:
enabled: true
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: grafana
bash复制kubectl create secret generic grafana-admin \
--from-literal=admin-user=admin \
--from-literal=admin-password=ComplexP@ssw0rd \
-n monitoring
| 现象 | 排查命令 | 解决方案 |
|---|---|---|
| Prometheus靶点无数据 | kubectl get servicemonitors -A |
检查ServiceMonitor标签匹配 |
| Grafana无法登录 | kubectl logs -f grafana-pod |
检查secret中admin密码配置 |
| 节点指标缺失 | kubectl get pods -n monitoring -l app.kubernetes.io/name=node-exporter |
确认node-exporter DaemonSet正常运行 |
| 内存OOM被杀 | kubectl describe pod prometheus-pod |
增加内存限制或减少采集间隔 |
查看Prometheus Operator日志:
bash复制kubectl logs -l app.kubernetes.io/name=prometheus-operator -n monitoring --tail=50
关键错误模式:
level=error ts=... caller=manager.go:533 component="scrape manager" msg="error creating new scrape pool" → 检查scrape配置语法context deadline exceeded → 增加scrape_timeout值在values.yaml中添加这些参数可提升大规模集群性能:
yaml复制prometheus:
prometheusSpec:
scrapeInterval: 1m
evaluationInterval: 1m
retention: 15d
enableAdminAPI: false
query:
maxConcurrency: 50
timeout: 2m
不同重要级别的指标采用不同采集间隔:
yaml复制additionalScrapeConfigs:
- job_name: 'critical-metrics'
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets: ['service.critical:8080']
- job_name: 'normal-metrics'
scrape_interval: 1m
metrics_path: /metrics
static_configs:
- targets: ['service.normal:8080']
创建Grafana仪表板查询CloudWatch成本数据:
code复制STATISTICS=Sum METRICS=UnblendedCost DIMENSIONS=ServiceName NAMESPACE=AWS/Billing
通过KEDA根据Prometheus指标自动伸缩Grafana:
yaml复制apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: grafana-scaler
namespace: monitoring
spec:
scaleTargetRef:
name: grafana
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: http_requests_total
threshold: '1000'
query: sum(rate(http_requests_total{job="grafana"}[1m]))