在云原生架构中,Kubernetes已经成为容器编排的事实标准,而AWS EKS(Elastic Kubernetes Service)作为托管Kubernetes服务,极大简化了集群管理。但随之而来的挑战是:如何有效监控这个动态变化的分布式系统?这正是Prometheus+Grafana组合的价值所在。
我管理过多个生产级EKS集群,发现没有监控就像在黑暗中开车——你永远不知道下一个拐角会遇到什么。Prometheus作为CNCF毕业项目,其多维数据模型和强大的查询语言特别适合Kubernetes环境。而Grafana则能将枯燥的指标转化为直观的可视化图表。
重要提示:在正式部署前,请确保您的AWS账户有足够的IAM权限,特别是对EKS、EC2和IAM服务的访问权限。权限不足是部署失败的常见原因。
典型的部署架构包含以下层级:
mermaid复制graph TD
A[Prometheus Server] -->|抓取指标| B(Kubernetes Pods)
A --> C[node-exporter]
A --> D[kube-state-metrics]
A --> E[其他Exporters]
F[Grafana] -->|查询数据| A
G[Alertmanager] -->|接收告警| A
根据集群规模,建议的资源配置:
| 集群节点数 | Prometheus CPU | Prometheus 内存 | Grafana CPU | Grafana 内存 |
|---|---|---|---|---|
| <10 | 2 cores | 4 GiB | 1 core | 2 GiB |
| 10-50 | 4 cores | 8 GiB | 2 cores | 4 GiB |
| >50 | 8 cores+ | 16 GiB+ | 4 cores | 8 GiB |
实测发现,每个Prometheus样本约占用1-2字节内存。对于500节点的集群,预计需要32GiB以上内存才能稳定运行。
首先安装必要的命令行工具:
bash复制# 安装eksctl
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin
# 安装kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# 安装helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
使用Prometheus社区Chart是最佳实践:
bash复制# 添加helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 创建监控专用namespace
kubectl create ns monitoring
# 自定义values.yaml
cat > prometheus-values.yaml <<EOF
alertmanager:
enabled: true
persistentVolume:
enabled: true
size: 10Gi
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 4
memory: 8Gi
nodeExporter:
enabled: true
kubeStateMetrics:
enabled: true
EOF
# 安装Prometheus Stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring \
-f prometheus-values.yaml
关键参数解析:
storageSpec:配置持久化存储,避免重启后数据丢失resources:根据前表建议设置资源限制nodeExporter:收集节点级指标(CPU/内存/磁盘等)kubeStateMetrics:转换Kubernetes对象状态为Prometheus指标部署完成后,获取Grafana访问地址:
bash复制kubectl get svc -n monitoring prometheus-grafana -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
默认登录凭据:
bash复制kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
推荐导入的仪表盘:
导入方法:
对于生产环境,必须配置持久化存储。以EBS为例:
yaml复制# prometheus-values.yaml 补充
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp2
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
通过ALB Ingress暴露Grafana:
yaml复制apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-ingress
namespace: monitoring
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- host: grafana.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus-grafana
port:
number: 80
自定义告警规则示例:
yaml复制# alertmanager-config.yaml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#alerts'
send_resolved: true
应用配置:
bash复制kubectl create secret generic alertmanager-prometheus-kube-prometheus-alertmanager -n monitoring --from-file=alertmanager.yaml=alertmanager-config.yaml --dry-run=client -o yaml | kubectl apply -f -
症状:Prometheus pod不断重启,日志显示"no space left on device"
解决方案:
bash复制kubectl edit pvc prometheus-prometheus-kube-prometheus-prometheus-db -n monitoring
yaml复制prometheus:
prometheusSpec:
retention: 7d # 默认15天
检查步骤:
bash复制kubectl get endpoints -n monitoring prometheus-operated
bash复制kubectl describe networkpolicy -n monitoring | grep -i prometheus
bash复制kubectl exec -it -n monitoring deploy/prometheus-grafana -- curl http://prometheus-operated.monitoring.svc:9090
典型错误:"context deadline exceeded"
解决方法:
yaml复制prometheus:
prometheusSpec:
scrapeInterval: 30s
scrapeTimeout: 10s
bash复制kubectl port-forward -n monitoring svc/prometheus-operated 9090
然后访问 http://localhost:9090/targets
对于大规模集群,考虑:
部署Thanos示例:
bash复制helm install thanos bitnami/thanos \
--set objstore.config.type=AWS \
--set objstore.config.config.bucket=your-s3-bucket \
--set objstore.config.config.region=us-west-2
减少不必要指标的采集:
yaml复制prometheus:
prometheusSpec:
ignoreNamespaceSelectors: false
podMonitorSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
ruleSelectorNilUsesHelmValues: false
调整资源限制的黄金法则:
计算公式:
code复制所需磁盘 = retention_time_seconds × ingested_samples_per_second × bytes_per_sample
其中bytes_per_sample通常为1-2字节