• dgd


    安装部署配置Grafana

    WHAT:美观、强大的可视化监控指标展示工具

    WHY:用来代替prometheus原生UI界面

    # 200机器,准备镜像、资源配置清单:
    ~]# docker pull grafana/grafana:5.4.2
    ~]# docker images|grep grafana
    ~]# docker tag 6f18ddf9e552 harbor.od.com/infra/grafana:v5.4.2
    ~]# docker push harbor.od.com/infra/grafana:v5.4.2
    ~]# mkdir /data/k8s-yaml/grafana/ /data/nfs-volume/grafana
    ~]# cd /data/k8s-yaml/grafana/
    grafana]# vi rbac.yaml
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: grafana
    rules:
    - apiGroups:
      - "*"
      resources:
      - namespaces
      - deployments
      - pods
      verbs:
      - get
      - list
      - watch
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        kubernetes.io/cluster-service: "true"
      name: grafana
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: grafana
    subjects:
    - kind: User
      name: k8s-node
    
    grafana]# vi dp.yaml
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      labels:
        app: grafana
        name: grafana
      name: grafana
      namespace: infra
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 7
      selector:
        matchLabels:
          name: grafana
      strategy:
        rollingUpdate:
          maxSurge: 1
          maxUnavailable: 1
        type: RollingUpdate
      template:
        metadata:
          labels:
            app: grafana
            name: grafana
        spec:
          containers:
          - name: grafana
            image: harbor.od.com/infra/grafana:v5.4.2
            imagePullPolicy: IfNotPresent
            ports:
            - containerPort: 3000
              protocol: TCP
            volumeMounts:
            - mountPath: /var/lib/grafana
              name: data
          imagePullSecrets:
          - name: harbor
          securityContext:
            runAsUser: 0
          volumes:
          - nfs:
              server: hdss7-200
              path: /data/nfs-volume/grafana
            name: data
    
    grafana]# vi svc.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: grafana
      namespace: infra
    spec:
      ports:
      - port: 3000
        protocol: TCP
        targetPort: 3000
      selector:
        app: grafana
    
    grafana]# vi ingress.yaml
    apiVersion: extensions/v1beta1
    kind: Ingress
    metadata:
      name: grafana
      namespace: infra
    spec:
      rules:
      - host: grafana.od.com
        http:
          paths:
          - path: /
            backend:
              serviceName: grafana
              servicePort: 3000
    

    1583504719781

    # 11机器,解析域名:
    ~]# vi /var/named/od.com.zone
    serial 前滚一位
    
    grafana            A    10.4.7.10
    ~]# systemctl restart named
    ~]# ping grafana.od.com

    1582705048800

    # 22机器,应用配置清单:
    ~]# kubectl apply -f http://k8s-yaml.od.com/grafana/rbac.yaml
    ~]# kubectl apply -f http://k8s-yaml.od.com/grafana/dp.yaml
    ~]# kubectl apply -f http://k8s-yaml.od.com/grafana/svc.yaml
    ~]# kubectl apply -f http://k8s-yaml.od.com/grafana/ingress.yaml

    1583504865941

    grafana.od.com

    默认账户和密码都是admin

    修改密码:admin123

    1583504898029

    修改配置,修改如下图

    1583505029816

    装插件

    进入容器

    1583505097409

    # 第一个:kubenetes App
    grafana# grafana-cli plugins install grafana-kubernetes-app
    # 第二个:Clock Pannel
    grafana# grafana-cli plugins install grafana-clock-panel
    # 第三个:Pie Chart
    grafana# grafana-cli plugins install grafana-piechart-panel
    # 第四个:D3Gauge
    grafana# grafana-cli plugins install briangann-gauge-panel
    # 第五个:Discrete
    grafana# grafana-cli plugins install natel-discrete-panel

    1583505305939

    装完后,可以在200机器查看

    # 200机器:
    cd /data/nfs-volume/grafana/plugins/
    plugins]# ll

    1583505462177

    删掉让它重启

    1583505490948

    重启完成后

    查看grafana.od.com,刚刚安装的5个插件都在里面了(记得检查是否在里面了)

    1583505547061

    添加数据源:Add data source

    1583505581557

    1583505600031

    # 填入参数:
    URL:http://prometheus.od.com
    TLS Client Auth✔    With CA Cert✔

    1583505840252

    # 填入参数对应的pem参数:
    # 200机器拿ca等:
    ~]# cat /opt/certs/ca.pem
    ~]# cat /opt/certs/client.pem
    ~]# cat /opt/certs/client-key.pem

    1583505713033

    1583505856093

    保存

    然后我们去配置plugins里面的kubernetes

    1583505923109

    1583505938700

    右侧就多了个按钮,点击进去

    1583505969865

    # 按参数填入:
    Name:myk8s
    URL:https://10.4.7.10:7443
    Access:Server
    TLS Client Auth✔    With CA Cert✔

    1583506058483

    # 填入参数:
    # 200机器拿ca等:
    ~]# cat /opt/certs/ca.pem
    ~]# cat /opt/certs/client.pem
    ~]# cat /opt/certs/client-key.pem

    1583506131529

    save后再点击右侧框的图标,并点击Name

    1583506163546

    可能抓取数据的时间会稍微慢些(两分钟左右)

    1583506184293

    1583506503213

    点击右上角的K8s Cluster,选择你要看的东西

    1583506545308

    由于K8s Container里面数据不全,如下图

    1583506559069

    我们改下,把Cluster删了

    1583506631392

    1583506645982

    container也删了

    1583506675876

    deployment也删了

    1583506695618

    node也删了

    1583506709705

    1583506730713

    1583506744138

    把我给你准备的dashboard的json文件import进来

    1583543092886

    1583543584476

    1583543602130

    用同样的方法把node、deployment、cluster、container这4个分别import进来

    1583543698727

    可以都看一下,已经正常了

    然后把etcd、generic、traefik也import进来

    1583543809740

    1583543831830

    还有另外一种import的方法(使用官网的):

    grafana官网

    找一个别人写好的点进去

    1584241883144

    这个编号可以直接用

    1584241903882

    如下图,我们装blackbox的编号是9965

    1584242072703

    1584242093513

    把名字和Prometheus修改一下

    1584242164621

    或者,你也可以用我上传的(我用的是7587)

    1583543931644

    你可以两个都用,自己做对比,都留着也可以,就是占一些资源

    JMX

    1583544009027

    这个里面还什么都没有

    1583544017606

    把Dubbo微服务数据弄到Grafana

    dubbo-service

    1583544062372

    # Edit a Daemon Set,添加以下内容,注意给上一行加逗号
      "prometheus_io_scrape": "true",
      "prometheus_io_port": "12346",
      "prometheus_io_path": "/"
    # 直接加进去update,会自动对齐,

    1583544144136

    dubbo-consumer

    1583544157268

    # Edit a Daemon Set,添加以下内容,注意给上一行加逗号
      "prometheus_io_scrape": "true",
      "prometheus_io_port": "12346",
      "prometheus_io_path": "/"
    # 直接加进去update,会自动对齐,

    1583544192459

    刷新JMX(可能有点慢,我等了1分钟才出来service,我机器不行了)

    1583544446817

    完成

    此时你可以感受到,Grafana明显比K8S自带的UI界面更加人性化

    安装部署alertmanager

    WHAT: 从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对方的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty 等。

    WHY:使得系统的警告随时让我们知道

    # 200机器,准备镜像、资源清单:
    ~]# mkdir /data/k8s-yaml/alertmanager
    ~]# cd /data/k8s-yaml/alertmanager
    alertmanager]# docker pull docker.io/prom/alertmanager:v0.14.0
    # 注意,这里你如果不用14版本可能会报错
    alertmanager]# docker images|grep alert
    alertmanager]# docker tag 23744b2d645c harbor.od.com/infra/alertmanager:v0.14.0
    alertmanager]# docker push harbor.od.com/infra/alertmanager:v0.14.0
    # 注意下面记得修改成你自己的邮箱等信息,还有中文注释可以删掉
    alertmanager]# vi cm.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: alertmanager-config
      namespace: infra
    data:
      config.yml: |-
        global:
          # 在没有报警的情况下声明为已解决的时间
          resolve_timeout: 5m
          # 配置邮件发送信息
          smtp_smarthost: 'smtp.163.com:25'
          smtp_from: 'ben909336740@163.com'
          smtp_auth_username: 'ben909336740@163.com'
          smtp_auth_password: 'xxxxxx'
          smtp_require_tls: false
        # 所有报警信息进入后的根路由,用来设置报警的分发策略
        route:
          # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面
          group_by: ['alertname', 'cluster']
          # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。
          group_wait: 30s
    
          # 当第一个报警发送后,等待'group_interval'时间来发送新的一组报警信息。
          group_interval: 5m
    
          # 如果一个报警信息已经发送成功了,等待'repeat_interval'时间来重新发送他们
          repeat_interval: 5m
    
          # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器
          receiver: default
    
        receivers:
        - name: 'default'
          email_configs:
          - to: '909336740@qq.com'
            send_resolved: true
    
    alertmanager]# vi dp.yaml
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      name: alertmanager
      namespace: infra
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: alertmanager
      template:
        metadata:
          labels:
            app: alertmanager
        spec:
          containers:
          - name: alertmanager
            image: harbor.od.com/infra/alertmanager:v0.14.0
            args:
              - "--config.file=/etc/alertmanager/config.yml"
              - "--storage.path=/alertmanager"
            ports:
            - name: alertmanager
              containerPort: 9093
            volumeMounts:
            - name: alertmanager-cm
              mountPath: /etc/alertmanager
          volumes:
          - name: alertmanager-cm
            configMap:
              name: alertmanager-config
          imagePullSecrets:
          - name: harbor
    
    alertmanager]# vi svc.yaml
    apiVersion: v1
    kind: Service
    metadata:
      name: alertmanager
      namespace: infra
    spec:
      selector: 
        app: alertmanager
      ports:
        - port: 80
          targetPort: 9093

    1583547933312

    # 22机器,应用清单:
    ~]# kubectl apply -f http://k8s-yaml.od.com/alertmanager/cm.yaml
    ~]# kubectl apply -f http://k8s-yaml.od.com/alertmanager/dp.yaml
    ~]# kubectl apply -f http://k8s-yaml.od.com/alertmanager/svc.yaml

    1583545326722

    1583545352951

    # 200机器,配置报警规则:
    ~]# vi /data/nfs-volume/prometheus/etc/rules.yml
    groups:
    - name: hostStatsAlert
      rules:
      - alert: hostCpuUsageAlert
        expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)"
      - alert: hostMemUsageAlert
        expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)"
      - alert: OutOfInodes
        expr: node_filesystem_free{fstype="overlay",mountpoint ="/"} / node_filesystem_size{fstype="overlay",mountpoint ="/"} * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Out of inodes (instance {{ $labels.instance }})"
          description: "Disk is almost running out of available inodes (< 10% left) (current value: {{ $value }})"
      - alert: OutOfDiskSpace
        expr: node_filesystem_free{fstype="overlay",mountpoint ="/rootfs"} / node_filesystem_size{fstype="overlay",mountpoint ="/rootfs"} * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Out of disk space (instance {{ $labels.instance }})"
          description: "Disk is almost full (< 10% left) (current value: {{ $value }})"
      - alert: UnusualNetworkThroughputIn
        expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual network throughput in (instance {{ $labels.instance }})"
          description: "Host network interfaces are probably receiving too much data (> 100 MB/s) (current value: {{ $value }})"
      - alert: UnusualNetworkThroughputOut
        expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual network throughput out (instance {{ $labels.instance }})"
          description: "Host network interfaces are probably sending too much data (> 100 MB/s) (current value: {{ $value }})"
      - alert: UnusualDiskReadRate
        expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk read rate (instance {{ $labels.instance }})"
          description: "Disk is probably reading too much data (> 50 MB/s) (current value: {{ $value }})"
      - alert: UnusualDiskWriteRate
        expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk write rate (instance {{ $labels.instance }})"
          description: "Disk is probably writing too much data (> 50 MB/s) (current value: {{ $value }})"
      - alert: UnusualDiskReadLatency
        expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk read latency (instance {{ $labels.instance }})"
          description: "Disk latency is growing (read operations > 100ms) (current value: {{ $value }})"
      - alert: UnusualDiskWriteLatency
        expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Unusual disk write latency (instance {{ $labels.instance }})"
          description: "Disk latency is growing (write operations > 100ms) (current value: {{ $value }})"
    - name: http_status
      rules:
      - alert: ProbeFailed
        expr: probe_success == 0
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Probe failed (instance {{ $labels.instance }})"
          description: "Probe failed (current value: {{ $value }})"
      - alert: StatusCode
        expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Status Code (instance {{ $labels.instance }})"
          description: "HTTP status code is not 200-399 (current value: {{ $value }})"
      - alert: SslCertificateWillExpireSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SSL certificate will expire soon (instance {{ $labels.instance }})"
          description: "SSL certificate expires in 30 days (current value: {{ $value }})"
      - alert: SslCertificateHasExpired
        expr: probe_ssl_earliest_cert_expiry - time()  <= 0
        for: 5m
        labels:
          severity: error
        annotations:
          summary: "SSL certificate has expired (instance {{ $labels.instance }})"
          description: "SSL certificate has expired already (current value: {{ $value }})"
      - alert: BlackboxSlowPing
        expr: probe_icmp_duration_seconds > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Blackbox slow ping (instance {{ $labels.instance }})"
          description: "Blackbox ping took more than 2s (current value: {{ $value }})"
      - alert: BlackboxSlowRequests
        expr: probe_http_duration_seconds > 2 
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Blackbox slow requests (instance {{ $labels.instance }})"
          description: "Blackbox request took more than 2s (current value: {{ $value }})"
      - alert: PodCpuUsagePercent
        expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),"pod","$1","container_label_io_kubernetes_pod_name", "(.*)"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)"
    
    # 在最后面添加如下内容
    ~]# vi /data/nfs-volume/prometheus/etc/prometheus.yml
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ["alertmanager"]
    rule_files:
     - "/data/etc/rules.yml"

    1583545590235

    rules.yml文件:这个文件就是报警规则

    这时候可以重启Prometheus的pod,但生产商因为Prometheus太庞大,删掉容易拖垮集群,所以我们用另外一种方法,平滑加载(Prometheus支持):

    # 21机器,因为我们起的Prometheus是在21机器,平滑加载:
    ~]# ps aux|grep prometheus
    ~]# kill -SIGHUP 1488

    1583545718137

    1583545762475

    这时候报警规则就都有了

    测试alertmanager报警功能

    先把对应的两个邮箱的stmp都打开

    1583721318338

    1583721408460

    我们测试一下,把dubbo-service停了,这样consumer就会报错

    把service的scale改成0

    1583545840349

    blackbox.od.com查看,已经failure了

    1583545937643

    prometheus.od.com.alerts查看,两个变红了(一开始是变黄)

    1583548102650

    1583545983131

    这时候可以在163邮箱看到已发送的报警

    1583721856162

    QQ邮箱收到报警

    1583721899076

    完成(service的scale记得改回1)

    关于rules.yml:报警不能错报也不能漏报,在实际应用中,我们需要不断的修改rules的规则,以来贴近我们公司的实际需求。

    资源不足时,可关闭部分非必要资源

    # 22机器,也可以用dashboard操作:
    ~]# kubectl scale deployment grafana --replicas=0 -n infra
    # out : deployment.extensions/grafana scaled
    ~]# kubectl scale deployment alertmanager --replicas=0 -n infra
    # out : deployment.extensions/alertmanager scaled
    ~]# kubectl scale deployment prometheus --replicas=0 -n infra
    # out : deployment.extensions/prometheus scaled
  • 相关阅读:
    leetcode206题实现反转链表(c语言)
    V22017编写C/C++时没有与参数列表匹配的重载函数实例
    3DMAX导出到Unity坐标轴转换问题
    ihandy2019笔记编程真题
    模糊数学中合成算子的计算方法
    点击Button按钮实现页面跳转
    做HTML静态页面时遇到的问题总结
    pip换源
    Python正课146 —— DRF 进阶7 JWT补充、基于权限的角色控制、django缓存
    Python正课145 —— DRF 进阶6 自定制频率、接口文档、JWT
  • 原文地址:https://www.cnblogs.com/dbslinux/p/13549385.html
Copyright © 2020-2023  润新知