프로메테우스 + 그라파나

히반·2023년 5월 14일

Docker Container 실행 구문

node-exporter: 현재 node-exporter가 실행되고 있는 서버의 매트릭스를 수집

sudo docker run --name node-exporter -d -p 9100:9100 prom/node-exporter

prometheus: 프로메테우스 서버, 원하는 경로의 매트릭스를 pull 받아 처리하는 모니터링 open api

sudo docker run --name prometheus -d -p 9090:9090 -v /home/ubuntu/prometheus-2.41.0.linux-amd64:/etc prom/prometheus --config.file=/etc/prometheus.yml

mysqld-exporter: 원하는 mysql계열 데이터 베이스의 매트릭스를 수집

sudo docker run --name mysqld-exporter -d -p 9104:9104 -e DATA_SOURCE_NAME="{id:password@(url)/}" -e collect.global_status=true prom/mysqld-exporter

alertmanager: 정해둔 rule에 따라 receiver에게 알림 전달

sudo docker run --name alertmanager -d -p 9093:9093 -v /home/ubuntu/alertmanager-0.25.0.linux-amd64:/etc prom/alertmanager --config.file=/etc/alertmanager.yml

grafana: prometheus 매트릭스 시각화

sudo docker run --name grafana -d -p 3000:3000 grafana/grafana-enterprise

promtotwilio: sms를 보내기 위한 api

sudo docker run --name twilio -d -p 9091:9090 -e SID={{ twilio.sid }} -e TOKEN={{ twilio.token }} -e SENDER=+{{ twilio.sender_number }} -e RECEIVER=+{{ receiver_number }} swatto/promtotwilio

prometheus.yml - 프로메테우스 설정 파일

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: ['[alertmanager IP]:[alertmanager PORT]']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - 'rules.yml'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["[prometheus IP]:[prometheus PORT]"]

  - job_name: "prometheus-node"
    static_configs:
      - targets: ["[node-exporter IP]:[node-exporter PORT]"]

  - job_name: "nest"
    static_configs:
      - targets: ["[nestjs IP]:[nestjs PORT]"]

  - job_name: "nest-node"
    static_configs:
      - targets: ["[node-exporter IP]:[node-exporter PORT]"]

  - job_name: "rds"
    static_configs:
      - targets: ["[mysqld-exporter IP]:[mysqld-exporter PORT]"]

rules.yml - 알림을 보내기 위한 규칙 정의 파일

groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: "critical"
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

  - alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Host out of memory (instance {{ $labels.instance }})"
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostMemoryUnderMemoryPressure
    expr: rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Host memory under memory pressure (instance {{ $labels.instance }})"
      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Host out of disk space (instance {{ $labels.instance }})"
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "Host high CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

alertmanager.yml

global:
  http_config:
    tls_config:
      insecure_skip_verify: true

route:
  receiver: 'notifications'
  repeat_interval: 2m
  routes:
    - receiver: 'notifications'

receivers:
  - name: 'notifications'
    slack_configs:
    - channel: '#웹훅-테스트'
      api_url: 'https://hooks.slack.com/services/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'
      send_resolved: true
      title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}"
      text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
    telegram_configs:
    - bot_token: {{ personal_bot_token }}
      chat_id: {{ personal_chat_id }}
      api_url: 'https://api.telegram.org'
      message: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
      parse_mode: 'HTML'
    webhook_configs:
    - url: 'http://15.165.19.150:9091/send' #promtotwilio

SMTP 연동(추후 예정)

sudo apt-get update
sudo apt-get install postfix
sudo dpkg-reconfigure postfix
sudo nano /etc/postfix/main.cf
myhostname = paycoq.com
sudo systemctl restart postfix

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: '<구글 사용자 이메일>'
  smtp_auth_password: '<구글 사용자 비밀번호>'
  smtp_auth_secret: '<구글 앱 비밀번호>'
  smtp_require_tls: true

route:
  receiver: email-receiver

receivers:
- name: email-receiver
  email_configs:
  - to: 'm16khb@gmail.com'
    send_resolved: true

서버리스 환경에서 매트릭 수집은 공급업체의 기능을 사용하는 것을 권장: https://discuss.prometheus.io/t/serverless-computing-prometheus-push/91

ECS에서 prometheus 구성:

https://jsonobject.tistory.com/567

참고

https://prometheus.io/docs/introduction/overview/

https://medium.com/finda-tech/prometheus란-cf52c9a8785f

https://blog.naver.com/PostView.nhn?blogId=whddbsml&logNo=222349178634

https://brunch.co.kr/@springboot/734

https://velog.io/@91savage/프로메테우스-Alertmanager

https://www.oss.kr/storage/app/public/oss/da/85/[Prometheus] Solution Guide.pdf

smtp 설정:

telegram 설정: https://gabrielkim.tistory.com/entry/Telegram-Bot-Token-및-Chat-Id-얻기