실무에서 모니터링 스택 구축하기 (Prometheus + Grafana + Loki + Tempo)

fever·2026년 4월 24일

Grafana Loki Springboot Tempo alertmanager devops docker prometheus 모니터링 옵저버빌리티

🍯 개발 꿀팁 모음

목록 보기

4/4

사내 서비스를 운영하면서 장애가 발생했을 때 원인을 파악하는 데 너무 많은 시간이 걸렸다. 로그를 일일이 뒤지고, 어느 시점에 문제가 생겼는지 추적하는 것 자체가 고역이었다. 그러던 중 모니터링 수업을 듣게 됐고, Prometheus, Grafana, Loki, Tempo 조합을 접하게 됐다.

각 도구의 역할을 간단히 정리하면:

도구	역할	질문
Prometheus	메트릭 수집	얼마나? (CPU, 메모리, 응답시간)
Grafana	시각화 대시보드	한눈에 보기
Loki	로그 수집	무슨 일이?
Tempo	트레이스 수집	어디서 느려?

이 4개가 함께 동작하면 장애 발생 시 언제, 어디서, 왜 를 빠르게 파악할 수 있다.

왜 ELK가 아닌가?

로그 수집 하면 보통 ELK (Elasticsearch + Logstash + Kibana) 스택을 먼저 떠올린다. 실제로 많은 회사에서 사용하는 검증된 스택이기도 하다.

하지만 우리 인프라는 Traditional 3-Tier Architecture (Web - WAS - DB) 였다. Kubernetes 환경이 아니다 보니 ELK가 제공하는 강점들이 크게 필요하지 않았다.

항목	ELK	Loki
리소스	Heavy (Elasticsearch 메모리 소모 큼)	Lightweight
로그 저장 방식	로그 내용을 인덱싱	레이블만 인덱싱
Grafana 연동	별도 플러그인 필요	기본 지원
적합한 환경	Kubernetes, 대규모 분산 환경	소규모, 단순 인프라
학습 곡선	높음	낮음

Grafana와 자연스럽게 연동되는 Loki, 메트릭 수집의 Prometheus, 트레이스의 Tempo까지 하나의 Grafana 대시보드에서 모두 확인할 수 있다는 점이 매력적이었다. ELK는 강력하지만 Traditional 3-Tier 소규모 서비스에서 Elasticsearch의 리소스 부담은 오버스펙에 가까웠다.

테스트 서버 구축

먼저 테스트 서버에 4개 스택을 전부 올려봤다. docker-compose로 구성했고, Spring Boot 애플리케이션에 actuator와 OpenTelemetry agent를 붙여서 메트릭, 로그, 트레이스를 각각 수집했다.

services:
  prometheus:
    image: prom/prometheus:v3.5.1
    ports:
      - 9090:9090
  grafana:
    image: grafana/grafana:12.4.2
    ports:
      - 3000:3000
  loki:
    image: grafana/loki:3.4.2
    ports:
      - 3100:3100
  tempo:
    image: grafana/tempo:2.7.2
    ports:
      - 3200:3200
      - 4317:4317

Grafana 대시보드에서 메트릭을 보고, 이상한 구간을 발견하면 Loki로 로그를 확인하고, Tempo로 트레이스를 추적하는 흐름이 생각보다 훨씬 강력했다. 테스트 서버에서는 모두 정상 동작했다.

Spring Boot 연동

모니터링 스택을 구성했다고 끝이 아니다. 애플리케이션에서 메트릭을 노출해야 Prometheus가 수집할 수 있다.

의존성 추가

build.gradle

implementation 'org.springframework.boot:spring-boot-starter-actuator'
implementation 'io.micrometer:micrometer-registry-prometheus'

pom.xml

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

application.yml 설정

management:
  endpoints:
    web:
      exposure:
        include: prometheus, health, info
  endpoint:
    prometheus:
      enabled: true
  metrics:
    tags:
      application: ${spring.application.name}

prometheus.yml scrape 설정

Prometheus가 어떤 엔드포인트에서 메트릭을 가져올지 설정한다.

scrape_configs:
  - job_name: my-app
    metrics_path: /actuator/prometheus
    static_configs:
      - targets:
        - 192.168.0.1:8080
        labels:
          application: my-app

Production 서버 적용 — 내부망이라는 벽

문제는 Production 서버였다. Production 서버는 내부망 환경이었고, 외부와의 통신이 제한되어 있었다. VPN으로 접속하는 방식이었는데, Windows 서버 VPN 특성상 접속할 때마다 IP가 달라지는 문제가 있었다.

여기서 Prometheus와 Loki/Tempo의 근본적인 차이가 드러났다.

Pull vs Push

방식	도구	동작 방식
Pull	Prometheus	모니터링 서버가 애플리케이션에서 메트릭을 가져옴
Push	Loki, Tempo	애플리케이션이 수집 서버로 데이터를 보냄

Prometheus는 Pull 방식이라 모니터링 서버에서 애플리케이션 엔드포인트로 주기적으로 요청을 보내서 메트릭을 가져온다. 즉 모니터링 서버가 애플리케이션 서버에 접근할 수 있으면 되기 때문에 내부망에서도 문제없이 동작했다.

반면 Loki와 Tempo는 Push 방식이라 애플리케이션이 직접 Loki/Tempo 서버로 데이터를 전송해야 한다. 내부망에서는 애플리케이션이 외부로 나갈 수가 없었고, 거기다 VPN IP가 매번 바뀌니 고정 엔드포인트 설정 자체가 불가능했다.

결국 Production 서버에는 Prometheus + Grafana + AlertManager 만 적용하기로 결정했다.

Production 서버 구성 — Prometheus + Grafana + AlertManager

docker-compose 구성

services:
  prometheus:
    image: prom/prometheus:v3.5.1
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --web.enable-lifecycle
      - --storage.tsdb.retention.time=30d
      - --storage.tsdb.retention.size=10GB
      - --web.enable-remote-write-receiver
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.31.1
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
    restart: unless-stopped
    depends_on:
      - prometheus-msteams

  grafana:
    image: grafana/grafana:12.4.2
    restart: unless-stopped
    depends_on:
      - prometheus

  prometheus-msteams:
    image: quay.io/prometheusmsteams/prometheus-msteams:latest
    environment:
      - TEAMS_INCOMING_WEBHOOK_URL=https://your-teams-webhook-url
      - TEAMS_REQUEST_URI=alertmanager
    command:
      - -workflow-webhook
    restart: unless-stopped

AlertManager Teams 연동

Teams webhook을 직접 AlertManager에 붙이는 건 지원이 안 돼서 prometheus-msteams 라는 중간 브릿지를 사용했다.

흐름

Prometheus → AlertManager → prometheus-msteams → Teams

처음에는 기존 Incoming Webhook 방식으로 연동하려 했는데, 2025년 이후 Teams Incoming Webhook 지원이 종료되면서 Power Automate Workflow Webhook 방식으로 변경해야 했다. Webhook URL을 Power Automate에서 생성하고 환경변수로 주입하면 된다.

alertmanager.yml

global:
  resolve_timeout: 5m
route:
  receiver: 'teams-notification'
  group_by: ['alertname', 'application']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
receivers:
- name: 'teams-notification'
  webhook_configs:
  - url: 'http://prometheus-msteams:2000/alertmanager'
    send_resolved: true

알림 규칙 설정

서비스 다운, CPU 과부하, JVM Heap 과부하 3가지 알림을 설정했다.

alert_rules.yml

groups:
  - name: 기본알람
    rules:
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          description: "[{{ $labels.application }}] 서비스 다운"

      - alert: HighCpuUsage
        expr: process_cpu_usage * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          description: "[{{ $labels.application }}] CPU {{ $value | printf \"%.1f\" }}% 초과"

      - alert: HighHeapUsage
        expr: |
          (
            sum by(instance, application) (jvm_memory_used_bytes{area="heap"})
            /
            sum by(instance, application) (jvm_memory_max_bytes{area="heap"})
          ) * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          description: "[{{ $labels.application }}] Heap {{ $value | printf \"%.1f\" }}% 초과"

Test vs Production 비교

항목	Test 서버	Production 서버
메트릭	Prometheus ✅	Prometheus ✅
로그	Loki ✅	❌ (내부망)
트레이스	Tempo ✅	❌ (내부망)
알림	AlertManager ✅	AlertManager ✅
시각화	Grafana ✅	Grafana ✅

로그와 트레이스를 포기한 건 아쉬웠지만, Prometheus만으로도 CPU, 메모리, JVM Heap, 응답시간, 서비스 상태 등 핵심 메트릭은 충분히 수집할 수 있었다.

마무리

모니터링을 붙이고 나서 가장 크게 달라진 건 장애 대응 속도였다. 이전에는 VOC가 들어온 후에야 인지했다면, 이제는 알림으로 먼저 인지하고 Grafana 대시보드에서 어느 시점에 무슨 일이 있었는지 바로 확인할 수 있게 됐다.

완벽한 구성은 아니었지만, 제약된 환경에서도 할 수 있는 최선을 찾아서 적용한 경험이 됐다. 나중에 내부망 환경에서 Loki를 붙일 수 있는 방법을 더 찾아볼 생각이다.

fever

선명한 삶을 살기 위하여

이전 포스트