모니터링용 batch 개발 - 3

zzery·2022년 3월 26일

zzerym

기타

목록 보기

3/7

이전에 비해 기능이 좀 개선되었다.

개선된 사항

InfluxDB 연동

config 파일에 작성해서 DB 연동 정보를 추가한다.
아래에서 measurement 제외한 값들은 DB 설치 등으로 직접 추가해줘야 한다.

# influxDB
database:
    host: "http://localhost:8086"
    authToken: "TOKEN"
    org: "ORG"
    bucket: "BUCKET"
    measurement: "my-measurement"

데이터 추가 방식

measurement: DB의 테이블 개념
Point: 테이블에 추가하는 데이터. 내부에 여러 Key가 존재.
Tag Key: Index와 유사, Select문으로 조회할 때의 기준
Field Key: 실질적인 데이터 그 자체

func SendData(c config.URL, wrt api.WriteAPI, status int) {
	p := influxdb2.NewPointWithMeasurement(c.Measurement).
		AddTag("hostname", c.Name).
		AddField("Status", status).
		// AddField("RespCode", respCode).
		SetTime(time.Now())

	wrt.WritePoint(p)
}

Request 호출 코드 전체 수정

http 클라이언트, DB 전송, 알림 전송 기능을 별도의 함수로 뺀다.
http 클라이언트에서, 응답을 기다리는 시간 timeout을 config.yml에서 지정한다.

func InitClient(timeout time.Duration) *http.Client {
	http.DefaultTransport.(*http.Transport).TLSClientConfig = &tls.Config{InsecureSkipVerify: true}
	client := &http.Client{
		Timeout: timeout,
	}
	return client
}

Retry가 중복되는 경우는 Cron Schedule과 겹쳐서 생긴 문제로, config.yml을 설정할 때 전체 호출 주기를 고려하여 작성해야 한다.

권장하는 설정은 15~20초마다 호출하는 경우이다.

scheduler: "@every 20s"
timeout: 3 (응답을 최대 3초 기다림)
Retry: 3번 (config 설정X)

최악의 경우는 3번의 Retry에서 모두 timeout 나는 경우이다.

Request -3초-> 1th Retry -3초-> 2th Retry -3초-> 3th Retry -3초-> ERROR

여기서 3초에는 ms 단위도 포함된다. (100ms ~ 300ms 사이)
즉 호출한 뒤 약 12초 ~ 13초의 시간이 소요되기에, cron scheduler 값은 그 이상의 시간으로 설정해야 한다.

config.yml에 표준 설정과 DB 연동 추가

표준 설정, DB 연동 내용이 추가되었다.
표준을 주로 이용한다면 urls에는 host와 URL만 써줘도 된다.

# commons
commons:
  scheduler: "@every 20s"
  slack_token: "TOKEN/TOKEN/TOKEN"
  timeout: 3 # Retry 시간: 최대 4초 미만

# influxDB
database:
    host: "http://localhost:8086"
    authToken: "TOKEN"
    org: "ORG"
    bucket: "BUCKET"
    measurement: "my-measurement"

# URLs
urls:
  # test server 1
  - name: "check server1"
    url: http://localhost:5051

  # test server 2
  - name: "check server2" # default name = url
    url: http://localhost:5052 # Required
    status_code: 200  # default status_code = 200
    slack_token: "TOKEN2/TOKEN2/TOKEN2" # default slack_token = commons.slack_token
    scheduler: "@every 30s" # default scheduler = commons.scheduler

표준 설정은 main.go에서 적용된다.

defCode := 200 // set Defalut statusCode values

for _, this := range config.Urls {

	if this.Name == "" { // SetDefName
		this.Name = this.URL
		log.Debugf("default server name defined: %s", this.Name)
	}

	if this.StatusCode == nil { // SetDefCode
		this.StatusCode = &defCode
		// log.Debugf("%s default status code defined: %d", this.Name, *this.StatusCode)
	}

	if this.Timeout == nil {
		this.Timeout = &config.Common.Timeout
	}

	if this.Measurement == "" { // influxDB Measurement
		this.Measurement = config.DB.Measurement
		// log.Noticef("Measurement Defined %s", this.Measurement)
	}

	if this.Scheduler == "" { // common Scheduler
		this.Scheduler = config.Common.Scheduler
		// log.Noticef("Scheduler Defined %s", this.Scheduler)
	}

	if this.SlackToken == "" { // common SlackToken
		this.SlackToken = config.Common.SlackToken
	}

	s.cron.AddJob(this.Scheduler, scheduler.New(this, config, log, writeAPI))
}

코드 구조 개편

DB 연동, config 설정 반영, timeout 추가

전체적으로 좀 더 효율적인 호출 시점으로 변경되었다.
다만 Retry 횟수가 많아지면 이상해지는건 여전해서 확인이 필요.

로그 형식 개편

Status Code 추가

응답 코드에 대한 상태가 로그에 자동 반영된다.

# before
16:46:13 :: DEBUG  [200] -- check server2

# after
2022-03-27T00:38:12.002Z [DEBUG]:: [200 OK] check-server2

Connection Refused 로그 수정

연결 자체에 문제가 있는 경우, 이와 관련된 에러 내용을 출력하는 것으로 변경했다. Retry는 여전히 3번이며, 100ms~300ms 사이에서 수행한다.

# before
16:46:13 :: ERROR  1th Retry to http://localhost:5051 -- 100ms
16:46:13 :: ERROR  2th Retry to http://localhost:5051 -- 138.065718ms
16:46:13 :: ERROR  3th Retry to http://localhost:5051 -- 195.417452ms
16:46:13 :: ERROR  Failed to Connect -- http://localhost:5051 (http://localhost:5051)
16:46:13 :: NOTICE  Sending Message to Slack: {"text":"💥 [Connection Failed] http://localhost:5051 -- http://localhost:5051"}

# after
2022-03-27T22:45:56.001Z [ERROR]:: [nginx-server] Get "http://localhost:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-03-27T22:45:56.001Z [ERROR]:: [nginx-server] 1th Retry (3ns 100ms)
2022-03-27T22:45:59.102Z [ERROR]:: [nginx-server] Get "http://localhost:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-03-27T22:45:59.102Z [ERROR]:: [nginx-server] 2th Retry (3ns 186.249143ms)
2022-03-27T22:46:02.29Z [ERROR]:: [nginx-server] Get "http://localhost:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-03-27T22:46:02.29Z [ERROR]:: [nginx-server] 3th Retry (3ns 187.934273ms)
2022-03-27T22:46:05.479Z [ERROR]:: [nginx-server] Get "http://localhost:8080/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2022-03-27T22:46:05.479Z [ERROR]:: [Connection Failed] nginx-server (http://localhost:8080/)
2022-03-27T22:46:05.479Z [INFO]:: Sending Message to Slack: {"text": "[Connection Failed] nginx-server -- http://localhost:8080/"}

참고 문서

컨셉

다나와: 검색 모니터링 시스템 구축

해당 구성도에서 메트릭 스케줄러가 하는 일이 내가 구현하고자 하는 내용과 비슷해서 참고했다.

네트워크 연결

도커로 환경을 구성할 경우, 연동되는 컨테이너는 같은 네트워크로 묶어줘야 한다.

❯ docker network create --driver=bridge [netName]
❯ docker network connect [netName] [ContainerName]
❯ docker inspect [ContainerName]

DB와 대시보드를 연결하는 경우, 테스트로 Nginx로 5xx 에러를 내기 위해 로드밸런싱 설정을 추가할 때 활용했다.

http.Client Timeout 설정

Golang - http.Client에는 Timeout이 들어가야 한다.

기존에는 이게 없어서 연결을 무한정으로 기다렸다.

http.DefaultTransport.(*http.Transport).TLSClientConfig = &tls.Config{InsecureSkipVerify: true}

	// timeout 설정
	client := &http.Client{
		Transport: &http.Transport{
			Dial:                (&net.Dialer{Timeout: 5 * time.Second}).Dial,
			TLSHandshakeTimeout: 5 * time.Second,
		},
		Timeout: 3 * time.Second,
	}

	resp, err := client.Get(c.URL)

client 선언에서 Timeout: 3 * time.Second 부분만 추가해줘도 된다고 한다...
Dial은 TCP 연결, TLS는 https 관련 설정이다. 이걸 따로 설정 안했다면 뭉뚱그려서 Timeout에 다 포함되어 반영되는 듯 하다. 다만 Timeout 설정을 안하면 타임아웃 제한이 없어 그냥 무한정 기다리게 된다.

근데 https의 경우 Insecure로 해도 되는지 모르겠다... 일단 호출은 되긴 할텐데...

Simple Golang HTTPS/TLS Examples

이거 필요하면 참고하면서 고치던가 하자.

InfluxDB 개념

[InfluxDB] 설치 및 사용법

시계열 데이터베이스에서의 Table은 측정의 의미를 가지는 Measurement라고 불린다. 하나의 데이터베이스 안에는 여러 개의 Measurement가 있을 수가 있으며, 이 Measurement의 구조에 대해서 자세하게 알아두어야 한다. Table에 Row가 쌓이는 것처럼, Measurement 안에는 Point라고 하는 데이터가 쌓이게 된다. 이 Point는 데이터를 입력하는 순간 시간적인 지점인 Point를 의미하는 것이며, 이 Point 내부에는 여러 개의 Key가 존재한다.

Key

Tag Key: RDB에서 Index와 유사한 것으로 Select문으로 조회할 때의 기준이 된다. 또한 항상 String(문자열)의 형태로 들어올 수 있기 때문에 따옴표를 통해 감싸주어야 한다.
Field Key: 데이터 그 자체라고 보면 된다.
Time Key: 측정 시점의 시간이 들어가는데 자동으로 입력되기 때문에 별도로 건드릴 필요는 없다.

flux Query

Grafana 대시보드를 만들 때 쿼리문을 써야한다.
kubemon config에서 서버의 name을 잘 지정해뒀다면, 쿼리문으로 name을 구분하여 대시보드를 구성할 수 있다. 아래는 예시.

# 모니터링 대상의 모든 서버의 Status 값
from(bucket:"test-bucket")
  |> range (start: -2h)
  |> filter (fn:(r) => r._measurement == "test-server-list" and r._field == "Status")
  
# nginx 서버의 Status 값
from(bucket:"test-bucket")
  |> range (start: -2h)
  |> filter (fn:(r) => r._measurement == "test-server-list" and r._field == "Status")
  |> filter (fn:(r) => r.hostname =~ /nginx/)
  
# hostname에 check가 들어가는 서버의 Status 값
from(bucket:"test-bucket")
  |> range (start: -2h)
  |> filter (fn:(r) => r._measurement == "test-server-list" and r._field == "Status")
  |> filter (fn:(r) => r.hostname =~ /check/)

앞으로 필요한 사항

알림 메세지 템플릿으로 빼기

zzery

이 블로그의 모든 글은 수제로 짜여져 있습니다...

이전 포스트

모니터링용 batch 개발 - 2

다음 포스트