Bigdata, Kafka / Optimization

Jeonghak Choยท2025๋…„ 4์›” 27์ผ

Bigdata

๋ชฉ๋ก ๋ณด๊ธฐ
20/30

๐Ÿ“— Kafka ํ”„๋กœ๋น„์ €๋‹ ( ๊ฐœ๋ฐœ๊ณ„ )

๐Ÿณ๏ธโ€๐ŸŒˆ [๊ถ๊ธˆํ•œ์ ]

  • ์นดํ”„์นด ์ตœ์ ํ™” ๋ฐฉ๋ฒ•

๐Ÿ”—๋ชฉ์ฐจ

์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ ์š”์†Œ

๊ตฌ์„ฑ ์š”์†Œ์—ญํ•  ๋ฐ ์„ค์ • ๋‚ด์šฉ
Kafka Brokers- ์ตœ์†Œ 3๋Œ€ ์ด์ƒ ๊ตฌ์„ฑ
- ๊ณ ์„ฑ๋Šฅ ๋””์Šคํฌ (์˜ˆ: NVMe)
- JVM Heap Size (-Xmx, -Xms) ๋™์ผํ•˜๊ฒŒ ์„ค์ •
- num.io.threads, num.network.threads ์ ์ ˆ ์กฐ์ •
Producers- batch.size, linger.ms ์กฐ์ •์œผ๋กœ throughput ํ–ฅ์ƒ
- compression.type=snappy ์ถ”์ฒœ
- acks=all, retries ์„ค์ •์œผ๋กœ ์•ˆ์ •์„ฑ ๊ฐ•ํ™”
Consumers- Consumer Group์œผ๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
- max.poll.records, fetch.min.bytes, fetch.max.wait.ms ์„ค์ • ์ตœ์ ํ™”
- offset ์ˆ˜๋™ ์ปค๋ฐ‹ ์„ค์ • (enable.auto.commit=false)
Partition ์„ค๊ณ„- ํ† ํ”ฝ๋‹น ํŒŒํ‹ฐ์…˜ ์ˆ˜๋Š” CPU ์ˆ˜ * 2 ์ด์ƒ ๊ถŒ์žฅ
- ํŒŒํ‹ฐ์…˜ ํ‚ค๋ฅผ ํ†ตํ•œ workload ๋ถ„์‚ฐ
- ๋ฆฌ๋” ๋ธŒ๋กœ์ปค ๊ณ ๋ฅด๊ฒŒ ๋ถ„์‚ฐ
Replication- ๊ฐ ํ† ํ”ฝ replication.factor๋Š” 3 ์ด์ƒ
- ISR(In-Sync Replica) ๋ชจ๋‹ˆํ„ฐ๋ง ์ค‘์š”
- ์žฅ์•  ๋ณต๊ตฌ ๋Œ€๋น„ ๊ตฌ์„ฑ
Storage- log.retention.hours, log.segment.bytes ํŠœ๋‹
- ์žฅ๊ธฐ ์ €์žฅ ์‹œ S3, HDFS ์—ฐ๋™ ๊ณ ๋ ค
- ๋””์Šคํฌ Full ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ชจ๋‹ˆํ„ฐ๋ง
Monitoring- Prometheus + Grafana + Kafka Exporter ๊ถŒ์žฅ
- Alertmanager ์—ฐ๋™์œผ๋กœ ์ด์ƒ ๊ฐ์ง€
- lag, throughput, ISR ๋“ฑ ์ฃผ์š” ์ง€ํ‘œ ํ™•์ธ
Scale-out ์ „๋žต- broker ์ˆ˜ ๋˜๋Š” ํŒŒํ‹ฐ์…˜ ์ˆ˜ ์ฆ๊ฐ€๋กœ ํ™•์žฅ
- ์ž๋™ ๋ฆฌ๋ฐธ๋Ÿฐ์‹ฑ ์ฃผ์˜ (๋™์‹œ ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ณ ๋ ค)
- ํ† ํ”ฝ/ํŒŒํ‹ฐ์…˜ ๋ฐฐ์น˜ ์ตœ์ ํ™” ํ•„์š”

Zookeeper, KRaft ๋น„๊ต

ํ•ญ๋ชฉZookeeper ๊ธฐ๋ฐ˜ KafkaKRaft ๊ธฐ๋ฐ˜ Kafka (Zookeeper-less)
๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ์™ธ๋ถ€ Zookeeper์—์„œ ๊ด€๋ฆฌKafka ์ž์ฒด์˜ Controller Quorum์—์„œ ๊ด€๋ฆฌ
๊ตฌ์„ฑ ๋ณต์žก๋„Zookeeper ๋ณ„๋„ ์„ค์น˜/์šด์˜ ํ•„์š”Zookeeper ๋ถˆํ•„์š” โ†’ ๊ตฌ์„ฑ ๋‹จ์ˆœํ™”
์‹œ์ž‘ ๋ฒ„์ „Kafka 0.x ~ ํ˜„์žฌ (Legacy)Kafka 2.8 (preview), Kafka 3.3๋ถ€ํ„ฐ ์•ˆ์ •ํ™”
๋ณต์ œ ๋ฐฉ์‹Zookeeper - ZAB ํ”„๋กœํ† ์ฝœKafka Raft (KRaft) ํ”„๋กœํ† ์ฝœ ์‚ฌ์šฉ
Controller ๊ตฌ์กฐ๋‹จ์ผ active controller + standbyMulti-controller (Raft quorum ๊ธฐ๋ฐ˜)
๊ณ ๊ฐ€์šฉ์„ฑZookeeper ์žฅ์•  ์‹œ controller ์„ ์ถœ ์ง€์—ฐRaft quorum์œผ๋กœ ๋น ๋ฅธ failover ๊ฐ€๋Šฅ
์šด์˜ ๋ณต์žก๋„Zookeeper์™€ Kafka ๋ฒ„์ „/๊ตฌ์„ฑ ๊ด€๋ฆฌ ํ•„์š”Kafka ํ•˜๋‚˜๋กœ ํ†ตํ•ฉ๋˜์–ด ์šด์˜ ๊ฐ„ํŽธ
๋ณด์•ˆ ์„ค์ •Kafka โ†” Zookeeper ๊ฐ„ ACL ๋ณ„๋„ ๊ด€๋ฆฌ ํ•„์š”Kafka ๋‚ด ์ผ์›ํ™”๋œ ๋ณด์•ˆ ๊ด€๋ฆฌ ๊ฐ€๋Šฅ
์„ฑ๋Šฅ/์ง€์—ฐZookeeper์™€์˜ ํ†ต์‹  ์ง€์—ฐ ์กด์žฌ๋” ๋น ๋ฅธ controller ๋ฐ˜์‘ ๊ฐ€๋Šฅ
์ด์‹์„ฑ/ํด๋ผ์šฐ๋“œStateful ์„œ๋น„์Šค + ์™ธ๋ถ€ ์ข…์†์„ฑZookeeper ์—†์ด ํด๋ผ์šฐ๋“œ ๋ฐฐํฌ ์šฉ์ด
์ง€์› ํ˜„ํ™ฉํ˜„์žฌ๊นŒ์ง€ ์•ˆ์ •์  ์šด์˜๋จKafka 3.5+์—์„œ ๊ธฐ๋ณธ ๊ตฌ์กฐ๋กœ ์ ์ฐจ ์ „ํ™˜ ์ค‘

Kafka ํ™œ์šฉ

  • ๋กœ๊ทธ ์ˆ˜์ง‘ ํ”Œ๋žซํผ: Fluent Bit/Logstash โ†’ Kafka โ†’ Elasticsearch
  • ์‹ค์‹œ๊ฐ„ ์ถ”์ฒœ ์‹œ์Šคํ…œ: Kafka โ†’ Flink/Spark Streaming โ†’ Redis
  • ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ: Kafka โ†’ Kafka Connect โ†’ S3, HDFS, Iceberg ๋“ฑ

Kafka ์ตœ์ ํ™” ๊ฐ€์ด๋“œ

Kafka broker

  • ๋ธŒ๋กœ์ปค ์ˆ˜ ๋Š˜๋ฆฌ๊ธฐ: ํŠธ๋ž˜ํ”ฝ์— ๋”ฐ๋ผ ๋‹ค๋ฅด์ง€๋งŒ, ์ตœ์†Œ 3๋Œ€ ์ด์ƒ์œผ๋กœ ์‹œ์ž‘ํ•ด ์ƒค๋”ฉ๊ณผ ๋ณต์ œ๋ฅผ ๋ถ„์‚ฐ.
  • ํ† ํ”ฝ ํŒŒํ‹ฐ์…˜ ์ˆ˜ ์ฆ๊ฐ€: ๋ณ‘๋ ฌ ์†Œ๋น„ ๊ฐ€๋Šฅ์„ฑ์„ ํ™•๋ณดํ•˜๊ธฐ ์œ„ํ•ด ํ† ํ”ฝ๋‹น ๋งŽ์€ ํŒŒํ‹ฐ์…˜ ์ˆ˜ ์„ค์ • (์˜ˆ: 12, 24, 48 ๋“ฑ).
  • ๋ฆฌ์†Œ์Šค ์„ค์ •: ๋””์Šคํฌ๋Š” NVMe ๋˜๋Š” SSD ๊ถŒ์žฅ. RAM์€ ํŽ˜์ด์ง€ ์บ์‹œ์šฉ์œผ๋กœ ๋„‰๋„‰ํžˆ (16GB ์ด์ƒ ๊ถŒ์žฅ). num.network.threads, num.io.threads, socket.send.buffer.bytes, log.segment.bytes ๋“ฑ ํŠœ๋‹.

Kafka producer

  • Batching & ์••์ถ•: batch.size, linger.ms ํŠœ๋‹ํ•˜์—ฌ ์ „์†ก ํšจ์œจ ๊ทน๋Œ€ํ™”.compression.type์€ snappy ๋˜๋Š” lz4 ๊ถŒ์žฅ.
  • Ack ์„ค์ •: acks=all + retries ๋กœ ์•ˆ์ •์„ฑ ํ™•๋ณด.
  • ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋“œ ๋˜๋Š” async ๋ฐฉ์‹์œผ๋กœ producer ํ’€ ๊ตฌ์„ฑ.

Kafka consumer

  • ์ปจ์Šˆ๋จธ ๊ทธ๋ฃน ๋ณ‘๋ ฌ์„ฑ: ํŒŒํ‹ฐ์…˜ ์ˆ˜ >= ์ปจ์Šˆ๋จธ ์ธ์Šคํ„ด์Šค ์ˆ˜๋กœ ์œ ์ง€. max.poll.records / fetch.min.bytes ๋“ฑ ํŠœ๋‹. ์—ญ์ง๋ ฌํ™” ์ฒ˜๋ฆฌ ์ตœ์ ํ™” (JSON, Avro ๋“ฑ ์ฒ˜๋ฆฌ ์†๋„ ๊ณ ๋ ค).

Zookeeper or KRaft (Kafka Raft)

Kafka 3.x ์ด์ƒ์ด๋ผ๋ฉด KRaft ๋ชจ๋“œ ๊ณ ๋ ค.์•ˆ์ •์  ์šด์˜ ์œ„ํ•ด ๋ณ„๋„ ๋…ธ๋“œ ๊ตฌ์„ฑ ๊ถŒ์žฅ.

๋ชจ๋‹ˆํ„ฐ๋ง & ์•Œ๋ฆผ

  • Prometheus + Grafana ์—ฐ๋™.
  • ์ฃผ์š” ๋ฉ”ํŠธ๋ฆญ: Under-replicated partitions, Consumer lag, Produce/Consume rate, Disk usage per broker

๋ถ„์‚ฐ ํŒŒ์ผ ์‹œ์Šคํ…œ/DB ์—ฐ๋™ (์˜ต์…˜)

๋Œ€๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํ›„ HDFS, S3, Iceberg, Delta Lake ๋“ฑ๊ณผ ์—ฐ๊ณ„.
Kafka Connect, Flink, Spark Streaming ์‚ฌ์šฉ ๊ฐ€๋Šฅ.

์ตœ์ ํ™” ์„ค์ • ์ถ”์ฒœ

๊ตฌ์„ฑ ์š”์†Œ์„ค์ • ํ•ญ๋ชฉ์„ค๋ช…์ถ”์ฒœ ๊ฐ’์„ค์ • ํšจ๊ณผํŠธ๋ ˆ์ด๋“œ์˜คํ”„ / ์ฃผ์˜์‚ฌํ•ญ
Produceracks๋ฉ”์‹œ์ง€ ์ˆ˜์‹  ํ™•์ธ ์ˆ˜์ค€all๋ชจ๋“  replica์— ์ €์žฅ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ โ†’ ๋ฐ์ดํ„ฐ ์•ˆ์ •์„ฑ โ†‘latency โ†‘, throughput โ†“ ๊ฐ€๋Šฅ
Producerretries์ „์†ก ์‹คํŒจ ์‹œ ์žฌ์‹œ๋„ ํšŸ์ˆ˜5 ์ด์ƒ๋„คํŠธ์›Œํฌ ์˜ค๋ฅ˜ ๋“ฑ ์žฌ์‹œ๋„ ์ฒ˜๋ฆฌ โ†’ ์‹ ๋ขฐ์„ฑ โ†‘์ˆœ์„œ ๋ณด์žฅ ์œ„ํ•ด max.in.flight.requests.per.connection ์กฐ์ • ํ•„์š”
Producerbatch.size๋ฐฐ์น˜ ์ „์†ก ํฌ๊ธฐ (bytes)32KB~512KB์—ฌ๋Ÿฌ ๋ฉ”์‹œ์ง€๋ฅผ ๋ชจ์•„ ์ „์†ก โ†’ ๋„คํŠธ์›Œํฌ ํšจ์œจ โ†‘๋„ˆ๋ฌด ํฌ๋ฉด latency โ†‘
Producerlinger.ms๋ฐฐ์น˜ ์ „ ๋Œ€๊ธฐ ์‹œ๊ฐ„ (ms)5~10ms๋ฉ”์‹œ์ง€ ๋ชจ์„ ์‹œ๊ฐ„ ํ™•๋ณด โ†’ Throughput โ†‘latency โ†‘ ๊ฐ€๋Šฅ์„ฑ
Producercompression.type๋ฉ”์‹œ์ง€ ์••์ถ• ๋ฐฉ์‹snappy, lz4๋ฐ์ดํ„ฐ ์ „์†ก๋Ÿ‰ โ†“ โ†’ throughput โ†‘CPU ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€
Producermax.in.flight.requests.per.connection๋™์‹œ ์ „์†ก ์š”์ฒญ ์ˆ˜1~5๋†’์€ ๊ฐ’์œผ๋กœ throughput โ†‘retries์™€ ํ•จ๊ป˜ ์ˆœ์„œ ๊ผฌ์ž„ ์ฃผ์˜
Brokernum.network.threads๋„คํŠธ์›Œํฌ IO ์ฒ˜๋ฆฌ ์Šค๋ ˆ๋“œ ์ˆ˜CPU ์ˆ˜์˜ ์ ˆ๋ฐ˜์š”์ฒญ ์ฒ˜๋ฆฌ ์„ฑ๋Šฅ ํ–ฅ์ƒCPU ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€
Brokernum.io.threads๋””์Šคํฌ IO ์Šค๋ ˆ๋“œ ์ˆ˜CPU ์ˆ˜ ๋˜๋Š” ๊ทธ ์ด์ƒ๋ณ‘๋ ฌ ๋””์Šคํฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ โ†’ ์„ฑ๋Šฅ โ†‘context switching ๊ณผ๋‹ค ๊ฐ€๋Šฅ์„ฑ
Brokersocket.send.buffer.bytes
socket.receive.buffer.bytes
socket.request.max.bytes
์†Œ์ผ“ ๋ฒ„ํผ ํฌ๊ธฐ128KB ~ 512KB๋Œ€์šฉ๋Ÿ‰ ์ „์†ก ํšจ์œจ โ†‘๊ณผ๋„ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๊ธ‰์ฆ
Brokerlog.segment.bytes๋กœ๊ทธ ์„ธ๊ทธ๋จผํŠธ ์ตœ๋Œ€ ํฌ๊ธฐ1GB ์ด์ƒ์„ธ๊ทธ๋จผํŠธ ํŒŒ์ผ ๋‹จ์œ„ ์กฐ์ ˆ ๊ฐ€๋Šฅ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด ์„ธ๊ทธ๋จผํŠธ ์ˆ˜ ์ฆ๊ฐ€๋กœ ๊ด€๋ฆฌ ๋ถ€๋‹ด
Brokerlog.retention.hours๋ฉ”์‹œ์ง€ ๋ณด๊ด€ ๊ธฐ๊ฐ„์—…๋ฌด ์š”๊ฑด์— ๋”ฐ๋ผ์ž๋™ ์‚ญ์ œ ์ฃผ๊ธฐ ์„ค์ • ๊ฐ€๋Šฅ๊ธธ๋ฉด ๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ ๊ฐ€๋Šฅ
Brokermessage.max.bytes๋‹จ์ผ ๋ฉ”์‹œ์ง€ ์ตœ๋Œ€ ํฌ๊ธฐ1MB ~ 10MB๋Œ€์šฉ๋Ÿ‰ ๋ฉ”์‹œ์ง€ ํ—ˆ์šฉ๋„ˆ๋ฌด ํฌ๋ฉด ์ฒ˜๋ฆฌ ๋ถ€๋‹ด, ์ž‘์œผ๋ฉด ์ „์†ก ์‹คํŒจ
Brokerreplica.fetch.max.bytes๋ฆฌ๋” โ†’ ํŒ”๋กœ์›Œ ๋ณต์ œ ์‹œ ์ตœ๋Œ€ ํฌ๊ธฐ10MB ~ 50MB๋ณต์ œ ํšจ์œจ โ†‘๋„ˆ๋ฌด ํฌ๋ฉด ๋””์Šคํฌ IO ๋ถ€๋‹ด, ์ž‘์œผ๋ฉด ์ง€์—ฐ ๋ฐœ์ƒ
Brokernum.partitionsํŒŒํ‹ฐ์…˜ ์ˆ˜workload ๊ธฐ๋ฐ˜ (์˜ˆ: 12, 24 ๋“ฑ)๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ฆ๊ฐ€ํŒŒํ‹ฐ์…˜ ์ˆ˜ ๋งŽ์œผ๋ฉด ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ/๋ฆฌ๋” ๊ด€๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ฆ๊ฐ€
Consumerfetch.min.bytes์ตœ์†Œ fetch ํฌ๊ธฐ1KB ~ 50KB๋ฐ์ดํ„ฐ ๋ชจ์•„์„œ ๊ฐ€์ ธ์˜ค๊ธฐ โ†’ ๋Œ€์—ญํญ ํšจ์œจ โ†‘๋„ˆ๋ฌด ํฌ๋ฉด ์ฒ˜๋ฆฌ ์ง€์—ฐ ๊ฐ€๋Šฅ์„ฑ โ†‘
Consumerfetch.max.wait.ms์ตœ๋Œ€ ๋Œ€๊ธฐ ์‹œ๊ฐ„100 ~ 500ms๋ฐ์ดํ„ฐ ๋ถ€์กฑ ์‹œ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ์ œํ•œ๋„ˆ๋ฌด ๊ธธ๋ฉด ์‘๋‹ต ์ง€์—ฐ, ์งง์œผ๋ฉด polling ์ฆ๊ฐ€
Consumermax.poll.recordspoll ๋‹น ๋ ˆ์ฝ”๋“œ ์ˆ˜500 ~ 5000๋ฉ”์‹œ์ง€ ์ฒ˜๋ฆฌ๋Ÿ‰ ์กฐ์ ˆ ๊ฐ€๋Šฅ๋„ˆ๋ฌด ํฌ๋ฉด ์ฒ˜๋ฆฌ ์ง€์—ฐ, ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด poll ๊ณผ๋‹ค ๋ฐœ์ƒ
Consumerenable.auto.commit์ž๋™ offset ์ปค๋ฐ‹false ๊ถŒ์žฅ์ˆ˜๋™ ์ปค๋ฐ‹ ๊ฐ€๋Šฅ โ†’ ๋ฐ์ดํ„ฐ ์ œ์–ด โ†‘๊ฐœ๋ฐœ์ž๊ฐ€ ์ปค๋ฐ‹ ์ง์ ‘ ๊ตฌํ˜„ํ•ด์•ผ ํ•จ
Consumermax.partition.fetch.bytesํŒŒํ‹ฐ์…˜๋‹น fetch ํฌ๊ธฐ1MB ~ 10MB๋Œ€์šฉ๋Ÿ‰ ๋ฉ”์‹œ์ง€ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด fetch ํšŸ์ˆ˜ โ†‘
Consumersession.timeout.ms์ปจ์Šˆ๋จธ ์žฅ์•  ํƒ์ง€ ์‹œ๊ฐ„10000 (10์ดˆ) ์ด์ƒ๋น ๋ฅธ ์žฅ์•  ํƒ์ง€ ๊ฐ€๋Šฅ๋„ˆ๋ฌด ์งง์œผ๋ฉด false positive โ†‘

0๊ฐœ์˜ ๋Œ“๊ธ€