sleep 함수는 어떻게 동작하는 것일까?

안태우·2023년 8월 26일

밑바닥까지 파보는 동작 원리

목록 보기

1/1

프로그래밍을 하다보면, 이따금 sleep을 쓸 일이 꽤 있다. 가령, 일정 주기로 반복하여 실행되는 cronjob 성격의 코드의 loop 안에서 쓰게 된다. Python으로 작성하여 보면, 다음과 같은 구조를 가지게 된다.

import time

while True:
  func()
  time.sleep(3600)  # iterate per 1 hour

이렇게 작성하면 프로그램은 잘 돌아간다. 그런데, sleep 함수는 어떻게 돌아가는 것일까라는 의문은 다들 한 번쯤은 가져봤을 것이라 생각한다. 뭔가 CPU를 먹는 것도 아닌데, 어떻게 대략 그 시간이 지나고 나면 다시 돌아올 수 있는 것일까?

상상하는 대로 구현해보기

가령, 만약 내가 sleep을 구현하게 되면 다음과 같이 해볼 수 있을 것이다.

def consume_one_second():
   ...
   
def sleep(seconds: int):
  for _ in range(seconds):
    consume_one_second()

consume_one_second는 구현하기 나름이겠지만, 상당히 러프한 다음의 가정을 세우면 쉽게 구현할 수 있다.
1. 프로그램이 실행되는 CPU의 clock이 $x$ GHz로 고정되어 있다.
2. 항상 $y$ cycle을 소모하는 특정 CPU 명령어, 가칭 consume이 있다.
3. consume을 호출하는데 Python에서 native까지 wrapping 비용이 0이다.

이런 가정에서 다음과 같이 구현할 수 있을 것이다.

def consume_one_second():
  for _ in range(x * 1e9 / y):
    wrapped_consume()

또는, 이렇게 엄격하게 하지 않고 통계적인 평균으로부터 만들어 볼 수도 있을 것이다.

import time

def benchmark(func, iter):
  result = 0
  for _ in range(iter):
    start = time.time()
    func()
    end = time.time()
    result += end - start

  print(result / iter)

def add():
  for _ in range(56_500):
    s = 0
    for i in range(1_000):
      s += i

benchmark(add, 10)  # about consume avg 1.00 second

후자의 방법이 좀 더 실용적인 접근이기 때문에(전자의 가정은 너무 이상주의적이다), 후자의 방법으로 sleep을 구현하고 사용해보면 다음과 같다.

import time

def consume_one_second():
  for _ in range(56_500):
    s = 0
    for i in range(1_000):
      s += i

def sleep(seconds: int):
  for _ in range(seconds):
    consume_one_second()

while True:
  print(f'{time.time()} hello! next print will run after 10 seconds')
  sleep(10)  # iterate per 10 seconds

실제 출력물을 확인해보면 다음과 같다. 좀 엉성하긴 하지만, 대략 10초마다 반복하는 것은 잘 확인할 수 있다.

1693019196.186122 hello! next print will run after 10 seconds
1693019206.2261178 hello! next print will run after 10 seconds
1693019216.2263172 hello! next print will run after 10 seconds
1693019226.224003 hello! next print will run after 10 seconds
1693019236.2392802 hello! next print will run after 10 seconds
1693019246.449384 hello! next print will run after 10 seconds
1693019256.5064378 hello! next print will run after 10 seconds
1693019266.5285928 hello! next print will run after 10 seconds
1693019276.519737 hello! next print will run after 10 seconds

그런데 하나의 문제가 있다. 사실 이 프로그램에서는 출력하는 것을 제외하곤 매 10초 동안의 공백에는 아무것도 하지 않음에도 불구하고, 다음과 같이 코어 1개를 점유하고 있는 것을 확인할 수 있다. 이렇게 코어를 직접 점유하면서 기다리는 것을 busy wait이라 한다.

내가 직접 만든 sleep이 아닌 time.sleep을 사용하면, CPU 점유율을 먹지 않는 것을 확인할 수 있는데 어떻게 된 일일까? 그렇다면, Python의 sleep 구현을 직접 살펴보도록 하자.

Python의 time.sleep 구현

Python의 time.sleep 구현은 https://github.com/python/cpython 에서 살펴볼 수 있다.

Modules/timemodule.c

static PyObject *
time_sleep(PyObject *self, PyObject *timeout_obj)
{
    PySys_Audit("time.sleep", "O", timeout_obj);

    _PyTime_t timeout;
    if (_PyTime_FromSecondsObject(&timeout, timeout_obj, _PyTime_ROUND_TIMEOUT))
        return NULL;
    if (timeout < 0) {
        PyErr_SetString(PyExc_ValueError,
                        "sleep length must be non-negative");
        return NULL;
    }
    if (pysleep(timeout) != 0) {
        return NULL;
    }
    Py_RETURN_NONE;
}

static PyMethodDef time_methods[] = {
...
    {"sleep",           time_sleep, METH_O, sleep_doc},
...
}

// time.sleep() implementation.
// On error, raise an exception and return -1.
// On success, return 0.
static int
pysleep(_PyTime_t timeout)
{
    assert(timeout >= 0);

    struct timespec timeout_abs;
    
    _PyTime_t deadline, monotonic;
    int err = 0;

    if (get_monotonic(&monotonic) < 0) {
        return -1;
    }
    deadline = monotonic + timeout;

    if (_PyTime_AsTimespec(deadline, &timeout_abs) < 0) {
        return -1;
    }

    do {
        int ret;
        Py_BEGIN_ALLOW_THREADS
        
        ret = clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, &timeout_abs, NULL);
        err = ret;

        Py_END_ALLOW_THREADS

        if (ret == 0) {
            break;
        }

        if (err != EINTR) {
            errno = err;
            PyErr_SetFromErrno(PyExc_OSError);
            return -1;
        }

        /* sleep was interrupted by SIGINT */
        if (PyErr_CheckSignals()) {
            return -1;
        }
    } while (1);

    return 0;
}

아무래도 native 관련 코드이다 보니까, pysleep의 구현은 platform별 구체적인 구현을 담고 있었지만, 적절히 일반화하여 설명하기 위해서 WINDOWS가 아니고 HAVE_CLOCK_NANOSLEEP에 해당하는 코드를 들고 왔다. 코드를 살펴 보면, Python의 sleep이 time_sleep에 매핑된다는 것을 쉽게 알 수 있고, 그 내부를 들여다 보면 pysleep를 호출하여 sleep을 구현하고 있는 것을 확인할 수 있다.

pysleep에서는 monotoic time을 하나 가져오고, 이를 바탕으로 clock_nanosleep을 호출하는 것을 볼 수 있다. 이때 Py_BEGIN_ALLOW_THREADS와 Py_BEGIN_ALLOW_THREADS는 GIL을 비활성화/활성화하는 매크로이다. 아마도, Python 내부에 GIL에 관리되는 여러 개의 thread가 있을 때, 한 thread가 time.sleep을 사용하더라도, 당연히 race condition이 발생하지 않기 때문에 타 thread가 CPU 자원을 사용할 수 있도록 하는 부분 아닐까 싶다.

이후부터는 Linux system call로 이어진다. https://man7.org/linux/man-pages/man2/clock_nanosleep.2.html

clock_nanosleep() suspends the execution of the calling thread
until either at least the time specified by request has elapsed,
or a signal is delivered that causes a signal handler to be
called or that terminates the process.

kernel에서 thread를 suspend(중지)하고 적어도 요청한 시간만큼 지난 이후 또는 signal이 들어오는 경우에 다시 돌아온다고 적혀있다. 따라서 여기서부터는 OS의 scheduler에서 관장되는 것이다. 마치 coroutine이 관리되는 것처럼, thread가 suspend되고 관리되는 것이다. 마음 같아서는 여기서 Linux kernel까지 까서 살펴보고 싶긴 하지만, 너무 복잡한 관계로 일단 넘어가도록 하겠다.

Python asyncio.sleep의 구현

그렇다면 생각난 김에, Python의 asyncio.sleep은 어떻게 구현되어 있을까?

Lib/asyncio/tasks.py

@types.coroutine
def __sleep0():
    """Skip one event loop run cycle.

    This is a private helper for 'asyncio.sleep()', used
    when the 'delay' is set to 0.  It uses a bare 'yield'
    expression (which Task.__step knows how to handle)
    instead of creating a Future object.
    """
    yield


async def sleep(delay, result=None):
    """Coroutine that completes after a given time (in seconds)."""
    if delay <= 0:
        await __sleep0()
        return result

    if math.isnan(delay):
        raise ValueError("Invalid delay: NaN (not a number)")

    loop = events.get_running_loop()
    future = loop.create_future()
    h = loop.call_later(delay,
                        futures._set_result_unless_cancelled,
                        future, result)
    try:
        return await future
    finally:
        h.cancel()

delay가 0 이하이면 단순히 yield 해주고, 양수라면 그 delay만큼 지난 후에 future의 결과값을 세팅해주고, 그것을 await하는 구조를 가지고 있다.

Lib/asyncio/base_events.py

def call_later(self, delay, callback, *args, context=None):
    """Arrange for a callback to be called at a given time.

    Return a Handle: an opaque object with a cancel() method that
    can be used to cancel the call.

    The delay can be an int or float, expressed in seconds.  It is
    always relative to the current time.

    Each callback will be called exactly once.  If two callbacks
    are scheduled for exactly the same time, it is undefined which
    will be called first.

    Any positional arguments after the callback will be passed to
    the callback when it is called.
    """
    if delay is None:
        raise TypeError('delay must not be None')
    timer = self.call_at(self.time() + delay, callback, *args,
                         context=context)
    if timer._source_traceback:
        del timer._source_traceback[-1]
    return timer

def call_at(self, when, callback, *args, context=None):
    """Like call_later(), but uses an absolute time.

    Absolute time corresponds to the event loop's time() method.
    """
    if when is None:
        raise TypeError("when cannot be None")
    self._check_closed()
    if self._debug:
        self._check_thread()
        self._check_callback(callback, 'call_at')
    timer = events.TimerHandle(when, callback, args, self, context)
    if timer._source_traceback:
        del timer._source_traceback[-1]
    heapq.heappush(self._scheduled, timer)
    timer._scheduled = True
    return timer

TimeHandle를 만들어서 언제 실행할지 지정하여 스케쥴러 heap에 넣는 것을 확인할 수 있다.

여기서 재미있는 점은 아무래도 __sleep0을 호출하는 부분이지 않을까 싶다. 만약, Python의 어떤 coroutine에서 다른 coroutine에게 양보하고, 가능한 빨리 자신이 다시 진행되도록 하고 싶으면 await asyncio.sleep(0)을 호출하여 할 수 있다. 사실 이렇게 코드를 짤 일이 있으면 안 좋은 디자인 패턴을 가졌다고 생각을 하는데, 어떤 무거운 sync task를 coroutine에서 진행해야 할 필요가 있다면 간간히 await asyncio.sleep(0)을 넣어주어 해결하는 기믹을 사용할 수도 있을 것이라 생각한다.

또 다른 sleep의 구현

만약에 OS 없이 돌아가는 프로그램의 경우에는 어떻게 될까? 가장 쉽게 접할 수 있는 것이 바로 Arduino라고 생각한다. 가만히 생각해보면 이런 임베디드에서는 OS라는 게 안 들어가 있는 경우가 많고, thread라는 것도 없을 것이고, 내가 짠 코드만 바로 실행되기 때문에, 어떤 일정한 동작을 위해 loop를 직접 썼어야 했다는 경험을 해본 적이 있을 것이다. Arduino에서는 delay로 sleep을 구현하고 있다.
https://github.com/arduino/ArduinoCore-avr
cores/arduino/Arduino.h

void yield(void);

void delay(unsigned long ms);
void delayMicroseconds(unsigned int us);

cores/arduino/wiring.c

void delay(unsigned long ms)
{
	uint32_t start = micros();

	while (ms > 0) {
		yield();
		while ( ms > 0 && (micros() - start) >= 1000) {
			ms--;
			start += 1000;
		}
	}
}

/* Delay for the given number of microseconds.  Assumes a 1, 8, 12, 16, 20 or 24 MHz clock. */
void delayMicroseconds(unsigned int us)
{
	// call = 4 cycles + 2 to 4 cycles to init us(2 for constant delay, 4 for variable)

	// calling avrlib's delay_us() function with low values (e.g. 1 or
	// 2 microseconds) gives delays longer than desired.
	//delay_us(us);
#if F_CPU >= 24000000L
	// for the 24 MHz clock for the adventurous ones trying to overclock

	// zero delay fix
	if (!us) return; //  = 3 cycles, (4 when true)

	// the following loop takes a 1/6 of a microsecond (4 cycles)
	// per iteration, so execute it six times for each microsecond of
	// delay requested.
	us *= 6; // x6 us, = 7 cycles

	// account for the time taken in the preceding commands.
	// we just burned 22 (24) cycles above, remove 5, (5*4=20)
	// us is at least 6 so we can subtract 5
	us -= 5; //=2 cycles

#elif F_CPU >= 20000000L
	// for the 20 MHz clock on rare Arduino boards

	// for a one-microsecond delay, simply return.  the overhead
	// of the function call takes 18 (20) cycles, which is 1us
	__asm__ __volatile__ (
		"nop" "\n\t"
		"nop" "\n\t"
		"nop" "\n\t"
		"nop"); //just waiting 4 cycles
	if (us <= 1) return; //  = 3 cycles, (4 when true)

	// the following loop takes a 1/5 of a microsecond (4 cycles)
	// per iteration, so execute it five times for each microsecond of
	// delay requested.
	us = (us << 2) + us; // x5 us, = 7 cycles

	// account for the time taken in the preceding commands.
	// we just burned 26 (28) cycles above, remove 7, (7*4=28)
	// us is at least 10 so we can subtract 7
	us -= 7; // 2 cycles

#elif F_CPU >= 16000000L
	// for the 16 MHz clock on most Arduino boards

	// for a one-microsecond delay, simply return.  the overhead
	// of the function call takes 14 (16) cycles, which is 1us
	if (us <= 1) return; //  = 3 cycles, (4 when true)

	// the following loop takes 1/4 of a microsecond (4 cycles)
	// per iteration, so execute it four times for each microsecond of
	// delay requested.
	us <<= 2; // x4 us, = 4 cycles

	// account for the time taken in the preceding commands.
	// we just burned 19 (21) cycles above, remove 5, (5*4=20)
	// us is at least 8 so we can subtract 5
	us -= 5; // = 2 cycles,

#elif F_CPU >= 12000000L
	// for the 12 MHz clock if somebody is working with USB

	// for a 1 microsecond delay, simply return.  the overhead
	// of the function call takes 14 (16) cycles, which is 1.5us
	if (us <= 1) return; //  = 3 cycles, (4 when true)

	// the following loop takes 1/3 of a microsecond (4 cycles)
	// per iteration, so execute it three times for each microsecond of
	// delay requested.
	us = (us << 1) + us; // x3 us, = 5 cycles

	// account for the time taken in the preceding commands.
	// we just burned 20 (22) cycles above, remove 5, (5*4=20)
	// us is at least 6 so we can subtract 5
	us -= 5; //2 cycles

#elif F_CPU >= 8000000L
	// for the 8 MHz internal clock

	// for a 1 and 2 microsecond delay, simply return.  the overhead
	// of the function call takes 14 (16) cycles, which is 2us
	if (us <= 2) return; //  = 3 cycles, (4 when true)

	// the following loop takes 1/2 of a microsecond (4 cycles)
	// per iteration, so execute it twice for each microsecond of
	// delay requested.
	us <<= 1; //x2 us, = 2 cycles

	// account for the time taken in the preceding commands.
	// we just burned 17 (19) cycles above, remove 4, (4*4=16)
	// us is at least 6 so we can subtract 4
	us -= 4; // = 2 cycles

#else
	// for the 1 MHz internal clock (default settings for common Atmega microcontrollers)

	// the overhead of the function calls is 14 (16) cycles
	if (us <= 16) return; //= 3 cycles, (4 when true)
	if (us <= 25) return; //= 3 cycles, (4 when true), (must be at least 25 if we want to subtract 22)

	// compensate for the time taken by the preceding and next commands (about 22 cycles)
	us -= 22; // = 2 cycles
	// the following loop takes 4 microseconds (4 cycles)
	// per iteration, so execute it us/4 times
	// us is at least 4, divided by 4 gives us 1 (no zero delay bug)
	us >>= 2; // us div 4, = 4 cycles
	

#endif

	// busy wait
	__asm__ __volatile__ (
		"1: sbiw %0,1" "\n\t" // 2 cycles
		"brne 1b" : "=w" (us) : "0" (us) // 2 cycles
	);
	// return = 4 cycles
}

core/arduino/hooks.c

/**
 * Empty yield() hook.
 *
 * This function is intended to be used by library writers to build
 * libraries or sketches that supports cooperative threads.
 *
 * Its defined as a weak symbol and it can be redefined to implement a
 * real cooperative scheduler.
 */
static void __empty() {
	// Empty
}
void yield(void) __attribute__ ((weak, alias("__empty")));

보다시피 busy wait을 구현하고 있는 것을 확인할 수 있고, Arduino의 모델별로 cooperative threads를 지원한다면 yield를 구현하여 어느 정도 양보하여 다른 thread가 돌 수 있도록 구현해둔 것을 확인할 수 있다.

결론

OS 위에서 돌아가는 프로그램이나 coroutine으로 돌아가는 코드의 경우 yield하는 방향으로 sleep을 구현하고 있음을 확인할 수 있었다. 반면에 임베디드의 경우에는 직접 busy wait을 할 수도 있는 것을 확인하였다. 사실 kernel까지 올라가는 과정에 어느 정도 비용이 있기 때문에 아주 작은 sleep, 가령 100 ns만 sleep한다 같은 경우에는 오히려 yield보다 busy wait이 더 효율적일 수도 있을 것이다(이는, mutex/spinlock의 경우와 같다)

어릴 적에 Java 프로그래밍할 적에 Thread.sleep(1000)를 사용하면서, 얘는 대체 어떻게 동작하는 것인지 참 궁금했었는데, 문득 최근에 Python asyncio 프로그래밍하면서 한 번 직접 구동원리를 찾아보고 싶었다. 이번 기회에 이렇게 글로 정리하면서 다른 분들도 이런 게 이렇게 굴러가는 구나, 보고 지나가면 좋은 경험이지 아닐까 싶다.

안태우

Rust 삽질러 / 동시성 프로그래밍을 주로 공부합니다.