[운영체제] Lock-based Concurrent Data Structures

전윤혁·2024년 12월 20일

OS

목록 보기

15/18

이전 글에서 Lock에 대해 상당히 자세히 알아봤다. 이번 글에서는 실제로 Lock을 Linked List, Queue, Hash Table 등의 자료 구조에 적용하여 thread-safe하게 만드는 방법을 알아볼 것이다. 그 과정에서 Correctness와 Performance를 모두 고려해보자.

1. Concurrent Counters

가장 먼저 Counter, 즉 단순히 공유 자원의 값을 올리거나 내리는 작업의 동시성을 생각해보자.

1) Concurrent Counters without Locks

typedef struct __counter_t {
	int value;
} counter_t;

void init(counter_t *c) {
	c->value = 0;
}

void increment(counter_t *c) {
	c->value++;
}

void decrement(counter_t *c) {
	c->value--;
}

int get(counter_t *c) {
	return c->value;
}

위는 Lock이 구현되지 않은 Counter 코드이다. 이렇게 작성될 경우 value의 값을 올리거나 내리는 과정에서 Race Condition이 발생할 수 있다.

2) Concurrent Counters with Locks

이번에는 단순히 Single Lock을 통해 Lock을 구현한 예시이다.

typedef struct __counter_t {
  int value;
  pthread_mutex_t lock;
} counter_t;

void init(counter_t *c) {
  c->value = 0;
  Pthread_mutex_init(&c->lock, NULL);
}

void increment(counter_t *c) {
  Pthread_mutex_lock(&c->lock);
  c->value++;
  Pthread_mutex_unlock(&c->lock);
}

void decrement(counter_t *c) {
  Pthread_mutex_lock(&c->lock);
  c->value--;
  Pthread_mutex_unlock(&c->lock);
}

int get(counter_t *c) {
  Pthread_mutex_lock(&c->lock);
  int rc = c->value;
  Pthread_mutex_unlock(&c->lock);
  return rc;
}

위와 같이 작성될 경우, 동시에 여러 스레드가 value에 접근하더라도 동시성이 보장된다.

하지만 위와 같이 한 개의 Lock을 통해 동시성을 보장하면 성능상 문제가 발생할 수 있다. 아래에서 성능 지표를 살펴보자.

3) Overhead of Simple Approach

위의 그래프를 살펴보면, 스레드의 수가 늘어남에 따라 Counter을 업데이트하는 시간(Time)이 점점 증가하는 것을 알 수 있다. 즉, Scalability(확장성)가 부족하다.

그 이유는 무엇일까? 앞서 설명한 것처럼 increment, decrement, get 모두 한 개의 Lock을 같이 사용하기 때문에, Lock에 대한 병목 현상이 일어나는 것이다.

따라서 지금부터 우리의 목표는 병렬 작업이 서로 간섭하지 않고 독립적으로 처리될 수 있는, Perfect Scaling을 구현하는 것이다.

4) Sloppy counter

앞선 그래프를 살펴보면 사실 Traditional과 Sloppy Counter 두 가지 방식의 성능을 비교하고 있는데, o로 표시되어 있는 Sloppy Counter 방식의 경우 스레드가 증가되더라도 일관된 성능을 보여주고 있다. 어떻게?

Sloppy Counter 방식을 간단히 설명하면, 각 스레드에서 매번 Global Lock을 획득하여 각각의 작업을 수행하는 대신, 주기적으로 로컬 작업의 결과를 한 번에 업데이트 하는 것이다. 이를 통해 Lock에 대한 경합이 현저히 줄어들게 된다.

Local Counter
각 CPU 코어가 자체적으로 값을 증가시키는 로컬 카운터.
Global Counter
로컬 카운터의 값이 주기적으로 통합되는 전체 카운터.

만약 특정 CPU 코어에서 실행되고 있는 스레드가 Counter의 값을 업데이트하려고 하면, 현재 CPU 코어의 로컬 카운터를 증가시킨다.

일정 임계값(S)이 도달하면, 로컬 카운터 값을 글로벌 카운터에 더한다. 이 과정은 Global Lock을 획득하여 진행한다. 이후, 로컬 카운터는 0으로 초기화된다.

여기서 임계값인 S란 Sloppiness를 의미하는데, local-to-global transfer의 주기에 따라 아래와 같은 차이가 생기게 된다.

The smaller S
Counter가 non-scalable counter와 유사하게 동작한다.
The bigger S
Counter의 확장성이 좋아지지만, global value와 실제 값과의 오차가 커질 수 있다.

여기서 말하는 오차란, 위의 표에서 볼 수 있듯이 로컬 카운터의 값이 아직 글로벌 카운터에 반영되지 않았기 때문에 생기는 오차이다. 위의 표에서 L2와 L3의 로컬 카운터 값이 임계값(5) 미만인 2와 4이므로 아직 글로벌 카운터에 반영되지 않았고, 그 결과 6만큼 오차가 발생하고 있다. (실제 값: 16, 글로벌 카운터 값: 10)

임계값 S가 중요한 이유는, 위의 그래프에서 볼 수 있듯이 S 값에 따라 성능은 점점 좋아지지만, 실제 값과의 오차가 더 커질 수 있기 때문이다.

아래는 Sloppy Counter의 구현이다.

typedef struct __counter_t {
  int global; // global count
  pthread_mutex_t glock; // global lock
  int local[NUMCPUS]; // per-CPU count
  pthread_mutex_t llock[NUMCPUS]; // ... and locks
  int threshold; // update frequency
} counter_t;

// init: record threshold, init locks, init values
// of all local counts and global count
void init(counter_t *c, int threshold) {
  c->threshold = threshold;
  c->global = 0;
  
  pthread_mutex_init(&c->glock, NULL);
  
  int i;
  for (i = 0; i < NUMCPUS; i++) {
    c->local[i] = 0;
    pthread_mutex_init(&c->llock[i], NULL);
  }
}

// update: usually, just grab local lock and update
// local amount; once local count has risen ’threshold’,
// grab global lock and transfer local values to it
void update(counter_t *c, int threadID, int amt) {
  int cpu = threadID % NUMCPUS;
  
  pthread_mutex_lock(&c->llock[cpu]);
  
  c->local[cpu] += amt;
  if (c->local[cpu] >= c->threshold) {
    // transfer to global (assumes amt>0)
    pthread_mutex_lock(&c->glock);
    c->global += c->local[cpu];
    pthread_mutex_unlock(&c->glock);
    c->local[cpu] = 0;
  }
  
  pthread_mutex_unlock(&c->llock[cpu]);
}

// get: just return global amount (approximate)
int get(counter_t *c) {
  pthread_mutex_lock(&c->glock);
  int val = c->global;
  pthread_mutex_unlock(&c->glock);
  return val; // only approximate!
}

코드가 상당히 직관적이기 때문에 추가적인 설명이 크게 필요하진 않지만, pthread_mutex_t glock(Global Lock)과 pthread_mutex_t llock[NUMCPUS](Local Lock)의 두 가지 Lock을 사용했다는 점에 주목하자!

2. Concurrent Linked Lists

이번에는 Linked List에 동시성을 구현하는 과정을 살펴보자.

// basic node structure
typedef struct __node_t {
  int key;
  struct __node_t *next;
} node_t;

// basic list structure (one used per list)
typedef struct __list_t {
  node_t *head;
  pthread_mutex_t lock;
} list_t;

void List_Init(list_t *L) {
  L->head = NULL;
  pthread_mutex_init(&L->lock, NULL);
}

int List_Insert(list_t *L, int key) {
  pthread_mutex_lock(&L->lock);
  node_t *new = malloc(sizeof(node_t));
  
  if (new == NULL) {
    perror("malloc");
    pthread_mutex_unlock(&L->lock);
    return -1; // fail
  }
  
  new->key = key;
  new->next = L->head;
  L->head = new;
  
  pthread_mutex_unlock(&L->lock);
  return 0; // success
}

int List_Lookup(list_t *L, int key) {
  pthread_mutex_lock(&L->lock);
  node_t *curr = L->head;
  
  while (curr) {
      if (curr->key == key) {
        pthread_mutex_unlock(&L->lock);
        return 0; // success
    }
    curr = curr->next;
  }
  pthread_mutex_unlock(&L->lock);
  return -1; // failure
}

먼저, Global Lock 한 개만 쓴 코드의 예시이다. 성능을 검토하기 전, 아래의 구현을 먼저 짚고 넘어가보자.

if (new == NULL) {
    perror("malloc");
    pthread_mutex_unlock(&L->lock);
    return -1; // fail
  }

해당 부분에서 malloc에 실패했을 경우, 삽입 작업이 완료되지 않은 상태에서 함수가 종료된다. 현재 코드에서는 정상적으로 Unlock을 진행해주고 있지만, 특정 스레드가 Lock을 해제하지 않고 종료되는 에러는 리눅스에서도 실제로 빈번히 발생하는 오류라고 한다.

이를 완화할 수 있는 방법은 무엇이 있을까? 현재는 insert() 함수에 들어갈 때 전체 함수에 락을 걸고, 함수가 끝날 때 락을 해제하고 있는데, 이는 위에서 설명한 종류의 에러에 취약한 코드이다. (실수로 Unlock을 빠뜨리는 실수를 하기 쉽다.)

이에 대한 해결책으로는, Lock을 실제 크리티컬 섹션에만 적용하는 방법이 있다. 즉, insert() 작업 전체가 아니라, 실제 데이터 수정이 이루어지는 부분에만 Lock을 걸고, 해제하는 것이다.

이를 반영하여 수정한 코드이다.

void List_Init(list_t *L) {
    L->head = NULL;
    pthread_mutex_init(&L->lock, NULL);
}

void List_Insert(list_t *L, int key) {
    // synchronization not needed
    node_t *new = malloc(sizeof(node_t));
    if (new == NULL) {
        perror("malloc");
        return;
    }
    new->key = key;

    // just lock critical section
    pthread_mutex_lock(&L->lock);
    new->next = L->head;
    L->head = new;
    pthread_mutex_unlock(&L->lock);
}

int List_Lookup(list_t *L, int key) {
    int rv = -1;
    pthread_mutex_lock(&L->lock);
    node_t *curr = L->head;
    while (curr) {
        if (curr->key == key) {
            rv = 0;
            break;
        }
        curr = curr->next;
    }
    pthread_mutex_unlock(&L->lock);
    return rv; // now both success and failure
}

함수 시작 부분에 Lock을 걸어주는 대신, 실제 크리티컬 섹션에 진입하는 부분만 Lock으로 감싸주었다. 이를 통해 안정성과 성능 모두 개선되었다. (while문에서 break 이후 Lock을 해제해주는 수정 또한 코드 작성 실수를 보다 방지하기 위한, 일관적인 Unlock 형식을 위한 것으로 이해하면 되겠다.)

다음으로, Linked List를 Scaling하는 과정을 살펴보자. 이를 위해 Hand-over-hand Locking이란 방식을 사용한다.

Hand-over-hand Locking 기법을 사용하면, 전체 리스트에 대한 단일 Lock을 사용하는 대신, 각 노드마다 Lock을 사용하게 된다. 이는 우리가 앞서 살펴본 Counter에서의 approach와 유사한데, Lock을 분배하여 병목 현상을 없애는 목적이 있다. 리스트를 순회할 때, 다음 노드의 Lock을 획득하고, 현재 노드의 Lock을 해제하는 과정이 Hand-over-hand와 비슷하다고 하여 붙여진 이름이다.

하지만 Lock의 개수가 많아질수록, 리스트를 순회할 때 Lock을 걸고 해제하는 과정의 비용 또한 커지기 때문에 해당 방식이 성능 면에서 무조건 유리한 것은 아니다! (큰 리스트일 경우 악화되겠지?) 따라서 책에서는 일정 개수의 노드마다 하나의 Lock을 할당하는 Hybrid 방식이 더 낫다고 평가하고 있다.

3. Concurrent Queues

이번에는 큐에 동시성을 구현해보자. Michael and Scott Concurrent Queues는 다중 스레드 환경에서 큐에 대해 효율적인 동시성을 제공하는 자료 구조이다. 아래의 코드를 살펴보자.

typedef struct __node_t {
  int value;
  struct __node_t *next;
} node_t;

typedef struct __queue_t {
  node_t *head;
  node_t *tail;
  pthread_mutex_t head_lock, tail_lock;
} queue_t;

void Queue_Init(queue_t *q) {
  node_t *tmp = malloc(sizeof(node_t));
  tmp->next = NULL;
  q->head = q->tail = tmp;
  pthread_mutex_init(&q->head_lock, NULL);
  pthread_mutex_init(&q->tail_lock, NULL);
}

void Queue_Enqueue(queue_t *q, int value) {
  node_t *tmp = malloc(sizeof(node_t));
  assert(tmp != NULL);
  tmp->value = value;
  tmp->next = NULL;

  pthread_mutex_lock(&q->tail_lock);
  q->tail->next = tmp;
  q->tail = tmp;
  pthread_mutex_unlock(&q->tail_lock);
}

int Queue_Dequeue(queue_t *q, int *value) {
  pthread_mutex_lock(&q->head_lock);
  node_t *tmp = q->head;
  node_t *new_head = tmp->next;
  if (new_head == NULL) {
    pthread_mutex_unlock(&q->head_lock);
    return -1; // queue was empty
  }
  *value = new_head->value;
  q->head = new_head;
  pthread_mutex_unlock(&q->head_lock);
  free(tmp);
  return 0;
}

두 개의 Lock을 사용하여 큐의 head와 tail을 분리해서 관리한다. 이를 통해 enqueue와 dequeue 작업이 동시에 일어나도 경합이 발생하지 않도록 한다.
head_lock은 큐의 앞부분을 관리하는 Lock으로, dequeue 작업에서 사용된다.
tail_lock은 큐의 뒷부분을 관리하는 Lock으로, enqueue 작업에서 사용된다.

4. Concurrent Hash Table

마지막으로 해시 테이블에 동시성을 구현해보자!

#define BUCKETS (101)

typedef struct __hash_t {
	list_t lists[BUCKETS];
} hash_t;

void Hash_Init(hash_t *H) {
  int i;
  for (i = 0; i < BUCKETS; i++)
 	 List_Init(&H->lists[i]);
}

int Hash_Insert(hash_t *H, int key) {
	return List_Insert(&H->lists[key % BUCKETS], key);
}

int Hash_Lookup(hash_t *H, int key) {
	return List_Lookup(&H->lists[key % BUCKETS], key);
}

사실 해시 테이블에 동시성을 구현하는 개념은 비교적 단순한데, 해시 테이블의 각각의 버킷을 서로 다른 Lock을 갖고 있는 Concurrent Linked List로 설정하면 된다. 어차피 해시 테이블에서 서로 다른 버킷은 서로 간섭할 여지가 없기 때문에, 각각의 버킷에 서로 다른 Lock을 할당해줄 수 있기 때문이다!